• T156@lemmy.world
    link
    fedilink
    English
    arrow-up
    15
    ·
    21 hours ago

    It’s also cheaper, if they can offload a portion to the user’s computer.

    • Em Adespoton@lemmy.ca
      link
      fedilink
      English
      arrow-up
      4
      ·
      16 hours ago

      Cheaper for them, that is.

      What I want to see is throttleable models, kind of like progressive JPEG, where the default model is “nano” and it has a watch function that analyzes if more tokens might be needed for a certain task and scales up as needed — identifying if the resources are too much for the device and offloading to the cloud (with explicit permission) only if (but always if) needed. Over time as the technology improves, larger models move to the endpoint.

      And then people could have a basic set of sliders: on-device only, on-cloud only, or somewhere in between, based on the user’s preferences.

      • T156@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        15 hours ago

        That’s basically model routing, and has existed a while. Open AI’s GPT-5 and llama-swap do that, for example. If the task is simple, it uses a smaller, less intensive model, and only uses the slower, larger one of the task is more complex.

        Though most tend to operate with models on the same device/service, rather than a model run elsewhere.