Google Chrome is downloading a 4 GB Gemini Nano model onto users' machines without consent, with no opt-in, no opt-out short of enterprise tooling, and an automatic re-download every time the user deletes it. The pattern is identical to the Anthropic Claude Desktop case I wrote about last month, but the scale is between two and three orders of magnitude larger. This article does the legal analysis and, for the first time, the environmental analysis. The numbers are not small.
What I want to see is throttleable models, kind of like progressive JPEG, where the default model is “nano” and it has a watch function that analyzes if more tokens might be needed for a certain task and scales up as needed — identifying if the resources are too much for the device and offloading to the cloud (with explicit permission) only if (but always if) needed. Over time as the technology improves, larger models move to the endpoint.
And then people could have a basic set of sliders: on-device only, on-cloud only, or somewhere in between, based on the user’s preferences.
That’s basically model routing, and has existed a while. Open AI’s GPT-5 and llama-swap do that, for example. If the task is simple, it uses a smaller, less intensive model, and only uses the slower, larger one of the task is more complex.
Though most tend to operate with models on the same device/service, rather than a model run elsewhere.
It’s also cheaper, if they can offload a portion to the user’s computer.
Cheaper for them, that is.
What I want to see is throttleable models, kind of like progressive JPEG, where the default model is “nano” and it has a watch function that analyzes if more tokens might be needed for a certain task and scales up as needed — identifying if the resources are too much for the device and offloading to the cloud (with explicit permission) only if (but always if) needed. Over time as the technology improves, larger models move to the endpoint.
And then people could have a basic set of sliders: on-device only, on-cloud only, or somewhere in between, based on the user’s preferences.
That’s basically model routing, and has existed a while. Open AI’s GPT-5 and llama-swap do that, for example. If the task is simple, it uses a smaller, less intensive model, and only uses the slower, larger one of the task is more complex.
Though most tend to operate with models on the same device/service, rather than a model run elsewhere.