In this blog post, I’ll describe one of my hobby projects - a browser extension for checking if shopping items align with my values - and how I ended up routing its LLM calls to a desktop computer sitting in my apartment.

About Vegan Confirmed

Vegan Confirmed is an extension for Chrome and Firefox, designed for vegans who want to make sure products they buy online are vegan.

It analyzes text on shopping web pages either when the user clicks a button such as “Buy” or “Add to Cart”, or when they explicitly request an analysis.

It has very few users at the moment, which is totally fine since I mainly built it for myself.

High Level Design

  • The extension’s content script detects clicks on buttons such as “Add to Cart”.
  • It extracts the text from the page, attempting to discard irrelevant parts (scripts, css, etc.).
  • The content script sends a message to the background script.
  • The background script calls the Vegan Confirmed backend.
  • The backend records the call.
  • The backend invokes an LLM for classification and summarization.
  • The backend records the results.
  • The backend returns the result to the extension.
  • The extension displays the result.
  • The extension possibly alerts the user.

The content extraction happens inside the extension because it already has access to the content of the page, so we don’t need to fetch it from the backend, risking being blocked.

Keeping it free

Since this is a hobby project without any potential for monetization in sight, I wanted to keep the price of running it to an absolute minimum. Ideally zero. This constraint ended up shaping every infrastructure decision that follows.

I contemplated self-hosting the backend, so I didn’t need to pay for any hosting at all, and prototyped using dynamic DNS, but it seemed too flimsy even for this (not particularly demanding for reliability) use case.

Instead I decided to use public cloud, but keep it within free tier (assuming it doesn’t explode in popularity all of a sudden). This basically means it has to auto scale down to zero, meaning we might need to tolerate occasional cold starts.

I chose GCP Cloud Functions as a very simple solution, plus I’m already quite familiar with them.

The cloud version

As an LLM provider, I chose Gemini 1.5 Flash Lite (later Gemini 2.5 Flash Lite) - the cheapest Gemini model available at that time. Truth is, the task at hand is pretty easy for any modern LLM - I could choose anything. So I chose the cheapest of all Gemini models, and chose Gemini over alternatives simply because it’s particularly easy to use on Cloud Functions.

This worked. Gemini was cheap, Cloud Functions scaled to zero, and the whole thing comfortably fit inside the free tier. I could have stopped there.

Why I moved inference home

But there was a machine I kept thinking about. I happened to own a powerful Framework Desktop - Ryzen AI Max+ 395, 128 GB RAM - capable of running local inference for many open models, and most of the time it sat idle. The classification task was easy enough that I didn’t need more horsepower. I just thought it’d be fun to cut the cloud out of the loop and run the inference at home.

How do we run models locally? The easiest way is LM Studio.

Choosing a local model

I tried three: google/gemma-4-31b, qwen/qwen3.5-9b and openai/gpt-oss-20b. All three handle the classification problem easily, albeit they need to be prompted slightly differently to get good analysis summaries. So it’s mainly a question of response time, and openai/gpt-oss-20b is the only one of these three with acceptable response time.

ModelMean latency
google/gemma-4-31b62.7s
qwen/qwen3.5-9b38.5s
openai/gpt-oss-20b4.7s
gemini/gemini-2.5-flash-lite1.0s

The gap is striking: gpt-oss-20b is roughly 10x faster than the others. I’m not certain why, but my guess is that it’s because gpt-oss is a mixture-of-experts model that activates only a small fraction of its parameters per token, while gemma and qwen are dense models that run every parameter on every token.

The Framework Desktop is capable of running larger and more powerful models, but it’s overkill since, again, the classification problem is sufficiently easy.

Note that all of them are significantly slower than Gemini 2.5 Flash Lite, but even 4 second latency is acceptable for this use case.

Wiring it together

How do we access these models from the Cloud Function? I considered two options - Tailscale on Cloud Run and using Firestore as the communication mechanism - and chose the latter for its simplicity.

When the backend needs to analyze a page, it writes the request to a dedicated Firestore collection, and subscribes to updates on that document. On the desktop, a background process listens for new documents in the collection and, when one arrives, executes it using the local LLM and writes the results back. The backend receives the update and replies to the caller. It falls back to Gemini if the desktop request fails or times out.

Reflection

So, was it worth it? In one sense, no - the cloud version already worked and cost nothing, and routing inference through Firestore to a desktop in my apartment is objectively a more fragile setup. But it was fun, it taught me a lot about running open models locally, and the Gemini fallback means the fragility never reaches the user.

More broadly, I wish browser extensions regained their popularity - they are a great mechanism for customizing the browsing experience. One obvious problem is that they are not supported by mobile browsers.

Source Code

Extension

Source code for the backend is coming soon.

Disclaimer about AI use

The post was largely written by a human. AI was used to proofread grammar and the flow.

Most of the code for the extension and the backend services was written using AI.