Why run local?

I’ve been struggling to answer this question for awhile, but I think I finally have some glimpses of the solution. I hope this short summary of my adventures will answer that question.

Edit: This was turning out to be much longer than I expected so I decided to break it up into a few series.

What is a Local LLM?

So instead of using ChatGPT or Claude, you have a model on your own computer that you can query and ask questions and even plug into an agentic workflows and it all works offline.

No token costs, no subscriptions.

Is it cheap?

It costs $0 and some change to run! It is practically free.

… who am I kidding. It is not. In terms of hobbies it is up there between luxury watch collecting and Legos. Assuming you don’t have a GPU.

To run a local LLM of any usefulness whatsoever, you’ll need at bare minimum, a 16 GB graphics card. A modern, RTX 5070 Ti from microcenter starts from $1,099. That is the bare minimum. Your model barely has enough intelligence to put together a coherent go program without hallucinating the imports.

So let’s say you’re someone with a bit more spending power. You don’t mind getting top tier tools for your hobby. The pro stuff. You’re a prosumer. $3,499 gets you the top-end RTX 5090 with 32 GB. You can run a quantised (think of it as a more ’lossy’ version) model at reasonable levels of intelligence and speed. You probably don’t want to delegate planning to it, or run it overnight without any checks, but it’ll work for agentic coding.

Or maybe you have an unlimited budget (compared to most people), you’re John Wick, and you want the night to go out with a bang. The Sommelier nods solemnly, only the best for your sir. and disappears into the back.

May I recommend the RTX PRO 6000, 1792 GB/s bandwidth, 96 GB of VRAM containing 24,064 CUDA cores. Hums like a dispossessed ghost when cranked to full power. Custom of course, for… special circumstances.

The best microcenter deal starts at $9,399 and the line goes vertical from there. For that, you get the ability to run all new local models at full intelligence and speed. Now that’s true power.

Beyond this range lies the H100s which you cannot get and / or those people clustering RTX PROs that cost more than a luxury car into a single PC case lurking in r/LocalLLMs.

Is it like running Opus?

ehhh

.. no.

What about Sonnet?

…. eh maybe.

If you trust the benchmarks, which you shouldn’t because of benchmaxxing.

Running a proper open weights model like minimax 2.7 with 230B parameters at full strength requires 5x RTX PRO 6000s, that’s like $50,000 in just the GPUs.

More reasonably, running a Q4 (94% accuracy) version still requires 2x cards. So even if you can, it’s definitely going to cost you WAY more than the $200/month Anthropic charges you.

Why use Local LLMs?

So we’ve established that Local LLMs is expensive, and it is kind of not great. Like some bars. So why do we still go there?

One way to look at it are areas that a Cloud run models are not be able to provide. Cloud run models provide high speed at low cost: that already covers two big reasons why someone might switch.

So to run a local LLM, the following reasons must be compelling enough for you:

Privacy

I win. No further explanation required.

Sensitive workloads, things you don’t ever want to send to a third party provider. Spicy workloads LIKE processing tax documents.

Everyone has a price for their privacy. It is listed clearly on microcenter’s website. What’s yours?

Resiliency

If running GPUs for AI is so expensive, how are companies like ChatGPT and Anthropic able to offer monthly pricing at $20/month?

They can’t.

It is like a gym membership. They count on you not using your allocated quota to the fullest.

And the moment users start to, wham! that’s the sound of the policy changing. Look at what happened to OpenClaw users. Look at how people have been complaining that the usage rates being lowered across the providers, about how the newest model tends to be really verbose (aka, token heavy).

I’d argue the true cost is really reflected in the API, and for heavy users, the cost can easily be 10x ~ 30x the price they pay for subscriptions.

I’ve also never really been a guy to trust cloud provisioned services either:

Opinion: No Cloud Please

Needing a cloud service to operate your Smart Home products is not always the best option, and here in this post I explore why

West Side Electronics

All of that to say that once you built your life around AI, what are your guarantees that the quality won’t be reduced or your ability to use it won’t be cut in the future?

Freedom

When LLMs say sorry I can’t help with that.

What they mean is that they won’t.

Guardrails are there for a reason and crucial for preventing misuse. But it can limit the model’s flexibility and responsiveness. I don’t want a future where my learning is limited to the boundaries defined by companies that own the models.

When to use Local LLMs?

Split your use cases

Coding may not need to be secret: who cares about the trillionth line of AI written code that day? Use a cloud-hosted model for that. Coding burns a ton of tokens going back and forth. But quiet inference on confidential data as part of a workflow can be moved offline.

Consider carefully what you need to keep local, because local capacity is precious.

Use Cloud to test

I believe running LLMs bare metal (whether on local or on a rented GPU) counts as ’local'.

Before working with a local LLM, try running your workload on the cloud with an equivalent model to see if it can meet your needs. It is also a more cost effective way to run open-weight models too, with a RTX 6000 Pro costing only $1.89 USD/hr. I personally use runpod for this, because they provide a pretty simple interface to spin up a GPU and deploy a workload on it.

For runpod, the easiest way to start is to use a public template or container. Ollama is a popular one. I use llama.cpp’s docker container, which gives me better performance on certain cards and spins up faster than vllm.

I’ve provided my own template that I use to spin up Qwen 3.6 26B UD-Q8_K_XL, which allows it to perform at near native accuracy while keeping the inference speed reasonably fast. Since there is plenty of headroom on a RTX 6000 PRO, I’ve also provisioned two slots to maximise usage: use it with chat and code at the same time, or work in parallel.

Runpod GPU Cloud

GPU Instance deploy for Jupyter, PyTorch, Tensorflow, and many more.

runpod.io

Conclusion

I spent many hours thinking about why I would use a local LLM, aside from the fun and magic of running your very own intelligence in a box that you own, and these were the considerations that I made.

Running your own LLM is very informative as well, and it is a whole area of knowledge in itself, trying to squeeze the best performance out of the hardware that you own (or rent!).

What is a Local LLM?#

Is it cheap?#

Is it like running Opus?#

What about Sonnet?#

Why use Local LLMs?#

Privacy#

Resiliency#

Freedom#

When to use Local LLMs?#

Split your use cases#

Use Cloud to test#

Conclusion#