The brain in a jar got cheap. The arms came with it. For small operators, this is the year the math finally works.
Six months ago, running a useful AI on your own hardware meant spending eight thousand dollars on a workstation, knowing what BIOS was, and accepting that whatever model you got would feel a half-step behind whatever the big cloud labs were doing that week. The gap has closed. Open-weights models in the twenty-to-seventy-billion parameter range are now doing what frontier cloud APIs were doing in mid-2025: coding, customer support, document analysis, reasoning over a service catalog. The hardware to run them dropped under $2,500. And there's a category of mini PC on the market right now that turns running a local AI agent from a research project into a thing a small shop can actually buy and plug in.
This post is the math, the hardware, the model, and the agent that sits on top of all of it.
§ 01The shift. Small models do real work now.
For the last two years, the case against on-prem AI was simple: cloud frontier models were better, the cost was someone else's problem, and the open models you could run yourself were noticeably worse at anything that mattered. All three of those assumptions broke at once.
The gap on small models closed.
Open-weights models in the 7B–70B range now consistently land at performance levels that frontier APIs were posting in mid-2025. For the kinds of tasks a small business actually wants (answering customer questions, drafting follow-ups, parsing invoices, summarizing call transcripts, classifying inbound emails), the difference between a strong open model and the latest cloud frontier model is no longer obvious in the output. It's only obvious on the bill.
Cloud agent costs got worse, not better.
Cloud API pricing isn't getting cheaper for the way agents actually use them. A single agent making twenty tool calls to handle one customer inquiry can burn through tokens fast. Multiply by 24/7 operation. The bill is the bill, and it shows up every month.
The hardware caught up.
The new generation of integrated GPUs (AMD's Ryzen AI Max+, Apple's M-series, Intel's Arrow Lake-H equivalents) ship with enough unified memory to load a 70-billion-parameter model quantized to four bits and run it at usable speed. That wasn't true a year ago. It is now.
On a traditional desktop, the GPU has its own VRAM (usually 8–24GB) and the rest of the system has separate RAM. For LLMs, that VRAM cap is the ceiling. Unified memory architectures share one big pool between CPU and integrated GPU, so a 64GB machine can load models that wouldn't fit on a $4,000 discrete-GPU rig.
§ 02The hardware. A mini PC in a closet, on a UPS.
The reason I keep coming back to this specific machine is the unified memory. 64GB of LPDDR5X at 8000 MHz, shared between the CPU and integrated GPU. For LLM inference that's the spec that matters more than anything else on the box.
Specs that actually matter for this use case.
- Ryzen AI Max+ 395. Strix Halo class, strong NPU plus a beefy integrated Radeon. Real inference throughput.
- 64GB LPDDR5X-8000 unified memory. The headline. Fits big models with quantization, no swapping to disk.
- 1TB PCIe 4.0 SSD. Enough for a few models on disk plus logs, vector indexes, and growing context.
- WiFi 7 + USB4. Network and I/O are not bottlenecks.
Specs that don't matter (for this use case).
- The 8K quad-screen output. You won't use it; the box lives headless in a closet.
- Gaming features. Irrelevant.
- SD card reader. Nice to have, not load-bearing.
§ 03The brain. Pick a model that fits the shop.
The hardware is the easy part. The model is where most operators get lost. Here's the way to think about it.
For a shop running a single agent (answer customer questions, draft replies, look up service info, summarize calls), anything in the 14B–32B parameter range is the sweet spot. Quantized to four bits (the standard format for this kind of deployment) it fits in roughly 16–22GB of memory and runs fast enough to feel responsive in conversation.
Concrete options as of right now.
- Qwen 3 32B. Strong reasoning, multilingual, very competitive on tool use. Good default.
- Llama 3.3 70B (4-bit quantized). Heavier, slower, but excellent at long-context work.
- Gemma 3 27B. Google's open weights, strong at structured output and JSON.
- Mistral Small / Magistral. Efficient, sharp, good when you want concise answers.
- Hermes 3 (Nous's own). A fine-tuned line built for agentic work; pairs naturally with their agent layer.
You don't pick a model once. The whole point of running locally is that you can swap whenever a better one drops, which on this curve happens roughly every eight to twelve weeks.
# on the box, with Ollama installed:
ollama pull qwen3:32b # default everyday
ollama pull llama3.3:70b # heavy lift
ollama serve # API on :11434
# point your agent at http://<box-ip>:11434/v1
If you can't decide, run a smaller model and a bigger one in parallel for a month. Most operators land on the smaller faster one for everyday work and only reach for the larger model on specific harder questions. Pick by watching the logs, not by reading benchmarks.
§ 04The agent. Turning the brain into a coworker.
A model is a brain in a jar. An agent is the brain plus the arms. The model can answer a question. The agent can answer the question, look up the customer's history in your CRM, send the follow-up SMS, schedule a callback, and log the interaction, without you. That's the whole difference between "we have AI on our website" and "we have an AI employee."
Hermes, from Nous Research, is one of the better open-source agent layers you can run on top of a local model. It handles the parts of agency that are tedious to write yourself: structured tool use, function calling, conversation memory, retry logic when a tool call fails. You point it at your local model, give it access to whatever tools you want it to use (a database, an SMS API, your inventory system, your calendar), and it runs.
In practice: the agent runs as a process on the mini PC, listens
for events (a form fill, a missed call, a customer text), and
decides what to do. You define the tools (look_up_inventory,
send_sms, schedule_call), and the agent
uses them as needed. It's not magic; it's a model with
structured access to your business systems, kept honest by an
orchestration layer that knows how to recover when something
goes wrong.
§ 05The cost math. Cloud vs. on-prem over 24 months.
Cloud agent usage is sneakier than chat usage. A single customer-handling agent can fire 15–30 model calls to resolve one inquiry: context lookups, tool calls, reasoning steps, retries. The token bill compounds.
Cloud baseline (frontier API, agent-style usage).
- Per-token pricing: roughly $3–15 per million input tokens, $15–75 per million output, depending on model tier.
- Realistic agent usage for a single small shop: 50M–200M tokens / month.
- That puts the monthly bill at $200–$2,000+, sometimes higher once tool-use overhead is honest.
On-prem cost.
- Hardware: ~$1,800–$2,200 one-time.
- Power: ~100W average × 24h × 30d × $0.15/kWh ≈ $11/month.
- Software: open-weights models are free. Ollama, llama.cpp, vLLM are free. Hermes pricing varies by tier.
- UPS + small admin overhead: trivial.
Over 24 months.
Cloud spend on a moderately busy agent: $4,800 to $48,000. On-prem spend including hardware, power, and the agent layer: $2,000 to $3,500. The crossover is usually inside the first six months. Past that, every month is gravy.
This isn't a knock on cloud APIs. They make sense for bursty, light, or genuinely model-specific workloads. The argument is narrower: if you have a steady-state agent that runs all day, every day, on-prem wins on every axis that matters to a small shop. From the audit notes
§ 06The privacy upside. The part operators raise on the call.
This is the part operators bring up in the first fifteen minutes but rarely write down anywhere, and one of the biggest reasons a local AI agent ends up being a values decision, not just a cost one.
If you're a clinic, your patient data shouldn't go to a third-party API. If you're a law firm, your client conversations shouldn't either. If you're an auto dealer with customer financial info, same. If you're any business with proprietary process knowledge that gets exposed in the prompts (pricing logic, vendor relationships, sourcing methods), a local model means none of it ever leaves the building.
On-prem doesn't automatically make your operation HIPAA-compliant or solve every regulatory question. It does make the underlying privacy story dramatically simpler. The data path stops at your closet. There's no third-party processor in the chain, no data-residency questions, no terms-of-service that change next quarter and surprise you.
§ 07What it looks like deployed at a real shop.
Here's the configuration we'd ship at a small operator today.
- Day 1. Mini PC arrives. UPS, static IP, SSH, tailscale or wireguard for remote admin. Install Ollama, pull a 32B model, smoke test.
- Day 2. Install the Hermes agent. Configure tool access: read CRM, send SMS, write calendar, log to Sheet. Define the agent's job in plain English.
- Day 3. Wire the agent into the existing site forms and inbound channels. Form fill triggers the agent. Agent routes urgent items to the owner's phone.
- Day 4–7. Tune. Watch the logs. Fix mismatched answers, sharpen prompts, add the tools the agent reaches for and doesn't have yet.
- Day 8+. Runs on its own. Owner gets alerts on the cases the agent escalates. Everything else gets handled and logged.
The agent doesn't sleep. It doesn't call in sick. It doesn't ghost on Friday afternoon. It also doesn't replace humans. What it replaces is the part of the day where a human had to do work a model is genuinely better at.
§ 08Honest constraints.
You give up some things going on-prem. The list, written down so there's no surprise on day fourteen.
- You're responsible for updates. Better models drop, you swap. Security patches happen, you patch. Roughly a half-day a month of admin overhead, in our experience.
- The hardest reasoning still benefits from frontier cloud. For routine work, local is fine. For occasional one-off complex reasoning you might still call out to a cloud API. Hybrid is normal.
- Setup is non-trivial. The mini PC arrives in a box. Turning it into a production 24/7 agent takes work. (This is the part we do in a sprint.)
- Power and noise. Mini PCs get loud under sustained load. Plan a closet, not a desk.
- One box, one point of failure. If uptime is mission-critical, run two and fail over. For most shops, one box plus a UPS is enough. The agent isn't running a hospital.
Don't go on-prem to "save on AI" if your team will use the model twice a week. The crossover math depends on steady-state usage. If your usage is bursty and light, a $20/month cloud subscription is the right answer and you should ignore this whole post.
§ 09Takeaway, in one line.
The economics of running a local AI agent just changed. A two-thousand dollar mini PC plus an open-weights model plus an agent layer is now a viable replacement for cloud-only setups that didn't quite do what you needed anyway. The privacy story is better. The cost story is better. The latency is better. The control story is dramatically better.
The longest part of the project, again, is the operator deciding to do it.
If you want help getting a local AI employee up and running (picking the model for your shop, deploying the agent, wiring it into the tools you already use), that's the work we do. The sprint is the same shape as our lead-capture rig but with a different brain at the center. For a longer-running build — sourcing, follow-ups, internal reporting on top of the local model, see custom builds. Fifteen minutes, one workflow, one question.