How much does a local AI agent cost versus cloud APIs?

On-prem: roughly $1,800–$2,200 one-time for the mini PC plus about $11/month in power. Cloud agent usage for a busy small shop typically runs $200–$2,000+ per month in token costs because agents fire 15–30 model calls per inquiry. Over 24 months, on-prem totals $2,000–$3,500 and cloud totals $4,800–$48,000. Crossover is usually inside the first six months for steady-state agent workloads.

Which open-weights model should I run on a mini PC for a small business?

For a single-agent small business workload, anything in the 14B–32B parameter range is the sweet spot. Quantized to 4-bit it fits in 16–22GB of memory and stays responsive in conversation. Good defaults right now: Qwen 3 32B (strong reasoning, tool use), Gemma 3 27B (structured output), Mistral Small / Magistral (concise answers), or Llama 3.3 70B 4-bit for heavier work. Swap models as better ones drop, that's the whole point of running locally.

What hardware do I need to run a 70B model locally?

The key spec is unified memory, not VRAM. A mini PC with 64GB of unified LPDDR5X memory (such as the GMKtec EVO-X2 with AMD Ryzen AI Max+ 395) can load a 70-billion-parameter model quantized to 4-bit and run it at usable speed. That wasn't possible on a desk-sized box a year ago. Traditional 8–24GB discrete-GPU setups cap out well below 70B.

When is on-prem AI worth it versus a cloud subscription?

On-prem wins when you have steady-state agent usage running all day, every day. If your team will use the model twice a week, a $20/month cloud subscription is the right answer. The crossover math depends on usage volume, for bursty or light workloads, cloud APIs make sense. For 24/7 agents handling inquiries, follow-ups, and routine reasoning, on-prem pays back inside six months and saves $3,000–$45,000 over 24 months.

Field note · 001 · Tooling · Self-hosted

Local AI Agent on a Mini PC.
A 24/7 brain for under $2,500.

Buy the box, own the brain.

The cost of bespoke software just dropped. The cost of cloud AI keeps going up. Somewhere in the middle is a small mini PC sitting in a corner, running an open-weights model that doesn't sleep, doesn't call in sick, and doesn't share your data with anyone.

Adam Pichardo · founder, Pulp AI Studio ● May 5, 2026 ● 11 min read Tooling Self-hosted Hardware Field note

Affiliate disclosure One link in this post, the GMKtec mini PC on Amazon, is an affiliate link. If you buy through it, we may earn a small commission at no extra cost to you. The Hermes link is a regular referral, not paid. We only recommend tools we'd ship in a real client engagement.

The brain in a jar got cheap. The arms came with it. For small operators, this is the year the math finally works.

Six months ago, running a useful AI on your own hardware meant spending eight thousand dollars on a workstation, knowing what BIOS was, and accepting that whatever model you got would feel a half-step behind whatever the big cloud labs were doing that week. The gap has closed. Open-weights models in the twenty-to-seventy-billion parameter range are now doing what frontier cloud APIs were doing in mid-2025: coding, customer support, document analysis, reasoning over a service catalog. The hardware to run them dropped under $2,500. And there's a category of mini PC on the market right now that turns running a local AI agent from a research project into a thing a small shop can actually buy and plug in.

This post is the math, the hardware, the model, and the agent that sits on top of all of it.

§ 01The shift. Small models do real work now.

For the last two years, the case against on-prem AI was simple: cloud frontier models were better, the cost was someone else's problem, and the open models you could run yourself were noticeably worse at anything that mattered. All three of those assumptions broke at once.

The gap on small models closed.

Open-weights models in the 7B–70B range now consistently land at performance levels that frontier APIs were posting in mid-2025. For the kinds of tasks a small business actually wants (answering customer questions, drafting follow-ups, parsing invoices, summarizing call transcripts, classifying inbound emails), the difference between a strong open model and the latest cloud frontier model is no longer obvious in the output. It's only obvious on the bill.

Cloud agent costs got worse, not better.

Cloud API pricing isn't getting cheaper for the way agents actually use them. A single agent making twenty tool calls to handle one customer inquiry can burn through tokens fast. Multiply by 24/7 operation. The bill is the bill, and it shows up every month.

The hardware caught up.

The new generation of integrated GPUs (AMD's Ryzen AI Max+, Apple's M-series, Intel's Arrow Lake-H equivalents) ship with enough unified memory to load a 70-billion-parameter model quantized to four bits and run it at usable speed. That wasn't true a year ago. It is now.

Tip · what "unified memory" buys you

On a traditional desktop, the GPU has its own VRAM (usually 8–24GB) and the rest of the system has separate RAM. For LLMs, that VRAM cap is the ceiling. Unified memory architectures share one big pool between CPU and integrated GPU, so a 64GB machine can load models that wouldn't fit on a $4,000 discrete-GPU rig.

§ 02The hardware. A mini PC in a closet, on a UPS.

The reason I keep coming back to this specific machine is the unified memory. 64GB of LPDDR5X at 8000 MHz, shared between the CPU and integrated GPU. For LLM inference that's the spec that matters more than anything else on the box.

TOOL · A · LOCAL INFERENCE BOX

GMKtec EVO-X2 AI Mini PC

AMD Ryzen AI Max+ 395 (up to 5.1GHz),64GB LPDDR5X 8000 unified memory,1TB PCIe 4.0 SSD, WiFi 7, USB4. The unified memory is the headline. It's what lets you run a 70B-parameter model quantized to 4-bit on a desk-sized box. Stick it in a closet on a UPS, point a static IP at it, serve models on your local network. This is the rig we'd pick today if a client said "we want AI on-prem."

~$2,000 one-time · No recurring

View on Amazon → Have us set it up

Specs that actually matter for this use case.

Ryzen AI Max+ 395. Strix Halo class, strong NPU plus a beefy integrated Radeon. Real inference throughput.
64GB LPDDR5X-8000 unified memory. The headline. Fits big models with quantization, no swapping to disk.
1TB PCIe 4.0 SSD. Enough for a few models on disk plus logs, vector indexes, and growing context.
WiFi 7 + USB4. Network and I/O are not bottlenecks.

Specs that don't matter (for this use case).

The 8K quad-screen output. You won't use it; the box lives headless in a closet.
Gaming features. Irrelevant.
SD card reader. Nice to have, not load-bearing.

§ 03The brain. Pick a model that fits the shop.

The hardware is the easy part. The model is where most operators get lost. Here's the way to think about it.

For a shop running a single agent (answer customer questions, draft replies, look up service info, summarize calls), anything in the 14B–32B parameter range is the sweet spot. Quantized to four bits (the standard format for this kind of deployment) it fits in roughly 16–22GB of memory and runs fast enough to feel responsive in conversation.

Concrete options as of right now.

Qwen 3 32B. Strong reasoning, multilingual, very competitive on tool use. Good default.
Llama 3.3 70B (4-bit quantized). Heavier, slower, but excellent at long-context work.
Gemma 3 27B. Google's open weights, strong at structured output and JSON.
Mistral Small / Magistral. Efficient, sharp, good when you want concise answers.
Hermes 3 (Nous's own). A fine-tuned line built for agentic work; pairs naturally with their agent layer.

You don't pick a model once. The whole point of running locally is that you can swap whenever a better one drops, which on this curve happens roughly every eight to twelve weeks.

# on the box, with Ollama installed:
ollama pull qwen3:32b          # default everyday
ollama pull llama3.3:70b       # heavy lift
ollama serve                   # API on :11434

# point your agent at http://<box-ip>:11434/v1

Tip · run two for a month

If you can't decide, run a smaller model and a bigger one in parallel for a month. Most operators land on the smaller faster one for everyday work and only reach for the larger model on specific harder questions. Pick by watching the logs, not by reading benchmarks.

§ 04The agent. Turning the brain into a coworker.

A model is a brain in a jar. An agent is the brain plus the arms. The model can answer a question. The agent can answer the question, look up the customer's history in your CRM, send the follow-up SMS, schedule a callback, and log the interaction, without you. That's the whole difference between "we have AI on our website" and "we have an AI employee."

Hermes, from Nous Research, is one of the better open-source agent layers you can run on top of a local model. It handles the parts of agency that are tedious to write yourself: structured tool use, function calling, conversation memory, retry logic when a tool call fails. You point it at your local model, give it access to whatever tools you want it to use (a database, an SMS API, your inventory system, your calendar), and it runs.

In practice: the agent runs as a process on the mini PC, listens for events (a form fill, a missed call, a customer text), and decides what to do. You define the tools (look_up_inventory, send_sms, schedule_call), and the agent uses them as needed. It's not magic; it's a model with structured access to your business systems, kept honest by an orchestration layer that knows how to recover when something goes wrong.

§ 05The cost math. Cloud vs. on-prem over 24 months.

Cloud agent usage is sneakier than chat usage. A single customer-handling agent can fire 15–30 model calls to resolve one inquiry: context lookups, tool calls, reasoning steps, retries. The token bill compounds.

Cloud baseline (frontier API, agent-style usage).

Per-token pricing: roughly $3–15 per million input tokens, $15–75 per million output, depending on model tier.
Realistic agent usage for a single small shop: 50M–200M tokens / month.
That puts the monthly bill at $200–$2,000+, sometimes higher once tool-use overhead is honest.

On-prem cost.

Hardware: ~$1,800–$2,200 one-time.
Power: ~100W average × 24h × 30d × $0.15/kWh ≈ $11/month.
Software: open-weights models are free. Ollama, llama.cpp, vLLM are free. Hermes pricing varies by tier.
UPS + small admin overhead: trivial.

Over 24 months.

Cloud spend on a moderately busy agent: $4,800 to $48,000. On-prem spend including hardware, power, and the agent layer: $2,000 to $3,500. The crossover is usually inside the first six months. Past that, every month is gravy.

This isn't a knock on cloud APIs. They make sense for bursty, light, or genuinely model-specific workloads. The argument is narrower: if you have a steady-state agent that runs all day, every day, on-prem wins on every axis that matters to a small shop. From the audit notes

§ 06The privacy upside. The part operators raise on the call.

This is the part operators bring up in the first fifteen minutes but rarely write down anywhere, and one of the biggest reasons a local AI agent ends up being a values decision, not just a cost one.

If you're a clinic, your patient data shouldn't go to a third-party API. If you're a law firm, your client conversations shouldn't either. If you're an auto dealer with customer financial info, same. If you're any business with proprietary process knowledge that gets exposed in the prompts (pricing logic, vendor relationships, sourcing methods), a local model means none of it ever leaves the building.

On-prem doesn't automatically make your operation HIPAA-compliant or solve every regulatory question. It does make the underlying privacy story dramatically simpler. The data path stops at your closet. There's no third-party processor in the chain, no data-residency questions, no terms-of-service that change next quarter and surprise you.

§ 07What it looks like deployed at a real shop.

Here's the configuration we'd ship at a small operator today.

Day 1. Mini PC arrives. UPS, static IP, SSH, tailscale or wireguard for remote admin. Install Ollama, pull a 32B model, smoke test.
Day 2. Install the Hermes agent. Configure tool access: read CRM, send SMS, write calendar, log to Sheet. Define the agent's job in plain English.
Day 3. Wire the agent into the existing site forms and inbound channels. Form fill triggers the agent. Agent routes urgent items to the owner's phone.
Day 4–7. Tune. Watch the logs. Fix mismatched answers, sharpen prompts, add the tools the agent reaches for and doesn't have yet.
Day 8+. Runs on its own. Owner gets alerts on the cases the agent escalates. Everything else gets handled and logged.

The agent doesn't sleep. It doesn't call in sick. It doesn't ghost on Friday afternoon. It also doesn't replace humans. What it replaces is the part of the day where a human had to do work a model is genuinely better at.

§ 08Honest constraints.

You give up some things going on-prem. The list, written down so there's no surprise on day fourteen.

You're responsible for updates. Better models drop, you swap. Security patches happen, you patch. Roughly a half-day a month of admin overhead, in our experience.
The hardest reasoning still benefits from frontier cloud. For routine work, local is fine. For occasional one-off complex reasoning you might still call out to a cloud API. Hybrid is normal.
Setup is non-trivial. The mini PC arrives in a box. Turning it into a production 24/7 agent takes work. (This is the part we do in a sprint.)
Power and noise. Mini PCs get loud under sustained load. Plan a closet, not a desk.
One box, one point of failure. If uptime is mission-critical, run two and fail over. For most shops, one box plus a UPS is enough. The agent isn't running a hospital.

Heads up · what we don't recommend

Don't go on-prem to "save on AI" if your team will use the model twice a week. The crossover math depends on steady-state usage. If your usage is bursty and light, a $20/month cloud subscription is the right answer and you should ignore this whole post.

§ 09Takeaway, in one line.

The economics of running a local AI agent just changed. A two-thousand dollar mini PC plus an open-weights model plus an agent layer is now a viable replacement for cloud-only setups that didn't quite do what you needed anyway. The privacy story is better. The cost story is better. The latency is better. The control story is dramatically better.

The longest part of the project, again, is the operator deciding to do it.

If you want help getting a local AI employee up and running (picking the model for your shop, deploying the agent, wiring it into the tools you already use), that's the work we do. The sprint is the same shape as our lead-capture rig but with a different brain at the center. For a longer-running build — sourcing, follow-ups, internal reporting on top of the local model, see custom builds. Fifteen minutes, one workflow, one question.

Written from the corner of the office where the mini PC lives. Adam Pichardo

The offer

Have us set up your local AI.

We pick the box, the model, the agent layer, and wire it into your shop's tools. Two-week sprint, fixed fee, you own the rig.

Book 15-min Call →

Featured tool

GMKtec EVO-X2 Mini PC

Ryzen AI Max+ 395,64GB unified memory. The rig in this post.

View on Amazon →

Featured tool

Hermes Agent

Open-source agent layer from Nous Research. Tool use, memory, structured output.

Get Hermes →

Best Mini PC for an AI Agent

The $400 box pick, GMKtec M5 Ultra, specs and trade-offs for running the agent.

Read the field note →

Local AI Agent on a Mini PC.
A 24/7 brain for under $2,500.

§ 01The shift. Small models do real work now.

The gap on small models closed.

Cloud agent costs got worse, not better.

The hardware caught up.

§ 02The hardware. A mini PC in a closet, on a UPS.

Specs that actually matter for this use case.

Specs that don't matter (for this use case).

§ 03The brain. Pick a model that fits the shop.

Concrete options as of right now.

§ 04The agent. Turning the brain into a coworker.

§ 05The cost math. Cloud vs. on-prem over 24 months.

Cloud baseline (frontier API, agent-style usage).

On-prem cost.

Over 24 months.

§ 06The privacy upside. The part operators raise on the call.

§ 07What it looks like deployed at a real shop.

§ 08Honest constraints.

§ 09Takeaway, in one line.

An AI employee that doesn't sleep, on hardware you own.

Offer

Studio

Direct

§ 01The shift. Small models do real work now.

The gap on small models closed.

Cloud agent costs got worse, not better.

The hardware caught up.

§ 02The hardware. A mini PC in a closet, on a UPS.

Specs that actually matter for this use case.

Specs that don't matter (for this use case).

§ 03The brain. Pick a model that fits the shop.

Concrete options as of right now.

§ 04The agent. Turning the brain into a coworker.

§ 05The cost math. Cloud vs. on-prem over 24 months.

Cloud baseline (frontier API, agent-style usage).

On-prem cost.

Over 24 months.

§ 06The privacy upside. The part operators raise on the call.

§ 07What it looks like deployed at a real shop.

§ 08Honest constraints.

§ 09Takeaway, in one line.

AI receptionist for HVAC. No heat, no AC, no voicemail.

AI receptionist for plumbers. Emergency triage that books.

AI receptionist for real estate. The 5-minute rule.

An AI employee that doesn't sleep, on hardware you own.

Offer

Studio

Direct