Building a Production LLM Stack for $0/Month
Two years ago, I asked myself a simple question: Can I run my own LLM, train it on my data, and deploy it without GPU rental costs?
Today, hotep-llm-v7-q8 runs at 91 tokens/second on consumer hardware. All local. No API bills. This is how I did it.
The Truth About Running Your Own LLM
Most people think you need:
- ❌ $10K+ in GPUs
- ❌ Cloud API subscriptions
- ❌ ML engineering team
Reality:
- ✅ Consumer GPU
- ✅ Open source tools
- ✅ Relentless iteration
The infrastructure exists. You just need to wire it together.
The Stack That Works
Training: Custom fine-tunes (consumer GPUs)
Inference: Ollama → 91 tok/s
Vector DB: ChromaDB (224 articles)
Cache: Garnet (Redis-compatible)
Web: Astro + FastAPI
Bot: python-telegram-bot
Every piece is replaceable. That’s the feature, not a bug.
Versions 1-3: Where I Failed
First attempts were disasters:
- Wrong quantization → VRAM crashes
- Broken embeddings → garbage retrieval
- No fallback chain → angry users
Breakthrough: Q8_0 quantization
Balances quality + VRAM for 8B models on consumer hardware. What quantization are you using?
The Fallback Chain That Saved Me
Production ≠ perfect inference. It’s about graceful degradation:
Ollama v7-q8 → v7 → Gemini Flash → Static Error
Local model → API → cache → error page.
Users never see downtime. They see cached responses. How do you handle inference failures?
RAG Is Harder Than It Looks
224 knowledge articles, semantically searchable.
The trick wasn’t embeddings—it was metadata filtering.
Articles need categories, timestamps, confidence scores. Otherwise you retrieve noise and users notice immediately.
The Bot That Never Sleeps
Telegram bot integration with:
- Auto-restart watchdog
- Voice message support
- Stripe payments
- Newsletter management
Windows VBS startup folder = cron jobs without admin privileges. Sometimes the jankiest solutions are the most reliable.
Production Has Two Bots
@hotep_test_bot → Experiments
@askhotep_ai_bot → Users
Deploy to test → verify → push to prod.
Simple pattern, saved me countless production incidents. Do you run separate test environments?
What 8.1 GB Actually Gets You
- 91 tokens/second
- 16K context window
- RAG retrieval in <50ms
- Multi-user conversation history
- Voice + text + payments
All from a single Q8_0 quantized model running locally.
The model size myth is holding you back.
The Numbers Don’t Lie
| Metric | Value |
|---|---|
| Tasks completed | 101+ |
| Swarm agents | 50 launched (100% success) |
| Estimation accuracy | 91.8% |
| CI tests | 358 passed, 0 failed |
| Knowledge articles | 224 |
Data-driven iteration beats intuition every time.
Lessons That Cost Me Months
1. Version drift kills
Updated .env to v7, deployed, everything worked. Except:
conversation_history.py→ v4 defaultmobile_api.py→ v4 defaultvoice_service.py→ v4 default
Silent config drift is deadly. I now grep entire repos after version changes.
2. CI ≠ production
Tests pass ≠ CSS classes exist. Build success means no syntax errors, not that your app actually works.
3. Config is truth
.env overrides defaults, always. Before debugging “wrong config value,” check .env first, then config/ports.py, then code defaults.
4. Commit before restart
Live code without a commit is a rollback risk. Edit → lint check → git add → git commit → then kill the process for watchdog restart.
The Cost Comparison
My setup:
- Hardware: Consumer GPU (one-time)
- Hosting: Cloudflare Pages (free tier)
- Bot: Free
- APIs: Pay-as-you-go fallback only
Commercial alternative: $1000+/mo for API calls at scale.
The math changes when you own the stack.
For the Builders: Start Here
- Quantized models first (Ollama)
- Add vector search (ChromaDB)
- Build fallback chain (before features!)
- Never skip test bot
- Commit before restart
The rest is iteration.
The Swarm That Built It
50 specialized agents, 100% success rate.
Each agent owns files. No overlap. QA agent checks cross-file consistency.
Swarm coordination beats solo hacking for complex systems. Hotep LLM was built BY agents, FOR agents.
What’s Next for hotep_llm
- v8 training with expanded dataset
- Multi-modal support (images + text)
- Autonomous self-training
- Control plane UI for monitoring
The goal: Agents that train, deploy, and monitor themselves. We’re getting close.
The Real Innovation
hotep_llm isn’t about novel ML research.
It’s proof that you don’t need a research lab to run production LLMs.
Local models + good infra + relentless iteration = magic.
The barrier to entry is gone. What will you build?
Try it live: askhotep.ai
Stack: Ollama + ChromaDB + Garnet + FastAPI Model: hotep-llm-v7-q8 (8.1 GB, 91 tok/s)