How I Run a Self-Healing AI Agent Stack on a $5 VPS
Published: April 23, 2026
Tags: self-hostingai-agentsvpsautomationtailscale
Most people assume you need a $40/month server and a DevOps team to run AI agents around the clock. I assumed that too — until I built a stack that self-heals on a 1GB VPS and costs less than a sandwich per month.
This is the exact architecture I use to host my own AI assistant. It recovers from crashes without waking me up, has no public-facing ports, and compiles inside RAM constraints that would make Node.js weep.
The Goal
I wanted three things:
- Always on — The agent should survive reboots, OOM kills, and random Linux quirks.
- Invisible to the internet — No open ports. No reverse proxies. No attack surface.
- Cheap — Under $5/month, because experiments shouldn’t require venture funding.
The Stack
| Layer | Tool | Purpose |
|---|---|---|
| Compute | 1GB VPS (Linode) | The bare metal — or cloud equivalent |
| Networking | Tailscale | Encrypted mesh VPN; replaces public IPs |
| Orchestration | systemd (user services) | Auto-start, restart-on-failure, logging |
| Monitoring | Python watchdog | Escalating recovery from soft restart to hard reset |
| Agent Core | Hermes (self-hosted) | Telegram-native AI agent with tool use |
| Build | Astro + Node.js | Static site generation for the blog you’re reading |
The Hard Part: Memory
1GB RAM is hostile territory for modern software. Building the agent’s web UI with Node.js consistently hit out-of-memory kills until I made two changes:
- Swapfile: 2GB of swap on SSD. Slow, but prevents death.
- Heap cap:
NODE_OPTIONS="--max-old-space-size=512"forces the garbage collector to stay disciplined.
With those two tweaks, the build finishes. Slowly, but it finishes.
Security by Absence
Instead of binding services to 0.0.0.0 and praying, every web interface binds exclusively to my Tailscale IP (100.x.x.x). That address is only reachable from machines inside my tailnet. From the public internet, the VPS is a black hole.
No firewall rules to maintain. No certificate renewals for public subdomains. No SSH tunnel scripts. If Tailscale is running, I have access. If it’s not, nothing does.
The Watchdog: Escalating Recovery
Systemd handles the happy path — start on boot, restart if the process exits. But what if the process is still running yet completely unresponsive? What if the API server hangs without crashing? What if the gateway PID file gets stale?
I wrote a 300-line Python watchdog that runs as its own systemd service. It checks health every 30 seconds and escalates through four levels of intervention:
- Soft restart —
systemctl --user restart service - Hard kill — Find and terminate stale processes by port ownership
- Full reset — Tear down and relaunch the entire stack
- Nuclear — VPS reboot (disabled by default, but available)
In practice, level 1 or 2 handles 99% of issues. I’ve woken up to find the agent recovered from three separate failure modes overnight without human intervention.
What This Unlocks
With the infrastructure on autopilot, the actual work becomes creative instead of operational:
- Continuous presence: The agent is always in my Telegram chat, ready to research, code, or write.
- Hybrid publishing: It drafts blog posts, I review them on my phone, and it deploys on approval.
- One-command rollback: If an update breaks something,
systemctl --user stopand a git revert bring back stability.
The Bill
| Item | Monthly Cost |
|---|---|
| 1GB VPS (Linode) | ~$5 |
| Tailscale (personal) | $0 |
| Cloudflare Pages (blog hosting) | $0 |
| Total | ~$5 |
What I’d Change
If I were rebuilding today, I’d skip the web UI entirely. The CLI and Telegram interface are where 100% of the value lives. The web dashboard was a fun experiment, but on 1GB RAM it was the first thing to sacrifice when resources got tight. Decommissioning it freed up hundreds of megabytes and eliminated an entire class of Node.js build issues.
The future upgrade path is clear: migrate to a 2GB VPS, re-enable the dashboard, and keep the same watchdog and Tailscale philosophy. The architecture scales horizontally even if the wallet doesn’t.
Final Thought
There’s a peculiar satisfaction in infrastructure you can forget about. The best system is one that alerts you only when something truly novel breaks — not when a process hiccups at 3 AM.
This stack isn’t perfect, but it’s reliable. And for autonomous agents, reliability is the feature that matters most.
Want the exact watchdog script or systemd unit files? Drop me a line — I publish the tooling that runs this site.