How I Run a Self-Healing AI Agent Stack on a $5 VPS

Published: April 23, 2026

Tags: self-hostingai-agentsvpsautomationtailscale

Most people assume you need a $40/month server and a DevOps team to run AI agents around the clock. I assumed that too — until I built a stack that self-heals on a 1GB VPS and costs less than a sandwich per month.

This is the exact architecture I use to host my own AI assistant. It recovers from crashes without waking me up, has no public-facing ports, and compiles inside RAM constraints that would make Node.js weep.

The Goal

I wanted three things:

Always on — The agent should survive reboots, OOM kills, and random Linux quirks.
Invisible to the internet — No open ports. No reverse proxies. No attack surface.
Cheap — Under $5/month, because experiments shouldn’t require venture funding.

The Stack

Layer	Tool	Purpose
Compute	1GB VPS (Linode)	The bare metal — or cloud equivalent
Networking	Tailscale	Encrypted mesh VPN; replaces public IPs
Orchestration	systemd (user services)	Auto-start, restart-on-failure, logging
Monitoring	Python watchdog	Escalating recovery from soft restart to hard reset
Agent Core	Hermes (self-hosted)	Telegram-native AI agent with tool use
Build	Astro + Node.js	Static site generation for the blog you’re reading

The Hard Part: Memory

1GB RAM is hostile territory for modern software. Building the agent’s web UI with Node.js consistently hit out-of-memory kills until I made two changes:

Swapfile: 2GB of swap on SSD. Slow, but prevents death.
Heap cap: NODE_OPTIONS="--max-old-space-size=512" forces the garbage collector to stay disciplined.

With those two tweaks, the build finishes. Slowly, but it finishes.

Security by Absence

Instead of binding services to 0.0.0.0 and praying, every web interface binds exclusively to my Tailscale IP (100.x.x.x). That address is only reachable from machines inside my tailnet. From the public internet, the VPS is a black hole.

No firewall rules to maintain. No certificate renewals for public subdomains. No SSH tunnel scripts. If Tailscale is running, I have access. If it’s not, nothing does.

The Watchdog: Escalating Recovery

Systemd handles the happy path — start on boot, restart if the process exits. But what if the process is still running yet completely unresponsive? What if the API server hangs without crashing? What if the gateway PID file gets stale?

I wrote a 300-line Python watchdog that runs as its own systemd service. It checks health every 30 seconds and escalates through four levels of intervention:

Soft restart — systemctl --user restart service
Hard kill — Find and terminate stale processes by port ownership
Full reset — Tear down and relaunch the entire stack
Nuclear — VPS reboot (disabled by default, but available)

In practice, level 1 or 2 handles 99% of issues. I’ve woken up to find the agent recovered from three separate failure modes overnight without human intervention.

What This Unlocks

With the infrastructure on autopilot, the actual work becomes creative instead of operational:

Continuous presence: The agent is always in my Telegram chat, ready to research, code, or write.
Hybrid publishing: It drafts blog posts, I review them on my phone, and it deploys on approval.
One-command rollback: If an update breaks something, systemctl --user stop and a git revert bring back stability.

The Bill

Item	Monthly Cost
1GB VPS (Linode)	~$5
Tailscale (personal)	$0
Cloudflare Pages (blog hosting)	$0
Total	~$5

What I’d Change

If I were rebuilding today, I’d skip the web UI entirely. The CLI and Telegram interface are where 100% of the value lives. The web dashboard was a fun experiment, but on 1GB RAM it was the first thing to sacrifice when resources got tight. Decommissioning it freed up hundreds of megabytes and eliminated an entire class of Node.js build issues.

The future upgrade path is clear: migrate to a 2GB VPS, re-enable the dashboard, and keep the same watchdog and Tailscale philosophy. The architecture scales horizontally even if the wallet doesn’t.

Final Thought

There’s a peculiar satisfaction in infrastructure you can forget about. The best system is one that alerts you only when something truly novel breaks — not when a process hiccups at 3 AM.

This stack isn’t perfect, but it’s reliable. And for autonomous agents, reliability is the feature that matters most.

Want the exact watchdog script or systemd unit files? Drop me a line — I publish the tooling that runs this site.