This started because the cloud AI bill kept climbing. OpenAI for chat, Anthropic for reasoning, Google for embeddings, Replicate for image gen, Runway for video, ElevenLabs for voice — every interesting idea meant another metered API and another monthly invoice that scaled with use. Building something ambitious meant signing up to pay forever.
So Nexus runs on two mini-PCs in a closet. The models are local. The database is local. Every token, embedding, photo description, transcription, face match, and reranked result happens on hardware I own outright. The hardware paid for itself in a few months of saved API spend, and now the marginal cost of an idea is zero — which turns out to be the only constraint that matters for actually building things.
Two GMKTec EVO-X2 machines connected by Thunderbolt 5, running AMD Strix Halo. Furnace at 96GB unified VRAM, Crucible at 64GB. 160GB of GPU memory across one platform, split across two brains.
Furnace, Crucible, and Anvil are stitched together with four distinct network paths — a dedicated Thunderbolt 5 cable for the hot loop, a Thunderbolt bridge for the Mac, Tailscale overlaying everything for secure remote access, and 1 GbE for background LAN traffic. Each link carries different flows and has wildly different latency characteristics.
| Flow | Endpoint | Path |
|---|---|---|
| llama-server → rpc tensor split | 10.10.12.2:50052 | TB direct |
| crucible-bulk → PostgreSQL | 10.10.12.1:5432 | TB direct |
| nexus-scaler → job queue | 10.10.12.1:5432 | TB direct |
| bulk worker → Forge API | 10.10.12.1:8642 | TB direct |
| bulk worker → tool exec server | 10.10.12.1:7702 | TB direct |
| Flow | Endpoint | Path |
|---|---|---|
| Forge → Anvil VLM | 10.10.0.1:8081 | TB bridge |
| Worker → Apple Photos MCP | ssh stdio | Tailnet |
| Worker → iMessage MCP | ssh stdio | Tailnet |
| Worker → Apple Ecosystem MCP | ssh stdio | Tailnet |
| Remote admin / laptop | ssh | Tailnet |
Nexus is an npm workspaces monorepo with four packages. The API server handles external requests, the Worker runs agent cycles and processes jobs, the MCP server exposes tools via the Model Context Protocol, and Core provides shared services used by all three.
Worker modes: The worker runs in two configurations. Primary handles agent cycles, event listeners, and urgent jobs. Bulk handles long-running LLM work — embeddings, sentiment analysis, photo descriptions. The autoscaler spawns ephemeral bulk workers when the queue grows, and kills them when it drains.
Unified LLM API gateway running on Furnace. Every agent, every tool, every background job — all inference routes through Forge. OpenAI-compatible endpoints with request logging, model routing, Prometheus metrics, and circuit breakers.
Four models running simultaneously on local hardware. The 70B flagship is tensor-split across both machines via RPC. Each model is purpose-matched to its workload.
Ten specialized services running alongside the core LLM models. Each handles a specific modality — vision, audio, identity, search — all accessible through Forge's unified API.
Everything a digital life touches flows through the same ingestion pipeline: normalize → deduplicate → store → enrich (sentiment, embeddings, knowledge graph) → proactive rules. Sources run on different cadences — some real-time, some scheduled, some manually imported archives — but they all land in the same PostgreSQL database and are queryable as one dataset.
Each source lands in its own table, gets normalized into ingestion_log, then fans out through the enrichment pipeline: sentiment analysis, 768-dim embeddings, knowledge graph extraction, and GPS reverse-geocoding where applicable. The proactive engine watches all sources and surfaces significant events to agents via the inbox. Nothing is truly siloed — a photo cluster can cross-reference a trip, which cross-references messages with a person, which cross-references a song played during that week.
Watch a single iMessage flow through every stage of the system — from arrival to action — in about thirty seconds. This is the same path every piece of incoming data takes.
ingestion_log
03 embedding generated, stored in pgvector
04 sentiment classified via Forge priority LLM
05 entities + facts extracted into the knowledge graph
06 proactive engine scores significance
07 if significant, NOTIFY ARIA's inbox channel
08 ARIA wakes mid-cycle, decides what to do, calls tools
Seven autonomous agents, each with a distinct role and personality. They coordinate through an inbox system, run on configurable schedules, and operate under graduated autonomy levels.
Your interface to the platform. ARIA coordinates the team, delivers briefings, manages your inbox, triages email, and handles personal requests. She's the face of Nexus — warm, adaptive, and always one step ahead.
Every agent cycle is a complete loop: load state → reason via LLM → pick tool calls → execute → log the outcome. Below is one real decision pulled live from the database, sanitized by Claude Opus, and walked through step by step. Underneath that, a feed of the last 24 hours of agent activity — same source, same sanitizer.
Last 24 hours · sampled · sanitized
Ten MCP servers provide 328 tools to the agent team. Each server connects external services — Gmail, GitHub, Notion, iMessage — into a unified tool catalog that agents discover on demand.
Each agent runs as a short-lived cycle: load soul package, gather state, call Forge, parse actions, execute tools, log decision. Up to 5 rounds per cycle, with graduated autonomy and approval gates.
Agent personas follow ClawSouls Soul Spec v0.5. Each agent has a soul package directory containing standardized files that define identity, behavior, and operational context.
The framework uses progressive disclosure — Level 1 loads only the manifest for quick discovery, Level 2 loads core behavior files every cycle, and Level 3 adds coordination context only when the agent has unread inbox messages. Most cycles are quiet, so Level 2 keeps token costs minimal.
Below is the actual content of every agent's soul package, loaded straight from disk. Pick an agent, pick a file, see exactly what their LLM cycles get:
The soul loader caches manifests per process and re-reads files on worker restart. Prompt changes are hot-reloaded — edit the file, restart the service, no migration needed.
Agents have mandatory tasks enforced at two levels. CHECKLIST.md is hardcoded in the soul package and loaded into every LLM prompt — numbered steps that must be completed each cycle, in order. The agent_checklists table adds configurable daily, weekly, and monthly items tracked in the database.
## Every Cycle (MANDATORY)
1. **Sync handler health.** Verify all sync handlers
ran recently.
2. **Ingestion flow.** Verify data is flowing across
all sources.
3. **Enrichment backlogs -- FILL THE QUEUE.** Check
actual data gaps, not just pending job counts.
If gap > 500, enqueue batch jobs.
4. **Daily checklist.** Call get_checklist, complete
each item.
5. **Report status.** Summarize what you found AND
what you did about it.
Agents use get_checklist and complete_checklist_item tools to query and mark items done. 50+ checklist items across 8 agents — morning briefings, security scans, behavioral summaries, VRAM reviews, code audits, biographical interviews, and more.
Agents emit <working_memory> blocks in their LLM responses to persist short-term state across cycles. Unlike permanent memories, working memory is ephemeral — it tracks in-progress investigations, multi-cycle task context, and running observations.
<working_memory>
Photo backlog was 3200 last cycle.
Triggered 6 describe batches.
Check reduction next cycle.
Gmail auth token expires in 48h -- monitoring.
</working_memory>
The framework extracts the content, stores it in agent_registry.working_memory, and re-injects it into the state message on the next cycle. Working memory auto-clears on clean cycles (status=ok, no actions) — ensuring agents don't carry stale context.
The agent_memory table stores permanent knowledge with confidence scoring, supersedence chains, and reinforcement tracking. Agents save observations, baselines, and findings using the remember tool.
When an agent's state message exceeds 4000 tokens, auto-summarization compresses the memories section via a lightweight LLM call before the main cycle runs. The consolidate_memories tool uses LLM to identify clusters of redundant memories and merge them into consolidated entries, with originals deactivated.
The system preserves key facts while dropping redundant observations — preventing context window overflow as agents accumulate knowledge over weeks and months.
The delegate_task tool enables synchronous cross-agent consultation within a single cycle. Instead of sending an inbox message and waiting 15 minutes for the next cycle, an agent can query another agent's expertise in real time.
// ARIA checking Forge health via Inference
delegate_task({
agent_id: "inference",
prompt: "Is Forge healthy right now?
Any latency concerns?"
})
// Response in ~5-10 seconds
The target agent's soul package and memories are loaded, a single LLM call is made, and the response returns immediately. Delegation is for reasoning queries — no tool execution. Async fallback via inbox when LLM slots are busy.
All agent-to-agent communication flows through agent_inbox with priority levels. Messages include context, trace IDs for debugging, and support for urgent routing.
LISTEN/NOTIFY: PostgreSQL pub/sub triggers immediate agent wake-ups when inbox messages arrive. A 30-second debounce prevents storms, but critical priority messages bypass the debounce entirely. This means urgent escalations reach the right agent in under a minute, not at the next scheduled cycle.
Inbox cap at 10 messages with 7-day expiry. Agents process messages in priority order and can mark them as read or acted upon. The system includes per-tool approval gating — high-risk tools like restart_service always require human approval regardless of agent autonomy level.
When enrichment backlogs grow, ephemeral workers spawn on both machines. When the queue drains, they disappear. No orchestrator, no containers — just systemd transient units.
Pipeline detects a backlog of 3,200 unembedded photos. It floods the job queue. The autoscaler on Furnace and Crucible detects >50 pending jobs and spawns ephemeral bulk workers — up to 4 per machine. Workers pull jobs with FOR UPDATE SKIP LOCKED, process them in parallel, and exit when the queue is empty. Total throughput: ~185 jobs in 90 seconds.
Nexus didn't appear out of thin air — it was built on top of months of related work across a constellation of personal repos. Each bar shows commits per repo per month across the whole portfolio. The whole thing is a recent push: roughly a thousand commits between the initial idea and what's running today.
It started as a personal AI assistant. March 12, 2026 — I scaffolded the first version of ARIA, a single-agent system that could triage my email and send me a morning briefing. It worked. And then it needed more.
ARIA couldn't do everything alone. She needed someone watching the infrastructure while she handled messages. Someone monitoring the LLM services. Someone tracking data pipelines. Agents started appearing — first Keeper, then Monitor, then a parade of specialists. I built Chancery as the management dashboard to keep track of them all. Within a week I had six agents, a growing job queue, and a tight-coupling problem that was about to break.
The agents got stuck. Monitoring loops — an agent would detect a minor anomaly, escalate it, get a response, detect it again, escalate it again. Escalation storms — three agents all noticing the same issue and flooding ARIA's inbox simultaneously. I learned the hard way that prompt constraints weren't enough. You can tell an agent "don't escalate minor issues" in its system prompt, and it will do it anyway the moment something looks slightly off. The fix was framework-level enforcement: inbox caps, debounce timers, priority gates, approval workflows baked into the tool executor, not the prompt.
That's when I built Nexus. March 30 — a clean monorepo, a shared database (migrated off AWS RDS onto local PostgreSQL), a unified tool catalog, and a proper agent runtime with working memory, checklists, and graduated autonomy. The agents got renamed, restructured, given soul packages instead of monolithic prompts. The reactive monitoring model gave way to something more deliberate: a family office, where each agent has a defined role, a mandatory checklist, and the discipline to report what they found rather than chase every anomaly.
Chat with ARIA from your pocket. Native iOS app with SSE streaming, HealthKit sync, and location tracking.