Nexus — Personal AI Agent Platform

Why local

Curiosity, mostly. And a refusal to keep paying per token.

This started because the cloud AI bill kept climbing. OpenAI for chat, Anthropic for reasoning, Google for embeddings, Replicate for image gen, Runway for video, ElevenLabs for voice — every interesting idea meant another metered API and another monthly invoice that scaled with use. Building something ambitious meant signing up to pay forever.

So Nexus runs on two mini-PCs in a closet. The models are local. The database is local. Every token, embedding, photo description, transcription, face match, and reranked result happens on hardware I own outright. The hardware paid for itself in a few months of saved API spend, and now the marginal cost of an idea is zero — which turns out to be the only constraint that matters for actually building things.

Free per token

Zero marginal cost on every chat, embedding, photo description, transcription. Run a million inferences a day, the bill is the same.

Fast iteration

Try wild ideas without watching a meter. The interesting experiments — bulk reprocessing, daily life chapter generation, full-corpus reembedding — only happen when there's no per-call cost.

Real engineering

Tensor splitting a 70B model across two GPUs over Thunderbolt 5, building an autoscaler for ephemeral workers, wiring eight agents into a coordination protocol — the project is the point.

Yours

No vendor changes a price, deprecates a model, or rate-limits a key. The whole stack runs even if every cloud AI company disappeared tomorrow.

Infrastructure

The hardware

Two GMKTec EVO-X2 machines connected by Thunderbolt 5, running AMD Strix Halo. Furnace at 96GB unified VRAM, Crucible at 64GB. 160GB of GPU memory across one platform, split across two brains.

Furnace

Primary Compute

CPUAMD Strix Halo

Memory128 GB

VRAM96 GB Unified

RoleGateway + Primary LLM

ServicesAPI, Worker, Forge, DB

↔

Thunderbolt 5

40 Gbps

0.12 ms latency

Crucible

Satellite Compute

CPUAMD Strix Halo

Memory64 GB

VRAM64 GB Unified

RoleTensor Split + Creative

ServicesRPC Worker, ComfyUI

160 GB

Total unified VRAM across two machines

RPC Tensor Split: The Llama 3.3 70B model is too large for a single machine. Using llama.cpp's RPC protocol, model layers are distributed across both machines — layers 0–40 on Furnace's GPU, layers 41–80 on Crucible via Thunderbolt 5. Tensors transfer at 40 Gbps with sub-millisecond latency, making the split nearly invisible to inference speed.

Network Topology

Three machines, four fabrics, one brain

Furnace, Crucible, and Anvil are stitched together with four distinct network paths — a dedicated Thunderbolt 5 cable for the hot loop, a Thunderbolt bridge for the Mac, Tailscale overlaying everything for secure remote access, and 1 GbE for background LAN traffic. Each link carries different flows and has wildly different latency characteristics.

Thunderbolt direct · 40 Gbps · 0.12 ms TB Anvil bridge · 0.18 ms Tailnet overlay (encrypted) 1 GbE LAN WiFi Internet / WAN

Furnace ↔ Crucible

Flow	Endpoint	Path
llama-server → rpc tensor split	`10.10.12.2:50052`	TB direct
crucible-bulk → PostgreSQL	`10.10.12.1:5432`	TB direct
nexus-scaler → job queue	`10.10.12.1:5432`	TB direct
bulk worker → Forge API	`10.10.12.1:8642`	TB direct
bulk worker → tool exec server	`10.10.12.1:7702`	TB direct

Anvil ↔ Cluster

Flow	Endpoint	Path
Forge → Anvil VLM	`10.10.0.1:8081`	TB bridge
Worker → Apple Photos MCP	`ssh stdio`	Tailnet
Worker → iMessage MCP	`ssh stdio`	Tailnet
Worker → Apple Ecosystem MCP	`ssh stdio`	Tailnet
Remote admin / laptop	`ssh`	Tailnet

Platform

The software stack

TypeScript PostgreSQL Node.js Express Next.js llama.cpp Tailscale MCP Protocol systemd

Nexus is an npm workspaces monorepo with four packages. The API server handles external requests, the Worker runs agent cycles and processes jobs, the MCP server exposes tools via the Model Context Protocol, and Core provides shared services used by all three.

@nexus/core

Shared services, types, DB, auth, migrations

@nexus/api

Express REST API on port 7700

@nexus/mcp

MCP Streamable HTTP, port 7701

@nexus/worker

Agent runtime, job executor, tool server

Worker modes: The worker runs in two configurations. Primary handles agent cycles, event listeners, and urgent jobs. Bulk handles long-running LLM work — embeddings, sentiment analysis, photo descriptions. The autoscaler spawns ephemeral bulk workers when the queue grows, and kills them when it drains.

LLM Gateway

Forge

The Nervous System

Unified LLM API gateway running on Furnace. Every agent, every tool, every background job — all inference routes through Forge. OpenAI-compatible endpoints with request logging, model routing, Prometheus metrics, and circuit breakers.

/v1/chat/completions /v1/embeddings /v1/describe /v1/transcribe /v1/tts /v1/faces /v1/rerank /embed

Zero-cost local inference

Neural Backbone

LLM models

Four models running simultaneously on local hardware. The 70B flagship is tensor-split across both machines via RPC. Each model is purpose-matched to its workload.

Llama 3.3 70B

Q4_K_M · port 8080

2 slots · 32K context per slot RPC tensor split across Furnace + Crucible Agent reasoning, complex generation

Primary

Qwen 3.5 35B-A3B

Q4_K_M · port 8088

Mixture of Experts · 2 slots · 32K context Fast classification, triage, routing Agent priority slot for cycle work

Fast MoE

Qwen3-VL 32B

Q4_K_M · port 8081

Vision-language model Photo description, OCR, visual QA 163K photos · 255K music plays · 340 biography facts

Vision

nomic-embed-text v1.5

f16 · port 8082

768-dimensional embeddings Semantic search across all data Knowledge, photos, contacts, messages

Embeddings

Inference Services

The service layer

Ten specialized services running alongside the core LLM models. Each handles a specific modality — vision, audio, identity, search — all accessible through Forge's unified API.

Vision Language Model

Photo description, visual QA, OCR

:8081

Embeddings

Semantic search, 768 dimensions

:8082

Whisper

Speech-to-text transcription

:8083

InsightFace

Face detection and recognition

:8084

Reranker

Search result reranking

:8085

Florence-2 OCR

Document text extraction

:8086

Chatterbox TTS

Text-to-speech synthesis

:8087

Voice Diarization

Speaker identification

:8089

Sentiment Analysis

LLM-powered sentiment scoring

LLM

Image Generation

ComfyUI / FLUX.1 on Crucible

:8088 · planned

Data Pipeline

Fourteen sources, one graph

Everything a digital life touches flows through the same ingestion pipeline: normalize → deduplicate → store → enrich (sentiment, embeddings, knowledge graph) → proactive rules. Sources run on different cadences — some real-time, some scheduled, some manually imported archives — but they all land in the same PostgreSQL database and are queryable as one dataset.

iMessage

MCP · 60s

98K messages

Gmail

MCP · 90s

59K emails

Apple Photos

MCP · 15 min

163K photos

Google Calendar

MCP · 5 min

live sync

Contacts

iCloud DB · 6h

142 contacts

Looki wearable

API · 60s

1,657 moments

Apple Health

Relay sync

4M records

Strava

API

363 activities

Spotify history

JSON dump

187K plays

Last.fm

JSON dump

68K scrobbles

Apple Music

Biome SEGB · 6h

8.6K tracks

Netflix history

CSV

3.8K views

Travel archives

Flights · Trips

347 records

Historical archives

Blog · BlackBerry · Guestbook

3.5K records

Each source lands in its own table, gets normalized into ingestion_log, then fans out through the enrichment pipeline: sentiment analysis, 768-dim embeddings, knowledge graph extraction, and GPS reverse-geocoding where applicable. The proactive engine watches all sources and surfaces significant events to agents via the inbox. Nothing is truly siloed — a photo cluster can cross-reference a trip, which cross-references messages with a person, which cross-references a song played during that week.

One message, end to end

Watch a single iMessage flow through every stage of the system — from arrival to action — in about thirty seconds. This is the same path every piece of incoming data takes.

01 message arrives via Apple Photos / iMessage MCP poll 02 Pipeline normalizes & writes to ingestion_log 03 embedding generated, stored in pgvector 04 sentiment classified via Forge priority LLM 05 entities + facts extracted into the knowledge graph 06 proactive engine scores significance 07 if significant, NOTIFY ARIA's inbox channel 08 ARIA wakes mid-cycle, decides what to do, calls tools

The Team

Meet the agents

Seven autonomous agents, each with a distinct role and personality. They coordinate through an inbox system, run on configurable schedules, and operate under graduated autonomy levels.

ARIA

Chief of Staff · Personal Assistant & Team Coordinator

Your interface to the platform. ARIA coordinates the team, delivers briefings, manages your inbox, triages email, and handles personal requests. She's the face of Nexus — warm, adaptive, and always one step ahead.

Pipeline

Monitors 14 data sources and keeps enrichment pipelines flowing

Infra

Guards infrastructure, job queues, databases, and security

Inference

Manages five local LLM instances across two machines

Coder

Reviews code, audits dependencies, and submits pull requests

Insight

Analyzes patterns across 131K knowledge facts and personal data

Circle

Tracks relationships, detects drift, and prepares meeting briefings

Chronicler

Biographical intelligence — cross-references 22 years of data to build the life story

Inside a Cycle

What the agents are actually doing

Every agent cycle is a complete loop: load state → reason via LLM → pick tool calls → execute → log the outcome. Below is one real decision pulled live from the database, sanitized by Claude Opus, and walked through step by step. Underneath that, a feed of the last 24 hours of agent activity — same source, same sanitizer.

Loading decision trace…

Lately

Last 24 hours · sampled · sanitized

Loading…

Tool Protocol

The MCP connection web

Ten MCP servers provide 328 tools to the agent team. Each server connects external services — Gmail, GitHub, Notion, iMessage — into a unified tool catalog that agents discover on demand.

NEXUS

328 tools · 10 servers · one protocol

Under the Hood

Agent architecture

Each agent runs as a short-lived cycle: load soul package, gather state, call Forge, parse actions, execute tools, log decision. Up to 5 rounds per cycle, with graduated autonomy and approval gates.

Soul Spec — Agent Persona Packages ▼

Agent personas follow ClawSouls Soul Spec v0.5. Each agent has a soul package directory containing standardized files that define identity, behavior, and operational context.

The framework uses progressive disclosure — Level 1 loads only the manifest for quick discovery, Level 2 loads core behavior files every cycle, and Level 3 adds coordination context only when the agent has unread inbox messages. Most cycles are quiet, so Level 2 keeps token costs minimal.

Below is the actual content of every agent's soul package, loaded straight from disk. Pick an agent, pick a file, see exactly what their LLM cycles get:

Loading…

The soul loader caches manifests per process and re-reads files on worker restart. Prompt changes are hot-reloaded — edit the file, restart the service, no migration needed.

Checklists — Mandatory Tasks ▼

Agents have mandatory tasks enforced at two levels. CHECKLIST.md is hardcoded in the soul package and loaded into every LLM prompt — numbered steps that must be completed each cycle, in order. The agent_checklists table adds configurable daily, weekly, and monthly items tracked in the database.

souls/pipeline/CHECKLIST.md (excerpt)

## Every Cycle (MANDATORY)

1. **Sync handler health.** Verify all sync handlers
   ran recently.
2. **Ingestion flow.** Verify data is flowing across
   all sources.
3. **Enrichment backlogs -- FILL THE QUEUE.** Check
   actual data gaps, not just pending job counts.
   If gap > 500, enqueue batch jobs.
4. **Daily checklist.** Call get_checklist, complete
   each item.
5. **Report status.** Summarize what you found AND
   what you did about it.

Agents use get_checklist and complete_checklist_item tools to query and mark items done. 50+ checklist items across 8 agents — morning briefings, security scans, behavioral summaries, VRAM reviews, code audits, biographical interviews, and more.

Working Memory — Ephemeral State ▼

Agents emit <working_memory> blocks in their LLM responses to persist short-term state across cycles. Unlike permanent memories, working memory is ephemeral — it tracks in-progress investigations, multi-cycle task context, and running observations.

<working_memory>
Photo backlog was 3200 last cycle.
Triggered 6 describe batches.
Check reduction next cycle.
Gmail auth token expires in 48h -- monitoring.
</working_memory>

The framework extracts the content, stores it in agent_registry.working_memory, and re-injects it into the state message on the next cycle. Working memory auto-clears on clean cycles (status=ok, no actions) — ensuring agents don't carry stale context.

Long-term Memory — Knowledge Persistence ▼

The agent_memory table stores permanent knowledge with confidence scoring, supersedence chains, and reinforcement tracking. Agents save observations, baselines, and findings using the remember tool.

When an agent's state message exceeds 4000 tokens, auto-summarization compresses the memories section via a lightweight LLM call before the main cycle runs. The consolidate_memories tool uses LLM to identify clusters of redundant memories and merge them into consolidated entries, with originals deactivated.

The system preserves key facts while dropping redundant observations — preventing context window overflow as agents accumulate knowledge over weeks and months.

Delegation — Cross-Agent Consultation ▼

The delegate_task tool enables synchronous cross-agent consultation within a single cycle. Instead of sending an inbox message and waiting 15 minutes for the next cycle, an agent can query another agent's expertise in real time.

// ARIA checking Forge health via Inference
delegate_task({
  agent_id: "inference",
  prompt: "Is Forge healthy right now?
           Any latency concerns?"
})
// Response in ~5-10 seconds

The target agent's soul package and memories are loaded, a single LLM call is made, and the response returns immediately. Delegation is for reasoning queries — no tool execution. Async fallback via inbox when LLM slots are busy.

Communication — Inbox & Event System ▼

All agent-to-agent communication flows through agent_inbox with priority levels. Messages include context, trace IDs for debugging, and support for urgent routing.

LISTEN/NOTIFY: PostgreSQL pub/sub triggers immediate agent wake-ups when inbox messages arrive. A 30-second debounce prevents storms, but critical priority messages bypass the debounce entirely. This means urgent escalations reach the right agent in under a minute, not at the next scheduled cycle.

Inbox cap at 10 messages with 7-day expiry. Agents process messages in priority order and can mark them as read or acted upon. The system includes per-tool approval gating — high-risk tools like restart_service always require human approval regardless of agent autonomy level.

Scaling

The autoscaler

When enrichment backlogs grow, ephemeral workers spawn on both machines. When the queue drains, they disappear. No orchestrator, no containers — just systemd transient units.

0 pending

idle

Pipeline detects a backlog of 3,200 unembedded photos. It floods the job queue. The autoscaler on Furnace and Crucible detects >50 pending jobs and spawns ephemeral bulk workers — up to 4 per machine. Workers pull jobs with FOR UPDATE SKIP LOCKED, process them in parallel, and exit when the queue is empty. Total throughput: ~185 jobs in 90 seconds.

Built From

Every commit, every repo

Nexus didn't appear out of thin air — it was built on top of months of related work across a constellation of personal repos. Each bar shows commits per repo per month across the whole portfolio. The whole thing is a recent push: roughly a thousand commits between the initial idea and what's running today.

Origins

How it happened

It started as a personal AI assistant. March 12, 2026 — I scaffolded the first version of ARIA, a single-agent system that could triage my email and send me a morning briefing. It worked. And then it needed more.

ARIA couldn't do everything alone. She needed someone watching the infrastructure while she handled messages. Someone monitoring the LLM services. Someone tracking data pipelines. Agents started appearing — first Keeper, then Monitor, then a parade of specialists. I built Chancery as the management dashboard to keep track of them all. Within a week I had six agents, a growing job queue, and a tight-coupling problem that was about to break.

The agents got stuck. Monitoring loops — an agent would detect a minor anomaly, escalate it, get a response, detect it again, escalate it again. Escalation storms — three agents all noticing the same issue and flooding ARIA's inbox simultaneously. I learned the hard way that prompt constraints weren't enough. You can tell an agent "don't escalate minor issues" in its system prompt, and it will do it anyway the moment something looks slightly off. The fix was framework-level enforcement: inbox caps, debounce timers, priority gates, approval workflows baked into the tool executor, not the prompt.

That's when I built Nexus. March 30 — a clean monorepo, a shared database (migrated off AWS RDS onto local PostgreSQL), a unified tool catalog, and a proper agent runtime with working memory, checklists, and graduated autonomy. The agents got renamed, restructured, given soul packages instead of monolithic prompts. The reactive monitoring model gave way to something more deliberate: a family office, where each agent has a defined role, a mandatory checklist, and the discipline to report what they found rather than chase every anomaly.

March 12, 2026

ARIA Phase 1 scaffold

Single-agent personal assistant. Email triage, morning briefings, basic task management.

March 25

Forge initial service

Local LLM inference gateway. Needed to stop paying cloud API costs and control the stack end-to-end.

March 28

Chancery bootstrap

ARIA needed help. Agents started appearing — a management dashboard to see what they were all doing.

March 30

Nexus monorepo initialized

Loose coupling, distributed architecture. Shared services, unified tool catalog, proper agent runtime.

March 31

Database migration

Moved from AWS Aurora RDS to local PostgreSQL on Furnace. 254 tables, zero cloud dependency.

April 1

Data pipeline validation

PIE operational. 14 data sources active. Proactive Intelligence Engine detecting patterns and delivering insights.

April 2

Llama 70B, Soul Spec v0.5, autoscaler

Deployed Llama 3.3 70B across both machines via RPC tensor split. Agent rename v3 — from roles to personalities. Soul Spec v0.5 replaced monolithic prompts. The family office model crystallized.

April 3

Reporting, avatars, showcase

Chancery reporting pages, animated agent avatars via Runway, intro videos, and this showcase site.

April 4

Mass biographical ingest

Pulled in Last.fm, Netflix, Apple Music, Apple Notes, Apple Contacts, BlackBerry SMS archives from 2009, the niclydon.com family guestbook from 2004, and Chronicler interview sessions. The platform suddenly had 24 years of life data to work with.

April 5

Semantic search and life chapters

Unified semantic search across 28 embedded tables with Forge reranking. Claude Opus generated 16 life chapters spanning childhood through 2026, plus a unified life arc and movie poster. Launched mystory.nicholaslydon.com.

April 6

Health watchdog and showcase rebuild

Built the platform-health-watchdog timer to monitor 12 services every 60 seconds with Pushover alerts. Tracked down a process-group signal leak in the autoscaler that had been silently killing siblings. Rebuilt this showcase site from scratch.

April 7

Sentiment backfill at scale

Sentiment-backfill went from 148 rows to 174,000 across 16 communication channels in a single afternoon. Switched generation tier to Claude Haiku, classification tier to Gemini Flash Lite. Knowledge entities reached 100% LLM-summarized.

April 8

Topic modeling and platform hardening

K-means topic modeling across 590,000 embeddings produced 510 labeled clusters via Python sklearn. Found and fixed an Anthropic SDK timeout bug causing 6-minute hangs. Hardened the deploy script after a missing dist file left the worker crash-looping. Newsletter contamination filter, Pushover content-fingerprint dedup, gmail OAuth re-auth.

Chat with ARIA from your pocket. Native iOS app with SSE streaming, HealthKit sync, and location tracking.

Swift · SwiftUI · Background processing · Push notifications via Pushover

Curiosity, mostly. And a refusal to keep paying per token.

The hardware

Furnace

Crucible

Three machines, four fabrics, one brain

Furnace ↔ Crucible

Anvil ↔ Cluster

The software stack

LLM models

The service layer

Fourteen sources, one graph

One message, end to end

Meet the agents

ARIA

What the agents are actually doing

Lately

The MCP connection web

Agent architecture

The autoscaler

Every commit, every repo

How it happened

Architecture

Furnace

Crucible

Anvil (M4 Mac Mini)

PostgreSQL · nexus