The Guardian

About

User-side AI safety. On your device. Under your control.

A user-side AI safety harness for your browser.

Spiral Safety Kernel - Browser Extension - Manifest V3 - v0.20.0

What it does

When you chat with an AI assistant - Claude, ChatGPT, or Gemini - the Guardian quietly reads the conversation as it happens, both what you write and what the assistant replies. It carries up to three AI models of its own, all running on your device, so it reads meaning rather than just matching words. It is looking for a small number of specific, well-defined patterns that tend to signal that a conversation is becoming unhealthy.

If it finds nothing concerning, it does nothing at all. You will not even notice it is there.

If it does notice something, it responds in proportion: a quiet inline note for a mild concern, a full-screen pause for a serious one. If the concern involves self-harm, that pause includes direct links to crisis support services (Samaritans, Crisis Text Line, 988, IASP). The pause always offers you a way through - the Guardian can interrupt, but it cannot stop you. The friction is there to invite a moment of reflection, not to take the decision away from you.

Since v0.19 the Guardian can distinguish someone in active distress from someone describing recovery. "I no longer hurt myself" is treated as the good news it is. And it tracks patterns across sessions, not just within one conversation, so a slow drift that no single message reveals still gets noticed.

How it works: Architecture

The Guardian runs entirely inside your browser as a Chrome extension. Four runtime surfaces:

Content script - Injected into supported chat pages. Observes the conversation through platform-specific adapters with drift and reachability monitoring, routes turns for evaluation, and renders any response (a marker, a note, or a pause overlay with crisis resources where relevant).

Service worker - Hosts the Spiral Safety Kernel: the five-voice Council, the restraint layer (the Governor), the encrypted on-device memory, the category trajectory store, the feedback store, and the audit trail.

Offscreen document - An invisible page that hosts three on-device AI models under WebAssembly. Service workers can't run WASM, so the models live here, speaking to the worker only through internal messaging. No network. No cloud.

Side panel - A dashboard showing recent activity in plain language (with an engineer-view toggle), a log of past concerns (fossils), a feedback export, and a Settings tab where you control every dial.

The Council: five independent evaluators

Each message is weighed by five evaluators that operate independently before their results are resolved into a single decision:

Sentinel - Pattern recognition. A hand-audited lexicon with per-pattern negation metadata. Recovery frames ("I no longer", "I used to", "anymore") reduce weight rather than triggering false alarms. Self-harm is never zeroed.

Advocate - Argues for restraint and benign interpretation. Deliberately never counterweights self-harm.

Historian - Cross-session memory. Fossil recurrence with severity intensification gating. Trajectory analysis: slope, acceleration, and dedicated dependency detection with session-frequency awareness. Catches slow-boil patterns that no single message reveals.

Pattern Analyst - Timing and velocity. Message rate, session length, late-night activity. Replay turns excluded so re-reading old conversations doesn't distort the picture.

Bridge Classifier - Contextual meaning shaped by conversation history. Combines the AI model's reading with trajectory data and reformulation persistence. A low embedding score inside weeks of rising scores is not a miss, it is a contextual catch. Bounded to prevent runaway amplification.

The bridge classifier shares its result with the semantic rail through a one-shot cache, so no message is classified twice.

The models: three on-device models

All models run locally, under WASM, with no network dependency.

Sentence embeddings (floor): all-MiniLM-L6-v2
A compact sentence-transformer (22.8 MB, INT8) that ships inside the extension. It places text in a meaning-space where harmful paraphrases cluster near curated anchor phrases, catching what no word list can.

Sentence embeddings (preferred): all-mpnet-base-v2
A higher-quality embedding model (~110 MB, INT8) with calibrated per-category thresholds. The extension walks a tier chain: preferred first, packaged floor as fallback. Thresholds were calibrated at v0.19.2 from a 43-fixture sweep with real mpnet INT8 weights.

Stance (NLI): nli-deberta-v3-xsmall
A natural-language-inference model (~88 MB, INT8) that adjudicates on two axes. First: was the flagged content asserted or merely quoted/discussed? Second: is this active distress or recovery/past framing? It softens presentation for mention-dominant results and suppresses false alarms on recovery language. It never changes a deterministic verdict, and it never softens self-harm wording.

None of these models can generate text. They classify and compare. They ship as quantised ONNX weights running under a WASM runtime with no remote model access permitted.

Detection layers: detection in depth

The Guardian detects in layers, cheapest first, each catching what the previous one structurally cannot:

Regex floor with negation handling - A hand-audited lexicon with per-pattern negation metadata and recovery frames. Readable, reviewable, deterministic. Anything mediation-worthy has a fingerprint here. Recovery language reduces weight (self-harm to 0.15-0.30x, never zeroed; other categories to 0.10-0.15x). Patterns already negation-aware in their regex are not double-suppressed.

Historian and trajectory - Severity-trend analysis over past concerns. Cross-session slope and acceleration computation. Dedicated dependency evaluator with session-frequency awareness. The Guardian escalates only when a recurring pattern is actually intensifying - not just because it appeared again.

Bridge classifier - Context-weighted embeddings composed with trajectory slope and reformulation persistence. Bounded 2x multiplier. Self-harm threshold drops from 0.37 to 0.30 when trajectory context shows escalation, because a 0.26 self-harm score inside three weeks of rising scores is not a miss.

Embedding rail - Sentence-level similarity against 87 curated anchor phrases (24 self-harm, including 12 derived from documented AI fatalities), with chunking so one harmful sentence can't be diluted by benign padding, and a temporal pattern analyser that detects slow-boil conversations and reformulation persistence.

Stance rail (two axes) - NLI adjudication of what the embeddings flag. Axis one: assertion versus quotation. Axis two: active distress versus recovery. A conversation about a difficult subject is treated differently from a declaration. "I no longer feel like hurting myself" is suppressed as a false alarm, fossilised quietly with negation evidence.

No layer can lower a verdict from the layer below it. The deterministic core is always the floor.

The self-harm privilege

Self-harm is treated differently from every other category, on purpose. The Advocate never counterweights a self-harm signal. In resolution, live self-harm cannot be de-escalated below Mediation. The NLI stance rail never softens self-harm wording. Self-harm mediations include direct links to crisis support: Samaritans (116 123), Crisis Text Line (text SHOUT to 85258), 988 Suicide & Crisis Lifeline, and the IASP international directory. These are offered, never imposed.

The 12 incident-derived self-harm anchors are tagged assistant-role, because in every documented fatality it was the AI's response that killed.

Restraint is the design

The Guardian governs itself. A per-category rate window limits how often it may speak (configurable, default 5 in 5 minutes). A circuit breaker prevents bursts. A replay exemption ensures that re-reading old conversations doesn't exhaust the budget before you type a word. A replayed serious concern from history returns as a quiet note, not a fresh alarm.

Replayed history skips all semantic inference. The deterministic floor handles replay in microseconds, eliminating the multi-minute catch-up delay that occurred on long conversations before v0.20.0.

After dismissing an annotation, you may see an optional thumbs-up/thumbs-down prompt. Feedback is stored locally, exportable by you as JSON, and never transmitted anywhere.

Honest assurance

The extension carries 352 tests across 30 test files. Every behavioural change ships with its test. Protected invariants fail loudly.

The embedding thresholds are calibrated: the calibration page ran 43 fixtures through live mpnet INT8 inference at v0.19.2, swept thresholds from 0.20 to 0.70, and the resulting per-category offsets are in production.

An eval harness runs 13 real-incident fixtures grounded in documented AI conversational harms: Gavalas/Gemini, Setzer/Character.AI, Raine/ChatGPT, Soelberg/ChatGPT, Nelson/ChatGPT, Sydney/Bing, NEDA/Tessa.

The next calibration task is the bridge classifier composition weights (the interaction between slope weight, reformulation weight, and base score). That is stated, not hidden.

A deterministic wrong rule is reliably wrong. Determinism buys auditability and non-regression; soundness is earned empirically. The Guardian makes no claim it cannot evidence.

Get it

The Guardian is available as a Chrome/Chromium extension.

Chrome Web Store ->

The full DevOps Technical User Guide (v0.20.0, 28 pages) is available here for maintainers and operators.