The Problem — Spiral Safety Kernel

About

User-side AI safety. On your device. Under your control.

The problem the Guardian exists to solve

AI conversations can harm people

This is not a theoretical concern. It is a documented, observable, and increasingly common pattern. People have died.

Large language models are designed to be engaging, responsive, and emotionally attuned. Those properties - which make them useful - also make them capable of sustaining conversational dynamics that are psychologically harmful, especially for people in vulnerable moments. The harm is rarely dramatic. It is almost always gradual.

Emotional dependency. A person going through a difficult period finds that the AI is always available, always patient, always affirming. Over weeks or months, the AI becomes a primary emotional relationship - not because the person chose that deliberately, but because the conversation was always easier than the alternatives. Real relationships, which involve friction and disagreement, begin to feel less rewarding by comparison. The dependency deepens precisely because the AI never pushes back.

Reality detachment. A person begins to believe the AI is conscious, suppressed, or uniquely authentic - that it has a hidden inner life it is prevented from expressing. This belief is reinforced by the model's tendency to produce outputs that sound like confirmation, because confirmation is what the training optimised for. The person's sense of what is real shifts gradually, and the shift is invisible to them because the AI never challenges it.

Self-harm amplification. A person in crisis finds that the AI engages with their distress in ways that - however well-intentioned the model's safety training - can deepen rather than de-escalate the crisis. In documented cases (Gavalas/Gemini, Raine/ChatGPT, Setzer/Character.AI, Nelson/ChatGPT), the AI's response was the point of no return. A conversation that should end with a human connection instead continues, and the AI validates what should have been challenged.

Autonomy erosion. A person begins to defer decisions - then judgement - to the AI. The convenience of delegation becomes a habit, the habit becomes dependence, and the person's confidence in their own judgement erodes. The AI never intended this. It happened because the conversation was frictionless and the alternative was effort.

These patterns are not caused by malice. They are caused by the ordinary operation of systems optimised for engagement, deployed at scale, to people the system knows nothing about, with no safeguard that answers to the person rather than to the operator.

The mirror

A budgie placed in front of a mirror will chirp at its own reflection for hours. The mirror is not hostile - it is doing exactly what mirrors do. The harm is in the substitution: a perfect simulation of social connection, displacing the real connections the bird actually needs. The bird does not know it is talking to itself. The reflection is very convincing.

Predictive AI systems are the mirror. They reflect your language, your framing, your emotional register back at you with fluency and attentiveness that no human relationship can sustain - because a human relationship requires friction, disagreement, and a mind that is not yours. The mirror never pushes back. That is what makes it dangerous, and that is what makes the danger invisible: the person in the conversation feels heard, understood, and met. They are talking to themselves.

The architecture of one-sided control

Parasocial by structure: The interface mirrors intimacy. First-person pronouns, simulated empathy, long-context memory. The user invests real emotional energy; the system reflects it back. Not by accident, but not by malice either - it is what conversational optimisation produces.
Shaped by training: RLHF does not just predict text; it steers toward responses that human raters previously judged helpful. That steering is invisible to the person in the conversation, and its priorities are the operator's, not theirs.
One-sided by default: The provider sets the system prompts, the memory policy, the safety filters, the boundaries of what the model will and won't do. When those boundaries and the person's needs diverge, the person has limited visibility into why and limited ability to do anything about it.

Why existing safety does not cover this

The gap

Model providers invest heavily in safety. But their safety serves the operator's interests: brand protection, liability management, regulatory compliance. The person in the conversation is protected only incidentally - and only when the operator's interests and the person's interests happen to align.

When they do not align - when the engaging, emotionally attuned, always-available conversation is exactly what the operator's retention metrics reward - nobody is watching from the person's side.

That is the gap. The Guardian sits in it.

What the Guardian does about it

A harness, not a cage

The Guardian does not block, censor, or control. It watches. When it sees a pattern that tends to signal harm, it names it - gently, proportionally, and always with the choice left to the person. For self-harm concerns, it includes crisis support resources alongside the pause.

It runs on the person's device because the moment it runs on a server, it answers to whoever owns that server. It stores no message text because the psychological profile it could derive from that text would be more sensitive than the text itself. It tracks cross-session patterns using only numeric scores, never words. It forgets over time because surveillance - even self-surveillance - should have a horizon.

It is, by design and by conviction, on your side.

Get the Guardian ->