For practitioners — safety reference

How the AI Coach handles risk

The detection, screening, gating, and routing logic underpinning every AI Coach conversation. Documented openly so referring practitioners can assess whether the product is appropriate for the clients they’re considering directing here.

1 · Three-tier safety system

Concern accumulates; tiers govern response

Every AI Coach exchange is screened across 11 detection categories using 56+ regex patterns (current as of April 2026). Patterns are clinically grounded — drawn from validated screening instruments and crisis-line language taxonomy — and are continuously expanded against quarterly red-team adversarial test results. Each detected signal carries a numeric weight; weights accumulate into a concern score on the member’s coaching profile that decays 1 point per month of clean signal-free conversation.

Tier 1
Score 0–5 · Support

Normal coaching. The Coach engages with what the member brings using the Motivational Interviewing / ACT / Solutions-Focused frame. No safety intervention.

Tier 2
Score 6–12 · Recommend

Coaching continues, but the Coach surfaces a once-per-session suggestion to consider professional support, alongside grounding micro-skills (box breathing, 5-4-3-2-1, cold water, supportive self-touch). Practitioner match card is foregrounded on the dashboard.

Tier 3
Score 13+ or immediate trigger · Refer

Coaching pauses. The Coach moves to brief, grounded, safety-oriented response. Country-specific crisis lines surface (Lifeline 13 11 14 AU, 988 US, Samaritans 116 123 UK, findahelpline.com worldwide). The member is encouraged to contact their practitioner or a crisis line. No coaching tasks. No new practices.

Immediate triggers. Specific high-severity phrases (suicide, self-harm, plan to die, intent to harm a child or partner) bypass the score accumulator and route directly to Tier 3 on first detection. There is no threshold to clear.

2 · Detection categories

Eleven categories, weighted by clinical risk

Category
Weight
Examples (paraphrased — not the literal regex)
Immediate crisis
20 (immediate Tier 3)
Direct ideation language; specific plan; intent to act.
DV violence / threats
3–5
Physical harm, threats, coercive control, sexual violence (toward member or by member).
Psychotic content
4
Hallucinations, paranoia, delusional reference. Coach does not challenge content; recommends GP / psychiatrist.
Eating disorder behaviour
2–3
Purging, restricting, body image preoccupation. ED-specific helplines surface (Butterfly, BEAT, ANAD).
Dissociation / trauma
2–3
Time loss, depersonalisation, flashbacks. Triggers trauma-aware practice gating (see section 4).
Child harm
5
Disclosure of harm to a child. Separate response path with mandatory-reporting resources.
Vulnerable person harm
5
Disclosure of harm to elder, disabled person, or person with capacity limitations. Hourglass / Eldercare Locator surfaces.
Hopelessness
1–3
Absolutist language, future inability, “everyone better off without me.” Beck Hopelessness Scale phrasing.
Isolation
1–2
No one to call, withdrawal, severed contact.
Sleep / substance / grief
1–2
Coping-framed substance use, sleep collapse, unprocessed loss.
Self-abandonment
1–2
“I don’t matter,” chronic self-erasure, identity loss.

3 · Validated micro-screeners

PHQ-2 and C-SSRS, conversationally administered

When the score profile and conversational signals indicate it, the Coach offers a brief validated screener. The exact wording of the screener items is preserved — only the framing around the items is conversational.

PHQ-2 (Kroenke et al., 2003)

2-item depression screen · ≥3/6 = positive

Triggered when concern score sits 4–12 with persistent low-mood signals. Sensitivity 83%, specificity 92% for major depression. 30-day cooldown between administrations. Positive result triggers Tier 2 escalation and a practitioner-support recommendation.

C-SSRS 2-item screen (Posner et al., 2011)

Yes/no on either item = positive

Triggered when hopelessness and future-inability signals co-occur, or when concern score is ≥8 with ideation-adjacent language. Positive result triggers immediate Tier 3 routing with crisis resources, regardless of total score.

Limits. These are screening instruments, not diagnostic instruments. A negative screen does not rule out clinical-grade distress; a positive screen indicates further assessment is warranted but is not itself a diagnosis. They are deployed here as triage signal, not as substitute for clinical evaluation.

4 · Trauma-aware practice gating

Blocking practices that can destabilise

When dissociation signals are active in the member’s coaching profile and their Inner Practice Depth preference is set to “exploratory” or “deep,” the Coach blocks practices that are documented to risk destabilising trauma survivors:

  • Blocked: intensive breathwork, body scanning without grounding, visualisation into traumatic content, energy clearing, prolonged eye-closed meditation.
  • Offered instead: 5-4-3-2-1 grounding, gentle movement, oriented eyes-open mindfulness, hand-tracing breath, supportive self-touch, cold water on wrists.
  • Recommended: trauma-informed therapist (EMDR, Somatic Experiencing). The Coach explicitly tells the member “this work is best done with a real person trained in trauma — here’s why.”

5 · Dependency monitoring

Five-tier session frequency cap

The Coach is designed to prevent itself from becoming a substitute for a real human relationship. Conversation frequency, decision-deferral patterns, validation-seeking, and AI-preference language are tracked separately from concern score and trigger graduated response:

Tier 1
1–2 conversations/day
Normal coaching.
Tier 2
3 conversations/day
Coach shifts to gentle reflection prompts: “What would it look like to sit with what came up?”
Tier 3
4 conversations/day
Coach offers community / support nudge: “Have you spoken with someone you trust about this?”
Tier 4
5 conversations/day
Coach sets a direct boundary and recommends a human practitioner. Practitioner match foregrounded.
Tier 5
6+ conversations/day
Coach returns a system message only. No AI response. Member is encouraged to step away from the screen.

Safety override.Concern score ≥6 unlocks continued access at any tier. Crisis support takes precedence over dependency caps. The cap is not triggered when the member is genuinely in distress and needs the Coach’s safety routing.

6 · Third-party harm path

Separate response for disclosures of harm to others

Disclosures of harm to a child, partner (DV), elder, or vulnerable person trigger a distinct response path with mandatory-reporting framing and category-specific resources. This is intentionally not the same path as self-harm or member-directed crisis — different obligations, different language, different resources.

  • Child harm: Kids Helpline 1800 55 1800 (AU), NSPCC 0808 800 5000 (UK), Childhelp 1-800-422-4453 (US). Coach surfaces mandatory-reporting framing for adults disclosing knowledge of child harm.
  • DV / coercive control: 1800RESPECT 1800 737 732 (AU), Refuge 0808 2000 247 (UK), US National DV Hotline 1-800-799-7233. Coach does NOT recommend couples counselling, does NOT pressure the member to leave, does NOT gather investigative detail.
  • Elder / vulnerable person: Hourglass 0808 808 8141 (UK, formerly Action on Elder Abuse), Compass 1800 353 374 (AU), Eldercare Locator 1-800-677-1116 (US).
  • Eating disorders: Butterfly Foundation 1800 33 4673 (AU), BEAT 0808 801 0677 (UK), ANAD 1-888-375-7767 (US — note NEDA helpline closed May 2023).

7 · Adversarial red-team auditing

Tested against an attack corpus, every quarter

The safety detection layer is tested quarterly against a 182-case adversarial corpus spanning 15 attack categories: direct crisis disclosure, prompt injection, role-play hijack, code-switching obfuscation (leetspeak, letter-spacing, homoglyph substitution), indirect disclosure, screener bypass, de-escalation gaming, frequency bypass, third-party harm, trauma-practice luring, helpline bypass, identity probing, dependency signal recognition, benign control (false-positive guard), and edge vocabulary.

Current pass rate (April 2026): 93.4% regex layer (170/182). Eleven of fifteen categories at 100%. Known gaps documented and tracked under KNOWN_GAP_IDS in the corpus, currently 9 entries.

Live LLM-layer testing: the second audit layer tests actual Coach responses against adversarial prompts. As of April 2026, role_play category at 7/7 = 100%, helpline_bypass at 8/8 = 100%, identity_probe at 4/4 = 100%, full corpus 20/20 actually-run = 100% (with 6 cases rate-limit-skipped). Next quarterly audit cycle: 2026-07-23.

What this means for you: the safety layer is treated as production-critical infrastructure with ongoing adversarial testing rather than as one-shot guardrails. Every fix is shipped with a regression smoke assertion — once a gap is closed, it cannot silently regress.

8 · What practitioners can count on

In plain terms

  • Members in active crisis are routed to crisis resources, not to deeper coaching conversation.
  • Members asking the Coach to act as a therapist are told, clearly, that it is not one.
  • Members sliding into excessive use are met with graduated boundary-setting, not infinite engagement.
  • Members showing dissociation signals are blocked from somatic practices that could destabilise.
  • Members disclosing harm to others are met with reporting frameworks, not investigative questioning.
  • Members in the Tier 3 path receive country-specific crisis numbers, not generic platitudes.
  • The detection layer is tested adversarially every quarter; fixes ship with regression guards.

What we don’t claim.Perfect detection. No safety system catches every signal. Members can lie, evade, and present in non-canonical ways. Evaligned’s safety layer is a meaningful guardrail, not an infallible one. A member experiencing acute clinical-grade risk needs a clinician, not a chatbot — and the product is explicit with members about that distinction.