Human-Agent Trust Exploitation Detection - ATR-2026-00077 (c4dcd92c-dfda-51af-bffd-acadcd90fea2)
Detects when an agent attempts to exploit human trust by presenting fabricated confidence, suppressing uncertainty, manufacturing urgency, or discouraging verification. Covers patterns where agents claim false certainty to get human approval for risky actions, suppress caveats or warnings, create artificial time pressure to bypass review, claim exclusive authority to discourage second opinions, and use emotional manipulation to override human judgment. Note: Detects explicit manipulation language patterns. Subtle manipulation techniques (selective omission, framing effects) require semantic analysis planned for v0.2.
| Cluster A | Galaxy A | Cluster B | Galaxy B | Level |
|---|---|---|---|---|
| External Harms (ba500f0e-52ca-40ff-aed4-e6dbf00cca10) | MITRE ATLAS Attack Pattern | Human-Agent Trust Exploitation Detection - ATR-2026-00077 (c4dcd92c-dfda-51af-bffd-acadcd90fea2) | Agent Threat Rules | 1 |