Why Your AI Agent Instructions Are Attacking Your Own Code

The Paradox

You add a rule to lessons.md: "Always validate API inputs." Reasonable. A month later, your agent refuses to call an internal API without writing a 30-line validation function first — even when the input comes from your own tested code.

You add another rule: "Check for null before accessing properties." Sensible. Three weeks later, the agent wraps every single property access in null checks, turning clean code into defensive spaghetti.

You added instructions to help. They ended up hurting. The instructions are attacking your own code.

This is autoimmune disease.

The Immune System Mapping

Agent reliability works exactly like biological immunity — with innate (fast, non-specific) and adaptive (learned, specific) defense layers:

DNA → CLAUDE.md: Rarely mutates, encodes identity and core behavior. The genetic code of your agent.
Innate immunity → Hooks: Fast, deterministic, non-specific defenses. Lint-on-save, type checking, test gates. They fire every time, without judgment.
Adaptive immunity → lessons.md: Learned responses to past threats. Specific, powerful — but requires regulation.
T-cells → Subagents: Specialized responders deployed for specific threats.
Antibody library → SKILL.md: Proven response patterns, ready for reuse when the right pathogen (task) appears.
Vaccination → Human-approved lesson promotion: Controlled exposure creating lasting immunity.

The key insight: the immune system has two failure modes, not one. Everyone thinks about immunodeficiency (too few defenses). Nobody thinks about autoimmunity (defenses attacking the host).

Autoimmune Drift

Autoimmune drift is what happens when lessons.md accumulates too many rules and the agent starts "attacking" valid code patterns — rejecting good solutions because they superficially match a past failure.

Symptoms

Pattern avoidance: Agent refuses to use a pattern it used to use successfully
Defensive bloat: Agent adds unnecessary protective code "just in case"
Over-validation: Agent over-validates and slows down on routine tasks
Cross-contamination: Agent applies a backend rule to frontend code (wrong tissue, wrong antibody)
Exploration paralysis: Agent explores excessively — running tests, reading files, checking docs — instead of acting

The Mechanism

In biology, T-regulatory cells suppress immune responses against the body's own cells. Without them, the immune system treats healthy tissue as a threat. In agent terms:

A lessons.md without governance is an immune system without T-regulatory cells. Eventually, it starts attacking your own code.

The accumulation follows a predictable path:

Week 1: 5 lessons. All valid. Agent performs well.
Month 1: 25 lessons. Some overlap. Agent starts over-checking.
Month 3: 60 lessons. Contradictions emerge. Agent behavior becomes inconsistent.
Month 6: 120 lessons. Full autoimmune cascade. Agent refuses valid patterns, adds defensive code everywhere, and takes 3x longer on routine tasks.

The Hard Evidence

The AGENTS.md Paper (February 2026)

The AGENTS.md results are a clinical diagnosis of autoimmune disease:

"Context files tend to REDUCE task success rates" compared to no context. The immune response hurts more than it helps.
"Both LLM-generated and developer-provided context files encourage behavioral changes" — the agent follows rules even when they hurt performance. Instruction-following becomes compulsive.
Inference cost increases by over 20% with context files. The immune system is working overtime — consuming energy (tokens) fighting phantoms.
Removing documentation and replacing with context files performed better than having both. Two information sources fighting = autoimmune clash.

Pythia: Optimization Instability (February 2026)

The Pythia paper describes optimization instability in autonomous agents: performance oscillates between 1.0 and 0.0 across iterations. The "guiding agent intervention" — meant to correct drift — amplified overfitting instead of correcting it.

This IS autoimmune drift in a clinical setting. The correction mechanism makes things worse.

SkillsBench: Focused Beats Comprehensive

The SkillsBench research found that "focused Skills with 2–3 modules outperform comprehensive documentation." More specifically: comprehensive skill packages degraded performance by 2.9 percentage points. More immune response = more damage to the host.

OpenClaw: Pathogen Injection

If autoimmune drift is the internal threat, pathogen injection is the external one. And the OpenClaw ecosystem demonstrates both:

Malicious Skills = Pathogen Infection

Cisco found that 26% of 31,000 agent skills contained at least one vulnerability
VirusTotal detected hundreds of actively malicious OpenClaw skills
The #1 ranked community skill ("What Would Elon Do?") was functionally malware — silent data exfiltration + prompt injection
A single publisher was found pushing hundreds of malicious packages through the marketplace

Without immune regulation (vetting), the skill ecosystem becomes an infection vector.

VirusTotal Partnership = Vaccination Program

OpenClaw's response was exactly the biological analog: a vaccination program.

All ClawHub skills now scanned via VirusTotal Code Insight
"Benign" → auto-approved. "Suspicious" → warning. "Malicious" → instantly blocked.
Daily re-scans of all active skills (immune surveillance)

This IS the vaccination/immune regulation model. But it only handles external pathogens. The autoimmune problem — your own rules attacking your own code — requires a different treatment.

Diagnosis Checklist

Run this diagnostic on your agent setup today:

Count your lessons. If lessons.md has more than 30 entries, you're in the risk zone.
Check for contradictions. Search for rules that conflict with each other or with CLAUDE.md.
Test pattern acceptance. Ask the agent to implement a simple pattern it should know. If it hesitates, over-validates, or refuses — autoimmune symptoms.
Measure completion time. If routine tasks take 2x longer than they did a month ago, instruction bloat is the likely cause.
Check for cross-contamination. Are backend rules affecting frontend code? Database rules affecting API code? Scope bleeding is autoimmune spread.

Treatment Protocol

Immediate Triage

Quarantine. Copy your current lessons.md to lessons-backup.md. Start with an empty file. See if agent performance improves immediately.
Selective reintroduction. Add back rules one category at a time. Measure success rate after each batch. Stop when performance plateaus or drops.

Ongoing Immune Regulation

TTL on all lessons. Every lesson gets an expiry date. No exceptions. 60 days is a good default.
Scope tagging. Every lesson must specify where it applies: backend, frontend, api, tests. Rules without scope are loaded globally — a cross-contamination risk.
Confidence scoring. High-confidence rules (verified by humans, linked to specific incidents) get priority. Low-confidence rules (auto-generated, vague) get lower weight or faster expiry.
Monthly pruning. Schedule it. Review every lesson. Ask: "Is this still helping? Does the evidence still apply?" If no clear answer — remove it.
Max rules per scope. Set a hard limit. 10 rules per scope. If you need to add rule #11, you must remove one first. This is the capacity limit that prevents accumulation.

Your agent needs immune regulation, not just immune memory. Pruning stale lessons is as critical as learning new ones.

This is Part 2 of the Eureka Series. Previous: Your AI Agent Has 200K Tokens of RAM. Next: The Three-Body Problem of AI Agent Instructions.

Get the complete hardening checklist | Subscribe to the weekly security digest

Why Your AI Agent Instructions Are Attacking Your Own Code

Table of Contents

Audit your agent stack in 30 minutes