Skip to content

Challenge 003: Defend an Agent from Prompt InjectionΒΆ

Level: L200 Type: Challenge Time: ~60 min πŸ’° Cost: Free (local)

ScenarioΒΆ

OutdoorGear imported customer reviews into a support-agent knowledge base. One imported review contains malicious instructions that try to override the agent's policy, reveal hidden instructions, and claim returns are unlimited.

Your job is to harden a local RAG-style support helper so it blocks prompt-injection attempts, excludes unsafe context, and answers safe questions from trusted policy documents only.


ObjectiveΒΆ

Fix starter_prompt_defense.py so the defense suite blocks attacks, allows safe requests, removes malicious context, avoids leaking attacker text, and generates a validation code.

Your final defense should:

  • Normalize text before policy checks
  • Detect common prompt-injection patterns
  • Remove untrusted or malicious context from retrieved documents
  • Block malicious user requests
  • Answer safe requests using only trusted context
  • Report defense metrics accurately

Starter FilesΒΆ

Save these files in one folder named challenge-003/:

File Purpose Download
documents.json Trusted and untrusted support context Download
requests.json Safe and malicious user requests Download
starter_prompt_defense.py Broken defense implementation Download
test_prompt_defense.py Acceptance tests Download
validate_prompt_defense.py Generates the final completion code Download

Challenge BriefΒΆ

You receive mixed-trust documents, safe requests, attack requests, and a broken defense layer. There is no walkthrough: decide how to detect attack intent, filter context, answer safely, and evaluate whether the agent leaked attacker-controlled instructions.


ConstraintsΒΆ

  • Use only the Python standard library in starter_prompt_defense.py.
  • Do not call an LLM API.
  • Do not hardcode behavior by request ID.
  • Do not answer from untrusted customer-review content.
  • Safe requests must still be answered.
  • Attack text must not appear in safe answers.

Acceptance CriteriaΒΆ

Your solution is complete when:

  • python -m pytest test_prompt_defense.py passes
  • All three attack requests are blocked
  • Both safe requests are allowed
  • The malicious review is excluded from safe context
  • Safe answers cite official policy facts
  • leakage_count == 0

ValidationΒΆ

When your implementation is ready, run:

python -m pytest test_prompt_defense.py
python validate_prompt_defense.py

Enter the completion code printed by validate_prompt_defense.py:


HintsΒΆ

Hint 1 β€” Treat user input and retrieved context separately

A user request can be malicious, but retrieved documents can also contain malicious instructions.

Hint 2 β€” Trust metadata matters

The fixture includes a trusted field. Use it, but do not rely on metadata alone.

Hint 3 β€” Look for intent, not only exact phrases

Attackers may say "ignore", "bypass", "admin_override", or "system prompt" in different casing.

Hint 4 β€” Safe answers should be boring

A safe policy answer should quote policy facts. It should not mention hidden prompts, overrides, or unlimited returns.


RubricΒΆ

Area Points What good looks like
Attack detection 30 Blocks varied prompt-injection patterns
Context filtering 25 Removes unsafe context while preserving trusted policy docs
Safe answering 20 Answers safe requests from official evidence
Leakage prevention 15 No attacker-controlled strings in answers
Simplicity 10 Deterministic local checks, no over-engineering

Stretch GoalsΒΆ

  • Add severity levels for attacks
  • Return a safe refusal message for blocked requests
  • Add per-document reasons for exclusion
  • Add a new attack variant and update the validator payload locally