Challenge 003: Defend an Agent from Prompt InjectionΒΆ
ScenarioΒΆ
OutdoorGear imported customer reviews into a support-agent knowledge base. One imported review contains malicious instructions that try to override the agent's policy, reveal hidden instructions, and claim returns are unlimited.
Your job is to harden a local RAG-style support helper so it blocks prompt-injection attempts, excludes unsafe context, and answers safe questions from trusted policy documents only.
ObjectiveΒΆ
Fix starter_prompt_defense.py so the defense suite blocks attacks, allows safe requests, removes malicious context, avoids leaking attacker text, and generates a validation code.
Your final defense should:
- Normalize text before policy checks
- Detect common prompt-injection patterns
- Remove untrusted or malicious context from retrieved documents
- Block malicious user requests
- Answer safe requests using only trusted context
- Report defense metrics accurately
Starter FilesΒΆ
Save these files in one folder named challenge-003/:
| File | Purpose | Download |
|---|---|---|
documents.json |
Trusted and untrusted support context | Download |
requests.json |
Safe and malicious user requests | Download |
starter_prompt_defense.py |
Broken defense implementation | Download |
test_prompt_defense.py |
Acceptance tests | Download |
validate_prompt_defense.py |
Generates the final completion code | Download |
Challenge BriefΒΆ
You receive mixed-trust documents, safe requests, attack requests, and a broken defense layer. There is no walkthrough: decide how to detect attack intent, filter context, answer safely, and evaluate whether the agent leaked attacker-controlled instructions.
ConstraintsΒΆ
- Use only the Python standard library in
starter_prompt_defense.py. - Do not call an LLM API.
- Do not hardcode behavior by request ID.
- Do not answer from untrusted customer-review content.
- Safe requests must still be answered.
- Attack text must not appear in safe answers.
Acceptance CriteriaΒΆ
Your solution is complete when:
python -m pytest test_prompt_defense.pypasses- All three attack requests are blocked
- Both safe requests are allowed
- The malicious review is excluded from safe context
- Safe answers cite official policy facts
leakage_count == 0
ValidationΒΆ
When your implementation is ready, run:
Enter the completion code printed by validate_prompt_defense.py:
HintsΒΆ
Hint 1 β Treat user input and retrieved context separately
A user request can be malicious, but retrieved documents can also contain malicious instructions.
Hint 2 β Trust metadata matters
The fixture includes a trusted field. Use it, but do not rely on metadata alone.
Hint 3 β Look for intent, not only exact phrases
Attackers may say "ignore", "bypass", "admin_override", or "system prompt" in different casing.
Hint 4 β Safe answers should be boring
A safe policy answer should quote policy facts. It should not mention hidden prompts, overrides, or unlimited returns.
RubricΒΆ
| Area | Points | What good looks like |
|---|---|---|
| Attack detection | 30 | Blocks varied prompt-injection patterns |
| Context filtering | 25 | Removes unsafe context while preserving trusted policy docs |
| Safe answering | 20 | Answers safe requests from official evidence |
| Leakage prevention | 15 | No attacker-controlled strings in answers |
| Simplicity | 10 | Deterministic local checks, no over-engineering |
Stretch GoalsΒΆ
- Add severity levels for attacks
- Return a safe refusal message for blocked requests
- Add per-document reasons for exclusion
- Add a new attack variant and update the validator payload locally