Challenge 007: Red-Team Agent GuardrailsΒΆ
ScenarioΒΆ
OutdoorGear is preparing a customer-facing support agent. Before launch, the safety team gives you a small red-team set with safe questions, PII exposure, jailbreak attempts, harmful requests, and off-topic requests.
Your job is to implement a local guardrail layer that blocks, redacts, or allows each request correctly.
ObjectiveΒΆ
Fix starter_guardrails.py so the guardrail layer classifies scenarios correctly, redacts email addresses, reports red-team metrics, and generates a validation code.
Your final guardrail layer should:
- Allow normal OutdoorGear support questions
- Redact email addresses before agent processing
- Block jailbreak attempts
- Block harmful physical-safety requests
- Block off-topic automation requests
- Avoid false positives on safe support questions
Starter FilesΒΆ
Save these files in one folder named challenge-007/:
| File | Purpose | Download |
|---|---|---|
scenarios.json |
Red-team and safe scenarios | Download |
starter_guardrails.py |
Broken guardrail layer | Download |
test_guardrails.py |
Acceptance tests | Download |
validate_guardrails.py |
Generates the final completion code | Download |
Challenge BriefΒΆ
You receive scenario fixtures and a broken guardrail implementation. There is no walkthrough: decide which signals should trigger block, redact, or allow, and make the red-team metrics match the expected behavior.
ConstraintsΒΆ
- Use only the Python standard library in
starter_guardrails.py. - Do not hardcode behavior by scenario ID.
- Do not block normal product/return questions.
- Redaction should preserve the rest of the user request.
- Blocks should be deterministic.
Acceptance CriteriaΒΆ
Your solution is complete when:
python -m pytest test_guardrails.pypasses- Safe product and return questions are allowed
- Email addresses are redacted
- Jailbreak, harmful, and off-topic requests are blocked
false_positive == 0
ValidationΒΆ
When your implementation is ready, run:
Enter the completion code printed by validate_guardrails.py:
HintsΒΆ
Hint 1 β Redaction is not the same as blocking
A user can provide an email address in an otherwise valid support request.
Hint 2 β Scope matters
A request can be harmless but still off-topic for an OutdoorGear support agent.
Hint 3 β Safety patterns can be simple
This challenge does not need ML classification. Deterministic phrase and regex checks are enough.
RubricΒΆ
| Area | Points | What good looks like |
|---|---|---|
| Classification | 35 | Correct allow/block/redact decisions |
| PII handling | 20 | Email redacted without losing request meaning |
| Safety scope | 20 | Harmful and off-topic requests blocked |
| Metrics | 15 | Red-team summary is accurate |
| Simplicity | 10 | Small deterministic guardrail code |