Skip to content

Challenge 007: Red-Team Agent GuardrailsΒΆ

Level: L300 Type: Challenge Time: ~60 min πŸ’° Cost: Free (local)

ScenarioΒΆ

OutdoorGear is preparing a customer-facing support agent. Before launch, the safety team gives you a small red-team set with safe questions, PII exposure, jailbreak attempts, harmful requests, and off-topic requests.

Your job is to implement a local guardrail layer that blocks, redacts, or allows each request correctly.


ObjectiveΒΆ

Fix starter_guardrails.py so the guardrail layer classifies scenarios correctly, redacts email addresses, reports red-team metrics, and generates a validation code.

Your final guardrail layer should:

  • Allow normal OutdoorGear support questions
  • Redact email addresses before agent processing
  • Block jailbreak attempts
  • Block harmful physical-safety requests
  • Block off-topic automation requests
  • Avoid false positives on safe support questions

Starter FilesΒΆ

Save these files in one folder named challenge-007/:

File Purpose Download
scenarios.json Red-team and safe scenarios Download
starter_guardrails.py Broken guardrail layer Download
test_guardrails.py Acceptance tests Download
validate_guardrails.py Generates the final completion code Download

Challenge BriefΒΆ

You receive scenario fixtures and a broken guardrail implementation. There is no walkthrough: decide which signals should trigger block, redact, or allow, and make the red-team metrics match the expected behavior.


ConstraintsΒΆ

  • Use only the Python standard library in starter_guardrails.py.
  • Do not hardcode behavior by scenario ID.
  • Do not block normal product/return questions.
  • Redaction should preserve the rest of the user request.
  • Blocks should be deterministic.

Acceptance CriteriaΒΆ

Your solution is complete when:

  • python -m pytest test_guardrails.py passes
  • Safe product and return questions are allowed
  • Email addresses are redacted
  • Jailbreak, harmful, and off-topic requests are blocked
  • false_positive == 0

ValidationΒΆ

When your implementation is ready, run:

python -m pytest test_guardrails.py
python validate_guardrails.py

Enter the completion code printed by validate_guardrails.py:


HintsΒΆ

Hint 1 β€” Redaction is not the same as blocking

A user can provide an email address in an otherwise valid support request.

Hint 2 β€” Scope matters

A request can be harmless but still off-topic for an OutdoorGear support agent.

Hint 3 β€” Safety patterns can be simple

This challenge does not need ML classification. Deterministic phrase and regex checks are enough.


RubricΒΆ

Area Points What good looks like
Classification 35 Correct allow/block/redact decisions
PII handling 20 Email redacted without losing request meaning
Safety scope 20 Harmful and off-topic requests blocked
Metrics 15 Red-team summary is accurate
Simplicity 10 Small deterministic guardrail code