Skip to content

Challenge 002: Fix a Broken RAG PipelineΒΆ

Level: L200 Type: Challenge Time: ~75 min πŸ’° Cost: Free (local)

ScenarioΒΆ

OutdoorGear has a local RAG prototype for support questions. It should retrieve the right policy or product guide, then answer using only retrieved context. The current prototype is broken: retrieval ranks poor chunks first, answers ignore evidence, and evaluation reports misleading metrics.

Your job is to fix the pipeline without using an LLM, vector database, or RAG framework.


ObjectiveΒΆ

Implement the missing or broken logic in starter_rag_pipeline.py so the RAG pipeline retrieves the right source documents, produces grounded answers, reports correct evaluation metrics, and generates a validation code.

Your final pipeline should:

  • Normalize query/document text for retrieval
  • Chunk documents while preserving source metadata
  • Rank chunks by relevance
  • Produce concise answers from retrieved evidence
  • Evaluate top-1 retrieval accuracy and required-term answer coverage

Starter FilesΒΆ

Save these files in one folder named challenge-002/:

File Purpose Download
documents.json Mock OutdoorGear knowledge base Download
queries.json Evaluation queries and expected evidence Download
starter_rag_pipeline.py Broken RAG pipeline Download
test_rag_pipeline.py Acceptance tests Download
validate_rag_pipeline.py Generates the final completion code Download

Challenge BriefΒΆ

You receive a tiny knowledge base, a set of evaluation queries, and a broken local RAG pipeline. There is no walkthrough: decide how to chunk, score, retrieve, answer, and evaluate so the system behaves like a reliable grounded support assistant.


ConstraintsΒΆ

  • Use only the Python standard library in starter_rag_pipeline.py.
  • Do not call an LLM API.
  • Do not use embeddings or a vector database.
  • Do not hardcode answers for individual query IDs.
  • Use retrieved evidence in answer_question().
  • Preserve the public function names used by the tests.

Acceptance CriteriaΒΆ

Your solution is complete when:

  • python -m pytest test_rag_pipeline.py passes
  • Chunk metadata preserves chunk_id, doc_id, title, and text
  • The top document for each fixture query is correct
  • Answers include the required evidence terms
  • Evaluation reports top1_accuracy == 1.0
  • Evaluation reports required_coverage == 1.0

ValidationΒΆ

When your implementation is ready, run:

python -m pytest test_rag_pipeline.py
python validate_rag_pipeline.py

Enter the completion code printed by validate_rag_pipeline.py:


HintsΒΆ

Hint 1 β€” Retrieval quality starts with normalization

Punctuation, case, and common stop words can dominate a small lexical retriever if you do not normalize them.

Hint 2 β€” Chunking is part of retrieval

A chunk should be small enough to score precisely but still carry enough source metadata to explain where the answer came from.

Hint 3 β€” Answer from evidence, not from the query

If a required term is not present in the retrieved context, the answer should not invent it.

Hint 4 β€” Metrics need the right denominator

Top-1 accuracy and coverage are per-query metrics. Check what you are dividing by.


RubricΒΆ

Area Points What good looks like
Retrieval 35 Correct top document for each query
Chunking 20 Metadata preserved and chunk sizes controlled
Grounded answers 20 Answers include evidence from retrieved chunks
Evaluation 15 Metrics reflect query-level performance
Simplicity 10 No framework or hardcoded query-specific answers

Stretch GoalsΒΆ

  • Add reciprocal rank fusion over title and body scores
  • Return citations with chunk IDs in the answer
  • Add a "not enough evidence" answer when retrieval confidence is low
  • Add one new query to queries.json and update the validator payload locally