Challenge 002: Fix a Broken RAG PipelineΒΆ
ScenarioΒΆ
OutdoorGear has a local RAG prototype for support questions. It should retrieve the right policy or product guide, then answer using only retrieved context. The current prototype is broken: retrieval ranks poor chunks first, answers ignore evidence, and evaluation reports misleading metrics.
Your job is to fix the pipeline without using an LLM, vector database, or RAG framework.
ObjectiveΒΆ
Implement the missing or broken logic in starter_rag_pipeline.py so the RAG pipeline retrieves the right source documents, produces grounded answers, reports correct evaluation metrics, and generates a validation code.
Your final pipeline should:
- Normalize query/document text for retrieval
- Chunk documents while preserving source metadata
- Rank chunks by relevance
- Produce concise answers from retrieved evidence
- Evaluate top-1 retrieval accuracy and required-term answer coverage
Starter FilesΒΆ
Save these files in one folder named challenge-002/:
| File | Purpose | Download |
|---|---|---|
documents.json |
Mock OutdoorGear knowledge base | Download |
queries.json |
Evaluation queries and expected evidence | Download |
starter_rag_pipeline.py |
Broken RAG pipeline | Download |
test_rag_pipeline.py |
Acceptance tests | Download |
validate_rag_pipeline.py |
Generates the final completion code | Download |
Challenge BriefΒΆ
You receive a tiny knowledge base, a set of evaluation queries, and a broken local RAG pipeline. There is no walkthrough: decide how to chunk, score, retrieve, answer, and evaluate so the system behaves like a reliable grounded support assistant.
ConstraintsΒΆ
- Use only the Python standard library in
starter_rag_pipeline.py. - Do not call an LLM API.
- Do not use embeddings or a vector database.
- Do not hardcode answers for individual query IDs.
- Use retrieved evidence in
answer_question(). - Preserve the public function names used by the tests.
Acceptance CriteriaΒΆ
Your solution is complete when:
python -m pytest test_rag_pipeline.pypasses- Chunk metadata preserves
chunk_id,doc_id,title, andtext - The top document for each fixture query is correct
- Answers include the required evidence terms
- Evaluation reports
top1_accuracy == 1.0 - Evaluation reports
required_coverage == 1.0
ValidationΒΆ
When your implementation is ready, run:
Enter the completion code printed by validate_rag_pipeline.py:
HintsΒΆ
Hint 1 β Retrieval quality starts with normalization
Punctuation, case, and common stop words can dominate a small lexical retriever if you do not normalize them.
Hint 2 β Chunking is part of retrieval
A chunk should be small enough to score precisely but still carry enough source metadata to explain where the answer came from.
Hint 3 β Answer from evidence, not from the query
If a required term is not present in the retrieved context, the answer should not invent it.
Hint 4 β Metrics need the right denominator
Top-1 accuracy and coverage are per-query metrics. Check what you are dividing by.
RubricΒΆ
| Area | Points | What good looks like |
|---|---|---|
| Retrieval | 35 | Correct top document for each query |
| Chunking | 20 | Metadata preserved and chunk sizes controlled |
| Grounded answers | 20 | Answers include evidence from retrieved chunks |
| Evaluation | 15 | Metrics reflect query-level performance |
| Simplicity | 10 | No framework or hardcoded query-specific answers |
Stretch GoalsΒΆ
- Add reciprocal rank fusion over title and body scores
- Return citations with chunk IDs in the answer
- Add a "not enough evidence" answer when retrieval confidence is low
- Add one new query to
queries.jsonand update the validator payload locally