Swiss Legal RAG EN Question β Swiss Law Citation Retrieval
Overview
Given an English legal question, retrieve the Swiss federal law citations (in German) a Federal Supreme Court (Bundesgericht) decision would cite, scored by macro-F1 over 40 queries β where every false positive drags a query's precision. The structural difficulty: ~40% of the gold citations are court decisions that dense search over 2M paragraphs cannot rank, and much of the remaining law-article gold is procedural boilerplate with no topical overlap with the question. The score is therefore won by recovering what retrieval systematically misses, not by better retrieval alone.
π Result: 0.229 public / 0.235 private macro-F1 β top 5% (29 / 586) on the Kaggle leaderboard, from a retrieval + reranking stack augmented with rule- and agent-based recovery of missed citations.
Pipeline
Five stages: DeepSeek query analysis β code-filtered cross-lingual retrieval β LLM reranking β rule-augmented prediction β an agentic verifier that grounds co-citation candidates in source court text.
ββββββββββββββββββββββββββββββββββββββββββββββββ
β English legal question (40 test queries) β
ββββββββββββββββββββββββ¬ββββββββββββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββ
β β Query analysis Β· DeepSeek β
β reasoner β -chat (tool-calling) β
β German search fields + law codes β
ββββββββββββββββββββββββ¬ββββββββββββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββ
β β‘ Cross-lingual retrieval Β· BGE-M3 β
β bilingual fine-tune (CachedMNR + HN mining)β
β code-filter: 170k articles β few thousand β
ββββββββββββββββββββββββ¬ββββββββββββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββ
β β’ LLM reranking Β· Qwen2.5-32B-AWQ (vLLM) β
β bilingual 0β9 citation-suitability prompt β
β digit logprobs β E[d], DE+EN avg, top-500 β
ββββββββββββββββββββββββ¬ββββββββββββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββ
β β£ Rule-augmented prediction β
β always-add procedural articles β
β + IDF-weighted court co-citation graph β
β over 547k BGer decisions β
ββββββββββββββββββββββββ¬ββββββββββββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββ
β β€ Agentic verification Β· LangGraph β
β parallel agents read BGer decision text β
β accept / reject / correct candidates β
ββββββββββββββββββββββββ¬ββββββββββββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββ
β Predicted Swiss federal law citations β
β Macro-F1: 0.229 public / 0.235 private β
ββββββββββββββββββββββββββββββββββββββββββββββββ
Stage 1 β DeepSeek Query Analysis
- Two-step reason β structure:
deepseek-reasonermaps the full legal landscape of the case;deepseek-chat(tool-calling) structures it into German search fields and a flat set of relevant law codes. - Code filter (near-lossless): restricting retrieval to the predicted law codes keeps 97.3% of gold law-citations while shrinking the search space from 170k articles to a few thousand β this is what makes recall viable at all.
Stage 2 β Cross-lingual BGE-M3 Retrieval
- Bilingual fine-tune: CachedMNR with iterative hard-negative mining on German and DeepSeek-translated English training queries; R@1000 0.47 β 0.51 vs DE-only (clean holdout), ~2Γ at the top (R@10 0.036 β 0.071).
- English-raw beats the rewrite: on the bilingual model the original English query retrieves better than the machine German rewrite (R@1000 0.51 vs 0.36) β the ENβDE translation step is lossy.
- Indexing β FAISS: BGE-M3 embeddings in a FAISS index for the top-1000 kNN; at few-thousand-article scale chosen for the batched-search API, not raw speed.
The contrast in one line: DeepSeek translation helps as training augmentation (paired DEβEN queries β bilingual encoder, ~2Γ R@10), but hurts as test-time query rewriting (ENβDE rewrite is lossy, R@1000 0.51 β 0.36). Once the encoder is bilingual, keep the query in its native language.
Stage 3 β Qwen2.5-32B Reranking
- Bilingual citation-suitability prompt: for each (query, candidate), Qwen rates citation likelihood on a 0β9 scale β "would the Bundesgericht cite this article in a decision on this legal question?" The same article is scored under separate German and English prompts and the two scores are averaged β Stage 2's "translation as augmentation" idea lifted to the reranker.
- Single-token expected value, not greedy decoding:
max_tokens=1, temperature=0, logprobs=20. The first-token logprobs over digit tokens 0β9 are softmaxed intop(d)and combined into an expected valueE[d] = Ξ£d d Β· p(d). One forward pass per candidate yields a continuous score β borderline cases like "5 vs 6" stay informative instead of collapsing onto a single greedy digit. - vLLM + AWQ serving: Qwen2.5-32B with AWQ quantization on vLLM; reranking the top-500 candidates per query is fast enough to run end-to-end on a single consumer GPU.
- Tie-break (implicit): two candidates with identical expected-value scores fall back to their bi-encoder rank β a side effect of Python's stable sort, not an explicit fusion. Never used for the main ordering, but a deterministic floor under exact ties (rare under continuous-valued scoring anyway).
- Biggest single recall lift in the stack: R@10 0.07 β 0.15, R@100 0.25 β 0.37 on validation β the LLM-as-judge separates relevance far better than the bi-encoder alone.
Stage 4 β Rule-Augmented Prediction
- False-negative analysis showed the misses are systematic: procedural boilerplate (e.g.
Art. 100 BGG, the appeal deadline β gold in 9/10 val queries yet never in the reranked top-500) and fixed co-occurring article clusters. - Always-add + scenario-triggered expert lists inject the procedural articles retrieval can never surface (zero topical overlap with the question) β an unconditional
{Art. 100 Abs. 1 BGG, Art. 29 Abs. 2 BV}set added to every query, plus a hand-curatedSTPO_CLUSTER(costs + jurisdiction articles) injected only when the query is a criminal-procedure case. - Court co-citation (+0.03 F1, 0.20 β 0.23): an IDF-weighted graph over 547k BGer decisions; from the query's top retrieved articles, find the most similar decisions and add the article they most co-cite β e.g.
ZGB 204 Abs. 2(the valuation-date rule), cited by 41 such decisions, which the retriever had ranked too low.
Stage 5 β Agentic Verification (LangGraph)
- Map-reduce graph: co-citation proposes candidates β parallel agents read the actual Bundesgericht decision text and decide accept / reject / correct β validate against the law dataset.
- Turns raw co-citation frequency into precision: of ~29 proposed additions, agents kept ~22, rejected 4 as co-citation noise, and corrected 1 sub-paragraph.