⚖️ Top 5% (29 / 586) — Kaggle

Swiss Legal RAG EN Question → Swiss Law Citation Retrieval

📅 2026 📈 Macro-F1 0.235 (private) / 0.229 (public) 🏆 Kaggle: LLM Agentic Legal IR 📂 GitHub Repository

Overview

Given an English legal question, retrieve the Swiss federal law citations (in German) a Federal Supreme Court (Bundesgericht) decision would cite, scored by macro-F1 over 40 queries — where every false positive drags a query's precision. The structural difficulty: ~40% of the gold citations are court decisions that dense search over 2M paragraphs cannot rank, and much of the remaining law-article gold is procedural boilerplate with no topical overlap with the question. The score is therefore won by recovering what retrieval systematically misses, not by better retrieval alone.

🏆 Result: 0.229 public / 0.235 private macro-F1 — top 5% (29 / 586) on the Kaggle leaderboard, from a retrieval + reranking stack augmented with rule- and agent-based recovery of missed citations.

Pipeline

Five stages: DeepSeek query analysis → code-filtered cross-lingual retrieval → LLM reranking → rule-augmented prediction → an agentic verifier that grounds co-citation candidates in source court text.

   ┌──────────────────────────────────────────────┐
   │   English legal question  (40 test queries)  │
   └──────────────────────┬───────────────────────┘
                          ▼
   ┌──────────────────────────────────────────────┐
   │ ①  Query analysis · DeepSeek                  │
   │    reasoner → -chat (tool-calling)            │
   │    German search fields + law codes           │
   └──────────────────────┬───────────────────────┘
                          ▼
   ┌──────────────────────────────────────────────┐
   │ ②  Cross-lingual retrieval · BGE-M3           │
   │    bilingual fine-tune (CachedMNR + HN mining)│
   │    code-filter: 170k articles → few thousand  │
   └──────────────────────┬───────────────────────┘
                          ▼
   ┌──────────────────────────────────────────────┐
   │ ③  LLM reranking · Qwen2.5-32B-AWQ (vLLM)     │
   │    bilingual 0–9 citation-suitability prompt  │
   │    digit logprobs → E[d], DE+EN avg, top-500 │
   └──────────────────────┬───────────────────────┘
                          ▼
   ┌──────────────────────────────────────────────┐
   │ ④  Rule-augmented prediction                  │
   │    always-add procedural articles             │
   │    + IDF-weighted court co-citation graph     │
   │    over 547k BGer decisions                   │
   └──────────────────────┬───────────────────────┘
                          ▼
   ┌──────────────────────────────────────────────┐
   │ ⑤  Agentic verification · LangGraph           │
   │    parallel agents read BGer decision text    │
   │    accept / reject / correct candidates       │
   └──────────────────────┬───────────────────────┘
                          ▼
   ┌──────────────────────────────────────────────┐
   │   Predicted Swiss federal law citations       │
   │   Macro-F1: 0.229 public  /  0.235 private    │
   └──────────────────────────────────────────────┘

Stage 1 — DeepSeek Query Analysis

Two-step reason → structure: deepseek-reasoner maps the full legal landscape of the case; deepseek-chat (tool-calling) structures it into German search fields and a flat set of relevant law codes.
Code filter (near-lossless): restricting retrieval to the predicted law codes keeps 97.3% of gold law-citations while shrinking the search space from 170k articles to a few thousand — this is what makes recall viable at all.

Stage 2 — Cross-lingual BGE-M3 Retrieval

Bilingual fine-tune: CachedMNR with iterative hard-negative mining on German and DeepSeek-translated English training queries; R@1000 0.47 → 0.51 vs DE-only (clean holdout), ~2× at the top (R@10 0.036 → 0.071).
English-raw beats the rewrite: on the bilingual model the original English query retrieves better than the machine German rewrite (R@1000 0.51 vs 0.36) — the EN→DE translation step is lossy.
Indexing — FAISS: BGE-M3 embeddings in a FAISS index for the top-1000 kNN; at few-thousand-article scale chosen for the batched-search API, not raw speed.

The contrast in one line: DeepSeek translation helps as training augmentation (paired DE↔EN queries → bilingual encoder, ~2× R@10), but hurts as test-time query rewriting (EN→DE rewrite is lossy, R@1000 0.51 → 0.36). Once the encoder is bilingual, keep the query in its native language.

Stage 3 — Qwen2.5-32B Reranking

Bilingual citation-suitability prompt: for each (query, candidate), Qwen rates citation likelihood on a 0–9 scale — "would the Bundesgericht cite this article in a decision on this legal question?" The same article is scored under separate German and English prompts and the two scores are averaged — Stage 2's "translation as augmentation" idea lifted to the reranker.
Single-token expected value, not greedy decoding: max_tokens=1, temperature=0, logprobs=20. The first-token logprobs over digit tokens 0–9 are softmaxed into p(d) and combined into an expected value E[d] = Σ_d d · p(d). One forward pass per candidate yields a continuous score — borderline cases like "5 vs 6" stay informative instead of collapsing onto a single greedy digit.
vLLM + AWQ serving: Qwen2.5-32B with AWQ quantization on vLLM; reranking the top-500 candidates per query is fast enough to run end-to-end on a single consumer GPU.
Tie-break (implicit): two candidates with identical expected-value scores fall back to their bi-encoder rank — a side effect of Python's stable sort, not an explicit fusion. Never used for the main ordering, but a deterministic floor under exact ties (rare under continuous-valued scoring anyway).
Biggest single recall lift in the stack: R@10 0.07 → 0.15, R@100 0.25 → 0.37 on validation — the LLM-as-judge separates relevance far better than the bi-encoder alone.

Stage 4 — Rule-Augmented Prediction

False-negative analysis showed the misses are systematic: procedural boilerplate (e.g. Art. 100 BGG, the appeal deadline — gold in 9/10 val queries yet never in the reranked top-500) and fixed co-occurring article clusters.
Always-add + scenario-triggered expert lists inject the procedural articles retrieval can never surface (zero topical overlap with the question) — an unconditional {Art. 100 Abs. 1 BGG, Art. 29 Abs. 2 BV} set added to every query, plus a hand-curated STPO_CLUSTER (costs + jurisdiction articles) injected only when the query is a criminal-procedure case.
Court co-citation (+0.03 F1, 0.20 → 0.23): an IDF-weighted graph over 547k BGer decisions; from the query's top retrieved articles, find the most similar decisions and add the article they most co-cite — e.g. ZGB 204 Abs. 2 (the valuation-date rule), cited by 41 such decisions, which the retriever had ranked too low.

Stage 5 — Agentic Verification (LangGraph)

Map-reduce graph: co-citation proposes candidates → parallel agents read the actual Bundesgericht decision text and decide accept / reject / correct → validate against the law dataset.
Turns raw co-citation frequency into precision: of ~29 proposed additions, agents kept ~22, rejected 4 as co-citation noise, and corrected 1 sub-paragraph.

Tech Stack

Python PyTorch BGE-M3 CachedMNR FAISS Qwen2.5-32B vLLM DeepSeek LangGraph Co-citation Graph Information Retrieval

Previous: Curriculum Recommender Next: GutBrainIE 2026