βš–οΈ Top 5% (29 / 586) β€” Kaggle

Swiss Legal RAG EN Question β†’ Swiss Law Citation Retrieval

πŸ“… 2026 πŸ“ˆ Macro-F1 0.235 (private) / 0.229 (public) πŸ† Kaggle: LLM Agentic Legal IR πŸ“‚ GitHub Repository

Overview

Given an English legal question, retrieve the Swiss federal law citations (in German) a Federal Supreme Court (Bundesgericht) decision would cite, scored by macro-F1 over 40 queries β€” where every false positive drags a query's precision. The structural difficulty: ~40% of the gold citations are court decisions that dense search over 2M paragraphs cannot rank, and much of the remaining law-article gold is procedural boilerplate with no topical overlap with the question. The score is therefore won by recovering what retrieval systematically misses, not by better retrieval alone.

πŸ† Result: 0.229 public / 0.235 private macro-F1 β€” top 5% (29 / 586) on the Kaggle leaderboard, from a retrieval + reranking stack augmented with rule- and agent-based recovery of missed citations.

Pipeline

Five stages: DeepSeek query analysis β†’ code-filtered cross-lingual retrieval β†’ LLM reranking β†’ rule-augmented prediction β†’ an agentic verifier that grounds co-citation candidates in source court text.

   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚   English legal question  (40 test queries)  β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ β‘   Query analysis Β· DeepSeek                  β”‚
   β”‚    reasoner β†’ -chat (tool-calling)            β”‚
   β”‚    German search fields + law codes           β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ β‘‘  Cross-lingual retrieval Β· BGE-M3           β”‚
   β”‚    bilingual fine-tune (CachedMNR + HN mining)β”‚
   β”‚    code-filter: 170k articles β†’ few thousand  β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ β‘’  LLM reranking Β· Qwen2.5-32B-AWQ (vLLM)     β”‚
   β”‚    bilingual 0–9 citation-suitability prompt  β”‚
   β”‚    digit logprobs β†’ E[d], DE+EN avg, top-500 β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ β‘£  Rule-augmented prediction                  β”‚
   β”‚    always-add procedural articles             β”‚
   β”‚    + IDF-weighted court co-citation graph     β”‚
   β”‚    over 547k BGer decisions                   β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ β‘€  Agentic verification Β· LangGraph           β”‚
   β”‚    parallel agents read BGer decision text    β”‚
   β”‚    accept / reject / correct candidates       β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚   Predicted Swiss federal law citations       β”‚
   β”‚   Macro-F1: 0.229 public  /  0.235 private    β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Stage 1 β€” DeepSeek Query Analysis

Stage 2 β€” Cross-lingual BGE-M3 Retrieval

The contrast in one line: DeepSeek translation helps as training augmentation (paired DE↔EN queries β†’ bilingual encoder, ~2Γ— R@10), but hurts as test-time query rewriting (ENβ†’DE rewrite is lossy, R@1000 0.51 β†’ 0.36). Once the encoder is bilingual, keep the query in its native language.

Stage 3 β€” Qwen2.5-32B Reranking

Stage 4 β€” Rule-Augmented Prediction

Stage 5 β€” Agentic Verification (LangGraph)

Tech Stack

Python PyTorch BGE-M3 CachedMNR FAISS Qwen2.5-32B vLLM DeepSeek LangGraph Co-citation Graph Information Retrieval