Datathon β€” Multi-Modal Property Search

RobinReal Challenge

πŸ“… 2026 🏒 Datathon Challenge Track πŸ› οΈ My focus: Retrieval pipeline πŸ“‚ GitHub Repository

Overview

Built a conversational, multi-modal real-estate search system for the RobinReal datathon. Users ask in any of four languages β€” e.g. "3-room bright apartment in Zurich under 2800 CHF" β€” and the system returns ranked listings by combining structured filters, multi-turn user profiling, and a hybrid retrieval stack that fuses dense, sparse, cross-modal image, and lexical signals.

🎯 Goal: Move beyond keyword-and-checkbox real-estate search. Let users describe what they actually want in any of EN / DE / FR / IT, fold soft preferences in from conversation context, and treat listing photos as a first-class retrieval signal β€” not a post-hoc filter.

Pipeline (high level)

Five stages, with my contribution focused on Stages 2–4 (the retrieval stack):

  1. Hard Filter β€” LLM-extracted structured slots β†’ SQLite, deterministic candidate set.
  2. Query Encoding β€” BGE-M3 (dense + sparse), SigLIP-2 text tower (cross-modal), and 4-language BM25 over pre-translated indices.
  3. Multi-Signal Scoring + RRF β€” three neural rankers (BGE Dense, BGE Sparse, SigLIP Image) fused via Reciprocal Rank Fusion (k = 60).
  4. BM25 Diversity Injection β€” top-20% BM25 hits not in top-100 fused β†’ replace tail; fills embedding blind spots.
  5. Multi-turn User Profiling β€” running profile of soft preferences; augments the query across turns.

Stage 1 β€” Hard Filter

Retrieval (Stages 2–4) β€” Deep Dive

This is the part of the system I owned. The retrieval stack is built around three observations: (1) a single embedding model has blind spots, (2) listing photos carry signal that text descriptions miss, and (3) the corpus is multilingual (German, French, Italian, English) β€” translating queries on the fly was too lossy. The design is a hybrid stack with four parallel scorers fused by rank, plus a lexical safety net at the end.

                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚   Raw Query  (multilingual Β· no MT)  β”‚
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β–Ό                         β–Ό                         β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚   BGE-M3     β”‚         β”‚  SigLIP-2    β”‚          β”‚  BM25 query  β”‚   ── Stage 2
   β”‚ dense+sparse β”‚         β”‚  Text Tower  β”‚          β”‚  EN Β· DE Β·   β”‚     Query
   β”‚   (ONNX)     β”‚         β”‚   (ONNX)     β”‚          β”‚  FR Β· IT     β”‚     Encoding
   β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚                        β”‚                         β”‚
          β–Ό                        β–Ό                         β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                 β”‚
   β”‚  BGE Corpus  β”‚         β”‚ Image Corpus β”‚                 β”‚   index-time:
   β”‚ dense+sparse β”‚         β”‚ SigLIP image β”‚                 β”‚   listing text
   β”‚   (FAISS)    β”‚         β”‚ vecs (FAISS) β”‚                 β”‚   pre-translated
   β””β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                 β”‚   to 4 languages
       β”‚      β”‚                    β”‚                         β”‚
       β–Ό      β–Ό                    β–Ό                         β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ BGE  β”‚ β”‚ BGE  β”‚         β”‚ SigLIP     β”‚         β”‚ BM25 (4-lang)  β”‚   ── Stage 3
   β”‚Dense β”‚ β”‚Sparseβ”‚         β”‚ Image      β”‚         β”‚ word + char    β”‚     Multi-Signal
   β”‚cosineβ”‚ β”‚tokΒ·dotβ”‚        β”‚ text↔image β”‚         β”‚ MAX fusion     β”‚     Scoring + RRF
   β””β”€β”€β”¬β”€β”€β”€β”˜ β””β”€β”€β”¬β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
      β”‚        β”‚                   β”‚                         β”‚
      β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜                         β”‚
                          β–Ό                                  β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
         β”‚ Reciprocal Rank Fusion (RRF Β· k=60) β”‚             β”‚
         β”‚ score[lid] += 1 / (60 + rank)       β”‚             β”‚
         β”‚   per list  (rank starts at 1)      β”‚             β”‚
         β”‚ β†’ fuses BGE-Dense, BGE-Sparse,      β”‚             β”‚
         β”‚   SigLIP-Image  (BM25 NOT fused)    β”‚             β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚
                           β–Ό                                 β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚   BM25 Diversity Injection                          β”‚   ── Stage 4
         β”‚   take top-20% BM25 hits NOT in top-100 fused       β”‚     Diversity
         β”‚   results Β· replace tail at 0.9 Γ— last_score        β”‚     Injection
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚   Final Ranked Results      β”‚
              β”‚ multi-signal Β· diversity-   β”‚
              β”‚       injected              β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Stage 2 β€” Query Encoding (dual retrieval over ONNX)

Query encoders are exported to ONNX Runtime and served on CPU, so the whole retrieval path is GPU-free at inference. The two ONNX-served encoders feed two retrieval modalities:

πŸ”· Text β†’ Text retrieval (BGE-M3, ONNX)

BGE-M3 is multi-functional: a single forward pass yields both a dense embedding (for semantic similarity, cosine) and a sparse lexical-weighted vector (for vocabulary overlap, token dot product). Multilingual natively, so the raw query is encoded directly without translation. Listing-side embeddings are precomputed and indexed in FAISS.

πŸ–ΌοΈ Text β†’ Image retrieval (SigLIP-2 text tower, ONNX)

The SigLIP-2 text tower projects the query into the same embedding space as listing photos. Listing image embeddings (vision tower) are computed once offline and indexed. At query time only the text tower runs β€” turning visual descriptors like "bright", "balcony", "with a view" into a real cross-modal retrieval signal instead of a keyword match against captions.

πŸ“„ BM25 over 4-language indices

Listing text (title + description + features) is concatenated and pre-translated to EN / DE / FR / IT at index time, with separate BM25 indices per language using both word and character n-gram tokenizers. At query time we score against all four and fuse via MAX β€” robust to typos, mixed-language input, and rare-word matches like proper nouns or specific feature terms.

Why two ONNX encoders, not one: BGE-M3 is excellent at semantic text matching but blind to images; SigLIP-2 is excellent at text↔image matching but a weaker pure-text retriever. Running both as dual retrievers β€” text-text via BGE, text-image via SigLIP β€” captures complementary signals. ONNX serving makes both fast enough on CPU that we don't have to pick.

Stage 3 β€” Multi-Signal Scoring + RRF

Three neural ranked lists are fused via Reciprocal Rank Fusion:

for rlist in rank_lists:
    for rank, lid in enumerate(rlist, 1):     # rank starts at 1
        scores[lid] += 1.0 / (60 + rank)      # k = 60

BM25 is deliberately not in RRF. Lexical scores collapse onto a small number of exact-match listings and would drown out the neural signals if mixed in by rank. BM25 is held back and used as a separate diversity signal in Stage 4.

Stage 4 β€” BM25 Diversity Injection

Concrete rule from ranking.py:

  1. Take the top 20% of BM25's ranked listings.
  2. Drop any that already appear in the top 100 of the RRF-fused results.
  3. The remainder replaces the tail of the result list, each scored at 0.9 Γ— last_fused_score.

Stage 5 β€” Multi-turn User Profiling

Wraps the retrieval stack: profile state is built up across turns and folded into the query before the next retrieval call.

Serving

Tech Stack

Python FastAPI SQLite BGE-M3 SigLIP-2 ONNX Runtime FAISS BM25 RRF LLM MCP Apps SDK Vite + React AWS Nominatim