Datathon — Multi-Modal Property Search

RobinReal Challenge

📅 2026 🏢 Datathon Challenge Track 🛠️ My focus: Retrieval pipeline 📂 GitHub Repository

Overview

Built a conversational, multi-modal real-estate search system for the RobinReal datathon. Users ask in any of four languages — e.g. "3-room bright apartment in Zurich under 2800 CHF" — and the system returns ranked listings by combining structured filters, multi-turn user profiling, and a hybrid retrieval stack that fuses dense, sparse, cross-modal image, and lexical signals.

🎯 Goal: Move beyond keyword-and-checkbox real-estate search. Let users describe what they actually want in any of EN / DE / FR / IT, fold soft preferences in from conversation context, and treat listing photos as a first-class retrieval signal — not a post-hoc filter.

Pipeline (high level)

Five stages, with my contribution focused on Stages 2–4 (the retrieval stack):

Hard Filter — LLM-extracted structured slots → SQLite, deterministic candidate set.
Query Encoding — BGE-M3 (dense + sparse), SigLIP-2 text tower (cross-modal), and 4-language BM25 over pre-translated indices.
Multi-Signal Scoring + RRF — three neural rankers (BGE Dense, BGE Sparse, SigLIP Image) fused via Reciprocal Rank Fusion (k = 60).
BM25 Diversity Injection — top-20% BM25 hits not in top-100 fused → replace tail; fills embedding blind spots.
Multi-turn User Profiling — running profile of soft preferences; augments the query across turns.

Stage 1 — Hard Filter

Slot extraction: An LLM parses the query into structured slots (city, postal code, price range, room count, offer type, geo radius, must-have features).
SQL-backed filter: Slots are translated to SQLite queries with auto-generated indices over the imported listing CSVs, producing a deterministic candidate set before any neural retrieval runs.
Geo handling: Listings without street addresses fall back to reverse geocoding via Nominatim, with precomputed lat/lon features so radius filters stay cheap at query time.

Retrieval (Stages 2–4) — Deep Dive

This is the part of the system I owned. The retrieval stack is built around three observations: (1) a single embedding model has blind spots, (2) listing photos carry signal that text descriptions miss, and (3) the corpus is multilingual (German, French, Italian, English) — translating queries on the fly was too lossy. The design is a hybrid stack with four parallel scorers fused by rank, plus a lexical safety net at the end.

                  ┌──────────────────────────────────────┐
                  │   Raw Query  (multilingual · no MT)  │
                  └─────────────────┬────────────────────┘
                                    │
          ┌─────────────────────────┼─────────────────────────┐
          ▼                         ▼                         ▼
   ┌──────────────┐         ┌──────────────┐          ┌──────────────┐
   │   BGE-M3     │         │  SigLIP-2    │          │  BM25 query  │   ── Stage 2
   │ dense+sparse │         │  Text Tower  │          │  EN · DE ·   │     Query
   │   (ONNX)     │         │   (ONNX)     │          │  FR · IT     │     Encoding
   └──────┬───────┘         └──────┬───────┘          └──────┬───────┘
          │                        │                         │
          ▼                        ▼                         │
   ┌──────────────┐         ┌──────────────┐                 │
   │  BGE Corpus  │         │ Image Corpus │                 │   index-time:
   │ dense+sparse │         │ SigLIP image │                 │   listing text
   │   (FAISS)    │         │ vecs (FAISS) │                 │   pre-translated
   └───┬──────┬───┘         └──────┬───────┘                 │   to 4 languages
       │      │                    │                         │
       ▼      ▼                    ▼                         ▼
   ┌──────┐ ┌──────┐         ┌────────────┐         ┌────────────────┐
   │ BGE  │ │ BGE  │         │ SigLIP     │         │ BM25 (4-lang)  │   ── Stage 3
   │Dense │ │Sparse│         │ Image      │         │ word + char    │     Multi-Signal
   │cosine│ │tok·dot│        │ text↔image │         │ MAX fusion     │     Scoring + RRF
   └──┬───┘ └──┬───┘         └─────┬──────┘         └────────┬───────┘
      │        │                   │                         │
      └────────┴──────────┬────────┘                         │
                          ▼                                  │
         ┌─────────────────────────────────────┐             │
         │ Reciprocal Rank Fusion (RRF · k=60) │             │
         │ score[lid] += 1 / (60 + rank)       │             │
         │   per list  (rank starts at 1)      │             │
         │ → fuses BGE-Dense, BGE-Sparse,      │             │
         │   SigLIP-Image  (BM25 NOT fused)    │             │
         └─────────────────┬───────────────────┘             │
                           ▼                                 ▼
         ┌─────────────────────────────────────────────────────┐
         │   BM25 Diversity Injection                          │   ── Stage 4
         │   take top-20% BM25 hits NOT in top-100 fused       │     Diversity
         │   results · replace tail at 0.9 × last_score        │     Injection
         └─────────────────┬───────────────────────────────────┘
                           ▼
              ┌─────────────────────────────┐
              │   Final Ranked Results      │
              │ multi-signal · diversity-   │
              │       injected              │
              └─────────────────────────────┘

Stage 2 — Query Encoding (dual retrieval over ONNX)

Query encoders are exported to ONNX Runtime and served on CPU, so the whole retrieval path is GPU-free at inference. The two ONNX-served encoders feed two retrieval modalities:

🔷 Text → Text retrieval (BGE-M3, ONNX)

BGE-M3 is multi-functional: a single forward pass yields both a dense embedding (for semantic similarity, cosine) and a sparse lexical-weighted vector (for vocabulary overlap, token dot product). Multilingual natively, so the raw query is encoded directly without translation. Listing-side embeddings are precomputed and indexed in FAISS.

🖼️ Text → Image retrieval (SigLIP-2 text tower, ONNX)

The SigLIP-2 text tower projects the query into the same embedding space as listing photos. Listing image embeddings (vision tower) are computed once offline and indexed. At query time only the text tower runs — turning visual descriptors like "bright", "balcony", "with a view" into a real cross-modal retrieval signal instead of a keyword match against captions.

📄 BM25 over 4-language indices

Listing text (title + description + features) is concatenated and pre-translated to EN / DE / FR / IT at index time, with separate BM25 indices per language using both word and character n-gram tokenizers. At query time we score against all four and fuse via MAX — robust to typos, mixed-language input, and rare-word matches like proper nouns or specific feature terms.

Why two ONNX encoders, not one: BGE-M3 is excellent at semantic text matching but blind to images; SigLIP-2 is excellent at text↔image matching but a weaker pure-text retriever. Running both as dual retrievers — text-text via BGE, text-image via SigLIP — captures complementary signals. ONNX serving makes both fast enough on CPU that we don't have to pick.

Stage 3 — Multi-Signal Scoring + RRF

Three neural ranked lists are fused via Reciprocal Rank Fusion:

BGE Dense — cosine similarity, query dense vector ↔ listing dense vector.
BGE Sparse — token-level dot product over BGE-M3's lexical-weighted vectors.
SigLIP Image — cross-modal similarity, query text embedding ↔ pre-computed listing image embeddings.

for rlist in rank_lists:
    for rank, lid in enumerate(rlist, 1):     # rank starts at 1
        scores[lid] += 1.0 / (60 + rank)      # k = 60

BM25 is deliberately not in RRF. Lexical scores collapse onto a small number of exact-match listings and would drown out the neural signals if mixed in by rank. BM25 is held back and used as a separate diversity signal in Stage 4.

Why ranks, not raw scores: cosine similarities and sparse dot products live on incompatible scales. RRF throws magnitudes away and only uses ordering, which keeps any one ranker from dominating just because its numbers happen to be bigger.
Why k = 60: the standard Cormack et al. choice — large enough that top-1 doesn't overpower the rest of the list, small enough that the long tail still contributes negligibly.
Effect: a listing only needs to rank reasonably well across multiple neural signals to surface — single-channel head monopolies are suppressed and consensus picks bubble up.

Stage 4 — BM25 Diversity Injection

Concrete rule from ranking.py:

Take the top 20% of BM25's ranked listings.
Drop any that already appear in the top 100 of the RRF-fused results.
The remainder replaces the tail of the result list, each scored at 0.9 × last_fused_score.

Why: dense and cross-modal embeddings have known blind spots — rare proper nouns, very specific feature words ("Stockwerkeigentum", a specific street name), tokenizer-induced quirks. The neural rankers can all miss the same way, and RRF can't recover from that.
Net effect: recall stays high on exact-match queries (street names, building IDs, unusual amenities) without giving up the semantic strengths of the neural rankers on softer queries.

Stage 5 — Multi-turn User Profiling

Wraps the retrieval stack: profile state is built up across turns and folded into the query before the next retrieval call.

Running profile: Across conversation turns, soft facts (lifestyle, neighborhood feel, willingness to compromise on price vs. space, etc.) are accumulated into a user profile rather than restarting from a cold query each turn.
Query construction: The current turn's text is augmented with profile-derived preferences before being passed to encoding — so a user reacting to results ("I want it brighter") refines retrieval instead of replacing it.
Soft vs. hard separation: Profile signals influence ranking, never filtering — we never reject a listing because of an inferred preference.

Serving

ONNX Runtime on CPU for both encoders (BGE-M3, SigLIP-2 text tower) — dual retrieval (text↔text + text↔image) without a GPU dependency in the serving path.
FAISS over the BGE corpus and the SigLIP image corpus, both with embeddings precomputed offline once and reused across queries. BM25 indices built per language at index time.
API: FastAPI service exposing both a high-level /listings NL endpoint and a low-level /listings/search/filter structured endpoint.
MCP integration: the whole pipeline is exposed via the MCP Apps SDK so it can be driven directly from ChatGPT or Claude Desktop as a conversational tool.
Frontend: Vite + React widget rendering ranked results alongside a map view.
Deployed on AWS.

Tech Stack

Python FastAPI SQLite BGE-M3 SigLIP-2 ONNX Runtime FAISS BM25 RRF LLM MCP Apps SDK Vite + React AWS Nominatim

Previous: TalentCLEF 2026 Next: MR Anomaly Detection