🥈 2nd Place — TalentCLEF 2026 Task B

TalentCLEF 2026 Job-Skill Retrieval

📅 2026 👤 Co-First Author & Corresponding 🏛️ CLEF 2026 Working Notes 🏆 Codabench: TalentCLEF Task B

Overview

TalentCLEF 2026 Task B is a retrieval problem: given a free-text job-title query, rank the full ESCO skill corpus by graded relevance, scored with graded nDCG. It is hard because queries are short and lexically distant from the structured skill records, and the corpus holds thousands of near-synonym aliases the ranker must disambiguate. Our system is a four-stage pipeline that confines fine-tuning to a single bi-encoder stage and otherwise relies on zero-shot LLM inference.

🏆 Achievement: Reached 0.7913 graded nDCG on the official test set — 2nd place on the Task B leaderboard. The full pipeline runs end-to-end on a single consumer GPU.

Pipeline

A GIST-fine-tuned bi-encoder ranker produces the initial ranking; two-sided test-time augmentation enriches both sides of the encoder; a two-stage LLM-reranker cascade then refines the top of the ranking with pointwise scoring and a pairwise tournament.

   ┌─────────────────────────────────────────────┐
   │  ①  Job-title query   "data science intern"  │
   └───────────────────────┬─────────────────────┘
                           ▼
   ┌─────────────────────────────────────────────┐
   │  ②  JobBERT ranker (GIST fine-tuned)         │
   │     siamese bi-encoder · scores all          │
   │     9,052 ESCO skills (no top-K cutoff)      │
   └───────────────────────┬─────────────────────┘
                           ▼
   ┌─────────────────────────────────────────────┐
   │  ③  Two-sided test-time augmentation         │
   │     doc-side: alias-explode (max-cosine)     │
   │     query-side: multi-style HyDE             │
   └───────────────────────┬─────────────────────┘
                           ▼
   ┌─────────────────────────────────────────────┐
   │  ④  Pointwise LLM rerank (top 500)           │
   │     Qwen 0–9 relevance · z-score fusion      │
   │     s = 0.3·z(s_be) + 0.7·z(s_llm)           │
   └───────────────────────┬─────────────────────┘
                           ▼
   ┌─────────────────────────────────────────────┐
   │  ⑤  Pairwise tournament (top 150)            │
   │     A/B comparisons · Bradley-Terry win count│
   └───────────────────────┬─────────────────────┘
                           ▼
              ┌──────────────────────────┐
              │  Ranked ESCO skills      │
              └──────────────────────────┘

Stage 2 — JobBERT ranker with GIST fine-tuning

Base encoder: JobBERT-v2, a 110M MPNet encoder pretrained on a proprietary job-skill corpus. Despite being the smallest candidate, it leads larger multilingual encoders (mE5-large, BGE-M3, ESCOXLM-R) by a wide margin zero-shot.
GIST loss: Guided In-sample selection of Training negatives uses a frozen guide model to filter in-batch false negatives — cleaner contrastive signal without hard-negative mining. GIST beat MNRL by +0.033 graded nDCG and graded-ranking losses (CoSENT, AnglE) by +0.039–0.044.
Augmented pairs: original (job-alias, skill-alias) training pairs augmented with (job-alias, ESCO description) pairs, doubling the effective training set and exposing the encoder to long descriptions in the same embedding space as short aliases.
Full-ranking: with only 9,052 skills, the bi-encoder scores all of them without a top-K candidate cutoff — a ranker, not a retriever.

Stage 3 — Two-sided test-time augmentation

The single largest gain in the pipeline (+0.116 graded nDCG), purely from indexing and output design — no model change.

📄 Doc-side: alias-explode

Rather than concatenating all aliases of a skill into one document, each alias is encoded as an independent document. A skill's score is the max cosine over its alias embeddings — capturing whichever alias view best matches the query.

🔮 Query-side: multi-style HyDE

For each query, Qwen generates three hypothetical skill descriptions in different styles (long paragraph, one-sentence, keyword list). Each is encoded with the same bi-encoder; the skill score is the max cosine over the original query and the three HyDE views. Max-pooling over all three styles beats any single style.

Stage 4–5 — LLM-reranker cascade

Qwen 2.5-7B-Instruct-AWQ is applied zero-shot as a reranker in two complementary modes on top of the bi-encoder output (LLM cascade adds +0.043 on test).

Pointwise scoring: the LLM rates the top-500 candidates for absolute relevance on a 0–9 scale; bi-encoder and LLM scores are fused per query via z-score normalisation, s = 0.3·z(s_be) + 0.7·z(s_llm) (weight tuned on validation). Contributes +0.022.
Pairwise tournament: the top-150 are reordered via a pairwise A/B tournament — each candidate accumulates a win count over 149 matches, with A/B order swapped for half the comparisons to mitigate positional bias. Under the Bradley-Terry model, total wins approximate the latent relevance ranking. Contributes +0.019.
Shared LLM: one Qwen Instruct-AWQ model serves three zero-shot roles — HyDE generation, 0–9 pointwise scoring, and A/B pairwise comparison.

Tech Stack

Python PyTorch Sentence-Transformers JobBERT-v2 GIST Loss HyDE Qwen 2.5-7B LLM Reranker ESCO Information Retrieval

Previous: GutBrainIE 2026 Next: RobinReal Challenge