TalentCLEF 2026 Job-Skill Retrieval
Overview
TalentCLEF 2026 Task B is a retrieval problem: given a free-text job-title query, rank the full ESCO skill corpus by graded relevance, scored with graded nDCG. It is hard because queries are short and lexically distant from the structured skill records, and the corpus holds thousands of near-synonym aliases the ranker must disambiguate. Our system is a four-stage pipeline that confines fine-tuning to a single bi-encoder stage and otherwise relies on zero-shot LLM inference.
๐ Achievement: Reached 0.7913 graded nDCG on the official test set โ 2nd place on the Task B leaderboard. The full pipeline runs end-to-end on a single consumer GPU.
Pipeline
A GIST-fine-tuned bi-encoder ranker produces the initial ranking; two-sided test-time augmentation enriches both sides of the encoder; a two-stage LLM-reranker cascade then refines the top of the ranking with pointwise scoring and a pairwise tournament.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ Job-title query "data science intern" โ
โโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โก JobBERT ranker (GIST fine-tuned) โ
โ siamese bi-encoder ยท scores all โ
โ 9,052 ESCO skills (no top-K cutoff) โ
โโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โข Two-sided test-time augmentation โ
โ doc-side: alias-explode (max-cosine) โ
โ query-side: multi-style HyDE โ
โโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โฃ Pointwise LLM rerank (top 500) โ
โ Qwen 0โ9 relevance ยท z-score fusion โ
โ s = 0.3ยทz(s_be) + 0.7ยทz(s_llm) โ
โโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โค Pairwise tournament (top 150) โ
โ A/B comparisons ยท Bradley-Terry win countโ
โโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Ranked ESCO skills โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Stage 2 โ JobBERT ranker with GIST fine-tuning
- Base encoder: JobBERT-v2, a 110M MPNet encoder pretrained on a proprietary job-skill corpus. Despite being the smallest candidate, it leads larger multilingual encoders (mE5-large, BGE-M3, ESCOXLM-R) by a wide margin zero-shot.
- GIST loss: Guided In-sample selection of Training negatives uses a frozen guide model to filter in-batch false negatives โ cleaner contrastive signal without hard-negative mining. GIST beat MNRL by +0.033 graded nDCG and graded-ranking losses (CoSENT, AnglE) by +0.039โ0.044.
- Augmented pairs: original (job-alias, skill-alias) training pairs augmented with (job-alias, ESCO description) pairs, doubling the effective training set and exposing the encoder to long descriptions in the same embedding space as short aliases.
- Full-ranking: with only 9,052 skills, the bi-encoder scores all of them without a top-K candidate cutoff โ a ranker, not a retriever.
Stage 3 โ Two-sided test-time augmentation
The single largest gain in the pipeline (+0.116 graded nDCG), purely from indexing and output design โ no model change.
๐ Doc-side: alias-explode
Rather than concatenating all aliases of a skill into one document, each alias is encoded as an independent document. A skill's score is the max cosine over its alias embeddings โ capturing whichever alias view best matches the query.
๐ฎ Query-side: multi-style HyDE
For each query, Qwen generates three hypothetical skill descriptions in different styles (long paragraph, one-sentence, keyword list). Each is encoded with the same bi-encoder; the skill score is the max cosine over the original query and the three HyDE views. Max-pooling over all three styles beats any single style.
Stage 4โ5 โ LLM-reranker cascade
Qwen 2.5-7B-Instruct-AWQ is applied zero-shot as a reranker in two complementary modes on top of the bi-encoder output (LLM cascade adds +0.043 on test).
- Pointwise scoring: the LLM rates the top-500 candidates for absolute relevance on a 0โ9 scale; bi-encoder and LLM scores are fused per query via z-score normalisation,
s = 0.3ยทz(s_be) + 0.7ยทz(s_llm)(weight tuned on validation). Contributes +0.022. - Pairwise tournament: the top-150 are reordered via a pairwise A/B tournament โ each candidate accumulates a win count over 149 matches, with A/B order swapped for half the comparisons to mitigate positional bias. Under the Bradley-Terry model, total wins approximate the latent relevance ranking. Contributes +0.019.
- Shared LLM: one Qwen Instruct-AWQ model serves three zero-shot roles โ HyDE generation, 0โ9 pointwise scoring, and A/B pairwise comparison.