🥇 1st Place — CLEF 2025 GutBrainIE
Biomedical NER System
Overview
Developed a state-of-the-art Named Entity Recognition (NER) system for the CLEF 2025 GutBrainIE shared task, focusing on extracting biomedical entities from gut-brain axis related PubMed abstracts. The core philosophy: "Smart Fine-Tuning Is All You Need".
🏆 Achievement: Ranked 1st on the leaderboard (Micro-F1: 0.8408) with a 3% improvement over baseline approaches through strategic fine-tuning and ensemble techniques.
Key Innovations
- Differential Learning Rates: Applied different learning rates for BERT (2e-5), Classifier (2e-4), and CRF (2e-3) components, respecting their different learning dynamics
- Training Format Consistency: Matched training format to inference format (separate title/abstract tokenization) for significant improvement
- Weight-Based Ensemble: Novel approach averaging emission and transition matrices across models trained with different seeds
- Inference Truncation Removal: Removing max_len truncation during inference prevented performance loss on overlong texts
Experiment Results
Systematic experiments revealed key insights:
- Seed Selection Matters: Seed 42 consistently underperformed; seeds 11 and 17 showed better results
- Noisy Data Helps: Adding lower-quality Bronze data improved performance, indicating the model benefits from additional training examples despite noise
- Ensemble Stability: Combining models (seeds 42, 11, 17) provided consistent improvements
Final Results: Baseline 0.7211 → Best Ensemble 0.7773 (Macro-F1) | 0.8117 → 0.8408 (Micro-F1)
Competition Results
The team achieved best-performing status on T61 (NER task) considering Micro-F1:
- Run 1 (PubMedBERT-CRF): Macro-F1: 0.7100 | Micro-F1: 0.8120
- Run 2 (AugEnsemble): Macro-F1: 0.7613 | Micro-F1: 0.8408
- Run 3 (Ensemble): Macro-F1: 0.7634 | Micro-F1: 0.8328
- Run 4 (EnsembleContGood): Macro-F1: 0.7686 | Micro-F1: 0.8361
Technical Pipeline
- Data Tokenization: Separate title/abstract tokenization matching inference format
- Model Training: PubMedBERT-CRF with differential learning rates across 3 seeds
- Ensemble: Weight-based ensemble averaging emission and transition matrices
- Data Augmentation: Scraped 500 gut-brain papers from PubMed for pseudo-labeling
Tech Stack
Python
PyTorch
Transformers
PubMedBERT
CRF
Hugging Face
Jupyter