🥇 1st Place — CLEF 2025 GutBrainIE

Biomedical NER System

📅 Mar 2025 – May 2025 👤 Tech Lead & 1st Co-Author 📂 GitHub Repository

Overview

Developed a state-of-the-art Named Entity Recognition (NER) system for the CLEF 2025 GutBrainIE shared task, focusing on extracting biomedical entities from gut-brain axis related PubMed abstracts. The core philosophy: "Smart Fine-Tuning Is All You Need".

🏆 Achievement: Ranked 1st on the leaderboard (Micro-F1: 0.8408) with a 3% improvement over baseline approaches through strategic fine-tuning and ensemble techniques.

Key Innovations

Differential Learning Rates: Applied different learning rates for BERT (2e-5), Classifier (2e-4), and CRF (2e-3) components, respecting their different learning dynamics
Training Format Consistency: Matched training format to inference format (separate title/abstract tokenization) for significant improvement
Weight-Based Ensemble: Novel approach averaging emission and transition matrices across models trained with different seeds
Inference Truncation Removal: Removing max_len truncation during inference prevented performance loss on overlong texts

Experiment Results

Systematic experiments revealed key insights:

Seed Selection Matters: Seed 42 consistently underperformed; seeds 11 and 17 showed better results
Noisy Data Helps: Adding lower-quality Bronze data improved performance, indicating the model benefits from additional training examples despite noise
Ensemble Stability: Combining models (seeds 42, 11, 17) provided consistent improvements

Final Results: Baseline 0.7211 → Best Ensemble 0.7773 (Macro-F1) | 0.8117 → 0.8408 (Micro-F1)

Competition Results

The team achieved best-performing status on T61 (NER task) considering Micro-F1:

Run 1 (PubMedBERT-CRF): Macro-F1: 0.7100 | Micro-F1: 0.8120
Run 2 (AugEnsemble): Macro-F1: 0.7613 | Micro-F1: 0.8408
Run 3 (Ensemble): Macro-F1: 0.7634 | Micro-F1: 0.8328
Run 4 (EnsembleContGood): Macro-F1: 0.7686 | Micro-F1: 0.8361

Technical Pipeline

Data Tokenization: Separate title/abstract tokenization matching inference format
Model Training: PubMedBERT-CRF with differential learning rates across 3 seeds
Ensemble: Weight-based ensemble averaging emission and transition matrices
Data Augmentation: Scraped 500 gut-brain papers from PubMed for pseudo-labeling

Tech Stack

Python PyTorch Transformers PubMedBERT CRF Hugging Face Jupyter

Previous: Curriculum Recommender Next: Movie Chatbot