GutBrainIE 2026 NER Β· Linking Β· RE
Overview
NightSun is a single sequential pipeline that addresses all four GutBrainIE 2026 subtasks over gut-brain axis biomedical abstracts: named entity recognition (T611), entity disambiguation / linking (T612), and mention- and concept-level relation extraction (T621/T622). T611 entities feed T612 disambiguation and T621 relation extraction; T621 mention-level triples are then lifted to canonical concept URIs via T612.
π Achievement: Ranked 1st in all four official subtasks. The build on our prior 2025 first-place NER system, extended end-to-end to entity linking and relation extraction.
Why this domain is hard
- Heterogeneous vocabulary: a single abstract may reference microorganism taxonomy (Lactobacillus rhamnosus GG), dietary components (inulin), psychiatric disorders (depression), and statistical methods (ANOVA) β each demanding a different recognition signal.
- Overlapping categories: the 13 entity types have fuzzy boundaries β microbiome vs. bacteria, food vs. dietary supplement, DDF spanning both diseases and clinical findings.
- Evolving taxonomy: NCBI Taxonomy and GTDB disagree on species boundaries, complicating disambiguation.
- Near-synonymous predicates: relation types like influence / affect / impact share surface patterns but carry distinct semantics that classifiers easily confuse.
Pipeline
ββββββββββββββββββββββββββββββββββββββββββββββββ
β Gut-brain axis PubMed abstract (raw text) β
βββββββββββββββββββββββββ¬ββββββββββββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββ
β T611 Β· NER β
β BiomedBERT-CRF Β· 27 BIO labels (13 types) β
β chunked inference β cross-model voting β
βββββββββββββββββββββββββ¬ββββββββββββββββββββββββ
entity spans β
βββββββββββββββββ΄ββββββββββββββββ
βΌ βΌ
ββββββββββββββββββββββββ ββββββββββββββββββββββββ
β T612 Β· Entity β β T621 Β· Mention-level β
β Linking (NERD) β β Relation Extraction β
β dict β SapBERT β β β typed entity markers β
β type-prior rerank β β CE-Dice Β· 18 classes β
ββββββββββββ¬ββββββββββββ βββββββββββββ¬βββββββββββ
β canonical URIs β mention triples
ββββββββββββββββ¬ββββββββββββββββ
βΌ
ββββββββββββββββββββββββ
β T622 Β· Concept-level β
β RE β lift triples to β
β canonical concept URIsβ
ββββββββββββββββββββββββ
T611 β Named Entity Recognition
Token-level sequence labeling with BertCrfForTokenClassification: a BiomedBERT encoder, a linear emission head, and a CRF layer over a 27-label BIO scheme.
- Differential learning rates: 2e-5 (encoder), 2e-4 (linear head), 2e-3 (CRF transition matrix) β respecting the different convergence speeds of pre-trained vs. randomly initialized components.
- Bronze relabeling: the 2,972 distant-supervision "bronze" articles are re-annotated by a cross-model ensemble trained only on gold+silver data, keeping spans where β₯2/3 models agree β cutting false positives (gene agreement with gold-trained models was only 59%).
- Chunked inference: long abstracts split into β€512-token chunks at punctuation boundaries, predictions merged back to original offsets so entities near the end of long abstracts are not lost to truncation.
- Within-model ensembling: three seeds (7, 11, 17) averaged at the emission + CRF transition level before Viterbi decoding β soft averaging consistently beats majority voting on predictions.
- Cross-model unanimous voting: the winning run requires 3/3 agreement across heterogeneous encoder-data combinations; per-class analysis shows each encoder has a categorical advantage tracing back to its pretraining corpus.
T612 β Entity Linking (three-stage cascade)
A unified knowledge base of 52,263 concept aliases derived from MeSH, NCBI Gene, GTDB, NCBI Taxonomy, FoodOn, and USDA FNDDS, plus a direct 18,512-entry surface-form lookup table from the training annotations.
- Dictionary lookup β normalized (surface form, entity type) query against the lookup table; handles common unambiguous mentions with near-perfect precision in under a second.
- SapBERT dense retrieval β unresolved mentions encoded with SapBERT, top-k (k=20) candidate URIs retrieved from a FAISS index over the 52,263 alias embeddings (95.6% top-1 where a correct KB entry exists).
- Type-prior reranking β candidates reranked by
sim(e,c) + Ξ±Β·log p(c|t), correcting type-ambiguous surface forms (e.g. a chemical that also appears as a drug) without over-relying on frequency priors.
Key finding: retrieval accuracy on in-KB entities is near-ceiling β the real bottleneck is KB coverage, not retrieval quality, which explains the dev-to-test gap on this subtask.
T621 / T622 β Relation Extraction
- Typed entity markers: BiomedBERT fine-tuned for 18-class relation classification (no_relation + 17 predicates); subject/object spans wrapped with type-bearing markers (
@ * t Β· span * @,# t Β· span #) so entity-type information stays in the encoder's receptive field. - CE-Dice loss: Dice loss combined with cross-entropy to handle severe class imbalance where no_relation pairs vastly outnumber positives β stable gradients with effective minority-class focus.
- Per-class threshold sweep: soft-probability ensembling with a per-predicate threshold sweep handles the 17-predicate imbalance better than majority voting.
- T622: mention-level T621 triples are lifted to canonical concept-level triples using the T612 URI assignments.
Error analysis: the oracle-to-pipeline RE gap is driven mostly by upstream NER errors, not the relation classifier itself β identifying NER recall as the highest-leverage direction for further gains.