🥉 Kaggle Bronze Medal — 76th / 1000+
Curriculum Recommender
Overview
Developed a machine learning solution for the Kaggle "Learning Equality - Curriculum Recommendations" competition. The goal was to match educational content with curriculum topics to help students in underserved communities access relevant learning materials.
🎯 Challenge: Given a curriculum topic, retrieve and rank the most relevant educational content from a large corpus of learning materials across multiple languages.
Stage 1: Retrieval
Data Strategy
- Used source category as part of training data for all folds, excluded from validation to prevent data leakage
- Applied StratifiedGroupKFold on non-source category topics (group=topic, target=language)
- Split correlations rather than directly splitting topics/content for train/validation
Model & Training
- Base Model: sentence-transformers/all-MiniLM-L6-v2 (compact size for faster experimentation)
- Training Approach: SimCSE contrastive learning
- Grouped by language then shuffled to ensure batches contain single language (small recall improvement)
- Hard negative mining significantly improved model recall
- Multi-round boosting with non-source data improved top50/100 recall for 2 rounds
Model Ensemble
Combined predictions using voting + average cosine similarity ranking for topic-content pairs.
Stage 2: Ranking
- Binary classification model (softmax + CrossEntropyLoss) for re-ranking
- Used threshold tuning to handle sample imbalance
- Tokenizer with pair format to ensure consistent text1/text2 lengths, reusing Stage 1 prompts
- Added recall order as text feature (small improvement observed)
- CV strategy aligned with LB evaluation strategy
- Larger models showed clear improvements
Key Insights
- Non-source Topics Critical: Distribution matches LB test set, essential for good generalization
- Hard Negative Mining: Most effective strategy for improving retrieval performance
- Multi-round Boosting: Iterative training with hard examples from previous models
- Model Size Trade-off: Increasing max_len/batch_size beyond 128/192 showed diminishing returns; larger models might have been more effective
- CV-LB Alignment: Consistent validation strategy crucial for reliable model selection
Results
- Achieved Bronze Medal with 76th place out of 1000+ teams
- Developed a scalable two-stage retrieve-and-rank pipeline
- Gained experience in handling multi-lingual educational content matching
Tech Stack
Python
PyTorch
Transformers
SimCSE
FAISS
Sentence-BERT
XGBoost
Pandas