🥉 Kaggle Bronze Medal — 76th / 1000+

Curriculum Recommender

📅 Jan 2023 – Mar 2023 🏆 Learning Equality Competition 📂 GitHub Repository

Overview

Developed a machine learning solution for the Kaggle "Learning Equality - Curriculum Recommendations" competition. The goal was to match educational content with curriculum topics to help students in underserved communities access relevant learning materials.

🎯 Challenge: Given a curriculum topic, retrieve and rank the most relevant educational content from a large corpus of learning materials across multiple languages.

Stage 1: Retrieval

Data Strategy

Used source category as part of training data for all folds, excluded from validation to prevent data leakage
Applied StratifiedGroupKFold on non-source category topics (group=topic, target=language)
Split correlations rather than directly splitting topics/content for train/validation

Model & Training

Base Model: sentence-transformers/all-MiniLM-L6-v2 (compact size for faster experimentation)
Training Approach: SimCSE contrastive learning
Grouped by language then shuffled to ensure batches contain single language (small recall improvement)
Hard negative mining significantly improved model recall
Multi-round boosting with non-source data improved top50/100 recall for 2 rounds

Model Ensemble

Combined predictions using voting + average cosine similarity ranking for topic-content pairs.

Stage 2: Ranking

Binary classification model (softmax + CrossEntropyLoss) for re-ranking
Used threshold tuning to handle sample imbalance
Tokenizer with pair format to ensure consistent text1/text2 lengths, reusing Stage 1 prompts
Added recall order as text feature (small improvement observed)
CV strategy aligned with LB evaluation strategy
Larger models showed clear improvements

Key Insights

Non-source Topics Critical: Distribution matches LB test set, essential for good generalization
Hard Negative Mining: Most effective strategy for improving retrieval performance
Multi-round Boosting: Iterative training with hard examples from previous models
Model Size Trade-off: Increasing max_len/batch_size beyond 128/192 showed diminishing returns; larger models might have been more effective
CV-LB Alignment: Consistent validation strategy crucial for reliable model selection

Results

Achieved Bronze Medal with 76th place out of 1000+ teams
Developed a scalable two-stage retrieve-and-rank pipeline
Gained experience in handling multi-lingual educational content matching

Tech Stack

Python PyTorch Transformers SimCSE FAISS Sentence-BERT XGBoost Pandas

Previous: Virtual Patient Next: NER System