Datathon Teams & Tasks
35 participants • 7 teams
Applications of Data Science in Health AI using Vector Embeddings and Clinical Metadata
This datathon uses retinal image vector embeddings extracted from MedSigLIP, DINOv2, and RETFound models, combined with patient metadata to investigate clinically meaningful machine learning applications in diabetic retinopathy screening.
Teams & Assignments
Team 1
Group 1 — Multimodal Fusion StudyMembers
- Roland Mfondoum
- Samuel Kyobe
- EFFAH JEMIMA N'DA
- Joel Sabiti
- Hafeez Akinniyi
Goal
Study the impact of combining metadata and embeddings.
Suggested tasks
- Compare metadata-only vs embeddings-only vs fused models
- Experiment with fusion strategies
- Perform ablation studies
- Analyze the added value of metadata
Expected deliverables
- Fusion performance comparison
- Ablation analysis
- Recommendations on multimodal learning
Team 2
Group 2 — Embedding Space ExplorationMembers
- Oluwole Olajide
- David Nkurunziza
- Nondumiso Mthiyane
- Busisiwe Mlotshwa
- Wisdom A Akurugu
Goal
Explore and visualize embedding representations learned by the models.
Suggested tasks
- Perform PCA, t-SNE, or UMAP
- Visualize clusters of disease severity
- Compare MedSigLIP, DINOv2, and RETFound embeddings
- Investigate separability of classes
Expected deliverables
- 2D embedding visualizations
- Cluster analysis
- Embedding comparison insights
Team 3
Group 3 — Risk Stratification & Screening PrioritizationMembers
- Keph Makoyi
- Judith Mangeni
- Norbert Nawe
- Penina Safari
- Tumai Muzorewa
Goal
Develop risk scoring systems for prioritizing diabetic retinopathy screening.
Suggested tasks
- Build risk prediction models
- Optimize sensitivity-focused thresholds
- Investigate triage strategies
- Discuss operational impact
Expected deliverables
- Risk stratification framework
- Screening prioritization strategy
- Operational recommendations
Team 4
Group 4 — Embedding Quality Comparison BenchmarkMembers
- Adewemimo Olaosebikan
- Natasha Lalloo
- Anotida Marrian Chapunza
- Nhlamulo Khoza
- Imonikhe Kio
Goal
Perform a systematic benchmark comparison of MedSigLIP, DINOv2, and RETFound embeddings.
Suggested tasks
- Train identical downstream models on each embedding type
- Compare accuracy
- Compare sensitivity
- Compare specificity
- Compare inference speed
- Compare robustness
- Analyze dimensionality and redundancy
- Study how embeddings combine with metadata
Expected deliverables
- Benchmark leaderboard
- Comparative visualizations
- Recommendations for healthcare deployment
Team 5
Group 5 — Explainable Health AIMembers
- Eyad Elbahtety
- Ezinne Uvere
- Chika Ezeador
- Oluwaseun Williams
- OLUSOLA OGUNSANYA
Goal
Develop interpretable diabetic retinopathy prediction systems.
Suggested tasks
- Train interpretable ML models
- Perform SHAP analysis
- Generate feature importance rankings
- Interpret clinically meaningful variables
Expected deliverables
- Explainability visualizations
- Feature importance analysis
- Clinical insights report
Team 6
Group 6 — Baseline DR Severity PredictionMembers
- Emmy MUGISHA
- Maria Gloria Nassuuna
- Hillary Koros
- Chantale Ruto
- Brigitte UMUTONI
Goal
Predict diabetic retinopathy severity using embeddings only, metadata only, and combined multimodal features.
Suggested tasks
- Build baseline ML models
- Compare embeddings vs metadata vs combined features
- Generate confusion matrices and evaluation metrics
- Interpret model performance
Expected deliverables
- Performance comparison table
- Best-performing baseline model
- Clinical interpretation of findings
Team 7
Group 7 — Bias & Fairness AnalysisMembers
- Celestin Twizere
- Tendayi Mutangadura
- Henriette Dukuzimana
- Eric Katagirya
- Jacqueline Wahura
Goal
Investigate whether model performance varies across patient subgroups.
Suggested tasks
- Evaluate performance across sex and age groups
- Investigate site or camera-related bias
- Perform subgroup-specific error analysis
- Discuss fairness implications
Expected deliverables
- Fairness analysis report
- Bias visualizations
- Recommendations for equitable deployment
Challenge Task Groups
Goal: Study the impact of combining metadata and embeddings.
Suggested Tasks
- Compare metadata-only vs embeddings-only vs fused models
- Experiment with fusion strategies
- Perform ablation studies
- Analyze the added value of metadata
Expected Deliverables
- Fusion performance comparison
- Ablation analysis
- Recommendations on multimodal learning
Goal: Explore and visualize embedding representations learned by the models.
Suggested Tasks
- Perform PCA, t-SNE, or UMAP
- Visualize clusters of disease severity
- Compare MedSigLIP, DINOv2, and RETFound embeddings
- Investigate separability of classes
Expected Deliverables
- 2D embedding visualizations
- Cluster analysis
- Embedding comparison insights
Goal: Develop risk scoring systems for prioritizing diabetic retinopathy screening.
Suggested Tasks
- Build risk prediction models
- Optimize sensitivity-focused thresholds
- Investigate triage strategies
- Discuss operational impact
Expected Deliverables
- Risk stratification framework
- Screening prioritization strategy
- Operational recommendations
Goal: Perform a systematic benchmark comparison of MedSigLIP, DINOv2, and RETFound embeddings.
Suggested Tasks
- Train identical downstream models on each embedding type
- Compare accuracy
- Compare sensitivity
- Compare specificity
- Compare inference speed
- Compare robustness
- Analyze dimensionality and redundancy
- Study how embeddings combine with metadata
Expected Deliverables
- Benchmark leaderboard
- Comparative visualizations
- Recommendations for healthcare deployment
Goal: Develop interpretable diabetic retinopathy prediction systems.
Suggested Tasks
- Train interpretable ML models
- Perform SHAP analysis
- Generate feature importance rankings
- Interpret clinically meaningful variables
Expected Deliverables
- Explainability visualizations
- Feature importance analysis
- Clinical insights report
Goal: Predict diabetic retinopathy severity using embeddings only, metadata only, and combined multimodal features.
Suggested Tasks
- Build baseline ML models
- Compare embeddings vs metadata vs combined features
- Generate confusion matrices and evaluation metrics
- Interpret model performance
Expected Deliverables
- Performance comparison table
- Best-performing baseline model
- Clinical interpretation of findings
Goal: Investigate whether model performance varies across patient subgroups.
Suggested Tasks
- Evaluate performance across sex and age groups
- Investigate site or camera-related bias
- Perform subgroup-specific error analysis
- Discuss fairness implications
Expected Deliverables
- Fairness analysis report
- Bias visualizations
- Recommendations for equitable deployment
Suggested Evaluation Criteria
| Criterion | Weight (%) |
|---|---|
| Problem Understanding | 15 |
| EDA & Data Cleaning | 20 |
| Technical Soundness | 20 |
| Innovation | 15 |
| Clinical Relevance | 15 |
| Deployment Feasibility | 10 |
| Presentation Quality | 5 |
BRSET Data Dictionary
A Brazilian Multilabel Ophthalmological Dataset (BRSET) v1.0.1
PhysioNet — BRSET v1.0.1 • DOI
BRSET consists of 16,266 fundus images from 8,524 Brazilian patients. The labels.csv file contains identifiers, demographics, anatomical labels, diabetic retinopathy grading, quality parameters, and multi-label disease classifications for each image.
Dataset Files
| File / Folder | Description |
|---|---|
fundus_photos |
16,266 color fundus photograph images (JPEG). |
labels.csv |
Table with image identifier, demographics, structural labels, diagnoses, and quality parameters. |
Identifiers & Demographics
| Column | Description |
|---|---|
image_id |
Image identifier. |
patient_id |
Patient identifier. |
camera |
Retinal camera (Canon CR-2 or Nikon NF505). |
patient_age |
Age of patient in years. |
patient_sex |
1 = male, 2 = female. |
exam_eye |
1 = right eye, 2 = left eye. |
nationality |
Patient's nationality. |
comorbidities |
Free text of self-reported clinical antecedents. |
diabetes |
Diabetes diagnosis. |
diabetes_time |
Self-reported time since diabetes diagnosis (years). |
insulin_use |
Self-reported insulin use (yes or no). |
Anatomical Parameters
| Column | Description |
|---|---|
optic_disc |
1 = normal, 2 = abnormal. |
vessels |
1 = normal, 2 = abnormal. |
macula |
1 = normal, 2 = abnormal. |
Diabetic Retinopathy Classification
| Column | Description |
|---|---|
DR_ICDR |
International Clinical Diabetic Retinopathy (ICDR) grade 0–4.
|
DR_SDRG |
Scottish Diabetic Retinopathy Grading (SDRG) grade 0–4.
|
Quality Parameters
| Column | Description |
|---|---|
focus |
1 = satisfactory, 2 = unsatisfactory. |
illumination |
1 = satisfactory, 2 = unsatisfactory. |
image_field |
1 = satisfactory, 2 = unsatisfactory. |
artifacts |
1 = satisfactory, 2 = unsatisfactory. |
Classification Parameters (Multi-label)
Binary labels: 1 = present, 0 = absent.
| Column | Description |
|---|---|
diabetic_retinopathy |
Diabetic retinopathy present. |
macular_edema |
Diabetic macular edema present. |
scar |
Scar (e.g. toxoplasmosis) present. |
nevus |
Nevus present. |
amd |
Age-related macular degeneration present. |
vascular_occlusion |
Vascular occlusion present. |
hypertensive_retinopathy |
Hypertensive retinopathy present. |
drusens |
Drusen present. |
hemorrhage |
Non-diabetic retinal hemorrhage present. |
retinal_detachment |
Retinal detachment present. |
myopic_fundus |
Myopic fundus present. |
increased_cup_disc |
Increased cup-to-disc ratio present. |
other |
Other pathology present. |