Datathon Teams & Tasks

35 participants • 7 teams

Applications of Data Science in Health AI using Vector Embeddings and Clinical Metadata

This datathon uses retinal image vector embeddings extracted from MedSigLIP, DINOv2, and RETFound models, combined with patient metadata to investigate clinically meaningful machine learning applications in diabetic retinopathy screening.

Teams & Assignments

Team 1

Group 1 — Multimodal Fusion Study

Members

  1. Roland Mfondoum
  2. Samuel Kyobe
  3. EFFAH JEMIMA N'DA
  4. Joel Sabiti
  5. Hafeez Akinniyi

Goal

Study the impact of combining metadata and embeddings.

Suggested tasks

  • Compare metadata-only vs embeddings-only vs fused models
  • Experiment with fusion strategies
  • Perform ablation studies
  • Analyze the added value of metadata

Expected deliverables

  • Fusion performance comparison
  • Ablation analysis
  • Recommendations on multimodal learning

Team 2

Group 2 — Embedding Space Exploration

Members

  1. Oluwole Olajide
  2. David Nkurunziza
  3. Nondumiso Mthiyane
  4. Busisiwe Mlotshwa
  5. Wisdom A Akurugu

Goal

Explore and visualize embedding representations learned by the models.

Suggested tasks

  • Perform PCA, t-SNE, or UMAP
  • Visualize clusters of disease severity
  • Compare MedSigLIP, DINOv2, and RETFound embeddings
  • Investigate separability of classes

Expected deliverables

  • 2D embedding visualizations
  • Cluster analysis
  • Embedding comparison insights

Team 3

Group 3 — Risk Stratification & Screening Prioritization

Members

  1. Keph Makoyi
  2. Judith Mangeni
  3. Norbert Nawe
  4. Penina Safari
  5. Tumai Muzorewa

Goal

Develop risk scoring systems for prioritizing diabetic retinopathy screening.

Suggested tasks

  • Build risk prediction models
  • Optimize sensitivity-focused thresholds
  • Investigate triage strategies
  • Discuss operational impact

Expected deliverables

  • Risk stratification framework
  • Screening prioritization strategy
  • Operational recommendations

Team 4

Group 4 — Embedding Quality Comparison Benchmark

Members

  1. Adewemimo Olaosebikan
  2. Natasha Lalloo
  3. Anotida Marrian Chapunza
  4. Nhlamulo Khoza
  5. Imonikhe Kio

Goal

Perform a systematic benchmark comparison of MedSigLIP, DINOv2, and RETFound embeddings.

Suggested tasks

  • Train identical downstream models on each embedding type
  • Compare accuracy
  • Compare sensitivity
  • Compare specificity
  • Compare inference speed
  • Compare robustness
  • Analyze dimensionality and redundancy
  • Study how embeddings combine with metadata

Expected deliverables

  • Benchmark leaderboard
  • Comparative visualizations
  • Recommendations for healthcare deployment

Team 5

Group 5 — Explainable Health AI

Members

  1. Eyad Elbahtety
  2. Ezinne Uvere
  3. Chika Ezeador
  4. Oluwaseun Williams
  5. OLUSOLA OGUNSANYA

Goal

Develop interpretable diabetic retinopathy prediction systems.

Suggested tasks

  • Train interpretable ML models
  • Perform SHAP analysis
  • Generate feature importance rankings
  • Interpret clinically meaningful variables

Expected deliverables

  • Explainability visualizations
  • Feature importance analysis
  • Clinical insights report

Team 6

Group 6 — Baseline DR Severity Prediction

Members

  1. Emmy MUGISHA
  2. Maria Gloria Nassuuna
  3. Hillary Koros
  4. Chantale Ruto
  5. Brigitte UMUTONI

Goal

Predict diabetic retinopathy severity using embeddings only, metadata only, and combined multimodal features.

Suggested tasks

  • Build baseline ML models
  • Compare embeddings vs metadata vs combined features
  • Generate confusion matrices and evaluation metrics
  • Interpret model performance

Expected deliverables

  • Performance comparison table
  • Best-performing baseline model
  • Clinical interpretation of findings

Team 7

Group 7 — Bias & Fairness Analysis

Members

  1. Celestin Twizere
  2. Tendayi Mutangadura
  3. Henriette Dukuzimana
  4. Eric Katagirya
  5. Jacqueline Wahura

Goal

Investigate whether model performance varies across patient subgroups.

Suggested tasks

  • Evaluate performance across sex and age groups
  • Investigate site or camera-related bias
  • Perform subgroup-specific error analysis
  • Discuss fairness implications

Expected deliverables

  • Fairness analysis report
  • Bias visualizations
  • Recommendations for equitable deployment

Challenge Task Groups

Goal: Study the impact of combining metadata and embeddings.

Suggested Tasks

  • Compare metadata-only vs embeddings-only vs fused models
  • Experiment with fusion strategies
  • Perform ablation studies
  • Analyze the added value of metadata

Expected Deliverables

  • Fusion performance comparison
  • Ablation analysis
  • Recommendations on multimodal learning

Goal: Explore and visualize embedding representations learned by the models.

Suggested Tasks

  • Perform PCA, t-SNE, or UMAP
  • Visualize clusters of disease severity
  • Compare MedSigLIP, DINOv2, and RETFound embeddings
  • Investigate separability of classes

Expected Deliverables

  • 2D embedding visualizations
  • Cluster analysis
  • Embedding comparison insights

Goal: Develop risk scoring systems for prioritizing diabetic retinopathy screening.

Suggested Tasks

  • Build risk prediction models
  • Optimize sensitivity-focused thresholds
  • Investigate triage strategies
  • Discuss operational impact

Expected Deliverables

  • Risk stratification framework
  • Screening prioritization strategy
  • Operational recommendations

Goal: Perform a systematic benchmark comparison of MedSigLIP, DINOv2, and RETFound embeddings.

Suggested Tasks

  • Train identical downstream models on each embedding type
  • Compare accuracy
  • Compare sensitivity
  • Compare specificity
  • Compare inference speed
  • Compare robustness
  • Analyze dimensionality and redundancy
  • Study how embeddings combine with metadata

Expected Deliverables

  • Benchmark leaderboard
  • Comparative visualizations
  • Recommendations for healthcare deployment

Goal: Develop interpretable diabetic retinopathy prediction systems.

Suggested Tasks

  • Train interpretable ML models
  • Perform SHAP analysis
  • Generate feature importance rankings
  • Interpret clinically meaningful variables

Expected Deliverables

  • Explainability visualizations
  • Feature importance analysis
  • Clinical insights report

Goal: Predict diabetic retinopathy severity using embeddings only, metadata only, and combined multimodal features.

Suggested Tasks

  • Build baseline ML models
  • Compare embeddings vs metadata vs combined features
  • Generate confusion matrices and evaluation metrics
  • Interpret model performance

Expected Deliverables

  • Performance comparison table
  • Best-performing baseline model
  • Clinical interpretation of findings

Goal: Investigate whether model performance varies across patient subgroups.

Suggested Tasks

  • Evaluate performance across sex and age groups
  • Investigate site or camera-related bias
  • Perform subgroup-specific error analysis
  • Discuss fairness implications

Expected Deliverables

  • Fairness analysis report
  • Bias visualizations
  • Recommendations for equitable deployment

Suggested Evaluation Criteria

Criterion Weight (%)
Problem Understanding 15
EDA & Data Cleaning 20
Technical Soundness 20
Innovation 15
Clinical Relevance 15
Deployment Feasibility 10
Presentation Quality 5
Teams are encouraged to focus on clinical relevance, explainability, fairness, and deployment feasibility in addition to model performance.

BRSET Data Dictionary

A Brazilian Multilabel Ophthalmological Dataset (BRSET) v1.0.1

PhysioNet — BRSET v1.0.1 DOI

BRSET consists of 16,266 fundus images from 8,524 Brazilian patients. The labels.csv file contains identifiers, demographics, anatomical labels, diabetic retinopathy grading, quality parameters, and multi-label disease classifications for each image.

Dataset Files

File / Folder Description
fundus_photos 16,266 color fundus photograph images (JPEG).
labels.csv Table with image identifier, demographics, structural labels, diagnoses, and quality parameters.

Identifiers & Demographics

Column Description
image_id Image identifier.
patient_id Patient identifier.
camera Retinal camera (Canon CR-2 or Nikon NF505).
patient_age Age of patient in years.
patient_sex 1 = male, 2 = female.
exam_eye 1 = right eye, 2 = left eye.
nationality Patient's nationality.
comorbidities Free text of self-reported clinical antecedents.
diabetes Diabetes diagnosis.
diabetes_time Self-reported time since diabetes diagnosis (years).
insulin_use Self-reported insulin use (yes or no).

Anatomical Parameters

Column Description
optic_disc 1 = normal, 2 = abnormal.
vessels 1 = normal, 2 = abnormal.
macula 1 = normal, 2 = abnormal.

Diabetic Retinopathy Classification

Column Description
DR_ICDR International Clinical Diabetic Retinopathy (ICDR) grade 0–4.
  • 0 — No retinopathy
  • 1 — Mild non-proliferative diabetic retinopathy
  • 2 — Moderate non-proliferative diabetic retinopathy
  • 3 — Severe non-proliferative diabetic retinopathy
  • 4 — Proliferative diabetic retinopathy and post-laser status
DR_SDRG Scottish Diabetic Retinopathy Grading (SDRG) grade 0–4.
  • 0 — No retinopathy
  • 1 — Mild background
  • 2 — Moderate background
  • 3 — Severe non-proliferative or pre-proliferative diabetic retinopathy
  • 4 — Proliferative diabetic retinopathy and post-laser status

Quality Parameters

Column Description
focus 1 = satisfactory, 2 = unsatisfactory.
illumination 1 = satisfactory, 2 = unsatisfactory.
image_field 1 = satisfactory, 2 = unsatisfactory.
artifacts 1 = satisfactory, 2 = unsatisfactory.

Classification Parameters (Multi-label)

Binary labels: 1 = present, 0 = absent.

Column Description
diabetic_retinopathy Diabetic retinopathy present.
macular_edema Diabetic macular edema present.
scar Scar (e.g. toxoplasmosis) present.
nevus Nevus present.
amd Age-related macular degeneration present.
vascular_occlusion Vascular occlusion present.
hypertensive_retinopathy Hypertensive retinopathy present.
drusens Drusen present.
hemorrhage Non-diabetic retinal hemorrhage present.
retinal_detachment Retinal detachment present.
myopic_fundus Myopic fundus present.
increased_cup_disc Increased cup-to-disc ratio present.
other Other pathology present.