The full pipeline
Each box below corresponds to one source file in the repo. The website's later sections drill into each step.
Data loading
Wrap the HF grodino/waterbirds split in a torch Dataset that returns (image, label, place, group, idx). Train transforms add a horizontal flip.
src/dataset.pyModel
ResNet18 (or ResNet50) initialised with ImageNet weights; final FC replaced with 2 logits.
src/model.pyTraining
Adam, 15 epochs, batch 32, lr 1e-4. Best checkpoint selected by validation worst-group accuracy — a DRO-style criterion that explicitly fights the shortcut.
src/train.pySubgroup evaluation
Overall accuracy, macro precision/recall/F1, full confusion matrix, per-subgroup accuracy and confidence, and worst-group accuracy.
src/evaluate.pyGrad-CAM analysis
Selvaraju et al. Grad-CAM on layer4[-1]. Up to 30 representative test samples per subgroup. Each sample produces a 3-panel image (original / Grad-CAM overlay / center-crop heuristic).
src/gradcam_analysis.pyForeground / background score
A 60% center crop approximates the foreground. attention_bias_score = sum(saliency outside crop) / total. This is exactly the score requested by the brief.
src/gradcam_analysis.pyInterventions
Four inference-time edits — background blur, background mask, background patch shuffle, and foreground mask. We re-run inference + Grad-CAM under each and log accuracy, prediction-flip rate, confidence drop, and saliency.
src/interventions.pySummary report
Aggregates everything into outputs/PROJECT_RESULTS_SUMMARY.md, which this website also reads at build time.
src/report_summary.pyReproducing it
pip install -r requirements.txt bash run_all.sh # train -> evaluate -> Grad-CAM -> interventions -> summary
The full pipeline lives in run_all.sh.