Code walkthrough

Each methodology step → the file that implements it

Click any chip to jump straight to the file on GitHub. The full pipeline runs end-to-end with bash run_all.sh.

Source:run_all.sh config.yaml

Load Waterbirds

Wraps the HF grodino/waterbirds split as a torch Dataset. Returns image, label, place, group, idx.

src/dataset.py src/utils.py · group ids

Build the model

ResNet18 (or 50) with ImageNet weights and a 2-class head. Architecture set in config.yaml.

src/model.py config.yaml

Train with worst-group selection

15 epochs of Adam, lr 1e-4. Best checkpoint is the one with the highest validation worst-group accuracy.

src/train.py

python -m src.train --config config.yaml

Outputs

outputs/checkpoints/best_model.pt
outputs/metrics/train_history.csv

Subgroup evaluation

Computes overall accuracy + macro P/R/F1, the confusion matrix, per-subgroup accuracy and worst-group accuracy.

src/evaluate.py

python -m src.evaluate --config config.yaml --checkpoint outputs/checkpoints/best_model.pt --split test

Outputs

outputs/metrics/test_metrics.json
outputs/metrics/test_predictions.csv
outputs/figures/test_confusion_matrix.png

Grad-CAM saliency analysis

Selvaraju et al. Grad-CAM on layer4[-1]. 30 representative test samples per subgroup; each generates a 3-panel plate.

src/gradcam_analysis.py

python -m src.gradcam_analysis --config config.yaml --checkpoint outputs/checkpoints/best_model.pt

Outputs

outputs/gradcam/gradcam_results.csv
outputs/gradcam/gradcam_group_summary.csv
outputs/gradcam/*.png (122 plates)

Foreground / background score

60% center crop = foreground proxy. attention_bias_score = sum(saliency outside) / total. Implemented as saliency_ratios() inside gradcam_analysis.py.

src/gradcam_analysis.py · saliency_ratios

Interventions

Four edits — background_blur, background_mask, background_patch_shuffle, foreground_mask — re-run inference + Grad-CAM under each.

src/interventions.py

python -m src.interventions --config config.yaml --checkpoint outputs/checkpoints/best_model.pt --max-samples 1000

Outputs

outputs/interventions/intervention_predictions.csv
outputs/interventions/intervention_metrics.csv
outputs/interventions/intervention_overall_metrics.csv
outputs/figures/intervention_accuracy.png
outputs/figures/intervention_background_saliency.png

Generate the project summary

Aggregates everything into a single markdown file. The website also reads the same CSV/JSON files at build time.

src/report_summary.py

python -m src.report_summary

Outputs

outputs/PROJECT_RESULTS_SUMMARY.md

One-shot reproducibility

pip install -r requirements.txt
bash run_all.sh

← Limitations Next: Live demo →