Shortcut Learning · Waterbirds
03 · Methodology

The full pipeline

Each box below corresponds to one source file in the repo. The website's later sections drill into each step.

01
Load Waterbirds
HF grodino/waterbirds · 4 subgroups
02
Train ResNet18
Save best by worst-group acc
03
Subgroup eval
Acc · P · R · F1 · CM · WG
04
Grad-CAM
Class-conditional saliency on layer4[-1]
05
Bias score
BG saliency / total · 60% center crop
06
Interventions
Blur · mask · shuffle · FG mask
07
Compare
Δ accuracy · Δ flips · Δ saliency
1

Data loading

Wrap the HF grodino/waterbirds split in a torch Dataset that returns (image, label, place, group, idx). Train transforms add a horizontal flip.

src/dataset.py
2

Model

ResNet18 (or ResNet50) initialised with ImageNet weights; final FC replaced with 2 logits.

src/model.py
3

Training

Adam, 15 epochs, batch 32, lr 1e-4. Best checkpoint selected by validation worst-group accuracy — a DRO-style criterion that explicitly fights the shortcut.

src/train.py
4

Subgroup evaluation

Overall accuracy, macro precision/recall/F1, full confusion matrix, per-subgroup accuracy and confidence, and worst-group accuracy.

src/evaluate.py
5

Grad-CAM analysis

Selvaraju et al. Grad-CAM on layer4[-1]. Up to 30 representative test samples per subgroup. Each sample produces a 3-panel image (original / Grad-CAM overlay / center-crop heuristic).

src/gradcam_analysis.py
6

Foreground / background score

A 60% center crop approximates the foreground. attention_bias_score = sum(saliency outside crop) / total. This is exactly the score requested by the brief.

src/gradcam_analysis.py
7

Interventions

Four inference-time edits — background blur, background mask, background patch shuffle, and foreground mask. We re-run inference + Grad-CAM under each and log accuracy, prediction-flip rate, confidence drop, and saliency.

src/interventions.py
8

Summary report

Aggregates everything into outputs/PROJECT_RESULTS_SUMMARY.md, which this website also reads at build time.

src/report_summary.py

Reproducing it

pip install -r requirements.txt
bash run_all.sh    # train -> evaluate -> Grad-CAM -> interventions -> summary

The full pipeline lives in run_all.sh.