03 · Methodology

The full pipeline

Each box below corresponds to one source file in the repo. The website's later sections drill into each step.

Source:run_all.sh config.yaml

Load Waterbirds

HF grodino/waterbirds · 4 subgroups

→

Train ResNet18

Save best by worst-group acc

→

Subgroup eval

Acc · P · R · F1 · CM · WG

→

Grad-CAM

Class-conditional saliency on layer4[-1]

→

Bias score

BG saliency / total · 60% center crop

→

Interventions

Blur · mask · shuffle · FG mask

→

Compare

Δ accuracy · Δ flips · Δ saliency

Data loading

Wrap the HF grodino/waterbirds split in a torch Dataset that returns (image, label, place, group, idx). Train transforms add a horizontal flip.

src/dataset.py

Model

ResNet18 (or ResNet50) initialised with ImageNet weights; final FC replaced with 2 logits.

src/model.py

Training

Adam, 15 epochs, batch 32, lr 1e-4. Best checkpoint selected by validation worst-group accuracy — a DRO-style criterion that explicitly fights the shortcut.

src/train.py

Subgroup evaluation

Overall accuracy, macro precision/recall/F1, full confusion matrix, per-subgroup accuracy and confidence, and worst-group accuracy.

src/evaluate.py

Grad-CAM analysis

Selvaraju et al. Grad-CAM on layer4[-1]. Up to 30 representative test samples per subgroup. Each sample produces a 3-panel image (original / Grad-CAM overlay / center-crop heuristic).

src/gradcam_analysis.py

Foreground / background score

A 60% center crop approximates the foreground. attention_bias_score = sum(saliency outside crop) / total. This is exactly the score requested by the brief.

src/gradcam_analysis.py

Interventions

Four inference-time edits — background blur, background mask, background patch shuffle, and foreground mask. We re-run inference + Grad-CAM under each and log accuracy, prediction-flip rate, confidence drop, and saliency.

src/interventions.py

Summary report

Aggregates everything into outputs/PROJECT_RESULTS_SUMMARY.md, which this website also reads at build time.

src/report_summary.py

Reproducing it

pip install -r requirements.txt
bash run_all.sh    # train -> evaluate -> Grad-CAM -> interventions -> summary

The full pipeline lives in run_all.sh.

← Dataset Next: CNN training →