Shortcut Learning · Waterbirds
04 · CNN training

Fine-tuning ResNet18 with worst-group selection

The model is selected by validation worst-group accuracy, not validation overall accuracy — this is the DRO-style criterion that makes the shortcut harder to ignore. The chart below shows why that matters.

Architecture
ResNet18
ImageNet pretrained
Optimizer
Adam
lr 1e-4 · wd 1e-4
Epochs
15
batch 32 · 224×224
Training history
Validation accuracy by subgroup, per epoch
Notice how overall validation accuracy stays high (~83%) while worst-group accuracy oscillates between 16% and 56% — this gap is the shortcut signal the project investigates.
Best val accuracy
86.0%
epoch 1
Best val worst-group acc
56.4%
epoch 4 · checkpoint kept here
Final-epoch train acc
99.2%
strong fit on biased train split

Why the chart matters

The blue line (overall validation accuracy) plateaus around 80–86%. The pink line (worst-group accuracy) jumps from 17% to 56% across epochs and never catches up. This volatility is the smoking gun: the model is learning at different speeds on different subgroups because it is leaning on a feature (background) that conflicts with the label in the minority groups.

By saving the checkpoint at the epoch with the highest val_worst_group_acc, we get the fairest available model from this single training run — without adding GroupDRO or reweighting (those would be the next step; see Limitations).