Fine-tuning ResNet18 with worst-group selection
The model is selected by validation worst-group accuracy, not validation overall accuracy — this is the DRO-style criterion that makes the shortcut harder to ignore. The chart below shows why that matters.
Why the chart matters
The blue line (overall validation accuracy) plateaus around 80–86%. The pink line (worst-group accuracy) jumps from 17% to 56% across epochs and never catches up. This volatility is the smoking gun: the model is learning at different speeds on different subgroups because it is leaning on a feature (background) that conflicts with the label in the minority groups.
By saving the checkpoint at the epoch with the highest val_worst_group_acc, we get the fairest available model from this single training run — without adding GroupDRO or reweighting (those would be the next step; see Limitations).