04 · CNN training

Fine-tuning ResNet18 with worst-group selection

The model is selected by validation worst-group accuracy, not validation overall accuracy — this is the DRO-style criterion that makes the shortcut harder to ignore. The chart below shows why that matters.

Source:src/train.py src/model.py config.yaml

Architecture

ResNet18

ImageNet pretrained

Optimizer

Adam

lr 1e-4 · wd 1e-4

Epochs

batch 32 · 224×224

Training history

Validation accuracy by subgroup, per epoch

Notice how overall validation accuracy stays high (~83%) while worst-group accuracy oscillates between 16% and 56% — this gap is the shortcut signal the project investigates.

Best val accuracy

86.0%

epoch 1

Best val worst-group acc

56.4%

epoch 4 · checkpoint kept here

Final-epoch train acc

99.2%

strong fit on biased train split

Why the chart matters

The blue line (overall validation accuracy) plateaus around 80–86%. The pink line (worst-group accuracy) jumps from 17% to 56% across epochs and never catches up. This volatility is the smoking gun: the model is learning at different speeds on different subgroups because it is leaning on a feature (background) that conflicts with the label in the minority groups.

By saving the checkpoint at the epoch with the highest val_worst_group_acc, we get the fairest available model from this single training run — without adding GroupDRO or reweighting (those would be the next step; see Limitations).

← Methodology Next: Subgroup evaluation →