Shortcut Learning · Waterbirds
05 · Subgroup evaluation

Strong on average, weak where it counts

Standard classification metrics are computed on the balanced test split, then broken down across the four (bird × background) subgroups.

Overall accuracy
83.9%
Macro F1
78.6%
Macro precision
76.9%
Worst-group accuracy
59.5%

Per-subgroup accuracy

SubgroupCountAccuracyAvg confidenceNotes
Waterbird on land (conflict)
waterbird-land
64259.5%90.1%worst group · shortcut conflict
Landbird on water (conflict)
landbird-water
225573.6%87.6%shortcut conflict
Waterbird on water (majority)
waterbird-water
64293.3%96.9%majority group
Landbird on land (majority)
landbird-land
225598.6%98.4%majority group

The two minority / conflict groups (waterbird-land and landbird-water) drop sharply below the majority groups. The gap between overall accuracy (83.9%) and worst-group accuracy (59.5%) is the signature of shortcut learning.

Confusion matrix · test split
Rows = true class, columns = predicted class. Cell intensity is proportional to count.
Pred: landbird
Pred: waterbird
True: landbird
3882
86.1% of true row
628
13.9% of true row
True: waterbird
303
23.6% of true row
981
76.4% of true row

What the confusion matrix tells us

Errors are asymmetric: the model over-predicts landbird (the more frequent class). This matches the bias direction of the training set and is consistent with the model leaning on background features that correlate with class frequency.

Where to look in the repo

The CSV with one row per test image:

outputs/metrics/test_predictions.csv

The raw JSON with subgroup metrics:

outputs/metrics/test_metrics.json

Confusion matrix figure:

outputs/figures/test_confusion_matrix.png