05 · Subgroup evaluation
Strong on average, weak where it counts
Standard classification metrics are computed on the balanced test split, then broken down across the four (bird × background) subgroups.
Overall accuracy
83.9%
Macro F1
78.6%
Macro precision
76.9%
Worst-group accuracy
59.5%
Per-subgroup accuracy
| Subgroup | Count | Accuracy | Avg confidence | Notes |
|---|---|---|---|---|
Waterbird on land (conflict) waterbird-land | 642 | 59.5% | 90.1% | worst group · shortcut conflict |
Landbird on water (conflict) landbird-water | 2255 | 73.6% | 87.6% | shortcut conflict |
Waterbird on water (majority) waterbird-water | 642 | 93.3% | 96.9% | majority group |
Landbird on land (majority) landbird-land | 2255 | 98.6% | 98.4% | majority group |
The two minority / conflict groups (waterbird-land and landbird-water) drop sharply below the majority groups. The gap between overall accuracy (83.9%) and worst-group accuracy (59.5%) is the signature of shortcut learning.
Confusion matrix · test split
Rows = true class, columns = predicted class. Cell intensity is proportional to count.
Pred: landbird
Pred: waterbird
True: landbird
3882
86.1% of true row
628
13.9% of true row
True: waterbird
303
23.6% of true row
981
76.4% of true row
What the confusion matrix tells us
Errors are asymmetric: the model over-predicts landbird (the more frequent class). This matches the bias direction of the training set and is consistent with the model leaning on background features that correlate with class frequency.
Where to look in the repo
The CSV with one row per test image:
outputs/metrics/test_predictions.csvThe raw JSON with subgroup metrics:
outputs/metrics/test_metrics.jsonConfusion matrix figure:
outputs/figures/test_confusion_matrix.png