05 · Subgroup evaluation

Strong on average, weak where it counts

Standard classification metrics are computed on the balanced test split, then broken down across the four (bird × background) subgroups.

Source:src/evaluate.py outputs/metrics/test_metrics.json

Overall accuracy

83.9%

Macro F1

78.6%

Macro precision

76.9%

Worst-group accuracy

59.5%

Per-subgroup accuracy

Subgroup	Count	Accuracy	Avg confidence	Notes
Waterbird on land (conflict) waterbird-land	642	59.5%	90.1%	worst group · shortcut conflict
Landbird on water (conflict) landbird-water	2255	73.6%	87.6%	shortcut conflict
Waterbird on water (majority) waterbird-water	642	93.3%	96.9%	majority group
Landbird on land (majority) landbird-land	2255	98.6%	98.4%	majority group

The two minority / conflict groups (waterbird-land and landbird-water) drop sharply below the majority groups. The gap between overall accuracy (83.9%) and worst-group accuracy (59.5%) is the signature of shortcut learning.

Confusion matrix · test split

Rows = true class, columns = predicted class. Cell intensity is proportional to count.

Pred: landbird

Pred: waterbird

True: landbird

3882

86.1% of true row

628

13.9% of true row

True: waterbird

303

23.6% of true row

981

76.4% of true row

What the confusion matrix tells us

Errors are asymmetric: the model over-predicts landbird (the more frequent class). This matches the bias direction of the training set and is consistent with the model leaning on background features that correlate with class frequency.

Where to look in the repo

The CSV with one row per test image:

outputs/metrics/test_predictions.csv

The raw JSON with subgroup metrics:

outputs/metrics/test_metrics.json

Confusion matrix figure:

outputs/figures/test_confusion_matrix.png

← Training Next: Grad-CAM saliency analysis →