The model uses both — but it leans on the background more than it should
Three independent lines of evidence converge on the same diagnosis.
Overall accuracy 83.9% vs. worst-group 59.5% — a 24.4% drop on the conflict subgroup.
Across all four subgroups, 30–48% of Grad-CAM mass lies outside the 60% center crop, peaking on the conflict groups.
Masking the bird drops accuracy from 80.2% to 53.4% (31.8% flips). Conversely, masking the background raises accuracy to 86.0% — the background was misleading the model on conflict cases.
Final statement (data-supported)
The CNN reaches 83.9% overall test accuracy on Waterbirds, but only 59.5% on the worst (waterbird-on-land) subgroup. Grad-CAM analysis confirms that a substantial share of the model's attention falls on the background even when its prediction is correct, and the four inference-time interventions provide direct causal evidence that background cues influence predictions: foreground masking flips 31.8% of predictions and crashes accuracy to 53.4%, while background masking raises overall accuracy to 86.0% by removing a misleading signal. We therefore conclude that the model has partially internalised the background → bird-type shortcut that the training distribution induced.
The model is not only a background classifier — it does use the bird (foreground masking still hurts a lot) — but the background is doing more work than a fair model should allow.