What this study does not claim
Honest accounting of the limits of the design — and concrete suggestions for what a follow-up should add.
Birds aren't always at the geometric center, so a fixed 60% center crop will mis-attribute saliency for non-centered birds. The brief explicitly allows this approximation, but a tighter heuristic (segmentation mask from CUB or saliency-based bounding boxes) would sharpen the bias score.
Results are reported for one ResNet18 run with one random seed. Re-running across seeds and reporting confidence intervals would tighten the claims.
background_patch_shuffle uses a single permutation per image rather than averaging over multiple shuffles. The reported flip rate is therefore noisier than the other interventions.
We adopt one DRO-flavoured choice (model selection by worst-group accuracy) but don't train an actual GroupDRO model (Sagawa et al. 2019). Comparing standard ERM vs. GroupDRO on the same metrics would directly test whether our shortcut diagnosis can be mitigated.
Different attribution methods (Integrated Gradients, SHAP, RISE) can disagree. Cross-checking the bias score with at least one other method would strengthen the saliency-based claims.
We follow the convention used by grodino/waterbirds (label 0 = landbird, label 1 = waterbird, place 0 = land, place 1 = water). If a different mirror reverses these, the subgroup names need to be re-verified before re-running.
Future work
- Replace center-crop heuristic with the CUB segmentation masks where available.
- Train a GroupDRO baseline and recompute every saliency / intervention metric.
- Average background_patch_shuffle over N permutations.
- Add a saliency-method comparison (Integrated Gradients vs. Grad-CAM vs. RISE).
- Try a different backbone (ConvNeXt-Tiny or a small ViT) to test whether the shortcut is architecture-specific.