Shortcut Learning · Waterbirds
10 · Limitations

What this study does not claim

Honest accounting of the limits of the design — and concrete suggestions for what a follow-up should add.

Coarse foreground heuristic

Birds aren't always at the geometric center, so a fixed 60% center crop will mis-attribute saliency for non-centered birds. The brief explicitly allows this approximation, but a tighter heuristic (segmentation mask from CUB or saliency-based bounding boxes) would sharpen the bias score.

Single seed, single architecture

Results are reported for one ResNet18 run with one random seed. Re-running across seeds and reporting confidence intervals would tighten the claims.

Patch shuffle is non-deterministic

background_patch_shuffle uses a single permutation per image rather than averaging over multiple shuffles. The reported flip rate is therefore noisier than the other interventions.

No GroupDRO or reweighting baseline

We adopt one DRO-flavoured choice (model selection by worst-group accuracy) but don't train an actual GroupDRO model (Sagawa et al. 2019). Comparing standard ERM vs. GroupDRO on the same metrics would directly test whether our shortcut diagnosis can be mitigated.

Grad-CAM is only one saliency method

Different attribution methods (Integrated Gradients, SHAP, RISE) can disagree. Cross-checking the bias score with at least one other method would strengthen the saliency-based claims.

Class label semantics in the HF mirror

We follow the convention used by grodino/waterbirds (label 0 = landbird, label 1 = waterbird, place 0 = land, place 1 = water). If a different mirror reverses these, the subgroup names need to be re-verified before re-running.

Future work

  • Replace center-crop heuristic with the CUB segmentation masks where available.
  • Train a GroupDRO baseline and recompute every saliency / intervention metric.
  • Average background_patch_shuffle over N permutations.
  • Add a saliency-method comparison (Integrated Gradients vs. Grad-CAM vs. RISE).
  • Try a different backbone (ConvNeXt-Tiny or a small ViT) to test whether the shortcut is architecture-specific.