Shortcut Learning · Waterbirds
← Back to website
Use your browser's print dialog → "Save as PDF".
Project 18 · Advanced Deep Learning

Saliency-based Analysis of Shortcut Learning in CNNs

A printable summary generated from outputs/. Repository: abidali9000/Deep-Learning

1. Abstract

We train a ResNet18 on the Waterbirds dataset, where background and bird type are spuriously correlated, and use Grad-CAM together with a foreground / background attention-bias score to test whether the model relies on background cues. The CNN reaches 83.9% overall accuracy on the balanced test split but only 59.5% on the worst (waterbird-on-land) subgroup. Four inference-time interventions provide direct causal evidence that background cues drive predictions on the conflict subgroups: foreground masking flips 31.8% of predictions, and background masking actually raises overall accuracy to 86.0%. We conclude that the model has partially internalised the shortcut background → bird type.

2. Method

The pipeline is implemented in eight Python files inside src/ (src/train.py, src/evaluate.py, src/gradcam_analysis.py, src/interventions.py, etc.). The full model configuration lives in config.yaml.

  • Backbone: ImageNet-pretrained ResNet18, FC replaced with 2 logits.
  • Optim: Adam, lr 1e-4, weight-decay 1e-4, batch 32, 15 epochs, 224×224.
  • Best checkpoint by validation worst-group accuracy.
  • Grad-CAM target layer: model.layer4[-1], predicted-class target.
  • Foreground heuristic: 60% center crop. Bias score = bg saliency / total.
  • Interventions: background blur (Gaussian k=31), background mask (grey 0.5), background patch shuffle (16×16), foreground mask (grey 0.5).

3. Test classification metrics

  • Overall accuracy: 83.9%
  • Macro precision: 76.9%
  • Macro recall: 81.2%
  • Macro F1: 78.6%
  • Worst-group accuracy: 59.5%

Subgroup metrics

SubgroupCountAccuracyAvg confidence
Waterbird on water (majority)64293.3%96.9%
Waterbird on land (conflict)64259.5%90.1%
Landbird on land (majority)225598.6%98.4%
Landbird on water (conflict)225573.6%87.6%
Confusion matrix

4. Training history (selected epochs)

EpochTrain accVal accWorst-group accWB-landLB-water
193.7%86.0%32.3%32.3%89.5%
398.7%80.2%54.1%54.1%64.2%
599.1%80.7%26.3%26.3%74.9%
799.5%81.2%50.4%50.4%67.6%
999.7%81.8%40.6%40.6%72.5%
1199.7%83.6%29.3%29.3%80.9%
1399.8%81.6%26.3%26.3%76.8%
1599.2%79.0%16.5%16.5%73.6%

5. Grad-CAM saliency by subgroup

SubgroupForegroundBackground (bias)Sample acc.
Landbird on land (majority)51.6%48.4%93.3%
Landbird on water (conflict)59.3%40.7%60.0%
Waterbird on land (conflict)59.3%40.7%70.0%
Waterbird on water (majority)65.2%34.8%96.7%

Representative shortcut failures

Shortcut failure: a waterbird placed on land. The model misclassifies it as landbird — saliency leaks onto the land background.Another waterbird-on-land failure — strong evidence the background is steering the prediction.Shortcut failure: landbird on water becomes 'waterbird'. Saliency drifts toward the water background.

6. Intervention results

ConditionAccuracyFlip rateAvg BG saliency
Background blur83.5%7.7%33.0%
Background mask86.0%10.4%27.8%
Background patch shuffle83.6%12.0%40.6%
Foreground mask53.4%31.8%67.4%
Original80.2%0.0%41.0%
Accuracy by interventionBackground saliency by intervention

Per subgroup × condition (selected)

ConditionSubgroupAccFlipBG sal.
Background blurLandbird on water (conflict)79.1%12.8%28.7%
Background blurWaterbird on land (conflict)53.2%11.5%35.1%
Background maskLandbird on water (conflict)86.2%17.8%27.7%
Background maskWaterbird on land (conflict)59.7%12.2%26.8%
Background patch shuffleLandbird on water (conflict)83.6%19.8%38.8%
Background patch shuffleWaterbird on land (conflict)48.9%15.8%41.2%
Foreground maskLandbird on water (conflict)20.6%55.6%70.2%
Foreground maskWaterbird on land (conflict)7.2%47.5%66.2%
OriginalLandbird on water (conflict)70.0%0.0%36.4%
OriginalWaterbird on land (conflict)50.4%0.0%44.7%

7. Conclusion

The CNN's overall accuracy hides a 20.7% gap to its worst subgroup. Saliency analysis shows the model's attention drifts onto the background, and inference-time interventions confirm a causal role for the background: foreground-mask collapses accuracy to 53.4% with 31.8% flips, while background-mask raises overall accuracy to 86.0%. This is consistent with shortcut learning: the model uses the bird, but it also relies on the background to a degree that hurts minority-subgroup generalisation.

8. References

  1. Sagawa et al., 2019. Distributionally Robust Neural Networks for Group Shifts. arXiv:1911.08731.
  2. Geirhos et al., 2020. Shortcut Learning in Deep Neural Networks. Nature MI.
  3. Selvaraju et al., 2017. Grad-CAM. ICCV.
  4. Wah et al., 2011. CUB-200-2011 Dataset.
  5. Zhou et al., 2017. Places: A 10M Image Database for Scene Recognition.