Shortcut Learning · Waterbirds
07 · Interventions

Causal evidence: edit the image, watch the prediction

The brief asks for at least two interventions. We implement four — three that erase the background in different ways, plus one that erases the foreground. We re-run inference and Grad-CAM under each condition.

Background blur
Gaussian blur (σ from kernel 31) outside the center crop. Removes high-frequency background detail without changing colour.
Background mask
Replace the background with a flat 0.5 grey. Removes essentially all background information.
Background patch shuffle
Permute 16×16 background patches. Keeps the colour distribution but destroys structure.
Foreground mask
Replace the center crop with grey — i.e. erase the bird while keeping the background untouched. The hardest test for shortcut reliance.

Overall effect

Accuracy by intervention
What happens to test accuracy when we modify the image at inference time?
Prediction flip rate vs. original
Fraction of samples where the predicted class changes after the intervention.
Average background saliency ratio
How much of the Grad-CAM heat sits outside the center foreground box?

Reading the bars

  • Foreground mask drops accuracy from 80.2% to 53.4% with a 31.8% flip rate — the bird matters, a lot. The model is not a pure background classifier.
  • Background mask actually increases overall accuracy (86.0%). When we delete the background, predictions on the conflict groups improve — direct evidence that the background was misleading the model on those subgroups.
  • Background patch shuffle is the most disruptive of the three background ablations for prediction flips, even though average colour is preserved — meaning structural background features (water vs. land texture) were carrying signal.

Pre-rendered figures

These come straight from src/interventions.py.

Accuracy by interventionBackground saliency by intervention

Side-by-side Grad-CAM comparison

For each conflict-group sample, we render the original image and its four intervened versions, each with the Grad-CAM overlay for the model's predicted class. Generated by src/comparison_plates.py.

Waterbird on land — shortcut failure case under each intervention.
waterbird-landWaterbird on land — shortcut failure case under each intervention.
Another waterbird-on-land sample.
waterbird-landAnother waterbird-on-land sample.
Landbird on water — note how the bias persists across blur and shuffle.
landbird-waterLandbird on water — note how the bias persists across blur and shuffle.
Landbird on water — foreground-mask still over-predicts waterbird.
landbird-waterLandbird on water — foreground-mask still over-predicts waterbird.
Conflict-group success case — for contrast.
waterbird-landConflict-group success case — for contrast.
Landbird-on-water success — used as a control.
landbird-waterLandbird-on-water success — used as a control.

Subgroup breakdown

Drilling into the 4 × 5 grid is where the story really lands. Note how foreground masking devastates the conflict groups (waterbird-land drops to ~7% accuracy) while leaving the majority group nearly untouched.

ConditionSubgroupNAccuracyFlip rateConf. dropFG saliencyBG saliency
OriginalLandbird on land (majority)35298.3%0.0%0.0%54.6%45.4%
OriginalLandbird on water (conflict)38370.0%0.0%0.0%63.6%36.4%
OriginalWaterbird on land (conflict)13950.4%0.0%0.0%55.3%44.7%
OriginalWaterbird on water (majority)12693.7%0.0%0.0%61.7%38.3%
Background blurLandbird on land (majority)35297.7%1.1%1.9%61.9%38.1%
Background blurLandbird on water (conflict)38379.1%12.8%-1.2%71.3%28.7%
Background blurWaterbird on land (conflict)13953.2%11.5%0.6%64.9%35.1%
Background blurWaterbird on water (majority)12690.5%6.3%2.0%70.2%29.8%
Background maskLandbird on land (majority)35296.9%1.4%3.0%71.2%28.8%
Background maskLandbird on water (conflict)38386.2%17.8%-4.4%72.3%27.7%
Background maskWaterbird on land (conflict)13959.7%12.2%0.0%73.2%26.8%
Background maskWaterbird on water (majority)12684.1%11.1%4.1%73.4%26.6%
Background patch shuffleLandbird on land (majority)35298.3%1.7%1.0%57.1%42.9%
Background patch shuffleLandbird on water (conflict)38383.6%19.8%-3.3%61.2%38.8%
Background patch shuffleWaterbird on land (conflict)13948.9%15.8%1.1%58.8%41.2%
Background patch shuffleWaterbird on water (majority)12681.0%12.7%7.8%61.2%38.8%
Foreground maskLandbird on land (majority)35297.2%4.0%4.0%34.9%65.1%
Foreground maskLandbird on water (conflict)38320.6%55.6%6.0%29.8%70.2%
Foreground maskWaterbird on land (conflict)1397.2%47.5%-1.5%33.8%66.2%
Foreground maskWaterbird on water (majority)12681.7%19.8%11.0%33.0%67.0%