02 · Waterbirds dataset

A dataset engineered to expose background bias

Waterbirds (Sagawa et al., 2019) crops birds from CUB and pastes them onto land or water backgrounds from Places. The training split makes background highly predictive of bird type; the test split removes that correlation.

Source:src/dataset.py config.yaml

Train split (skewed)

waterbird-water (majority)≈ 95%
waterbird-land (minority)≈ 5%
landbird-land (majority)≈ 95%
landbird-water (minority)≈ 5%

Test split (balanced)

waterbird-water50%
waterbird-land (conflict)50%
landbird-land50%
landbird-water (conflict)50%

Why this design exposes shortcuts

In the train split, predicting the bird type from the background alone would already get ~95% accuracy — so a model that takes that shortcut looks perfectly competent during training. In the test split, that same shortcut now scores only ~50%, so subgroup evaluation surfaces the failure mode immediately.

Subgroup definitions

We use the four standard subgroups defined in src/utils.py:

0 · waterbird-water
1 · waterbird-land (conflict)
2 · landbird-land
3 · landbird-water (conflict)

How we load it

We load the public mirror grodino/waterbirds via Hugging Face Datasets and wrap it in a small torch.utils.data.Dataset that exposes (image tensor, label, place, group, index). All transforms are deterministic at evaluation time.

← Problem Next: Methodology pipeline →