08 · Results dashboard
All numbers in one place
The full project results assembled into one scrollable view, suitable for live presentation.
Source:PROJECT_RESULTS_SUMMARY.md
Headline metrics
Overall test acc
83.9%
Worst-group acc
59.5%
waterbird-land
Foreground-mask acc
53.4%
31.8% flips
Background-mask acc
86.0%
bg removal helps
Subgroup metrics
| Subgroup | Count | Accuracy | Avg confidence | Notes |
|---|---|---|---|---|
Waterbird on land (conflict) waterbird-land | 642 | 59.5% | 90.1% | worst group · shortcut conflict |
Landbird on water (conflict) landbird-water | 2255 | 73.6% | 87.6% | shortcut conflict |
Waterbird on water (majority) waterbird-water | 642 | 93.3% | 96.9% | majority group |
Landbird on land (majority) landbird-land | 2255 | 98.6% | 98.4% | majority group |
Confusion matrix
Confusion matrix · test split
Rows = true class, columns = predicted class. Cell intensity is proportional to count.
Pred: landbird
Pred: waterbird
True: landbird
3882
86.1% of true row
628
13.9% of true row
True: waterbird
303
23.6% of true row
981
76.4% of true row
Training dynamics
Training history
Validation accuracy by subgroup, per epoch
Notice how overall validation accuracy stays high (~83%) while worst-group accuracy oscillates between 16% and 56% — this gap is the shortcut signal the project investigates.
Grad-CAM saliency
Foreground vs. background saliency by subgroup
Fraction of Grad-CAM heat that falls inside the 60% center crop (foreground heuristic) vs. outside (background).
Interventions
Accuracy by intervention
What happens to test accuracy when we modify the image at inference time?
Prediction flip rate vs. original
Fraction of samples where the predicted class changes after the intervention.
Average background saliency ratio
How much of the Grad-CAM heat sits outside the center foreground box?
Intervention × subgroup table
| Condition | Subgroup | N | Accuracy | Flip rate | Conf. drop | FG saliency | BG saliency |
|---|---|---|---|---|---|---|---|
| Original | Landbird on land (majority) | 352 | 98.3% | 0.0% | 0.0% | 54.6% | 45.4% |
| Original | Landbird on water (conflict) | 383 | 70.0% | 0.0% | 0.0% | 63.6% | 36.4% |
| Original | Waterbird on land (conflict) | 139 | 50.4% | 0.0% | 0.0% | 55.3% | 44.7% |
| Original | Waterbird on water (majority) | 126 | 93.7% | 0.0% | 0.0% | 61.7% | 38.3% |
| Background blur | Landbird on land (majority) | 352 | 97.7% | 1.1% | 1.9% | 61.9% | 38.1% |
| Background blur | Landbird on water (conflict) | 383 | 79.1% | 12.8% | -1.2% | 71.3% | 28.7% |
| Background blur | Waterbird on land (conflict) | 139 | 53.2% | 11.5% | 0.6% | 64.9% | 35.1% |
| Background blur | Waterbird on water (majority) | 126 | 90.5% | 6.3% | 2.0% | 70.2% | 29.8% |
| Background mask | Landbird on land (majority) | 352 | 96.9% | 1.4% | 3.0% | 71.2% | 28.8% |
| Background mask | Landbird on water (conflict) | 383 | 86.2% | 17.8% | -4.4% | 72.3% | 27.7% |
| Background mask | Waterbird on land (conflict) | 139 | 59.7% | 12.2% | 0.0% | 73.2% | 26.8% |
| Background mask | Waterbird on water (majority) | 126 | 84.1% | 11.1% | 4.1% | 73.4% | 26.6% |
| Background patch shuffle | Landbird on land (majority) | 352 | 98.3% | 1.7% | 1.0% | 57.1% | 42.9% |
| Background patch shuffle | Landbird on water (conflict) | 383 | 83.6% | 19.8% | -3.3% | 61.2% | 38.8% |
| Background patch shuffle | Waterbird on land (conflict) | 139 | 48.9% | 15.8% | 1.1% | 58.8% | 41.2% |
| Background patch shuffle | Waterbird on water (majority) | 126 | 81.0% | 12.7% | 7.8% | 61.2% | 38.8% |
| Foreground mask | Landbird on land (majority) | 352 | 97.2% | 4.0% | 4.0% | 34.9% | 65.1% |
| Foreground mask | Landbird on water (conflict) | 383 | 20.6% | 55.6% | 6.0% | 29.8% | 70.2% |
| Foreground mask | Waterbird on land (conflict) | 139 | 7.2% | 47.5% | -1.5% | 33.8% | 66.2% |
| Foreground mask | Waterbird on water (majority) | 126 | 81.7% | 19.8% | 11.0% | 33.0% | 67.0% |
Source files
outputs/metrics/test_metrics.jsonoutputs/metrics/test_predictions.csvoutputs/metrics/train_history.csvoutputs/gradcam/gradcam_results.csvoutputs/gradcam/gradcam_group_summary.csvoutputs/interventions/intervention_metrics.csvoutputs/interventions/intervention_overall_metrics.csvoutputs/interventions/intervention_predictions.csvoutputs/PROJECT_RESULTS_SUMMARY.md