Add accuracy test framework, prompts, results, and analysis reports

Includes accuracy test scripts for Qwen (local) and Gemini (cloud API),
three prompt variants (original, capstone, constrained), test results
from all runs, and two analysis reports with an HTML presentation version.
This commit is contained in:
2026-03-03 18:44:49 -07:00
parent 435033ea07
commit 5405d7f7dc
13 changed files with 8561 additions and 0 deletions

170
accuracy_analysis_report.md Normal file
View File

@ -0,0 +1,170 @@
# Jersey Color Detection Accuracy Analysis
## Test Configuration
- **Models tested:** Gemini 3 Flash Preview (cloud API), Qwen3-VL-8B (local, via llama.cpp)
- **Prompts tested:** `jersey_prompt.txt` (original), `jersey_prompt_capstone.txt` (capstone)
- **Test images:** 161 annotated basketball jersey images
- **Ground truth colors:** 202 (excluding white)
- **Images resized** to max 768px wide before submission
---
## Summary Comparison
| Metric | Gemini + Original | Gemini + Capstone | Qwen + Original | Qwen + Capstone |
|----------------------------|:-----------------:|:-----------------:|:----------------:|:---------------:|
| **Recall (exact)** | 64.4% | 60.9% | 64.4% | 65.8% |
| **Recall (exact+similar)** | **81.2%** | 78.2% | 77.2% | 77.7% |
| **Recall (missed)** | 18.8% | 21.8% | 22.8% | 22.3% |
| **Precision (exact)** | 74.7% | 70.7% | 70.7% | 73.9% |
| **Precision (exact+sim.)** | **93.7%** | 90.2% | 84.8% | 87.2% |
| **Extra/wrong** | **6.3%** | 9.8% | 15.2% | 12.8% |
| PASS images | **124** | 118 | 117 | 119 |
| PARTIAL images | 19 | 21 | 18 | 19 |
| FAIL images | **18** | 22 | 26 | 23 |
| Avg time per image | 13.3s | 11.7s | 9.5s | 8.9s |
### Key Takeaways
1. **Gemini + original prompt is the best combination** across all major metrics: highest recall (81.2%), highest precision (93.7%), fewest failures (18), and fewest extra/wrong colors (6.3%).
2. **Exact recall is remarkably stable** across all four runs (60.9%65.8%), suggesting ~35% of ground truth colors are inherently difficult for current VLMs regardless of model or prompt.
3. **Gemini produces far fewer hallucinated colors** than Qwen. Gemini's extra/wrong rate is 6.3%9.8% vs. Qwen's 12.8%15.2%. When Gemini detects a color, it is almost always correct.
4. **The capstone prompt did not improve results** for either model. For Gemini it degraded both recall and precision. For Qwen the difference was negligible.
5. **Qwen is ~30% faster** (8.99.5s vs 11.713.3s per image) but at the cost of lower accuracy and more false positives.
---
## Color-Level Analysis
### Most Problematic Ground Truth Colors
Colors most frequently missed across all four test runs:
| Color | Gemini+Orig | Gemini+Cap | Qwen+Orig | Qwen+Cap | Total Misses | Common Confusion |
|-----------------|:-----------:|:----------:|:---------:|:--------:|:------------:|---------------------|
| **gray** | 7 | 6 | 7 | 9 | 29 | Often returned as "grey" (similar match) or missed entirely |
| **maroon** | 5 | 9 | 8 | 7 | 29 | Frequently confused with "red" |
| **black** | 7 | 7 | 6 | 6 | 26 | Often not detected at all |
| **light blue** | 2 | 2 | 8 | 5 | 17 | Returned as "blue" (Qwen especially) |
| **green** | 3 | 4 | 3 | 4 | 14 | Sometimes returned as "black" |
| **dark brown** | 0 | 1 | 4 | 4 | 9 | Returned as "black" or "brown" |
| **brown** | 1 | 1 | 3 | 3 | 8 | Returned as "black" or "orange" |
| **teal** | 2 | 2 | 2 | 2 | 8 | Confused with "green" or "blue" |
| **blue** | 3 | 3 | 3 | 2 | 11 | Sometimes not detected at all |
| **gold/yellow** | 2 | 2 | 1 | 1 | 6 | Occasionally missed entirely |
### Most Common Extra/Wrong Colors Reported
| Extra Color | Gemini+Orig | Gemini+Cap | Qwen+Orig | Qwen+Cap | Notes |
|--------------|:-----------:|:----------:|:---------:|:--------:|-------|
| **red** | 3 | 7 | 7 | 6 | Typically a misread of maroon |
| **black** | 2 | 4 | 7 | 7 | Misread of dark brown/green/gray |
| **blue** | 3 | 2 | 10 | 6 | Misread of light blue or teal |
| **green** | 1 | 1 | 1 | 1 | Misread of teal |
| **orange** | 1 | 1 | 1 | 1 | Misread of brown |
### Similar-Match Confusion Patterns
These are cases where the VLM returned a color in the same family but not the exact ground truth term:
| Expected | Returned As | Gemini+Orig | Gemini+Cap | Qwen+Orig | Qwen+Cap |
|------------------|----------------|:-----------:|:----------:|:---------:|:--------:|
| gray | grey | 9 | 10 | — | — |
| navy blue | blue | 7 | 6 | 8 | 8 |
| dark blue | blue | 5 | 6 | 10 | 9 |
| dark brown | brown | 5 | 5 | 2 | 2 |
| gold | yellow | 3 | 2 | 5 | 3 |
| dark blue | navy blue/navy | 4 | 4 | — | 1 |
**Observations:**
- **gray/grey** is purely a spelling variant — Gemini consistently uses British spelling. Qwen uses "gray" so this never triggers for Qwen.
- **navy blue → blue** and **dark blue → blue** are the most common simplifications. Both models tend to drop shade qualifiers.
- **dark brown → brown** follows the same pattern of dropping the shade qualifier.
- **gold → yellow** is a genuine color perception difference where models see yellow-dominant gold jerseys.
---
## Persistently Failed Images
These 11 images failed across **all four** test runs, representing the hardest cases:
| Image | GT Colors | Typical VLM Response | Failure Pattern |
|-------|-----------|---------------------|-----------------|
| 016 - maroon.jpg | maroon | (none) or red | Maroon not recognized |
| 029 - maroon_white.jpg | maroon | red | Maroon → red confusion |
| 034 - light blue.jpg | light blue | blue | Shade qualifier dropped |
| 046 - green.jpg | green | black | Dark green misread as black |
| 053 - black_white.jpg | black | (not detected) | Black jerseys missed |
| 057 - gold or yellow.jpg | gold\|yellow | (not detected) | Gold/yellow missed |
| 132 - brown_white.jpg | brown | orange | Brown → orange confusion |
| 134 - teal_white.jpg | teal | blue or green | Teal not in model vocabulary |
| 138 - maroon.jpg | maroon | red | Maroon → red confusion |
| 150 - green_gray.jpg | green, gray | black | Both colors misread |
| 160 - blue_white.jpg | blue | (not detected) | Blue not detected |
### Root Cause Categories
1. **Maroon blindness (3 images):** Both models consistently classify maroon as red. This is the single largest systematic error.
2. **Dark color confusion (3 images):** Dark green, brown, and black are frequently confused with each other, especially in low-contrast or shadowed images.
3. **Shade qualifier loss (2 images):** "Light blue" and "teal" are simplified to "blue" or "green" — models use a coarser color vocabulary than the ground truth.
4. **Non-detection (3 images):** Some jerseys are simply not detected at all, likely due to occlusion, unusual angles, or low image quality.
---
## Model-Specific Observations
### Gemini 3 Flash
- **Strengths:** Highest precision (93.7%), very few hallucinated colors, good at similar-family matching. Never produced gibberish color names.
- **Weaknesses:** Consistently uses British "grey" instead of "gray". Slower than local model.
- **Prompt sensitivity:** The capstone prompt slightly hurt performance (81.2% → 78.2% recall), suggesting the original simpler prompt works better.
### Qwen3-VL-8B
- **Strengths:** Faster inference (8.9s avg). Slightly higher exact match rate with capstone prompt (65.8%).
- **Weaknesses:** Much higher false positive rate (12.815.2% extra/wrong). Struggles significantly with "light blue" (8 misses with original prompt). Produced one gibberish color ("redolas"). Over-reports "blue" and "black".
- **Prompt sensitivity:** Minimal difference between prompts. Capstone prompt slightly reduced errors.
---
## Recommendations
1. **Normalize "grey" → "gray"** in post-processing to eliminate the most common similar-match gap for Gemini.
2. **Add "maroon" to the prompt** as an explicit color option or example, since both models struggle to distinguish it from red without guidance.
3. **Consider a constrained color vocabulary** in the prompt (e.g., "Choose from: red, blue, green, yellow, orange, purple, black, gray, brown, maroon, teal, light blue, navy blue, gold, pink") to reduce vocabulary mismatch and shade-qualifier drift.
4. **Post-processing color mapping** could recover many similar-match cases automatically: navy→navy blue, grey→gray, dark blue→navy blue, etc.
5. **The original `jersey_prompt.txt` is the better prompt** — the capstone prompt's additional constraints did not improve accuracy for either model.
---
## Appendix: Color Similarity Families
The following color families were used for "similar match" scoring. Two colors count as a similar match if they appear in the same family:
| Family | Member Colors |
|------------|-------------------------------------------------------|
| blue | blue, dark blue, navy blue, navy, royal blue |
| light_blue | light blue, sky blue, baby blue, carolina blue, powder blue |
| red | red, scarlet, crimson |
| dark_red | maroon, burgundy, dark red, wine |
| green | green, dark green, forest green, kelly green |
| yellow | yellow, gold, golden |
| orange | orange, burnt orange |
| brown | brown, dark brown |
| purple | purple, violet |
| gray | gray, grey, silver, charcoal |
| black | black |
| teal | teal, turquoise, cyan, aqua |
| pink | pink, magenta, hot pink, rose |
**Note:** Colors in *different* families are never counted as similar, even if perceptually close (e.g., maroon and red are in separate families; brown and orange are in separate families). This is intentional — the similar-match metric captures vocabulary variation within the same color concept, not genuine color misidentification.