# Jersey Color Detection Accuracy Analysis ## Test Configuration - **Models tested:** Gemini 3 Flash Preview (cloud API), Qwen3-VL-8B (local, via llama.cpp) - **Prompts tested:** `jersey_prompt.txt` (original), `jersey_prompt_capstone.txt` (capstone) - **Test images:** 161 annotated basketball jersey images - **Ground truth colors:** 202 (excluding white) - **Images resized** to max 768px wide before submission --- ## Summary Comparison | Metric | Gemini + Original | Gemini + Capstone | Qwen + Original | Qwen + Capstone | |----------------------------|:-----------------:|:-----------------:|:----------------:|:---------------:| | **Recall (exact)** | 64.4% | 60.9% | 64.4% | 65.8% | | **Recall (exact+similar)** | **81.2%** | 78.2% | 77.2% | 77.7% | | **Recall (missed)** | 18.8% | 21.8% | 22.8% | 22.3% | | **Precision (exact)** | 74.7% | 70.7% | 70.7% | 73.9% | | **Precision (exact+sim.)** | **93.7%** | 90.2% | 84.8% | 87.2% | | **Extra/wrong** | **6.3%** | 9.8% | 15.2% | 12.8% | | PASS images | **124** | 118 | 117 | 119 | | PARTIAL images | 19 | 21 | 18 | 19 | | FAIL images | **18** | 22 | 26 | 23 | | Avg time per image | 13.3s | 11.7s | 9.5s | 8.9s | ### Key Takeaways 1. **Gemini + original prompt is the best combination** across all major metrics: highest recall (81.2%), highest precision (93.7%), fewest failures (18), and fewest extra/wrong colors (6.3%). 2. **Exact recall is remarkably stable** across all four runs (60.9%–65.8%), suggesting ~35% of ground truth colors are inherently difficult for current VLMs regardless of model or prompt. 3. **Gemini produces far fewer hallucinated colors** than Qwen. Gemini's extra/wrong rate is 6.3%–9.8% vs. Qwen's 12.8%–15.2%. When Gemini detects a color, it is almost always correct. 4. **The capstone prompt did not improve results** for either model. For Gemini it degraded both recall and precision. For Qwen the difference was negligible. 5. **Qwen is ~30% faster** (8.9–9.5s vs 11.7–13.3s per image) but at the cost of lower accuracy and more false positives. --- ## Color-Level Analysis ### Most Problematic Ground Truth Colors Colors most frequently missed across all four test runs: | Color | Gemini+Orig | Gemini+Cap | Qwen+Orig | Qwen+Cap | Total Misses | Common Confusion | |-----------------|:-----------:|:----------:|:---------:|:--------:|:------------:|---------------------| | **gray** | 7 | 6 | 7 | 9 | 29 | Often returned as "grey" (similar match) or missed entirely | | **maroon** | 5 | 9 | 8 | 7 | 29 | Frequently confused with "red" | | **black** | 7 | 7 | 6 | 6 | 26 | Often not detected at all | | **light blue** | 2 | 2 | 8 | 5 | 17 | Returned as "blue" (Qwen especially) | | **green** | 3 | 4 | 3 | 4 | 14 | Sometimes returned as "black" | | **dark brown** | 0 | 1 | 4 | 4 | 9 | Returned as "black" or "brown" | | **brown** | 1 | 1 | 3 | 3 | 8 | Returned as "black" or "orange" | | **teal** | 2 | 2 | 2 | 2 | 8 | Confused with "green" or "blue" | | **blue** | 3 | 3 | 3 | 2 | 11 | Sometimes not detected at all | | **gold/yellow** | 2 | 2 | 1 | 1 | 6 | Occasionally missed entirely | ### Most Common Extra/Wrong Colors Reported | Extra Color | Gemini+Orig | Gemini+Cap | Qwen+Orig | Qwen+Cap | Notes | |--------------|:-----------:|:----------:|:---------:|:--------:|-------| | **red** | 3 | 7 | 7 | 6 | Typically a misread of maroon | | **black** | 2 | 4 | 7 | 7 | Misread of dark brown/green/gray | | **blue** | 3 | 2 | 10 | 6 | Misread of light blue or teal | | **green** | 1 | 1 | 1 | 1 | Misread of teal | | **orange** | 1 | 1 | 1 | 1 | Misread of brown | ### Similar-Match Confusion Patterns These are cases where the VLM returned a color in the same family but not the exact ground truth term: | Expected | Returned As | Gemini+Orig | Gemini+Cap | Qwen+Orig | Qwen+Cap | |------------------|----------------|:-----------:|:----------:|:---------:|:--------:| | gray | grey | 9 | 10 | — | — | | navy blue | blue | 7 | 6 | 8 | 8 | | dark blue | blue | 5 | 6 | 10 | 9 | | dark brown | brown | 5 | 5 | 2 | 2 | | gold | yellow | 3 | 2 | 5 | 3 | | dark blue | navy blue/navy | 4 | 4 | — | 1 | **Observations:** - **gray/grey** is purely a spelling variant — Gemini consistently uses British spelling. Qwen uses "gray" so this never triggers for Qwen. - **navy blue → blue** and **dark blue → blue** are the most common simplifications. Both models tend to drop shade qualifiers. - **dark brown → brown** follows the same pattern of dropping the shade qualifier. - **gold → yellow** is a genuine color perception difference where models see yellow-dominant gold jerseys. --- ## Persistently Failed Images These 11 images failed across **all four** test runs, representing the hardest cases: | Image | GT Colors | Typical VLM Response | Failure Pattern | |-------|-----------|---------------------|-----------------| | 016 - maroon.jpg | maroon | (none) or red | Maroon not recognized | | 029 - maroon_white.jpg | maroon | red | Maroon → red confusion | | 034 - light blue.jpg | light blue | blue | Shade qualifier dropped | | 046 - green.jpg | green | black | Dark green misread as black | | 053 - black_white.jpg | black | (not detected) | Black jerseys missed | | 057 - gold or yellow.jpg | gold\|yellow | (not detected) | Gold/yellow missed | | 132 - brown_white.jpg | brown | orange | Brown → orange confusion | | 134 - teal_white.jpg | teal | blue or green | Teal not in model vocabulary | | 138 - maroon.jpg | maroon | red | Maroon → red confusion | | 150 - green_gray.jpg | green, gray | black | Both colors misread | | 160 - blue_white.jpg | blue | (not detected) | Blue not detected | ### Root Cause Categories 1. **Maroon blindness (3 images):** Both models consistently classify maroon as red. This is the single largest systematic error. 2. **Dark color confusion (3 images):** Dark green, brown, and black are frequently confused with each other, especially in low-contrast or shadowed images. 3. **Shade qualifier loss (2 images):** "Light blue" and "teal" are simplified to "blue" or "green" — models use a coarser color vocabulary than the ground truth. 4. **Non-detection (3 images):** Some jerseys are simply not detected at all, likely due to occlusion, unusual angles, or low image quality. --- ## Model-Specific Observations ### Gemini 3 Flash - **Strengths:** Highest precision (93.7%), very few hallucinated colors, good at similar-family matching. Never produced gibberish color names. - **Weaknesses:** Consistently uses British "grey" instead of "gray". Slower than local model. - **Prompt sensitivity:** The capstone prompt slightly hurt performance (81.2% → 78.2% recall), suggesting the original simpler prompt works better. ### Qwen3-VL-8B - **Strengths:** Faster inference (8.9s avg). Slightly higher exact match rate with capstone prompt (65.8%). - **Weaknesses:** Much higher false positive rate (12.8–15.2% extra/wrong). Struggles significantly with "light blue" (8 misses with original prompt). Produced one gibberish color ("redolas"). Over-reports "blue" and "black". - **Prompt sensitivity:** Minimal difference between prompts. Capstone prompt slightly reduced errors. --- ## Recommendations 1. **Normalize "grey" → "gray"** in post-processing to eliminate the most common similar-match gap for Gemini. 2. **Add "maroon" to the prompt** as an explicit color option or example, since both models struggle to distinguish it from red without guidance. 3. **Consider a constrained color vocabulary** in the prompt (e.g., "Choose from: red, blue, green, yellow, orange, purple, black, gray, brown, maroon, teal, light blue, navy blue, gold, pink") to reduce vocabulary mismatch and shade-qualifier drift. 4. **Post-processing color mapping** could recover many similar-match cases automatically: navy→navy blue, grey→gray, dark blue→navy blue, etc. 5. **The original `jersey_prompt.txt` is the better prompt** — the capstone prompt's additional constraints did not improve accuracy for either model. --- ## Appendix: Color Similarity Families The following color families were used for "similar match" scoring. Two colors count as a similar match if they appear in the same family: | Family | Member Colors | |------------|-------------------------------------------------------| | blue | blue, dark blue, navy blue, navy, royal blue | | light_blue | light blue, sky blue, baby blue, carolina blue, powder blue | | red | red, scarlet, crimson | | dark_red | maroon, burgundy, dark red, wine | | green | green, dark green, forest green, kelly green | | yellow | yellow, gold, golden | | orange | orange, burnt orange | | brown | brown, dark brown | | purple | purple, violet | | gray | gray, grey, silver, charcoal | | black | black | | teal | teal, turquoise, cyan, aqua | | pink | pink, magenta, hot pink, rose | **Note:** Colors in *different* families are never counted as similar, even if perceptually close (e.g., maroon and red are in separate families; brown and orange are in separate families). This is intentional — the similar-match metric captures vocabulary variation within the same color concept, not genuine color misidentification.