# Jersey Color Detection Accuracy Analysis

## Test Configuration

- **Models tested:** Gemini 3 Flash Preview (cloud API), Qwen3-VL-8B (local, via llama.cpp)
- **Prompts tested:** `jersey_prompt.txt` (original), `jersey_prompt_capstone.txt` (capstone)
- **Test images:** 161 annotated basketball jersey images
- **Ground truth colors:** 202 (excluding white)
- **Images resized** to max 768px wide before submission

---

## Summary Comparison

| Metric                     | Gemini + Original | Gemini + Capstone | Qwen + Original | Qwen + Capstone |
|----------------------------|:-----------------:|:-----------------:|:----------------:|:---------------:|
| **Recall (exact)**         | 64.4%             | 60.9%             | 64.4%            | 65.8%           |
| **Recall (exact+similar)** | **81.2%**         | 78.2%             | 77.2%            | 77.7%           |
| **Recall (missed)**        | 18.8%             | 21.8%             | 22.8%            | 22.3%           |
| **Precision (exact)**      | 74.7%             | 70.7%             | 70.7%            | 73.9%           |
| **Precision (exact+sim.)** | **93.7%**         | 90.2%             | 84.8%            | 87.2%           |
| **Extra/wrong**            | **6.3%**          | 9.8%              | 15.2%            | 12.8%           |
| PASS images                | **124**           | 118               | 117              | 119             |
| PARTIAL images             | 19                | 21                | 18               | 19              |
| FAIL images                | **18**            | 22                | 26               | 23              |
| Avg time per image         | 13.3s             | 11.7s             | 9.5s             | 8.9s            |

### Key Takeaways

1. **Gemini + original prompt is the best combination** across all major metrics: highest recall (81.2%), highest precision (93.7%), fewest failures (18), and fewest extra/wrong colors (6.3%).

2. **Exact recall is remarkably stable** across all four runs (60.9%–65.8%), suggesting ~35% of ground truth colors are inherently difficult for current VLMs regardless of model or prompt.

3. **Gemini produces far fewer hallucinated colors** than Qwen. Gemini's extra/wrong rate is 6.3%–9.8% vs. Qwen's 12.8%–15.2%. When Gemini detects a color, it is almost always correct.

4. **The capstone prompt did not improve results** for either model. For Gemini it degraded both recall and precision. For Qwen the difference was negligible.

5. **Qwen is ~30% faster** (8.9–9.5s vs 11.7–13.3s per image) but at the cost of lower accuracy and more false positives.

---

## Color-Level Analysis

### Most Problematic Ground Truth Colors

Colors most frequently missed across all four test runs:

| Color           | Gemini+Orig | Gemini+Cap | Qwen+Orig | Qwen+Cap | Total Misses | Common Confusion    |
|-----------------|:-----------:|:----------:|:---------:|:--------:|:------------:|---------------------|
| **gray**        | 7           | 6          | 7         | 9        | 29           | Often returned as "grey" (similar match) or missed entirely |
| **maroon**      | 5           | 9          | 8         | 7        | 29           | Frequently confused with "red"   |
| **black**       | 7           | 7          | 6         | 6        | 26           | Often not detected at all        |
| **light blue**  | 2           | 2          | 8         | 5        | 17           | Returned as "blue" (Qwen especially) |
| **green**       | 3           | 4          | 3         | 4        | 14           | Sometimes returned as "black"    |
| **dark brown**  | 0           | 1          | 4         | 4        | 9            | Returned as "black" or "brown"   |
| **brown**       | 1           | 1          | 3         | 3        | 8            | Returned as "black" or "orange"  |
| **teal**        | 2           | 2          | 2         | 2        | 8            | Confused with "green" or "blue"  |
| **blue**        | 3           | 3          | 3         | 2        | 11           | Sometimes not detected at all    |
| **gold/yellow** | 2           | 2          | 1         | 1        | 6            | Occasionally missed entirely     |

### Most Common Extra/Wrong Colors Reported

| Extra Color  | Gemini+Orig | Gemini+Cap | Qwen+Orig | Qwen+Cap | Notes |
|--------------|:-----------:|:----------:|:---------:|:--------:|-------|
| **red**      | 3           | 7          | 7         | 6        | Typically a misread of maroon    |
| **black**    | 2           | 4          | 7         | 7        | Misread of dark brown/green/gray |
| **blue**     | 3           | 2          | 10        | 6        | Misread of light blue or teal    |
| **green**    | 1           | 1          | 1         | 1        | Misread of teal                  |
| **orange**   | 1           | 1          | 1         | 1        | Misread of brown                 |

### Similar-Match Confusion Patterns

These are cases where the VLM returned a color in the same family but not the exact ground truth term:

| Expected         | Returned As    | Gemini+Orig | Gemini+Cap | Qwen+Orig | Qwen+Cap |
|------------------|----------------|:-----------:|:----------:|:---------:|:--------:|
| gray             | grey           | 9           | 10         | —         | —        |
| navy blue        | blue           | 7           | 6          | 8         | 8        |
| dark blue        | blue           | 5           | 6          | 10        | 9        |
| dark brown       | brown          | 5           | 5          | 2         | 2        |
| gold             | yellow         | 3           | 2          | 5         | 3        |
| dark blue        | navy blue/navy | 4           | 4          | —         | 1        |

**Observations:**
- **gray/grey** is purely a spelling variant — Gemini consistently uses British spelling. Qwen uses "gray" so this never triggers for Qwen.
- **navy blue → blue** and **dark blue → blue** are the most common simplifications. Both models tend to drop shade qualifiers.
- **dark brown → brown** follows the same pattern of dropping the shade qualifier.
- **gold → yellow** is a genuine color perception difference where models see yellow-dominant gold jerseys.

---

## Persistently Failed Images

These 11 images failed across **all four** test runs, representing the hardest cases:

| Image | GT Colors | Typical VLM Response | Failure Pattern |
|-------|-----------|---------------------|-----------------|
| 016 - maroon.jpg            | maroon          | (none) or red        | Maroon not recognized |
| 029 - maroon_white.jpg      | maroon          | red                  | Maroon → red confusion |
| 034 - light blue.jpg        | light blue      | blue                 | Shade qualifier dropped |
| 046 - green.jpg             | green           | black                | Dark green misread as black |
| 053 - black_white.jpg       | black           | (not detected)       | Black jerseys missed |
| 057 - gold or yellow.jpg    | gold\|yellow    | (not detected)       | Gold/yellow missed |
| 132 - brown_white.jpg       | brown           | orange               | Brown → orange confusion |
| 134 - teal_white.jpg        | teal            | blue or green        | Teal not in model vocabulary |
| 138 - maroon.jpg            | maroon          | red                  | Maroon → red confusion |
| 150 - green_gray.jpg        | green, gray     | black                | Both colors misread |
| 160 - blue_white.jpg        | blue            | (not detected)       | Blue not detected |

### Root Cause Categories

1. **Maroon blindness (3 images):** Both models consistently classify maroon as red. This is the single largest systematic error.

2. **Dark color confusion (3 images):** Dark green, brown, and black are frequently confused with each other, especially in low-contrast or shadowed images.

3. **Shade qualifier loss (2 images):** "Light blue" and "teal" are simplified to "blue" or "green" — models use a coarser color vocabulary than the ground truth.

4. **Non-detection (3 images):** Some jerseys are simply not detected at all, likely due to occlusion, unusual angles, or low image quality.

---

## Model-Specific Observations

### Gemini 3 Flash
- **Strengths:** Highest precision (93.7%), very few hallucinated colors, good at similar-family matching. Never produced gibberish color names.
- **Weaknesses:** Consistently uses British "grey" instead of "gray". Slower than local model.
- **Prompt sensitivity:** The capstone prompt slightly hurt performance (81.2% → 78.2% recall), suggesting the original simpler prompt works better.

### Qwen3-VL-8B
- **Strengths:** Faster inference (8.9s avg). Slightly higher exact match rate with capstone prompt (65.8%).
- **Weaknesses:** Much higher false positive rate (12.8–15.2% extra/wrong). Struggles significantly with "light blue" (8 misses with original prompt). Produced one gibberish color ("redolas"). Over-reports "blue" and "black".
- **Prompt sensitivity:** Minimal difference between prompts. Capstone prompt slightly reduced errors.

---

## Recommendations

1. **Normalize "grey" → "gray"** in post-processing to eliminate the most common similar-match gap for Gemini.

2. **Add "maroon" to the prompt** as an explicit color option or example, since both models struggle to distinguish it from red without guidance.

3. **Consider a constrained color vocabulary** in the prompt (e.g., "Choose from: red, blue, green, yellow, orange, purple, black, gray, brown, maroon, teal, light blue, navy blue, gold, pink") to reduce vocabulary mismatch and shade-qualifier drift.

4. **Post-processing color mapping** could recover many similar-match cases automatically: navy→navy blue, grey→gray, dark blue→navy blue, etc.

5. **The original `jersey_prompt.txt` is the better prompt** — the capstone prompt's additional constraints did not improve accuracy for either model.

---

## Appendix: Color Similarity Families

The following color families were used for "similar match" scoring. Two colors count as a similar match if they appear in the same family:

| Family     | Member Colors                                         |
|------------|-------------------------------------------------------|
| blue       | blue, dark blue, navy blue, navy, royal blue          |
| light_blue | light blue, sky blue, baby blue, carolina blue, powder blue |
| red        | red, scarlet, crimson                                 |
| dark_red   | maroon, burgundy, dark red, wine                      |
| green      | green, dark green, forest green, kelly green          |
| yellow     | yellow, gold, golden                                  |
| orange     | orange, burnt orange                                  |
| brown      | brown, dark brown                                     |
| purple     | purple, violet                                        |
| gray       | gray, grey, silver, charcoal                          |
| black      | black                                                 |
| teal       | teal, turquoise, cyan, aqua                           |
| pink       | pink, magenta, hot pink, rose                         |

**Note:** Colors in *different* families are never counted as similar, even if perceptually close (e.g., maroon and red are in separate families; brown and orange are in separate families). This is intentional — the similar-match metric captures vocabulary variation within the same color concept, not genuine color misidentification.