# Jersey Color Detection Accuracy — Round 2 Analysis

**Date:** March 3, 2026
**Models:** Gemini 3 Flash Preview, Qwen3-VL-8B (local via llama.cpp)
**Prompts:** jersey_prompt.txt (original), jersey_prompt_capstone.txt (capstone), jersey_prompt_constrained.txt (constrained)
**Test set:** 161 annotated images, 202 ground truth colors (excluding white)

---

## Summary Comparison

| Metric                     | Qwen Original | Qwen Capstone | Qwen Constrained | Gemini Original | Gemini Capstone | Gemini Constrained |
|----------------------------|:-------------:|:-------------:|:-----------------:|:---------------:|:---------------:|:------------------:|
| **Recall (exact)**         | 65.3%         | 66.3%         | **71.8%**         | 62.4%           | 60.9%           | 67.8%              |
| **Recall (exact+similar)** | 78.2%         | 78.2%         | **82.7%**         | 79.7%           | 78.2%           | 81.7%              |
| **Missed**                 | 21.8%         | 21.8%         | **17.3%**         | 20.3%           | 21.8%           | 18.3%              |
| **Precision (exact)**      | 71.7%         | 74.0%         | 78.4%             | 72.0%           | 69.5%           | **78.7%**          |
| **Precision (exact+sim.)** | 85.9%         | 87.3%         | 90.3%             | 91.4%           | 88.7%           | **94.3%**          |
| **Extra/wrong**            | 14.1%         | 12.7%         | 9.7%              | 8.6%            | 11.3%           | **5.7%**           |
| PASS                       | 118           | 120           | **127**           | 120             | 117             | 124                |
| PARTIAL                    | 19            | 19            | **15**            | 20              | 22              | 19                 |
| FAIL                       | 24            | 22            | 19                | 21              | 22              | **18**             |
| Total time                 | 1557s         | 1437s         | 1596s             | 253s            | 260s            | 344s               |

---

## Key Findings

### 1. The constrained prompt is the best prompt for both models

The constrained vocabulary prompt delivered the strongest results across the board:

- **Qwen + Constrained** achieved the highest recall of any combination at **82.7%** (167/202 found), up from 78.2% with both other prompts. It also posted the most PASS images (**127**, up from 118/120) and the fewest FAIL images (**19**, down from 24/22).

- **Gemini + Constrained** achieved the highest precision of any combination at **94.3%** (164/174 correct), with only **5.7% extra/wrong** colors — the lowest error rate across all six runs. It tied for fewest failures at **18**.

### 2. Exact match rates jumped significantly

The constrained prompt's biggest impact was converting similar matches into exact matches by forcing models to use the ground truth vocabulary:

| Model  | Exact Match (Original) | Exact Match (Constrained) | Improvement |
|--------|:----------------------:|:-------------------------:|:-----------:|
| Qwen   | 65.3% (132)            | **71.8% (145)**           | +6.5 pp     |
| Gemini | 62.4% (126)            | **67.8% (137)**           | +5.4 pp     |

This came partly from eliminating vocabulary mismatch (e.g., grey→gray, navy→navy blue) and partly from teaching models to use specific color terms like "maroon" and "light blue."

### 3. Targeted color improvements

The constrained prompt's explicit color guidance fixed the worst systematic errors:

| Problem Color  | Qwen Misses (Orig→Constrained) | Gemini Misses (Orig→Constrained) |
|----------------|:------------------------------:|:--------------------------------:|
| **maroon**     | 8 → **3**                      | 6 → **3**                        |
| **light blue** | 7 → **1**                      | 3 → **1**                        |
| **dark brown** | 4 → **2**                      | 1 → 1                            |
| **teal**       | 2 → **2**                      | 2 → 2                            |
| **gray**       | 7 → 8                          | 6 → 6                            |
| **black**      | 6 → 6                          | 7 → 7                            |

- **Maroon:** Cut in half for both models. Previously the most-missed color for Qwen; now ranks 5th.
- **Light blue:** Near-elimination of the "light blue → blue" confusion for both models (7→1 for Qwen, 3→1 for Gemini).
- **Gray/grey:** The spelling normalization instruction eliminated the grey→gray similar-match penalty for Gemini entirely (10 confusions → 0). However, gray detection misses remain unchanged — these are images where gray jerseys aren't detected at all, not a naming issue.
- **Teal and black** remain stubbornly problematic regardless of prompt.

### 4. New overcorrection pattern with constrained prompt

The constrained prompt introduced a new failure mode — models now occasionally over-apply newly-learned color terms:

- **Qwen + Constrained** reported "maroon" as an extra/wrong color **5 times** (was 0 previously). It's now calling some brown and red jerseys "maroon" — the opposite of the original problem. Specific cases: 007 (brown→maroon), 031 (brown→maroon), 048 (red→maroon), 142 (orange→maroon).

- **Gemini + Constrained** reported "light blue" as an extra/wrong color **2 times** (was 0 previously), including misidentifying navy blue as light blue (image 081).

This overcorrection is a smaller problem than the original misses it replaced, but worth noting.

### 5. The capstone prompt did not improve results

The capstone prompt performed at or slightly below the original prompt for both models:

- Qwen: 78.2% recall (same), 87.3% precision (slight improvement)
- Gemini: 78.2% recall (down from 79.7%), 88.7% precision (down from 91.4%)

The capstone prompt's emphasis on precision over recall ("do not guess") may have hurt overall detection rates without meaningfully improving color accuracy.

### 6. Gemini speed improvement from concurrency

The concurrent processing optimization (8 workers + session reuse + JPEG quality 85) delivered major speed gains for the Gemini runs:

| Previous sequential runs | Current concurrent runs |
|:------------------------:|:-----------------------:|
| 2134s (13.3s avg)        | 253s (1.6s avg)         |
| 1882s (11.7s avg)        | 260s (1.6s avg)         |
|                          | 344s (2.1s avg)         |

That's roughly an **8x speedup** for the first two prompts. The constrained prompt run was slightly slower (344s) due to its longer prompt text (2223 chars vs ~1500 chars).

---

## Persistently Failed Images

These **10 images** failed across all six runs, representing the hardest cases for current VLMs regardless of model or prompt:

| Image | GT Colors | Typical Error |
|-------|-----------|---------------|
| 016 - maroon.jpg             | maroon       | Not detected or called "red" |
| 034 - light blue.jpg         | light blue   | Called "blue" |
| 046 - green.jpg              | green        | Called "black" |
| 053 - black_white.jpg        | black        | Not detected |
| 077 - teal_white.jpg         | teal         | Called "green" |
| 132 - brown_white.jpg        | brown        | Called "orange" |
| 134 - teal_white.jpg         | teal         | Called "blue" or "light blue" |
| 138 - maroon.jpg             | maroon       | Called "red" |
| 150 - green_gray.jpg         | green, gray  | Called "black" |
| 160 - blue_white.jpg         | blue         | Not detected |

Notable improvements: Images **029** (maroon), **087/141/161** (light blue), and **099** (maroon) were previously persistent failures but were **fixed by the constrained prompt** for at least one model.

---

## Model Comparison

### Gemini 3 Flash
- **Best at:** Precision (94.3% with constrained prompt), fewest hallucinated colors
- **Weakness:** Lower exact recall than Qwen; still uses shade variants even with constraints
- **Speed:** ~250-340s with 8 concurrent workers

### Qwen3-VL-8B
- **Best at:** Recall (82.7% with constrained prompt), highest PASS count (127)
- **Weakness:** Higher false positive rate; introduced "maroon" overcorrection with constrained prompt
- **Speed:** ~1440-1600s sequential (local GPU inference)

---

## Recommendations

1. **Use the constrained prompt** (`jersey_prompt_constrained.txt`) — it is the clear winner for both models, improving recall and precision simultaneously.

2. **Post-processing normalization** could still recover additional matches:
   - Map `grey` → `gray` (catches any remaining Gemini outputs)
   - Map `navy` → `navy blue` (catches shorthand usage)

3. **Consider a brown/maroon calibration** — the constrained prompt overcorrected on Qwen, turning brown→maroon confusion into a new error source. Adding "Use 'brown' for warm, non-reddish dark colors" or similar guidance may help.

4. **Gray and black detection remain unsolved** at the prompt level — these are likely image quality or model perception limitations that no amount of prompt engineering will fix. These colors may benefit from a secondary computer vision pass (e.g., dominant color extraction from the jersey region).

5. **Retire the capstone prompt** — it offered no benefit over the original and performed worse than the constrained prompt in every metric.

---

## Appendix: Color Similarity Families Used for Scoring

| Family     | Member Colors                                         |
|------------|-------------------------------------------------------|
| blue       | blue, dark blue, navy blue, navy, royal blue          |
| light_blue | light blue, sky blue, baby blue, carolina blue, powder blue |
| red        | red, scarlet, crimson                                 |
| dark_red   | maroon, burgundy, dark red, wine                      |
| green      | green, dark green, forest green, kelly green          |
| yellow     | yellow, gold, golden                                  |
| orange     | orange, burnt orange                                  |
| brown      | brown, dark brown                                     |
| purple     | purple, violet                                        |
| gray       | gray, grey, silver, charcoal                          |
| black      | black                                                 |
| teal       | teal, turquoise, cyan, aqua                           |
| pink       | pink, magenta, hot pink, rose                         |

---

## Appendix: Constrained Prompt (`jersey_prompt_constrained.txt`)

```
You are an expert at detecting sports jerseys in images. Carefully examine the provided image and identify all visible sports jerseys.

CRITICAL INSTRUCTIONS:
1. ONLY detect jerseys that are CLEARLY VISIBLE in the image
2. ONLY include jersey numbers that you can ACTUALLY READ in the image
3. If you CANNOT see any jerseys, you MUST return {"jerseys": []}
4. DO NOT make up, imagine, or guess jersey numbers that aren't visible
5. DO NOT include jerseys if you cannot clearly see the number

COLOR VOCABULARY:
For "jersey_color" and "number_color", you MUST choose from this list ONLY:
red, blue, dark blue, navy blue, light blue, green, yellow, gold, orange, purple, black, white, gray, brown, dark brown, maroon, teal, pink

Important color distinctions:
- Use "maroon" for dark brownish-red, NOT "red"
- Use "light blue" for pale or sky blue, NOT "blue"
- Use "navy blue" for very dark blue, NOT "blue" or "dark blue"
- Use "teal" for blue-green, NOT "green" or "blue"
- Use "gray" (not "grey") for silver or neutral tones
- Use "dark brown" for very dark brown, NOT "black"
- Use "gold" for metallic or deep yellow, NOT "yellow"

RESPONSE FORMAT:
Respond ONLY with a valid JSON object. No explanations, no markdown, no extra text.

Use DOUBLE QUOTES (") for all JSON keys and string values.

The JSON must have a single key "jerseys" with an array of dictionaries.

Each dictionary must have exactly these three keys:
- "jersey_number": The number on the jersey (as a string, only if clearly visible)
- "jersey_color": The primary color of the jersey (MUST be from the color list above)
- "number_color": The color of the number on the jersey (MUST be from the color list above)

Example response for an image WITH visible jerseys:
{
  "jerseys": [
    {
      "jersey_number": "10",
      "jersey_color": "maroon",
      "number_color": "gold"
    },
    {
      "jersey_number": "42",
      "jersey_color": "light blue",
      "number_color": "white"
    }
  ]
}

Example response for an image WITHOUT jerseys or with unclear numbers:
{"jerseys": []}

REMEMBER: Only include jerseys with numbers you can ACTUALLY SEE in the image. When in doubt, return empty array.

Now analyze the image and return the JSON object.
```