Jersey Color Detection Accuracy — Round 2 Analysis

Date: March 3, 2026
Models: Gemini 3 Flash Preview, Qwen3-VL-8B (local via llama.cpp)
Prompts: jersey_prompt.txt (original), jersey_prompt_capstone.txt (capstone), jersey_prompt_constrained.txt (constrained)
Test set: 161 annotated images, 202 ground truth colors (excluding white)

Summary Comparison

Metric	Qwen Original	Qwen Capstone	Qwen Constrained	Gemini Original	Gemini Capstone	Gemini Constrained
Recall (exact)	65.3%	66.3%	71.8%	62.4%	60.9%	67.8%
Recall (exact+similar)	78.2%	78.2%	82.7%	79.7%	78.2%	81.7%
Missed	21.8%	21.8%	17.3%	20.3%	21.8%	18.3%
Precision (exact)	71.7%	74.0%	78.4%	72.0%	69.5%	78.7%
Precision (exact+sim.)	85.9%	87.3%	90.3%	91.4%	88.7%	94.3%
Extra/wrong	14.1%	12.7%	9.7%	8.6%	11.3%	5.7%
PASS	118	120	127	120	117	124
PARTIAL	19	19	15	20	22	19
FAIL	24	22	19	21	22	18
Total time	1557s	1437s	1596s	253s	260s	344s

Key Findings

1. The constrained prompt is the best prompt for both models

The constrained vocabulary prompt delivered the strongest results across the board:

Qwen + Constrained Achieved the highest recall of any combination at 82.7% (167/202 found), up from 78.2% with both other prompts. It also posted the most PASS images (127, up from 118/120) and the fewest FAIL images (19, down from 24/22).

Gemini + Constrained Achieved the highest precision of any combination at 94.3% (164/174 correct), with only 5.7% extra/wrong colors — the lowest error rate across all six runs. It tied for fewest failures at 18.

2. Exact match rates jumped significantly

The constrained prompt's biggest impact was converting similar matches into exact matches by forcing models to use the ground truth vocabulary:

Model	Exact Match (Original)	Exact Match (Constrained)	Improvement
Qwen	65.3% (132)	71.8% (145)	+6.5 pp
Gemini	62.4% (126)	67.8% (137)	+5.4 pp

This came partly from eliminating vocabulary mismatch (e.g., grey→gray, navy→navy blue) and partly from teaching models to use specific color terms like "maroon" and "light blue."

3. Targeted color improvements

The constrained prompt's explicit color guidance fixed the worst systematic errors:

Problem Color	Qwen Misses (Orig→Constrained)	Gemini Misses (Orig→Constrained)
maroon	8 → 3	6 → 3
light blue	7 → 1	3 → 1
dark brown	4 → 2	1 → 1
teal	2 → 2	2 → 2
gray	7 → 8	6 → 6
black	6 → 6	7 → 7

Maroon: Cut in half for both models. Previously the most-missed color for Qwen; now ranks 5th.
Light blue: Near-elimination of the "light blue → blue" confusion for both models (7→1 for Qwen, 3→1 for Gemini).
Gray/grey: The spelling normalization instruction eliminated the grey→gray similar-match penalty for Gemini entirely (10 confusions → 0). However, gray detection misses remain unchanged — these are images where gray jerseys aren't detected at all, not a naming issue.
Teal and black remain stubbornly problematic regardless of prompt.

4. New overcorrection pattern with constrained prompt

Overcorrection Warning The constrained prompt introduced a new failure mode — models now occasionally over-apply newly-learned color terms.

Qwen + Constrained reported "maroon" as an extra/wrong color 5 times (was 0 previously). It's now calling some brown and red jerseys "maroon" — the opposite of the original problem. Specific cases: 007 (brown→maroon), 031 (brown→maroon), 048 (red→maroon), 142 (orange→maroon).
Gemini + Constrained reported "light blue" as an extra/wrong color 2 times (was 0 previously), including misidentifying navy blue as light blue (image 081).

This overcorrection is a smaller problem than the original misses it replaced, but worth noting.

5. The capstone prompt did not improve results

Capstone Prompt: No Benefit The capstone prompt performed at or slightly below the original prompt for both models. Its emphasis on precision over recall ("do not guess") hurt overall detection rates without meaningfully improving color accuracy.

Qwen: 78.2% recall (same), 87.3% precision (slight improvement)
Gemini: 78.2% recall (down from 79.7%), 88.7% precision (down from 91.4%)

6. Gemini speed improvement from concurrency

The concurrent processing optimization (8 workers + session reuse + JPEG quality 85) delivered major speed gains:

Previous Sequential Runs	Current Concurrent Runs
2134s (13.3s avg)	253s (1.6s avg)
1882s (11.7s avg)	260s (1.6s avg)
—	344s (2.1s avg)

That's roughly an 8x speedup for the first two prompts. The constrained prompt run was slightly slower (344s) due to its longer prompt text (2223 chars vs ~1500 chars).

Persistently Failed Images

These 10 images failed across all six runs, representing the hardest cases for current VLMs regardless of model or prompt:

Image	GT Colors	Typical Error
016 - maroon.jpg	maroon	Not detected or called "red"
034 - light blue.jpg	light blue	Called "blue"
046 - green.jpg	green	Called "black"
053 - black_white.jpg	black	Not detected
077 - teal_white.jpg	teal	Called "green"
132 - brown_white.jpg	brown	Called "orange"
134 - teal_white.jpg	teal	Called "blue" or "light blue"
138 - maroon.jpg	maroon	Called "red"
150 - green_gray.jpg	green, gray	Called "black"
160 - blue_white.jpg	blue	Not detected

Notable improvements: Images 029 (maroon), 087/141/161 (light blue), and 099 (maroon) were previously persistent failures but were fixed by the constrained prompt for at least one model.

Model Comparison

Gemini 3 Flash

Best at: Precision (94.3% with constrained prompt), fewest hallucinated colors
Weakness: Lower exact recall than Qwen; still uses shade variants even with constraints
Speed: ~250–340s with 8 concurrent workers

Qwen3-VL-8B

Best at: Recall (82.7% with constrained prompt), highest PASS count (127)
Weakness: Higher false positive rate; introduced "maroon" overcorrection with constrained prompt
Speed: ~1440–1600s sequential (local GPU inference)

Recommendations

Use the constrained prompt (jersey_prompt_constrained.txt) — it is the clear winner for both models, improving recall and precision simultaneously.
Post-processing normalization could still recover additional matches: map grey → gray (catches any remaining Gemini outputs) and navy → navy blue (catches shorthand usage).
Consider a brown/maroon calibration — the constrained prompt overcorrected on Qwen, turning brown→maroon confusion into a new error source. Adding "Use 'brown' for warm, non-reddish dark colors" or similar guidance may help.
Gray and black detection remain unsolved at the prompt level — these are likely image quality or model perception limitations that no amount of prompt engineering will fix. These colors may benefit from a secondary computer vision pass (e.g., dominant color extraction from the jersey region).
Retire the capstone prompt — it offered no benefit over the original and performed worse than the constrained prompt in every metric.

Appendix: Color Similarity Families Used for Scoring

Family	Member Colors
blue	blue, dark blue, navy blue, navy, royal blue
light_blue	light blue, sky blue, baby blue, carolina blue, powder blue
red	red, scarlet, crimson
dark_red	maroon, burgundy, dark red, wine
green	green, dark green, forest green, kelly green
yellow	yellow, gold, golden
orange	orange, burnt orange
brown	brown, dark brown
purple	purple, violet
gray	gray, grey, silver, charcoal
black	black
teal	teal, turquoise, cyan, aqua
pink	pink, magenta, hot pink, rose

Appendix: Constrained Prompt (`jersey_prompt_constrained.txt`)

You are an expert at detecting sports jerseys in images. Carefully examine the provided image and identify all visible sports jerseys.

CRITICAL INSTRUCTIONS:
1. ONLY detect jerseys that are CLEARLY VISIBLE in the image
2. ONLY include jersey numbers that you can ACTUALLY READ in the image
3. If you CANNOT see any jerseys, you MUST return {"jerseys": []}
4. DO NOT make up, imagine, or guess jersey numbers that aren't visible
5. DO NOT include jerseys if you cannot clearly see the number

COLOR VOCABULARY:
For "jersey_color" and "number_color", you MUST choose from this list ONLY:
red, blue, dark blue, navy blue, light blue, green, yellow, gold, orange, purple, black, white, gray, brown, dark brown, maroon, teal, pink

Important color distinctions:
- Use "maroon" for dark brownish-red, NOT "red"
- Use "light blue" for pale or sky blue, NOT "blue"
- Use "navy blue" for very dark blue, NOT "blue" or "dark blue"
- Use "teal" for blue-green, NOT "green" or "blue"
- Use "gray" (not "grey") for silver or neutral tones
- Use "dark brown" for very dark brown, NOT "black"
- Use "gold" for metallic or deep yellow, NOT "yellow"

RESPONSE FORMAT:
Respond ONLY with a valid JSON object. No explanations, no markdown, no extra text.

Use DOUBLE QUOTES (") for all JSON keys and string values.

The JSON must have a single key "jerseys" with an array of dictionaries.

Each dictionary must have exactly these three keys:
- "jersey_number": The number on the jersey (as a string, only if clearly visible)
- "jersey_color": The primary color of the jersey (MUST be from the color list above)
- "number_color": The color of the number on the jersey (MUST be from the color list above)

Example response for an image WITH visible jerseys:
{
  "jerseys": [
    {
      "jersey_number": "10",
      "jersey_color": "maroon",
      "number_color": "gold"
    },
    {
      "jersey_number": "42",
      "jersey_color": "light blue",
      "number_color": "white"
    }
  ]
}

Example response for an image WITHOUT jerseys or with unclear numbers:
{"jerseys": []}

REMEMBER: Only include jerseys with numbers you can ACTUALLY SEE in the image. When in doubt, return empty array.

Now analyze the image and return the JSON object.

Jersey Color Detection Accuracy — Round 2 Analysis

Summary Comparison

Key Findings

1. The constrained prompt is the best prompt for both models

2. Exact match rates jumped significantly

3. Targeted color improvements

4. New overcorrection pattern with constrained prompt

5. The capstone prompt did not improve results

6. Gemini speed improvement from concurrency

Persistently Failed Images

Model Comparison

Gemini 3 Flash

Qwen3-VL-8B

Recommendations

Appendix: Color Similarity Families Used for Scoring

Appendix: Constrained Prompt (jersey_prompt_constrained.txt)

Appendix: Constrained Prompt (`jersey_prompt_constrained.txt`)