Jersey Color Detection Accuracy — Round 2 Analysis

Date: March 3, 2026
Models: Gemini 3 Flash Preview, Qwen3-VL-8B (local via llama.cpp)
Prompts: jersey_prompt.txt (original), jersey_prompt_capstone.txt (capstone), jersey_prompt_constrained.txt (constrained)
Test set: 161 annotated images, 202 ground truth colors (excluding white)

Summary Comparison

Metric Qwen Original Qwen Capstone Qwen Constrained Gemini Original Gemini Capstone Gemini Constrained
Recall (exact) 65.3% 66.3% 71.8% 62.4% 60.9% 67.8%
Recall (exact+similar) 78.2% 78.2% 82.7% 79.7% 78.2% 81.7%
Missed 21.8% 21.8% 17.3% 20.3% 21.8% 18.3%
Precision (exact) 71.7% 74.0% 78.4% 72.0% 69.5% 78.7%
Precision (exact+sim.) 85.9% 87.3% 90.3% 91.4% 88.7% 94.3%
Extra/wrong 14.1% 12.7% 9.7% 8.6% 11.3% 5.7%
PASS 118 120 127 120 117 124
PARTIAL 19 19 15 20 22 19
FAIL 24 22 19 21 22 18
Total time 1557s 1437s 1596s 253s 260s 344s

Key Findings

1. The constrained prompt is the best prompt for both models

The constrained vocabulary prompt delivered the strongest results across the board:

Qwen + Constrained Achieved the highest recall of any combination at 82.7% (167/202 found), up from 78.2% with both other prompts. It also posted the most PASS images (127, up from 118/120) and the fewest FAIL images (19, down from 24/22).
Gemini + Constrained Achieved the highest precision of any combination at 94.3% (164/174 correct), with only 5.7% extra/wrong colors — the lowest error rate across all six runs. It tied for fewest failures at 18.

2. Exact match rates jumped significantly

The constrained prompt's biggest impact was converting similar matches into exact matches by forcing models to use the ground truth vocabulary:

Model Exact Match (Original) Exact Match (Constrained) Improvement
Qwen 65.3% (132) 71.8% (145) +6.5 pp
Gemini 62.4% (126) 67.8% (137) +5.4 pp

This came partly from eliminating vocabulary mismatch (e.g., grey→gray, navy→navy blue) and partly from teaching models to use specific color terms like "maroon" and "light blue."

3. Targeted color improvements

The constrained prompt's explicit color guidance fixed the worst systematic errors:

Problem Color Qwen Misses (Orig→Constrained) Gemini Misses (Orig→Constrained)
maroon 8 → 3 6 → 3
light blue 7 → 1 3 → 1
dark brown 4 → 2 1 → 1
teal 2 → 2 2 → 2
gray 7 → 8 6 → 6
black 6 → 6 7 → 7

4. New overcorrection pattern with constrained prompt

Overcorrection Warning The constrained prompt introduced a new failure mode — models now occasionally over-apply newly-learned color terms.

This overcorrection is a smaller problem than the original misses it replaced, but worth noting.

5. The capstone prompt did not improve results

Capstone Prompt: No Benefit The capstone prompt performed at or slightly below the original prompt for both models. Its emphasis on precision over recall ("do not guess") hurt overall detection rates without meaningfully improving color accuracy.

6. Gemini speed improvement from concurrency

The concurrent processing optimization (8 workers + session reuse + JPEG quality 85) delivered major speed gains:

Previous Sequential Runs Current Concurrent Runs
2134s (13.3s avg) 253s (1.6s avg)
1882s (11.7s avg) 260s (1.6s avg)
344s (2.1s avg)

That's roughly an 8x speedup for the first two prompts. The constrained prompt run was slightly slower (344s) due to its longer prompt text (2223 chars vs ~1500 chars).


Persistently Failed Images

These 10 images failed across all six runs, representing the hardest cases for current VLMs regardless of model or prompt:

Image GT Colors Typical Error
016 - maroon.jpg maroon Not detected or called "red"
034 - light blue.jpg light blue Called "blue"
046 - green.jpg green Called "black"
053 - black_white.jpg black Not detected
077 - teal_white.jpg teal Called "green"
132 - brown_white.jpg brown Called "orange"
134 - teal_white.jpg teal Called "blue" or "light blue"
138 - maroon.jpg maroon Called "red"
150 - green_gray.jpg green, gray Called "black"
160 - blue_white.jpg blue Not detected

Notable improvements: Images 029 (maroon), 087/141/161 (light blue), and 099 (maroon) were previously persistent failures but were fixed by the constrained prompt for at least one model.


Model Comparison

Gemini 3 Flash

Qwen3-VL-8B


Recommendations

  1. Use the constrained prompt (jersey_prompt_constrained.txt) — it is the clear winner for both models, improving recall and precision simultaneously.
  2. Post-processing normalization could still recover additional matches: map greygray (catches any remaining Gemini outputs) and navynavy blue (catches shorthand usage).
  3. Consider a brown/maroon calibration — the constrained prompt overcorrected on Qwen, turning brown→maroon confusion into a new error source. Adding "Use 'brown' for warm, non-reddish dark colors" or similar guidance may help.
  4. Gray and black detection remain unsolved at the prompt level — these are likely image quality or model perception limitations that no amount of prompt engineering will fix. These colors may benefit from a secondary computer vision pass (e.g., dominant color extraction from the jersey region).
  5. Retire the capstone prompt — it offered no benefit over the original and performed worse than the constrained prompt in every metric.

Appendix: Color Similarity Families Used for Scoring

Family Member Colors
blueblue, dark blue, navy blue, navy, royal blue
light_bluelight blue, sky blue, baby blue, carolina blue, powder blue
redred, scarlet, crimson
dark_redmaroon, burgundy, dark red, wine
greengreen, dark green, forest green, kelly green
yellowyellow, gold, golden
orangeorange, burnt orange
brownbrown, dark brown
purplepurple, violet
graygray, grey, silver, charcoal
blackblack
tealteal, turquoise, cyan, aqua
pinkpink, magenta, hot pink, rose

Appendix: Constrained Prompt (jersey_prompt_constrained.txt)

You are an expert at detecting sports jerseys in images. Carefully examine the provided image and identify all visible sports jerseys.

CRITICAL INSTRUCTIONS:
1. ONLY detect jerseys that are CLEARLY VISIBLE in the image
2. ONLY include jersey numbers that you can ACTUALLY READ in the image
3. If you CANNOT see any jerseys, you MUST return {"jerseys": []}
4. DO NOT make up, imagine, or guess jersey numbers that aren't visible
5. DO NOT include jerseys if you cannot clearly see the number

COLOR VOCABULARY:
For "jersey_color" and "number_color", you MUST choose from this list ONLY:
red, blue, dark blue, navy blue, light blue, green, yellow, gold, orange, purple, black, white, gray, brown, dark brown, maroon, teal, pink

Important color distinctions:
- Use "maroon" for dark brownish-red, NOT "red"
- Use "light blue" for pale or sky blue, NOT "blue"
- Use "navy blue" for very dark blue, NOT "blue" or "dark blue"
- Use "teal" for blue-green, NOT "green" or "blue"
- Use "gray" (not "grey") for silver or neutral tones
- Use "dark brown" for very dark brown, NOT "black"
- Use "gold" for metallic or deep yellow, NOT "yellow"

RESPONSE FORMAT:
Respond ONLY with a valid JSON object. No explanations, no markdown, no extra text.

Use DOUBLE QUOTES (") for all JSON keys and string values.

The JSON must have a single key "jerseys" with an array of dictionaries.

Each dictionary must have exactly these three keys:
- "jersey_number": The number on the jersey (as a string, only if clearly visible)
- "jersey_color": The primary color of the jersey (MUST be from the color list above)
- "number_color": The color of the number on the jersey (MUST be from the color list above)

Example response for an image WITH visible jerseys:
{
  "jerseys": [
    {
      "jersey_number": "10",
      "jersey_color": "maroon",
      "number_color": "gold"
    },
    {
      "jersey_number": "42",
      "jersey_color": "light blue",
      "number_color": "white"
    }
  ]
}

Example response for an image WITHOUT jerseys or with unclear numbers:
{"jerseys": []}

REMEMBER: Only include jerseys with numbers you can ACTUALLY SEE in the image. When in doubt, return empty array.

Now analyze the image and return the JSON object.