| Metric | Qwen Original | Qwen Capstone | Qwen Constrained | Gemini Original | Gemini Capstone | Gemini Constrained |
|---|---|---|---|---|---|---|
| Recall (exact) | 65.3% | 66.3% | 71.8% | 62.4% | 60.9% | 67.8% |
| Recall (exact+similar) | 78.2% | 78.2% | 82.7% | 79.7% | 78.2% | 81.7% |
| Missed | 21.8% | 21.8% | 17.3% | 20.3% | 21.8% | 18.3% |
| Precision (exact) | 71.7% | 74.0% | 78.4% | 72.0% | 69.5% | 78.7% |
| Precision (exact+sim.) | 85.9% | 87.3% | 90.3% | 91.4% | 88.7% | 94.3% |
| Extra/wrong | 14.1% | 12.7% | 9.7% | 8.6% | 11.3% | 5.7% |
| PASS | 118 | 120 | 127 | 120 | 117 | 124 |
| PARTIAL | 19 | 19 | 15 | 20 | 22 | 19 |
| FAIL | 24 | 22 | 19 | 21 | 22 | 18 |
| Total time | 1557s | 1437s | 1596s | 253s | 260s | 344s |
The constrained vocabulary prompt delivered the strongest results across the board:
The constrained prompt's biggest impact was converting similar matches into exact matches by forcing models to use the ground truth vocabulary:
| Model | Exact Match (Original) | Exact Match (Constrained) | Improvement |
|---|---|---|---|
| Qwen | 65.3% (132) | 71.8% (145) | +6.5 pp |
| Gemini | 62.4% (126) | 67.8% (137) | +5.4 pp |
This came partly from eliminating vocabulary mismatch (e.g., grey→gray, navy→navy blue) and partly from teaching models to use specific color terms like "maroon" and "light blue."
The constrained prompt's explicit color guidance fixed the worst systematic errors:
| Problem Color | Qwen Misses (Orig→Constrained) | Gemini Misses (Orig→Constrained) |
|---|---|---|
| maroon | 8 → 3 | 6 → 3 |
| light blue | 7 → 1 | 3 → 1 |
| dark brown | 4 → 2 | 1 → 1 |
| teal | 2 → 2 | 2 → 2 |
| gray | 7 → 8 | 6 → 6 |
| black | 6 → 6 | 7 → 7 |
This overcorrection is a smaller problem than the original misses it replaced, but worth noting.
The concurrent processing optimization (8 workers + session reuse + JPEG quality 85) delivered major speed gains:
| Previous Sequential Runs | Current Concurrent Runs |
|---|---|
| 2134s (13.3s avg) | 253s (1.6s avg) |
| 1882s (11.7s avg) | 260s (1.6s avg) |
| — | 344s (2.1s avg) |
That's roughly an 8x speedup for the first two prompts. The constrained prompt run was slightly slower (344s) due to its longer prompt text (2223 chars vs ~1500 chars).
These 10 images failed across all six runs, representing the hardest cases for current VLMs regardless of model or prompt:
| Image | GT Colors | Typical Error |
|---|---|---|
| 016 - maroon.jpg | maroon | Not detected or called "red" |
| 034 - light blue.jpg | light blue | Called "blue" |
| 046 - green.jpg | green | Called "black" |
| 053 - black_white.jpg | black | Not detected |
| 077 - teal_white.jpg | teal | Called "green" |
| 132 - brown_white.jpg | brown | Called "orange" |
| 134 - teal_white.jpg | teal | Called "blue" or "light blue" |
| 138 - maroon.jpg | maroon | Called "red" |
| 150 - green_gray.jpg | green, gray | Called "black" |
| 160 - blue_white.jpg | blue | Not detected |
Notable improvements: Images 029 (maroon), 087/141/161 (light blue), and 099 (maroon) were previously persistent failures but were fixed by the constrained prompt for at least one model.
jersey_prompt_constrained.txt) — it is the clear winner for both models, improving recall and precision simultaneously.grey → gray (catches any remaining Gemini outputs) and navy → navy blue (catches shorthand usage).| Family | Member Colors |
|---|---|
| blue | blue, dark blue, navy blue, navy, royal blue |
| light_blue | light blue, sky blue, baby blue, carolina blue, powder blue |
| red | red, scarlet, crimson |
| dark_red | maroon, burgundy, dark red, wine |
| green | green, dark green, forest green, kelly green |
| yellow | yellow, gold, golden |
| orange | orange, burnt orange |
| brown | brown, dark brown |
| purple | purple, violet |
| gray | gray, grey, silver, charcoal |
| black | black |
| teal | teal, turquoise, cyan, aqua |
| pink | pink, magenta, hot pink, rose |
jersey_prompt_constrained.txt)You are an expert at detecting sports jerseys in images. Carefully examine the provided image and identify all visible sports jerseys.
CRITICAL INSTRUCTIONS:
1. ONLY detect jerseys that are CLEARLY VISIBLE in the image
2. ONLY include jersey numbers that you can ACTUALLY READ in the image
3. If you CANNOT see any jerseys, you MUST return {"jerseys": []}
4. DO NOT make up, imagine, or guess jersey numbers that aren't visible
5. DO NOT include jerseys if you cannot clearly see the number
COLOR VOCABULARY:
For "jersey_color" and "number_color", you MUST choose from this list ONLY:
red, blue, dark blue, navy blue, light blue, green, yellow, gold, orange, purple, black, white, gray, brown, dark brown, maroon, teal, pink
Important color distinctions:
- Use "maroon" for dark brownish-red, NOT "red"
- Use "light blue" for pale or sky blue, NOT "blue"
- Use "navy blue" for very dark blue, NOT "blue" or "dark blue"
- Use "teal" for blue-green, NOT "green" or "blue"
- Use "gray" (not "grey") for silver or neutral tones
- Use "dark brown" for very dark brown, NOT "black"
- Use "gold" for metallic or deep yellow, NOT "yellow"
RESPONSE FORMAT:
Respond ONLY with a valid JSON object. No explanations, no markdown, no extra text.
Use DOUBLE QUOTES (") for all JSON keys and string values.
The JSON must have a single key "jerseys" with an array of dictionaries.
Each dictionary must have exactly these three keys:
- "jersey_number": The number on the jersey (as a string, only if clearly visible)
- "jersey_color": The primary color of the jersey (MUST be from the color list above)
- "number_color": The color of the number on the jersey (MUST be from the color list above)
Example response for an image WITH visible jerseys:
{
"jerseys": [
{
"jersey_number": "10",
"jersey_color": "maroon",
"number_color": "gold"
},
{
"jersey_number": "42",
"jersey_color": "light blue",
"number_color": "white"
}
]
}
Example response for an image WITHOUT jerseys or with unclear numbers:
{"jerseys": []}
REMEMBER: Only include jerseys with numbers you can ACTUALLY SEE in the image. When in doubt, return empty array.
Now analyze the image and return the JSON object.