Files
jersey_test/accuracy_analysis_report_round2.html
Rick McEwen 5405d7f7dc Add accuracy test framework, prompts, results, and analysis reports
Includes accuracy test scripts for Qwen (local) and Gemini (cloud API),
three prompt variants (original, capstone, constrained), test results
from all runs, and two analysis reports with an HTML presentation version.
2026-03-03 18:44:49 -07:00

761 lines
23 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Jersey Color Detection Accuracy — Round 2 Analysis</title>
<style>
:root {
--green: #16a34a;
--green-bg: #dcfce7;
--red: #dc2626;
--red-bg: #fee2e2;
--blue: #2563eb;
--blue-bg: #dbeafe;
--amber: #d97706;
--amber-bg: #fef3c7;
--gray-50: #f9fafb;
--gray-100: #f3f4f6;
--gray-200: #e5e7eb;
--gray-300: #d1d5db;
--gray-600: #4b5563;
--gray-700: #374151;
--gray-800: #1f2937;
--gray-900: #111827;
}
* { box-sizing: border-box; margin: 0; padding: 0; }
body {
font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", Arial, sans-serif;
line-height: 1.6;
color: var(--gray-800);
max-width: 1100px;
margin: 0 auto;
padding: 2rem;
background: #fff;
}
h1 {
font-size: 2rem;
color: var(--gray-900);
border-bottom: 3px solid var(--blue);
padding-bottom: 0.5rem;
margin-bottom: 0.5rem;
}
h2 {
font-size: 1.5rem;
color: var(--blue);
margin-top: 2.5rem;
margin-bottom: 1rem;
border-bottom: 2px solid var(--gray-200);
padding-bottom: 0.3rem;
}
h3 {
font-size: 1.15rem;
color: var(--gray-700);
margin-top: 1.5rem;
margin-bottom: 0.5rem;
}
.meta {
color: var(--gray-600);
font-size: 0.95rem;
margin-bottom: 1.5rem;
}
.meta strong { color: var(--gray-800); }
p { margin-bottom: 0.75rem; }
hr {
border: none;
border-top: 1px solid var(--gray-200);
margin: 2rem 0;
}
/* Tables */
table {
width: 100%;
border-collapse: collapse;
margin: 1rem 0 1.5rem;
font-size: 0.9rem;
}
th, td {
padding: 0.5rem 0.75rem;
text-align: center;
border: 1px solid var(--gray-200);
}
th {
background: var(--gray-800);
color: #fff;
font-weight: 600;
}
td:first-child, th:first-child {
text-align: left;
font-weight: 600;
}
tr:nth-child(even) { background: var(--gray-50); }
tr:hover { background: var(--gray-100); }
/* Highlight classes */
.best {
background: var(--green-bg) !important;
color: var(--green);
font-weight: 700;
}
.worst {
background: var(--red-bg) !important;
color: var(--red);
font-weight: 600;
}
.improved {
background: var(--blue-bg) !important;
color: var(--blue);
font-weight: 600;
}
.warning {
background: var(--amber-bg) !important;
color: var(--amber);
font-weight: 600;
}
/* Callout boxes */
.callout {
border-left: 4px solid;
padding: 1rem 1.25rem;
margin: 1rem 0;
border-radius: 0 6px 6px 0;
}
.callout-green {
border-color: var(--green);
background: var(--green-bg);
}
.callout-red {
border-color: var(--red);
background: var(--red-bg);
}
.callout-blue {
border-color: var(--blue);
background: var(--blue-bg);
}
.callout-amber {
border-color: var(--amber);
background: var(--amber-bg);
}
.callout strong { display: block; margin-bottom: 0.25rem; }
/* Model comparison cards */
.model-cards {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 1.5rem;
margin: 1rem 0;
}
.model-card {
border: 2px solid var(--gray-200);
border-radius: 8px;
padding: 1.25rem;
}
.model-card h3 {
margin-top: 0;
padding-bottom: 0.4rem;
border-bottom: 2px solid;
}
.model-card.gemini h3 { border-color: var(--blue); color: var(--blue); }
.model-card.qwen h3 { border-color: var(--green); color: var(--green); }
.model-card ul {
list-style: none;
padding: 0;
margin: 0.5rem 0 0;
}
.model-card li {
padding: 0.3rem 0;
font-size: 0.9rem;
}
.model-card li strong { color: var(--gray-700); }
/* Recommendation list */
ol.recs {
counter-reset: rec;
list-style: none;
padding: 0;
}
ol.recs li {
counter-increment: rec;
padding: 0.75rem 1rem 0.75rem 3.25rem;
margin-bottom: 0.5rem;
border-radius: 6px;
background: var(--gray-50);
border: 1px solid var(--gray-200);
position: relative;
}
ol.recs li::before {
content: counter(rec);
position: absolute;
left: 0.75rem;
top: 0.75rem;
width: 1.75rem;
height: 1.75rem;
background: var(--blue);
color: #fff;
border-radius: 50%;
text-align: center;
line-height: 1.75rem;
font-weight: 700;
font-size: 0.85rem;
}
/* Code / prompt block */
pre {
background: var(--gray-900);
color: #e5e7eb;
padding: 1.25rem;
border-radius: 8px;
overflow-x: auto;
font-size: 0.85rem;
line-height: 1.5;
margin: 1rem 0;
}
code {
font-family: "SF Mono", "Fira Code", "Fira Mono", Menlo, Consolas, monospace;
background: var(--gray-100);
padding: 0.15rem 0.35rem;
border-radius: 3px;
font-size: 0.88em;
}
pre code {
background: none;
padding: 0;
}
/* Color swatch in similarity table */
.swatch {
display: inline-block;
width: 14px;
height: 14px;
border-radius: 3px;
margin-right: 6px;
vertical-align: middle;
border: 1px solid var(--gray-300);
}
/* Badge */
.badge {
display: inline-block;
padding: 0.15rem 0.5rem;
border-radius: 4px;
font-size: 0.8rem;
font-weight: 700;
text-transform: uppercase;
letter-spacing: 0.03em;
}
.badge-pass { background: var(--green-bg); color: var(--green); }
.badge-partial { background: var(--amber-bg); color: var(--amber); }
.badge-fail { background: var(--red-bg); color: var(--red); }
/* Print styles */
@media print {
body { padding: 0; font-size: 11pt; }
.callout, .model-card { break-inside: avoid; }
h2 { break-after: avoid; }
}
</style>
</head>
<body>
<h1>Jersey Color Detection Accuracy — Round 2 Analysis</h1>
<div class="meta">
<strong>Date:</strong> March 3, 2026<br>
<strong>Models:</strong> Gemini 3 Flash Preview, Qwen3-VL-8B (local via llama.cpp)<br>
<strong>Prompts:</strong> jersey_prompt.txt (original), jersey_prompt_capstone.txt (capstone), jersey_prompt_constrained.txt (constrained)<br>
<strong>Test set:</strong> 161 annotated images, 202 ground truth colors (excluding white)
</div>
<hr>
<h2>Summary Comparison</h2>
<table>
<thead>
<tr>
<th>Metric</th>
<th>Qwen Original</th>
<th>Qwen Capstone</th>
<th>Qwen Constrained</th>
<th>Gemini Original</th>
<th>Gemini Capstone</th>
<th>Gemini Constrained</th>
</tr>
</thead>
<tbody>
<tr>
<td>Recall (exact)</td>
<td>65.3%</td>
<td>66.3%</td>
<td class="best">71.8%</td>
<td>62.4%</td>
<td class="worst">60.9%</td>
<td class="improved">67.8%</td>
</tr>
<tr>
<td>Recall (exact+similar)</td>
<td>78.2%</td>
<td>78.2%</td>
<td class="best">82.7%</td>
<td>79.7%</td>
<td class="worst">78.2%</td>
<td class="improved">81.7%</td>
</tr>
<tr>
<td>Missed</td>
<td>21.8%</td>
<td>21.8%</td>
<td class="best">17.3%</td>
<td>20.3%</td>
<td class="worst">21.8%</td>
<td class="improved">18.3%</td>
</tr>
<tr>
<td>Precision (exact)</td>
<td>71.7%</td>
<td>74.0%</td>
<td class="improved">78.4%</td>
<td>72.0%</td>
<td class="worst">69.5%</td>
<td class="best">78.7%</td>
</tr>
<tr>
<td>Precision (exact+sim.)</td>
<td>85.9%</td>
<td>87.3%</td>
<td class="improved">90.3%</td>
<td>91.4%</td>
<td class="worst">88.7%</td>
<td class="best">94.3%</td>
</tr>
<tr>
<td>Extra/wrong</td>
<td>14.1%</td>
<td>12.7%</td>
<td class="improved">9.7%</td>
<td>8.6%</td>
<td class="worst">11.3%</td>
<td class="best">5.7%</td>
</tr>
<tr>
<td><span class="badge badge-pass">PASS</span></td>
<td>118</td>
<td>120</td>
<td class="best">127</td>
<td>120</td>
<td>117</td>
<td class="improved">124</td>
</tr>
<tr>
<td><span class="badge badge-partial">PARTIAL</span></td>
<td>19</td>
<td>19</td>
<td class="best">15</td>
<td>20</td>
<td class="worst">22</td>
<td>19</td>
</tr>
<tr>
<td><span class="badge badge-fail">FAIL</span></td>
<td>24</td>
<td>22</td>
<td>19</td>
<td>21</td>
<td>22</td>
<td class="best">18</td>
</tr>
<tr>
<td>Total time</td>
<td>1557s</td>
<td>1437s</td>
<td>1596s</td>
<td class="best">253s</td>
<td>260s</td>
<td>344s</td>
</tr>
</tbody>
</table>
<hr>
<h2>Key Findings</h2>
<h3>1. The constrained prompt is the best prompt for both models</h3>
<p>The constrained vocabulary prompt delivered the strongest results across the board:</p>
<div class="callout callout-green">
<strong>Qwen + Constrained</strong>
Achieved the highest recall of any combination at <strong>82.7%</strong> (167/202 found), up from 78.2% with both other prompts. It also posted the most PASS images (<strong>127</strong>, up from 118/120) and the fewest FAIL images (<strong>19</strong>, down from 24/22).
</div>
<div class="callout callout-blue">
<strong>Gemini + Constrained</strong>
Achieved the highest precision of any combination at <strong>94.3%</strong> (164/174 correct), with only <strong>5.7% extra/wrong</strong> colors — the lowest error rate across all six runs. It tied for fewest failures at <strong>18</strong>.
</div>
<h3>2. Exact match rates jumped significantly</h3>
<p>The constrained prompt's biggest impact was converting similar matches into exact matches by forcing models to use the ground truth vocabulary:</p>
<table>
<thead>
<tr>
<th>Model</th>
<th>Exact Match (Original)</th>
<th>Exact Match (Constrained)</th>
<th>Improvement</th>
</tr>
</thead>
<tbody>
<tr>
<td>Qwen</td>
<td>65.3% (132)</td>
<td class="best">71.8% (145)</td>
<td class="improved">+6.5 pp</td>
</tr>
<tr>
<td>Gemini</td>
<td>62.4% (126)</td>
<td class="best">67.8% (137)</td>
<td class="improved">+5.4 pp</td>
</tr>
</tbody>
</table>
<p>This came partly from eliminating vocabulary mismatch (e.g., grey→gray, navy→navy blue) and partly from teaching models to use specific color terms like "maroon" and "light blue."</p>
<h3>3. Targeted color improvements</h3>
<p>The constrained prompt's explicit color guidance fixed the worst systematic errors:</p>
<table>
<thead>
<tr>
<th>Problem Color</th>
<th>Qwen Misses (Orig→Constrained)</th>
<th>Gemini Misses (Orig→Constrained)</th>
</tr>
</thead>
<tbody>
<tr>
<td><span class="swatch" style="background:#800000"></span>maroon</td>
<td>8 → <span style="color:var(--green);font-weight:700">3</span></td>
<td>6 → <span style="color:var(--green);font-weight:700">3</span></td>
</tr>
<tr>
<td><span class="swatch" style="background:#87ceeb"></span>light blue</td>
<td>7 → <span style="color:var(--green);font-weight:700">1</span></td>
<td>3 → <span style="color:var(--green);font-weight:700">1</span></td>
</tr>
<tr>
<td><span class="swatch" style="background:#3e2723"></span>dark brown</td>
<td>4 → <span style="color:var(--green);font-weight:700">2</span></td>
<td>1 → 1</td>
</tr>
<tr>
<td><span class="swatch" style="background:#008080"></span>teal</td>
<td>2 → 2</td>
<td>2 → 2</td>
</tr>
<tr>
<td><span class="swatch" style="background:#9e9e9e"></span>gray</td>
<td class="warning">7 → 8</td>
<td>6 → 6</td>
</tr>
<tr>
<td><span class="swatch" style="background:#222"></span>black</td>
<td>6 → 6</td>
<td>7 → 7</td>
</tr>
</tbody>
</table>
<ul>
<li><strong>Maroon:</strong> Cut in half for both models. Previously the most-missed color for Qwen; now ranks 5th.</li>
<li><strong>Light blue:</strong> Near-elimination of the "light blue → blue" confusion for both models (7→1 for Qwen, 3→1 for Gemini).</li>
<li><strong>Gray/grey:</strong> The spelling normalization instruction eliminated the grey→gray similar-match penalty for Gemini entirely (10 confusions → 0). However, gray detection misses remain unchanged — these are images where gray jerseys aren't detected at all, not a naming issue.</li>
<li><strong>Teal and black</strong> remain stubbornly problematic regardless of prompt.</li>
</ul>
<h3>4. New overcorrection pattern with constrained prompt</h3>
<div class="callout callout-amber">
<strong>Overcorrection Warning</strong>
The constrained prompt introduced a new failure mode — models now occasionally over-apply newly-learned color terms.
</div>
<ul>
<li><strong>Qwen + Constrained</strong> reported "maroon" as an extra/wrong color <strong>5 times</strong> (was 0 previously). It's now calling some brown and red jerseys "maroon" — the opposite of the original problem. Specific cases: 007 (brown→maroon), 031 (brown→maroon), 048 (red→maroon), 142 (orange→maroon).</li>
<li><strong>Gemini + Constrained</strong> reported "light blue" as an extra/wrong color <strong>2 times</strong> (was 0 previously), including misidentifying navy blue as light blue (image 081).</li>
</ul>
<p>This overcorrection is a smaller problem than the original misses it replaced, but worth noting.</p>
<h3>5. The capstone prompt did not improve results</h3>
<div class="callout callout-red">
<strong>Capstone Prompt: No Benefit</strong>
The capstone prompt performed at or slightly below the original prompt for both models. Its emphasis on precision over recall ("do not guess") hurt overall detection rates without meaningfully improving color accuracy.
</div>
<ul>
<li>Qwen: 78.2% recall (same), 87.3% precision (slight improvement)</li>
<li>Gemini: 78.2% recall (down from 79.7%), 88.7% precision (down from 91.4%)</li>
</ul>
<h3>6. Gemini speed improvement from concurrency</h3>
<p>The concurrent processing optimization (8 workers + session reuse + JPEG quality 85) delivered major speed gains:</p>
<table>
<thead>
<tr>
<th>Previous Sequential Runs</th>
<th>Current Concurrent Runs</th>
</tr>
</thead>
<tbody>
<tr>
<td>2134s (13.3s avg)</td>
<td class="best">253s (1.6s avg)</td>
</tr>
<tr>
<td>1882s (11.7s avg)</td>
<td class="best">260s (1.6s avg)</td>
</tr>
<tr>
<td></td>
<td>344s (2.1s avg)</td>
</tr>
</tbody>
</table>
<p>That's roughly an <strong>8x speedup</strong> for the first two prompts. The constrained prompt run was slightly slower (344s) due to its longer prompt text (2223 chars vs ~1500 chars).</p>
<hr>
<h2>Persistently Failed Images</h2>
<p>These <strong>10 images</strong> failed across all six runs, representing the hardest cases for current VLMs regardless of model or prompt:</p>
<table>
<thead>
<tr>
<th>Image</th>
<th>GT Colors</th>
<th>Typical Error</th>
</tr>
</thead>
<tbody>
<tr>
<td>016 - maroon.jpg</td>
<td><span class="swatch" style="background:#800000"></span>maroon</td>
<td class="worst">Not detected or called "red"</td>
</tr>
<tr>
<td>034 - light blue.jpg</td>
<td><span class="swatch" style="background:#87ceeb"></span>light blue</td>
<td class="worst">Called "blue"</td>
</tr>
<tr>
<td>046 - green.jpg</td>
<td><span class="swatch" style="background:#388e3c"></span>green</td>
<td class="worst">Called "black"</td>
</tr>
<tr>
<td>053 - black_white.jpg</td>
<td><span class="swatch" style="background:#222"></span>black</td>
<td class="worst">Not detected</td>
</tr>
<tr>
<td>077 - teal_white.jpg</td>
<td><span class="swatch" style="background:#008080"></span>teal</td>
<td class="worst">Called "green"</td>
</tr>
<tr>
<td>132 - brown_white.jpg</td>
<td><span class="swatch" style="background:#795548"></span>brown</td>
<td class="worst">Called "orange"</td>
</tr>
<tr>
<td>134 - teal_white.jpg</td>
<td><span class="swatch" style="background:#008080"></span>teal</td>
<td class="worst">Called "blue" or "light blue"</td>
</tr>
<tr>
<td>138 - maroon.jpg</td>
<td><span class="swatch" style="background:#800000"></span>maroon</td>
<td class="worst">Called "red"</td>
</tr>
<tr>
<td>150 - green_gray.jpg</td>
<td><span class="swatch" style="background:#388e3c"></span>green, <span class="swatch" style="background:#9e9e9e"></span>gray</td>
<td class="worst">Called "black"</td>
</tr>
<tr>
<td>160 - blue_white.jpg</td>
<td><span class="swatch" style="background:#2196f3"></span>blue</td>
<td class="worst">Not detected</td>
</tr>
</tbody>
</table>
<p>Notable improvements: Images <strong>029</strong> (maroon), <strong>087/141/161</strong> (light blue), and <strong>099</strong> (maroon) were previously persistent failures but were <strong>fixed by the constrained prompt</strong> for at least one model.</p>
<hr>
<h2>Model Comparison</h2>
<div class="model-cards">
<div class="model-card gemini">
<h3>Gemini 3 Flash</h3>
<ul>
<li><strong>Best at:</strong> Precision (94.3% with constrained prompt), fewest hallucinated colors</li>
<li><strong>Weakness:</strong> Lower exact recall than Qwen; still uses shade variants even with constraints</li>
<li><strong>Speed:</strong> ~250340s with 8 concurrent workers</li>
</ul>
</div>
<div class="model-card qwen">
<h3>Qwen3-VL-8B</h3>
<ul>
<li><strong>Best at:</strong> Recall (82.7% with constrained prompt), highest PASS count (127)</li>
<li><strong>Weakness:</strong> Higher false positive rate; introduced "maroon" overcorrection with constrained prompt</li>
<li><strong>Speed:</strong> ~14401600s sequential (local GPU inference)</li>
</ul>
</div>
</div>
<hr>
<h2>Recommendations</h2>
<ol class="recs">
<li><strong>Use the constrained prompt</strong> (<code>jersey_prompt_constrained.txt</code>) — it is the clear winner for both models, improving recall and precision simultaneously.</li>
<li><strong>Post-processing normalization</strong> could still recover additional matches: map <code>grey</code><code>gray</code> (catches any remaining Gemini outputs) and <code>navy</code><code>navy blue</code> (catches shorthand usage).</li>
<li><strong>Consider a brown/maroon calibration</strong> — the constrained prompt overcorrected on Qwen, turning brown→maroon confusion into a new error source. Adding "Use 'brown' for warm, non-reddish dark colors" or similar guidance may help.</li>
<li><strong>Gray and black detection remain unsolved</strong> at the prompt level — these are likely image quality or model perception limitations that no amount of prompt engineering will fix. These colors may benefit from a secondary computer vision pass (e.g., dominant color extraction from the jersey region).</li>
<li><strong>Retire the capstone prompt</strong> — it offered no benefit over the original and performed worse than the constrained prompt in every metric.</li>
</ol>
<hr>
<h2>Appendix: Color Similarity Families Used for Scoring</h2>
<table>
<thead>
<tr>
<th>Family</th>
<th>Member Colors</th>
</tr>
</thead>
<tbody>
<tr><td><span class="swatch" style="background:#2196f3"></span>blue</td><td>blue, dark blue, navy blue, navy, royal blue</td></tr>
<tr><td><span class="swatch" style="background:#87ceeb"></span>light_blue</td><td>light blue, sky blue, baby blue, carolina blue, powder blue</td></tr>
<tr><td><span class="swatch" style="background:#f44336"></span>red</td><td>red, scarlet, crimson</td></tr>
<tr><td><span class="swatch" style="background:#800000"></span>dark_red</td><td>maroon, burgundy, dark red, wine</td></tr>
<tr><td><span class="swatch" style="background:#388e3c"></span>green</td><td>green, dark green, forest green, kelly green</td></tr>
<tr><td><span class="swatch" style="background:#fdd835"></span>yellow</td><td>yellow, gold, golden</td></tr>
<tr><td><span class="swatch" style="background:#ff9800"></span>orange</td><td>orange, burnt orange</td></tr>
<tr><td><span class="swatch" style="background:#795548"></span>brown</td><td>brown, dark brown</td></tr>
<tr><td><span class="swatch" style="background:#9c27b0"></span>purple</td><td>purple, violet</td></tr>
<tr><td><span class="swatch" style="background:#9e9e9e"></span>gray</td><td>gray, grey, silver, charcoal</td></tr>
<tr><td><span class="swatch" style="background:#222"></span>black</td><td>black</td></tr>
<tr><td><span class="swatch" style="background:#008080"></span>teal</td><td>teal, turquoise, cyan, aqua</td></tr>
<tr><td><span class="swatch" style="background:#e91e63"></span>pink</td><td>pink, magenta, hot pink, rose</td></tr>
</tbody>
</table>
<hr>
<h2>Appendix: Constrained Prompt (<code>jersey_prompt_constrained.txt</code>)</h2>
<pre><code>You are an expert at detecting sports jerseys in images. Carefully examine the provided image and identify all visible sports jerseys.
CRITICAL INSTRUCTIONS:
1. ONLY detect jerseys that are CLEARLY VISIBLE in the image
2. ONLY include jersey numbers that you can ACTUALLY READ in the image
3. If you CANNOT see any jerseys, you MUST return {"jerseys": []}
4. DO NOT make up, imagine, or guess jersey numbers that aren't visible
5. DO NOT include jerseys if you cannot clearly see the number
COLOR VOCABULARY:
For "jersey_color" and "number_color", you MUST choose from this list ONLY:
red, blue, dark blue, navy blue, light blue, green, yellow, gold, orange, purple, black, white, gray, brown, dark brown, maroon, teal, pink
Important color distinctions:
- Use "maroon" for dark brownish-red, NOT "red"
- Use "light blue" for pale or sky blue, NOT "blue"
- Use "navy blue" for very dark blue, NOT "blue" or "dark blue"
- Use "teal" for blue-green, NOT "green" or "blue"
- Use "gray" (not "grey") for silver or neutral tones
- Use "dark brown" for very dark brown, NOT "black"
- Use "gold" for metallic or deep yellow, NOT "yellow"
RESPONSE FORMAT:
Respond ONLY with a valid JSON object. No explanations, no markdown, no extra text.
Use DOUBLE QUOTES (") for all JSON keys and string values.
The JSON must have a single key "jerseys" with an array of dictionaries.
Each dictionary must have exactly these three keys:
- "jersey_number": The number on the jersey (as a string, only if clearly visible)
- "jersey_color": The primary color of the jersey (MUST be from the color list above)
- "number_color": The color of the number on the jersey (MUST be from the color list above)
Example response for an image WITH visible jerseys:
{
"jerseys": [
{
"jersey_number": "10",
"jersey_color": "maroon",
"number_color": "gold"
},
{
"jersey_number": "42",
"jersey_color": "light blue",
"number_color": "white"
}
]
}
Example response for an image WITHOUT jerseys or with unclear numbers:
{"jerseys": []}
REMEMBER: Only include jerseys with numbers you can ACTUALLY SEE in the image. When in doubt, return empty array.
Now analyze the image and return the JSON object.</code></pre>
</body>
</html>