Files

Rick McEwen 5405d7f7dc Add accuracy test framework, prompts, results, and analysis reports

Includes accuracy test scripts for Qwen (local) and Gemini (cloud API),
three prompt variants (original, capstone, constrained), test results
from all runs, and two analysis reports with an HTML presentation version.

2026-03-03 18:44:49 -07:00

11 KiB

Raw Blame History

Jersey Color Detection Accuracy Analysis

Test Configuration

Models tested: Gemini 3 Flash Preview (cloud API), Qwen3-VL-8B (local, via llama.cpp)
Prompts tested: jersey_prompt.txt (original), jersey_prompt_capstone.txt (capstone)
Test images: 161 annotated basketball jersey images
Ground truth colors: 202 (excluding white)
Images resized to max 768px wide before submission

Summary Comparison

Metric	Gemini + Original	Gemini + Capstone	Qwen + Original	Qwen + Capstone
Recall (exact)	64.4%	60.9%	64.4%	65.8%
Recall (exact+similar)	81.2%	78.2%	77.2%	77.7%
Recall (missed)	18.8%	21.8%	22.8%	22.3%
Precision (exact)	74.7%	70.7%	70.7%	73.9%
Precision (exact+sim.)	93.7%	90.2%	84.8%	87.2%
Extra/wrong	6.3%	9.8%	15.2%	12.8%
PASS images	124	118	117	119
PARTIAL images	19	21	18	19
FAIL images	18	22	26	23
Avg time per image	13.3s	11.7s	9.5s	8.9s

Key Takeaways

Gemini + original prompt is the best combination across all major metrics: highest recall (81.2%), highest precision (93.7%), fewest failures (18), and fewest extra/wrong colors (6.3%).
Exact recall is remarkably stable across all four runs (60.9%–65.8%), suggesting ~35% of ground truth colors are inherently difficult for current VLMs regardless of model or prompt.
Gemini produces far fewer hallucinated colors than Qwen. Gemini's extra/wrong rate is 6.3%–9.8% vs. Qwen's 12.8%–15.2%. When Gemini detects a color, it is almost always correct.
The capstone prompt did not improve results for either model. For Gemini it degraded both recall and precision. For Qwen the difference was negligible.
Qwen is ~30% faster (8.9–9.5s vs 11.7–13.3s per image) but at the cost of lower accuracy and more false positives.

Color-Level Analysis

Most Problematic Ground Truth Colors

Colors most frequently missed across all four test runs:

Color	Gemini+Orig	Gemini+Cap	Qwen+Orig	Qwen+Cap	Total Misses	Common Confusion
gray	7	6	7	9	29	Often returned as "grey" (similar match) or missed entirely
maroon	5	9	8	7	29	Frequently confused with "red"
black	7	7	6	6	26	Often not detected at all
light blue	2	2	8	5	17	Returned as "blue" (Qwen especially)
green	3	4	3	4	14	Sometimes returned as "black"
dark brown	0	1	4	4	9	Returned as "black" or "brown"
brown	1	1	3	3	8	Returned as "black" or "orange"
teal	2	2	2	2	8	Confused with "green" or "blue"
blue	3	3	3	2	11	Sometimes not detected at all
gold/yellow	2	2	1	1	6	Occasionally missed entirely

Most Common Extra/Wrong Colors Reported

Extra Color	Gemini+Orig	Gemini+Cap	Qwen+Orig	Qwen+Cap	Notes
red	3	7	7	6	Typically a misread of maroon
black	2	4	7	7	Misread of dark brown/green/gray
blue	3	2	10	6	Misread of light blue or teal
green	1	1	1	1	Misread of teal
orange	1	1	1	1	Misread of brown

Similar-Match Confusion Patterns

These are cases where the VLM returned a color in the same family but not the exact ground truth term:

Expected	Returned As	Gemini+Orig	Gemini+Cap	Qwen+Orig	Qwen+Cap
gray	grey	9	10	—	—
navy blue	blue	7	6	8	8
dark blue	blue	5	6	10	9
dark brown	brown	5	5	2	2
gold	yellow	3	2	5	3
dark blue	navy blue/navy	4	4	—	1

Observations:

gray/grey is purely a spelling variant — Gemini consistently uses British spelling. Qwen uses "gray" so this never triggers for Qwen.
navy blue → blue and dark blue → blue are the most common simplifications. Both models tend to drop shade qualifiers.
dark brown → brown follows the same pattern of dropping the shade qualifier.
gold → yellow is a genuine color perception difference where models see yellow-dominant gold jerseys.

Persistently Failed Images

These 11 images failed across all four test runs, representing the hardest cases:

Image	GT Colors	Typical VLM Response	Failure Pattern
016 - maroon.jpg	maroon	(none) or red	Maroon not recognized
029 - maroon_white.jpg	maroon	red	Maroon → red confusion
034 - light blue.jpg	light blue	blue	Shade qualifier dropped
046 - green.jpg	green	black	Dark green misread as black
053 - black_white.jpg	black	(not detected)	Black jerseys missed
057 - gold or yellow.jpg	gold\|yellow	(not detected)	Gold/yellow missed
132 - brown_white.jpg	brown	orange	Brown → orange confusion
134 - teal_white.jpg	teal	blue or green	Teal not in model vocabulary
138 - maroon.jpg	maroon	red	Maroon → red confusion
150 - green_gray.jpg	green, gray	black	Both colors misread
160 - blue_white.jpg	blue	(not detected)	Blue not detected

Root Cause Categories

Maroon blindness (3 images): Both models consistently classify maroon as red. This is the single largest systematic error.
Dark color confusion (3 images): Dark green, brown, and black are frequently confused with each other, especially in low-contrast or shadowed images.
Shade qualifier loss (2 images): "Light blue" and "teal" are simplified to "blue" or "green" — models use a coarser color vocabulary than the ground truth.
Non-detection (3 images): Some jerseys are simply not detected at all, likely due to occlusion, unusual angles, or low image quality.

Model-Specific Observations

Gemini 3 Flash

Strengths: Highest precision (93.7%), very few hallucinated colors, good at similar-family matching. Never produced gibberish color names.
Weaknesses: Consistently uses British "grey" instead of "gray". Slower than local model.
Prompt sensitivity: The capstone prompt slightly hurt performance (81.2% → 78.2% recall), suggesting the original simpler prompt works better.

Qwen3-VL-8B

Strengths: Faster inference (8.9s avg). Slightly higher exact match rate with capstone prompt (65.8%).
Weaknesses: Much higher false positive rate (12.8–15.2% extra/wrong). Struggles significantly with "light blue" (8 misses with original prompt). Produced one gibberish color ("redolas"). Over-reports "blue" and "black".
Prompt sensitivity: Minimal difference between prompts. Capstone prompt slightly reduced errors.

Recommendations

Normalize "grey" → "gray" in post-processing to eliminate the most common similar-match gap for Gemini.
Add "maroon" to the prompt as an explicit color option or example, since both models struggle to distinguish it from red without guidance.
Consider a constrained color vocabulary in the prompt (e.g., "Choose from: red, blue, green, yellow, orange, purple, black, gray, brown, maroon, teal, light blue, navy blue, gold, pink") to reduce vocabulary mismatch and shade-qualifier drift.
Post-processing color mapping could recover many similar-match cases automatically: navy→navy blue, grey→gray, dark blue→navy blue, etc.
The original jersey_prompt.txt is the better prompt — the capstone prompt's additional constraints did not improve accuracy for either model.

Appendix: Color Similarity Families

The following color families were used for "similar match" scoring. Two colors count as a similar match if they appear in the same family:

Family	Member Colors
blue	blue, dark blue, navy blue, navy, royal blue
light_blue	light blue, sky blue, baby blue, carolina blue, powder blue
red	red, scarlet, crimson
dark_red	maroon, burgundy, dark red, wine
green	green, dark green, forest green, kelly green
yellow	yellow, gold, golden
orange	orange, burnt orange
brown	brown, dark brown
purple	purple, violet
gray	gray, grey, silver, charcoal
black	black
teal	teal, turquoise, cyan, aqua
pink	pink, magenta, hot pink, rose

Note: Colors in different families are never counted as similar, even if perceptually close (e.g., maroon and red are in separate families; brown and orange are in separate families). This is intentional — the similar-match metric captures vocabulary variation within the same color concept, not genuine color misidentification.

11 KiB Raw Blame History Unescape Escape