Files
jersey_test/accuracy_analysis_report.md
Rick McEwen 5405d7f7dc Add accuracy test framework, prompts, results, and analysis reports
Includes accuracy test scripts for Qwen (local) and Gemini (cloud API),
three prompt variants (original, capstone, constrained), test results
from all runs, and two analysis reports with an HTML presentation version.
2026-03-03 18:44:49 -07:00

11 KiB
Raw Blame History

Jersey Color Detection Accuracy Analysis

Test Configuration

  • Models tested: Gemini 3 Flash Preview (cloud API), Qwen3-VL-8B (local, via llama.cpp)
  • Prompts tested: jersey_prompt.txt (original), jersey_prompt_capstone.txt (capstone)
  • Test images: 161 annotated basketball jersey images
  • Ground truth colors: 202 (excluding white)
  • Images resized to max 768px wide before submission

Summary Comparison

Metric Gemini + Original Gemini + Capstone Qwen + Original Qwen + Capstone
Recall (exact) 64.4% 60.9% 64.4% 65.8%
Recall (exact+similar) 81.2% 78.2% 77.2% 77.7%
Recall (missed) 18.8% 21.8% 22.8% 22.3%
Precision (exact) 74.7% 70.7% 70.7% 73.9%
Precision (exact+sim.) 93.7% 90.2% 84.8% 87.2%
Extra/wrong 6.3% 9.8% 15.2% 12.8%
PASS images 124 118 117 119
PARTIAL images 19 21 18 19
FAIL images 18 22 26 23
Avg time per image 13.3s 11.7s 9.5s 8.9s

Key Takeaways

  1. Gemini + original prompt is the best combination across all major metrics: highest recall (81.2%), highest precision (93.7%), fewest failures (18), and fewest extra/wrong colors (6.3%).

  2. Exact recall is remarkably stable across all four runs (60.9%65.8%), suggesting ~35% of ground truth colors are inherently difficult for current VLMs regardless of model or prompt.

  3. Gemini produces far fewer hallucinated colors than Qwen. Gemini's extra/wrong rate is 6.3%9.8% vs. Qwen's 12.8%15.2%. When Gemini detects a color, it is almost always correct.

  4. The capstone prompt did not improve results for either model. For Gemini it degraded both recall and precision. For Qwen the difference was negligible.

  5. Qwen is ~30% faster (8.99.5s vs 11.713.3s per image) but at the cost of lower accuracy and more false positives.


Color-Level Analysis

Most Problematic Ground Truth Colors

Colors most frequently missed across all four test runs:

Color Gemini+Orig Gemini+Cap Qwen+Orig Qwen+Cap Total Misses Common Confusion
gray 7 6 7 9 29 Often returned as "grey" (similar match) or missed entirely
maroon 5 9 8 7 29 Frequently confused with "red"
black 7 7 6 6 26 Often not detected at all
light blue 2 2 8 5 17 Returned as "blue" (Qwen especially)
green 3 4 3 4 14 Sometimes returned as "black"
dark brown 0 1 4 4 9 Returned as "black" or "brown"
brown 1 1 3 3 8 Returned as "black" or "orange"
teal 2 2 2 2 8 Confused with "green" or "blue"
blue 3 3 3 2 11 Sometimes not detected at all
gold/yellow 2 2 1 1 6 Occasionally missed entirely

Most Common Extra/Wrong Colors Reported

Extra Color Gemini+Orig Gemini+Cap Qwen+Orig Qwen+Cap Notes
red 3 7 7 6 Typically a misread of maroon
black 2 4 7 7 Misread of dark brown/green/gray
blue 3 2 10 6 Misread of light blue or teal
green 1 1 1 1 Misread of teal
orange 1 1 1 1 Misread of brown

Similar-Match Confusion Patterns

These are cases where the VLM returned a color in the same family but not the exact ground truth term:

Expected Returned As Gemini+Orig Gemini+Cap Qwen+Orig Qwen+Cap
gray grey 9 10
navy blue blue 7 6 8 8
dark blue blue 5 6 10 9
dark brown brown 5 5 2 2
gold yellow 3 2 5 3
dark blue navy blue/navy 4 4 1

Observations:

  • gray/grey is purely a spelling variant — Gemini consistently uses British spelling. Qwen uses "gray" so this never triggers for Qwen.
  • navy blue → blue and dark blue → blue are the most common simplifications. Both models tend to drop shade qualifiers.
  • dark brown → brown follows the same pattern of dropping the shade qualifier.
  • gold → yellow is a genuine color perception difference where models see yellow-dominant gold jerseys.

Persistently Failed Images

These 11 images failed across all four test runs, representing the hardest cases:

Image GT Colors Typical VLM Response Failure Pattern
016 - maroon.jpg maroon (none) or red Maroon not recognized
029 - maroon_white.jpg maroon red Maroon → red confusion
034 - light blue.jpg light blue blue Shade qualifier dropped
046 - green.jpg green black Dark green misread as black
053 - black_white.jpg black (not detected) Black jerseys missed
057 - gold or yellow.jpg gold|yellow (not detected) Gold/yellow missed
132 - brown_white.jpg brown orange Brown → orange confusion
134 - teal_white.jpg teal blue or green Teal not in model vocabulary
138 - maroon.jpg maroon red Maroon → red confusion
150 - green_gray.jpg green, gray black Both colors misread
160 - blue_white.jpg blue (not detected) Blue not detected

Root Cause Categories

  1. Maroon blindness (3 images): Both models consistently classify maroon as red. This is the single largest systematic error.

  2. Dark color confusion (3 images): Dark green, brown, and black are frequently confused with each other, especially in low-contrast or shadowed images.

  3. Shade qualifier loss (2 images): "Light blue" and "teal" are simplified to "blue" or "green" — models use a coarser color vocabulary than the ground truth.

  4. Non-detection (3 images): Some jerseys are simply not detected at all, likely due to occlusion, unusual angles, or low image quality.


Model-Specific Observations

Gemini 3 Flash

  • Strengths: Highest precision (93.7%), very few hallucinated colors, good at similar-family matching. Never produced gibberish color names.
  • Weaknesses: Consistently uses British "grey" instead of "gray". Slower than local model.
  • Prompt sensitivity: The capstone prompt slightly hurt performance (81.2% → 78.2% recall), suggesting the original simpler prompt works better.

Qwen3-VL-8B

  • Strengths: Faster inference (8.9s avg). Slightly higher exact match rate with capstone prompt (65.8%).
  • Weaknesses: Much higher false positive rate (12.815.2% extra/wrong). Struggles significantly with "light blue" (8 misses with original prompt). Produced one gibberish color ("redolas"). Over-reports "blue" and "black".
  • Prompt sensitivity: Minimal difference between prompts. Capstone prompt slightly reduced errors.

Recommendations

  1. Normalize "grey" → "gray" in post-processing to eliminate the most common similar-match gap for Gemini.

  2. Add "maroon" to the prompt as an explicit color option or example, since both models struggle to distinguish it from red without guidance.

  3. Consider a constrained color vocabulary in the prompt (e.g., "Choose from: red, blue, green, yellow, orange, purple, black, gray, brown, maroon, teal, light blue, navy blue, gold, pink") to reduce vocabulary mismatch and shade-qualifier drift.

  4. Post-processing color mapping could recover many similar-match cases automatically: navy→navy blue, grey→gray, dark blue→navy blue, etc.

  5. The original jersey_prompt.txt is the better prompt — the capstone prompt's additional constraints did not improve accuracy for either model.


Appendix: Color Similarity Families

The following color families were used for "similar match" scoring. Two colors count as a similar match if they appear in the same family:

Family Member Colors
blue blue, dark blue, navy blue, navy, royal blue
light_blue light blue, sky blue, baby blue, carolina blue, powder blue
red red, scarlet, crimson
dark_red maroon, burgundy, dark red, wine
green green, dark green, forest green, kelly green
yellow yellow, gold, golden
orange orange, burnt orange
brown brown, dark brown
purple purple, violet
gray gray, grey, silver, charcoal
black black
teal teal, turquoise, cyan, aqua
pink pink, magenta, hot pink, rose

Note: Colors in different families are never counted as similar, even if perceptually close (e.g., maroon and red are in separate families; brown and orange are in separate families). This is intentional — the similar-match metric captures vocabulary variation within the same color concept, not genuine color misidentification.