Files
jersey_test/COLOR_TEST_REPORT.md
Rick McEwen 435033ea07 Add color variety and hex specificity test scripts with report
- test_color_variety.py: named-color test for local llama.cpp VLM
- test_color_variety_gemini.py: named-color test for Gemini 3 Flash API
- test_hex_color_specificity.py: hex color specificity test for Gemini
- test_hex_color_specificity_llama.py: hex color specificity test for local VLM
- jersey_prompt_hex_color.txt: prompt requesting hex color codes
- COLOR_TEST_REPORT.md: analysis report comparing 3 models across 5 tests
- color_test_results.md: raw test output from all runs
2026-02-24 11:30:41 -07:00

12 KiB

Jersey Color Detection - VLM Comparison Report

Date: 2026-02-24 Test set: 161 basketball images (basketball_jersery_color_test_files/)

Overview

Five tests were run to evaluate how vision-language models describe jersey colors:

Test Model Images Prompt Purpose
1 Qwen2.5-VL-7B (local, llama.cpp) 161 Named colors Baseline color vocabulary
2 Gemini 3 Flash (cloud API) 161 Named colors Cloud model color vocabulary
3 Qwen3-VL-8B (local, llama.cpp) 161 Named colors Newer local model color vocabulary
4 Gemini 3 Flash (cloud API) 20 (random, seed=42) Hex codes (jersey only) Hex color specificity
5 Qwen3-VL-8B (local, llama.cpp) 20 (random, seed=42) Hex codes (jersey only) Hex color specificity

Named Color Vocabulary (Tests 1-3)

Detection Volume

Metric Qwen2.5-VL-7B Gemini 3 Flash Qwen3-VL-8B
Jerseys detected 369 453 444
Errors 0 0 1
Avg time/image 14.9s 15.9s 17.0s
Unique jersey colors 15 19 15
Unique number colors 11 15 13
Combined palette size 15 19 17

Gemini detected the most jerseys (453) and used the broadest color vocabulary (19 terms). Qwen3-VL-8B detected nearly as many jerseys (444) as Gemini but with a vocabulary closer to the older Qwen2.5 model.

Jersey Color Distribution

Color Qwen2.5-VL-7B Gemini 3 Flash Qwen3-VL-8B Notes
white 84 (22.8%) 125 (27.6%) 120 (27.0%) Top color for all three
blue 60 (16.3%) 43 (9.5%) 69 (15.5%) Both Qwen models lump blues
green 48 (13.0%) 60 (13.2%) 53 (11.9%) Consistent across models
black 31 (8.4%) 21 (4.6%) 33 (7.4%)
purple 25 (6.8%) 28 (6.2%) 30 (6.8%) Consistent
red 27 (7.3%) 22 (4.9%) 28 (6.3%)
orange 24 (6.5%) 27 (6.0%) 27 (6.1%) Very consistent
yellow 27 (7.3%) 24 (5.3%) 26 (5.9%)
maroon 14 (3.8%) 23 (5.1%) 15 (3.4%) Gemini uses maroon more
light blue 6 (1.6%) 22 (4.9%) 13 (2.9%) Gemini distinguishes light blue most
gray/grey 9 (2.4%) 12 (2.6%) 10 (2.3%)
brown 6 (1.6%) 13 (2.9%) 9 (2.0%)
teal 4 (1.1%) 7 (1.5%) 7 (1.6%)
pink 2 (0.5%) 2 (0.4%) 2 (0.5%)
gold 2 (0.5%) 2 (0.4%) 2 (0.5%)
navy blue -- 11 (2.4%) -- Gemini-only
dark blue -- 9 (2.0%) -- Gemini-only
dark brown -- 1 (0.2%) -- Gemini-only
navy -- 1 (0.2%) -- Gemini-only

Number Color Distribution

Color Qwen2.5-VL-7B Gemini 3 Flash Qwen3-VL-8B
white 195 (52.8%) 183 (40.4%) 184 (41.4%)
black 60 (16.3%) 40 (8.8%) 44 (9.9%)
yellow 39 (10.6%) 58 (12.8%) 32 (7.2%)
red 30 (8.1%) 44 (9.7%) 41 (9.2%)
blue 23 (6.2%) 39 (8.6%) 39 (8.8%)
orange 8 (2.2%) 21 (4.6%) 29 (6.5%)
gold -- 5 (1.1%) 21 (4.7%)
dark blue -- 14 (3.1%) 9 (2.0%)
maroon 2 (0.5%) 14 (3.1%) 12 (2.7%)
green 3 (0.8%) 13 (2.9%) 14 (3.2%)
purple 4 (1.1%) 11 (2.4%) 11 (2.5%)
pink 3 (0.8%) 6 (1.3%) 6 (1.4%)
brown 2 (0.5%) 2 (0.4%) --
grey -- 2 (0.4%) --
navy blue -- 1 (0.2%) --
silver -- -- 2 (0.5%)

Key Differences in Named Color Mode

  1. Gemini has the richest vocabulary. It uses 19 distinct jersey color terms vs 15 for both Qwen models. The extras are all blue-shade variants (navy blue, dark blue, navy) and dark brown.

  2. Both Qwen models lump blues together. Qwen2.5-VL-7B reports 60 "blue" jerseys, Qwen3-VL-8B reports 69. Gemini splits these into blue (43), light blue (22), navy blue (11), dark blue (9), and navy (1) — totaling 86 blue-family detections with much finer granularity.

  3. Qwen3-VL-8B is a modest upgrade over Qwen2.5-VL-7B. It detects 20% more jerseys (444 vs 369) and uses the same 15 jersey color terms but with a slightly more balanced distribution. It has the same vocabulary as Qwen2.5 but added "dark blue", "silver" to its number color palette.

  4. Gemini detects the most jerseys overall. 453 vs 444 (Qwen3) vs 369 (Qwen2.5). The two newer models are close, while Qwen2.5 lags behind.

  5. All three models are dominated by basic colors. White, blue/green, and black account for the majority of detections. None spontaneously uses precise shade names like "crimson", "cobalt", or "forest green".

  6. Qwen3-VL-8B favors "gold" for number colors. It reported gold 21 times for number colors vs Gemini's 5 and Qwen2.5's 0. This may reflect team-specific coloring (e.g., Lakers gold numbers).


Hex Color Specificity (Tests 4-5)

Both tests used the same 20 random images (seed=42) and evaluated jersey colors only (number colors excluded since they are usually primary colors like white or black).

Summary

Metric Gemini 3 Flash Qwen3-VL-8B
Images tested 20 20
Total jerseys 56 59
Jersey color values 56 59
Valid hex codes 56/56 (100%) 59/59 (100%)
Unique hex values 24 21
Specific (distinct shade) 40 (71.4%) 37 (62.7%)
Generic (near primary) 16 (28.6%) 22 (37.3%)

Distance from Nearest Primary Color

Stat Gemini 3 Flash Qwen3-VL-8B
Min 0.0 0.0
Avg 44.5 34.5
Max 111.0 110.7

(Scale: 0 = exact primary match. 20 = generic threshold. Higher = more specific.)

Gemini 3 Flash - Unique Hex Values (24)

Hex RGB Count Classification
#004B23 (0, 75, 35) x7 specific, near green (dark), d=63.5
#1A2344 (26, 35, 68) x2 specific, near navy, d=74.2
#1E4BA1 (30, 75, 161) x1 specific, near navy, d=87.3
#2B231D (43, 35, 29) x1 specific, near black, d=62.6
#3D2B1F (61, 43, 31) x1 specific, near black, d=80.8
#461D7C (70, 29, 124) x1 specific, near purple, d=65.0
#4B2E83 (75, 46, 131) x5 specific, near purple, d=70.2
#701112 (112, 17, 18) x1 specific, near maroon, d=29.5
#7BAFD4 (123, 175, 212) x3 specific, near silver, d=73.8
#990000 (153, 0, 0) x2 specific, near maroon, d=25.0
#A9A9A9 (169, 169, 169) x1 specific, near silver, d=39.8
#C41230 (196, 18, 48) x1 specific, near brown, d=39.7
#D11111 (209, 17, 17) x2 specific, near red, d=51.9
#D32F2F (211, 47, 47) x2 specific, near brown, d=46.5
#E31837 (227, 24, 55) x1 specific, near brown, d=65.9
#E31B23 (227, 27, 35) x1 specific, near red, d=52.3
#E3242B (227, 36, 43) x2 specific, near brown, d=62.3
#E6E600 (230, 230, 0) x1 specific, near gold, d=29.2
#E8E8E8 (232, 232, 232) x1 specific, near white, d=39.8
#E91E63 (233, 30, 99) x1 specific, near brown, d=89.5
#F06292 (240, 98, 146) x2 specific, near pink, d=111.0
#F57C00 (245, 124, 0) x1 specific, near orange, d=42.2
#FFCD00 (255, 205, 0) x1 GENERIC, near gold, d=10.0
#FFFFFF (255, 255, 255) x15 GENERIC, near white, d=0.0

Qwen3-VL-8B - Unique Hex Values (21)

Hex RGB Count Classification
#000000 (0, 0, 0) x1 GENERIC, near black, d=0.0
#006400 (0, 100, 0) x10 specific, near green (dark), d=28.0
#191970 (25, 25, 112) x1 specific, near navy, d=38.8
#19418A (25, 65, 138) x1 specific, near navy, d=70.4
#3D2B21 (61, 43, 33) x2 specific, near black, d=81.6
#66B2FF (102, 178, 255) x3 specific, near silver, d=110.7
#6A0DAD (106, 13, 173) x6 specific, near purple, d=51.7
#8B0000 (139, 0, 0) x1 GENERIC, near maroon, d=11.0
#A9A9A9 (169, 169, 169) x1 specific, near silver, d=39.8
#B22234 (178, 34, 52) x2 GENERIC, near brown, d=18.2
#D32F2F (211, 47, 47) x3 specific, near brown, d=46.5
#D60000 (214, 0, 0) x3 specific, near red, d=41.0
#DC143C (220, 20, 60) x2 specific, near brown, d=61.9
#F5F5DC (245, 245, 220) x2 specific, near white, d=37.7
#F5F5F5 (245, 245, 245) x1 GENERIC, near white, d=17.3
#FF0000 (255, 0, 0) x1 GENERIC, near red, d=0.0
#FF6347 (255, 99, 71) x1 specific, near orange, d=96.9
#FF69B4 (255, 105, 180) x2 specific, near pink, d=90.0
#FFD700 (255, 215, 0) x1 GENERIC, near gold, d=0.0
#FFFF00 (255, 255, 0) x1 GENERIC, near yellow, d=0.0
#FFFFFF (255, 255, 255) x14 GENERIC, near white, d=0.0

Notable Findings

  • Both models can produce valid hex codes. 100% of returned values were valid hex in both cases.

  • Gemini is more specific overall. 71.4% of its jersey hex codes were distinct shades vs 62.7% for Qwen3. Gemini also produced more unique hex values (24 vs 21) and had a higher average distance from primaries (44.5 vs 34.5).

  • Gemini uses more varied shades of each color family. For red-family jerseys, Gemini returned 8 distinct hex values (#701112, #990000, #C41230, #D11111, #D32F2F, #E31837, #E31B23, #E3242B). Qwen3 returned 6 (#8B0000, #B22234, #D32F2F, #D60000, #DC143C, #FF0000), including two exact primaries.

  • Qwen3 reuses hex values more heavily. #006400 (dark green) appeared 10 times and #FFFFFF 14 times — two values account for 41% of all results. Gemini's most repeated value was #FFFFFF at 15 times (27%), with better spread across other shades.

  • White dominates both models. #FFFFFF was the single most common value for both (Gemini: x15, Qwen3: x14), which is expected given white jerseys are the most common in basketball.

  • Both models share some exact hex codes. #3D2B21 (dark brown), #A9A9A9 (dark silver/gray), and #D32F2F (medium red) appeared in both models' outputs, suggesting some convergence on certain color estimations.


Conclusions

  1. For basic color categorization, all three models work. If you only need to distinguish "white vs dark vs colored" jerseys, any will do. Gemini offers slightly finer granularity with its blue-shade vocabulary (navy blue, dark blue, navy).

  2. Gemini detects the most jerseys per image (2.81 avg), followed closely by Qwen3-VL-8B (2.76 avg), with Qwen2.5-VL-7B trailing (2.29 avg).

  3. Qwen3-VL-8B is a solid upgrade over Qwen2.5-VL-7B for detection volume (+20% more jerseys) while maintaining the same color vocabulary. It runs locally without cloud API costs, making it a good default choice.

  4. Hex color prompting works for jersey body colors. Both models return specific hex shades the majority of the time (Gemini 71%, Qwen3 63%). Gemini produces more varied and specific shades, while Qwen3 tends to reuse a smaller set of hex values.

  5. Neither model is a reliable colorimeter. The hex values should be treated as rough shade estimates, not pixel-accurate measurements. For precise color matching, traditional computer vision (e.g., sampling pixels from the detected jersey region) would be more reliable.

  6. Recommendation: Use named-color prompts for general jersey classification. Reserve hex-color prompts for use cases where distinguishing similar shades matters (e.g., telling apart two teams that both wear "blue"). Gemini gives the best hex specificity but requires a cloud API; Qwen3-VL-8B is a capable local alternative.