Files

Rick McEwen 435033ea07 Add color variety and hex specificity test scripts with report

- test_color_variety.py: named-color test for local llama.cpp VLM
- test_color_variety_gemini.py: named-color test for Gemini 3 Flash API
- test_hex_color_specificity.py: hex color specificity test for Gemini
- test_hex_color_specificity_llama.py: hex color specificity test for local VLM
- jersey_prompt_hex_color.txt: prompt requesting hex color codes
- COLOR_TEST_REPORT.md: analysis report comparing 3 models across 5 tests
- color_test_results.md: raw test output from all runs

2026-02-24 11:30:41 -07:00

12 KiB

Raw Permalink Blame History

Jersey Color Detection - VLM Comparison Report

Date: 2026-02-24 Test set: 161 basketball images (basketball_jersery_color_test_files/)

Overview

Five tests were run to evaluate how vision-language models describe jersey colors:

Test	Model	Images	Prompt	Purpose
1	Qwen2.5-VL-7B (local, llama.cpp)	161	Named colors	Baseline color vocabulary
2	Gemini 3 Flash (cloud API)	161	Named colors	Cloud model color vocabulary
3	Qwen3-VL-8B (local, llama.cpp)	161	Named colors	Newer local model color vocabulary
4	Gemini 3 Flash (cloud API)	20 (random, seed=42)	Hex codes (jersey only)	Hex color specificity
5	Qwen3-VL-8B (local, llama.cpp)	20 (random, seed=42)	Hex codes (jersey only)	Hex color specificity

Named Color Vocabulary (Tests 1-3)

Detection Volume

Metric	Qwen2.5-VL-7B	Gemini 3 Flash	Qwen3-VL-8B
Jerseys detected	369	453	444
Errors	0	0	1
Avg time/image	14.9s	15.9s	17.0s
Unique jersey colors	15	19	15
Unique number colors	11	15	13
Combined palette size	15	19	17

Gemini detected the most jerseys (453) and used the broadest color vocabulary (19 terms). Qwen3-VL-8B detected nearly as many jerseys (444) as Gemini but with a vocabulary closer to the older Qwen2.5 model.

Jersey Color Distribution

Color	Qwen2.5-VL-7B	Gemini 3 Flash	Qwen3-VL-8B	Notes
white	84 (22.8%)	125 (27.6%)	120 (27.0%)	Top color for all three
blue	60 (16.3%)	43 (9.5%)	69 (15.5%)	Both Qwen models lump blues
green	48 (13.0%)	60 (13.2%)	53 (11.9%)	Consistent across models
black	31 (8.4%)	21 (4.6%)	33 (7.4%)
purple	25 (6.8%)	28 (6.2%)	30 (6.8%)	Consistent
red	27 (7.3%)	22 (4.9%)	28 (6.3%)
orange	24 (6.5%)	27 (6.0%)	27 (6.1%)	Very consistent
yellow	27 (7.3%)	24 (5.3%)	26 (5.9%)
maroon	14 (3.8%)	23 (5.1%)	15 (3.4%)	Gemini uses maroon more
light blue	6 (1.6%)	22 (4.9%)	13 (2.9%)	Gemini distinguishes light blue most
gray/grey	9 (2.4%)	12 (2.6%)	10 (2.3%)
brown	6 (1.6%)	13 (2.9%)	9 (2.0%)
teal	4 (1.1%)	7 (1.5%)	7 (1.6%)
pink	2 (0.5%)	2 (0.4%)	2 (0.5%)
gold	2 (0.5%)	2 (0.4%)	2 (0.5%)
navy blue	--	11 (2.4%)	--	Gemini-only
dark blue	--	9 (2.0%)	--	Gemini-only
dark brown	--	1 (0.2%)	--	Gemini-only
navy	--	1 (0.2%)	--	Gemini-only

Number Color Distribution

Color	Qwen2.5-VL-7B	Gemini 3 Flash	Qwen3-VL-8B
white	195 (52.8%)	183 (40.4%)	184 (41.4%)
black	60 (16.3%)	40 (8.8%)	44 (9.9%)
yellow	39 (10.6%)	58 (12.8%)	32 (7.2%)
red	30 (8.1%)	44 (9.7%)	41 (9.2%)
blue	23 (6.2%)	39 (8.6%)	39 (8.8%)
orange	8 (2.2%)	21 (4.6%)	29 (6.5%)
gold	--	5 (1.1%)	21 (4.7%)
dark blue	--	14 (3.1%)	9 (2.0%)
maroon	2 (0.5%)	14 (3.1%)	12 (2.7%)
green	3 (0.8%)	13 (2.9%)	14 (3.2%)
purple	4 (1.1%)	11 (2.4%)	11 (2.5%)
pink	3 (0.8%)	6 (1.3%)	6 (1.4%)
brown	2 (0.5%)	2 (0.4%)	--
grey	--	2 (0.4%)	--
navy blue	--	1 (0.2%)	--
silver	--	--	2 (0.5%)

Key Differences in Named Color Mode

Gemini has the richest vocabulary. It uses 19 distinct jersey color terms vs 15 for both Qwen models. The extras are all blue-shade variants (navy blue, dark blue, navy) and dark brown.
Both Qwen models lump blues together. Qwen2.5-VL-7B reports 60 "blue" jerseys, Qwen3-VL-8B reports 69. Gemini splits these into blue (43), light blue (22), navy blue (11), dark blue (9), and navy (1) — totaling 86 blue-family detections with much finer granularity.
Qwen3-VL-8B is a modest upgrade over Qwen2.5-VL-7B. It detects 20% more jerseys (444 vs 369) and uses the same 15 jersey color terms but with a slightly more balanced distribution. It has the same vocabulary as Qwen2.5 but added "dark blue", "silver" to its number color palette.
Gemini detects the most jerseys overall. 453 vs 444 (Qwen3) vs 369 (Qwen2.5). The two newer models are close, while Qwen2.5 lags behind.
All three models are dominated by basic colors. White, blue/green, and black account for the majority of detections. None spontaneously uses precise shade names like "crimson", "cobalt", or "forest green".
Qwen3-VL-8B favors "gold" for number colors. It reported gold 21 times for number colors vs Gemini's 5 and Qwen2.5's 0. This may reflect team-specific coloring (e.g., Lakers gold numbers).

Hex Color Specificity (Tests 4-5)

Both tests used the same 20 random images (seed=42) and evaluated jersey colors only (number colors excluded since they are usually primary colors like white or black).

Summary

Metric	Gemini 3 Flash	Qwen3-VL-8B
Images tested	20	20
Total jerseys	56	59
Jersey color values	56	59
Valid hex codes	56/56 (100%)	59/59 (100%)
Unique hex values	24	21
Specific (distinct shade)	40 (71.4%)	37 (62.7%)
Generic (near primary)	16 (28.6%)	22 (37.3%)

Distance from Nearest Primary Color

Stat	Gemini 3 Flash	Qwen3-VL-8B
Min	0.0	0.0
Avg	44.5	34.5
Max	111.0	110.7

(Scale: 0 = exact primary match. 20 = generic threshold. Higher = more specific.)

Gemini 3 Flash - Unique Hex Values (24)

Hex	RGB	Count	Classification
`#004B23`	(0, 75, 35)	x7	specific, near green (dark), d=63.5
`#1A2344`	(26, 35, 68)	x2	specific, near navy, d=74.2
`#1E4BA1`	(30, 75, 161)	x1	specific, near navy, d=87.3
`#2B231D`	(43, 35, 29)	x1	specific, near black, d=62.6
`#3D2B1F`	(61, 43, 31)	x1	specific, near black, d=80.8
`#461D7C`	(70, 29, 124)	x1	specific, near purple, d=65.0
`#4B2E83`	(75, 46, 131)	x5	specific, near purple, d=70.2
`#701112`	(112, 17, 18)	x1	specific, near maroon, d=29.5
`#7BAFD4`	(123, 175, 212)	x3	specific, near silver, d=73.8
`#990000`	(153, 0, 0)	x2	specific, near maroon, d=25.0
`#A9A9A9`	(169, 169, 169)	x1	specific, near silver, d=39.8
`#C41230`	(196, 18, 48)	x1	specific, near brown, d=39.7
`#D11111`	(209, 17, 17)	x2	specific, near red, d=51.9
`#D32F2F`	(211, 47, 47)	x2	specific, near brown, d=46.5
`#E31837`	(227, 24, 55)	x1	specific, near brown, d=65.9
`#E31B23`	(227, 27, 35)	x1	specific, near red, d=52.3
`#E3242B`	(227, 36, 43)	x2	specific, near brown, d=62.3
`#E6E600`	(230, 230, 0)	x1	specific, near gold, d=29.2
`#E8E8E8`	(232, 232, 232)	x1	specific, near white, d=39.8
`#E91E63`	(233, 30, 99)	x1	specific, near brown, d=89.5
`#F06292`	(240, 98, 146)	x2	specific, near pink, d=111.0
`#F57C00`	(245, 124, 0)	x1	specific, near orange, d=42.2
`#FFCD00`	(255, 205, 0)	x1	GENERIC, near gold, d=10.0
`#FFFFFF`	(255, 255, 255)	x15	GENERIC, near white, d=0.0

Qwen3-VL-8B - Unique Hex Values (21)

Hex	RGB	Count	Classification
`#000000`	(0, 0, 0)	x1	GENERIC, near black, d=0.0
`#006400`	(0, 100, 0)	x10	specific, near green (dark), d=28.0
`#191970`	(25, 25, 112)	x1	specific, near navy, d=38.8
`#19418A`	(25, 65, 138)	x1	specific, near navy, d=70.4
`#3D2B21`	(61, 43, 33)	x2	specific, near black, d=81.6
`#66B2FF`	(102, 178, 255)	x3	specific, near silver, d=110.7
`#6A0DAD`	(106, 13, 173)	x6	specific, near purple, d=51.7
`#8B0000`	(139, 0, 0)	x1	GENERIC, near maroon, d=11.0
`#A9A9A9`	(169, 169, 169)	x1	specific, near silver, d=39.8
`#B22234`	(178, 34, 52)	x2	GENERIC, near brown, d=18.2
`#D32F2F`	(211, 47, 47)	x3	specific, near brown, d=46.5
`#D60000`	(214, 0, 0)	x3	specific, near red, d=41.0
`#DC143C`	(220, 20, 60)	x2	specific, near brown, d=61.9
`#F5F5DC`	(245, 245, 220)	x2	specific, near white, d=37.7
`#F5F5F5`	(245, 245, 245)	x1	GENERIC, near white, d=17.3
`#FF0000`	(255, 0, 0)	x1	GENERIC, near red, d=0.0
`#FF6347`	(255, 99, 71)	x1	specific, near orange, d=96.9
`#FF69B4`	(255, 105, 180)	x2	specific, near pink, d=90.0
`#FFD700`	(255, 215, 0)	x1	GENERIC, near gold, d=0.0
`#FFFF00`	(255, 255, 0)	x1	GENERIC, near yellow, d=0.0
`#FFFFFF`	(255, 255, 255)	x14	GENERIC, near white, d=0.0

Notable Findings

Both models can produce valid hex codes. 100% of returned values were valid hex in both cases.
Gemini is more specific overall. 71.4% of its jersey hex codes were distinct shades vs 62.7% for Qwen3. Gemini also produced more unique hex values (24 vs 21) and had a higher average distance from primaries (44.5 vs 34.5).
Gemini uses more varied shades of each color family. For red-family jerseys, Gemini returned 8 distinct hex values (#701112, #990000, #C41230, #D11111, #D32F2F, #E31837, #E31B23, #E3242B). Qwen3 returned 6 (#8B0000, #B22234, #D32F2F, #D60000, #DC143C, #FF0000), including two exact primaries.
Qwen3 reuses hex values more heavily. #006400 (dark green) appeared 10 times and #FFFFFF 14 times — two values account for 41% of all results. Gemini's most repeated value was #FFFFFF at 15 times (27%), with better spread across other shades.
White dominates both models. #FFFFFF was the single most common value for both (Gemini: x15, Qwen3: x14), which is expected given white jerseys are the most common in basketball.
Both models share some exact hex codes. #3D2B21 (dark brown), #A9A9A9 (dark silver/gray), and #D32F2F (medium red) appeared in both models' outputs, suggesting some convergence on certain color estimations.

Conclusions

For basic color categorization, all three models work. If you only need to distinguish "white vs dark vs colored" jerseys, any will do. Gemini offers slightly finer granularity with its blue-shade vocabulary (navy blue, dark blue, navy).
Gemini detects the most jerseys per image (2.81 avg), followed closely by Qwen3-VL-8B (2.76 avg), with Qwen2.5-VL-7B trailing (2.29 avg).
Qwen3-VL-8B is a solid upgrade over Qwen2.5-VL-7B for detection volume (+20% more jerseys) while maintaining the same color vocabulary. It runs locally without cloud API costs, making it a good default choice.
Hex color prompting works for jersey body colors. Both models return specific hex shades the majority of the time (Gemini 71%, Qwen3 63%). Gemini produces more varied and specific shades, while Qwen3 tends to reuse a smaller set of hex values.
Neither model is a reliable colorimeter. The hex values should be treated as rough shade estimates, not pixel-accurate measurements. For precise color matching, traditional computer vision (e.g., sampling pixels from the detected jersey region) would be more reliable.
Recommendation: Use named-color prompts for general jersey classification. Reserve hex-color prompts for use cases where distinguishing similar shades matters (e.g., telling apart two teams that both wear "blue"). Gemini gives the best hex specificity but requires a cloud API; Qwen3-VL-8B is a capable local alternative.

12 KiB Raw Permalink Blame History