- test_color_variety.py: named-color test for local llama.cpp VLM - test_color_variety_gemini.py: named-color test for Gemini 3 Flash API - test_hex_color_specificity.py: hex color specificity test for Gemini - test_hex_color_specificity_llama.py: hex color specificity test for local VLM - jersey_prompt_hex_color.txt: prompt requesting hex color codes - COLOR_TEST_REPORT.md: analysis report comparing 3 models across 5 tests - color_test_results.md: raw test output from all runs
12 KiB
Jersey Color Detection - VLM Comparison Report
Date: 2026-02-24
Test set: 161 basketball images (basketball_jersery_color_test_files/)
Overview
Five tests were run to evaluate how vision-language models describe jersey colors:
| Test | Model | Images | Prompt | Purpose |
|---|---|---|---|---|
| 1 | Qwen2.5-VL-7B (local, llama.cpp) | 161 | Named colors | Baseline color vocabulary |
| 2 | Gemini 3 Flash (cloud API) | 161 | Named colors | Cloud model color vocabulary |
| 3 | Qwen3-VL-8B (local, llama.cpp) | 161 | Named colors | Newer local model color vocabulary |
| 4 | Gemini 3 Flash (cloud API) | 20 (random, seed=42) | Hex codes (jersey only) | Hex color specificity |
| 5 | Qwen3-VL-8B (local, llama.cpp) | 20 (random, seed=42) | Hex codes (jersey only) | Hex color specificity |
Named Color Vocabulary (Tests 1-3)
Detection Volume
| Metric | Qwen2.5-VL-7B | Gemini 3 Flash | Qwen3-VL-8B |
|---|---|---|---|
| Jerseys detected | 369 | 453 | 444 |
| Errors | 0 | 0 | 1 |
| Avg time/image | 14.9s | 15.9s | 17.0s |
| Unique jersey colors | 15 | 19 | 15 |
| Unique number colors | 11 | 15 | 13 |
| Combined palette size | 15 | 19 | 17 |
Gemini detected the most jerseys (453) and used the broadest color vocabulary (19 terms). Qwen3-VL-8B detected nearly as many jerseys (444) as Gemini but with a vocabulary closer to the older Qwen2.5 model.
Jersey Color Distribution
| Color | Qwen2.5-VL-7B | Gemini 3 Flash | Qwen3-VL-8B | Notes |
|---|---|---|---|---|
| white | 84 (22.8%) | 125 (27.6%) | 120 (27.0%) | Top color for all three |
| blue | 60 (16.3%) | 43 (9.5%) | 69 (15.5%) | Both Qwen models lump blues |
| green | 48 (13.0%) | 60 (13.2%) | 53 (11.9%) | Consistent across models |
| black | 31 (8.4%) | 21 (4.6%) | 33 (7.4%) | |
| purple | 25 (6.8%) | 28 (6.2%) | 30 (6.8%) | Consistent |
| red | 27 (7.3%) | 22 (4.9%) | 28 (6.3%) | |
| orange | 24 (6.5%) | 27 (6.0%) | 27 (6.1%) | Very consistent |
| yellow | 27 (7.3%) | 24 (5.3%) | 26 (5.9%) | |
| maroon | 14 (3.8%) | 23 (5.1%) | 15 (3.4%) | Gemini uses maroon more |
| light blue | 6 (1.6%) | 22 (4.9%) | 13 (2.9%) | Gemini distinguishes light blue most |
| gray/grey | 9 (2.4%) | 12 (2.6%) | 10 (2.3%) | |
| brown | 6 (1.6%) | 13 (2.9%) | 9 (2.0%) | |
| teal | 4 (1.1%) | 7 (1.5%) | 7 (1.6%) | |
| pink | 2 (0.5%) | 2 (0.4%) | 2 (0.5%) | |
| gold | 2 (0.5%) | 2 (0.4%) | 2 (0.5%) | |
| navy blue | -- | 11 (2.4%) | -- | Gemini-only |
| dark blue | -- | 9 (2.0%) | -- | Gemini-only |
| dark brown | -- | 1 (0.2%) | -- | Gemini-only |
| navy | -- | 1 (0.2%) | -- | Gemini-only |
Number Color Distribution
| Color | Qwen2.5-VL-7B | Gemini 3 Flash | Qwen3-VL-8B |
|---|---|---|---|
| white | 195 (52.8%) | 183 (40.4%) | 184 (41.4%) |
| black | 60 (16.3%) | 40 (8.8%) | 44 (9.9%) |
| yellow | 39 (10.6%) | 58 (12.8%) | 32 (7.2%) |
| red | 30 (8.1%) | 44 (9.7%) | 41 (9.2%) |
| blue | 23 (6.2%) | 39 (8.6%) | 39 (8.8%) |
| orange | 8 (2.2%) | 21 (4.6%) | 29 (6.5%) |
| gold | -- | 5 (1.1%) | 21 (4.7%) |
| dark blue | -- | 14 (3.1%) | 9 (2.0%) |
| maroon | 2 (0.5%) | 14 (3.1%) | 12 (2.7%) |
| green | 3 (0.8%) | 13 (2.9%) | 14 (3.2%) |
| purple | 4 (1.1%) | 11 (2.4%) | 11 (2.5%) |
| pink | 3 (0.8%) | 6 (1.3%) | 6 (1.4%) |
| brown | 2 (0.5%) | 2 (0.4%) | -- |
| grey | -- | 2 (0.4%) | -- |
| navy blue | -- | 1 (0.2%) | -- |
| silver | -- | -- | 2 (0.5%) |
Key Differences in Named Color Mode
-
Gemini has the richest vocabulary. It uses 19 distinct jersey color terms vs 15 for both Qwen models. The extras are all blue-shade variants (navy blue, dark blue, navy) and dark brown.
-
Both Qwen models lump blues together. Qwen2.5-VL-7B reports 60 "blue" jerseys, Qwen3-VL-8B reports 69. Gemini splits these into blue (43), light blue (22), navy blue (11), dark blue (9), and navy (1) — totaling 86 blue-family detections with much finer granularity.
-
Qwen3-VL-8B is a modest upgrade over Qwen2.5-VL-7B. It detects 20% more jerseys (444 vs 369) and uses the same 15 jersey color terms but with a slightly more balanced distribution. It has the same vocabulary as Qwen2.5 but added "dark blue", "silver" to its number color palette.
-
Gemini detects the most jerseys overall. 453 vs 444 (Qwen3) vs 369 (Qwen2.5). The two newer models are close, while Qwen2.5 lags behind.
-
All three models are dominated by basic colors. White, blue/green, and black account for the majority of detections. None spontaneously uses precise shade names like "crimson", "cobalt", or "forest green".
-
Qwen3-VL-8B favors "gold" for number colors. It reported gold 21 times for number colors vs Gemini's 5 and Qwen2.5's 0. This may reflect team-specific coloring (e.g., Lakers gold numbers).
Hex Color Specificity (Tests 4-5)
Both tests used the same 20 random images (seed=42) and evaluated jersey colors only (number colors excluded since they are usually primary colors like white or black).
Summary
| Metric | Gemini 3 Flash | Qwen3-VL-8B |
|---|---|---|
| Images tested | 20 | 20 |
| Total jerseys | 56 | 59 |
| Jersey color values | 56 | 59 |
| Valid hex codes | 56/56 (100%) | 59/59 (100%) |
| Unique hex values | 24 | 21 |
| Specific (distinct shade) | 40 (71.4%) | 37 (62.7%) |
| Generic (near primary) | 16 (28.6%) | 22 (37.3%) |
Distance from Nearest Primary Color
| Stat | Gemini 3 Flash | Qwen3-VL-8B |
|---|---|---|
| Min | 0.0 | 0.0 |
| Avg | 44.5 | 34.5 |
| Max | 111.0 | 110.7 |
(Scale: 0 = exact primary match. 20 = generic threshold. Higher = more specific.)
Gemini 3 Flash - Unique Hex Values (24)
| Hex | RGB | Count | Classification |
|---|---|---|---|
#004B23 |
(0, 75, 35) | x7 | specific, near green (dark), d=63.5 |
#1A2344 |
(26, 35, 68) | x2 | specific, near navy, d=74.2 |
#1E4BA1 |
(30, 75, 161) | x1 | specific, near navy, d=87.3 |
#2B231D |
(43, 35, 29) | x1 | specific, near black, d=62.6 |
#3D2B1F |
(61, 43, 31) | x1 | specific, near black, d=80.8 |
#461D7C |
(70, 29, 124) | x1 | specific, near purple, d=65.0 |
#4B2E83 |
(75, 46, 131) | x5 | specific, near purple, d=70.2 |
#701112 |
(112, 17, 18) | x1 | specific, near maroon, d=29.5 |
#7BAFD4 |
(123, 175, 212) | x3 | specific, near silver, d=73.8 |
#990000 |
(153, 0, 0) | x2 | specific, near maroon, d=25.0 |
#A9A9A9 |
(169, 169, 169) | x1 | specific, near silver, d=39.8 |
#C41230 |
(196, 18, 48) | x1 | specific, near brown, d=39.7 |
#D11111 |
(209, 17, 17) | x2 | specific, near red, d=51.9 |
#D32F2F |
(211, 47, 47) | x2 | specific, near brown, d=46.5 |
#E31837 |
(227, 24, 55) | x1 | specific, near brown, d=65.9 |
#E31B23 |
(227, 27, 35) | x1 | specific, near red, d=52.3 |
#E3242B |
(227, 36, 43) | x2 | specific, near brown, d=62.3 |
#E6E600 |
(230, 230, 0) | x1 | specific, near gold, d=29.2 |
#E8E8E8 |
(232, 232, 232) | x1 | specific, near white, d=39.8 |
#E91E63 |
(233, 30, 99) | x1 | specific, near brown, d=89.5 |
#F06292 |
(240, 98, 146) | x2 | specific, near pink, d=111.0 |
#F57C00 |
(245, 124, 0) | x1 | specific, near orange, d=42.2 |
#FFCD00 |
(255, 205, 0) | x1 | GENERIC, near gold, d=10.0 |
#FFFFFF |
(255, 255, 255) | x15 | GENERIC, near white, d=0.0 |
Qwen3-VL-8B - Unique Hex Values (21)
| Hex | RGB | Count | Classification |
|---|---|---|---|
#000000 |
(0, 0, 0) | x1 | GENERIC, near black, d=0.0 |
#006400 |
(0, 100, 0) | x10 | specific, near green (dark), d=28.0 |
#191970 |
(25, 25, 112) | x1 | specific, near navy, d=38.8 |
#19418A |
(25, 65, 138) | x1 | specific, near navy, d=70.4 |
#3D2B21 |
(61, 43, 33) | x2 | specific, near black, d=81.6 |
#66B2FF |
(102, 178, 255) | x3 | specific, near silver, d=110.7 |
#6A0DAD |
(106, 13, 173) | x6 | specific, near purple, d=51.7 |
#8B0000 |
(139, 0, 0) | x1 | GENERIC, near maroon, d=11.0 |
#A9A9A9 |
(169, 169, 169) | x1 | specific, near silver, d=39.8 |
#B22234 |
(178, 34, 52) | x2 | GENERIC, near brown, d=18.2 |
#D32F2F |
(211, 47, 47) | x3 | specific, near brown, d=46.5 |
#D60000 |
(214, 0, 0) | x3 | specific, near red, d=41.0 |
#DC143C |
(220, 20, 60) | x2 | specific, near brown, d=61.9 |
#F5F5DC |
(245, 245, 220) | x2 | specific, near white, d=37.7 |
#F5F5F5 |
(245, 245, 245) | x1 | GENERIC, near white, d=17.3 |
#FF0000 |
(255, 0, 0) | x1 | GENERIC, near red, d=0.0 |
#FF6347 |
(255, 99, 71) | x1 | specific, near orange, d=96.9 |
#FF69B4 |
(255, 105, 180) | x2 | specific, near pink, d=90.0 |
#FFD700 |
(255, 215, 0) | x1 | GENERIC, near gold, d=0.0 |
#FFFF00 |
(255, 255, 0) | x1 | GENERIC, near yellow, d=0.0 |
#FFFFFF |
(255, 255, 255) | x14 | GENERIC, near white, d=0.0 |
Notable Findings
-
Both models can produce valid hex codes. 100% of returned values were valid hex in both cases.
-
Gemini is more specific overall. 71.4% of its jersey hex codes were distinct shades vs 62.7% for Qwen3. Gemini also produced more unique hex values (24 vs 21) and had a higher average distance from primaries (44.5 vs 34.5).
-
Gemini uses more varied shades of each color family. For red-family jerseys, Gemini returned 8 distinct hex values (
#701112,#990000,#C41230,#D11111,#D32F2F,#E31837,#E31B23,#E3242B). Qwen3 returned 6 (#8B0000,#B22234,#D32F2F,#D60000,#DC143C,#FF0000), including two exact primaries. -
Qwen3 reuses hex values more heavily.
#006400(dark green) appeared 10 times and#FFFFFF14 times — two values account for 41% of all results. Gemini's most repeated value was#FFFFFFat 15 times (27%), with better spread across other shades. -
White dominates both models.
#FFFFFFwas the single most common value for both (Gemini: x15, Qwen3: x14), which is expected given white jerseys are the most common in basketball. -
Both models share some exact hex codes.
#3D2B21(dark brown),#A9A9A9(dark silver/gray), and#D32F2F(medium red) appeared in both models' outputs, suggesting some convergence on certain color estimations.
Conclusions
-
For basic color categorization, all three models work. If you only need to distinguish "white vs dark vs colored" jerseys, any will do. Gemini offers slightly finer granularity with its blue-shade vocabulary (navy blue, dark blue, navy).
-
Gemini detects the most jerseys per image (2.81 avg), followed closely by Qwen3-VL-8B (2.76 avg), with Qwen2.5-VL-7B trailing (2.29 avg).
-
Qwen3-VL-8B is a solid upgrade over Qwen2.5-VL-7B for detection volume (+20% more jerseys) while maintaining the same color vocabulary. It runs locally without cloud API costs, making it a good default choice.
-
Hex color prompting works for jersey body colors. Both models return specific hex shades the majority of the time (Gemini 71%, Qwen3 63%). Gemini produces more varied and specific shades, while Qwen3 tends to reuse a smaller set of hex values.
-
Neither model is a reliable colorimeter. The hex values should be treated as rough shade estimates, not pixel-accurate measurements. For precise color matching, traditional computer vision (e.g., sampling pixels from the detected jersey region) would be more reliable.
-
Recommendation: Use named-color prompts for general jersey classification. Reserve hex-color prompts for use cases where distinguishing similar shades matters (e.g., telling apart two teams that both wear "blue"). Gemini gives the best hex specificity but requires a cloud API; Qwen3-VL-8B is a capable local alternative.