Add color variety and hex specificity test scripts with report
- test_color_variety.py: named-color test for local llama.cpp VLM - test_color_variety_gemini.py: named-color test for Gemini 3 Flash API - test_hex_color_specificity.py: hex color specificity test for Gemini - test_hex_color_specificity_llama.py: hex color specificity test for local VLM - jersey_prompt_hex_color.txt: prompt requesting hex color codes - COLOR_TEST_REPORT.md: analysis report comparing 3 models across 5 tests - color_test_results.md: raw test output from all runs
This commit is contained in:
205
COLOR_TEST_REPORT.md
Normal file
205
COLOR_TEST_REPORT.md
Normal file
@ -0,0 +1,205 @@
|
||||
# Jersey Color Detection - VLM Comparison Report
|
||||
|
||||
**Date:** 2026-02-24
|
||||
**Test set:** 161 basketball images (`basketball_jersery_color_test_files/`)
|
||||
|
||||
## Overview
|
||||
|
||||
Five tests were run to evaluate how vision-language models describe jersey colors:
|
||||
|
||||
| Test | Model | Images | Prompt | Purpose |
|
||||
|------|-------|--------|--------|---------|
|
||||
| 1 | Qwen2.5-VL-7B (local, llama.cpp) | 161 | Named colors | Baseline color vocabulary |
|
||||
| 2 | Gemini 3 Flash (cloud API) | 161 | Named colors | Cloud model color vocabulary |
|
||||
| 3 | Qwen3-VL-8B (local, llama.cpp) | 161 | Named colors | Newer local model color vocabulary |
|
||||
| 4 | Gemini 3 Flash (cloud API) | 20 (random, seed=42) | Hex codes (jersey only) | Hex color specificity |
|
||||
| 5 | Qwen3-VL-8B (local, llama.cpp) | 20 (random, seed=42) | Hex codes (jersey only) | Hex color specificity |
|
||||
|
||||
---
|
||||
|
||||
## Named Color Vocabulary (Tests 1-3)
|
||||
|
||||
### Detection Volume
|
||||
|
||||
| Metric | Qwen2.5-VL-7B | Gemini 3 Flash | Qwen3-VL-8B |
|
||||
|--------|---------------|----------------|--------------|
|
||||
| Jerseys detected | 369 | 453 | 444 |
|
||||
| Errors | 0 | 0 | 1 |
|
||||
| Avg time/image | 14.9s | 15.9s | 17.0s |
|
||||
| Unique jersey colors | 15 | 19 | 15 |
|
||||
| Unique number colors | 11 | 15 | 13 |
|
||||
| Combined palette size | 15 | 19 | 17 |
|
||||
|
||||
Gemini detected the most jerseys (453) and used the broadest color vocabulary (19 terms). Qwen3-VL-8B detected nearly as many jerseys (444) as Gemini but with a vocabulary closer to the older Qwen2.5 model.
|
||||
|
||||
### Jersey Color Distribution
|
||||
|
||||
| Color | Qwen2.5-VL-7B | Gemini 3 Flash | Qwen3-VL-8B | Notes |
|
||||
|-------|---------------|----------------|--------------|-------|
|
||||
| white | 84 (22.8%) | 125 (27.6%) | 120 (27.0%) | Top color for all three |
|
||||
| blue | 60 (16.3%) | 43 (9.5%) | 69 (15.5%) | Both Qwen models lump blues |
|
||||
| green | 48 (13.0%) | 60 (13.2%) | 53 (11.9%) | Consistent across models |
|
||||
| black | 31 (8.4%) | 21 (4.6%) | 33 (7.4%) | |
|
||||
| purple | 25 (6.8%) | 28 (6.2%) | 30 (6.8%) | Consistent |
|
||||
| red | 27 (7.3%) | 22 (4.9%) | 28 (6.3%) | |
|
||||
| orange | 24 (6.5%) | 27 (6.0%) | 27 (6.1%) | Very consistent |
|
||||
| yellow | 27 (7.3%) | 24 (5.3%) | 26 (5.9%) | |
|
||||
| maroon | 14 (3.8%) | 23 (5.1%) | 15 (3.4%) | Gemini uses maroon more |
|
||||
| light blue | 6 (1.6%) | 22 (4.9%) | 13 (2.9%) | Gemini distinguishes light blue most |
|
||||
| gray/grey | 9 (2.4%) | 12 (2.6%) | 10 (2.3%) | |
|
||||
| brown | 6 (1.6%) | 13 (2.9%) | 9 (2.0%) | |
|
||||
| teal | 4 (1.1%) | 7 (1.5%) | 7 (1.6%) | |
|
||||
| pink | 2 (0.5%) | 2 (0.4%) | 2 (0.5%) | |
|
||||
| gold | 2 (0.5%) | 2 (0.4%) | 2 (0.5%) | |
|
||||
| navy blue | -- | 11 (2.4%) | -- | Gemini-only |
|
||||
| dark blue | -- | 9 (2.0%) | -- | Gemini-only |
|
||||
| dark brown | -- | 1 (0.2%) | -- | Gemini-only |
|
||||
| navy | -- | 1 (0.2%) | -- | Gemini-only |
|
||||
|
||||
### Number Color Distribution
|
||||
|
||||
| Color | Qwen2.5-VL-7B | Gemini 3 Flash | Qwen3-VL-8B |
|
||||
|-------|---------------|----------------|--------------|
|
||||
| white | 195 (52.8%) | 183 (40.4%) | 184 (41.4%) |
|
||||
| black | 60 (16.3%) | 40 (8.8%) | 44 (9.9%) |
|
||||
| yellow | 39 (10.6%) | 58 (12.8%) | 32 (7.2%) |
|
||||
| red | 30 (8.1%) | 44 (9.7%) | 41 (9.2%) |
|
||||
| blue | 23 (6.2%) | 39 (8.6%) | 39 (8.8%) |
|
||||
| orange | 8 (2.2%) | 21 (4.6%) | 29 (6.5%) |
|
||||
| gold | -- | 5 (1.1%) | 21 (4.7%) |
|
||||
| dark blue | -- | 14 (3.1%) | 9 (2.0%) |
|
||||
| maroon | 2 (0.5%) | 14 (3.1%) | 12 (2.7%) |
|
||||
| green | 3 (0.8%) | 13 (2.9%) | 14 (3.2%) |
|
||||
| purple | 4 (1.1%) | 11 (2.4%) | 11 (2.5%) |
|
||||
| pink | 3 (0.8%) | 6 (1.3%) | 6 (1.4%) |
|
||||
| brown | 2 (0.5%) | 2 (0.4%) | -- |
|
||||
| grey | -- | 2 (0.4%) | -- |
|
||||
| navy blue | -- | 1 (0.2%) | -- |
|
||||
| silver | -- | -- | 2 (0.5%) |
|
||||
|
||||
### Key Differences in Named Color Mode
|
||||
|
||||
1. **Gemini has the richest vocabulary.** It uses 19 distinct jersey color terms vs 15 for both Qwen models. The extras are all blue-shade variants (navy blue, dark blue, navy) and dark brown.
|
||||
|
||||
2. **Both Qwen models lump blues together.** Qwen2.5-VL-7B reports 60 "blue" jerseys, Qwen3-VL-8B reports 69. Gemini splits these into blue (43), light blue (22), navy blue (11), dark blue (9), and navy (1) — totaling 86 blue-family detections with much finer granularity.
|
||||
|
||||
3. **Qwen3-VL-8B is a modest upgrade over Qwen2.5-VL-7B.** It detects 20% more jerseys (444 vs 369) and uses the same 15 jersey color terms but with a slightly more balanced distribution. It has the same vocabulary as Qwen2.5 but added "dark blue", "silver" to its number color palette.
|
||||
|
||||
4. **Gemini detects the most jerseys overall.** 453 vs 444 (Qwen3) vs 369 (Qwen2.5). The two newer models are close, while Qwen2.5 lags behind.
|
||||
|
||||
5. **All three models are dominated by basic colors.** White, blue/green, and black account for the majority of detections. None spontaneously uses precise shade names like "crimson", "cobalt", or "forest green".
|
||||
|
||||
6. **Qwen3-VL-8B favors "gold" for number colors.** It reported gold 21 times for number colors vs Gemini's 5 and Qwen2.5's 0. This may reflect team-specific coloring (e.g., Lakers gold numbers).
|
||||
|
||||
---
|
||||
|
||||
## Hex Color Specificity (Tests 4-5)
|
||||
|
||||
Both tests used the same 20 random images (seed=42) and evaluated **jersey colors only** (number colors excluded since they are usually primary colors like white or black).
|
||||
|
||||
### Summary
|
||||
|
||||
| Metric | Gemini 3 Flash | Qwen3-VL-8B |
|
||||
|--------|----------------|--------------|
|
||||
| Images tested | 20 | 20 |
|
||||
| Total jerseys | 56 | 59 |
|
||||
| Jersey color values | 56 | 59 |
|
||||
| Valid hex codes | 56/56 (100%) | 59/59 (100%) |
|
||||
| Unique hex values | 24 | 21 |
|
||||
| Specific (distinct shade) | 40 (71.4%) | 37 (62.7%) |
|
||||
| Generic (near primary) | 16 (28.6%) | 22 (37.3%) |
|
||||
|
||||
### Distance from Nearest Primary Color
|
||||
|
||||
| Stat | Gemini 3 Flash | Qwen3-VL-8B |
|
||||
|------|----------------|--------------|
|
||||
| Min | 0.0 | 0.0 |
|
||||
| Avg | 44.5 | 34.5 |
|
||||
| Max | 111.0 | 110.7 |
|
||||
|
||||
(Scale: 0 = exact primary match. 20 = generic threshold. Higher = more specific.)
|
||||
|
||||
### Gemini 3 Flash - Unique Hex Values (24)
|
||||
|
||||
| Hex | RGB | Count | Classification |
|
||||
|-----|-----|-------|---------------|
|
||||
| `#004B23` | (0, 75, 35) | x7 | specific, near green (dark), d=63.5 |
|
||||
| `#1A2344` | (26, 35, 68) | x2 | specific, near navy, d=74.2 |
|
||||
| `#1E4BA1` | (30, 75, 161) | x1 | specific, near navy, d=87.3 |
|
||||
| `#2B231D` | (43, 35, 29) | x1 | specific, near black, d=62.6 |
|
||||
| `#3D2B1F` | (61, 43, 31) | x1 | specific, near black, d=80.8 |
|
||||
| `#461D7C` | (70, 29, 124) | x1 | specific, near purple, d=65.0 |
|
||||
| `#4B2E83` | (75, 46, 131) | x5 | specific, near purple, d=70.2 |
|
||||
| `#701112` | (112, 17, 18) | x1 | specific, near maroon, d=29.5 |
|
||||
| `#7BAFD4` | (123, 175, 212) | x3 | specific, near silver, d=73.8 |
|
||||
| `#990000` | (153, 0, 0) | x2 | specific, near maroon, d=25.0 |
|
||||
| `#A9A9A9` | (169, 169, 169) | x1 | specific, near silver, d=39.8 |
|
||||
| `#C41230` | (196, 18, 48) | x1 | specific, near brown, d=39.7 |
|
||||
| `#D11111` | (209, 17, 17) | x2 | specific, near red, d=51.9 |
|
||||
| `#D32F2F` | (211, 47, 47) | x2 | specific, near brown, d=46.5 |
|
||||
| `#E31837` | (227, 24, 55) | x1 | specific, near brown, d=65.9 |
|
||||
| `#E31B23` | (227, 27, 35) | x1 | specific, near red, d=52.3 |
|
||||
| `#E3242B` | (227, 36, 43) | x2 | specific, near brown, d=62.3 |
|
||||
| `#E6E600` | (230, 230, 0) | x1 | specific, near gold, d=29.2 |
|
||||
| `#E8E8E8` | (232, 232, 232) | x1 | specific, near white, d=39.8 |
|
||||
| `#E91E63` | (233, 30, 99) | x1 | specific, near brown, d=89.5 |
|
||||
| `#F06292` | (240, 98, 146) | x2 | specific, near pink, d=111.0 |
|
||||
| `#F57C00` | (245, 124, 0) | x1 | specific, near orange, d=42.2 |
|
||||
| `#FFCD00` | (255, 205, 0) | x1 | GENERIC, near gold, d=10.0 |
|
||||
| `#FFFFFF` | (255, 255, 255) | x15 | GENERIC, near white, d=0.0 |
|
||||
|
||||
### Qwen3-VL-8B - Unique Hex Values (21)
|
||||
|
||||
| Hex | RGB | Count | Classification |
|
||||
|-----|-----|-------|---------------|
|
||||
| `#000000` | (0, 0, 0) | x1 | GENERIC, near black, d=0.0 |
|
||||
| `#006400` | (0, 100, 0) | x10 | specific, near green (dark), d=28.0 |
|
||||
| `#191970` | (25, 25, 112) | x1 | specific, near navy, d=38.8 |
|
||||
| `#19418A` | (25, 65, 138) | x1 | specific, near navy, d=70.4 |
|
||||
| `#3D2B21` | (61, 43, 33) | x2 | specific, near black, d=81.6 |
|
||||
| `#66B2FF` | (102, 178, 255) | x3 | specific, near silver, d=110.7 |
|
||||
| `#6A0DAD` | (106, 13, 173) | x6 | specific, near purple, d=51.7 |
|
||||
| `#8B0000` | (139, 0, 0) | x1 | GENERIC, near maroon, d=11.0 |
|
||||
| `#A9A9A9` | (169, 169, 169) | x1 | specific, near silver, d=39.8 |
|
||||
| `#B22234` | (178, 34, 52) | x2 | GENERIC, near brown, d=18.2 |
|
||||
| `#D32F2F` | (211, 47, 47) | x3 | specific, near brown, d=46.5 |
|
||||
| `#D60000` | (214, 0, 0) | x3 | specific, near red, d=41.0 |
|
||||
| `#DC143C` | (220, 20, 60) | x2 | specific, near brown, d=61.9 |
|
||||
| `#F5F5DC` | (245, 245, 220) | x2 | specific, near white, d=37.7 |
|
||||
| `#F5F5F5` | (245, 245, 245) | x1 | GENERIC, near white, d=17.3 |
|
||||
| `#FF0000` | (255, 0, 0) | x1 | GENERIC, near red, d=0.0 |
|
||||
| `#FF6347` | (255, 99, 71) | x1 | specific, near orange, d=96.9 |
|
||||
| `#FF69B4` | (255, 105, 180) | x2 | specific, near pink, d=90.0 |
|
||||
| `#FFD700` | (255, 215, 0) | x1 | GENERIC, near gold, d=0.0 |
|
||||
| `#FFFF00` | (255, 255, 0) | x1 | GENERIC, near yellow, d=0.0 |
|
||||
| `#FFFFFF` | (255, 255, 255) | x14 | GENERIC, near white, d=0.0 |
|
||||
|
||||
### Notable Findings
|
||||
|
||||
- **Both models can produce valid hex codes.** 100% of returned values were valid hex in both cases.
|
||||
|
||||
- **Gemini is more specific overall.** 71.4% of its jersey hex codes were distinct shades vs 62.7% for Qwen3. Gemini also produced more unique hex values (24 vs 21) and had a higher average distance from primaries (44.5 vs 34.5).
|
||||
|
||||
- **Gemini uses more varied shades of each color family.** For red-family jerseys, Gemini returned 8 distinct hex values (`#701112`, `#990000`, `#C41230`, `#D11111`, `#D32F2F`, `#E31837`, `#E31B23`, `#E3242B`). Qwen3 returned 6 (`#8B0000`, `#B22234`, `#D32F2F`, `#D60000`, `#DC143C`, `#FF0000`), including two exact primaries.
|
||||
|
||||
- **Qwen3 reuses hex values more heavily.** `#006400` (dark green) appeared 10 times and `#FFFFFF` 14 times — two values account for 41% of all results. Gemini's most repeated value was `#FFFFFF` at 15 times (27%), with better spread across other shades.
|
||||
|
||||
- **White dominates both models.** `#FFFFFF` was the single most common value for both (Gemini: x15, Qwen3: x14), which is expected given white jerseys are the most common in basketball.
|
||||
|
||||
- **Both models share some exact hex codes.** `#3D2B21` (dark brown), `#A9A9A9` (dark silver/gray), and `#D32F2F` (medium red) appeared in both models' outputs, suggesting some convergence on certain color estimations.
|
||||
|
||||
---
|
||||
|
||||
## Conclusions
|
||||
|
||||
1. **For basic color categorization, all three models work.** If you only need to distinguish "white vs dark vs colored" jerseys, any will do. Gemini offers slightly finer granularity with its blue-shade vocabulary (navy blue, dark blue, navy).
|
||||
|
||||
2. **Gemini detects the most jerseys per image** (2.81 avg), followed closely by Qwen3-VL-8B (2.76 avg), with Qwen2.5-VL-7B trailing (2.29 avg).
|
||||
|
||||
3. **Qwen3-VL-8B is a solid upgrade over Qwen2.5-VL-7B** for detection volume (+20% more jerseys) while maintaining the same color vocabulary. It runs locally without cloud API costs, making it a good default choice.
|
||||
|
||||
4. **Hex color prompting works for jersey body colors.** Both models return specific hex shades the majority of the time (Gemini 71%, Qwen3 63%). Gemini produces more varied and specific shades, while Qwen3 tends to reuse a smaller set of hex values.
|
||||
|
||||
5. **Neither model is a reliable colorimeter.** The hex values should be treated as rough shade estimates, not pixel-accurate measurements. For precise color matching, traditional computer vision (e.g., sampling pixels from the detected jersey region) would be more reliable.
|
||||
|
||||
6. **Recommendation:** Use named-color prompts for general jersey classification. Reserve hex-color prompts for use cases where distinguishing similar shades matters (e.g., telling apart two teams that both wear "blue"). Gemini gives the best hex specificity but requires a cloud API; Qwen3-VL-8B is a capable local alternative.
|
||||
Reference in New Issue
Block a user