Add color variety and hex specificity test scripts with report

- test_color_variety.py: named-color test for local llama.cpp VLM - test_color_variety_gemini.py: named-color test for Gemini 3 Flash API - test_hex_color_specificity.py: hex color specificity test for Gemini - test_hex_color_specificity_llama.py: hex color specificity test for local VLM - jersey_prompt_hex_color.txt: prompt requesting hex color codes - COLOR_TEST_REPORT.md: analysis report comparing 3 models across 5 tests - color_test_results.md: raw test output from all runs
2026-02-24 11:30:41 -07:00
parent 825f3c19a9
commit 435033ea07
7 changed files with 1646 additions and 0 deletions
--- a/COLOR_TEST_REPORT.md
+++ b/COLOR_TEST_REPORT.md
@ -0,0 +1,205 @@
+# Jersey Color Detection - VLM Comparison Report
+
+**Date:** 2026-02-24
+**Test set:** 161 basketball images (`basketball_jersery_color_test_files/`)
+
+## Overview
+
+Five tests were run to evaluate how vision-language models describe jersey colors:
+
+| Test | Model | Images | Prompt | Purpose |
+|------|-------|--------|--------|---------|
+| 1 | Qwen2.5-VL-7B (local, llama.cpp) | 161 | Named colors | Baseline color vocabulary |
+| 2 | Gemini 3 Flash (cloud API) | 161 | Named colors | Cloud model color vocabulary |
+| 3 | Qwen3-VL-8B (local, llama.cpp) | 161 | Named colors | Newer local model color vocabulary |
+| 4 | Gemini 3 Flash (cloud API) | 20 (random, seed=42) | Hex codes (jersey only) | Hex color specificity |
+| 5 | Qwen3-VL-8B (local, llama.cpp) | 20 (random, seed=42) | Hex codes (jersey only) | Hex color specificity |
+
+---
+
+## Named Color Vocabulary (Tests 1-3)
+
+### Detection Volume
+
+| Metric | Qwen2.5-VL-7B | Gemini 3 Flash | Qwen3-VL-8B |
+|--------|---------------|----------------|--------------|
+| Jerseys detected | 369 | 453 | 444 |
+| Errors | 0 | 0 | 1 |
+| Avg time/image | 14.9s | 15.9s | 17.0s |
+| Unique jersey colors | 15 | 19 | 15 |
+| Unique number colors | 11 | 15 | 13 |
+| Combined palette size | 15 | 19 | 17 |
+
+Gemini detected the most jerseys (453) and used the broadest color vocabulary (19 terms). Qwen3-VL-8B detected nearly as many jerseys (444) as Gemini but with a vocabulary closer to the older Qwen2.5 model.
+
+### Jersey Color Distribution
+
+| Color | Qwen2.5-VL-7B | Gemini 3 Flash | Qwen3-VL-8B | Notes |
+|-------|---------------|----------------|--------------|-------|
+| white | 84 (22.8%) | 125 (27.6%) | 120 (27.0%) | Top color for all three |
+| blue | 60 (16.3%) | 43 (9.5%) | 69 (15.5%) | Both Qwen models lump blues |
+| green | 48 (13.0%) | 60 (13.2%) | 53 (11.9%) | Consistent across models |
+| black | 31 (8.4%) | 21 (4.6%) | 33 (7.4%) | |
+| purple | 25 (6.8%) | 28 (6.2%) | 30 (6.8%) | Consistent |
+| red | 27 (7.3%) | 22 (4.9%) | 28 (6.3%) | |
+| orange | 24 (6.5%) | 27 (6.0%) | 27 (6.1%) | Very consistent |
+| yellow | 27 (7.3%) | 24 (5.3%) | 26 (5.9%) | |
+| maroon | 14 (3.8%) | 23 (5.1%) | 15 (3.4%) | Gemini uses maroon more |
+| light blue | 6 (1.6%) | 22 (4.9%) | 13 (2.9%) | Gemini distinguishes light blue most |
+| gray/grey | 9 (2.4%) | 12 (2.6%) | 10 (2.3%) | |
+| brown | 6 (1.6%) | 13 (2.9%) | 9 (2.0%) | |
+| teal | 4 (1.1%) | 7 (1.5%) | 7 (1.6%) | |
+| pink | 2 (0.5%) | 2 (0.4%) | 2 (0.5%) | |
+| gold | 2 (0.5%) | 2 (0.4%) | 2 (0.5%) | |
+| navy blue | -- | 11 (2.4%) | -- | Gemini-only |
+| dark blue | -- | 9 (2.0%) | -- | Gemini-only |
+| dark brown | -- | 1 (0.2%) | -- | Gemini-only |
+| navy | -- | 1 (0.2%) | -- | Gemini-only |
+
+### Number Color Distribution
+
+| Color | Qwen2.5-VL-7B | Gemini 3 Flash | Qwen3-VL-8B |
+|-------|---------------|----------------|--------------|
+| white | 195 (52.8%) | 183 (40.4%) | 184 (41.4%) |
+| black | 60 (16.3%) | 40 (8.8%) | 44 (9.9%) |
+| yellow | 39 (10.6%) | 58 (12.8%) | 32 (7.2%) |
+| red | 30 (8.1%) | 44 (9.7%) | 41 (9.2%) |
+| blue | 23 (6.2%) | 39 (8.6%) | 39 (8.8%) |
+| orange | 8 (2.2%) | 21 (4.6%) | 29 (6.5%) |
+| gold | -- | 5 (1.1%) | 21 (4.7%) |
+| dark blue | -- | 14 (3.1%) | 9 (2.0%) |
+| maroon | 2 (0.5%) | 14 (3.1%) | 12 (2.7%) |
+| green | 3 (0.8%) | 13 (2.9%) | 14 (3.2%) |
+| purple | 4 (1.1%) | 11 (2.4%) | 11 (2.5%) |
+| pink | 3 (0.8%) | 6 (1.3%) | 6 (1.4%) |
+| brown | 2 (0.5%) | 2 (0.4%) | -- |
+| grey | -- | 2 (0.4%) | -- |
+| navy blue | -- | 1 (0.2%) | -- |
+| silver | -- | -- | 2 (0.5%) |
+
+### Key Differences in Named Color Mode
+
+1. **Gemini has the richest vocabulary.** It uses 19 distinct jersey color terms vs 15 for both Qwen models. The extras are all blue-shade variants (navy blue, dark blue, navy) and dark brown.
+
+2. **Both Qwen models lump blues together.** Qwen2.5-VL-7B reports 60 "blue" jerseys, Qwen3-VL-8B reports 69. Gemini splits these into blue (43), light blue (22), navy blue (11), dark blue (9), and navy (1) — totaling 86 blue-family detections with much finer granularity.
+
+3. **Qwen3-VL-8B is a modest upgrade over Qwen2.5-VL-7B.** It detects 20% more jerseys (444 vs 369) and uses the same 15 jersey color terms but with a slightly more balanced distribution. It has the same vocabulary as Qwen2.5 but added "dark blue", "silver" to its number color palette.
+
+4. **Gemini detects the most jerseys overall.** 453 vs 444 (Qwen3) vs 369 (Qwen2.5). The two newer models are close, while Qwen2.5 lags behind.
+
+5. **All three models are dominated by basic colors.** White, blue/green, and black account for the majority of detections. None spontaneously uses precise shade names like "crimson", "cobalt", or "forest green".
+
+6. **Qwen3-VL-8B favors "gold" for number colors.** It reported gold 21 times for number colors vs Gemini's 5 and Qwen2.5's 0. This may reflect team-specific coloring (e.g., Lakers gold numbers).
+
+---
+
+## Hex Color Specificity (Tests 4-5)
+
+Both tests used the same 20 random images (seed=42) and evaluated **jersey colors only** (number colors excluded since they are usually primary colors like white or black).
+
+### Summary
+
+| Metric | Gemini 3 Flash | Qwen3-VL-8B |
+|--------|----------------|--------------|
+| Images tested | 20 | 20 |
+| Total jerseys | 56 | 59 |
+| Jersey color values | 56 | 59 |
+| Valid hex codes | 56/56 (100%) | 59/59 (100%) |
+| Unique hex values | 24 | 21 |
+| Specific (distinct shade) | 40 (71.4%) | 37 (62.7%) |
+| Generic (near primary) | 16 (28.6%) | 22 (37.3%) |
+
+### Distance from Nearest Primary Color
+
+| Stat | Gemini 3 Flash | Qwen3-VL-8B |
+|------|----------------|--------------|
+| Min | 0.0 | 0.0 |
+| Avg | 44.5 | 34.5 |
+| Max | 111.0 | 110.7 |
+
+(Scale: 0 = exact primary match. 20 = generic threshold. Higher = more specific.)
+
+### Gemini 3 Flash - Unique Hex Values (24)
+
+| Hex | RGB | Count | Classification |
+|-----|-----|-------|---------------|
+| `#004B23` | (0, 75, 35) | x7 | specific, near green (dark), d=63.5 |
+| `#1A2344` | (26, 35, 68) | x2 | specific, near navy, d=74.2 |
+| `#1E4BA1` | (30, 75, 161) | x1 | specific, near navy, d=87.3 |
+| `#2B231D` | (43, 35, 29) | x1 | specific, near black, d=62.6 |
+| `#3D2B1F` | (61, 43, 31) | x1 | specific, near black, d=80.8 |
+| `#461D7C` | (70, 29, 124) | x1 | specific, near purple, d=65.0 |
+| `#4B2E83` | (75, 46, 131) | x5 | specific, near purple, d=70.2 |
+| `#701112` | (112, 17, 18) | x1 | specific, near maroon, d=29.5 |
+| `#7BAFD4` | (123, 175, 212) | x3 | specific, near silver, d=73.8 |
+| `#990000` | (153, 0, 0) | x2 | specific, near maroon, d=25.0 |
+| `#A9A9A9` | (169, 169, 169) | x1 | specific, near silver, d=39.8 |
+| `#C41230` | (196, 18, 48) | x1 | specific, near brown, d=39.7 |
+| `#D11111` | (209, 17, 17) | x2 | specific, near red, d=51.9 |
+| `#D32F2F` | (211, 47, 47) | x2 | specific, near brown, d=46.5 |
+| `#E31837` | (227, 24, 55) | x1 | specific, near brown, d=65.9 |
+| `#E31B23` | (227, 27, 35) | x1 | specific, near red, d=52.3 |
+| `#E3242B` | (227, 36, 43) | x2 | specific, near brown, d=62.3 |
+| `#E6E600` | (230, 230, 0) | x1 | specific, near gold, d=29.2 |
+| `#E8E8E8` | (232, 232, 232) | x1 | specific, near white, d=39.8 |
+| `#E91E63` | (233, 30, 99) | x1 | specific, near brown, d=89.5 |
+| `#F06292` | (240, 98, 146) | x2 | specific, near pink, d=111.0 |
+| `#F57C00` | (245, 124, 0) | x1 | specific, near orange, d=42.2 |
+| `#FFCD00` | (255, 205, 0) | x1 | GENERIC, near gold, d=10.0 |
+| `#FFFFFF` | (255, 255, 255) | x15 | GENERIC, near white, d=0.0 |
+
+### Qwen3-VL-8B - Unique Hex Values (21)
+
+| Hex | RGB | Count | Classification |
+|-----|-----|-------|---------------|
+| `#000000` | (0, 0, 0) | x1 | GENERIC, near black, d=0.0 |
+| `#006400` | (0, 100, 0) | x10 | specific, near green (dark), d=28.0 |
+| `#191970` | (25, 25, 112) | x1 | specific, near navy, d=38.8 |
+| `#19418A` | (25, 65, 138) | x1 | specific, near navy, d=70.4 |
+| `#3D2B21` | (61, 43, 33) | x2 | specific, near black, d=81.6 |
+| `#66B2FF` | (102, 178, 255) | x3 | specific, near silver, d=110.7 |
+| `#6A0DAD` | (106, 13, 173) | x6 | specific, near purple, d=51.7 |
+| `#8B0000` | (139, 0, 0) | x1 | GENERIC, near maroon, d=11.0 |
+| `#A9A9A9` | (169, 169, 169) | x1 | specific, near silver, d=39.8 |
+| `#B22234` | (178, 34, 52) | x2 | GENERIC, near brown, d=18.2 |
+| `#D32F2F` | (211, 47, 47) | x3 | specific, near brown, d=46.5 |
+| `#D60000` | (214, 0, 0) | x3 | specific, near red, d=41.0 |
+| `#DC143C` | (220, 20, 60) | x2 | specific, near brown, d=61.9 |
+| `#F5F5DC` | (245, 245, 220) | x2 | specific, near white, d=37.7 |
+| `#F5F5F5` | (245, 245, 245) | x1 | GENERIC, near white, d=17.3 |
+| `#FF0000` | (255, 0, 0) | x1 | GENERIC, near red, d=0.0 |
+| `#FF6347` | (255, 99, 71) | x1 | specific, near orange, d=96.9 |
+| `#FF69B4` | (255, 105, 180) | x2 | specific, near pink, d=90.0 |
+| `#FFD700` | (255, 215, 0) | x1 | GENERIC, near gold, d=0.0 |
+| `#FFFF00` | (255, 255, 0) | x1 | GENERIC, near yellow, d=0.0 |
+| `#FFFFFF` | (255, 255, 255) | x14 | GENERIC, near white, d=0.0 |
+
+### Notable Findings
+
+- **Both models can produce valid hex codes.** 100% of returned values were valid hex in both cases.
+
+- **Gemini is more specific overall.** 71.4% of its jersey hex codes were distinct shades vs 62.7% for Qwen3. Gemini also produced more unique hex values (24 vs 21) and had a higher average distance from primaries (44.5 vs 34.5).
+
+- **Gemini uses more varied shades of each color family.** For red-family jerseys, Gemini returned 8 distinct hex values (`#701112`, `#990000`, `#C41230`, `#D11111`, `#D32F2F`, `#E31837`, `#E31B23`, `#E3242B`). Qwen3 returned 6 (`#8B0000`, `#B22234`, `#D32F2F`, `#D60000`, `#DC143C`, `#FF0000`), including two exact primaries.
+
+- **Qwen3 reuses hex values more heavily.** `#006400` (dark green) appeared 10 times and `#FFFFFF` 14 times — two values account for 41% of all results. Gemini's most repeated value was `#FFFFFF` at 15 times (27%), with better spread across other shades.
+
+- **White dominates both models.** `#FFFFFF` was the single most common value for both (Gemini: x15, Qwen3: x14), which is expected given white jerseys are the most common in basketball.
+
+- **Both models share some exact hex codes.** `#3D2B21` (dark brown), `#A9A9A9` (dark silver/gray), and `#D32F2F` (medium red) appeared in both models' outputs, suggesting some convergence on certain color estimations.
+
+---
+
+## Conclusions
+
+1. **For basic color categorization, all three models work.** If you only need to distinguish "white vs dark vs colored" jerseys, any will do. Gemini offers slightly finer granularity with its blue-shade vocabulary (navy blue, dark blue, navy).
+
+2. **Gemini detects the most jerseys per image** (2.81 avg), followed closely by Qwen3-VL-8B (2.76 avg), with Qwen2.5-VL-7B trailing (2.29 avg).
+
+3. **Qwen3-VL-8B is a solid upgrade over Qwen2.5-VL-7B** for detection volume (+20% more jerseys) while maintaining the same color vocabulary. It runs locally without cloud API costs, making it a good default choice.
+
+4. **Hex color prompting works for jersey body colors.** Both models return specific hex shades the majority of the time (Gemini 71%, Qwen3 63%). Gemini produces more varied and specific shades, while Qwen3 tends to reuse a smaller set of hex values.
+
+5. **Neither model is a reliable colorimeter.** The hex values should be treated as rough shade estimates, not pixel-accurate measurements. For precise color matching, traditional computer vision (e.g., sampling pixels from the detected jersey region) would be more reliable.
+
+6. **Recommendation:** Use named-color prompts for general jersey classification. Reserve hex-color prompts for use cases where distinguishing similar shades matters (e.g., telling apart two teams that both wear "blue"). Gemini gives the best hex specificity but requires a cloud API; Qwen3-VL-8B is a capable local alternative.