Files
jersey_test/accuracy_test_results.md
Rick McEwen 5405d7f7dc Add accuracy test framework, prompts, results, and analysis reports
Includes accuracy test scripts for Qwen (local) and Gemini (cloud API),
three prompt variants (original, capstone, constrained), test results
from all runs, and two analysis reports with an HTML presentation version.
2026-03-03 18:44:49 -07:00

14 KiB

#Gemini 3 Flash Results (Prompt: jersey_prompt.txt):

================================================================================ ACCURACY SUMMARY (gemini-3-flash-preview)

Images processed: 161 Errors: 0 Total time: 2134.4s (13.3s avg)

Ground truth colors: 202 (excluding white) VLM unique colors: 174 (excluding white)

--- Recall (did VLM find each ground truth color?) --- Exact match: 130 / 202 (64.4%) Similar match: 34 / 202 (16.8%) Total found: 164 / 202 (81.2%) Missed: 38 / 202 (18.8%)

--- Precision (are VLM colors correct?) --- Exact match: 130 / 174 (74.7%) Similar match: 33 / 174 (19.0%) Total correct: 163 / 174 (93.7%) Extra/wrong: 11 / 174 (6.3%)

--- Similar-Match Confusions (expected -> got) --- gray -> grey x9 navy blue -> blue x7 dark brown -> brown x5 dark blue -> blue x5 gold -> yellow x3 dark blue -> navy blue x3 navy -> navy blue x1 dark blue -> navy x1

--- Most Missed Ground Truth Colors --- gray 7 ####### black 7 ####### maroon 5 ##### blue 3 ### green 3 ### gold 2 ## light blue 2 ## gold|yellow 2 ## red 2 ## teal 2 ## orange 1 # yellow 1 # brown 1 #

--- Most Common Extra/Wrong VLM Colors --- red 3 ### blue 3 ### black 2 ## green 1 # orange 1 # dark blue 1 #

--- Per-Image Verdict --- PASS 124 PARTIAL 19 FAIL 18

--- Failed Images (18) --- 016 - maroon.jpg missed: maroon 029 -maroon_white.jpg missed: maroon extra: red 034 - light blue.jpg missed: light blue extra: blue 046 - green.jpg missed: green extra: black 048 - red.jpg missed: red 053 - black_white.jpg missed: black 057 - white_gold or yellow.jpg missed: gold|yellow 069 - red_white.jpg missed: red 074 - white_orange.jpg missed: orange 077 - teal_white.jpg missed: teal extra: green 088 - white_maroon.jpg missed: maroon 129 - blue_white.jpg missed: blue 132 - brown_white.jpg missed: brown extra: orange 134 - teal_white.jpg missed: teal extra: blue 138 - maroon.jpg missed: maroon extra: red 150 - green_gray.jpg missed: green, gray extra: black 160 - blue_white.jpg missed: blue 161 - light blue_white.jpg missed: light blue extra: blue

#Qwen3-VL-8B Model Results (Prompt: jersey_prompt.txt):

================================================================================ ACCURACY SUMMARY

Images processed: 161 Errors: 0 Total time: 1526.4s (9.5s avg)

Ground truth colors: 202 (excluding white) VLM unique colors: 184 (excluding white)

--- Recall (did VLM find each ground truth color?) --- Exact match: 130 / 202 (64.4%) Similar match: 26 / 202 (12.9%) Total found: 156 / 202 (77.2%) Missed: 46 / 202 (22.8%)

--- Precision (are VLM colors correct?) --- Exact match: 130 / 184 (70.7%) Similar match: 26 / 184 (14.1%) Total correct: 156 / 184 (84.8%) Extra/wrong: 28 / 184 (15.2%)

--- Similar-Match Confusions (expected -> got) --- dark blue -> blue x10 navy blue -> blue x8 gold -> yellow x5 dark brown -> brown x2 navy -> blue x1

--- Most Missed Ground Truth Colors --- light blue 8 ######## maroon 8 ######## gray 7 ####### black 6 ###### dark brown 4 #### brown 3 ### blue 3 ### green 3 ### teal 2 ## gold|yellow 1 # red 1 #

--- Most Common Extra/Wrong VLM Colors --- blue 10 ########## black 7 ####### red 7 ####### gold 1 # green 1 # redolas 1 # orange 1 #

--- Per-Image Verdict --- PASS 117 PARTIAL 18 FAIL 26

--- Failed Images (26) --- 001 -brown_white or dark brown.jpg missed: brown, dark brown extra: black 013 - light blue.jpg missed: light blue extra: blue 016 - maroon.jpg missed: maroon 017 - brown_white.jpg missed: brown extra: black 022 - black_light blue.jpg missed: black, light blue extra: blue 029 -maroon_white.jpg missed: maroon extra: red 034 - light blue.jpg missed: light blue extra: blue 036 - light blue_white.jpg missed: light blue extra: blue 046 - green.jpg missed: green extra: black 053 - black_white.jpg missed: black 057 - white_gold or yellow.jpg missed: gold|yellow 063 - dark brown.jpg missed: dark brown extra: black 069 - red_white.jpg missed: red 077 - teal_white.jpg missed: teal extra: green 078 - light blue_white.jpg missed: light blue extra: blue 083 - dark brown_white.jpg missed: dark brown extra: black 087 - white_light blue.jpg missed: light blue extra: blue 099 - maroon_white.jpg missed: maroon extra: redolas, red 129 - blue_white.jpg missed: blue 132 - brown_white.jpg missed: brown extra: orange 134 - teal_white.jpg missed: teal extra: blue 138 - maroon.jpg missed: maroon extra: red 141 - light blue_white.jpg missed: light blue extra: blue 150 - green_gray.jpg missed: green, gray extra: black 160 - blue_white.jpg missed: blue 161 - light blue_white.jpg missed: light blue extra: blue

#Gemini 3 Flash Results (Prompt: jersey_prompt_capstone.txt):

================================================================================ ACCURACY SUMMARY (gemini-3-flash-preview)

Images processed: 161 Errors: 0 Total time: 1881.7s (11.7s avg)

Ground truth colors: 202 (excluding white) VLM unique colors: 174 (excluding white)

--- Recall (did VLM find each ground truth color?) --- Exact match: 123 / 202 (60.9%) Similar match: 35 / 202 (17.3%) Total found: 158 / 202 (78.2%) Missed: 44 / 202 (21.8%)

--- Precision (are VLM colors correct?) --- Exact match: 123 / 174 (70.7%) Similar match: 34 / 174 (19.5%) Total correct: 157 / 174 (90.2%) Extra/wrong: 17 / 174 (9.8%)

--- Similar-Match Confusions (expected -> got) --- gray -> grey x10 navy blue -> blue x6 dark blue -> blue x6 dark brown -> brown x5 dark blue -> navy blue x3 gold -> yellow x2 navy blue -> navy x1 navy -> blue x1 dark blue -> navy x1

--- Most Missed Ground Truth Colors --- maroon 9 ######### black 7 ####### gray 6 ###### green 4 #### gold 3 ### blue 3 ### light blue 2 ## gold|yellow 2 ## red 2 ## teal 2 ## navy blue 1 # dark brown 1 # yellow 1 # brown 1 #

--- Most Common Extra/Wrong VLM Colors --- red 7 ####### black 4 #### blue 2 ## green 1 # orange 1 # light blue 1 # navy 1 #

--- Per-Image Verdict --- PASS 118 PARTIAL 21 FAIL 22

--- Failed Images (22) --- 016 - maroon.jpg missed: maroon 019 - maroon_gold.jpg missed: maroon, gold extra: red 029 -maroon_white.jpg missed: maroon extra: red 030 - navy blue_white.jpg missed: navy blue 034 - light blue.jpg missed: light blue extra: blue 036 - light blue_white.jpg missed: light blue extra: blue 046 - green.jpg missed: green extra: black 048 - red.jpg missed: red 053 - black_white.jpg missed: black 057 - white_gold or yellow.jpg missed: gold|yellow 069 - red_white.jpg missed: red 077 - teal_white.jpg missed: teal extra: green 083 - dark brown_white.jpg missed: dark brown extra: black 088 - white_maroon.jpg missed: maroon 099 - maroon_white.jpg missed: maroon extra: red 128 - green_white.jpg missed: green 129 - blue_white.jpg missed: blue 132 - brown_white.jpg missed: brown extra: orange 134 - teal_white.jpg missed: teal extra: light blue 138 - maroon.jpg missed: maroon extra: red 150 - green_gray.jpg missed: green, gray extra: black 160 - blue_white.jpg missed: blue

#Qwen3-VL-8B Model Results (Prompt: jersey_prompt_capstone.txt):

================================================================================ ACCURACY SUMMARY

Images processed: 161 Errors: 0 Total time: 1435.7s (8.9s avg)

Ground truth colors: 202 (excluding white) VLM unique colors: 180 (excluding white)

--- Recall (did VLM find each ground truth color?) --- Exact match: 133 / 202 (65.8%) Similar match: 24 / 202 (11.9%) Total found: 157 / 202 (77.7%) Missed: 45 / 202 (22.3%)

--- Precision (are VLM colors correct?) --- Exact match: 133 / 180 (73.9%) Similar match: 24 / 180 (13.3%) Total correct: 157 / 180 (87.2%) Extra/wrong: 23 / 180 (12.8%)

--- Similar-Match Confusions (expected -> got) --- dark blue -> blue x9 navy blue -> blue x8 gold -> yellow x3 dark brown -> brown x2 navy -> blue x1 dark blue -> navy x1

--- Most Missed Ground Truth Colors --- gray 9 ######### maroon 7 ####### black 6 ###### light blue 5 ##### dark brown 4 #### green 4 #### brown 3 ### gold 2 ## blue 2 ## teal 2 ## gold|yellow 1 #

--- Most Common Extra/Wrong VLM Colors --- black 7 ####### blue 6 ###### red 6 ###### gold 1 # green 1 # orange 1 # navy 1 #

--- Per-Image Verdict --- PASS 119 PARTIAL 19 FAIL 23

--- Failed Images (23) --- 001 -brown_white or dark brown.jpg missed: brown, dark brown extra: black 013 - light blue.jpg missed: light blue extra: blue 016 - maroon.jpg missed: maroon 017 - brown_white.jpg missed: brown extra: black 019 - maroon_gold.jpg missed: maroon, gold extra: red 029 -maroon_white.jpg missed: maroon extra: red 034 - light blue.jpg missed: light blue extra: blue 036 - light blue_white.jpg missed: light blue extra: blue 039 - gray_white.jpg missed: gray 046 - green.jpg missed: green extra: black 053 - black_white.jpg missed: black 057 - white_gold or yellow.jpg missed: gold|yellow 063 - dark brown.jpg missed: dark brown extra: black 077 - teal_white.jpg missed: teal extra: green 083 - dark brown_white.jpg missed: dark brown extra: black 132 - brown_white.jpg missed: brown extra: orange 134 - teal_white.jpg missed: teal extra: blue 138 - maroon.jpg missed: maroon extra: red 141 - light blue_white.jpg missed: light blue extra: blue 145 - green_white.jpg missed: green 150 - green_gray.jpg missed: green, gray extra: black 160 - blue_white.jpg missed: blue 161 - light blue_white.jpg missed: light blue extra: blue