Files
jersey_test/accuracy_test_results.md
Rick McEwen 5405d7f7dc Add accuracy test framework, prompts, results, and analysis reports
Includes accuracy test scripts for Qwen (local) and Gemini (cloud API),
three prompt variants (original, capstone, constrained), test results
from all runs, and two analysis reports with an HTML presentation version.
2026-03-03 18:44:49 -07:00

491 lines
14 KiB
Markdown

#Gemini 3 Flash Results (Prompt: jersey_prompt.txt):
================================================================================
ACCURACY SUMMARY (gemini-3-flash-preview)
================================================================================
Images processed: 161
Errors: 0
Total time: 2134.4s (13.3s avg)
Ground truth colors: 202 (excluding white)
VLM unique colors: 174 (excluding white)
--- Recall (did VLM find each ground truth color?) ---
Exact match: 130 / 202 (64.4%)
Similar match: 34 / 202 (16.8%)
Total found: 164 / 202 (81.2%)
Missed: 38 / 202 (18.8%)
--- Precision (are VLM colors correct?) ---
Exact match: 130 / 174 (74.7%)
Similar match: 33 / 174 (19.0%)
Total correct: 163 / 174 (93.7%)
Extra/wrong: 11 / 174 (6.3%)
--- Similar-Match Confusions (expected -> got) ---
gray -> grey x9
navy blue -> blue x7
dark brown -> brown x5
dark blue -> blue x5
gold -> yellow x3
dark blue -> navy blue x3
navy -> navy blue x1
dark blue -> navy x1
--- Most Missed Ground Truth Colors ---
gray 7 #######
black 7 #######
maroon 5 #####
blue 3 ###
green 3 ###
gold 2 ##
light blue 2 ##
gold|yellow 2 ##
red 2 ##
teal 2 ##
orange 1 #
yellow 1 #
brown 1 #
--- Most Common Extra/Wrong VLM Colors ---
red 3 ###
blue 3 ###
black 2 ##
green 1 #
orange 1 #
dark blue 1 #
--- Per-Image Verdict ---
PASS 124
PARTIAL 19
FAIL 18
--- Failed Images (18) ---
016 - maroon.jpg
missed: maroon
029 -maroon_white.jpg
missed: maroon
extra: red
034 - light blue.jpg
missed: light blue
extra: blue
046 - green.jpg
missed: green
extra: black
048 - red.jpg
missed: red
053 - black_white.jpg
missed: black
057 - white_gold or yellow.jpg
missed: gold|yellow
069 - red_white.jpg
missed: red
074 - white_orange.jpg
missed: orange
077 - teal_white.jpg
missed: teal
extra: green
088 - white_maroon.jpg
missed: maroon
129 - blue_white.jpg
missed: blue
132 - brown_white.jpg
missed: brown
extra: orange
134 - teal_white.jpg
missed: teal
extra: blue
138 - maroon.jpg
missed: maroon
extra: red
150 - green_gray.jpg
missed: green, gray
extra: black
160 - blue_white.jpg
missed: blue
161 - light blue_white.jpg
missed: light blue
extra: blue
#Qwen3-VL-8B Model Results (Prompt: jersey_prompt.txt):
================================================================================
ACCURACY SUMMARY
================================================================================
Images processed: 161
Errors: 0
Total time: 1526.4s (9.5s avg)
Ground truth colors: 202 (excluding white)
VLM unique colors: 184 (excluding white)
--- Recall (did VLM find each ground truth color?) ---
Exact match: 130 / 202 (64.4%)
Similar match: 26 / 202 (12.9%)
Total found: 156 / 202 (77.2%)
Missed: 46 / 202 (22.8%)
--- Precision (are VLM colors correct?) ---
Exact match: 130 / 184 (70.7%)
Similar match: 26 / 184 (14.1%)
Total correct: 156 / 184 (84.8%)
Extra/wrong: 28 / 184 (15.2%)
--- Similar-Match Confusions (expected -> got) ---
dark blue -> blue x10
navy blue -> blue x8
gold -> yellow x5
dark brown -> brown x2
navy -> blue x1
--- Most Missed Ground Truth Colors ---
light blue 8 ########
maroon 8 ########
gray 7 #######
black 6 ######
dark brown 4 ####
brown 3 ###
blue 3 ###
green 3 ###
teal 2 ##
gold|yellow 1 #
red 1 #
--- Most Common Extra/Wrong VLM Colors ---
blue 10 ##########
black 7 #######
red 7 #######
gold 1 #
green 1 #
redolas 1 #
orange 1 #
--- Per-Image Verdict ---
PASS 117
PARTIAL 18
FAIL 26
--- Failed Images (26) ---
001 -brown_white or dark brown.jpg
missed: brown, dark brown
extra: black
013 - light blue.jpg
missed: light blue
extra: blue
016 - maroon.jpg
missed: maroon
017 - brown_white.jpg
missed: brown
extra: black
022 - black_light blue.jpg
missed: black, light blue
extra: blue
029 -maroon_white.jpg
missed: maroon
extra: red
034 - light blue.jpg
missed: light blue
extra: blue
036 - light blue_white.jpg
missed: light blue
extra: blue
046 - green.jpg
missed: green
extra: black
053 - black_white.jpg
missed: black
057 - white_gold or yellow.jpg
missed: gold|yellow
063 - dark brown.jpg
missed: dark brown
extra: black
069 - red_white.jpg
missed: red
077 - teal_white.jpg
missed: teal
extra: green
078 - light blue_white.jpg
missed: light blue
extra: blue
083 - dark brown_white.jpg
missed: dark brown
extra: black
087 - white_light blue.jpg
missed: light blue
extra: blue
099 - maroon_white.jpg
missed: maroon
extra: redolas, red
129 - blue_white.jpg
missed: blue
132 - brown_white.jpg
missed: brown
extra: orange
134 - teal_white.jpg
missed: teal
extra: blue
138 - maroon.jpg
missed: maroon
extra: red
141 - light blue_white.jpg
missed: light blue
extra: blue
150 - green_gray.jpg
missed: green, gray
extra: black
160 - blue_white.jpg
missed: blue
161 - light blue_white.jpg
missed: light blue
extra: blue
#Gemini 3 Flash Results (Prompt: jersey_prompt_capstone.txt):
================================================================================
ACCURACY SUMMARY (gemini-3-flash-preview)
================================================================================
Images processed: 161
Errors: 0
Total time: 1881.7s (11.7s avg)
Ground truth colors: 202 (excluding white)
VLM unique colors: 174 (excluding white)
--- Recall (did VLM find each ground truth color?) ---
Exact match: 123 / 202 (60.9%)
Similar match: 35 / 202 (17.3%)
Total found: 158 / 202 (78.2%)
Missed: 44 / 202 (21.8%)
--- Precision (are VLM colors correct?) ---
Exact match: 123 / 174 (70.7%)
Similar match: 34 / 174 (19.5%)
Total correct: 157 / 174 (90.2%)
Extra/wrong: 17 / 174 (9.8%)
--- Similar-Match Confusions (expected -> got) ---
gray -> grey x10
navy blue -> blue x6
dark blue -> blue x6
dark brown -> brown x5
dark blue -> navy blue x3
gold -> yellow x2
navy blue -> navy x1
navy -> blue x1
dark blue -> navy x1
--- Most Missed Ground Truth Colors ---
maroon 9 #########
black 7 #######
gray 6 ######
green 4 ####
gold 3 ###
blue 3 ###
light blue 2 ##
gold|yellow 2 ##
red 2 ##
teal 2 ##
navy blue 1 #
dark brown 1 #
yellow 1 #
brown 1 #
--- Most Common Extra/Wrong VLM Colors ---
red 7 #######
black 4 ####
blue 2 ##
green 1 #
orange 1 #
light blue 1 #
navy 1 #
--- Per-Image Verdict ---
PASS 118
PARTIAL 21
FAIL 22
--- Failed Images (22) ---
016 - maroon.jpg
missed: maroon
019 - maroon_gold.jpg
missed: maroon, gold
extra: red
029 -maroon_white.jpg
missed: maroon
extra: red
030 - navy blue_white.jpg
missed: navy blue
034 - light blue.jpg
missed: light blue
extra: blue
036 - light blue_white.jpg
missed: light blue
extra: blue
046 - green.jpg
missed: green
extra: black
048 - red.jpg
missed: red
053 - black_white.jpg
missed: black
057 - white_gold or yellow.jpg
missed: gold|yellow
069 - red_white.jpg
missed: red
077 - teal_white.jpg
missed: teal
extra: green
083 - dark brown_white.jpg
missed: dark brown
extra: black
088 - white_maroon.jpg
missed: maroon
099 - maroon_white.jpg
missed: maroon
extra: red
128 - green_white.jpg
missed: green
129 - blue_white.jpg
missed: blue
132 - brown_white.jpg
missed: brown
extra: orange
134 - teal_white.jpg
missed: teal
extra: light blue
138 - maroon.jpg
missed: maroon
extra: red
150 - green_gray.jpg
missed: green, gray
extra: black
160 - blue_white.jpg
missed: blue
#Qwen3-VL-8B Model Results (Prompt: jersey_prompt_capstone.txt):
================================================================================
ACCURACY SUMMARY
================================================================================
Images processed: 161
Errors: 0
Total time: 1435.7s (8.9s avg)
Ground truth colors: 202 (excluding white)
VLM unique colors: 180 (excluding white)
--- Recall (did VLM find each ground truth color?) ---
Exact match: 133 / 202 (65.8%)
Similar match: 24 / 202 (11.9%)
Total found: 157 / 202 (77.7%)
Missed: 45 / 202 (22.3%)
--- Precision (are VLM colors correct?) ---
Exact match: 133 / 180 (73.9%)
Similar match: 24 / 180 (13.3%)
Total correct: 157 / 180 (87.2%)
Extra/wrong: 23 / 180 (12.8%)
--- Similar-Match Confusions (expected -> got) ---
dark blue -> blue x9
navy blue -> blue x8
gold -> yellow x3
dark brown -> brown x2
navy -> blue x1
dark blue -> navy x1
--- Most Missed Ground Truth Colors ---
gray 9 #########
maroon 7 #######
black 6 ######
light blue 5 #####
dark brown 4 ####
green 4 ####
brown 3 ###
gold 2 ##
blue 2 ##
teal 2 ##
gold|yellow 1 #
--- Most Common Extra/Wrong VLM Colors ---
black 7 #######
blue 6 ######
red 6 ######
gold 1 #
green 1 #
orange 1 #
navy 1 #
--- Per-Image Verdict ---
PASS 119
PARTIAL 19
FAIL 23
--- Failed Images (23) ---
001 -brown_white or dark brown.jpg
missed: brown, dark brown
extra: black
013 - light blue.jpg
missed: light blue
extra: blue
016 - maroon.jpg
missed: maroon
017 - brown_white.jpg
missed: brown
extra: black
019 - maroon_gold.jpg
missed: maroon, gold
extra: red
029 -maroon_white.jpg
missed: maroon
extra: red
034 - light blue.jpg
missed: light blue
extra: blue
036 - light blue_white.jpg
missed: light blue
extra: blue
039 - gray_white.jpg
missed: gray
046 - green.jpg
missed: green
extra: black
053 - black_white.jpg
missed: black
057 - white_gold or yellow.jpg
missed: gold|yellow
063 - dark brown.jpg
missed: dark brown
extra: black
077 - teal_white.jpg
missed: teal
extra: green
083 - dark brown_white.jpg
missed: dark brown
extra: black
132 - brown_white.jpg
missed: brown
extra: orange
134 - teal_white.jpg
missed: teal
extra: blue
138 - maroon.jpg
missed: maroon
extra: red
141 - light blue_white.jpg
missed: light blue
extra: blue
145 - green_white.jpg
missed: green
150 - green_gray.jpg
missed: green, gray
extra: black
160 - blue_white.jpg
missed: blue
161 - light blue_white.jpg
missed: light blue
extra: blue