Go to file

Rick McEwen 5405d7f7dc Add accuracy test framework, prompts, results, and analysis reports

Includes accuracy test scripts for Qwen (local) and Gemini (cloud API),
three prompt variants (original, capstone, constrained), test results
from all runs, and two analysis reports with an HTML presentation version.

2026-03-03 18:44:49 -07:00

docs

Initial commit: Jersey detection test suite

2026-01-20 13:37:01 -07:00

scan_utils

Initial commit: Jersey detection test suite

2026-01-20 13:37:01 -07:00

.python-version

Add accuracy test framework, prompts, results, and analysis reports

2026-03-03 18:44:49 -07:00

accuracy_analysis_report_round2.html

Add accuracy test framework, prompts, results, and analysis reports

2026-03-03 18:44:49 -07:00

accuracy_analysis_report_round2.md

Add accuracy test framework, prompts, results, and analysis reports

2026-03-03 18:44:49 -07:00

accuracy_analysis_report.md

Add accuracy test framework, prompts, results, and analysis reports

2026-03-03 18:44:49 -07:00

accuracy_test_results_all.txt

Add accuracy test framework, prompts, results, and analysis reports

2026-03-03 18:44:49 -07:00

accuracy_test_results.md

Add accuracy test framework, prompts, results, and analysis reports

2026-03-03 18:44:49 -07:00

analyze_jersey_results.py

Initial commit: Jersey detection test suite

2026-01-20 13:37:01 -07:00

COLOR_TEST_REPORT.md

Add color variety and hex specificity test scripts with report

2026-02-24 11:30:41 -07:00

color_test_results.md

Add color variety and hex specificity test scripts with report

2026-02-24 11:30:41 -07:00

jersey_detection_results.jsonl

Initial commit: Jersey detection test suite

2026-01-20 13:37:01 -07:00

jersey_prompt_capstone.txt

Add accuracy test framework, prompts, results, and analysis reports

2026-03-03 18:44:49 -07:00

jersey_prompt_constrained.txt

Add accuracy test framework, prompts, results, and analysis reports

2026-03-03 18:44:49 -07:00

jersey_prompt_hex_color.txt

Add color variety and hex specificity test scripts with report

2026-02-24 11:30:41 -07:00

jersey_prompt_with_confidence.txt

Initial commit: Jersey detection test suite

2026-01-20 13:37:01 -07:00

jersey_prompt.txt

Initial commit: Jersey detection test suite

2026-01-20 13:37:01 -07:00

llama-swap-config.yaml

Initial commit: Jersey detection test suite

2026-01-20 13:37:01 -07:00

pyproject.toml

Add accuracy test framework, prompts, results, and analysis reports

2026-03-03 18:44:49 -07:00

README.md

Add hallucination detection, prompt files, and llama-swap sections to README

2026-01-20 13:42:39 -07:00

requirements.txt

Initial commit: Jersey detection test suite

2026-01-20 13:37:01 -07:00

run_all_accuracy_tests.sh

Add accuracy test framework, prompts, results, and analysis reports

2026-03-03 18:44:49 -07:00

test_accuracy_gemini.py

Add accuracy test framework, prompts, results, and analysis reports

2026-03-03 18:44:49 -07:00

test_accuracy.py

Add accuracy test framework, prompts, results, and analysis reports

2026-03-03 18:44:49 -07:00

test_all_models.sh

Initial commit: Jersey detection test suite

2026-01-20 13:37:01 -07:00

test_color_variety_gemini.py

Add color variety and hex specificity test scripts with report

2026-02-24 11:30:41 -07:00

test_color_variety.py

Add color variety and hex specificity test scripts with report

2026-02-24 11:30:41 -07:00

test_hex_color_specificity_llama.py

Add color variety and hex specificity test scripts with report

2026-02-24 11:30:41 -07:00

test_hex_color_specificity.py

Add color variety and hex specificity test scripts with report

2026-02-24 11:30:41 -07:00

test_jersey_detection.py

Initial commit: Jersey detection test suite

2026-01-20 13:37:01 -07:00

uv.lock

Add accuracy test framework, prompts, results, and analysis reports

2026-03-03 18:44:49 -07:00

README.md

Jersey Detection Testing

This project contains test scripts, results, and utilities for evaluating vision-language models on jersey number detection tasks using llama.cpp.

Directory Structure

jersey_test/
├── scan_utils/
│   ├── jersey_detection.py      # Core detection class using VLM
│   └── llama_cpp_client.py      # Client for llama.cpp server
├── docs/
│   ├── JERSEY_DETECTION_MODEL_ANALYSIS.md  # Model comparison results
│   └── LLAMA_SWAP_SETUP.md      # Server setup instructions
├── test_images/                  # Place test images here
├── test_images_output/           # Output directory for annotated images
├── test_jersey_detection.py      # Main test runner
├── analyze_jersey_results.py     # Results analysis script
├── test_all_models.sh            # Batch testing shell script
├── jersey_prompt.txt             # Basic detection prompt
├── jersey_prompt_with_confidence.txt  # Prompt with confidence scoring
└── jersey_detection_results.jsonl     # Historical test results

Prerequisites

Python 3.10+
llama.cpp server running with a vision-language model
Test images with ground truth encoded in filenames

Test Image Naming Convention

Test images should follow this naming pattern to encode ground truth:

prefix-number1-number2-number3.jpg

Example: game1-23-45-7.jpg contains jerseys with numbers 23, 45, and 7.

Running Tests

Single Model Test

python test_jersey_detection.py \
    --images-dir ./test_images \
    --prompt-file jersey_prompt_with_confidence.txt \
    --server-url http://localhost:8080 \
    --resize 1024 \
    --output jersey_detection_results.jsonl

Batch Testing All Models

./test_all_models.sh

Edit variables at the top of the script to configure:

IMAGES_DIR - test images directory
PROMPT_FILE - prompt file to use
SERVER_URL - llama.cpp/llama-swap server URL
LLAMA_SWAP_CONFIG - path to llama-swap config for model list

Analyzing Results

python analyze_jersey_results.py jersey_detection_results.jsonl

Options:

--csv output.csv - Export results to CSV
--filter-model "model_name" - Filter by model name

Historical Results

The jersey_detection_results.jsonl file contains results from 6 test runs:

Model	F1 Score	Avg Time/Image	Avg Confidence
qwen2.5-vl-7b	72.9%	-	-
gemma-3-27b	72.1%	18.1s	87.1
Mistral-Small-3.2-24B (Q4)	-	14.2s	92.1
Kimi-VL-A3B-Thinking	-	29.1s	88.9

See docs/JERSEY_DETECTION_MODEL_ANALYSIS.md for detailed analysis.

Hallucination Detection

Vision-language models can sometimes "hallucinate" by returning example jersey numbers from the prompt instead of actual detections from the image. To combat this, the detection code filters out known example numbers.

Filtered numbers: 101, 102, 103, 142, 199

These numbers were deliberately chosen as examples in the prompt because real jersey numbers are typically 0-99. Any detection returning these numbers is flagged as a hallucination and excluded from results.

The hallucination filter is implemented in both:

scan_utils/jersey_detection.py - Core detection class
test_jersey_detection.py - Test runner

Test results track hallucination statistics including:

Total hallucinated detections filtered
Hallucination rate percentage
Per-image hallucination counts

Prompt Files

Two prompt templates are provided for jersey detection:

File	Description
jersey_prompt.txt	Basic prompt for jersey detection without confidence scores
jersey_prompt_with_confidence.txt	Enhanced prompt with confidence scoring (0-100 scale)

The confidence prompt includes scoring guidelines:

90-100: Extremely clear and unambiguous
70-89: Clear but minor occlusion/angle issues
50-69: Partially visible or somewhat unclear
30-49: Difficult to read but visible
0-29: Very uncertain, barely visible

Llama-swap Configuration

This project supports llama-swap for automatic model switching during batch testing.

Configuration file: llama-swap-config.yaml

The config includes 8 pre-configured vision-language models:

Model Tag	Parameters	Quantization
lfm2-vl-1.6b	1.6B	F16
gemma-3-4b	4B	F16
kimi-vl-3b	3B	F16
qwen2.5-vl-7b	7B	F16
gemma-3-12b	12B	F16
mistral-small-24b-q8	24B	Q8_K_XL
mistral-small-24b-q4	24B	Q4_K_XL
gemma-3-27b	27B	Q8_0

See docs/LLAMA_SWAP_SETUP.md for server setup instructions.

Key Findings

Top Recommendation: qwen2.5-vl-7b (72.9% F1 score)
Best Confidence Calibration: gemma-3-27b
Speed Champion: gemma-3-4b (7.9s/img, 63.8% F1)
Confidence threshold of 85+ recommended for filtering uncertain detections