Go to file

Rick McEwen 8706edcd13 Initial commit: Jersey detection test suite

Test scripts and utilities for evaluating vision-language models
on jersey number detection using llama.cpp server.

2026-01-20 13:37:01 -07:00

docs

Initial commit: Jersey detection test suite

2026-01-20 13:37:01 -07:00

scan_utils

Initial commit: Jersey detection test suite

2026-01-20 13:37:01 -07:00

analyze_jersey_results.py

Initial commit: Jersey detection test suite

2026-01-20 13:37:01 -07:00

jersey_detection_results.jsonl

Initial commit: Jersey detection test suite

2026-01-20 13:37:01 -07:00

jersey_prompt_with_confidence.txt

Initial commit: Jersey detection test suite

2026-01-20 13:37:01 -07:00

jersey_prompt.txt

Initial commit: Jersey detection test suite

2026-01-20 13:37:01 -07:00

llama-swap-config.yaml

Initial commit: Jersey detection test suite

2026-01-20 13:37:01 -07:00

README.md

Initial commit: Jersey detection test suite

2026-01-20 13:37:01 -07:00

requirements.txt

Initial commit: Jersey detection test suite

2026-01-20 13:37:01 -07:00

test_all_models.sh

Initial commit: Jersey detection test suite

2026-01-20 13:37:01 -07:00

test_jersey_detection.py

Initial commit: Jersey detection test suite

2026-01-20 13:37:01 -07:00

README.md

Jersey Detection Testing

This project contains test scripts, results, and utilities for evaluating vision-language models on jersey number detection tasks using llama.cpp.

Directory Structure

jersey_test/
├── scan_utils/
│   ├── jersey_detection.py      # Core detection class using VLM
│   └── llama_cpp_client.py      # Client for llama.cpp server
├── docs/
│   ├── JERSEY_DETECTION_MODEL_ANALYSIS.md  # Model comparison results
│   └── LLAMA_SWAP_SETUP.md      # Server setup instructions
├── test_images/                  # Place test images here
├── test_images_output/           # Output directory for annotated images
├── test_jersey_detection.py      # Main test runner
├── analyze_jersey_results.py     # Results analysis script
├── test_all_models.sh            # Batch testing shell script
├── jersey_prompt.txt             # Basic detection prompt
├── jersey_prompt_with_confidence.txt  # Prompt with confidence scoring
└── jersey_detection_results.jsonl     # Historical test results

Prerequisites

Python 3.10+
llama.cpp server running with a vision-language model
Test images with ground truth encoded in filenames

Test Image Naming Convention

Test images should follow this naming pattern to encode ground truth:

prefix-number1-number2-number3.jpg

Example: game1-23-45-7.jpg contains jerseys with numbers 23, 45, and 7.

Running Tests

Single Model Test

python test_jersey_detection.py \
    --images-dir ./test_images \
    --prompt-file jersey_prompt_with_confidence.txt \
    --server-url http://localhost:8080 \
    --resize 1024 \
    --output jersey_detection_results.jsonl

Batch Testing All Models

./test_all_models.sh

Edit variables at the top of the script to configure:

IMAGES_DIR - test images directory
PROMPT_FILE - prompt file to use
SERVER_URL - llama.cpp/llama-swap server URL
LLAMA_SWAP_CONFIG - path to llama-swap config for model list

Analyzing Results

python analyze_jersey_results.py jersey_detection_results.jsonl

Options:

--csv output.csv - Export results to CSV
--filter-model "model_name" - Filter by model name

Historical Results

The jersey_detection_results.jsonl file contains results from 6 test runs:

Model	F1 Score	Avg Time/Image	Avg Confidence
qwen2.5-vl-7b	72.9%	-	-
gemma-3-27b	72.1%	18.1s	87.1
Mistral-Small-3.2-24B (Q4)	-	14.2s	92.1
Kimi-VL-A3B-Thinking	-	29.1s	88.9

See docs/JERSEY_DETECTION_MODEL_ANALYSIS.md for detailed analysis.

Key Findings

Top Recommendation: qwen2.5-vl-7b (72.9% F1 score)
Best Confidence Calibration: gemma-3-27b
Speed Champion: gemma-3-4b (7.9s/img, 63.8% F1)
Confidence threshold of 85+ recommended for filtering uncertain detections