8706edcd13e59576f06187e0e80b5380c652e451
Test scripts and utilities for evaluating vision-language models on jersey number detection using llama.cpp server.
Jersey Detection Testing
This project contains test scripts, results, and utilities for evaluating vision-language models on jersey number detection tasks using llama.cpp.
Directory Structure
jersey_test/
├── scan_utils/
│ ├── jersey_detection.py # Core detection class using VLM
│ └── llama_cpp_client.py # Client for llama.cpp server
├── docs/
│ ├── JERSEY_DETECTION_MODEL_ANALYSIS.md # Model comparison results
│ └── LLAMA_SWAP_SETUP.md # Server setup instructions
├── test_images/ # Place test images here
├── test_images_output/ # Output directory for annotated images
├── test_jersey_detection.py # Main test runner
├── analyze_jersey_results.py # Results analysis script
├── test_all_models.sh # Batch testing shell script
├── jersey_prompt.txt # Basic detection prompt
├── jersey_prompt_with_confidence.txt # Prompt with confidence scoring
└── jersey_detection_results.jsonl # Historical test results
Prerequisites
- Python 3.10+
- llama.cpp server running with a vision-language model
- Test images with ground truth encoded in filenames
Test Image Naming Convention
Test images should follow this naming pattern to encode ground truth:
prefix-number1-number2-number3.jpg
Example: game1-23-45-7.jpg contains jerseys with numbers 23, 45, and 7.
Running Tests
Single Model Test
python test_jersey_detection.py \
--images-dir ./test_images \
--prompt-file jersey_prompt_with_confidence.txt \
--server-url http://localhost:8080 \
--resize 1024 \
--output jersey_detection_results.jsonl
Batch Testing All Models
./test_all_models.sh
Edit variables at the top of the script to configure:
IMAGES_DIR- test images directoryPROMPT_FILE- prompt file to useSERVER_URL- llama.cpp/llama-swap server URLLLAMA_SWAP_CONFIG- path to llama-swap config for model list
Analyzing Results
python analyze_jersey_results.py jersey_detection_results.jsonl
Options:
--csv output.csv- Export results to CSV--filter-model "model_name"- Filter by model name
Historical Results
The jersey_detection_results.jsonl file contains results from 6 test runs:
| Model | F1 Score | Avg Time/Image | Avg Confidence |
|---|---|---|---|
| qwen2.5-vl-7b | 72.9% | - | - |
| gemma-3-27b | 72.1% | 18.1s | 87.1 |
| Mistral-Small-3.2-24B (Q4) | - | 14.2s | 92.1 |
| Kimi-VL-A3B-Thinking | - | 29.1s | 88.9 |
See docs/JERSEY_DETECTION_MODEL_ANALYSIS.md for detailed analysis.
Key Findings
- Top Recommendation: qwen2.5-vl-7b (72.9% F1 score)
- Best Confidence Calibration: gemma-3-27b
- Speed Champion: gemma-3-4b (7.9s/img, 63.8% F1)
- Confidence threshold of 85+ recommended for filtering uncertain detections
Description
Languages
Python
83.3%
HTML
12%
Shell
4.7%