Initial commit: Jersey detection test suite

Test scripts and utilities for evaluating vision-language models
on jersey number detection using llama.cpp server.
This commit is contained in:
2026-01-20 13:37:01 -07:00
commit 8706edcd13
14 changed files with 3080 additions and 0 deletions

93
README.md Normal file
View File

@ -0,0 +1,93 @@
# Jersey Detection Testing
This project contains test scripts, results, and utilities for evaluating vision-language models on jersey number detection tasks using llama.cpp.
## Directory Structure
```
jersey_test/
├── scan_utils/
│ ├── jersey_detection.py # Core detection class using VLM
│ └── llama_cpp_client.py # Client for llama.cpp server
├── docs/
│ ├── JERSEY_DETECTION_MODEL_ANALYSIS.md # Model comparison results
│ └── LLAMA_SWAP_SETUP.md # Server setup instructions
├── test_images/ # Place test images here
├── test_images_output/ # Output directory for annotated images
├── test_jersey_detection.py # Main test runner
├── analyze_jersey_results.py # Results analysis script
├── test_all_models.sh # Batch testing shell script
├── jersey_prompt.txt # Basic detection prompt
├── jersey_prompt_with_confidence.txt # Prompt with confidence scoring
└── jersey_detection_results.jsonl # Historical test results
```
## Prerequisites
- Python 3.10+
- llama.cpp server running with a vision-language model
- Test images with ground truth encoded in filenames
## Test Image Naming Convention
Test images should follow this naming pattern to encode ground truth:
```
prefix-number1-number2-number3.jpg
```
Example: `game1-23-45-7.jpg` contains jerseys with numbers 23, 45, and 7.
## Running Tests
### Single Model Test
```bash
python test_jersey_detection.py \
--images-dir ./test_images \
--prompt-file jersey_prompt_with_confidence.txt \
--server-url http://localhost:8080 \
--resize 1024 \
--output jersey_detection_results.jsonl
```
### Batch Testing All Models
```bash
./test_all_models.sh
```
Edit variables at the top of the script to configure:
- `IMAGES_DIR` - test images directory
- `PROMPT_FILE` - prompt file to use
- `SERVER_URL` - llama.cpp/llama-swap server URL
- `LLAMA_SWAP_CONFIG` - path to llama-swap config for model list
### Analyzing Results
```bash
python analyze_jersey_results.py jersey_detection_results.jsonl
```
Options:
- `--csv output.csv` - Export results to CSV
- `--filter-model "model_name"` - Filter by model name
## Historical Results
The `jersey_detection_results.jsonl` file contains results from 6 test runs:
| Model | F1 Score | Avg Time/Image | Avg Confidence |
|-------|----------|----------------|----------------|
| qwen2.5-vl-7b | 72.9% | - | - |
| gemma-3-27b | 72.1% | 18.1s | 87.1 |
| Mistral-Small-3.2-24B (Q4) | - | 14.2s | 92.1 |
| Kimi-VL-A3B-Thinking | - | 29.1s | 88.9 |
See `docs/JERSEY_DETECTION_MODEL_ANALYSIS.md` for detailed analysis.
## Key Findings
1. **Top Recommendation**: qwen2.5-vl-7b (72.9% F1 score)
2. **Best Confidence Calibration**: gemma-3-27b
3. **Speed Champion**: gemma-3-4b (7.9s/img, 63.8% F1)
4. Confidence threshold of 85+ recommended for filtering uncertain detections