Initial commit: Jersey detection test suite

Test scripts and utilities for evaluating vision-language models on jersey number detection using llama.cpp server.
2026-01-20 13:37:01 -07:00
commit 8706edcd13
14 changed files with 3080 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,93 @@
+# Jersey Detection Testing
+
+This project contains test scripts, results, and utilities for evaluating vision-language models on jersey number detection tasks using llama.cpp.
+
+## Directory Structure
+
+```
+jersey_test/
+├── scan_utils/
+│   ├── jersey_detection.py      # Core detection class using VLM
+│   └── llama_cpp_client.py      # Client for llama.cpp server
+├── docs/
+│   ├── JERSEY_DETECTION_MODEL_ANALYSIS.md  # Model comparison results
+│   └── LLAMA_SWAP_SETUP.md      # Server setup instructions
+├── test_images/                  # Place test images here
+├── test_images_output/           # Output directory for annotated images
+├── test_jersey_detection.py      # Main test runner
+├── analyze_jersey_results.py     # Results analysis script
+├── test_all_models.sh            # Batch testing shell script
+├── jersey_prompt.txt             # Basic detection prompt
+├── jersey_prompt_with_confidence.txt  # Prompt with confidence scoring
+└── jersey_detection_results.jsonl     # Historical test results
+```
+
+## Prerequisites
+
+- Python 3.10+
+- llama.cpp server running with a vision-language model
+- Test images with ground truth encoded in filenames
+
+## Test Image Naming Convention
+
+Test images should follow this naming pattern to encode ground truth:
+```
+prefix-number1-number2-number3.jpg
+```
+
+Example: `game1-23-45-7.jpg` contains jerseys with numbers 23, 45, and 7.
+
+## Running Tests
+
+### Single Model Test
+
+```bash
+python test_jersey_detection.py \
+    --images-dir ./test_images \
+    --prompt-file jersey_prompt_with_confidence.txt \
+    --server-url http://localhost:8080 \
+    --resize 1024 \
+    --output jersey_detection_results.jsonl
+```
+
+### Batch Testing All Models
+
+```bash
+./test_all_models.sh
+```
+
+Edit variables at the top of the script to configure:
+- `IMAGES_DIR` - test images directory
+- `PROMPT_FILE` - prompt file to use
+- `SERVER_URL` - llama.cpp/llama-swap server URL
+- `LLAMA_SWAP_CONFIG` - path to llama-swap config for model list
+
+### Analyzing Results
+
+```bash
+python analyze_jersey_results.py jersey_detection_results.jsonl
+```
+
+Options:
+- `--csv output.csv` - Export results to CSV
+- `--filter-model "model_name"` - Filter by model name
+
+## Historical Results
+
+The `jersey_detection_results.jsonl` file contains results from 6 test runs:
+
+| Model | F1 Score | Avg Time/Image | Avg Confidence |
+|-------|----------|----------------|----------------|
+| qwen2.5-vl-7b | 72.9% | - | - |
+| gemma-3-27b | 72.1% | 18.1s | 87.1 |
+| Mistral-Small-3.2-24B (Q4) | - | 14.2s | 92.1 |
+| Kimi-VL-A3B-Thinking | - | 29.1s | 88.9 |
+
+See `docs/JERSEY_DETECTION_MODEL_ANALYSIS.md` for detailed analysis.
+
+## Key Findings
+
+1. **Top Recommendation**: qwen2.5-vl-7b (72.9% F1 score)
+2. **Best Confidence Calibration**: gemma-3-27b
+3. **Speed Champion**: gemma-3-4b (7.9s/img, 63.8% F1)
+4. Confidence threshold of 85+ recommended for filtering uncertain detections