# Jersey Detection Testing This project contains test scripts, results, and utilities for evaluating vision-language models on jersey number detection tasks using llama.cpp. ## Directory Structure ``` jersey_test/ ├── scan_utils/ │ ├── jersey_detection.py # Core detection class using VLM │ └── llama_cpp_client.py # Client for llama.cpp server ├── docs/ │ ├── JERSEY_DETECTION_MODEL_ANALYSIS.md # Model comparison results │ └── LLAMA_SWAP_SETUP.md # Server setup instructions ├── test_images/ # Place test images here ├── test_images_output/ # Output directory for annotated images ├── test_jersey_detection.py # Main test runner ├── analyze_jersey_results.py # Results analysis script ├── test_all_models.sh # Batch testing shell script ├── jersey_prompt.txt # Basic detection prompt ├── jersey_prompt_with_confidence.txt # Prompt with confidence scoring └── jersey_detection_results.jsonl # Historical test results ``` ## Prerequisites - Python 3.10+ - llama.cpp server running with a vision-language model - Test images with ground truth encoded in filenames ## Test Image Naming Convention Test images should follow this naming pattern to encode ground truth: ``` prefix-number1-number2-number3.jpg ``` Example: `game1-23-45-7.jpg` contains jerseys with numbers 23, 45, and 7. ## Running Tests ### Single Model Test ```bash python test_jersey_detection.py \ --images-dir ./test_images \ --prompt-file jersey_prompt_with_confidence.txt \ --server-url http://localhost:8080 \ --resize 1024 \ --output jersey_detection_results.jsonl ``` ### Batch Testing All Models ```bash ./test_all_models.sh ``` Edit variables at the top of the script to configure: - `IMAGES_DIR` - test images directory - `PROMPT_FILE` - prompt file to use - `SERVER_URL` - llama.cpp/llama-swap server URL - `LLAMA_SWAP_CONFIG` - path to llama-swap config for model list ### Analyzing Results ```bash python analyze_jersey_results.py jersey_detection_results.jsonl ``` Options: - `--csv output.csv` - Export results to CSV - `--filter-model "model_name"` - Filter by model name ## Historical Results The `jersey_detection_results.jsonl` file contains results from 6 test runs: | Model | F1 Score | Avg Time/Image | Avg Confidence | |-------|----------|----------------|----------------| | qwen2.5-vl-7b | 72.9% | - | - | | gemma-3-27b | 72.1% | 18.1s | 87.1 | | Mistral-Small-3.2-24B (Q4) | - | 14.2s | 92.1 | | Kimi-VL-A3B-Thinking | - | 29.1s | 88.9 | See `docs/JERSEY_DETECTION_MODEL_ANALYSIS.md` for detailed analysis. ## Key Findings 1. **Top Recommendation**: qwen2.5-vl-7b (72.9% F1 score) 2. **Best Confidence Calibration**: gemma-3-27b 3. **Speed Champion**: gemma-3-4b (7.9s/img, 63.8% F1) 4. Confidence threshold of 85+ recommended for filtering uncertain detections