Initial commit: Jersey detection test suite
Test scripts and utilities for evaluating vision-language models on jersey number detection using llama.cpp server.
This commit is contained in:
93
README.md
Normal file
93
README.md
Normal file
@ -0,0 +1,93 @@
|
||||
# Jersey Detection Testing
|
||||
|
||||
This project contains test scripts, results, and utilities for evaluating vision-language models on jersey number detection tasks using llama.cpp.
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
jersey_test/
|
||||
├── scan_utils/
|
||||
│ ├── jersey_detection.py # Core detection class using VLM
|
||||
│ └── llama_cpp_client.py # Client for llama.cpp server
|
||||
├── docs/
|
||||
│ ├── JERSEY_DETECTION_MODEL_ANALYSIS.md # Model comparison results
|
||||
│ └── LLAMA_SWAP_SETUP.md # Server setup instructions
|
||||
├── test_images/ # Place test images here
|
||||
├── test_images_output/ # Output directory for annotated images
|
||||
├── test_jersey_detection.py # Main test runner
|
||||
├── analyze_jersey_results.py # Results analysis script
|
||||
├── test_all_models.sh # Batch testing shell script
|
||||
├── jersey_prompt.txt # Basic detection prompt
|
||||
├── jersey_prompt_with_confidence.txt # Prompt with confidence scoring
|
||||
└── jersey_detection_results.jsonl # Historical test results
|
||||
```
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python 3.10+
|
||||
- llama.cpp server running with a vision-language model
|
||||
- Test images with ground truth encoded in filenames
|
||||
|
||||
## Test Image Naming Convention
|
||||
|
||||
Test images should follow this naming pattern to encode ground truth:
|
||||
```
|
||||
prefix-number1-number2-number3.jpg
|
||||
```
|
||||
|
||||
Example: `game1-23-45-7.jpg` contains jerseys with numbers 23, 45, and 7.
|
||||
|
||||
## Running Tests
|
||||
|
||||
### Single Model Test
|
||||
|
||||
```bash
|
||||
python test_jersey_detection.py \
|
||||
--images-dir ./test_images \
|
||||
--prompt-file jersey_prompt_with_confidence.txt \
|
||||
--server-url http://localhost:8080 \
|
||||
--resize 1024 \
|
||||
--output jersey_detection_results.jsonl
|
||||
```
|
||||
|
||||
### Batch Testing All Models
|
||||
|
||||
```bash
|
||||
./test_all_models.sh
|
||||
```
|
||||
|
||||
Edit variables at the top of the script to configure:
|
||||
- `IMAGES_DIR` - test images directory
|
||||
- `PROMPT_FILE` - prompt file to use
|
||||
- `SERVER_URL` - llama.cpp/llama-swap server URL
|
||||
- `LLAMA_SWAP_CONFIG` - path to llama-swap config for model list
|
||||
|
||||
### Analyzing Results
|
||||
|
||||
```bash
|
||||
python analyze_jersey_results.py jersey_detection_results.jsonl
|
||||
```
|
||||
|
||||
Options:
|
||||
- `--csv output.csv` - Export results to CSV
|
||||
- `--filter-model "model_name"` - Filter by model name
|
||||
|
||||
## Historical Results
|
||||
|
||||
The `jersey_detection_results.jsonl` file contains results from 6 test runs:
|
||||
|
||||
| Model | F1 Score | Avg Time/Image | Avg Confidence |
|
||||
|-------|----------|----------------|----------------|
|
||||
| qwen2.5-vl-7b | 72.9% | - | - |
|
||||
| gemma-3-27b | 72.1% | 18.1s | 87.1 |
|
||||
| Mistral-Small-3.2-24B (Q4) | - | 14.2s | 92.1 |
|
||||
| Kimi-VL-A3B-Thinking | - | 29.1s | 88.9 |
|
||||
|
||||
See `docs/JERSEY_DETECTION_MODEL_ANALYSIS.md` for detailed analysis.
|
||||
|
||||
## Key Findings
|
||||
|
||||
1. **Top Recommendation**: qwen2.5-vl-7b (72.9% F1 score)
|
||||
2. **Best Confidence Calibration**: gemma-3-27b
|
||||
3. **Speed Champion**: gemma-3-4b (7.9s/img, 63.8% F1)
|
||||
4. Confidence threshold of 85+ recommended for filtering uncertain detections
|
||||
Reference in New Issue
Block a user