148 lines
5.0 KiB
Markdown
148 lines
5.0 KiB
Markdown
# Jersey Detection Testing
|
|
|
|
This project contains test scripts, results, and utilities for evaluating vision-language models on jersey number detection tasks using llama.cpp.
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
jersey_test/
|
|
├── scan_utils/
|
|
│ ├── jersey_detection.py # Core detection class using VLM
|
|
│ └── llama_cpp_client.py # Client for llama.cpp server
|
|
├── docs/
|
|
│ ├── JERSEY_DETECTION_MODEL_ANALYSIS.md # Model comparison results
|
|
│ └── LLAMA_SWAP_SETUP.md # Server setup instructions
|
|
├── test_images/ # Place test images here
|
|
├── test_images_output/ # Output directory for annotated images
|
|
├── test_jersey_detection.py # Main test runner
|
|
├── analyze_jersey_results.py # Results analysis script
|
|
├── test_all_models.sh # Batch testing shell script
|
|
├── jersey_prompt.txt # Basic detection prompt
|
|
├── jersey_prompt_with_confidence.txt # Prompt with confidence scoring
|
|
└── jersey_detection_results.jsonl # Historical test results
|
|
```
|
|
|
|
## Prerequisites
|
|
|
|
- Python 3.10+
|
|
- llama.cpp server running with a vision-language model
|
|
- Test images with ground truth encoded in filenames
|
|
|
|
## Test Image Naming Convention
|
|
|
|
Test images should follow this naming pattern to encode ground truth:
|
|
```
|
|
prefix-number1-number2-number3.jpg
|
|
```
|
|
|
|
Example: `game1-23-45-7.jpg` contains jerseys with numbers 23, 45, and 7.
|
|
|
|
## Running Tests
|
|
|
|
### Single Model Test
|
|
|
|
```bash
|
|
python test_jersey_detection.py \
|
|
--images-dir ./test_images \
|
|
--prompt-file jersey_prompt_with_confidence.txt \
|
|
--server-url http://localhost:8080 \
|
|
--resize 1024 \
|
|
--output jersey_detection_results.jsonl
|
|
```
|
|
|
|
### Batch Testing All Models
|
|
|
|
```bash
|
|
./test_all_models.sh
|
|
```
|
|
|
|
Edit variables at the top of the script to configure:
|
|
- `IMAGES_DIR` - test images directory
|
|
- `PROMPT_FILE` - prompt file to use
|
|
- `SERVER_URL` - llama.cpp/llama-swap server URL
|
|
- `LLAMA_SWAP_CONFIG` - path to llama-swap config for model list
|
|
|
|
### Analyzing Results
|
|
|
|
```bash
|
|
python analyze_jersey_results.py jersey_detection_results.jsonl
|
|
```
|
|
|
|
Options:
|
|
- `--csv output.csv` - Export results to CSV
|
|
- `--filter-model "model_name"` - Filter by model name
|
|
|
|
## Historical Results
|
|
|
|
The `jersey_detection_results.jsonl` file contains results from 6 test runs:
|
|
|
|
| Model | F1 Score | Avg Time/Image | Avg Confidence |
|
|
|-------|----------|----------------|----------------|
|
|
| qwen2.5-vl-7b | 72.9% | - | - |
|
|
| gemma-3-27b | 72.1% | 18.1s | 87.1 |
|
|
| Mistral-Small-3.2-24B (Q4) | - | 14.2s | 92.1 |
|
|
| Kimi-VL-A3B-Thinking | - | 29.1s | 88.9 |
|
|
|
|
See `docs/JERSEY_DETECTION_MODEL_ANALYSIS.md` for detailed analysis.
|
|
|
|
## Hallucination Detection
|
|
|
|
Vision-language models can sometimes "hallucinate" by returning example jersey numbers from the prompt instead of actual detections from the image. To combat this, the detection code filters out known example numbers.
|
|
|
|
**Filtered numbers:** `101`, `102`, `103`, `142`, `199`
|
|
|
|
These numbers were deliberately chosen as examples in the prompt because real jersey numbers are typically 0-99. Any detection returning these numbers is flagged as a hallucination and excluded from results.
|
|
|
|
The hallucination filter is implemented in both:
|
|
- `scan_utils/jersey_detection.py` - Core detection class
|
|
- `test_jersey_detection.py` - Test runner
|
|
|
|
Test results track hallucination statistics including:
|
|
- Total hallucinated detections filtered
|
|
- Hallucination rate percentage
|
|
- Per-image hallucination counts
|
|
|
|
## Prompt Files
|
|
|
|
Two prompt templates are provided for jersey detection:
|
|
|
|
| File | Description |
|
|
|------|-------------|
|
|
| [jersey_prompt.txt](jersey_prompt.txt) | Basic prompt for jersey detection without confidence scores |
|
|
| [jersey_prompt_with_confidence.txt](jersey_prompt_with_confidence.txt) | Enhanced prompt with confidence scoring (0-100 scale) |
|
|
|
|
The confidence prompt includes scoring guidelines:
|
|
- **90-100**: Extremely clear and unambiguous
|
|
- **70-89**: Clear but minor occlusion/angle issues
|
|
- **50-69**: Partially visible or somewhat unclear
|
|
- **30-49**: Difficult to read but visible
|
|
- **0-29**: Very uncertain, barely visible
|
|
|
|
## Llama-swap Configuration
|
|
|
|
This project supports [llama-swap](https://github.com/mostlygeek/llama-swap) for automatic model switching during batch testing.
|
|
|
|
**Configuration file:** [llama-swap-config.yaml](llama-swap-config.yaml)
|
|
|
|
The config includes 8 pre-configured vision-language models:
|
|
|
|
| Model Tag | Parameters | Quantization |
|
|
|-----------|------------|--------------|
|
|
| lfm2-vl-1.6b | 1.6B | F16 |
|
|
| gemma-3-4b | 4B | F16 |
|
|
| kimi-vl-3b | 3B | F16 |
|
|
| qwen2.5-vl-7b | 7B | F16 |
|
|
| gemma-3-12b | 12B | F16 |
|
|
| mistral-small-24b-q8 | 24B | Q8_K_XL |
|
|
| mistral-small-24b-q4 | 24B | Q4_K_XL |
|
|
| gemma-3-27b | 27B | Q8_0 |
|
|
|
|
See [docs/LLAMA_SWAP_SETUP.md](docs/LLAMA_SWAP_SETUP.md) for server setup instructions.
|
|
|
|
## Key Findings
|
|
|
|
1. **Top Recommendation**: qwen2.5-vl-7b (72.9% F1 score)
|
|
2. **Best Confidence Calibration**: gemma-3-27b
|
|
3. **Speed Champion**: gemma-3-4b (7.9s/img, 63.8% F1)
|
|
4. Confidence threshold of 85+ recommended for filtering uncertain detections
|