# Jersey Detection Testing

This project contains test scripts, results, and utilities for evaluating vision-language models on jersey number detection tasks using llama.cpp.

## Directory Structure

```
jersey_test/
├── scan_utils/
│   ├── jersey_detection.py      # Core detection class using VLM
│   └── llama_cpp_client.py      # Client for llama.cpp server
├── docs/
│   ├── JERSEY_DETECTION_MODEL_ANALYSIS.md  # Model comparison results
│   └── LLAMA_SWAP_SETUP.md      # Server setup instructions
├── test_images/                  # Place test images here
├── test_images_output/           # Output directory for annotated images
├── test_jersey_detection.py      # Main test runner
├── analyze_jersey_results.py     # Results analysis script
├── test_all_models.sh            # Batch testing shell script
├── jersey_prompt.txt             # Basic detection prompt
├── jersey_prompt_with_confidence.txt  # Prompt with confidence scoring
└── jersey_detection_results.jsonl     # Historical test results
```

## Prerequisites

- Python 3.10+
- llama.cpp server running with a vision-language model
- Test images with ground truth encoded in filenames

## Test Image Naming Convention

Test images should follow this naming pattern to encode ground truth:
```
prefix-number1-number2-number3.jpg
```

Example: `game1-23-45-7.jpg` contains jerseys with numbers 23, 45, and 7.

## Running Tests

### Single Model Test

```bash
python test_jersey_detection.py \
    --images-dir ./test_images \
    --prompt-file jersey_prompt_with_confidence.txt \
    --server-url http://localhost:8080 \
    --resize 1024 \
    --output jersey_detection_results.jsonl
```

### Batch Testing All Models

```bash
./test_all_models.sh
```

Edit variables at the top of the script to configure:
- `IMAGES_DIR` - test images directory
- `PROMPT_FILE` - prompt file to use
- `SERVER_URL` - llama.cpp/llama-swap server URL
- `LLAMA_SWAP_CONFIG` - path to llama-swap config for model list

### Analyzing Results

```bash
python analyze_jersey_results.py jersey_detection_results.jsonl
```

Options:
- `--csv output.csv` - Export results to CSV
- `--filter-model "model_name"` - Filter by model name

## Historical Results

The `jersey_detection_results.jsonl` file contains results from 6 test runs:

| Model | F1 Score | Avg Time/Image | Avg Confidence |
|-------|----------|----------------|----------------|
| qwen2.5-vl-7b | 72.9% | - | - |
| gemma-3-27b | 72.1% | 18.1s | 87.1 |
| Mistral-Small-3.2-24B (Q4) | - | 14.2s | 92.1 |
| Kimi-VL-A3B-Thinking | - | 29.1s | 88.9 |

See `docs/JERSEY_DETECTION_MODEL_ANALYSIS.md` for detailed analysis.

## Key Findings

1. **Top Recommendation**: qwen2.5-vl-7b (72.9% F1 score)
2. **Best Confidence Calibration**: gemma-3-27b
3. **Speed Champion**: gemma-3-4b (7.9s/img, 63.8% F1)
4. Confidence threshold of 85+ recommended for filtering uncertain detections