jersey_test/README.md

# Jersey Detection Testing

This project contains test scripts, results, and utilities for evaluating vision-language models on jersey number detection tasks using llama.cpp.

## Directory Structure

```
jersey_test/
├── scan_utils/
│   ├── jersey_detection.py      # Core detection class using VLM
│   └── llama_cpp_client.py      # Client for llama.cpp server
├── docs/
│   ├── JERSEY_DETECTION_MODEL_ANALYSIS.md  # Model comparison results
│   └── LLAMA_SWAP_SETUP.md      # Server setup instructions
├── test_images/                  # Place test images here
├── test_images_output/           # Output directory for annotated images
├── test_jersey_detection.py      # Main test runner
├── analyze_jersey_results.py     # Results analysis script
├── test_all_models.sh            # Batch testing shell script
├── jersey_prompt.txt             # Basic detection prompt
├── jersey_prompt_with_confidence.txt  # Prompt with confidence scoring
└── jersey_detection_results.jsonl     # Historical test results
```

## Prerequisites

- Python 3.10+
- llama.cpp server running with a vision-language model
- Test images with ground truth encoded in filenames

## Test Image Naming Convention

Test images should follow this naming pattern to encode ground truth:
```
prefix-number1-number2-number3.jpg
```

Example: `game1-23-45-7.jpg` contains jerseys with numbers 23, 45, and 7.

## Running Tests

### Single Model Test

```bash
python test_jersey_detection.py \
    --images-dir ./test_images \
    --prompt-file jersey_prompt_with_confidence.txt \
    --server-url http://localhost:8080 \
    --resize 1024 \
    --output jersey_detection_results.jsonl
```

### Batch Testing All Models

```bash
./test_all_models.sh
```

Edit variables at the top of the script to configure:
- `IMAGES_DIR` - test images directory
- `PROMPT_FILE` - prompt file to use
- `SERVER_URL` - llama.cpp/llama-swap server URL
- `LLAMA_SWAP_CONFIG` - path to llama-swap config for model list

### Analyzing Results

```bash
python analyze_jersey_results.py jersey_detection_results.jsonl
```

Options:
- `--csv output.csv` - Export results to CSV
- `--filter-model "model_name"` - Filter by model name

## Historical Results

The `jersey_detection_results.jsonl` file contains results from 6 test runs:

| Model | F1 Score | Avg Time/Image | Avg Confidence |
|-------|----------|----------------|----------------|
| qwen2.5-vl-7b | 72.9% | - | - |
| gemma-3-27b | 72.1% | 18.1s | 87.1 |
| Mistral-Small-3.2-24B (Q4) | - | 14.2s | 92.1 |
| Kimi-VL-A3B-Thinking | - | 29.1s | 88.9 |

See `docs/JERSEY_DETECTION_MODEL_ANALYSIS.md` for detailed analysis.

## Hallucination Detection

Vision-language models can sometimes "hallucinate" by returning example jersey numbers from the prompt instead of actual detections from the image. To combat this, the detection code filters out known example numbers.

**Filtered numbers:** `101`, `102`, `103`, `142`, `199`

These numbers were deliberately chosen as examples in the prompt because real jersey numbers are typically 0-99. Any detection returning these numbers is flagged as a hallucination and excluded from results.

The hallucination filter is implemented in both:
- `scan_utils/jersey_detection.py` - Core detection class
- `test_jersey_detection.py` - Test runner

Test results track hallucination statistics including:
- Total hallucinated detections filtered
- Hallucination rate percentage
- Per-image hallucination counts

## Prompt Files

Two prompt templates are provided for jersey detection:

| File | Description |
|------|-------------|
| [jersey_prompt.txt](jersey_prompt.txt) | Basic prompt for jersey detection without confidence scores |
| [jersey_prompt_with_confidence.txt](jersey_prompt_with_confidence.txt) | Enhanced prompt with confidence scoring (0-100 scale) |

The confidence prompt includes scoring guidelines:
- **90-100**: Extremely clear and unambiguous
- **70-89**: Clear but minor occlusion/angle issues
- **50-69**: Partially visible or somewhat unclear
- **30-49**: Difficult to read but visible
- **0-29**: Very uncertain, barely visible

## Llama-swap Configuration

This project supports [llama-swap](https://github.com/mostlygeek/llama-swap) for automatic model switching during batch testing.

**Configuration file:** [llama-swap-config.yaml](llama-swap-config.yaml)

The config includes 8 pre-configured vision-language models:

| Model Tag | Parameters | Quantization |
|-----------|------------|--------------|
| lfm2-vl-1.6b | 1.6B | F16 |
| gemma-3-4b | 4B | F16 |
| kimi-vl-3b | 3B | F16 |
| qwen2.5-vl-7b | 7B | F16 |
| gemma-3-12b | 12B | F16 |
| mistral-small-24b-q8 | 24B | Q8_K_XL |
| mistral-small-24b-q4 | 24B | Q4_K_XL |
| gemma-3-27b | 27B | Q8_0 |

See [docs/LLAMA_SWAP_SETUP.md](docs/LLAMA_SWAP_SETUP.md) for server setup instructions.

## Key Findings

1. **Top Recommendation**: qwen2.5-vl-7b (72.9% F1 score)
2. **Best Confidence Calibration**: gemma-3-27b
3. **Speed Champion**: gemma-3-4b (7.9s/img, 63.8% F1)
4. Confidence threshold of 85+ recommended for filtering uncertain detections