Jersey Detection Testing

This project contains test scripts, results, and utilities for evaluating vision-language models on jersey number detection tasks using llama.cpp.

Directory Structure

jersey_test/
├── scan_utils/
│   ├── jersey_detection.py      # Core detection class using VLM
│   └── llama_cpp_client.py      # Client for llama.cpp server
├── docs/
│   ├── JERSEY_DETECTION_MODEL_ANALYSIS.md  # Model comparison results
│   └── LLAMA_SWAP_SETUP.md      # Server setup instructions
├── test_images/                  # Place test images here
├── test_images_output/           # Output directory for annotated images
├── test_jersey_detection.py      # Main test runner
├── analyze_jersey_results.py     # Results analysis script
├── test_all_models.sh            # Batch testing shell script
├── jersey_prompt.txt             # Basic detection prompt
├── jersey_prompt_with_confidence.txt  # Prompt with confidence scoring
└── jersey_detection_results.jsonl     # Historical test results

Prerequisites

  • Python 3.10+
  • llama.cpp server running with a vision-language model
  • Test images with ground truth encoded in filenames

Test Image Naming Convention

Test images should follow this naming pattern to encode ground truth:

prefix-number1-number2-number3.jpg

Example: game1-23-45-7.jpg contains jerseys with numbers 23, 45, and 7.

Running Tests

Single Model Test

python test_jersey_detection.py \
    --images-dir ./test_images \
    --prompt-file jersey_prompt_with_confidence.txt \
    --server-url http://localhost:8080 \
    --resize 1024 \
    --output jersey_detection_results.jsonl

Batch Testing All Models

./test_all_models.sh

Edit variables at the top of the script to configure:

  • IMAGES_DIR - test images directory
  • PROMPT_FILE - prompt file to use
  • SERVER_URL - llama.cpp/llama-swap server URL
  • LLAMA_SWAP_CONFIG - path to llama-swap config for model list

Analyzing Results

python analyze_jersey_results.py jersey_detection_results.jsonl

Options:

  • --csv output.csv - Export results to CSV
  • --filter-model "model_name" - Filter by model name

Historical Results

The jersey_detection_results.jsonl file contains results from 6 test runs:

Model F1 Score Avg Time/Image Avg Confidence
qwen2.5-vl-7b 72.9% - -
gemma-3-27b 72.1% 18.1s 87.1
Mistral-Small-3.2-24B (Q4) - 14.2s 92.1
Kimi-VL-A3B-Thinking - 29.1s 88.9

See docs/JERSEY_DETECTION_MODEL_ANALYSIS.md for detailed analysis.

Key Findings

  1. Top Recommendation: qwen2.5-vl-7b (72.9% F1 score)
  2. Best Confidence Calibration: gemma-3-27b
  3. Speed Champion: gemma-3-4b (7.9s/img, 63.8% F1)
  4. Confidence threshold of 85+ recommended for filtering uncertain detections
Description
No description provided
Readme 164 KiB
Languages
Python 83.3%
HTML 12%
Shell 4.7%