Rick McEwen 5405d7f7dc Add accuracy test framework, prompts, results, and analysis reports
Includes accuracy test scripts for Qwen (local) and Gemini (cloud API),
three prompt variants (original, capstone, constrained), test results
from all runs, and two analysis reports with an HTML presentation version.
2026-03-03 18:44:49 -07:00

Jersey Detection Testing

This project contains test scripts, results, and utilities for evaluating vision-language models on jersey number detection tasks using llama.cpp.

Directory Structure

jersey_test/
├── scan_utils/
│   ├── jersey_detection.py      # Core detection class using VLM
│   └── llama_cpp_client.py      # Client for llama.cpp server
├── docs/
│   ├── JERSEY_DETECTION_MODEL_ANALYSIS.md  # Model comparison results
│   └── LLAMA_SWAP_SETUP.md      # Server setup instructions
├── test_images/                  # Place test images here
├── test_images_output/           # Output directory for annotated images
├── test_jersey_detection.py      # Main test runner
├── analyze_jersey_results.py     # Results analysis script
├── test_all_models.sh            # Batch testing shell script
├── jersey_prompt.txt             # Basic detection prompt
├── jersey_prompt_with_confidence.txt  # Prompt with confidence scoring
└── jersey_detection_results.jsonl     # Historical test results

Prerequisites

  • Python 3.10+
  • llama.cpp server running with a vision-language model
  • Test images with ground truth encoded in filenames

Test Image Naming Convention

Test images should follow this naming pattern to encode ground truth:

prefix-number1-number2-number3.jpg

Example: game1-23-45-7.jpg contains jerseys with numbers 23, 45, and 7.

Running Tests

Single Model Test

python test_jersey_detection.py \
    --images-dir ./test_images \
    --prompt-file jersey_prompt_with_confidence.txt \
    --server-url http://localhost:8080 \
    --resize 1024 \
    --output jersey_detection_results.jsonl

Batch Testing All Models

./test_all_models.sh

Edit variables at the top of the script to configure:

  • IMAGES_DIR - test images directory
  • PROMPT_FILE - prompt file to use
  • SERVER_URL - llama.cpp/llama-swap server URL
  • LLAMA_SWAP_CONFIG - path to llama-swap config for model list

Analyzing Results

python analyze_jersey_results.py jersey_detection_results.jsonl

Options:

  • --csv output.csv - Export results to CSV
  • --filter-model "model_name" - Filter by model name

Historical Results

The jersey_detection_results.jsonl file contains results from 6 test runs:

Model F1 Score Avg Time/Image Avg Confidence
qwen2.5-vl-7b 72.9% - -
gemma-3-27b 72.1% 18.1s 87.1
Mistral-Small-3.2-24B (Q4) - 14.2s 92.1
Kimi-VL-A3B-Thinking - 29.1s 88.9

See docs/JERSEY_DETECTION_MODEL_ANALYSIS.md for detailed analysis.

Hallucination Detection

Vision-language models can sometimes "hallucinate" by returning example jersey numbers from the prompt instead of actual detections from the image. To combat this, the detection code filters out known example numbers.

Filtered numbers: 101, 102, 103, 142, 199

These numbers were deliberately chosen as examples in the prompt because real jersey numbers are typically 0-99. Any detection returning these numbers is flagged as a hallucination and excluded from results.

The hallucination filter is implemented in both:

  • scan_utils/jersey_detection.py - Core detection class
  • test_jersey_detection.py - Test runner

Test results track hallucination statistics including:

  • Total hallucinated detections filtered
  • Hallucination rate percentage
  • Per-image hallucination counts

Prompt Files

Two prompt templates are provided for jersey detection:

File Description
jersey_prompt.txt Basic prompt for jersey detection without confidence scores
jersey_prompt_with_confidence.txt Enhanced prompt with confidence scoring (0-100 scale)

The confidence prompt includes scoring guidelines:

  • 90-100: Extremely clear and unambiguous
  • 70-89: Clear but minor occlusion/angle issues
  • 50-69: Partially visible or somewhat unclear
  • 30-49: Difficult to read but visible
  • 0-29: Very uncertain, barely visible

Llama-swap Configuration

This project supports llama-swap for automatic model switching during batch testing.

Configuration file: llama-swap-config.yaml

The config includes 8 pre-configured vision-language models:

Model Tag Parameters Quantization
lfm2-vl-1.6b 1.6B F16
gemma-3-4b 4B F16
kimi-vl-3b 3B F16
qwen2.5-vl-7b 7B F16
gemma-3-12b 12B F16
mistral-small-24b-q8 24B Q8_K_XL
mistral-small-24b-q4 24B Q4_K_XL
gemma-3-27b 27B Q8_0

See docs/LLAMA_SWAP_SETUP.md for server setup instructions.

Key Findings

  1. Top Recommendation: qwen2.5-vl-7b (72.9% F1 score)
  2. Best Confidence Calibration: gemma-3-27b
  3. Speed Champion: gemma-3-4b (7.9s/img, 63.8% F1)
  4. Confidence threshold of 85+ recommended for filtering uncertain detections
Description
No description provided
Readme 164 KiB
Languages
Python 83.3%
HTML 12%
Shell 4.7%