- test_color_variety.py: named-color test for local llama.cpp VLM - test_color_variety_gemini.py: named-color test for Gemini 3 Flash API - test_hex_color_specificity.py: hex color specificity test for Gemini - test_hex_color_specificity_llama.py: hex color specificity test for local VLM - jersey_prompt_hex_color.txt: prompt requesting hex color codes - COLOR_TEST_REPORT.md: analysis report comparing 3 models across 5 tests - color_test_results.md: raw test output from all runs
Jersey Detection Testing
This project contains test scripts, results, and utilities for evaluating vision-language models on jersey number detection tasks using llama.cpp.
Directory Structure
jersey_test/
├── scan_utils/
│ ├── jersey_detection.py # Core detection class using VLM
│ └── llama_cpp_client.py # Client for llama.cpp server
├── docs/
│ ├── JERSEY_DETECTION_MODEL_ANALYSIS.md # Model comparison results
│ └── LLAMA_SWAP_SETUP.md # Server setup instructions
├── test_images/ # Place test images here
├── test_images_output/ # Output directory for annotated images
├── test_jersey_detection.py # Main test runner
├── analyze_jersey_results.py # Results analysis script
├── test_all_models.sh # Batch testing shell script
├── jersey_prompt.txt # Basic detection prompt
├── jersey_prompt_with_confidence.txt # Prompt with confidence scoring
└── jersey_detection_results.jsonl # Historical test results
Prerequisites
- Python 3.10+
- llama.cpp server running with a vision-language model
- Test images with ground truth encoded in filenames
Test Image Naming Convention
Test images should follow this naming pattern to encode ground truth:
prefix-number1-number2-number3.jpg
Example: game1-23-45-7.jpg contains jerseys with numbers 23, 45, and 7.
Running Tests
Single Model Test
python test_jersey_detection.py \
--images-dir ./test_images \
--prompt-file jersey_prompt_with_confidence.txt \
--server-url http://localhost:8080 \
--resize 1024 \
--output jersey_detection_results.jsonl
Batch Testing All Models
./test_all_models.sh
Edit variables at the top of the script to configure:
IMAGES_DIR- test images directoryPROMPT_FILE- prompt file to useSERVER_URL- llama.cpp/llama-swap server URLLLAMA_SWAP_CONFIG- path to llama-swap config for model list
Analyzing Results
python analyze_jersey_results.py jersey_detection_results.jsonl
Options:
--csv output.csv- Export results to CSV--filter-model "model_name"- Filter by model name
Historical Results
The jersey_detection_results.jsonl file contains results from 6 test runs:
| Model | F1 Score | Avg Time/Image | Avg Confidence |
|---|---|---|---|
| qwen2.5-vl-7b | 72.9% | - | - |
| gemma-3-27b | 72.1% | 18.1s | 87.1 |
| Mistral-Small-3.2-24B (Q4) | - | 14.2s | 92.1 |
| Kimi-VL-A3B-Thinking | - | 29.1s | 88.9 |
See docs/JERSEY_DETECTION_MODEL_ANALYSIS.md for detailed analysis.
Hallucination Detection
Vision-language models can sometimes "hallucinate" by returning example jersey numbers from the prompt instead of actual detections from the image. To combat this, the detection code filters out known example numbers.
Filtered numbers: 101, 102, 103, 142, 199
These numbers were deliberately chosen as examples in the prompt because real jersey numbers are typically 0-99. Any detection returning these numbers is flagged as a hallucination and excluded from results.
The hallucination filter is implemented in both:
scan_utils/jersey_detection.py- Core detection classtest_jersey_detection.py- Test runner
Test results track hallucination statistics including:
- Total hallucinated detections filtered
- Hallucination rate percentage
- Per-image hallucination counts
Prompt Files
Two prompt templates are provided for jersey detection:
| File | Description |
|---|---|
| jersey_prompt.txt | Basic prompt for jersey detection without confidence scores |
| jersey_prompt_with_confidence.txt | Enhanced prompt with confidence scoring (0-100 scale) |
The confidence prompt includes scoring guidelines:
- 90-100: Extremely clear and unambiguous
- 70-89: Clear but minor occlusion/angle issues
- 50-69: Partially visible or somewhat unclear
- 30-49: Difficult to read but visible
- 0-29: Very uncertain, barely visible
Llama-swap Configuration
This project supports llama-swap for automatic model switching during batch testing.
Configuration file: llama-swap-config.yaml
The config includes 8 pre-configured vision-language models:
| Model Tag | Parameters | Quantization |
|---|---|---|
| lfm2-vl-1.6b | 1.6B | F16 |
| gemma-3-4b | 4B | F16 |
| kimi-vl-3b | 3B | F16 |
| qwen2.5-vl-7b | 7B | F16 |
| gemma-3-12b | 12B | F16 |
| mistral-small-24b-q8 | 24B | Q8_K_XL |
| mistral-small-24b-q4 | 24B | Q4_K_XL |
| gemma-3-27b | 27B | Q8_0 |
See docs/LLAMA_SWAP_SETUP.md for server setup instructions.
Key Findings
- Top Recommendation: qwen2.5-vl-7b (72.9% F1 score)
- Best Confidence Calibration: gemma-3-27b
- Speed Champion: gemma-3-4b (7.9s/img, 63.8% F1)
- Confidence threshold of 85+ recommended for filtering uncertain detections