Initial commit: Jersey detection test suite

Test scripts and utilities for evaluating vision-language models on jersey number detection using llama.cpp server.
2026-01-20 13:37:01 -07:00
commit 8706edcd13
14 changed files with 3080 additions and 0 deletions
--- a/docs/JERSEY_DETECTION_MODEL_ANALYSIS.md
+++ b/docs/JERSEY_DETECTION_MODEL_ANALYSIS.md
@ -0,0 +1,296 @@
+# Jersey Detection Model Analysis Report
+
+**Date:** October 22, 2025
+**Models Tested:** 8 vision-language models
+**Test Images:** 194 images with known jersey numbers
+**Purpose:** Determine the best model for automated jersey number detection in sports photography
+
+---
+
+## Executive Summary
+
+After comprehensive testing of 8 different AI models on 194 sports images with known jersey numbers, we recommend **qwen2.5-vl-7b** as the best overall model for jersey detection, with **gemma-3-27b** as a close second choice depending on specific needs.
+
+### Key Findings:
+
+1. **Best Overall Performance**: qwen2.5-vl-7b achieves the highest accuracy (72.9% F1 score)
+2. **Confidence Scores Are Useful**: 7 out of 8 models show reliable confidence calibration, meaning higher confidence scores correlate with correct detections
+3. **Speed vs Accuracy Trade-off**: The most accurate models take 13-21 seconds per image; faster models sacrifice significant accuracy
+
+---
+
+## Model Performance Comparison
+
+### Top 3 Recommended Models
+
+| Rank | Model | Accuracy (F1) | Speed | Correct Detections | False Alarms | Confidence Reliability |
+|------|-------|---------------|-------|--------------------|--------------|-----------------------|
+| 🥇 1 | qwen2.5-vl-7b | 72.9% | 13.4s | 328 / 436 (75%) | 136 | Good |
+| 🥈 2 | gemma-3-27b | 72.1% | 20.9s | 343 / 462 (74%) | 147 | Very Good (+6.0) |
+| 🥉 3 | gemma-3-12b | 69.8% | 18.9s | 322 / 462 (70%) | 139 | Good (+3.1) |
+
+### Complete Results Table
+
+| Model | Accuracy (F1 Score) | Correct Detections | False Alarms | Missed Jerseys | Speed (sec/image) | Confidence Calibration |
+|-------|--------------------|--------------------|--------------|----------------|-------------------|------------------------|
+| **qwen2.5-vl-7b** | **72.9%** ⭐ | 328 / 436 | 136 | 108 | 13.4 | +0.5 (Good) |
+| **gemma-3-27b** | **72.1%** | 343 / 462 | 147 | 119 | 20.9 | +6.0 (Very Good) |
+| **gemma-3-12b** | 69.8% | 322 / 462 | 139 | 140 | 18.9 | +3.1 (Good) |
+| mistral-small-24b-q4 | 67.6% | 328 / 462 | 180 | 134 | 15.1 | +2.4 (Good) |
+| mistral-small-24b-q8 | 67.2% | 330 / 462 | 190 | 132 | 22.6 | +3.1 (Good) |
+| gemma-3-4b | 63.8% | 277 / 462 | 130 | 185 | 7.9 ⚡ | +6.2 (Very Good) |
+| lfm2-vl-1.6b | 50.5% | 171 / 448 | 58 | 277 | 4.6 ⚡⚡ | +11.9 (Excellent) |
+| kimi-vl-3b | 2.0% ❌ | 5 / 416 | 67 | 411 | 40.0 🐌 | -1.3 (Poor) |
+
+---
+
+## Understanding the Metrics
+
+### What the Numbers Mean:
+
+- **Accuracy (F1 Score)**: Overall effectiveness balancing correct detections and false alarms
+  - 70%+ = Excellent for production use
+  - 60-70% = Good for assisted workflows
+  - Below 60% = Not recommended
+
+- **Correct Detections**: Out of all jerseys that should have been found, how many were actually detected
+  - Example: "328 / 436" means the model found 328 jerseys out of 436 that were actually in the images
+
+- **False Alarms**: Jersey numbers detected that weren't actually in the image
+  - Lower is better - these are incorrect detections
+  - Can be filtered using confidence scores
+
+- **Missed Jerseys**: Jersey numbers that were in the image but not detected
+  - Lower is better - these are opportunities lost
+
+- **Speed**: Average seconds to process one image
+  - ⚡⚡ = Very fast (< 8s)
+  - ⚡ = Fast (8-15s)
+  - Standard = 15-25s
+  - 🐌 = Slow (> 30s)
+
+- **Confidence Calibration**: The difference between average confidence on correct vs incorrect detections
+  - Positive number (e.g., +6.0) = Good calibration - correct detections have higher confidence
+  - Negative number = Poor calibration - can't trust confidence scores
+  - Higher positive values = Better for filtering with confidence thresholds
+
+---
+
+## Detailed Analysis
+
+### 1. Best Model: qwen2.5-vl-7b
+
+**Why It's the Best:**
+- ✅ Highest overall accuracy (72.9%)
+- ✅ Best recall - finds 75% of all jerseys
+- ✅ Reasonable speed (13.4 seconds per image)
+- ✅ Very low hallucination rate (only 1%)
+- ✅ Confidence scores are reliable for filtering
+
+**Strengths:**
+- Finds the most jerseys (highest recall at 75.2%)
+- Rarely makes up fake jersey numbers (hallucination rate: 1%)
+- Almost always returns results (empty response rate: 2.6%)
+
+**Weaknesses:**
+- Generates 136 false positives (30% of detections are incorrect)
+- Confidence calibration is minimal (+0.5), making threshold filtering less effective
+- All confidence scores are 90-95, showing limited variation
+
+**Best For:**
+- Applications where finding all jerseys is critical
+- Batch processing where moderate false positives are acceptable
+- When combined with manual review of results
+
+### 2. Runner-Up: gemma-3-27b
+
+**Why It's Excellent:**
+- ✅ Nearly identical accuracy to the winner (72.1% vs 72.9%)
+- ✅ Finds the most total jerseys (343 correct detections)
+- ✅ Excellent confidence calibration (+6.0 difference)
+- ✅ No hallucinations
+- ⚠️ Slower processing (20.9s per image)
+
+**Strengths:**
+- Best for confidence-based filtering (6-point difference between correct/incorrect)
+- Highest absolute number of correct detections (343)
+- More varied confidence scores (54% in 90-100 range, 42% in 70-89 range)
+
+**Weaknesses:**
+- 56% slower than qwen2.5-vl-7b
+- Similar false positive rate
+
+**Best For:**
+- Applications requiring confidence-based filtering
+- When processing time is not critical
+- Maximizing total correct detections
+
+### 3. Alternative: gemma-3-4b (Speed Champion)
+
+**Why Consider It:**
+- ⚡ Fast processing (7.9 seconds per image)
+- ✅ Very good confidence calibration (+6.2)
+- ✅ Zero hallucinations
+- ⚠️ Lower accuracy (63.8%)
+
+**Trade-offs:**
+- 41% faster than qwen2.5-vl-7b
+- But 12% lower accuracy
+- Misses 40% of jerseys (185 false negatives)
+
+**Best For:**
+- Real-time or high-volume processing
+- Applications where speed is more important than completeness
+- Initial rough filtering before manual review
+
+---
+
+## Should You Use Confidence Scores for Filtering?
+
+### Answer: **YES** - Confidence scores are useful for most models
+
+### Evidence from Testing:
+
+**7 out of 8 models show good confidence calibration:**
+
+| Model | Avg Confidence (Correct) | Avg Confidence (Incorrect) | Difference | Reliability |
+|-------|--------------------------|---------------------------|------------|-------------|
+| lfm2-vl-1.6b | 91.8 | 80.0 | **+11.9** | ⭐⭐⭐ Excellent |
+| gemma-3-4b | 85.2 | 79.0 | **+6.2** | ⭐⭐ Very Good |
+| gemma-3-27b | 88.2 | 82.2 | **+6.0** | ⭐⭐ Very Good |
+| gemma-3-12b | 91.8 | 88.7 | **+3.1** | ⭐ Good |
+| mistral-small-24b-q8 | 92.3 | 89.1 | **+3.1** | ⭐ Good |
+| mistral-small-24b-q4 | 93.0 | 90.7 | **+2.4** | ⭐ Good |
+| qwen2.5-vl-7b | 94.6 | 94.1 | +0.5 | Limited utility |
+| kimi-vl-3b | 88.4 | 89.7 | **-1.3** | ❌ Not reliable |
+
+### What This Means:
+
+**For most models**, setting a confidence threshold can significantly reduce false positives:
+- A threshold of 85 on gemma-3-27b would keep most correct detections (88.2 avg) while filtering many incorrect ones (82.2 avg)
+- A threshold of 85 on gemma-3-4b would be even more effective
+
+**Exception: qwen2.5-vl-7b** has minimal difference (94.6 vs 94.1), making threshold filtering less useful despite being the most accurate model.
+
+### Recommended Filtering Strategy:
+
+1. **Use gemma-3-27b with confidence threshold of 85+** for best balance of accuracy and filtering
+2. **Use gemma-3-4b with confidence threshold of 85+** for faster processing with good filtering
+3. **Use qwen2.5-vl-7b without filtering** when you need maximum recall and will manually review results
+
+---
+
+## Model-Specific Recommendations
+
+### For Different Use Cases:
+
+#### 🎯 **Highest Accuracy Required**
+- **Model:** qwen2.5-vl-7b
+- **Expected Results:** Find 75% of jerseys, 30% false positive rate
+- **Processing:** 13.4 seconds per image
+- **Setup:** Use raw results, manually review all detections
+
+#### 🎯 **Best Balance of Speed and Accuracy**
+- **Model:** gemma-3-12b
+- **Expected Results:** Find 70% of jerseys, reasonable false positive rate
+- **Processing:** 18.9 seconds per image
+- **Setup:** Apply confidence threshold of 90+ to reduce false positives
+
+#### 🎯 **Maximum Quality with Confidence Filtering**
+- **Model:** gemma-3-27b
+- **Expected Results:** Find 74% of jerseys, filter false positives effectively
+- **Processing:** 20.9 seconds per image
+- **Setup:** Apply confidence threshold of 85+ to reduce false positives by ~50%
+
+#### ⚡ **Speed is Critical**
+- **Model:** gemma-3-4b
+- **Expected Results:** Find 60% of jerseys quickly
+- **Processing:** 7.9 seconds per image
+- **Setup:** Apply confidence threshold of 85+ for quality filtering
+
+#### ❌ **Do Not Use**
+- **kimi-vl-3b**: Only 2% accuracy, extremely slow, poor confidence calibration
+
+---
+
+## Implementation Recommendations
+
+### 1. Production Deployment Strategy
+
+**Recommended:** Two-tier approach
+- **Tier 1 (Automatic):** gemma-3-27b with confidence threshold 85+
+  - Automatically tag high-confidence detections
+  - Expected: ~200 correct detections per 194 images with minimal false positives
+
+- **Tier 2 (Review Queue):** qwen2.5-vl-7b on remaining images
+  - Human review of all detections below confidence threshold
+  - Catches jerseys missed by Tier 1
+
+### 2. Confidence Threshold Guidelines
+
+Based on testing data:
+
+| Model | Recommended Threshold | Expected Precision | Expected Recall |
+|-------|----------------------|-------------------|-----------------|
+| gemma-3-27b | 85+ | ~85-90% | ~60-65% |
+| gemma-3-4b | 85+ | ~80-85% | ~50-55% |
+| gemma-3-12b | 90+ | ~80-85% | ~60-65% |
+| qwen2.5-vl-7b | Don't filter | 70.7% | 75.2% |
+
+### 3. Performance Optimization
+
+**Processing 1000 images:**
+- qwen2.5-vl-7b: ~3.7 hours
+- gemma-3-27b: ~5.8 hours
+- gemma-3-4b: ~2.2 hours
+
+**Recommendation:** Use gemma-3-4b for initial pass, qwen2.5-vl-7b for second pass on low-confidence results.
+
+---
+
+## Conclusions
+
+### Main Findings:
+
+1. **qwen2.5-vl-7b is the most accurate model** but has limited confidence score utility
+2. **gemma-3-27b offers the best combination** of accuracy and confidence-based filtering
+3. **Confidence scores are highly valuable** for reducing false positives in most models
+4. **Speed vs accuracy trade-offs are significant** - fastest model is 9% less accurate than best
+5. **One model (kimi-vl-3b) is completely unsuitable** for this task
+
+### Strategic Recommendations:
+
+**For most users:** Deploy gemma-3-27b with confidence threshold of 85+
+- Balances accuracy, speed, and filtering capability
+- Reduces manual review burden significantly
+- Good confidence calibration enables automated decision-making
+
+**For maximum accuracy:** Deploy qwen2.5-vl-7b without filtering
+- Best for finding all possible jerseys
+- Requires manual review of results
+- Accept higher false positive rate
+
+**For high-volume processing:** Deploy gemma-3-4b with confidence threshold of 85+
+- Fast enough for real-time applications
+- Good accuracy for the speed
+- Effective filtering capability
+
+### Final Verdict:
+
+**Winner: qwen2.5-vl-7b** for pure accuracy
+**Best Overall: gemma-3-27b** for practical deployment with confidence filtering
+**Best Value: gemma-3-4b** for speed-sensitive applications
+
+---
+
+## Technical Notes
+
+- **Test Dataset:** 194 images with ground truth jersey numbers encoded in filenames
+- **Total Expected Jerseys:** 416-462 depending on which images each model processed successfully
+- **Evaluation Metrics:** Precision, Recall, F1 Score, Confidence Calibration
+- **Hardware:** Testing performed on comparable hardware configurations
+- **Prompt:** All models used identical jersey detection prompt with confidence scores
+
+---
+
+*Report generated from comprehensive testing of 8 vision-language models for jersey number detection in sports photography.*
--- a/docs/LLAMA_SWAP_SETUP.md
+++ b/docs/LLAMA_SWAP_SETUP.md
@ -0,0 +1,237 @@
+# llama-swap Setup Guide for Jersey Detection Testing
+
+This guide explains how to use [llama-swap](https://github.com/mostlygeek/llama-swap) to automatically switch between different vision language models when testing jersey detection.
+
+## What is llama-swap?
+
+llama-swap is a model-swapping proxy that sits between your application and llama.cpp servers. It automatically loads and unloads models based on the `model` parameter in API requests, allowing you to test multiple models without manually restarting servers.
+
+## Installation
+
+### Docker (Recommended)
+
+```bash
+# Pull the CUDA image (or cpu, vulkan, intel depending on your hardware)
+docker pull ghcr.io/mostlygeek/llama-swap:cuda
+```
+
+### Homebrew (macOS/Linux)
+
+```bash
+brew tap mostlygeek/llama-swap
+brew install llama-swap
+```
+
+### Pre-built Binaries
+
+Download from the [releases page](https://github.com/mostlygeek/llama-swap/releases).
+
+## Configuration
+
+A configuration file `llama-swap-config.yaml` is provided with 8 pre-configured vision models:
+
+### Small Models (1-4B parameters)
+- `lfm2-vl-1.6b` - LiquidAI LFM2-VL 1.6B (F16)
+- `gemma-3-4b` - Gemma 3 4B Instruct (F16)
+- `kimi-vl-3b` - Kimi VL A3B Thinking (F16)
+
+### Medium Models (7-12B parameters)
+- `qwen2.5-vl-7b` - Qwen2.5-VL 7B Instruct (F16)
+- `gemma-3-12b` - Gemma 3 12B Instruct (F16)
+
+### Large Models (24-27B parameters)
+- `mistral-small-24b-q8` - Mistral Small 3.2 24B (Q8_K_XL)
+- `mistral-small-24b-q4` - Mistral Small 3.2 24B (Q4_K_XL)
+- `gemma-3-27b` - Gemma 3 27B Instruct (Q8_0)
+
+## Starting llama-swap
+
+### Using Docker
+
+```bash
+docker run -it --rm --runtime nvidia -p 8080:8080 \
+  -v $(pwd)/llama-swap-config.yaml:/app/config.yaml \
+  -v /path/to/hf/cache:/root/.cache/huggingface \
+  ghcr.io/mostlygeek/llama-swap:cuda
+```
+
+### Using Binary
+
+```bash
+llama-swap --config llama-swap-config.yaml --listen localhost:8080
+```
+
+## Testing with Jersey Detection Script
+
+Once llama-swap is running, you can test different models by specifying the `--model-tag` parameter:
+
+### Test a Single Model
+
+```bash
+# Test Qwen2.5-VL 7B with resizing
+python test_jersey_detection.py ./images jersey_prompt.txt \
+  --model-tag "qwen2.5-vl-7b" \
+  --resize 1024
+```
+
+### Test Multiple Models Sequentially
+
+```bash
+# Test small models
+python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "lfm2-vl-1.6b" --resize 1024
+python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "gemma-3-4b" --resize 1024
+python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "kimi-vl-3b" --resize 1024
+
+# Test medium models
+python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "qwen2.5-vl-7b" --resize 1024
+python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "gemma-3-12b" --resize 1024
+
+# Test large models
+python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "mistral-small-24b-q4" --resize 1024
+python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "gemma-3-27b" --resize 1024
+```
+
+### Automated Testing Scripts
+
+Two bash scripts are provided for automated testing:
+
+#### 1. Full Test Suite (`test_all_models.sh`)
+
+Tests **all models** defined in `llama-swap-config.yaml`:
+
+```bash
+# Basic usage (uses defaults)
+./test_all_models.sh ./test_images
+
+# Customize configuration with environment variables
+RESIZE=2048 ./test_all_models.sh ./test_images
+OUTPUT_FILE=custom_results.jsonl ./test_all_models.sh ./test_images
+PROMPT_FILE=custom_prompt.txt ./test_all_models.sh ./test_images
+
+# Disable resize
+RESIZE= ./test_all_models.sh ./test_images
+```
+
+**Features:**
+- Automatically extracts all model tags from YAML config
+- Color-coded output with progress tracking
+- Confirms before starting tests
+- Shows summary with success/failure counts
+- Asks to continue if a model fails
+
+**Default Configuration:**
+- Images: `./test_images`
+- Prompt: `jersey_prompt_with_confidence.txt`
+- Resize: `1024px`
+- Output: `jersey_detection_results.jsonl`
+
+#### 2. Quick Test (`test_quick.sh`)
+
+Tests a **small subset** of models for rapid iteration:
+
+```bash
+# Test default selection (small, medium, large)
+./test_quick.sh ./test_images
+
+# Test custom models
+MODELS="lfm2-vl-1.6b qwen2.5-vl-7b" ./test_quick.sh ./test_images
+
+# Customize settings
+RESIZE=512 MODELS="gemma-3-4b" ./test_quick.sh ./test_images
+```
+
+**Default Models:**
+- `lfm2-vl-1.6b` (Small - 1.6B)
+- `qwen2.5-vl-7b` (Medium - 7B)
+- `mistral-small-24b-q4` (Large - 24B Q4)
+
+**Use Cases:**
+- Quick validation after prompt changes
+- Testing configuration adjustments
+- Rapid prototyping before full test run
+
+## Analyzing Results
+
+After testing multiple models, use the analysis script to compare performance:
+
+```bash
+python analyze_jersey_results.py
+```
+
+This will show:
+- Comparison table of all models tested
+- Performance charts with hallucination rates
+- Best performers by speed and accuracy
+- Confidence distribution (if applicable)
+
+## Model Swapping Behavior
+
+llama-swap will:
+1. **Automatically load** the requested model when you specify `--model-tag`
+2. **Automatically unload** the previous model (if different from current request)
+3. **Keep running** if you test the same model multiple times
+4. **Monitor** model loading/unloading in the web UI at `http://localhost:8080/ui`
+
+## Optional: Model Auto-Unloading
+
+To automatically unload models after 5 minutes of inactivity, uncomment this line in `llama-swap-config.yaml`:
+
+```yaml
+ttl: 300
+```
+
+## Optional: Preload Model on Startup
+
+To preload a specific model when llama-swap starts, uncomment and modify this section:
+
+```yaml
+hooks:
+  onStartup:
+    - loadModel: qwen2.5-vl-7b
+```
+
+## Customizing Models
+
+To add or modify models, edit `llama-swap-config.yaml`:
+
+```yaml
+models:
+  my-custom-model:
+    name: "My Custom Model Description"
+    cmd: llama-server --no-mmap -ngl 999 -fa on --host 0.0.0.0 --port ${PORT} -hf user/model-name:quantization
+```
+
+Then test with:
+
+```bash
+python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "my-custom-model"
+```
+
+## Troubleshooting
+
+### Model not loading
+- Check llama-swap logs at `http://localhost:8080/log` or via `curl http://localhost:8080/log/stream`
+- Verify the model name in the config matches the `--model-tag` parameter
+- Ensure sufficient GPU memory for the model
+
+### Connection refused
+- Verify llama-swap is running: `curl http://localhost:8080/health`
+- Check the server URL matches: default is `http://192.168.1.126:8080` (from scan.ini)
+
+### Slow model switching
+- First load downloads models from HuggingFace (can be slow)
+- Subsequent loads are faster (cached locally)
+- Use quantized models (Q4, Q8) for faster loading and lower memory usage
+
+## Web UI
+
+llama-swap includes a web interface for monitoring:
+- **Dashboard**: `http://localhost:8080/ui` - View loaded models and logs
+- **Activity**: See recent API requests
+- **Logs**: Real-time log monitoring
+
+## References
+
+- [llama-swap GitHub](https://github.com/mostlygeek/llama-swap)
+- [llama-swap Documentation](https://github.com/mostlygeek/llama-swap/tree/main/docs)
+- [llama.cpp Documentation](https://github.com/ggerganov/llama.cpp)