Test scripts and utilities for evaluating vision-language models on jersey number detection using llama.cpp server.
11 KiB
Jersey Detection Model Analysis Report
Date: October 22, 2025 Models Tested: 8 vision-language models Test Images: 194 images with known jersey numbers Purpose: Determine the best model for automated jersey number detection in sports photography
Executive Summary
After comprehensive testing of 8 different AI models on 194 sports images with known jersey numbers, we recommend qwen2.5-vl-7b as the best overall model for jersey detection, with gemma-3-27b as a close second choice depending on specific needs.
Key Findings:
- Best Overall Performance: qwen2.5-vl-7b achieves the highest accuracy (72.9% F1 score)
- Confidence Scores Are Useful: 7 out of 8 models show reliable confidence calibration, meaning higher confidence scores correlate with correct detections
- Speed vs Accuracy Trade-off: The most accurate models take 13-21 seconds per image; faster models sacrifice significant accuracy
Model Performance Comparison
Top 3 Recommended Models
| Rank | Model | Accuracy (F1) | Speed | Correct Detections | False Alarms | Confidence Reliability |
|---|---|---|---|---|---|---|
| 🥇 1 | qwen2.5-vl-7b | 72.9% | 13.4s | 328 / 436 (75%) | 136 | Good |
| 🥈 2 | gemma-3-27b | 72.1% | 20.9s | 343 / 462 (74%) | 147 | Very Good (+6.0) |
| 🥉 3 | gemma-3-12b | 69.8% | 18.9s | 322 / 462 (70%) | 139 | Good (+3.1) |
Complete Results Table
| Model | Accuracy (F1 Score) | Correct Detections | False Alarms | Missed Jerseys | Speed (sec/image) | Confidence Calibration |
|---|---|---|---|---|---|---|
| qwen2.5-vl-7b | 72.9% ⭐ | 328 / 436 | 136 | 108 | 13.4 | +0.5 (Good) |
| gemma-3-27b | 72.1% | 343 / 462 | 147 | 119 | 20.9 | +6.0 (Very Good) |
| gemma-3-12b | 69.8% | 322 / 462 | 139 | 140 | 18.9 | +3.1 (Good) |
| mistral-small-24b-q4 | 67.6% | 328 / 462 | 180 | 134 | 15.1 | +2.4 (Good) |
| mistral-small-24b-q8 | 67.2% | 330 / 462 | 190 | 132 | 22.6 | +3.1 (Good) |
| gemma-3-4b | 63.8% | 277 / 462 | 130 | 185 | 7.9 ⚡ | +6.2 (Very Good) |
| lfm2-vl-1.6b | 50.5% | 171 / 448 | 58 | 277 | 4.6 ⚡⚡ | +11.9 (Excellent) |
| kimi-vl-3b | 2.0% ❌ | 5 / 416 | 67 | 411 | 40.0 🐌 | -1.3 (Poor) |
Understanding the Metrics
What the Numbers Mean:
-
Accuracy (F1 Score): Overall effectiveness balancing correct detections and false alarms
- 70%+ = Excellent for production use
- 60-70% = Good for assisted workflows
- Below 60% = Not recommended
-
Correct Detections: Out of all jerseys that should have been found, how many were actually detected
- Example: "328 / 436" means the model found 328 jerseys out of 436 that were actually in the images
-
False Alarms: Jersey numbers detected that weren't actually in the image
- Lower is better - these are incorrect detections
- Can be filtered using confidence scores
-
Missed Jerseys: Jersey numbers that were in the image but not detected
- Lower is better - these are opportunities lost
-
Speed: Average seconds to process one image
- ⚡⚡ = Very fast (< 8s)
- ⚡ = Fast (8-15s)
- Standard = 15-25s
- 🐌 = Slow (> 30s)
-
Confidence Calibration: The difference between average confidence on correct vs incorrect detections
- Positive number (e.g., +6.0) = Good calibration - correct detections have higher confidence
- Negative number = Poor calibration - can't trust confidence scores
- Higher positive values = Better for filtering with confidence thresholds
Detailed Analysis
1. Best Model: qwen2.5-vl-7b
Why It's the Best:
- ✅ Highest overall accuracy (72.9%)
- ✅ Best recall - finds 75% of all jerseys
- ✅ Reasonable speed (13.4 seconds per image)
- ✅ Very low hallucination rate (only 1%)
- ✅ Confidence scores are reliable for filtering
Strengths:
- Finds the most jerseys (highest recall at 75.2%)
- Rarely makes up fake jersey numbers (hallucination rate: 1%)
- Almost always returns results (empty response rate: 2.6%)
Weaknesses:
- Generates 136 false positives (30% of detections are incorrect)
- Confidence calibration is minimal (+0.5), making threshold filtering less effective
- All confidence scores are 90-95, showing limited variation
Best For:
- Applications where finding all jerseys is critical
- Batch processing where moderate false positives are acceptable
- When combined with manual review of results
2. Runner-Up: gemma-3-27b
Why It's Excellent:
- ✅ Nearly identical accuracy to the winner (72.1% vs 72.9%)
- ✅ Finds the most total jerseys (343 correct detections)
- ✅ Excellent confidence calibration (+6.0 difference)
- ✅ No hallucinations
- ⚠️ Slower processing (20.9s per image)
Strengths:
- Best for confidence-based filtering (6-point difference between correct/incorrect)
- Highest absolute number of correct detections (343)
- More varied confidence scores (54% in 90-100 range, 42% in 70-89 range)
Weaknesses:
- 56% slower than qwen2.5-vl-7b
- Similar false positive rate
Best For:
- Applications requiring confidence-based filtering
- When processing time is not critical
- Maximizing total correct detections
3. Alternative: gemma-3-4b (Speed Champion)
Why Consider It:
- ⚡ Fast processing (7.9 seconds per image)
- ✅ Very good confidence calibration (+6.2)
- ✅ Zero hallucinations
- ⚠️ Lower accuracy (63.8%)
Trade-offs:
- 41% faster than qwen2.5-vl-7b
- But 12% lower accuracy
- Misses 40% of jerseys (185 false negatives)
Best For:
- Real-time or high-volume processing
- Applications where speed is more important than completeness
- Initial rough filtering before manual review
Should You Use Confidence Scores for Filtering?
Answer: YES - Confidence scores are useful for most models
Evidence from Testing:
7 out of 8 models show good confidence calibration:
| Model | Avg Confidence (Correct) | Avg Confidence (Incorrect) | Difference | Reliability |
|---|---|---|---|---|
| lfm2-vl-1.6b | 91.8 | 80.0 | +11.9 | ⭐⭐⭐ Excellent |
| gemma-3-4b | 85.2 | 79.0 | +6.2 | ⭐⭐ Very Good |
| gemma-3-27b | 88.2 | 82.2 | +6.0 | ⭐⭐ Very Good |
| gemma-3-12b | 91.8 | 88.7 | +3.1 | ⭐ Good |
| mistral-small-24b-q8 | 92.3 | 89.1 | +3.1 | ⭐ Good |
| mistral-small-24b-q4 | 93.0 | 90.7 | +2.4 | ⭐ Good |
| qwen2.5-vl-7b | 94.6 | 94.1 | +0.5 | Limited utility |
| kimi-vl-3b | 88.4 | 89.7 | -1.3 | ❌ Not reliable |
What This Means:
For most models, setting a confidence threshold can significantly reduce false positives:
- A threshold of 85 on gemma-3-27b would keep most correct detections (88.2 avg) while filtering many incorrect ones (82.2 avg)
- A threshold of 85 on gemma-3-4b would be even more effective
Exception: qwen2.5-vl-7b has minimal difference (94.6 vs 94.1), making threshold filtering less useful despite being the most accurate model.
Recommended Filtering Strategy:
- Use gemma-3-27b with confidence threshold of 85+ for best balance of accuracy and filtering
- Use gemma-3-4b with confidence threshold of 85+ for faster processing with good filtering
- Use qwen2.5-vl-7b without filtering when you need maximum recall and will manually review results
Model-Specific Recommendations
For Different Use Cases:
🎯 Highest Accuracy Required
- Model: qwen2.5-vl-7b
- Expected Results: Find 75% of jerseys, 30% false positive rate
- Processing: 13.4 seconds per image
- Setup: Use raw results, manually review all detections
🎯 Best Balance of Speed and Accuracy
- Model: gemma-3-12b
- Expected Results: Find 70% of jerseys, reasonable false positive rate
- Processing: 18.9 seconds per image
- Setup: Apply confidence threshold of 90+ to reduce false positives
🎯 Maximum Quality with Confidence Filtering
- Model: gemma-3-27b
- Expected Results: Find 74% of jerseys, filter false positives effectively
- Processing: 20.9 seconds per image
- Setup: Apply confidence threshold of 85+ to reduce false positives by ~50%
⚡ Speed is Critical
- Model: gemma-3-4b
- Expected Results: Find 60% of jerseys quickly
- Processing: 7.9 seconds per image
- Setup: Apply confidence threshold of 85+ for quality filtering
❌ Do Not Use
- kimi-vl-3b: Only 2% accuracy, extremely slow, poor confidence calibration
Implementation Recommendations
1. Production Deployment Strategy
Recommended: Two-tier approach
-
Tier 1 (Automatic): gemma-3-27b with confidence threshold 85+
- Automatically tag high-confidence detections
- Expected: ~200 correct detections per 194 images with minimal false positives
-
Tier 2 (Review Queue): qwen2.5-vl-7b on remaining images
- Human review of all detections below confidence threshold
- Catches jerseys missed by Tier 1
2. Confidence Threshold Guidelines
Based on testing data:
| Model | Recommended Threshold | Expected Precision | Expected Recall |
|---|---|---|---|
| gemma-3-27b | 85+ | ~85-90% | ~60-65% |
| gemma-3-4b | 85+ | ~80-85% | ~50-55% |
| gemma-3-12b | 90+ | ~80-85% | ~60-65% |
| qwen2.5-vl-7b | Don't filter | 70.7% | 75.2% |
3. Performance Optimization
Processing 1000 images:
- qwen2.5-vl-7b: ~3.7 hours
- gemma-3-27b: ~5.8 hours
- gemma-3-4b: ~2.2 hours
Recommendation: Use gemma-3-4b for initial pass, qwen2.5-vl-7b for second pass on low-confidence results.
Conclusions
Main Findings:
- qwen2.5-vl-7b is the most accurate model but has limited confidence score utility
- gemma-3-27b offers the best combination of accuracy and confidence-based filtering
- Confidence scores are highly valuable for reducing false positives in most models
- Speed vs accuracy trade-offs are significant - fastest model is 9% less accurate than best
- One model (kimi-vl-3b) is completely unsuitable for this task
Strategic Recommendations:
For most users: Deploy gemma-3-27b with confidence threshold of 85+
- Balances accuracy, speed, and filtering capability
- Reduces manual review burden significantly
- Good confidence calibration enables automated decision-making
For maximum accuracy: Deploy qwen2.5-vl-7b without filtering
- Best for finding all possible jerseys
- Requires manual review of results
- Accept higher false positive rate
For high-volume processing: Deploy gemma-3-4b with confidence threshold of 85+
- Fast enough for real-time applications
- Good accuracy for the speed
- Effective filtering capability
Final Verdict:
Winner: qwen2.5-vl-7b for pure accuracy Best Overall: gemma-3-27b for practical deployment with confidence filtering Best Value: gemma-3-4b for speed-sensitive applications
Technical Notes
- Test Dataset: 194 images with ground truth jersey numbers encoded in filenames
- Total Expected Jerseys: 416-462 depending on which images each model processed successfully
- Evaluation Metrics: Precision, Recall, F1 Score, Confidence Calibration
- Hardware: Testing performed on comparable hardware configurations
- Prompt: All models used identical jersey detection prompt with confidence scores
Report generated from comprehensive testing of 8 vision-language models for jersey number detection in sports photography.