Files

Rick McEwen 8706edcd13 Initial commit: Jersey detection test suite

Test scripts and utilities for evaluating vision-language models
on jersey number detection using llama.cpp server.

2026-01-20 13:37:01 -07:00

11 KiB

Raw Blame History

Jersey Detection Model Analysis Report

Date: October 22, 2025 Models Tested: 8 vision-language models Test Images: 194 images with known jersey numbers Purpose: Determine the best model for automated jersey number detection in sports photography

Executive Summary

After comprehensive testing of 8 different AI models on 194 sports images with known jersey numbers, we recommend qwen2.5-vl-7b as the best overall model for jersey detection, with gemma-3-27b as a close second choice depending on specific needs.

Key Findings:

Best Overall Performance: qwen2.5-vl-7b achieves the highest accuracy (72.9% F1 score)
Confidence Scores Are Useful: 7 out of 8 models show reliable confidence calibration, meaning higher confidence scores correlate with correct detections
Speed vs Accuracy Trade-off: The most accurate models take 13-21 seconds per image; faster models sacrifice significant accuracy

Model Performance Comparison

Top 3 Recommended Models

Rank	Model	Accuracy (F1)	Speed	Correct Detections	False Alarms	Confidence Reliability
🥇 1	qwen2.5-vl-7b	72.9%	13.4s	328 / 436 (75%)	136	Good
🥈 2	gemma-3-27b	72.1%	20.9s	343 / 462 (74%)	147	Very Good (+6.0)
🥉 3	gemma-3-12b	69.8%	18.9s	322 / 462 (70%)	139	Good (+3.1)

Complete Results Table

Model	Accuracy (F1 Score)	Correct Detections	False Alarms	Missed Jerseys	Speed (sec/image)	Confidence Calibration
qwen2.5-vl-7b	72.9% ⭐	328 / 436	136	108	13.4	+0.5 (Good)
gemma-3-27b	72.1%	343 / 462	147	119	20.9	+6.0 (Very Good)
gemma-3-12b	69.8%	322 / 462	139	140	18.9	+3.1 (Good)
mistral-small-24b-q4	67.6%	328 / 462	180	134	15.1	+2.4 (Good)
mistral-small-24b-q8	67.2%	330 / 462	190	132	22.6	+3.1 (Good)
gemma-3-4b	63.8%	277 / 462	130	185	7.9 ⚡	+6.2 (Very Good)
lfm2-vl-1.6b	50.5%	171 / 448	58	277	4.6 ⚡⚡	+11.9 (Excellent)
kimi-vl-3b	2.0% ❌	5 / 416	67	411	40.0 🐌	-1.3 (Poor)

Understanding the Metrics

What the Numbers Mean:

Accuracy (F1 Score): Overall effectiveness balancing correct detections and false alarms
- 70%+ = Excellent for production use
- 60-70% = Good for assisted workflows
- Below 60% = Not recommended
Correct Detections: Out of all jerseys that should have been found, how many were actually detected
- Example: "328 / 436" means the model found 328 jerseys out of 436 that were actually in the images
False Alarms: Jersey numbers detected that weren't actually in the image
- Lower is better - these are incorrect detections
- Can be filtered using confidence scores
Missed Jerseys: Jersey numbers that were in the image but not detected
- Lower is better - these are opportunities lost
Speed: Average seconds to process one image
- ⚡⚡ = Very fast (< 8s)
- ⚡ = Fast (8-15s)
- Standard = 15-25s
- 🐌 = Slow (> 30s)
Confidence Calibration: The difference between average confidence on correct vs incorrect detections
- Positive number (e.g., +6.0) = Good calibration - correct detections have higher confidence
- Negative number = Poor calibration - can't trust confidence scores
- Higher positive values = Better for filtering with confidence thresholds

Detailed Analysis

1. Best Model: qwen2.5-vl-7b

Why It's the Best:

✅ Highest overall accuracy (72.9%)
✅ Best recall - finds 75% of all jerseys
✅ Reasonable speed (13.4 seconds per image)
✅ Very low hallucination rate (only 1%)
✅ Confidence scores are reliable for filtering

Strengths:

Finds the most jerseys (highest recall at 75.2%)
Rarely makes up fake jersey numbers (hallucination rate: 1%)
Almost always returns results (empty response rate: 2.6%)

Weaknesses:

Generates 136 false positives (30% of detections are incorrect)
Confidence calibration is minimal (+0.5), making threshold filtering less effective
All confidence scores are 90-95, showing limited variation

Best For:

Applications where finding all jerseys is critical
Batch processing where moderate false positives are acceptable
When combined with manual review of results

2. Runner-Up: gemma-3-27b

Why It's Excellent:

✅ Nearly identical accuracy to the winner (72.1% vs 72.9%)
✅ Finds the most total jerseys (343 correct detections)
✅ Excellent confidence calibration (+6.0 difference)
✅ No hallucinations
⚠️ Slower processing (20.9s per image)

Strengths:

Best for confidence-based filtering (6-point difference between correct/incorrect)
Highest absolute number of correct detections (343)
More varied confidence scores (54% in 90-100 range, 42% in 70-89 range)

Weaknesses:

56% slower than qwen2.5-vl-7b
Similar false positive rate

Best For:

Applications requiring confidence-based filtering
When processing time is not critical
Maximizing total correct detections

3. Alternative: gemma-3-4b (Speed Champion)

Why Consider It:

⚡ Fast processing (7.9 seconds per image)
✅ Very good confidence calibration (+6.2)
✅ Zero hallucinations
⚠️ Lower accuracy (63.8%)

Trade-offs:

41% faster than qwen2.5-vl-7b
But 12% lower accuracy
Misses 40% of jerseys (185 false negatives)

Best For:

Real-time or high-volume processing
Applications where speed is more important than completeness
Initial rough filtering before manual review

Should You Use Confidence Scores for Filtering?

Answer: YES - Confidence scores are useful for most models

Evidence from Testing:

7 out of 8 models show good confidence calibration:

Model	Avg Confidence (Correct)	Avg Confidence (Incorrect)	Difference	Reliability
lfm2-vl-1.6b	91.8	80.0	+11.9	⭐⭐⭐ Excellent
gemma-3-4b	85.2	79.0	+6.2	⭐⭐ Very Good
gemma-3-27b	88.2	82.2	+6.0	⭐⭐ Very Good
gemma-3-12b	91.8	88.7	+3.1	⭐ Good
mistral-small-24b-q8	92.3	89.1	+3.1	⭐ Good
mistral-small-24b-q4	93.0	90.7	+2.4	⭐ Good
qwen2.5-vl-7b	94.6	94.1	+0.5	Limited utility
kimi-vl-3b	88.4	89.7	-1.3	❌ Not reliable

What This Means:

For most models, setting a confidence threshold can significantly reduce false positives:

A threshold of 85 on gemma-3-27b would keep most correct detections (88.2 avg) while filtering many incorrect ones (82.2 avg)
A threshold of 85 on gemma-3-4b would be even more effective

Exception: qwen2.5-vl-7b has minimal difference (94.6 vs 94.1), making threshold filtering less useful despite being the most accurate model.

Recommended Filtering Strategy:

Use gemma-3-27b with confidence threshold of 85+ for best balance of accuracy and filtering
Use gemma-3-4b with confidence threshold of 85+ for faster processing with good filtering
Use qwen2.5-vl-7b without filtering when you need maximum recall and will manually review results

Model-Specific Recommendations

For Different Use Cases:

🎯 Highest Accuracy Required

Model: qwen2.5-vl-7b
Expected Results: Find 75% of jerseys, 30% false positive rate
Processing: 13.4 seconds per image
Setup: Use raw results, manually review all detections

🎯 Best Balance of Speed and Accuracy

Model: gemma-3-12b
Expected Results: Find 70% of jerseys, reasonable false positive rate
Processing: 18.9 seconds per image
Setup: Apply confidence threshold of 90+ to reduce false positives

🎯 Maximum Quality with Confidence Filtering

Model: gemma-3-27b
Expected Results: Find 74% of jerseys, filter false positives effectively
Processing: 20.9 seconds per image
Setup: Apply confidence threshold of 85+ to reduce false positives by ~50%

⚡ Speed is Critical

Model: gemma-3-4b
Expected Results: Find 60% of jerseys quickly
Processing: 7.9 seconds per image
Setup: Apply confidence threshold of 85+ for quality filtering

❌ Do Not Use

kimi-vl-3b: Only 2% accuracy, extremely slow, poor confidence calibration

Implementation Recommendations

1. Production Deployment Strategy

Recommended: Two-tier approach

Tier 1 (Automatic): gemma-3-27b with confidence threshold 85+
- Automatically tag high-confidence detections
- Expected: ~200 correct detections per 194 images with minimal false positives
Tier 2 (Review Queue): qwen2.5-vl-7b on remaining images
- Human review of all detections below confidence threshold
- Catches jerseys missed by Tier 1

2. Confidence Threshold Guidelines

Based on testing data:

Model	Recommended Threshold	Expected Precision	Expected Recall
gemma-3-27b	85+	~85-90%	~60-65%
gemma-3-4b	85+	~80-85%	~50-55%
gemma-3-12b	90+	~80-85%	~60-65%
qwen2.5-vl-7b	Don't filter	70.7%	75.2%

3. Performance Optimization

Processing 1000 images:

qwen2.5-vl-7b: ~3.7 hours
gemma-3-27b: ~5.8 hours
gemma-3-4b: ~2.2 hours

Recommendation: Use gemma-3-4b for initial pass, qwen2.5-vl-7b for second pass on low-confidence results.

Conclusions

Main Findings:

qwen2.5-vl-7b is the most accurate model but has limited confidence score utility
gemma-3-27b offers the best combination of accuracy and confidence-based filtering
Confidence scores are highly valuable for reducing false positives in most models
Speed vs accuracy trade-offs are significant - fastest model is 9% less accurate than best
One model (kimi-vl-3b) is completely unsuitable for this task

Strategic Recommendations:

For most users: Deploy gemma-3-27b with confidence threshold of 85+

Balances accuracy, speed, and filtering capability
Reduces manual review burden significantly
Good confidence calibration enables automated decision-making

For maximum accuracy: Deploy qwen2.5-vl-7b without filtering

Best for finding all possible jerseys
Requires manual review of results
Accept higher false positive rate

For high-volume processing: Deploy gemma-3-4b with confidence threshold of 85+

Fast enough for real-time applications
Good accuracy for the speed
Effective filtering capability

Final Verdict:

Winner: qwen2.5-vl-7b for pure accuracy Best Overall: gemma-3-27b for practical deployment with confidence filtering Best Value: gemma-3-4b for speed-sensitive applications

Technical Notes

Test Dataset: 194 images with ground truth jersey numbers encoded in filenames
Total Expected Jerseys: 416-462 depending on which images each model processed successfully
Evaluation Metrics: Precision, Recall, F1 Score, Confidence Calibration
Hardware: Testing performed on comparable hardware configurations
Prompt: All models used identical jersey detection prompt with confidence scores

Report generated from comprehensive testing of 8 vision-language models for jersey number detection in sports photography.

11 KiB Raw Blame History

Jersey Detection Model Analysis Report

Executive Summary

Key Findings:

Model Performance Comparison

Top 3 Recommended Models

Complete Results Table

Understanding the Metrics

What the Numbers Mean:

Detailed Analysis

1. Best Model: qwen2.5-vl-7b

2. Runner-Up: gemma-3-27b

3. Alternative: gemma-3-4b (Speed Champion)

Should You Use Confidence Scores for Filtering?

Answer: YES - Confidence scores are useful for most models

Evidence from Testing:

What This Means:

Recommended Filtering Strategy:

Model-Specific Recommendations

For Different Use Cases:

🎯 Highest Accuracy Required

🎯 Best Balance of Speed and Accuracy

🎯 Maximum Quality with Confidence Filtering

⚡ Speed is Critical

❌ Do Not Use

Implementation Recommendations

1. Production Deployment Strategy

2. Confidence Threshold Guidelines

3. Performance Optimization

Conclusions

Main Findings:

Strategic Recommendations:

Final Verdict:

Technical Notes

11 KiB

Raw Blame History