Files
jersey_test/docs/JERSEY_DETECTION_MODEL_ANALYSIS.md
Rick McEwen 8706edcd13 Initial commit: Jersey detection test suite
Test scripts and utilities for evaluating vision-language models
on jersey number detection using llama.cpp server.
2026-01-20 13:37:01 -07:00

11 KiB

Jersey Detection Model Analysis Report

Date: October 22, 2025 Models Tested: 8 vision-language models Test Images: 194 images with known jersey numbers Purpose: Determine the best model for automated jersey number detection in sports photography


Executive Summary

After comprehensive testing of 8 different AI models on 194 sports images with known jersey numbers, we recommend qwen2.5-vl-7b as the best overall model for jersey detection, with gemma-3-27b as a close second choice depending on specific needs.

Key Findings:

  1. Best Overall Performance: qwen2.5-vl-7b achieves the highest accuracy (72.9% F1 score)
  2. Confidence Scores Are Useful: 7 out of 8 models show reliable confidence calibration, meaning higher confidence scores correlate with correct detections
  3. Speed vs Accuracy Trade-off: The most accurate models take 13-21 seconds per image; faster models sacrifice significant accuracy

Model Performance Comparison

Rank Model Accuracy (F1) Speed Correct Detections False Alarms Confidence Reliability
🥇 1 qwen2.5-vl-7b 72.9% 13.4s 328 / 436 (75%) 136 Good
🥈 2 gemma-3-27b 72.1% 20.9s 343 / 462 (74%) 147 Very Good (+6.0)
🥉 3 gemma-3-12b 69.8% 18.9s 322 / 462 (70%) 139 Good (+3.1)

Complete Results Table

Model Accuracy (F1 Score) Correct Detections False Alarms Missed Jerseys Speed (sec/image) Confidence Calibration
qwen2.5-vl-7b 72.9% 328 / 436 136 108 13.4 +0.5 (Good)
gemma-3-27b 72.1% 343 / 462 147 119 20.9 +6.0 (Very Good)
gemma-3-12b 69.8% 322 / 462 139 140 18.9 +3.1 (Good)
mistral-small-24b-q4 67.6% 328 / 462 180 134 15.1 +2.4 (Good)
mistral-small-24b-q8 67.2% 330 / 462 190 132 22.6 +3.1 (Good)
gemma-3-4b 63.8% 277 / 462 130 185 7.9 +6.2 (Very Good)
lfm2-vl-1.6b 50.5% 171 / 448 58 277 4.6 +11.9 (Excellent)
kimi-vl-3b 2.0% 5 / 416 67 411 40.0 🐌 -1.3 (Poor)

Understanding the Metrics

What the Numbers Mean:

  • Accuracy (F1 Score): Overall effectiveness balancing correct detections and false alarms

    • 70%+ = Excellent for production use
    • 60-70% = Good for assisted workflows
    • Below 60% = Not recommended
  • Correct Detections: Out of all jerseys that should have been found, how many were actually detected

    • Example: "328 / 436" means the model found 328 jerseys out of 436 that were actually in the images
  • False Alarms: Jersey numbers detected that weren't actually in the image

    • Lower is better - these are incorrect detections
    • Can be filtered using confidence scores
  • Missed Jerseys: Jersey numbers that were in the image but not detected

    • Lower is better - these are opportunities lost
  • Speed: Average seconds to process one image

    • = Very fast (< 8s)
    • = Fast (8-15s)
    • Standard = 15-25s
    • 🐌 = Slow (> 30s)
  • Confidence Calibration: The difference between average confidence on correct vs incorrect detections

    • Positive number (e.g., +6.0) = Good calibration - correct detections have higher confidence
    • Negative number = Poor calibration - can't trust confidence scores
    • Higher positive values = Better for filtering with confidence thresholds

Detailed Analysis

1. Best Model: qwen2.5-vl-7b

Why It's the Best:

  • Highest overall accuracy (72.9%)
  • Best recall - finds 75% of all jerseys
  • Reasonable speed (13.4 seconds per image)
  • Very low hallucination rate (only 1%)
  • Confidence scores are reliable for filtering

Strengths:

  • Finds the most jerseys (highest recall at 75.2%)
  • Rarely makes up fake jersey numbers (hallucination rate: 1%)
  • Almost always returns results (empty response rate: 2.6%)

Weaknesses:

  • Generates 136 false positives (30% of detections are incorrect)
  • Confidence calibration is minimal (+0.5), making threshold filtering less effective
  • All confidence scores are 90-95, showing limited variation

Best For:

  • Applications where finding all jerseys is critical
  • Batch processing where moderate false positives are acceptable
  • When combined with manual review of results

2. Runner-Up: gemma-3-27b

Why It's Excellent:

  • Nearly identical accuracy to the winner (72.1% vs 72.9%)
  • Finds the most total jerseys (343 correct detections)
  • Excellent confidence calibration (+6.0 difference)
  • No hallucinations
  • ⚠️ Slower processing (20.9s per image)

Strengths:

  • Best for confidence-based filtering (6-point difference between correct/incorrect)
  • Highest absolute number of correct detections (343)
  • More varied confidence scores (54% in 90-100 range, 42% in 70-89 range)

Weaknesses:

  • 56% slower than qwen2.5-vl-7b
  • Similar false positive rate

Best For:

  • Applications requiring confidence-based filtering
  • When processing time is not critical
  • Maximizing total correct detections

3. Alternative: gemma-3-4b (Speed Champion)

Why Consider It:

  • Fast processing (7.9 seconds per image)
  • Very good confidence calibration (+6.2)
  • Zero hallucinations
  • ⚠️ Lower accuracy (63.8%)

Trade-offs:

  • 41% faster than qwen2.5-vl-7b
  • But 12% lower accuracy
  • Misses 40% of jerseys (185 false negatives)

Best For:

  • Real-time or high-volume processing
  • Applications where speed is more important than completeness
  • Initial rough filtering before manual review

Should You Use Confidence Scores for Filtering?

Answer: YES - Confidence scores are useful for most models

Evidence from Testing:

7 out of 8 models show good confidence calibration:

Model Avg Confidence (Correct) Avg Confidence (Incorrect) Difference Reliability
lfm2-vl-1.6b 91.8 80.0 +11.9 Excellent
gemma-3-4b 85.2 79.0 +6.2 Very Good
gemma-3-27b 88.2 82.2 +6.0 Very Good
gemma-3-12b 91.8 88.7 +3.1 Good
mistral-small-24b-q8 92.3 89.1 +3.1 Good
mistral-small-24b-q4 93.0 90.7 +2.4 Good
qwen2.5-vl-7b 94.6 94.1 +0.5 Limited utility
kimi-vl-3b 88.4 89.7 -1.3 Not reliable

What This Means:

For most models, setting a confidence threshold can significantly reduce false positives:

  • A threshold of 85 on gemma-3-27b would keep most correct detections (88.2 avg) while filtering many incorrect ones (82.2 avg)
  • A threshold of 85 on gemma-3-4b would be even more effective

Exception: qwen2.5-vl-7b has minimal difference (94.6 vs 94.1), making threshold filtering less useful despite being the most accurate model.

  1. Use gemma-3-27b with confidence threshold of 85+ for best balance of accuracy and filtering
  2. Use gemma-3-4b with confidence threshold of 85+ for faster processing with good filtering
  3. Use qwen2.5-vl-7b without filtering when you need maximum recall and will manually review results

Model-Specific Recommendations

For Different Use Cases:

🎯 Highest Accuracy Required

  • Model: qwen2.5-vl-7b
  • Expected Results: Find 75% of jerseys, 30% false positive rate
  • Processing: 13.4 seconds per image
  • Setup: Use raw results, manually review all detections

🎯 Best Balance of Speed and Accuracy

  • Model: gemma-3-12b
  • Expected Results: Find 70% of jerseys, reasonable false positive rate
  • Processing: 18.9 seconds per image
  • Setup: Apply confidence threshold of 90+ to reduce false positives

🎯 Maximum Quality with Confidence Filtering

  • Model: gemma-3-27b
  • Expected Results: Find 74% of jerseys, filter false positives effectively
  • Processing: 20.9 seconds per image
  • Setup: Apply confidence threshold of 85+ to reduce false positives by ~50%

Speed is Critical

  • Model: gemma-3-4b
  • Expected Results: Find 60% of jerseys quickly
  • Processing: 7.9 seconds per image
  • Setup: Apply confidence threshold of 85+ for quality filtering

Do Not Use

  • kimi-vl-3b: Only 2% accuracy, extremely slow, poor confidence calibration

Implementation Recommendations

1. Production Deployment Strategy

Recommended: Two-tier approach

  • Tier 1 (Automatic): gemma-3-27b with confidence threshold 85+

    • Automatically tag high-confidence detections
    • Expected: ~200 correct detections per 194 images with minimal false positives
  • Tier 2 (Review Queue): qwen2.5-vl-7b on remaining images

    • Human review of all detections below confidence threshold
    • Catches jerseys missed by Tier 1

2. Confidence Threshold Guidelines

Based on testing data:

Model Recommended Threshold Expected Precision Expected Recall
gemma-3-27b 85+ ~85-90% ~60-65%
gemma-3-4b 85+ ~80-85% ~50-55%
gemma-3-12b 90+ ~80-85% ~60-65%
qwen2.5-vl-7b Don't filter 70.7% 75.2%

3. Performance Optimization

Processing 1000 images:

  • qwen2.5-vl-7b: ~3.7 hours
  • gemma-3-27b: ~5.8 hours
  • gemma-3-4b: ~2.2 hours

Recommendation: Use gemma-3-4b for initial pass, qwen2.5-vl-7b for second pass on low-confidence results.


Conclusions

Main Findings:

  1. qwen2.5-vl-7b is the most accurate model but has limited confidence score utility
  2. gemma-3-27b offers the best combination of accuracy and confidence-based filtering
  3. Confidence scores are highly valuable for reducing false positives in most models
  4. Speed vs accuracy trade-offs are significant - fastest model is 9% less accurate than best
  5. One model (kimi-vl-3b) is completely unsuitable for this task

Strategic Recommendations:

For most users: Deploy gemma-3-27b with confidence threshold of 85+

  • Balances accuracy, speed, and filtering capability
  • Reduces manual review burden significantly
  • Good confidence calibration enables automated decision-making

For maximum accuracy: Deploy qwen2.5-vl-7b without filtering

  • Best for finding all possible jerseys
  • Requires manual review of results
  • Accept higher false positive rate

For high-volume processing: Deploy gemma-3-4b with confidence threshold of 85+

  • Fast enough for real-time applications
  • Good accuracy for the speed
  • Effective filtering capability

Final Verdict:

Winner: qwen2.5-vl-7b for pure accuracy Best Overall: gemma-3-27b for practical deployment with confidence filtering Best Value: gemma-3-4b for speed-sensitive applications


Technical Notes

  • Test Dataset: 194 images with ground truth jersey numbers encoded in filenames
  • Total Expected Jerseys: 416-462 depending on which images each model processed successfully
  • Evaluation Metrics: Precision, Recall, F1 Score, Confidence Calibration
  • Hardware: Testing performed on comparable hardware configurations
  • Prompt: All models used identical jersey detection prompt with confidence scores

Report generated from comprehensive testing of 8 vision-language models for jersey number detection in sports photography.