Files
logo_test/test_results/FINAL_MODEL_ANALYSIS.md
2026-01-07 12:44:15 -05:00

6.6 KiB

Logo Recognition Model Analysis

Date: January 7, 2026 Purpose: Determine the best model and threshold for logo recognition of logos not currently in the test set.


Executive Summary

Model Best Threshold F1 Score Precision Recall Recommended Use
Image-Split Fine-tuned 0.70-0.75 67-68% 66-80% 59-68% Known logos (in reference set)
Baseline CLIP 0.70 57-60% 48-49% 72-77% Unknown logos (never seen before)
Logo-Split Fine-tuned 0.76 56% 49% 64% Not recommended
DINOv2 (small/large) - 29-30% 22-32% 28-43% Not suitable

Winner: Image-Split Fine-tuned Model at threshold 0.70-0.75


Detailed Model Comparison

1. Baseline CLIP (openai/clip-vit-large-patch14)

The pre-trained CLIP model without any fine-tuning.

Threshold Performance:

Threshold Precision Recall F1
0.70 47.9% 71.8% 57.5%
0.80 33.0% 63.1% 43.4%
0.85 26.9% 43.4% 33.2%
0.90 54.9% 22.8% 32.2%

Similarity Distribution:

  • True Positive mean: 0.854 (range: 0.75-0.95)
  • False Positive mean: 0.846 (range: 0.75-0.95)
  • Problem: TP and FP distributions almost completely overlap

Suggested optimal threshold: 0.756 (predicted F1 = 67.1%)

Strengths:

  • Good recall at low thresholds
  • Works on completely unseen logos
  • No training required

Weaknesses:

  • Poor separation between correct and incorrect matches
  • High false positive rate

2. Fine-tuned CLIP (Logo-Level Splits)

Trained with contrastive learning, tested on completely unseen logo brands.

Threshold Performance:

Threshold Precision Recall F1
0.70 25.9% 67.1% 37.4%
0.76 49.1% 64.3% 55.7%
0.82 75.7% 41.4% 53.5%
0.86 88.6% 28.1% 42.7%

Similarity Distribution:

  • True Positive mean: 0.853
  • False Positive mean: 0.787 (better separation than baseline)
  • Missed logos mean: 0.711 (only 43.7% above 0.75)

Suggested optimal threshold: 0.82 (predicted F1 = 71.9%)

Strengths:

  • Better TP/FP separation than baseline
  • Very high precision at high thresholds (88.6% at t=0.86)

Weaknesses:

  • Does not generalize well to unseen logo brands
  • Many correct logos score below threshold (56% of missed logos below 0.75)
  • Worse than baseline at threshold 0.70

3. Fine-tuned CLIP (Image-Level Splits) BEST

Trained with contrastive learning, all logo brands seen during training (different images held out for testing).

Threshold Performance:

Threshold Precision Recall F1
0.65 56.9% 75.9% 65.0%
0.70 66.3% 68.3% 67.3%
0.75 79.9% 59.3% 68.1%
0.80 83.7% 52.8% 64.8%
0.85 92.4% 42.8% 58.5%
0.90 98.9% 24.7% 39.5%

Similarity Distribution:

  • True Positive mean: 0.866 (higher than other models)
  • False Positive mean: 0.807
  • TP-FP gap: 0.059 (best separation)
  • At t=0.75: 92 TP vs only 38 FP (excellent ratio)

Suggested optimal threshold: 0.755 (predicted F1 = 85.6%)

Strengths:

  • Best overall F1 score (68.1% at t=0.75)
  • Best precision at any threshold (79.9-98.9%)
  • Excellent TP/FP ratio
  • Highest true positive similarity scores

Weaknesses:

  • Requires logos to be in the reference set during training
  • May not generalize to completely novel logos

4. DINOv2 Models

Tested for comparison but significantly underperformed.

Model Precision Recall F1
DINOv2-small 22.4% 42.8% 29.5%
DINOv2-large 32.2% 28.5% 30.2%

Not recommended for logo recognition tasks.


Recommendations

For Logo Recognition of Known Logos (logos in your reference set)

Use: Image-Split Fine-tuned Model

# Recommended configuration
python test_logo_detection.py \
    -e models/logo_detection/clip_finetuned_image_split \
    -t 0.70 \
    --matching-method multi-ref \
    --use-max-similarity
Use Case Threshold Expected Performance
Balanced (recommended) 0.70 66% precision, 68% recall, 67% F1
High precision 0.75 80% precision, 59% recall, 68% F1
Very high precision 0.80 84% precision, 53% recall, 65% F1
Maximum precision 0.85+ 92%+ precision, <43% recall

For Logo Recognition of Unknown Logos (completely novel brands)

Use: Baseline CLIP (the fine-tuned models don't generalize well)

# Recommended configuration
python test_logo_detection.py \
    -e openai/clip-vit-large-patch14 \
    -t 0.70 \
    --matching-method multi-ref \
    --use-max-similarity

Expected: ~48% precision, ~72% recall, ~58% F1


Key Findings

1. Image-Level Splits Dramatically Improve Performance

The image-split fine-tuned model outperforms all others because:

  • It learns brand-specific features during training
  • Test images are different but from same brands
  • Better represents real-world use where you have reference images for logos you want to detect

2. Logo-Level Splits Test True Generalization (but results are poor)

The logo-split model tests whether fine-tuning helps with completely unseen logos:

  • Result: It doesn't help much (56% F1 vs 58% baseline)
  • Contrastive learning doesn't transfer well to novel brands
  • Use baseline CLIP for novel logo detection

3. Threshold Sweet Spot is 0.70-0.75

For all models, the optimal F1 occurs around threshold 0.70-0.75:

  • Lower thresholds: Too many false positives
  • Higher thresholds: Misses too many correct logos
  • At 0.90+: Precision is high but recall drops below 25%

4. Precision-Recall Tradeoff

Priority Threshold Tradeoff
Recall 0.65-0.70 More matches, more false positives
Balanced 0.70-0.75 Best F1 score
Precision 0.75-0.80 Fewer false positives, misses some matches
High Precision 0.85+ Very few false positives, misses many matches

Conclusion

For production use with known logos:

  • Use Image-Split Fine-tuned Model at threshold 0.70-0.75
  • Expected F1: 67-68%, Precision: 66-80%

For discovering unknown logos:

  • Use Baseline CLIP at threshold 0.70
  • Expected F1: ~58%, Precision: ~48%

The image-split fine-tuning provides significant improvements (+8-10% F1) over baseline for known logos, but does not help with completely novel brands. For a production system, ensure all target logos are included in the training/reference set.