217 lines
6.6 KiB
Markdown
217 lines
6.6 KiB
Markdown
# Logo Recognition Model Analysis
|
|
|
|
**Date:** January 7, 2026
|
|
**Purpose:** Determine the best model and threshold for logo recognition of logos not currently in the test set.
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
| Model | Best Threshold | F1 Score | Precision | Recall | Recommended Use |
|
|
|-------|---------------|----------|-----------|--------|-----------------|
|
|
| **Image-Split Fine-tuned** | 0.70-0.75 | **67-68%** | 66-80% | 59-68% | Known logos (in reference set) |
|
|
| Baseline CLIP | 0.70 | 57-60% | 48-49% | 72-77% | Unknown logos (never seen before) |
|
|
| Logo-Split Fine-tuned | 0.76 | 56% | 49% | 64% | Not recommended |
|
|
| DINOv2 (small/large) | - | 29-30% | 22-32% | 28-43% | Not suitable |
|
|
|
|
**Winner: Image-Split Fine-tuned Model** at threshold **0.70-0.75**
|
|
|
|
---
|
|
|
|
## Detailed Model Comparison
|
|
|
|
### 1. Baseline CLIP (openai/clip-vit-large-patch14)
|
|
|
|
The pre-trained CLIP model without any fine-tuning.
|
|
|
|
**Threshold Performance:**
|
|
|
|
| Threshold | Precision | Recall | F1 |
|
|
|-----------|-----------|--------|-----|
|
|
| 0.70 | 47.9% | 71.8% | 57.5% |
|
|
| 0.80 | 33.0% | 63.1% | 43.4% |
|
|
| 0.85 | 26.9% | 43.4% | 33.2% |
|
|
| 0.90 | 54.9% | 22.8% | 32.2% |
|
|
|
|
**Similarity Distribution:**
|
|
- True Positive mean: 0.854 (range: 0.75-0.95)
|
|
- False Positive mean: 0.846 (range: 0.75-0.95)
|
|
- **Problem:** TP and FP distributions almost completely overlap
|
|
|
|
**Suggested optimal threshold:** 0.756 (predicted F1 = 67.1%)
|
|
|
|
**Strengths:**
|
|
- Good recall at low thresholds
|
|
- Works on completely unseen logos
|
|
- No training required
|
|
|
|
**Weaknesses:**
|
|
- Poor separation between correct and incorrect matches
|
|
- High false positive rate
|
|
|
|
---
|
|
|
|
### 2. Fine-tuned CLIP (Logo-Level Splits)
|
|
|
|
Trained with contrastive learning, tested on completely unseen logo brands.
|
|
|
|
**Threshold Performance:**
|
|
|
|
| Threshold | Precision | Recall | F1 |
|
|
|-----------|-----------|--------|-----|
|
|
| 0.70 | 25.9% | 67.1% | 37.4% |
|
|
| 0.76 | **49.1%** | 64.3% | **55.7%** |
|
|
| 0.82 | 75.7% | 41.4% | 53.5% |
|
|
| 0.86 | 88.6% | 28.1% | 42.7% |
|
|
|
|
**Similarity Distribution:**
|
|
- True Positive mean: 0.853
|
|
- False Positive mean: 0.787 (better separation than baseline)
|
|
- Missed logos mean: 0.711 (only 43.7% above 0.75)
|
|
|
|
**Suggested optimal threshold:** 0.82 (predicted F1 = 71.9%)
|
|
|
|
**Strengths:**
|
|
- Better TP/FP separation than baseline
|
|
- Very high precision at high thresholds (88.6% at t=0.86)
|
|
|
|
**Weaknesses:**
|
|
- Does not generalize well to unseen logo brands
|
|
- Many correct logos score below threshold (56% of missed logos below 0.75)
|
|
- Worse than baseline at threshold 0.70
|
|
|
|
---
|
|
|
|
### 3. Fine-tuned CLIP (Image-Level Splits) ⭐ BEST
|
|
|
|
Trained with contrastive learning, all logo brands seen during training (different images held out for testing).
|
|
|
|
**Threshold Performance:**
|
|
|
|
| Threshold | Precision | Recall | F1 |
|
|
|-----------|-----------|--------|-----|
|
|
| 0.65 | 56.9% | **75.9%** | 65.0% |
|
|
| 0.70 | 66.3% | 68.3% | **67.3%** |
|
|
| 0.75 | **79.9%** | 59.3% | **68.1%** |
|
|
| 0.80 | 83.7% | 52.8% | 64.8% |
|
|
| 0.85 | 92.4% | 42.8% | 58.5% |
|
|
| 0.90 | 98.9% | 24.7% | 39.5% |
|
|
|
|
**Similarity Distribution:**
|
|
- True Positive mean: 0.866 (higher than other models)
|
|
- False Positive mean: 0.807
|
|
- TP-FP gap: 0.059 (best separation)
|
|
- At t=0.75: 92 TP vs only 38 FP (excellent ratio)
|
|
|
|
**Suggested optimal threshold:** 0.755 (predicted F1 = 85.6%)
|
|
|
|
**Strengths:**
|
|
- Best overall F1 score (68.1% at t=0.75)
|
|
- Best precision at any threshold (79.9-98.9%)
|
|
- Excellent TP/FP ratio
|
|
- Highest true positive similarity scores
|
|
|
|
**Weaknesses:**
|
|
- Requires logos to be in the reference set during training
|
|
- May not generalize to completely novel logos
|
|
|
|
---
|
|
|
|
### 4. DINOv2 Models
|
|
|
|
Tested for comparison but significantly underperformed.
|
|
|
|
| Model | Precision | Recall | F1 |
|
|
|-------|-----------|--------|-----|
|
|
| DINOv2-small | 22.4% | 42.8% | 29.5% |
|
|
| DINOv2-large | 32.2% | 28.5% | 30.2% |
|
|
|
|
**Not recommended** for logo recognition tasks.
|
|
|
|
---
|
|
|
|
## Recommendations
|
|
|
|
### For Logo Recognition of Known Logos (logos in your reference set)
|
|
|
|
**Use: Image-Split Fine-tuned Model**
|
|
|
|
```bash
|
|
# Recommended configuration
|
|
python test_logo_detection.py \
|
|
-e models/logo_detection/clip_finetuned_image_split \
|
|
-t 0.70 \
|
|
--matching-method multi-ref \
|
|
--use-max-similarity
|
|
```
|
|
|
|
| Use Case | Threshold | Expected Performance |
|
|
|----------|-----------|---------------------|
|
|
| Balanced (recommended) | 0.70 | 66% precision, 68% recall, 67% F1 |
|
|
| High precision | 0.75 | 80% precision, 59% recall, 68% F1 |
|
|
| Very high precision | 0.80 | 84% precision, 53% recall, 65% F1 |
|
|
| Maximum precision | 0.85+ | 92%+ precision, <43% recall |
|
|
|
|
### For Logo Recognition of Unknown Logos (completely novel brands)
|
|
|
|
**Use: Baseline CLIP** (the fine-tuned models don't generalize well)
|
|
|
|
```bash
|
|
# Recommended configuration
|
|
python test_logo_detection.py \
|
|
-e openai/clip-vit-large-patch14 \
|
|
-t 0.70 \
|
|
--matching-method multi-ref \
|
|
--use-max-similarity
|
|
```
|
|
|
|
Expected: ~48% precision, ~72% recall, ~58% F1
|
|
|
|
---
|
|
|
|
## Key Findings
|
|
|
|
### 1. Image-Level Splits Dramatically Improve Performance
|
|
|
|
The image-split fine-tuned model outperforms all others because:
|
|
- It learns brand-specific features during training
|
|
- Test images are different but from same brands
|
|
- Better represents real-world use where you have reference images for logos you want to detect
|
|
|
|
### 2. Logo-Level Splits Test True Generalization (but results are poor)
|
|
|
|
The logo-split model tests whether fine-tuning helps with completely unseen logos:
|
|
- Result: It doesn't help much (56% F1 vs 58% baseline)
|
|
- Contrastive learning doesn't transfer well to novel brands
|
|
- Use baseline CLIP for novel logo detection
|
|
|
|
### 3. Threshold Sweet Spot is 0.70-0.75
|
|
|
|
For all models, the optimal F1 occurs around threshold 0.70-0.75:
|
|
- Lower thresholds: Too many false positives
|
|
- Higher thresholds: Misses too many correct logos
|
|
- At 0.90+: Precision is high but recall drops below 25%
|
|
|
|
### 4. Precision-Recall Tradeoff
|
|
|
|
| Priority | Threshold | Tradeoff |
|
|
|----------|-----------|----------|
|
|
| Recall | 0.65-0.70 | More matches, more false positives |
|
|
| Balanced | 0.70-0.75 | Best F1 score |
|
|
| Precision | 0.75-0.80 | Fewer false positives, misses some matches |
|
|
| High Precision | 0.85+ | Very few false positives, misses many matches |
|
|
|
|
---
|
|
|
|
## Conclusion
|
|
|
|
**For production use with known logos:**
|
|
- Use **Image-Split Fine-tuned Model** at **threshold 0.70-0.75**
|
|
- Expected F1: 67-68%, Precision: 66-80%
|
|
|
|
**For discovering unknown logos:**
|
|
- Use **Baseline CLIP** at **threshold 0.70**
|
|
- Expected F1: ~58%, Precision: ~48%
|
|
|
|
The image-split fine-tuning provides significant improvements (+8-10% F1) over baseline for known logos, but does not help with completely novel brands. For a production system, ensure all target logos are included in the training/reference set.
|