22 KiB
Logo Detection Test Results Analysis
This document provides analysis of logo detection test results across different matching methods and configurations.
Test Run: CLIP Defaults with All Matching Methods
Date: 2025-12-31 Embedding Model: openai/clip-vit-large-patch14 (default)
Test Configuration
| Parameter | Value |
|---|---|
| Reference logos | 20 |
| Refs per logo | 10 |
| Total reference embeddings | 189 |
| Positive samples per logo | 20 |
| Negative samples per logo | 100 |
| Test images processed | ~2,350 |
| Similarity threshold | 0.70 |
| DETR threshold | 0.50 |
| Margin | 0.05 |
| Min matching refs | 3 |
| Random seed | 42 |
Results Summary
| Method | TP | FP | FN | Precision | Recall | F1 |
|---|---|---|---|---|---|---|
| Simple | 751 | 58,221 | 9 | 1.3% | 203.5%* | 2.5% |
| Margin | 60 | 26 | 310 | 69.8% | 16.3% | 26.4% |
| Multi-ref (mean) | 233 | 217 | 170 | 51.8% | 63.1% | 56.9% |
| Multi-ref (max) | 278 | 259 | 136 | 51.8% | 75.3% | 61.4% |
*Recall >100% indicates multiple true positive detections per expected logo (multiple detected regions matching the same logo).
Analysis by Method
Simple Matching
The simple method returns ALL logos above the similarity threshold without any rejection logic. This serves as a baseline to understand the raw discriminative power of CLIP embeddings.
Observations:
- 58,221 false positives vs 751 true positives (~78:1 ratio)
- At threshold 0.70, CLIP embeddings are not discriminative enough to distinguish between different logos
- The extremely high false positive count indicates that unrelated logo regions frequently produce similarity scores above 0.70
- This method is unsuitable for production use but valuable for understanding the embedding space
Margin-Based Matching
The margin method requires the best match to exceed the second-best by a minimum margin (0.05), rejecting ambiguous matches.
Observations:
- Highest precision (69.8%) but very low recall (16.3%)
- Only 60 true positives out of 369 expected
- The margin requirement is too strict when using multiple references per logo
- With 10 refs per logo, references from the SAME logo compete with each other
- Example: If Logo A has refs scoring 0.85 and 0.84, the margin is only 0.01, causing rejection
- This explains why margin matching produces fewer matches than multi-ref methods
Multi-Ref Matching (Mean Similarity)
Uses the average similarity across all reference images for each logo.
Observations:
- Balanced precision (51.8%) and recall (63.1%)
- F1 score of 56.9%
- False positive ratio approximately 1:1 with true positives (217 FP vs 233 TP)
- Mean aggregation penalizes logos where some references don't match well
- More conservative than max aggregation
Multi-Ref Matching (Max Similarity)
Uses the highest similarity score from any single reference image.
Observations:
- Best F1 score (61.4%) and recall (75.3%)
- Same precision as mean method (51.8%)
- 278 true positives vs 259 false positives (still approximately 1:1)
- Max aggregation is more lenient, improving recall at no precision cost
- Better suited when reference images capture different logo variants
Key Findings
1. CLIP Embedding Similarity Distribution
The simple matching results reveal a fundamental issue: at threshold 0.70, the CLIP embedding space does not provide sufficient separation between different logos. The 78:1 false positive to true positive ratio indicates that:
- Many unrelated images produce high cosine similarity scores
- The threshold would need to be significantly higher (0.85+) to reduce false positives
- Even then, recall would likely suffer
2. Margin Method Limitation with Multiple References
The margin-based matching method was designed assuming one reference per logo. When using multiple references (10 per logo in this test), references from the same logo compete against each other in the margin calculation. This causes legitimate matches to be rejected when two references from the same logo have similar scores.
3. False Positive Rate Remains High
Even the best-performing method (multi-ref max) produces nearly as many false positives as true positives:
- 278 correct matches
- 259 incorrect matches
- This 1:1 ratio is problematic for production use cases
4. Trade-off Between Precision and Recall
| Goal | Best Method | Trade-off |
|---|---|---|
| Maximize precision | Margin | Very low recall (16.3%) |
| Maximize recall | Multi-ref (max) | Lower precision (51.8%) |
| Balance both | Multi-ref (max) | Best F1 but still ~50% precision |
Deficiencies of This Approach
CLIP Model Limitations
-
General-Purpose Training: CLIP was trained on text-image pairs for general visual understanding, not for fine-grained logo discrimination. Logo matching requires distinguishing between visually similar brand marks, which CLIP's training objective doesn't optimize for.
-
Embedding Space Density: The cosine similarity scores cluster in a narrow range (0.6-0.9 for most images), making threshold-based discrimination difficult. Small differences in embedding similarity don't reliably indicate visual differences.
-
Scale and Context Sensitivity: CLIP embeddings are affected by the context around detected regions. A logo on a busy background may produce different embeddings than the same logo on a clean background.
-
No Logo-Specific Features: CLIP doesn't learn features specific to logo recognition such as:
- Typography and font shapes
- Brand-specific color combinations
- Geometric patterns and symmetry
- Edge and contour characteristics
Detection Pipeline Issues
-
DETR Detection Quality: The pipeline assumes DETR correctly identifies logo regions. Detection errors (missed logos, partial detections, non-logo regions) propagate to the matching stage.
-
Cropping Artifacts: Detected regions are cropped and resized before embedding extraction. This may introduce artifacts that affect embedding quality.
-
Threshold Sensitivity: The entire system is highly sensitive to the similarity threshold parameter. A 0.05 change in threshold can dramatically alter precision/recall balance.
Test Run: Threshold Optimization Tests
Date: 2026-01-02 Embedding Model: openai/clip-vit-large-patch14 Matching Method: Multi-ref (max) for all tests
Test Configuration
| Parameter | Value |
|---|---|
| Reference logos | 20 |
| Refs per logo | 10 |
| Total reference embeddings | 189 |
| Positive samples per logo | 20 |
| Negative samples per logo | 100 |
| Test images processed | ~2,355 |
| DETR threshold | 0.50 |
| Min matching refs | 3 |
| Random seed | 42 |
Results Summary
| Test | Threshold | Margin | TP | FP | FN | Precision | Recall | F1 |
|---|---|---|---|---|---|---|---|---|
| 1 (baseline) | 0.70 | 0.05 | 265 | 288 | 141 | 47.9% | 71.8% | 57.5% |
| 2 | 0.80 | 0.05 | 233 | 472 | 165 | 33.0% | 63.1% | 43.4% |
| 3 | 0.80 | 0.10 | 187 | 375 | 208 | 33.3% | 50.7% | 40.2% |
| 4 | 0.85 | 0.10 | 160 | 434 | 223 | 26.9% | 43.4% | 33.2% |
| 5 | 0.85 | 0.15 | 163 | 410 | 220 | 28.4% | 44.2% | 34.6% |
| 6 | 0.90 | 0.15 | 84 | 69 | 288 | 54.9% | 22.8% | 32.2% |
Analysis
Counter-Intuitive Results
The most striking finding is that raising the similarity threshold made performance worse in most cases:
| Threshold Change | Effect on FP:TP Ratio |
|---|---|
| 0.70 → 0.80 | 1.09:1 → 2.03:1 (worse) |
| 0.80 → 0.85 | 2.03:1 → 2.71:1 (worse) |
| 0.85 → 0.90 | 2.71:1 → 0.82:1 (better) |
This is the opposite of expected behavior. Normally, raising the threshold should reduce false positives. Instead, false positives increased from 288 at threshold 0.70 to 472 at threshold 0.80.
Why Higher Thresholds Failed
The likely explanation relates to how min_matching_refs interacts with the threshold:
-
True positives are penalized more: Correct matches require 3+ references to exceed the threshold. At higher thresholds, fewer references clear the bar, causing legitimate matches to fail the
min_matching_refs=3requirement. -
False positives survive differently: False positive detections may have 1-2 references that happen to score very high (above the threshold) due to random visual similarities. Since we use max aggregation, these spurious high scores still produce matches.
-
The margin becomes less effective: When most scores are clustered below the threshold, the margin check operates on a smaller pool of candidates, reducing its discriminative power.
Threshold 0.90: Different Behavior
At threshold 0.90, behavior finally matches expectations:
- False positives dropped dramatically (69 vs 288-472 in other tests)
- But recall collapsed to 22.8%
- Only 84 true positives out of 369 expected
This suggests that at 0.90, the threshold is finally high enough to filter out most noise, but it's too aggressive and rejects most legitimate matches as well.
The Optimal Threshold Problem
| Threshold | Precision | Recall | F1 | Assessment |
|---|---|---|---|---|
| 0.70 | 47.9% | 71.8% | 57.5% | Best overall F1 |
| 0.80 | 33.0% | 63.1% | 43.4% | Worse than baseline |
| 0.85 | 26.9-28.4% | 43-44% | 33-35% | Much worse |
| 0.90 | 54.9% | 22.8% | 32.2% | Best precision, worst recall |
The lowest threshold tested (0.70) produced the best F1 score. This indicates:
- CLIP embeddings don't provide clean separation at any threshold
- The multi-ref matching with min_matching_refs provides better discrimination than threshold alone
- Raising the threshold hurts true positives more than it helps reject false positives
Margin Parameter Impact
Comparing tests with the same threshold but different margins:
| Threshold | Margin 0.05 | Margin 0.10 | Margin 0.15 |
|---|---|---|---|
| 0.80 | F1: 43.4% | F1: 40.2% | - |
| 0.85 | - | F1: 33.2% | F1: 34.6% |
Increasing the margin had minimal effect, slightly reducing both true and false positives. The margin parameter is less impactful than the threshold in this configuration.
Key Findings
-
The baseline (threshold=0.70, margin=0.05) was optimal: No threshold/margin combination tested outperformed the defaults for F1 score.
-
Threshold tuning alone cannot fix CLIP's limitations: The embedding space doesn't provide clear separation points that can be exploited with threshold adjustments.
-
min_matching_refs matters more than threshold: The requirement for multiple matching references provides better discrimination than similarity threshold.
-
Precision-recall trade-off is extreme: Achieving 55% precision (at threshold 0.90) requires accepting only 23% recall.
-
The 0.70-0.85 range is a "dead zone": Thresholds in this range produce worse results than either extreme.
Implications
These results suggest that improving logo detection accuracy requires:
- A different embedding model with better logo discrimination
- Logo-specific fine-tuning
- Alternative matching strategies beyond threshold-based approaches
- Potentially ensemble methods combining multiple signals
Simply tuning threshold and margin parameters with CLIP is insufficient to achieve acceptable precision/recall balance.
Test Run: Embedding Model Comparison
Date: 2026-01-02 Matching Method: Multi-ref (max) for all tests
Test Configuration
| Parameter | Value |
|---|---|
| Reference logos | 20 |
| Refs per logo | 10 |
| Total reference embeddings | 189 |
| Positive samples per logo | 20 |
| Negative samples per logo | 100 |
| Test images processed | ~2,355 |
| Similarity threshold | 0.70 |
| DETR threshold | 0.50 |
| Margin | 0.05 |
| Min matching refs | 3 |
| Random seed | 42 |
Results Summary
| Model | TP | FP | FN | Precision | Recall | F1 |
|---|---|---|---|---|---|---|
| CLIP ViT-Large | 284 | 295 | 124 | 49.1% | 77.0% | 59.9% |
| DINOv2 Small | 158 | 546 | 234 | 22.4% | 42.8% | 29.5% |
| DINOv2 Large | 105 | 221 | 277 | 32.2% | 28.5% | 30.2% |
Analysis
CLIP Significantly Outperforms DINOv2
CLIP ViT-Large achieved approximately 2x the F1 score of either DINOv2 model:
| Model | F1 Score | vs CLIP |
|---|---|---|
| CLIP ViT-Large | 59.9% | baseline |
| DINOv2 Small | 29.5% | -50.7% |
| DINOv2 Large | 30.2% | -49.6% |
This is a substantial performance gap that cannot be closed through parameter tuning.
DINOv2 Model Comparison
Comparing the two DINOv2 variants:
| Metric | DINOv2 Small | DINOv2 Large | Winner |
|---|---|---|---|
| Precision | 22.4% | 32.2% | Large (+44%) |
| Recall | 42.8% | 28.5% | Small (+50%) |
| F1 | 29.5% | 30.2% | Large (+2%) |
| FP:TP Ratio | 3.46:1 | 2.10:1 | Large |
DINOv2 Large shows better precision and fewer false positives, but at the cost of significantly lower recall. The larger model appears more conservative in its matching, rejecting more candidates overall.
Why DINOv2 Underperforms
-
Training Objective Mismatch: DINOv2 uses self-supervised learning optimized for general visual representation, not for discriminating between similar visual objects. While it excels at semantic understanding, logo matching requires fine-grained visual discrimination.
-
Embedding Space Characteristics: DINOv2's embedding space may cluster logos differently than CLIP. The 0.70 threshold that works reasonably for CLIP may be entirely wrong for DINOv2's similarity distribution.
-
No Text-Image Alignment: Unlike CLIP, DINOv2 has no concept of semantic labels. CLIP's text-image training may inadvertently help it distinguish between branded content, even if not explicitly trained for logos.
False Positive Analysis
| Model | FP:TP Ratio | Assessment |
|---|---|---|
| CLIP ViT-Large | 1.04:1 | Approximately balanced |
| DINOv2 Small | 3.46:1 | Very high false positives |
| DINOv2 Large | 2.10:1 | High false positives |
DINOv2 Small produces over 3x as many false positives as true positives, making it unsuitable for this task without significant threshold adjustment.
Key Findings
-
CLIP remains the best choice: Despite its limitations documented in earlier tests, CLIP substantially outperforms DINOv2 for logo matching with the current pipeline and parameters.
-
Model size doesn't guarantee better results: DINOv2 Large (304M parameters) performed only marginally better than DINOv2 Small (22M parameters) for F1 score, and actually had worse recall.
-
Threshold may need per-model tuning: The 0.70 threshold optimized for CLIP may not be appropriate for DINOv2. The high false positive rates suggest DINOv2 may need a higher threshold.
-
Self-supervised models not ideal for this task: The results suggest that self-supervised vision models like DINOv2 are not well-suited for fine-grained logo discrimination without additional fine-tuning.
Recommendations
-
Continue using CLIP for this logo detection pipeline unless a logo-specific model becomes available.
-
If DINOv2 must be used, conduct threshold optimization tests specifically for DINOv2's embedding space—the optimal threshold is likely different from CLIP's.
-
Consider fine-tuning: Training a model specifically on logo discrimination tasks would likely outperform both general-purpose models.
-
Explore hybrid approaches: Combining CLIP's semantic understanding with additional visual features (edges, colors, shapes) might improve discrimination.
Summary and Recommendations
This section synthesizes findings from all test runs to provide actionable recommendations for logo detection configuration and future improvements.
Best Configuration
Based on all tests conducted, the optimal configuration is:
| Parameter | Recommended Value | Rationale |
|---|---|---|
| Embedding Model | openai/clip-vit-large-patch14 |
2x better F1 than DINOv2 alternatives |
| Matching Method | multi-ref with max similarity |
Best F1 (59.9%) and recall (77.0%) |
| Similarity Threshold | 0.70 | Lower thresholds outperform higher ones |
| Margin | 0.05 | Minimal impact; keep low to avoid rejecting valid matches |
| Min Matching Refs | 3 | Provides better discrimination than threshold alone |
| Refs Per Logo | 10 | More references improve robustness |
| DETR Threshold | 0.50 | Standard detection confidence |
Performance Expectations
With the recommended configuration:
| Metric | Expected Value | Interpretation |
|---|---|---|
| Precision | ~49% | About half of detections are correct |
| Recall | ~77% | Finds most logos present in images |
| F1 Score | ~60% | Moderate overall accuracy |
| FP:TP Ratio | ~1:1 | Approximately equal true and false positives |
Important: These results indicate the system is suitable for applications that can tolerate a high false positive rate, such as:
- Initial screening with human review
- Flagging content for further analysis
- Low-stakes logo presence detection
The system is not suitable for high-precision applications without additional filtering or verification steps.
Key Insights from Testing
What Works
- Multi-ref matching with max aggregation consistently outperforms other methods
- Multiple references per logo (10) provides robustness against logo variations
- min_matching_refs=3 is more effective at discrimination than threshold tuning
- CLIP embeddings significantly outperform self-supervised alternatives (DINOv2)
What Doesn't Work
- Raising similarity threshold paradoxically increases false positives in the 0.70-0.85 range
- Margin-only matching fails with multiple references (same-logo refs compete)
- DINOv2 models produce 2-3x worse results than CLIP
- Simple threshold-based matching produces unacceptable 78:1 FP:TP ratio
Limitations
- ~50% precision ceiling: Even the best configuration produces nearly as many false positives as true positives
- No clean threshold separation: CLIP's embedding space doesn't provide clear decision boundaries for logos
- General-purpose models: Neither CLIP nor DINOv2 are optimized for fine-grained logo discrimination
- Pipeline dependencies: Results depend heavily on DETR detection quality
Recommendations for Future Improvements
Short-Term Improvements
| Improvement | Expected Impact | Effort |
|---|---|---|
| Post-processing filters | Reduce FP by 20-30% | Low |
| Add color histogram matching | Filter matches with wrong colors | |
| Add aspect ratio validation | Reject shape mismatches | |
| Add text detection | Filter if expected text is missing | |
| Reference curation | Improve TP by 10-20% | Low |
| Remove low-quality references | Reduce noise in ref embeddings | |
| Ensure diverse logo variants | Improve coverage | |
| Ensemble scoring | Improve F1 by 10-15% | Medium |
| Combine CLIP + color features | Multi-signal confidence | |
| Weighted voting across refs | More robust aggregation |
Medium-Term Improvements
| Improvement | Expected Impact | Effort |
|---|---|---|
| Fine-tune CLIP on logos | Improve F1 by 20-40% | Medium |
| Contrastive training on logo pairs | Better embedding separation | |
| Use LogoDet-3K for training data | Domain-specific features | |
| Alternative detection models | Improve detection quality | Medium |
| Test YOLOv8 for logo detection | Faster, potentially more accurate | |
| Train custom detector on logo data | Better region proposals | |
| Learned similarity metric | Improve precision by 30-50% | Medium |
| Train siamese network for logo matching | Replace cosine similarity | |
| Learn logo-specific distance function | Better discrimination |
Long-Term Improvements
| Improvement | Expected Impact | Effort |
|---|---|---|
| End-to-end logo recognition model | F1 > 85% | High |
| Single model for detection + recognition | Eliminate pipeline errors | |
| Train on large-scale logo dataset | Comprehensive coverage | |
| Logo-specific foundation model | F1 > 90% | High |
| Pre-train on millions of logo images | Domain expertise | |
| Fine-tune for specific brand sets | Production-ready accuracy |
Decision Framework
Use this framework to choose between precision and recall:
| Use Case | Priority | Recommended Adjustments |
|---|---|---|
| Content moderation | High recall | Use defaults; accept FPs for human review |
| Brand monitoring | Balanced | Use defaults; filter obvious FPs |
| Automated licensing | High precision | Use threshold=0.90; accept low recall |
| Search/discovery | High recall | Lower threshold to 0.65; more refs |
Conclusion
The current DETR + CLIP pipeline with multi-ref matching achieves moderate accuracy (~60% F1) that is suitable for screening applications but falls short of production requirements for automated decision-making. The fundamental limitation is that general-purpose vision models lack the fine-grained discrimination needed for logo recognition.
To achieve production-quality accuracy (>85% F1), the system requires:
- A logo-specific embedding model (fine-tuned or trained from scratch)
- Additional visual features beyond CLIP embeddings
- Potentially an end-to-end architecture designed for logo recognition
The test framework established here provides the foundation for evaluating these future improvements systematically.
Test Run: [Next Test Name]
Results pending...