Combine all test results in a single directory

This commit is contained in:
Rick McEwen
2026-01-07 10:22:54 -05:00
parent 2f28aa6052
commit 440e8fcdb4
10 changed files with 1518 additions and 0 deletions

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,29 @@
============================================================
Test Parameters:
Logos: 50, Seed: 42, Threshold: 0.7
Method: multi-ref, Refs/logo: 3, Margin: 0.05
BASELINE (openai/clip-vit-large-patch14):
True Positives (correct matches): 101
False Positives (wrong matches): 104
False Negatives (missed logos): 156
Precision: 0.4927 (49.3%)
Recall: 0.4056 (40.6%)
F1 Score: 0.4449 (44.5%)
FINE-TUNED (models/logo_detection/clip_finetuned):
True Positives (correct matches): 164
False Positives (wrong matches): 414
False Negatives (missed logos): 115
Precision: 0.2837 (28.4%)
Recall: 0.6586 (65.9%)
F1 Score: 0.3966 (39.7%)
------------------------------------------------------------
F1 SCORE COMPARISON:
Baseline: 44.5%
Fine-tuned: 39.7%
------------------------------------------------------------
Full results saved to: comparison_results/

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,124 @@
Logo Detection Comparison Tests
================================
Date: Wed Dec 31 03:43:45 PM MST 2025
Common Parameters:
Reference logos: 20
Refs per logo: 10
Positive samples: 20
Negative samples: 100
Min matching refs: 3
Seed: 42
======================================================================
TEST: SIMPLE MATCHING
Method: Simple (all matches above threshold)
======================================================================
Date: 2025-12-31 16:02:25
Configuration:
Reference logos: 20
Refs per logo: 10
Total reference embeddings:189
Positive samples/logo: 20
Negative samples/logo: 100
Test images processed: 2355
CLIP threshold: 0.7
DETR threshold: 0.5
Random seed: 42
Results:
True Positives: 751
False Positives: 58221
False Negatives: 9
Total Expected: 369
Scores:
Precision: 0.0127 (1.3%)
Recall: 2.0352 (203.5%)
F1 Score: 0.0253 (2.5%)
======================================================================
TEST: MARGIN MATCHING
Method: Margin-based (margin=0.05)
======================================================================
Date: 2025-12-31 16:20:42
Configuration:
Reference logos: 20
Refs per logo: 10
Total reference embeddings:189
Positive samples/logo: 20
Negative samples/logo: 100
Test images processed: 2361
CLIP threshold: 0.7
DETR threshold: 0.5
Random seed: 42
Results:
True Positives: 60
False Positives: 26
False Negatives: 310
Total Expected: 369
Scores:
Precision: 0.6977 (69.8%)
Recall: 0.1626 (16.3%)
F1 Score: 0.2637 (26.4%)
======================================================================
TEST: MULTI-REF MATCHING
Method: Multi-ref (mean, min_refs=3, margin=0.05)
======================================================================
Date: 2025-12-31 16:38:59
Configuration:
Reference logos: 20
Refs per logo: 10
Total reference embeddings:189
Positive samples/logo: 20
Negative samples/logo: 100
Test images processed: 2352
CLIP threshold: 0.7
DETR threshold: 0.5
Random seed: 42
Results:
True Positives: 233
False Positives: 217
False Negatives: 170
Total Expected: 369
Scores:
Precision: 0.5178 (51.8%)
Recall: 0.6314 (63.1%)
F1 Score: 0.5690 (56.9%)
======================================================================
TEST: MULTI-REF MATCHING
Method: Multi-ref (max, min_refs=3, margin=0.05)
======================================================================
Date: 2025-12-31 16:56:49
Configuration:
Reference logos: 20
Refs per logo: 10
Total reference embeddings:189
Positive samples/logo: 20
Negative samples/logo: 100
Test images processed: 2350
CLIP threshold: 0.7
DETR threshold: 0.5
Random seed: 42
Results:
True Positives: 278
False Positives: 259
False Negatives: 136
Total Expected: 369
Scores:
Precision: 0.5177 (51.8%)
Recall: 0.7534 (75.3%)
F1 Score: 0.6137 (61.4%)

View File

@ -0,0 +1,105 @@
Embedding Model Comparison Tests
=================================
Date: Fri Jan 2 12:47:03 PM MST 2026
Common Parameters:
Matching method: multi-ref (max)
Reference logos: 20
Refs per logo: 10
Positive samples: 20
Negative samples: 100
Min matching refs: 3
Threshold: 0.70
Margin: 0.05
Seed: 42
======================================================================
TEST: MULTI-REF MATCHING
Model: openai/clip-vit-large-patch14
Method: Multi-ref (max, min_refs=3, margin=0.05)
======================================================================
Date: 2026-01-02 13:05:17
Configuration:
Embedding model: openai/clip-vit-large-patch14
Reference logos: 20
Refs per logo: 10
Total reference embeddings:189
Positive samples/logo: 20
Negative samples/logo: 100
Test images processed: 2355
Similarity threshold: 0.7
DETR threshold: 0.5
Random seed: 42
Results:
True Positives: 284
False Positives: 295
False Negatives: 124
Total Expected: 369
Scores:
Precision: 0.4905 (49.1%)
Recall: 0.7696 (77.0%)
F1 Score: 0.5992 (59.9%)
======================================================================
TEST: MULTI-REF MATCHING
Model: facebook/dinov2-small
Method: Multi-ref (max, min_refs=3, margin=0.05)
======================================================================
Date: 2026-01-02 13:19:01
Configuration:
Embedding model: facebook/dinov2-small
Reference logos: 20
Refs per logo: 10
Total reference embeddings:189
Positive samples/logo: 20
Negative samples/logo: 100
Test images processed: 2358
Similarity threshold: 0.7
DETR threshold: 0.5
Random seed: 42
Results:
True Positives: 158
False Positives: 546
False Negatives: 234
Total Expected: 369
Scores:
Precision: 0.2244 (22.4%)
Recall: 0.4282 (42.8%)
F1 Score: 0.2945 (29.5%)
======================================================================
TEST: MULTI-REF MATCHING
Model: facebook/dinov2-large
Method: Multi-ref (max, min_refs=3, margin=0.05)
======================================================================
Date: 2026-01-02 13:39:33
Configuration:
Embedding model: facebook/dinov2-large
Reference logos: 20
Refs per logo: 10
Total reference embeddings:189
Positive samples/logo: 20
Negative samples/logo: 100
Test images processed: 2355
Similarity threshold: 0.7
DETR threshold: 0.5
Random seed: 42
Results:
True Positives: 105
False Positives: 221
False Negatives: 277
Total Expected: 369
Scores:
Precision: 0.3221 (32.2%)
Recall: 0.2846 (28.5%)
F1 Score: 0.3022 (30.2%)

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -346,6 +346,131 @@ DINOv2 Small produces over 3x as many false positives as true positives, making
---
## Summary and Recommendations
This section synthesizes findings from all test runs to provide actionable recommendations for logo detection configuration and future improvements.
### Best Configuration
Based on all tests conducted, the optimal configuration is:
| Parameter | Recommended Value | Rationale |
|-----------|-------------------|-----------|
| **Embedding Model** | `openai/clip-vit-large-patch14` | 2x better F1 than DINOv2 alternatives |
| **Matching Method** | `multi-ref` with max similarity | Best F1 (59.9%) and recall (77.0%) |
| **Similarity Threshold** | 0.70 | Lower thresholds outperform higher ones |
| **Margin** | 0.05 | Minimal impact; keep low to avoid rejecting valid matches |
| **Min Matching Refs** | 3 | Provides better discrimination than threshold alone |
| **Refs Per Logo** | 10 | More references improve robustness |
| **DETR Threshold** | 0.50 | Standard detection confidence |
### Performance Expectations
With the recommended configuration:
| Metric | Expected Value | Interpretation |
|--------|----------------|----------------|
| Precision | ~49% | About half of detections are correct |
| Recall | ~77% | Finds most logos present in images |
| F1 Score | ~60% | Moderate overall accuracy |
| FP:TP Ratio | ~1:1 | Approximately equal true and false positives |
**Important**: These results indicate the system is suitable for applications that can tolerate a high false positive rate, such as:
- Initial screening with human review
- Flagging content for further analysis
- Low-stakes logo presence detection
The system is **not suitable** for high-precision applications without additional filtering or verification steps.
### Key Insights from Testing
#### What Works
1. **Multi-ref matching with max aggregation** consistently outperforms other methods
2. **Multiple references per logo** (10) provides robustness against logo variations
3. **min_matching_refs=3** is more effective at discrimination than threshold tuning
4. **CLIP embeddings** significantly outperform self-supervised alternatives (DINOv2)
#### What Doesn't Work
1. **Raising similarity threshold** paradoxically increases false positives in the 0.70-0.85 range
2. **Margin-only matching** fails with multiple references (same-logo refs compete)
3. **DINOv2 models** produce 2-3x worse results than CLIP
4. **Simple threshold-based matching** produces unacceptable 78:1 FP:TP ratio
#### Limitations
1. **~50% precision ceiling**: Even the best configuration produces nearly as many false positives as true positives
2. **No clean threshold separation**: CLIP's embedding space doesn't provide clear decision boundaries for logos
3. **General-purpose models**: Neither CLIP nor DINOv2 are optimized for fine-grained logo discrimination
4. **Pipeline dependencies**: Results depend heavily on DETR detection quality
### Recommendations for Future Improvements
#### Short-Term Improvements
| Improvement | Expected Impact | Effort |
|-------------|-----------------|--------|
| **Post-processing filters** | Reduce FP by 20-30% | Low |
| Add color histogram matching | Filter matches with wrong colors | |
| Add aspect ratio validation | Reject shape mismatches | |
| Add text detection | Filter if expected text is missing | |
| **Reference curation** | Improve TP by 10-20% | Low |
| Remove low-quality references | Reduce noise in ref embeddings | |
| Ensure diverse logo variants | Improve coverage | |
| **Ensemble scoring** | Improve F1 by 10-15% | Medium |
| Combine CLIP + color features | Multi-signal confidence | |
| Weighted voting across refs | More robust aggregation | |
#### Medium-Term Improvements
| Improvement | Expected Impact | Effort |
|-------------|-----------------|--------|
| **Fine-tune CLIP on logos** | Improve F1 by 20-40% | Medium |
| Contrastive training on logo pairs | Better embedding separation | |
| Use LogoDet-3K for training data | Domain-specific features | |
| **Alternative detection models** | Improve detection quality | Medium |
| Test YOLOv8 for logo detection | Faster, potentially more accurate | |
| Train custom detector on logo data | Better region proposals | |
| **Learned similarity metric** | Improve precision by 30-50% | Medium |
| Train siamese network for logo matching | Replace cosine similarity | |
| Learn logo-specific distance function | Better discrimination | |
#### Long-Term Improvements
| Improvement | Expected Impact | Effort |
|-------------|-----------------|--------|
| **End-to-end logo recognition model** | F1 > 85% | High |
| Single model for detection + recognition | Eliminate pipeline errors | |
| Train on large-scale logo dataset | Comprehensive coverage | |
| **Logo-specific foundation model** | F1 > 90% | High |
| Pre-train on millions of logo images | Domain expertise | |
| Fine-tune for specific brand sets | Production-ready accuracy | |
### Decision Framework
Use this framework to choose between precision and recall:
| Use Case | Priority | Recommended Adjustments |
|----------|----------|------------------------|
| **Content moderation** | High recall | Use defaults; accept FPs for human review |
| **Brand monitoring** | Balanced | Use defaults; filter obvious FPs |
| **Automated licensing** | High precision | Use threshold=0.90; accept low recall |
| **Search/discovery** | High recall | Lower threshold to 0.65; more refs |
### Conclusion
The current DETR + CLIP pipeline with multi-ref matching achieves moderate accuracy (~60% F1) that is suitable for screening applications but falls short of production requirements for automated decision-making. The fundamental limitation is that general-purpose vision models lack the fine-grained discrimination needed for logo recognition.
**To achieve production-quality accuracy (>85% F1), the system requires:**
1. A logo-specific embedding model (fine-tuned or trained from scratch)
2. Additional visual features beyond CLIP embeddings
3. Potentially an end-to-end architecture designed for logo recognition
The test framework established here provides the foundation for evaluating these future improvements systematically.
---
## Test Run: [Next Test Name]
*Results pending...*

View File

@ -0,0 +1,20 @@
============================================================
THRESHOLD OPTIMIZATION RESULTS
Model: finetuned (models/logo_detection/clip_finetuned)
============================================================
Threshold TP FP FN Prec Recall F1
--------------------------------------------------------------------
0.70 167 477 120 25.9% 67.1% 37.4%
0.72 158 339 116 31.8% 63.5% 42.4%
0.74 150 252 123 37.3% 60.2% 46.1%
0.76 160 166 119 49.1% 64.3% 55.7%
0.78 120 102 147 54.1% 48.2% 51.0%
0.80 110 73 151 60.1% 44.2% 50.9%
0.82 103 33 159 75.7% 41.4% 53.5%
0.84 74 18 180 80.4% 29.7% 43.4%
0.86 70 9 187 88.6% 28.1% 42.7%
--------------------------------------------------------------------
BEST THRESHOLD: 0.76 (F1 = 55.7%)

View File

@ -0,0 +1,193 @@
Threshold Optimization Tests
=============================
Date: Fri Jan 2 10:11:34 AM MST 2026
Common Parameters:
Matching method: multi-ref (max)
Reference logos: 20
Refs per logo: 10
Positive samples: 20
Negative samples: 100
Min matching refs: 3
Seed: 42
======================================================================
TEST: MULTI-REF MATCHING
Model: openai/clip-vit-large-patch14
Method: Multi-ref (max, min_refs=3, margin=0.05)
======================================================================
Date: 2026-01-02 10:29:26
Configuration:
Embedding model: openai/clip-vit-large-patch14
Reference logos: 20
Refs per logo: 10
Total reference embeddings:189
Positive samples/logo: 20
Negative samples/logo: 100
Test images processed: 2358
Similarity threshold: 0.7
DETR threshold: 0.5
Random seed: 42
Results:
True Positives: 265
False Positives: 288
False Negatives: 141
Total Expected: 369
Scores:
Precision: 0.4792 (47.9%)
Recall: 0.7182 (71.8%)
F1 Score: 0.5748 (57.5%)
======================================================================
TEST: MULTI-REF MATCHING
Model: openai/clip-vit-large-patch14
Method: Multi-ref (max, min_refs=3, margin=0.05)
======================================================================
Date: 2026-01-02 10:47:35
Configuration:
Embedding model: openai/clip-vit-large-patch14
Reference logos: 20
Refs per logo: 10
Total reference embeddings:189
Positive samples/logo: 20
Negative samples/logo: 100
Test images processed: 2348
Similarity threshold: 0.8
DETR threshold: 0.5
Random seed: 42
Results:
True Positives: 233
False Positives: 472
False Negatives: 165
Total Expected: 369
Scores:
Precision: 0.3305 (33.0%)
Recall: 0.6314 (63.1%)
F1 Score: 0.4339 (43.4%)
======================================================================
TEST: MULTI-REF MATCHING
Model: openai/clip-vit-large-patch14
Method: Multi-ref (max, min_refs=3, margin=0.1)
======================================================================
Date: 2026-01-02 11:05:34
Configuration:
Embedding model: openai/clip-vit-large-patch14
Reference logos: 20
Refs per logo: 10
Total reference embeddings:189
Positive samples/logo: 20
Negative samples/logo: 100
Test images processed: 2357
Similarity threshold: 0.8
DETR threshold: 0.5
Random seed: 42
Results:
True Positives: 187
False Positives: 375
False Negatives: 208
Total Expected: 369
Scores:
Precision: 0.3327 (33.3%)
Recall: 0.5068 (50.7%)
F1 Score: 0.4017 (40.2%)
======================================================================
TEST: MULTI-REF MATCHING
Model: openai/clip-vit-large-patch14
Method: Multi-ref (max, min_refs=3, margin=0.1)
======================================================================
Date: 2026-01-02 11:23:33
Configuration:
Embedding model: openai/clip-vit-large-patch14
Reference logos: 20
Refs per logo: 10
Total reference embeddings:189
Positive samples/logo: 20
Negative samples/logo: 100
Test images processed: 2356
Similarity threshold: 0.85
DETR threshold: 0.5
Random seed: 42
Results:
True Positives: 160
False Positives: 434
False Negatives: 223
Total Expected: 369
Scores:
Precision: 0.2694 (26.9%)
Recall: 0.4336 (43.4%)
F1 Score: 0.3323 (33.2%)
======================================================================
TEST: MULTI-REF MATCHING
Model: openai/clip-vit-large-patch14
Method: Multi-ref (max, min_refs=3, margin=0.15)
======================================================================
Date: 2026-01-02 11:41:47
Configuration:
Embedding model: openai/clip-vit-large-patch14
Reference logos: 20
Refs per logo: 10
Total reference embeddings:189
Positive samples/logo: 20
Negative samples/logo: 100
Test images processed: 2359
Similarity threshold: 0.85
DETR threshold: 0.5
Random seed: 42
Results:
True Positives: 163
False Positives: 410
False Negatives: 220
Total Expected: 369
Scores:
Precision: 0.2845 (28.4%)
Recall: 0.4417 (44.2%)
F1 Score: 0.3461 (34.6%)
======================================================================
TEST: MULTI-REF MATCHING
Model: openai/clip-vit-large-patch14
Method: Multi-ref (max, min_refs=3, margin=0.15)
======================================================================
Date: 2026-01-02 12:00:00
Configuration:
Embedding model: openai/clip-vit-large-patch14
Reference logos: 20
Refs per logo: 10
Total reference embeddings:189
Positive samples/logo: 20
Negative samples/logo: 100
Test images processed: 2363
Similarity threshold: 0.9
DETR threshold: 0.5
Random seed: 42
Results:
True Positives: 84
False Positives: 69
False Negatives: 288
Total Expected: 369
Scores:
Precision: 0.5490 (54.9%)
Recall: 0.2276 (22.8%)
F1 Score: 0.3218 (32.2%)