|
|
|
|
@ -137,6 +137,215 @@ Even the best-performing method (multi-ref max) produces nearly as many false po
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## Test Run: Threshold Optimization Tests
|
|
|
|
|
|
|
|
|
|
**Date**: 2026-01-02
|
|
|
|
|
**Embedding Model**: openai/clip-vit-large-patch14
|
|
|
|
|
**Matching Method**: Multi-ref (max) for all tests
|
|
|
|
|
|
|
|
|
|
### Test Configuration
|
|
|
|
|
|
|
|
|
|
| Parameter | Value |
|
|
|
|
|
|-----------|-------|
|
|
|
|
|
| Reference logos | 20 |
|
|
|
|
|
| Refs per logo | 10 |
|
|
|
|
|
| Total reference embeddings | 189 |
|
|
|
|
|
| Positive samples per logo | 20 |
|
|
|
|
|
| Negative samples per logo | 100 |
|
|
|
|
|
| Test images processed | ~2,355 |
|
|
|
|
|
| DETR threshold | 0.50 |
|
|
|
|
|
| Min matching refs | 3 |
|
|
|
|
|
| Random seed | 42 |
|
|
|
|
|
|
|
|
|
|
### Results Summary
|
|
|
|
|
|
|
|
|
|
| Test | Threshold | Margin | TP | FP | FN | Precision | Recall | F1 |
|
|
|
|
|
|------|----------:|-------:|---:|---:|---:|----------:|-------:|---:|
|
|
|
|
|
| 1 (baseline) | 0.70 | 0.05 | 265 | 288 | 141 | 47.9% | 71.8% | 57.5% |
|
|
|
|
|
| 2 | 0.80 | 0.05 | 233 | 472 | 165 | 33.0% | 63.1% | 43.4% |
|
|
|
|
|
| 3 | 0.80 | 0.10 | 187 | 375 | 208 | 33.3% | 50.7% | 40.2% |
|
|
|
|
|
| 4 | 0.85 | 0.10 | 160 | 434 | 223 | 26.9% | 43.4% | 33.2% |
|
|
|
|
|
| 5 | 0.85 | 0.15 | 163 | 410 | 220 | 28.4% | 44.2% | 34.6% |
|
|
|
|
|
| 6 | 0.90 | 0.15 | 84 | 69 | 288 | 54.9% | 22.8% | 32.2% |
|
|
|
|
|
|
|
|
|
|
### Analysis
|
|
|
|
|
|
|
|
|
|
#### Counter-Intuitive Results
|
|
|
|
|
|
|
|
|
|
The most striking finding is that **raising the similarity threshold made performance worse** in most cases:
|
|
|
|
|
|
|
|
|
|
| Threshold Change | Effect on FP:TP Ratio |
|
|
|
|
|
|------------------|----------------------|
|
|
|
|
|
| 0.70 → 0.80 | 1.09:1 → 2.03:1 (worse) |
|
|
|
|
|
| 0.80 → 0.85 | 2.03:1 → 2.71:1 (worse) |
|
|
|
|
|
| 0.85 → 0.90 | 2.71:1 → 0.82:1 (better) |
|
|
|
|
|
|
|
|
|
|
This is the opposite of expected behavior. Normally, raising the threshold should reduce false positives. Instead, false positives *increased* from 288 at threshold 0.70 to 472 at threshold 0.80.
|
|
|
|
|
|
|
|
|
|
#### Why Higher Thresholds Failed
|
|
|
|
|
|
|
|
|
|
The likely explanation relates to how `min_matching_refs` interacts with the threshold:
|
|
|
|
|
|
|
|
|
|
1. **True positives are penalized more**: Correct matches require 3+ references to exceed the threshold. At higher thresholds, fewer references clear the bar, causing legitimate matches to fail the `min_matching_refs=3` requirement.
|
|
|
|
|
|
|
|
|
|
2. **False positives survive differently**: False positive detections may have 1-2 references that happen to score very high (above the threshold) due to random visual similarities. Since we use max aggregation, these spurious high scores still produce matches.
|
|
|
|
|
|
|
|
|
|
3. **The margin becomes less effective**: When most scores are clustered below the threshold, the margin check operates on a smaller pool of candidates, reducing its discriminative power.
|
|
|
|
|
|
|
|
|
|
#### Threshold 0.90: Different Behavior
|
|
|
|
|
|
|
|
|
|
At threshold 0.90, behavior finally matches expectations:
|
|
|
|
|
- False positives dropped dramatically (69 vs 288-472 in other tests)
|
|
|
|
|
- But recall collapsed to 22.8%
|
|
|
|
|
- Only 84 true positives out of 369 expected
|
|
|
|
|
|
|
|
|
|
This suggests that at 0.90, the threshold is finally high enough to filter out most noise, but it's too aggressive and rejects most legitimate matches as well.
|
|
|
|
|
|
|
|
|
|
#### The Optimal Threshold Problem
|
|
|
|
|
|
|
|
|
|
| Threshold | Precision | Recall | F1 | Assessment |
|
|
|
|
|
|-----------|-----------|--------|-----|------------|
|
|
|
|
|
| 0.70 | 47.9% | 71.8% | **57.5%** | Best overall F1 |
|
|
|
|
|
| 0.80 | 33.0% | 63.1% | 43.4% | Worse than baseline |
|
|
|
|
|
| 0.85 | 26.9-28.4% | 43-44% | 33-35% | Much worse |
|
|
|
|
|
| 0.90 | 54.9% | 22.8% | 32.2% | Best precision, worst recall |
|
|
|
|
|
|
|
|
|
|
The lowest threshold tested (0.70) produced the best F1 score. This indicates:
|
|
|
|
|
- CLIP embeddings don't provide clean separation at any threshold
|
|
|
|
|
- The multi-ref matching with min_matching_refs provides better discrimination than threshold alone
|
|
|
|
|
- Raising the threshold hurts true positives more than it helps reject false positives
|
|
|
|
|
|
|
|
|
|
#### Margin Parameter Impact
|
|
|
|
|
|
|
|
|
|
Comparing tests with the same threshold but different margins:
|
|
|
|
|
|
|
|
|
|
| Threshold | Margin 0.05 | Margin 0.10 | Margin 0.15 |
|
|
|
|
|
|-----------|-------------|-------------|-------------|
|
|
|
|
|
| 0.80 | F1: 43.4% | F1: 40.2% | - |
|
|
|
|
|
| 0.85 | - | F1: 33.2% | F1: 34.6% |
|
|
|
|
|
|
|
|
|
|
Increasing the margin had minimal effect, slightly reducing both true and false positives. The margin parameter is less impactful than the threshold in this configuration.
|
|
|
|
|
|
|
|
|
|
### Key Findings
|
|
|
|
|
|
|
|
|
|
1. **The baseline (threshold=0.70, margin=0.05) was optimal**: No threshold/margin combination tested outperformed the defaults for F1 score.
|
|
|
|
|
|
|
|
|
|
2. **Threshold tuning alone cannot fix CLIP's limitations**: The embedding space doesn't provide clear separation points that can be exploited with threshold adjustments.
|
|
|
|
|
|
|
|
|
|
3. **min_matching_refs matters more than threshold**: The requirement for multiple matching references provides better discrimination than similarity threshold.
|
|
|
|
|
|
|
|
|
|
4. **Precision-recall trade-off is extreme**: Achieving 55% precision (at threshold 0.90) requires accepting only 23% recall.
|
|
|
|
|
|
|
|
|
|
5. **The 0.70-0.85 range is a "dead zone"**: Thresholds in this range produce worse results than either extreme.
|
|
|
|
|
|
|
|
|
|
### Implications
|
|
|
|
|
|
|
|
|
|
These results suggest that improving logo detection accuracy requires:
|
|
|
|
|
- A different embedding model with better logo discrimination
|
|
|
|
|
- Logo-specific fine-tuning
|
|
|
|
|
- Alternative matching strategies beyond threshold-based approaches
|
|
|
|
|
- Potentially ensemble methods combining multiple signals
|
|
|
|
|
|
|
|
|
|
Simply tuning threshold and margin parameters with CLIP is insufficient to achieve acceptable precision/recall balance.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## Test Run: Embedding Model Comparison
|
|
|
|
|
|
|
|
|
|
**Date**: 2026-01-02
|
|
|
|
|
**Matching Method**: Multi-ref (max) for all tests
|
|
|
|
|
|
|
|
|
|
### Test Configuration
|
|
|
|
|
|
|
|
|
|
| Parameter | Value |
|
|
|
|
|
|-----------|-------|
|
|
|
|
|
| Reference logos | 20 |
|
|
|
|
|
| Refs per logo | 10 |
|
|
|
|
|
| Total reference embeddings | 189 |
|
|
|
|
|
| Positive samples per logo | 20 |
|
|
|
|
|
| Negative samples per logo | 100 |
|
|
|
|
|
| Test images processed | ~2,355 |
|
|
|
|
|
| Similarity threshold | 0.70 |
|
|
|
|
|
| DETR threshold | 0.50 |
|
|
|
|
|
| Margin | 0.05 |
|
|
|
|
|
| Min matching refs | 3 |
|
|
|
|
|
| Random seed | 42 |
|
|
|
|
|
|
|
|
|
|
### Results Summary
|
|
|
|
|
|
|
|
|
|
| Model | TP | FP | FN | Precision | Recall | F1 |
|
|
|
|
|
|-------|---:|---:|---:|----------:|-------:|---:|
|
|
|
|
|
| CLIP ViT-Large | 284 | 295 | 124 | 49.1% | 77.0% | 59.9% |
|
|
|
|
|
| DINOv2 Small | 158 | 546 | 234 | 22.4% | 42.8% | 29.5% |
|
|
|
|
|
| DINOv2 Large | 105 | 221 | 277 | 32.2% | 28.5% | 30.2% |
|
|
|
|
|
|
|
|
|
|
### Analysis
|
|
|
|
|
|
|
|
|
|
#### CLIP Significantly Outperforms DINOv2
|
|
|
|
|
|
|
|
|
|
CLIP ViT-Large achieved approximately **2x the F1 score** of either DINOv2 model:
|
|
|
|
|
|
|
|
|
|
| Model | F1 Score | vs CLIP |
|
|
|
|
|
|-------|----------|---------|
|
|
|
|
|
| CLIP ViT-Large | 59.9% | baseline |
|
|
|
|
|
| DINOv2 Small | 29.5% | -50.7% |
|
|
|
|
|
| DINOv2 Large | 30.2% | -49.6% |
|
|
|
|
|
|
|
|
|
|
This is a substantial performance gap that cannot be closed through parameter tuning.
|
|
|
|
|
|
|
|
|
|
#### DINOv2 Model Comparison
|
|
|
|
|
|
|
|
|
|
Comparing the two DINOv2 variants:
|
|
|
|
|
|
|
|
|
|
| Metric | DINOv2 Small | DINOv2 Large | Winner |
|
|
|
|
|
|--------|--------------|--------------|--------|
|
|
|
|
|
| Precision | 22.4% | 32.2% | Large (+44%) |
|
|
|
|
|
| Recall | 42.8% | 28.5% | Small (+50%) |
|
|
|
|
|
| F1 | 29.5% | 30.2% | Large (+2%) |
|
|
|
|
|
| FP:TP Ratio | 3.46:1 | 2.10:1 | Large |
|
|
|
|
|
|
|
|
|
|
DINOv2 Large shows better precision and fewer false positives, but at the cost of significantly lower recall. The larger model appears more conservative in its matching, rejecting more candidates overall.
|
|
|
|
|
|
|
|
|
|
#### Why DINOv2 Underperforms
|
|
|
|
|
|
|
|
|
|
1. **Training Objective Mismatch**: DINOv2 uses self-supervised learning optimized for general visual representation, not for discriminating between similar visual objects. While it excels at semantic understanding, logo matching requires fine-grained visual discrimination.
|
|
|
|
|
|
|
|
|
|
2. **Embedding Space Characteristics**: DINOv2's embedding space may cluster logos differently than CLIP. The 0.70 threshold that works reasonably for CLIP may be entirely wrong for DINOv2's similarity distribution.
|
|
|
|
|
|
|
|
|
|
3. **No Text-Image Alignment**: Unlike CLIP, DINOv2 has no concept of semantic labels. CLIP's text-image training may inadvertently help it distinguish between branded content, even if not explicitly trained for logos.
|
|
|
|
|
|
|
|
|
|
#### False Positive Analysis
|
|
|
|
|
|
|
|
|
|
| Model | FP:TP Ratio | Assessment |
|
|
|
|
|
|-------|-------------|------------|
|
|
|
|
|
| CLIP ViT-Large | 1.04:1 | Approximately balanced |
|
|
|
|
|
| DINOv2 Small | 3.46:1 | Very high false positives |
|
|
|
|
|
| DINOv2 Large | 2.10:1 | High false positives |
|
|
|
|
|
|
|
|
|
|
DINOv2 Small produces over 3x as many false positives as true positives, making it unsuitable for this task without significant threshold adjustment.
|
|
|
|
|
|
|
|
|
|
### Key Findings
|
|
|
|
|
|
|
|
|
|
1. **CLIP remains the best choice**: Despite its limitations documented in earlier tests, CLIP substantially outperforms DINOv2 for logo matching with the current pipeline and parameters.
|
|
|
|
|
|
|
|
|
|
2. **Model size doesn't guarantee better results**: DINOv2 Large (304M parameters) performed only marginally better than DINOv2 Small (22M parameters) for F1 score, and actually had worse recall.
|
|
|
|
|
|
|
|
|
|
3. **Threshold may need per-model tuning**: The 0.70 threshold optimized for CLIP may not be appropriate for DINOv2. The high false positive rates suggest DINOv2 may need a higher threshold.
|
|
|
|
|
|
|
|
|
|
4. **Self-supervised models not ideal for this task**: The results suggest that self-supervised vision models like DINOv2 are not well-suited for fine-grained logo discrimination without additional fine-tuning.
|
|
|
|
|
|
|
|
|
|
### Recommendations
|
|
|
|
|
|
|
|
|
|
1. **Continue using CLIP** for this logo detection pipeline unless a logo-specific model becomes available.
|
|
|
|
|
|
|
|
|
|
2. **If DINOv2 must be used**, conduct threshold optimization tests specifically for DINOv2's embedding space—the optimal threshold is likely different from CLIP's.
|
|
|
|
|
|
|
|
|
|
3. **Consider fine-tuning**: Training a model specifically on logo discrimination tasks would likely outperform both general-purpose models.
|
|
|
|
|
|
|
|
|
|
4. **Explore hybrid approaches**: Combining CLIP's semantic understanding with additional visual features (edges, colors, shapes) might improve discrimination.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## Test Run: [Next Test Name]
|
|
|
|
|
|
|
|
|
|
*Results pending...*
|
|
|
|
|
|