Add embedding model comparison analysis (CLIP vs DINOv2)
This commit is contained in:
@ -250,6 +250,102 @@ Simply tuning threshold and margin parameters with CLIP is insufficient to achie
|
||||
|
||||
---
|
||||
|
||||
## Test Run: Embedding Model Comparison
|
||||
|
||||
**Date**: 2026-01-02
|
||||
**Matching Method**: Multi-ref (max) for all tests
|
||||
|
||||
### Test Configuration
|
||||
|
||||
| Parameter | Value |
|
||||
|-----------|-------|
|
||||
| Reference logos | 20 |
|
||||
| Refs per logo | 10 |
|
||||
| Total reference embeddings | 189 |
|
||||
| Positive samples per logo | 20 |
|
||||
| Negative samples per logo | 100 |
|
||||
| Test images processed | ~2,355 |
|
||||
| Similarity threshold | 0.70 |
|
||||
| DETR threshold | 0.50 |
|
||||
| Margin | 0.05 |
|
||||
| Min matching refs | 3 |
|
||||
| Random seed | 42 |
|
||||
|
||||
### Results Summary
|
||||
|
||||
| Model | TP | FP | FN | Precision | Recall | F1 |
|
||||
|-------|---:|---:|---:|----------:|-------:|---:|
|
||||
| CLIP ViT-Large | 284 | 295 | 124 | 49.1% | 77.0% | 59.9% |
|
||||
| DINOv2 Small | 158 | 546 | 234 | 22.4% | 42.8% | 29.5% |
|
||||
| DINOv2 Large | 105 | 221 | 277 | 32.2% | 28.5% | 30.2% |
|
||||
|
||||
### Analysis
|
||||
|
||||
#### CLIP Significantly Outperforms DINOv2
|
||||
|
||||
CLIP ViT-Large achieved approximately **2x the F1 score** of either DINOv2 model:
|
||||
|
||||
| Model | F1 Score | vs CLIP |
|
||||
|-------|----------|---------|
|
||||
| CLIP ViT-Large | 59.9% | baseline |
|
||||
| DINOv2 Small | 29.5% | -50.7% |
|
||||
| DINOv2 Large | 30.2% | -49.6% |
|
||||
|
||||
This is a substantial performance gap that cannot be closed through parameter tuning.
|
||||
|
||||
#### DINOv2 Model Comparison
|
||||
|
||||
Comparing the two DINOv2 variants:
|
||||
|
||||
| Metric | DINOv2 Small | DINOv2 Large | Winner |
|
||||
|--------|--------------|--------------|--------|
|
||||
| Precision | 22.4% | 32.2% | Large (+44%) |
|
||||
| Recall | 42.8% | 28.5% | Small (+50%) |
|
||||
| F1 | 29.5% | 30.2% | Large (+2%) |
|
||||
| FP:TP Ratio | 3.46:1 | 2.10:1 | Large |
|
||||
|
||||
DINOv2 Large shows better precision and fewer false positives, but at the cost of significantly lower recall. The larger model appears more conservative in its matching, rejecting more candidates overall.
|
||||
|
||||
#### Why DINOv2 Underperforms
|
||||
|
||||
1. **Training Objective Mismatch**: DINOv2 uses self-supervised learning optimized for general visual representation, not for discriminating between similar visual objects. While it excels at semantic understanding, logo matching requires fine-grained visual discrimination.
|
||||
|
||||
2. **Embedding Space Characteristics**: DINOv2's embedding space may cluster logos differently than CLIP. The 0.70 threshold that works reasonably for CLIP may be entirely wrong for DINOv2's similarity distribution.
|
||||
|
||||
3. **No Text-Image Alignment**: Unlike CLIP, DINOv2 has no concept of semantic labels. CLIP's text-image training may inadvertently help it distinguish between branded content, even if not explicitly trained for logos.
|
||||
|
||||
#### False Positive Analysis
|
||||
|
||||
| Model | FP:TP Ratio | Assessment |
|
||||
|-------|-------------|------------|
|
||||
| CLIP ViT-Large | 1.04:1 | Approximately balanced |
|
||||
| DINOv2 Small | 3.46:1 | Very high false positives |
|
||||
| DINOv2 Large | 2.10:1 | High false positives |
|
||||
|
||||
DINOv2 Small produces over 3x as many false positives as true positives, making it unsuitable for this task without significant threshold adjustment.
|
||||
|
||||
### Key Findings
|
||||
|
||||
1. **CLIP remains the best choice**: Despite its limitations documented in earlier tests, CLIP substantially outperforms DINOv2 for logo matching with the current pipeline and parameters.
|
||||
|
||||
2. **Model size doesn't guarantee better results**: DINOv2 Large (304M parameters) performed only marginally better than DINOv2 Small (22M parameters) for F1 score, and actually had worse recall.
|
||||
|
||||
3. **Threshold may need per-model tuning**: The 0.70 threshold optimized for CLIP may not be appropriate for DINOv2. The high false positive rates suggest DINOv2 may need a higher threshold.
|
||||
|
||||
4. **Self-supervised models not ideal for this task**: The results suggest that self-supervised vision models like DINOv2 are not well-suited for fine-grained logo discrimination without additional fine-tuning.
|
||||
|
||||
### Recommendations
|
||||
|
||||
1. **Continue using CLIP** for this logo detection pipeline unless a logo-specific model becomes available.
|
||||
|
||||
2. **If DINOv2 must be used**, conduct threshold optimization tests specifically for DINOv2's embedding space—the optimal threshold is likely different from CLIP's.
|
||||
|
||||
3. **Consider fine-tuning**: Training a model specifically on logo discrimination tasks would likely outperform both general-purpose models.
|
||||
|
||||
4. **Explore hybrid approaches**: Combining CLIP's semantic understanding with additional visual features (edges, colors, shapes) might improve discrimination.
|
||||
|
||||
---
|
||||
|
||||
## Test Run: [Next Test Name]
|
||||
|
||||
*Results pending...*
|
||||
|
||||
Reference in New Issue
Block a user