Add embedding model comparison analysis (CLIP vs DINOv2)

This commit is contained in:
Rick McEwen
2026-01-02 16:26:59 -05:00
parent 2c41549ae0
commit 1551360028

View File

@ -250,6 +250,102 @@ Simply tuning threshold and margin parameters with CLIP is insufficient to achie
--- ---
## Test Run: Embedding Model Comparison
**Date**: 2026-01-02
**Matching Method**: Multi-ref (max) for all tests
### Test Configuration
| Parameter | Value |
|-----------|-------|
| Reference logos | 20 |
| Refs per logo | 10 |
| Total reference embeddings | 189 |
| Positive samples per logo | 20 |
| Negative samples per logo | 100 |
| Test images processed | ~2,355 |
| Similarity threshold | 0.70 |
| DETR threshold | 0.50 |
| Margin | 0.05 |
| Min matching refs | 3 |
| Random seed | 42 |
### Results Summary
| Model | TP | FP | FN | Precision | Recall | F1 |
|-------|---:|---:|---:|----------:|-------:|---:|
| CLIP ViT-Large | 284 | 295 | 124 | 49.1% | 77.0% | 59.9% |
| DINOv2 Small | 158 | 546 | 234 | 22.4% | 42.8% | 29.5% |
| DINOv2 Large | 105 | 221 | 277 | 32.2% | 28.5% | 30.2% |
### Analysis
#### CLIP Significantly Outperforms DINOv2
CLIP ViT-Large achieved approximately **2x the F1 score** of either DINOv2 model:
| Model | F1 Score | vs CLIP |
|-------|----------|---------|
| CLIP ViT-Large | 59.9% | baseline |
| DINOv2 Small | 29.5% | -50.7% |
| DINOv2 Large | 30.2% | -49.6% |
This is a substantial performance gap that cannot be closed through parameter tuning.
#### DINOv2 Model Comparison
Comparing the two DINOv2 variants:
| Metric | DINOv2 Small | DINOv2 Large | Winner |
|--------|--------------|--------------|--------|
| Precision | 22.4% | 32.2% | Large (+44%) |
| Recall | 42.8% | 28.5% | Small (+50%) |
| F1 | 29.5% | 30.2% | Large (+2%) |
| FP:TP Ratio | 3.46:1 | 2.10:1 | Large |
DINOv2 Large shows better precision and fewer false positives, but at the cost of significantly lower recall. The larger model appears more conservative in its matching, rejecting more candidates overall.
#### Why DINOv2 Underperforms
1. **Training Objective Mismatch**: DINOv2 uses self-supervised learning optimized for general visual representation, not for discriminating between similar visual objects. While it excels at semantic understanding, logo matching requires fine-grained visual discrimination.
2. **Embedding Space Characteristics**: DINOv2's embedding space may cluster logos differently than CLIP. The 0.70 threshold that works reasonably for CLIP may be entirely wrong for DINOv2's similarity distribution.
3. **No Text-Image Alignment**: Unlike CLIP, DINOv2 has no concept of semantic labels. CLIP's text-image training may inadvertently help it distinguish between branded content, even if not explicitly trained for logos.
#### False Positive Analysis
| Model | FP:TP Ratio | Assessment |
|-------|-------------|------------|
| CLIP ViT-Large | 1.04:1 | Approximately balanced |
| DINOv2 Small | 3.46:1 | Very high false positives |
| DINOv2 Large | 2.10:1 | High false positives |
DINOv2 Small produces over 3x as many false positives as true positives, making it unsuitable for this task without significant threshold adjustment.
### Key Findings
1. **CLIP remains the best choice**: Despite its limitations documented in earlier tests, CLIP substantially outperforms DINOv2 for logo matching with the current pipeline and parameters.
2. **Model size doesn't guarantee better results**: DINOv2 Large (304M parameters) performed only marginally better than DINOv2 Small (22M parameters) for F1 score, and actually had worse recall.
3. **Threshold may need per-model tuning**: The 0.70 threshold optimized for CLIP may not be appropriate for DINOv2. The high false positive rates suggest DINOv2 may need a higher threshold.
4. **Self-supervised models not ideal for this task**: The results suggest that self-supervised vision models like DINOv2 are not well-suited for fine-grained logo discrimination without additional fine-tuning.
### Recommendations
1. **Continue using CLIP** for this logo detection pipeline unless a logo-specific model becomes available.
2. **If DINOv2 must be used**, conduct threshold optimization tests specifically for DINOv2's embedding space—the optimal threshold is likely different from CLIP's.
3. **Consider fine-tuning**: Training a model specifically on logo discrimination tasks would likely outperform both general-purpose models.
4. **Explore hybrid approaches**: Combining CLIP's semantic understanding with additional visual features (edges, colors, shapes) might improve discrimination.
---
## Test Run: [Next Test Name] ## Test Run: [Next Test Name]
*Results pending...* *Results pending...*