Add embedding model comparison analysis (CLIP vs DINOv2)

2026-01-02 16:26:59 -05:00
parent 2c41549ae0
commit 1551360028
1 changed files with 96 additions and 0 deletions
--- a/test_results_analysis.md
+++ b/test_results_analysis.md
@ -250,6 +250,102 @@ Simply tuning threshold and margin parameters with CLIP is insufficient to achie

 ---

+## Test Run: Embedding Model Comparison
+
+**Date**: 2026-01-02
+**Matching Method**: Multi-ref (max) for all tests
+
+### Test Configuration
+
+| Parameter | Value |
+|-----------|-------|
+| Reference logos | 20 |
+| Refs per logo | 10 |
+| Total reference embeddings | 189 |
+| Positive samples per logo | 20 |
+| Negative samples per logo | 100 |
+| Test images processed | ~2,355 |
+| Similarity threshold | 0.70 |
+| DETR threshold | 0.50 |
+| Margin | 0.05 |
+| Min matching refs | 3 |
+| Random seed | 42 |
+
+### Results Summary
+
+| Model | TP | FP | FN | Precision | Recall | F1 |
+|-------|---:|---:|---:|----------:|-------:|---:|
+| CLIP ViT-Large | 284 | 295 | 124 | 49.1% | 77.0% | 59.9% |
+| DINOv2 Small | 158 | 546 | 234 | 22.4% | 42.8% | 29.5% |
+| DINOv2 Large | 105 | 221 | 277 | 32.2% | 28.5% | 30.2% |
+
+### Analysis
+
+#### CLIP Significantly Outperforms DINOv2
+
+CLIP ViT-Large achieved approximately **2x the F1 score** of either DINOv2 model:
+
+| Model | F1 Score | vs CLIP |
+|-------|----------|---------|
+| CLIP ViT-Large | 59.9% | baseline |
+| DINOv2 Small | 29.5% | -50.7% |
+| DINOv2 Large | 30.2% | -49.6% |
+
+This is a substantial performance gap that cannot be closed through parameter tuning.
+
+#### DINOv2 Model Comparison
+
+Comparing the two DINOv2 variants:
+
+| Metric | DINOv2 Small | DINOv2 Large | Winner |
+|--------|--------------|--------------|--------|
+| Precision | 22.4% | 32.2% | Large (+44%) |
+| Recall | 42.8% | 28.5% | Small (+50%) |
+| F1 | 29.5% | 30.2% | Large (+2%) |
+| FP:TP Ratio | 3.46:1 | 2.10:1 | Large |
+
+DINOv2 Large shows better precision and fewer false positives, but at the cost of significantly lower recall. The larger model appears more conservative in its matching, rejecting more candidates overall.
+
+#### Why DINOv2 Underperforms
+
+1. **Training Objective Mismatch**: DINOv2 uses self-supervised learning optimized for general visual representation, not for discriminating between similar visual objects. While it excels at semantic understanding, logo matching requires fine-grained visual discrimination.
+
+2. **Embedding Space Characteristics**: DINOv2's embedding space may cluster logos differently than CLIP. The 0.70 threshold that works reasonably for CLIP may be entirely wrong for DINOv2's similarity distribution.
+
+3. **No Text-Image Alignment**: Unlike CLIP, DINOv2 has no concept of semantic labels. CLIP's text-image training may inadvertently help it distinguish between branded content, even if not explicitly trained for logos.
+
+#### False Positive Analysis
+
+| Model | FP:TP Ratio | Assessment |
+|-------|-------------|------------|
+| CLIP ViT-Large | 1.04:1 | Approximately balanced |
+| DINOv2 Small | 3.46:1 | Very high false positives |
+| DINOv2 Large | 2.10:1 | High false positives |
+
+DINOv2 Small produces over 3x as many false positives as true positives, making it unsuitable for this task without significant threshold adjustment.
+
+### Key Findings
+
+1. **CLIP remains the best choice**: Despite its limitations documented in earlier tests, CLIP substantially outperforms DINOv2 for logo matching with the current pipeline and parameters.
+
+2. **Model size doesn't guarantee better results**: DINOv2 Large (304M parameters) performed only marginally better than DINOv2 Small (22M parameters) for F1 score, and actually had worse recall.
+
+3. **Threshold may need per-model tuning**: The 0.70 threshold optimized for CLIP may not be appropriate for DINOv2. The high false positive rates suggest DINOv2 may need a higher threshold.
+
+4. **Self-supervised models not ideal for this task**: The results suggest that self-supervised vision models like DINOv2 are not well-suited for fine-grained logo discrimination without additional fine-tuning.
+
+### Recommendations
+
+1. **Continue using CLIP** for this logo detection pipeline unless a logo-specific model becomes available.
+
+2. **If DINOv2 must be used**, conduct threshold optimization tests specifically for DINOv2's embedding space—the optimal threshold is likely different from CLIP's.
+
+3. **Consider fine-tuning**: Training a model specifically on logo discrimination tasks would likely outperform both general-purpose models.
+
+4. **Explore hybrid approaches**: Combining CLIP's semantic understanding with additional visual features (edges, colors, shapes) might improve discrimination.
+
+---
+
 ## Test Run: [Next Test Name]

 *Results pending...*