Add embedding model comparison analysis (CLIP vs DINOv2)

2026-01-02 16:26:59 -05:00
parent 2c41549ae0
commit 1551360028
1 changed files with 96 additions and 0 deletions
--- a/test_results_analysis.md
+++ b/test_results_analysis.md
@ -250,6 +250,102 @@ Simply tuning threshold and margin parameters with CLIP is insufficient to achie
 ---
 ## Test Run: Embedding Model Comparison
 **Date**: 2026-01-02
 **Matching Method**: Multi-ref (max) for all tests
 ### Test Configuration
 | Parameter | Value |
 |-----------|-------|
 | Reference logos | 20 |
 | Refs per logo | 10 |
 | Total reference embeddings | 189 |
 | Positive samples per logo | 20 |
 | Negative samples per logo | 100 |
 | Test images processed | ~2,355 |
 | Similarity threshold | 0.70 |
 | DETR threshold | 0.50 |
 | Margin | 0.05 |
 | Min matching refs | 3 |
 | Random seed | 42 |
 ### Results Summary
 | Model | TP | FP | FN | Precision | Recall | F1 |
 |-------|---:|---:|---:|----------:|-------:|---:|
 | CLIP ViT-Large | 284 | 295 | 124 | 49.1% | 77.0% | 59.9% |
 | DINOv2 Small | 158 | 546 | 234 | 22.4% | 42.8% | 29.5% |
 | DINOv2 Large | 105 | 221 | 277 | 32.2% | 28.5% | 30.2% |
 ### Analysis
 #### CLIP Significantly Outperforms DINOv2
 CLIP ViT-Large achieved approximately **2x the F1 score** of either DINOv2 model:
 | Model | F1 Score | vs CLIP |
 |-------|----------|---------|
 | CLIP ViT-Large | 59.9% | baseline |
 | DINOv2 Small | 29.5% | -50.7% |
 | DINOv2 Large | 30.2% | -49.6% |
 This is a substantial performance gap that cannot be closed through parameter tuning.
 #### DINOv2 Model Comparison
 Comparing the two DINOv2 variants:
 | Metric | DINOv2 Small | DINOv2 Large | Winner |
 |--------|--------------|--------------|--------|
 | Precision | 22.4% | 32.2% | Large (+44%) |
 | Recall | 42.8% | 28.5% | Small (+50%) |
 | F1 | 29.5% | 30.2% | Large (+2%) |
 | FP:TP Ratio | 3.46:1 | 2.10:1 | Large |
 DINOv2 Large shows better precision and fewer false positives, but at the cost of significantly lower recall. The larger model appears more conservative in its matching, rejecting more candidates overall.
 #### Why DINOv2 Underperforms
 1. **Training Objective Mismatch**: DINOv2 uses self-supervised learning optimized for general visual representation, not for discriminating between similar visual objects. While it excels at semantic understanding, logo matching requires fine-grained visual discrimination.
 2. **Embedding Space Characteristics**: DINOv2's embedding space may cluster logos differently than CLIP. The 0.70 threshold that works reasonably for CLIP may be entirely wrong for DINOv2's similarity distribution.
 3. **No Text-Image Alignment**: Unlike CLIP, DINOv2 has no concept of semantic labels. CLIP's text-image training may inadvertently help it distinguish between branded content, even if not explicitly trained for logos.
 #### False Positive Analysis
 | Model | FP:TP Ratio | Assessment |
 |-------|-------------|------------|
 | CLIP ViT-Large | 1.04:1 | Approximately balanced |
 | DINOv2 Small | 3.46:1 | Very high false positives |
 | DINOv2 Large | 2.10:1 | High false positives |
 DINOv2 Small produces over 3x as many false positives as true positives, making it unsuitable for this task without significant threshold adjustment.
 ### Key Findings
 1. **CLIP remains the best choice**: Despite its limitations documented in earlier tests, CLIP substantially outperforms DINOv2 for logo matching with the current pipeline and parameters.
 2. **Model size doesn't guarantee better results**: DINOv2 Large (304M parameters) performed only marginally better than DINOv2 Small (22M parameters) for F1 score, and actually had worse recall.
 3. **Threshold may need per-model tuning**: The 0.70 threshold optimized for CLIP may not be appropriate for DINOv2. The high false positive rates suggest DINOv2 may need a higher threshold.
 4. **Self-supervised models not ideal for this task**: The results suggest that self-supervised vision models like DINOv2 are not well-suited for fine-grained logo discrimination without additional fine-tuning.
 ### Recommendations
 1. **Continue using CLIP** for this logo detection pipeline unless a logo-specific model becomes available.
 2. **If DINOv2 must be used**, conduct threshold optimization tests specifically for DINOv2's embedding space—the optimal threshold is likely different from CLIP's.
 3. **Consider fine-tuning**: Training a model specifically on logo discrimination tasks would likely outperform both general-purpose models.
 4. **Explore hybrid approaches**: Combining CLIP's semantic understanding with additional visual features (edges, colors, shapes) might improve discrimination.
 ---
 ## Test Run: [Next Test Name]
 *Results pending...*