diff --git a/test_results_analysis.md b/test_results_analysis.md index 1e35f62..f7d3877 100644 --- a/test_results_analysis.md +++ b/test_results_analysis.md @@ -250,6 +250,102 @@ Simply tuning threshold and margin parameters with CLIP is insufficient to achie --- +## Test Run: Embedding Model Comparison + +**Date**: 2026-01-02 +**Matching Method**: Multi-ref (max) for all tests + +### Test Configuration + +| Parameter | Value | +|-----------|-------| +| Reference logos | 20 | +| Refs per logo | 10 | +| Total reference embeddings | 189 | +| Positive samples per logo | 20 | +| Negative samples per logo | 100 | +| Test images processed | ~2,355 | +| Similarity threshold | 0.70 | +| DETR threshold | 0.50 | +| Margin | 0.05 | +| Min matching refs | 3 | +| Random seed | 42 | + +### Results Summary + +| Model | TP | FP | FN | Precision | Recall | F1 | +|-------|---:|---:|---:|----------:|-------:|---:| +| CLIP ViT-Large | 284 | 295 | 124 | 49.1% | 77.0% | 59.9% | +| DINOv2 Small | 158 | 546 | 234 | 22.4% | 42.8% | 29.5% | +| DINOv2 Large | 105 | 221 | 277 | 32.2% | 28.5% | 30.2% | + +### Analysis + +#### CLIP Significantly Outperforms DINOv2 + +CLIP ViT-Large achieved approximately **2x the F1 score** of either DINOv2 model: + +| Model | F1 Score | vs CLIP | +|-------|----------|---------| +| CLIP ViT-Large | 59.9% | baseline | +| DINOv2 Small | 29.5% | -50.7% | +| DINOv2 Large | 30.2% | -49.6% | + +This is a substantial performance gap that cannot be closed through parameter tuning. + +#### DINOv2 Model Comparison + +Comparing the two DINOv2 variants: + +| Metric | DINOv2 Small | DINOv2 Large | Winner | +|--------|--------------|--------------|--------| +| Precision | 22.4% | 32.2% | Large (+44%) | +| Recall | 42.8% | 28.5% | Small (+50%) | +| F1 | 29.5% | 30.2% | Large (+2%) | +| FP:TP Ratio | 3.46:1 | 2.10:1 | Large | + +DINOv2 Large shows better precision and fewer false positives, but at the cost of significantly lower recall. The larger model appears more conservative in its matching, rejecting more candidates overall. + +#### Why DINOv2 Underperforms + +1. **Training Objective Mismatch**: DINOv2 uses self-supervised learning optimized for general visual representation, not for discriminating between similar visual objects. While it excels at semantic understanding, logo matching requires fine-grained visual discrimination. + +2. **Embedding Space Characteristics**: DINOv2's embedding space may cluster logos differently than CLIP. The 0.70 threshold that works reasonably for CLIP may be entirely wrong for DINOv2's similarity distribution. + +3. **No Text-Image Alignment**: Unlike CLIP, DINOv2 has no concept of semantic labels. CLIP's text-image training may inadvertently help it distinguish between branded content, even if not explicitly trained for logos. + +#### False Positive Analysis + +| Model | FP:TP Ratio | Assessment | +|-------|-------------|------------| +| CLIP ViT-Large | 1.04:1 | Approximately balanced | +| DINOv2 Small | 3.46:1 | Very high false positives | +| DINOv2 Large | 2.10:1 | High false positives | + +DINOv2 Small produces over 3x as many false positives as true positives, making it unsuitable for this task without significant threshold adjustment. + +### Key Findings + +1. **CLIP remains the best choice**: Despite its limitations documented in earlier tests, CLIP substantially outperforms DINOv2 for logo matching with the current pipeline and parameters. + +2. **Model size doesn't guarantee better results**: DINOv2 Large (304M parameters) performed only marginally better than DINOv2 Small (22M parameters) for F1 score, and actually had worse recall. + +3. **Threshold may need per-model tuning**: The 0.70 threshold optimized for CLIP may not be appropriate for DINOv2. The high false positive rates suggest DINOv2 may need a higher threshold. + +4. **Self-supervised models not ideal for this task**: The results suggest that self-supervised vision models like DINOv2 are not well-suited for fine-grained logo discrimination without additional fine-tuning. + +### Recommendations + +1. **Continue using CLIP** for this logo detection pipeline unless a logo-specific model becomes available. + +2. **If DINOv2 must be used**, conduct threshold optimization tests specifically for DINOv2's embedding space—the optimal threshold is likely different from CLIP's. + +3. **Consider fine-tuning**: Training a model specifically on logo discrimination tasks would likely outperform both general-purpose models. + +4. **Explore hybrid approaches**: Combining CLIP's semantic understanding with additional visual features (edges, colors, shapes) might improve discrimination. + +--- + ## Test Run: [Next Test Name] *Results pending...*