Document margin behavior and update model comparison script

- Add section explaining how margin works differently in multi-ref vs margin-only matching, with examples showing why margin-only fails when using multiple references per logo - Update run_model_comparison.sh to use optimal threshold (0.70) and margin (0.05) based on test results - Add DINOv2 Large model test to comparison script - Add threshold optimization test analysis to results document
2026-01-02 14:42:53 -05:00
parent 48d9145810
commit 2c41549ae0
3 changed files with 179 additions and 3 deletions
--- a/test_results_analysis.md
+++ b/test_results_analysis.md
@ -137,6 +137,119 @@ Even the best-performing method (multi-ref max) produces nearly as many false po

 ---

+## Test Run: Threshold Optimization Tests
+
+**Date**: 2026-01-02
+**Embedding Model**: openai/clip-vit-large-patch14
+**Matching Method**: Multi-ref (max) for all tests
+
+### Test Configuration
+
+| Parameter | Value |
+|-----------|-------|
+| Reference logos | 20 |
+| Refs per logo | 10 |
+| Total reference embeddings | 189 |
+| Positive samples per logo | 20 |
+| Negative samples per logo | 100 |
+| Test images processed | ~2,355 |
+| DETR threshold | 0.50 |
+| Min matching refs | 3 |
+| Random seed | 42 |
+
+### Results Summary
+
+| Test | Threshold | Margin | TP | FP | FN | Precision | Recall | F1 |
+|------|----------:|-------:|---:|---:|---:|----------:|-------:|---:|
+| 1 (baseline) | 0.70 | 0.05 | 265 | 288 | 141 | 47.9% | 71.8% | 57.5% |
+| 2 | 0.80 | 0.05 | 233 | 472 | 165 | 33.0% | 63.1% | 43.4% |
+| 3 | 0.80 | 0.10 | 187 | 375 | 208 | 33.3% | 50.7% | 40.2% |
+| 4 | 0.85 | 0.10 | 160 | 434 | 223 | 26.9% | 43.4% | 33.2% |
+| 5 | 0.85 | 0.15 | 163 | 410 | 220 | 28.4% | 44.2% | 34.6% |
+| 6 | 0.90 | 0.15 | 84 | 69 | 288 | 54.9% | 22.8% | 32.2% |
+
+### Analysis
+
+#### Counter-Intuitive Results
+
+The most striking finding is that **raising the similarity threshold made performance worse** in most cases:
+
+| Threshold Change | Effect on FP:TP Ratio |
+|------------------|----------------------|
+| 0.70 → 0.80 | 1.09:1 → 2.03:1 (worse) |
+| 0.80 → 0.85 | 2.03:1 → 2.71:1 (worse) |
+| 0.85 → 0.90 | 2.71:1 → 0.82:1 (better) |
+
+This is the opposite of expected behavior. Normally, raising the threshold should reduce false positives. Instead, false positives *increased* from 288 at threshold 0.70 to 472 at threshold 0.80.
+
+#### Why Higher Thresholds Failed
+
+The likely explanation relates to how `min_matching_refs` interacts with the threshold:
+
+1. **True positives are penalized more**: Correct matches require 3+ references to exceed the threshold. At higher thresholds, fewer references clear the bar, causing legitimate matches to fail the `min_matching_refs=3` requirement.
+
+2. **False positives survive differently**: False positive detections may have 1-2 references that happen to score very high (above the threshold) due to random visual similarities. Since we use max aggregation, these spurious high scores still produce matches.
+
+3. **The margin becomes less effective**: When most scores are clustered below the threshold, the margin check operates on a smaller pool of candidates, reducing its discriminative power.
+
+#### Threshold 0.90: Different Behavior
+
+At threshold 0.90, behavior finally matches expectations:
+- False positives dropped dramatically (69 vs 288-472 in other tests)
+- But recall collapsed to 22.8%
+- Only 84 true positives out of 369 expected
+
+This suggests that at 0.90, the threshold is finally high enough to filter out most noise, but it's too aggressive and rejects most legitimate matches as well.
+
+#### The Optimal Threshold Problem
+
+| Threshold | Precision | Recall | F1 | Assessment |
+|-----------|-----------|--------|-----|------------|
+| 0.70 | 47.9% | 71.8% | **57.5%** | Best overall F1 |
+| 0.80 | 33.0% | 63.1% | 43.4% | Worse than baseline |
+| 0.85 | 26.9-28.4% | 43-44% | 33-35% | Much worse |
+| 0.90 | 54.9% | 22.8% | 32.2% | Best precision, worst recall |
+
+The lowest threshold tested (0.70) produced the best F1 score. This indicates:
+- CLIP embeddings don't provide clean separation at any threshold
+- The multi-ref matching with min_matching_refs provides better discrimination than threshold alone
+- Raising the threshold hurts true positives more than it helps reject false positives
+
+#### Margin Parameter Impact
+
+Comparing tests with the same threshold but different margins:
+
+| Threshold | Margin 0.05 | Margin 0.10 | Margin 0.15 |
+|-----------|-------------|-------------|-------------|
+| 0.80 | F1: 43.4% | F1: 40.2% | - |
+| 0.85 | - | F1: 33.2% | F1: 34.6% |
+
+Increasing the margin had minimal effect, slightly reducing both true and false positives. The margin parameter is less impactful than the threshold in this configuration.
+
+### Key Findings
+
+1. **The baseline (threshold=0.70, margin=0.05) was optimal**: No threshold/margin combination tested outperformed the defaults for F1 score.
+
+2. **Threshold tuning alone cannot fix CLIP's limitations**: The embedding space doesn't provide clear separation points that can be exploited with threshold adjustments.
+
+3. **min_matching_refs matters more than threshold**: The requirement for multiple matching references provides better discrimination than similarity threshold.
+
+4. **Precision-recall trade-off is extreme**: Achieving 55% precision (at threshold 0.90) requires accepting only 23% recall.
+
+5. **The 0.70-0.85 range is a "dead zone"**: Thresholds in this range produce worse results than either extreme.
+
+### Implications
+
+These results suggest that improving logo detection accuracy requires:
+- A different embedding model with better logo discrimination
+- Logo-specific fine-tuning
+- Alternative matching strategies beyond threshold-based approaches
+- Potentially ensemble methods combining multiple signals
+
+Simply tuning threshold and margin parameters with CLIP is insufficient to achieve acceptable precision/recall balance.
+
+---
+
 ## Test Run: [Next Test Name]

 *Results pending...*