Document margin behavior and update model comparison script

- Add section explaining how margin works differently in multi-ref vs margin-only matching, with examples showing why margin-only fails when using multiple references per logo - Update run_model_comparison.sh to use optimal threshold (0.70) and margin (0.05) based on test results - Add DINOv2 Large model test to comparison script - Add threshold optimization test analysis to results document
2026-01-02 14:42:53 -05:00
parent 48d9145810
commit 2c41549ae0
3 changed files with 179 additions and 3 deletions
--- a/logo_detection_test_methodology.md
+++ b/logo_detection_test_methodology.md
@ -224,6 +224,51 @@ This ensures confident matches and reduces ambiguous classifications.
 - Margin required: 0.05
 - Result: **No match** (0.82 - 0.79 = 0.03 < 0.05)
 #### Margin in Multi-Ref vs Margin-Only Matching
 The margin parameter applies to both `margin` and `multi-ref` methods, but operates at different levels:
 | Method | What Margin Compares |
 |--------|---------------------|
 | `margin` | Best **reference embedding** vs second-best **reference embedding** |
 | `multi-ref` | Best **logo's aggregated score** vs second-best **logo's aggregated score** |
 This distinction is critical when using multiple references per logo.
 #### The Problem with Margin-Only and Multiple References
 In margin-only matching, all individual reference embeddings compete against each other—including references from the **same logo**. This causes legitimate matches to be rejected.
 **Example showing the problem:**
 Suppose Nike has 3 references and Adidas has 3 references. A detected region produces:
 | Reference | Similarity |
 |-----------|------------|
 | Nike_ref1 | 0.92 |
 | Nike_ref2 | 0.91 |
 | Nike_ref3 | 0.85 |
 | Adidas_ref1 | 0.78 |
 | Adidas_ref2 | 0.75 |
 | Adidas_ref3 | 0.72 |
 **With margin-only matching (margin=0.05):**
 - Best reference: Nike_ref1 (0.92)
 - Second-best reference: Nike_ref2 (0.91) ← Same logo!
 - Margin check: 0.92 - 0.91 = 0.01 < 0.05 → **Rejected**
 The match is rejected even though this is clearly a Nike logo! Nike's own references compete against each other and fail the margin test.
 **With multi-ref matching (margin=0.05):**
 - First, aggregate scores per logo:
  - Nike: max(0.92, 0.91, 0.85) = 0.92
  - Adidas: max(0.78, 0.75, 0.72) = 0.78
 - Best logo: Nike (0.92)
 - Second-best logo: Adidas (0.78)
 - Margin check: 0.92 - 0.78 = 0.14 >= 0.05 → **Accepted**
 This is why margin-only matching produces very low recall when using multiple references per logo—it was designed for single-reference scenarios.
 ---
 ### 6. Embedding Caching
--- a/run_model_comparison.sh
+++ b/run_model_comparison.sh
@ -13,8 +13,8 @@ REFS_PER_LOGO=10
 POSITIVE_SAMPLES=20
 NEGATIVE_SAMPLES=100
 MIN_MATCHING_REFS=3
-THRESHOLD=0.80
+THRESHOLD=0.70
-MARGIN=0.10
+MARGIN=0.05
 SEED=42
 # Clear output file and write header
@ -82,11 +82,29 @@ uv run python "$SCRIPT_DIR/test_logo_detection.py" \
    --clear-cache \
    --output-file "$OUTPUT_FILE"
 echo ""
 # Test 3: DINOv2 Large
 echo "=== Test 3: DINOv2 Large (facebook/dinov2-large) ==="
 uv run python "$SCRIPT_DIR/test_logo_detection.py" \
    --num-logos $NUM_LOGOS \
    --refs-per-logo $REFS_PER_LOGO \
    --positive-samples $POSITIVE_SAMPLES \
    --negative-samples $NEGATIVE_SAMPLES \
    --matching-method multi-ref \
    --min-matching-refs $MIN_MATCHING_REFS \
    --use-max-similarity \
    --threshold $THRESHOLD \
    --margin $MARGIN \
    --seed $SEED \
    --embedding-model "facebook/dinov2-large" \
    --clear-cache \
    --output-file "$OUTPUT_FILE"
 echo ""
 echo "Results saved to: $OUTPUT_FILE"
 echo ""
 echo "Note: You can also try other models:"
 echo "  - facebook/dinov2-base"
 echo "  - facebook/dinov2-large"
 echo "  - openai/clip-vit-base-patch32"
 echo "  - openai/clip-vit-large-patch14-336"
--- a/test_results_analysis.md
+++ b/test_results_analysis.md
@ -137,6 +137,119 @@ Even the best-performing method (multi-ref max) produces nearly as many false po
 ---
 ## Test Run: Threshold Optimization Tests
 **Date**: 2026-01-02
 **Embedding Model**: openai/clip-vit-large-patch14
 **Matching Method**: Multi-ref (max) for all tests
 ### Test Configuration
 | Parameter | Value |
 |-----------|-------|
 | Reference logos | 20 |
 | Refs per logo | 10 |
 | Total reference embeddings | 189 |
 | Positive samples per logo | 20 |
 | Negative samples per logo | 100 |
 | Test images processed | ~2,355 |
 | DETR threshold | 0.50 |
 | Min matching refs | 3 |
 | Random seed | 42 |
 ### Results Summary
 | Test | Threshold | Margin | TP | FP | FN | Precision | Recall | F1 |
 |------|----------:|-------:|---:|---:|---:|----------:|-------:|---:|
 | 1 (baseline) | 0.70 | 0.05 | 265 | 288 | 141 | 47.9% | 71.8% | 57.5% |
 | 2 | 0.80 | 0.05 | 233 | 472 | 165 | 33.0% | 63.1% | 43.4% |
 | 3 | 0.80 | 0.10 | 187 | 375 | 208 | 33.3% | 50.7% | 40.2% |
 | 4 | 0.85 | 0.10 | 160 | 434 | 223 | 26.9% | 43.4% | 33.2% |
 | 5 | 0.85 | 0.15 | 163 | 410 | 220 | 28.4% | 44.2% | 34.6% |
 | 6 | 0.90 | 0.15 | 84 | 69 | 288 | 54.9% | 22.8% | 32.2% |
 ### Analysis
 #### Counter-Intuitive Results
 The most striking finding is that **raising the similarity threshold made performance worse** in most cases:
 | Threshold Change | Effect on FP:TP Ratio |
 |------------------|----------------------|
 | 0.70 → 0.80 | 1.09:1 → 2.03:1 (worse) |
 | 0.80 → 0.85 | 2.03:1 → 2.71:1 (worse) |
 | 0.85 → 0.90 | 2.71:1 → 0.82:1 (better) |
 This is the opposite of expected behavior. Normally, raising the threshold should reduce false positives. Instead, false positives *increased* from 288 at threshold 0.70 to 472 at threshold 0.80.
 #### Why Higher Thresholds Failed
 The likely explanation relates to how `min_matching_refs` interacts with the threshold:
 1. **True positives are penalized more**: Correct matches require 3+ references to exceed the threshold. At higher thresholds, fewer references clear the bar, causing legitimate matches to fail the `min_matching_refs=3` requirement.
 2. **False positives survive differently**: False positive detections may have 1-2 references that happen to score very high (above the threshold) due to random visual similarities. Since we use max aggregation, these spurious high scores still produce matches.
 3. **The margin becomes less effective**: When most scores are clustered below the threshold, the margin check operates on a smaller pool of candidates, reducing its discriminative power.
 #### Threshold 0.90: Different Behavior
 At threshold 0.90, behavior finally matches expectations:
 - False positives dropped dramatically (69 vs 288-472 in other tests)
 - But recall collapsed to 22.8%
 - Only 84 true positives out of 369 expected
 This suggests that at 0.90, the threshold is finally high enough to filter out most noise, but it's too aggressive and rejects most legitimate matches as well.
 #### The Optimal Threshold Problem
 | Threshold | Precision | Recall | F1 | Assessment |
 |-----------|-----------|--------|-----|------------|
 | 0.70 | 47.9% | 71.8% | **57.5%** | Best overall F1 |
 | 0.80 | 33.0% | 63.1% | 43.4% | Worse than baseline |
 | 0.85 | 26.9-28.4% | 43-44% | 33-35% | Much worse |
 | 0.90 | 54.9% | 22.8% | 32.2% | Best precision, worst recall |
 The lowest threshold tested (0.70) produced the best F1 score. This indicates:
 - CLIP embeddings don't provide clean separation at any threshold
 - The multi-ref matching with min_matching_refs provides better discrimination than threshold alone
 - Raising the threshold hurts true positives more than it helps reject false positives
 #### Margin Parameter Impact
 Comparing tests with the same threshold but different margins:
 | Threshold | Margin 0.05 | Margin 0.10 | Margin 0.15 |
 |-----------|-------------|-------------|-------------|
 | 0.80 | F1: 43.4% | F1: 40.2% | - |
 | 0.85 | - | F1: 33.2% | F1: 34.6% |
 Increasing the margin had minimal effect, slightly reducing both true and false positives. The margin parameter is less impactful than the threshold in this configuration.
 ### Key Findings
 1. **The baseline (threshold=0.70, margin=0.05) was optimal**: No threshold/margin combination tested outperformed the defaults for F1 score.
 2. **Threshold tuning alone cannot fix CLIP's limitations**: The embedding space doesn't provide clear separation points that can be exploited with threshold adjustments.
 3. **min_matching_refs matters more than threshold**: The requirement for multiple matching references provides better discrimination than similarity threshold.
 4. **Precision-recall trade-off is extreme**: Achieving 55% precision (at threshold 0.90) requires accepting only 23% recall.
 5. **The 0.70-0.85 range is a "dead zone"**: Thresholds in this range produce worse results than either extreme.
 ### Implications
 These results suggest that improving logo detection accuracy requires:
 - A different embedding model with better logo discrimination
 - Logo-specific fine-tuning
 - Alternative matching strategies beyond threshold-based approaches
 - Potentially ensemble methods combining multiple signals
 Simply tuning threshold and margin parameters with CLIP is insufficient to achieve acceptable precision/recall balance.
 ---
 ## Test Run: [Next Test Name]
 *Results pending...*