diff --git a/logo_detection_test_methodology.md b/logo_detection_test_methodology.md index 190a52d..1a4a9d5 100644 --- a/logo_detection_test_methodology.md +++ b/logo_detection_test_methodology.md @@ -224,6 +224,51 @@ This ensures confident matches and reduces ambiguous classifications. - Margin required: 0.05 - Result: **No match** (0.82 - 0.79 = 0.03 < 0.05) +#### Margin in Multi-Ref vs Margin-Only Matching + +The margin parameter applies to both `margin` and `multi-ref` methods, but operates at different levels: + +| Method | What Margin Compares | +|--------|---------------------| +| `margin` | Best **reference embedding** vs second-best **reference embedding** | +| `multi-ref` | Best **logo's aggregated score** vs second-best **logo's aggregated score** | + +This distinction is critical when using multiple references per logo. + +#### The Problem with Margin-Only and Multiple References + +In margin-only matching, all individual reference embeddings compete against each other—including references from the **same logo**. This causes legitimate matches to be rejected. + +**Example showing the problem:** + +Suppose Nike has 3 references and Adidas has 3 references. A detected region produces: + +| Reference | Similarity | +|-----------|------------| +| Nike_ref1 | 0.92 | +| Nike_ref2 | 0.91 | +| Nike_ref3 | 0.85 | +| Adidas_ref1 | 0.78 | +| Adidas_ref2 | 0.75 | +| Adidas_ref3 | 0.72 | + +**With margin-only matching (margin=0.05):** +- Best reference: Nike_ref1 (0.92) +- Second-best reference: Nike_ref2 (0.91) ← Same logo! +- Margin check: 0.92 - 0.91 = 0.01 < 0.05 → **Rejected** + +The match is rejected even though this is clearly a Nike logo! Nike's own references compete against each other and fail the margin test. + +**With multi-ref matching (margin=0.05):** +- First, aggregate scores per logo: + - Nike: max(0.92, 0.91, 0.85) = 0.92 + - Adidas: max(0.78, 0.75, 0.72) = 0.78 +- Best logo: Nike (0.92) +- Second-best logo: Adidas (0.78) +- Margin check: 0.92 - 0.78 = 0.14 >= 0.05 → **Accepted** + +This is why margin-only matching produces very low recall when using multiple references per logo—it was designed for single-reference scenarios. + --- ### 6. Embedding Caching diff --git a/run_model_comparison.sh b/run_model_comparison.sh index 5d831dc..c6eefac 100755 --- a/run_model_comparison.sh +++ b/run_model_comparison.sh @@ -13,8 +13,8 @@ REFS_PER_LOGO=10 POSITIVE_SAMPLES=20 NEGATIVE_SAMPLES=100 MIN_MATCHING_REFS=3 -THRESHOLD=0.80 -MARGIN=0.10 +THRESHOLD=0.70 +MARGIN=0.05 SEED=42 # Clear output file and write header @@ -82,11 +82,29 @@ uv run python "$SCRIPT_DIR/test_logo_detection.py" \ --clear-cache \ --output-file "$OUTPUT_FILE" +echo "" + +# Test 3: DINOv2 Large +echo "=== Test 3: DINOv2 Large (facebook/dinov2-large) ===" +uv run python "$SCRIPT_DIR/test_logo_detection.py" \ + --num-logos $NUM_LOGOS \ + --refs-per-logo $REFS_PER_LOGO \ + --positive-samples $POSITIVE_SAMPLES \ + --negative-samples $NEGATIVE_SAMPLES \ + --matching-method multi-ref \ + --min-matching-refs $MIN_MATCHING_REFS \ + --use-max-similarity \ + --threshold $THRESHOLD \ + --margin $MARGIN \ + --seed $SEED \ + --embedding-model "facebook/dinov2-large" \ + --clear-cache \ + --output-file "$OUTPUT_FILE" + echo "" echo "Results saved to: $OUTPUT_FILE" echo "" echo "Note: You can also try other models:" echo " - facebook/dinov2-base" -echo " - facebook/dinov2-large" echo " - openai/clip-vit-base-patch32" echo " - openai/clip-vit-large-patch14-336" \ No newline at end of file diff --git a/test_results_analysis.md b/test_results_analysis.md index b97b828..1e35f62 100644 --- a/test_results_analysis.md +++ b/test_results_analysis.md @@ -137,6 +137,119 @@ Even the best-performing method (multi-ref max) produces nearly as many false po --- +## Test Run: Threshold Optimization Tests + +**Date**: 2026-01-02 +**Embedding Model**: openai/clip-vit-large-patch14 +**Matching Method**: Multi-ref (max) for all tests + +### Test Configuration + +| Parameter | Value | +|-----------|-------| +| Reference logos | 20 | +| Refs per logo | 10 | +| Total reference embeddings | 189 | +| Positive samples per logo | 20 | +| Negative samples per logo | 100 | +| Test images processed | ~2,355 | +| DETR threshold | 0.50 | +| Min matching refs | 3 | +| Random seed | 42 | + +### Results Summary + +| Test | Threshold | Margin | TP | FP | FN | Precision | Recall | F1 | +|------|----------:|-------:|---:|---:|---:|----------:|-------:|---:| +| 1 (baseline) | 0.70 | 0.05 | 265 | 288 | 141 | 47.9% | 71.8% | 57.5% | +| 2 | 0.80 | 0.05 | 233 | 472 | 165 | 33.0% | 63.1% | 43.4% | +| 3 | 0.80 | 0.10 | 187 | 375 | 208 | 33.3% | 50.7% | 40.2% | +| 4 | 0.85 | 0.10 | 160 | 434 | 223 | 26.9% | 43.4% | 33.2% | +| 5 | 0.85 | 0.15 | 163 | 410 | 220 | 28.4% | 44.2% | 34.6% | +| 6 | 0.90 | 0.15 | 84 | 69 | 288 | 54.9% | 22.8% | 32.2% | + +### Analysis + +#### Counter-Intuitive Results + +The most striking finding is that **raising the similarity threshold made performance worse** in most cases: + +| Threshold Change | Effect on FP:TP Ratio | +|------------------|----------------------| +| 0.70 → 0.80 | 1.09:1 → 2.03:1 (worse) | +| 0.80 → 0.85 | 2.03:1 → 2.71:1 (worse) | +| 0.85 → 0.90 | 2.71:1 → 0.82:1 (better) | + +This is the opposite of expected behavior. Normally, raising the threshold should reduce false positives. Instead, false positives *increased* from 288 at threshold 0.70 to 472 at threshold 0.80. + +#### Why Higher Thresholds Failed + +The likely explanation relates to how `min_matching_refs` interacts with the threshold: + +1. **True positives are penalized more**: Correct matches require 3+ references to exceed the threshold. At higher thresholds, fewer references clear the bar, causing legitimate matches to fail the `min_matching_refs=3` requirement. + +2. **False positives survive differently**: False positive detections may have 1-2 references that happen to score very high (above the threshold) due to random visual similarities. Since we use max aggregation, these spurious high scores still produce matches. + +3. **The margin becomes less effective**: When most scores are clustered below the threshold, the margin check operates on a smaller pool of candidates, reducing its discriminative power. + +#### Threshold 0.90: Different Behavior + +At threshold 0.90, behavior finally matches expectations: +- False positives dropped dramatically (69 vs 288-472 in other tests) +- But recall collapsed to 22.8% +- Only 84 true positives out of 369 expected + +This suggests that at 0.90, the threshold is finally high enough to filter out most noise, but it's too aggressive and rejects most legitimate matches as well. + +#### The Optimal Threshold Problem + +| Threshold | Precision | Recall | F1 | Assessment | +|-----------|-----------|--------|-----|------------| +| 0.70 | 47.9% | 71.8% | **57.5%** | Best overall F1 | +| 0.80 | 33.0% | 63.1% | 43.4% | Worse than baseline | +| 0.85 | 26.9-28.4% | 43-44% | 33-35% | Much worse | +| 0.90 | 54.9% | 22.8% | 32.2% | Best precision, worst recall | + +The lowest threshold tested (0.70) produced the best F1 score. This indicates: +- CLIP embeddings don't provide clean separation at any threshold +- The multi-ref matching with min_matching_refs provides better discrimination than threshold alone +- Raising the threshold hurts true positives more than it helps reject false positives + +#### Margin Parameter Impact + +Comparing tests with the same threshold but different margins: + +| Threshold | Margin 0.05 | Margin 0.10 | Margin 0.15 | +|-----------|-------------|-------------|-------------| +| 0.80 | F1: 43.4% | F1: 40.2% | - | +| 0.85 | - | F1: 33.2% | F1: 34.6% | + +Increasing the margin had minimal effect, slightly reducing both true and false positives. The margin parameter is less impactful than the threshold in this configuration. + +### Key Findings + +1. **The baseline (threshold=0.70, margin=0.05) was optimal**: No threshold/margin combination tested outperformed the defaults for F1 score. + +2. **Threshold tuning alone cannot fix CLIP's limitations**: The embedding space doesn't provide clear separation points that can be exploited with threshold adjustments. + +3. **min_matching_refs matters more than threshold**: The requirement for multiple matching references provides better discrimination than similarity threshold. + +4. **Precision-recall trade-off is extreme**: Achieving 55% precision (at threshold 0.90) requires accepting only 23% recall. + +5. **The 0.70-0.85 range is a "dead zone"**: Thresholds in this range produce worse results than either extreme. + +### Implications + +These results suggest that improving logo detection accuracy requires: +- A different embedding model with better logo discrimination +- Logo-specific fine-tuning +- Alternative matching strategies beyond threshold-based approaches +- Potentially ensemble methods combining multiple signals + +Simply tuning threshold and margin parameters with CLIP is insufficient to achieve acceptable precision/recall balance. + +--- + ## Test Run: [Next Test Name] *Results pending...*