Files
logo_test/logo_detection_test_methodology.md
Rick McEwen 2c41549ae0 Document margin behavior and update model comparison script
- Add section explaining how margin works differently in multi-ref vs
  margin-only matching, with examples showing why margin-only fails
  when using multiple references per logo
- Update run_model_comparison.sh to use optimal threshold (0.70) and
  margin (0.05) based on test results
- Add DINOv2 Large model test to comparison script
- Add threshold optimization test analysis to results document
2026-01-02 14:42:53 -05:00

16 KiB

Logo Detection Test Methodology

This document describes how the logo detection test framework works and the various techniques implemented to improve detection accuracy.

Overview

The system uses a two-stage pipeline:

  1. DETR (DEtection TRansformer) - Detects potential logo regions in images
  2. CLIP (Contrastive Language-Image Pre-training) - Extracts feature embeddings for matching

Test Framework (test_logo_detection.py)

Test Flow

  1. Sample Reference Logos: Randomly select N logos from the database, with multiple reference images per logo
  2. Compute Reference Embeddings: Generate CLIP embeddings for all reference logo images
  3. Build Test Set: For each sampled logo, select:
    • Positive samples: Images known to contain the logo
    • Negative samples: Images known NOT to contain the logo
  4. Run Detection: Process each test image through DETR to find logo regions
  5. Match Against References: Compare detected regions against reference embeddings using margin-based matching
  6. Calculate Metrics: Compute precision, recall, and F1 score

Configurable Parameters

General Parameters

Parameter Default Description
--num-logos 10 Number of reference logos to sample
--refs-per-logo 3 Reference images per logo
--positive-samples 5 Positive test images per logo
--negative-samples 20 Negative test images per logo
--threshold 0.7 CLIP similarity threshold for matching
--detr-threshold 0.5 DETR detection confidence threshold
--seed None Random seed for reproducibility

Matching Method Selection

Parameter Default Description
--matching-method margin Matching method: simple, margin, or multi-ref
--margin 0.05 Required margin between best and second-best match (applies to margin and multi-ref)

Multi-Ref Method Parameters (when --matching-method multi-ref)

Parameter Default Description
--min-matching-refs 1 Minimum references that must match above threshold
--use-max-similarity False Use max similarity instead of mean across references

Cache Control

Parameter Default Description
--no-cache False Disable embedding cache
--clear-cache False Clear cache before running

Metrics

  • True Positives: Detected logo correctly matches expected logo
  • False Positives: Detected logo matches wrong logo or image has no logo
  • False Negatives: Expected logo not detected/matched
  • Precision: TP / (TP + FP) - How many detections were correct
  • Recall: TP / Total Expected - How many logos were found
  • F1 Score: Harmonic mean of precision and recall

Accuracy Improvement Techniques

1. Non-Maximum Suppression (NMS)

Location: logo_detection_detr.py:214-268

Problem: DETR may produce multiple overlapping bounding boxes for the same logo.

Solution: NMS removes redundant detections by:

  1. Sorting detections by confidence score (descending)
  2. Keeping the highest-scoring box
  3. Removing any remaining boxes with IoU > threshold (default 0.5)
  4. Repeating until no boxes remain
IoU (Intersection over Union) = Area of Overlap / Area of Union

Configuration: nms_iou_threshold parameter (default: 0.5)


2. Minimum Box Size Filtering

Location: logo_detection_detr.py:187-191

Problem: Very small detections are often noise or partial logo fragments.

Solution: Filter out detections where width OR height is below a minimum threshold.

Configuration: min_box_size parameter (default: 20 pixels)


3. Confidence Threshold Filtering

Location: logo_detection_detr.py:177-179

Problem: Low-confidence DETR detections are unreliable.

Solution: Only keep detections with confidence score >= threshold.

Configuration: detr_threshold parameter (default: 0.5)


Location: logo_detection_detr.py:397-457 (find_best_match_multi_ref)

Problem: A single reference image may not capture all variations of a logo (different angles, lighting, scales).

Solution: Use multiple reference images per logo and aggregate their similarity scores:

  • Calculate similarity to each reference embedding
  • Count how many references match above threshold
  • Use mean or max similarity as the aggregate score
  • Require a minimum number of references to match

Configuration:

  • refs_per_logo: Number of reference images (default: 3)
  • min_matching_refs: Minimum references that must match
  • use_max_similarity: Use max instead of mean aggregation (default: False)

Mean vs Max Similarity Aggregation

When comparing a detected region against multiple reference images for the same logo, we need to combine the individual similarity scores into a single aggregate score. The two options are:

Mean Similarity (default, --use-max-similarity NOT set):

  • Calculates the average similarity across ALL reference images
  • More conservative: requires consistent matching across references
  • Better at rejecting false positives where only one reference happens to match

Max Similarity (--use-max-similarity flag):

  • Takes the HIGHEST similarity score from any single reference
  • More lenient: only needs one good match to succeed
  • Better recall when logos have high variability (one reference might be a perfect match)

Detailed Example

Suppose we have 5 reference images for the Nike logo, and a detected region produces these similarity scores:

Reference Similarity
nike_ref1.png 0.92
nike_ref2.png 0.78
nike_ref3.png 0.85
nike_ref4.png 0.71
nike_ref5.png 0.88

With Mean Aggregation:

Score = (0.92 + 0.78 + 0.85 + 0.71 + 0.88) / 5 = 0.828

The score reflects the overall consistency of the match. If one reference is an outlier (like nike_ref4 at 0.71), it pulls the average down.

With Max Aggregation:

Score = max(0.92, 0.78, 0.85, 0.71, 0.88) = 0.92

The score reflects the best possible match. The lower-scoring references don't affect the result.

When to Use Each

Scenario Recommended Why
Logos with consistent appearance Mean Penalizes partial matches that only hit one variant
Logos with high variability (different colors, orientations) Max One reference matching well is sufficient evidence
High false positive rate Mean More conservative scoring reduces false matches
High false negative rate Max More lenient scoring catches more true matches
Reference images are all similar Either Results will be similar
Reference images show different logo variants Max Each variant should be allowed to match independently

Combined Example with min_matching_refs

The min_matching_refs parameter works independently of the aggregation method. It counts how many references exceed the threshold, regardless of which aggregation is used for the final score.

Example with threshold=0.80, min_matching_refs=2:

Reference Similarity Above Threshold?
nike_ref1.png 0.92 Yes
nike_ref2.png 0.78 No
nike_ref3.png 0.85 Yes
nike_ref4.png 0.71 No
nike_ref5.png 0.88 Yes
  • References above threshold: 3 (nike_ref1, nike_ref3, nike_ref5)
  • min_matching_refs requirement: 2 ✓ (3 >= 2, so we proceed)
  • Mean score: 0.828
  • Max score: 0.92

If only 1 reference was above threshold, the match would be rejected regardless of the aggregated score.


5. Margin-Based Matching

Location: logo_detection_detr.py:459-505 (find_best_match_with_margin)

Problem: When multiple logos have similar embeddings, the best match may not be significantly better than alternatives, leading to false positives.

Solution: Require the best match to exceed the second-best match by a minimum margin:

Match only if: best_similarity - second_best_similarity >= margin

This ensures confident matches and reduces ambiguous classifications.

Configuration: --margin parameter (default: 0.05)

Example:

  • Best match: Logo A with similarity 0.82
  • Second best: Logo B with similarity 0.79
  • Margin required: 0.05
  • Result: No match (0.82 - 0.79 = 0.03 < 0.05)

Margin in Multi-Ref vs Margin-Only Matching

The margin parameter applies to both margin and multi-ref methods, but operates at different levels:

Method What Margin Compares
margin Best reference embedding vs second-best reference embedding
multi-ref Best logo's aggregated score vs second-best logo's aggregated score

This distinction is critical when using multiple references per logo.

The Problem with Margin-Only and Multiple References

In margin-only matching, all individual reference embeddings compete against each other—including references from the same logo. This causes legitimate matches to be rejected.

Example showing the problem:

Suppose Nike has 3 references and Adidas has 3 references. A detected region produces:

Reference Similarity
Nike_ref1 0.92
Nike_ref2 0.91
Nike_ref3 0.85
Adidas_ref1 0.78
Adidas_ref2 0.75
Adidas_ref3 0.72

With margin-only matching (margin=0.05):

  • Best reference: Nike_ref1 (0.92)
  • Second-best reference: Nike_ref2 (0.91) ← Same logo!
  • Margin check: 0.92 - 0.91 = 0.01 < 0.05 → Rejected

The match is rejected even though this is clearly a Nike logo! Nike's own references compete against each other and fail the margin test.

With multi-ref matching (margin=0.05):

  • First, aggregate scores per logo:
    • Nike: max(0.92, 0.91, 0.85) = 0.92
    • Adidas: max(0.78, 0.75, 0.72) = 0.78
  • Best logo: Nike (0.92)
  • Second-best logo: Adidas (0.78)
  • Margin check: 0.92 - 0.78 = 0.14 >= 0.05 → Accepted

This is why margin-only matching produces very low recall when using multiple references per logo—it was designed for single-reference scenarios.


6. Embedding Caching

Location: test_logo_detection.py:49-82 (EmbeddingCache class)

Problem: Computing CLIP embeddings is computationally expensive. Re-running tests would reprocess the same images.

Solution: Cache embeddings to disk using pickle:

  • Reference embeddings keyed by ref:{filename}
  • Detection results keyed by det:{filename}
  • Cache persists between runs (.embedding_cache.pkl)

Configuration:

  • --no-cache: Disable caching entirely
  • --clear-cache: Clear cache before running

7. Normalized Embeddings for Cosine Similarity

Location: logo_detection_detr.py:334-335

Problem: Raw CLIP embeddings have varying magnitudes, which can affect similarity calculations.

Solution: L2-normalize all embeddings before comparison:

features = F.normalize(features, dim=-1)

This ensures cosine similarity is computed correctly and scores fall in the range [-1, 1].


Matching Methods Summary

Method Test Script Option Key Feature
find_all_matches --matching-method simple Returns ALL logos above threshold (baseline, most permissive)
find_best_match_with_margin --matching-method margin Requires margin over second-best match
find_best_match_multi_ref --matching-method multi-ref Aggregates scores across reference images

The test script supports simple, margin, and multi-ref matching methods via the --matching-method parameter.


Detection Pipeline Summary

Input Image
    │
    ▼
┌─────────────────────────────────────┐
│  DETR Object Detection              │
│  - Identifies potential logo regions│
│  - Returns bounding boxes + scores  │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  Confidence Filtering               │
│  - Remove detections < threshold    │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  Size Filtering                     │
│  - Remove boxes < min_box_size      │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  CLIP Embedding Extraction          │
│  - Crop each detected region        │
│  - Generate normalized embedding    │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  Non-Maximum Suppression            │
│  - Remove overlapping detections    │
│  - Keep highest confidence boxes    │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  Matching (selectable method)       │
│  ┌─────────┬─────────┬────────────┐ │
│  │ simple  │ margin  │ multi-ref  │ │
│  ├─────────┼─────────┼────────────┤ │
│  │ All     │ Require │ Aggregate  │ │
│  │ matches │ margin  │ across     │ │
│  │ above   │ over    │ refs       │ │
│  │ thresh  │ 2nd best│ (mean/max) │ │
│  └─────────┴─────────┴────────────┘ │
└─────────────────────────────────────┘
    │
    ▼
Matched Logo Labels

Tuning Recommendations

For Simple Matching (--matching-method simple)

Goal Adjustments
Reduce false positives Increase --threshold (only tuning option for simple method)
Reduce false negatives Decrease --threshold

Note: Simple matching is primarily used as a baseline. For production use, consider margin or multi-ref.

For Margin-Based Matching (--matching-method margin)

Goal Adjustments
Reduce false positives Increase --threshold, increase --margin
Reduce false negatives Decrease --threshold, decrease --margin

For Multi-Ref Matching (--matching-method multi-ref)

Goal Adjustments
Reduce false positives Increase --threshold, increase --margin, increase --min-matching-refs, use mean similarity
Reduce false negatives Decrease --threshold, decrease --margin, decrease --min-matching-refs, use --use-max-similarity

General Tuning

Goal Adjustments
Faster processing Decrease --refs-per-logo, use caching
More robust detection Increase --refs-per-logo, decrease --detr-threshold
Higher precision Increase --detr-threshold, use margin method with high margin
Higher recall Decrease --detr-threshold, use multi-ref with low --min-matching-refs

Example Usage

# Simple matching (baseline - all matches above threshold)
python test_logo_detection.py -n 20 --matching-method simple --threshold 0.70

# Default margin-based matching
python test_logo_detection.py -n 20 --threshold 0.75 --margin 0.05

# Multi-ref matching with margin (recommended for reducing false positives)
python test_logo_detection.py -n 20 --matching-method multi-ref \
    --refs-per-logo 5 --min-matching-refs 2 --threshold 0.70 --margin 0.05

# Multi-ref matching with max similarity (more lenient)
python test_logo_detection.py -n 20 --matching-method multi-ref \
    --refs-per-logo 5 --min-matching-refs 1 --use-max-similarity --margin 0.03

# Reproducible test with seed
python test_logo_detection.py -n 50 --seed 42 --clear-cache