Files
logo_test/README.md
Rick McEwen 8b67b50d19 Add Burnley averaged embeddings test results to README
DINOv2 with margin-based matching on barnfield/vertu logos:
43.8% precision, 19.2% recall, 26.7% F1.
2026-03-31 11:59:02 -06:00

291 lines
11 KiB
Markdown

# Logo Detection Test Framework
A testing framework for evaluating logo detection accuracy using DETR (DEtection TRansformer) and CLIP (Contrastive Language-Image Pre-training) models.
## Burnley Test: Averaged Embeddings with DINOv2
A targeted test using `DetectLogosEmbeddings` to detect two specific logos (barnfield and vertu) in 516 Burnley match images. Reference embeddings are averaged across all images in each reference directory, and matching uses margin-based comparison (margin=0.05).
**Test command:**
```bash
uv run python test_burnley_detection.py -e dinov2 -t 0.7 --margin 0.05 --output-file results_average_embeddings.txt
```
**Results (DINOv2, threshold 0.70, margin 0.05):**
| Metric | Value |
|--------|-------|
| True Positives | 28 |
| False Positives | 36 |
| False Negatives | 125 |
| Total Expected | 146 |
| **Precision** | **43.8%** |
| **Recall** | **19.2%** |
| **F1 Score** | **26.7%** |
Ground truth is derived from filename prefixes: `vertu_` (vertu logo), `barnfield_` (barnfield logo), `barnfield+vertu_` (both logos). Images without these prefixes are treated as negatives.
Low recall suggests many logos go undetected by DETR or fall below the similarity threshold. The relatively low precision indicates DINOv2 averaged embeddings struggle to discriminate between the two logos in this domain. Further tuning of thresholds, margin, and embedding model (e.g. CLIP or SigLIP) may improve results.
---
## Recommended Settings
Based on extensive testing with the LogoDet-3K dataset, these are the optimal settings:
| Parameter | Recommended Value | Notes |
|-----------|-------------------|-------|
| **Matching Method** | `multi-ref` | Best balance of precision and recall |
| **Similarity Aggregation** | `max` (default) | Max outperforms mean aggregation |
| **Embedding Model** | `openai/clip-vit-large-patch14` | Significantly outperforms DINOv2 |
| **CLIP Threshold** | `0.70` | Good precision/recall balance |
| **DETR Threshold** | `0.50` | Default detection confidence |
| **Margin** | `0.05` | Reduces false positives |
| **Refs per Logo** | `7-10` | More references = better accuracy |
| **Preprocessing** | `default` | Best precision; letterbox/stretch hurt precision |
**Example command with recommended settings:**
```bash
uv run python test_logo_detection.py \
--matching-method multi-ref \
--refs-per-logo 10 \
--threshold 0.70 \
--margin 0.05 \
--use-max-similarity
```
### Performance Benchmarks
With recommended settings (multi-ref max, threshold 0.70, margin 0.05):
| Refs/Logo | Precision | Recall | F1 Score |
|-----------|-----------|--------|----------|
| 1 | 45.8% | 65.9% | 54.0% |
| 3 | 40.5% | 72.4% | 51.9% |
| 5 | 47.2% | 72.6% | 57.2% |
| 7 | **51.0%** | **79.9%** | **62.3%** |
| 10 | 50.2% | 81.6% | 62.1% |
**Key findings:**
- More reference images per logo consistently improves recall
- 7+ refs provides the best precision/recall balance
- Diminishing returns beyond 10 refs
### Matching Method Comparison
| Method | Precision | Recall | F1 | Use Case |
|--------|-----------|--------|-----|----------|
| `simple` | 1.3% | 203%* | 2.5% | Not recommended (too many FPs) |
| `margin` | 69.8% | 16.3% | 26.4% | High precision, low recall |
| `multi-ref` (mean) | 51.8% | 63.1% | 56.9% | Balanced |
| `multi-ref` (max) | **51.8%** | **75.3%** | **61.4%** | **Best overall** |
*Simple method returns all matches above threshold, causing many duplicates.
### Embedding Model Comparison
| Model | Precision | Recall | F1 | Recommendation |
|-------|-----------|--------|-----|----------------|
| `openai/clip-vit-large-patch14` | **49.1%** | **77.0%** | **59.9%** | **Recommended** |
| `facebook/dinov2-small` | 22.4% | 42.8% | 29.5% | Not recommended |
| `facebook/dinov2-large` | 32.2% | 28.5% | 30.2% | Not recommended |
CLIP significantly outperforms DINOv2 for logo matching tasks.
### Preprocessing Mode Comparison
| Mode | Precision | Recall | F1 | Notes |
|------|-----------|--------|-----|-------|
| `default` | **50.2%** | 81.6% | 62.1% | **Recommended** - best precision |
| `letterbox` | 42.4% | 119%* | 62.6% | Higher recall but worse precision |
| `stretch` | 34.5% | 113%* | 52.9% | Not recommended |
*Recall >100% indicates multiple detections per expected logo.
**Recommendation:** Use `default` preprocessing. While letterbox shows marginally higher F1, it has significantly worse precision (more false positives).
---
## Overview
This project provides tools to:
- Detect logos in images using a fine-tuned DETR model
- Match detected logos against reference images using CLIP embeddings
- Evaluate detection accuracy with precision, recall, and F1 metrics
## Architecture
The system uses a two-stage pipeline:
1. **DETR** - Identifies potential logo regions (bounding boxes) in images
2. **CLIP** - Extracts feature embeddings for each detected region and compares against reference logos
## Installation
Requires Python 3.12+. Uses [uv](https://github.com/astral-sh/uv) for package management.
```bash
# Install dependencies
uv sync
# Or using pip
pip install -r requirements.txt
```
## Usage
### Prepare Test Data
The test framework requires the **LogoDet-3K** dataset. Download it and place it in the project directory:
```
logo_test/
├── LogoDet-3K/ # Dataset directory (required)
│ ├── Clothes/ # Category directories
│ │ ├── Adidas/ # Brand directories with images + XML annotations
│ │ ├── Nike/
│ │ └── ...
│ ├── Electronic/
│ ├── Food/
│ └── ...
```
The dataset should contain images with corresponding Pascal VOC format XML annotation files that define logo bounding boxes.
Then run the preparation script:
```bash
uv run python prepare_test_data.py
```
This script:
1. Scans `LogoDet-3K/` for images and XML annotation files
2. Extracts cropped logo regions using bounding box data → saves to `reference_logos/`
3. Copies full images → saves to `test_images/`
4. Creates `test_data_mapping.db` SQLite database with ground truth mappings
### Run Detection Tests
```bash
# Basic test with default settings (margin-based matching)
uv run python test_logo_detection.py
# Test with more logos and custom threshold
uv run python test_logo_detection.py -n 20 --threshold 0.75
# Use multi-ref matching method
uv run python test_logo_detection.py --matching-method multi-ref \
--refs-per-logo 5 --min-matching-refs 2
# Reproducible test with seed
uv run python test_logo_detection.py -n 50 --seed 42
```
### Key Parameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| `-n, --num-logos` | 10 | Number of reference logos to sample |
| `-t, --threshold` | 0.7 | Similarity threshold for matching |
| `-d, --detr-threshold` | 0.5 | DETR detection confidence threshold |
| `-e, --embedding-model` | openai/clip-vit-large-patch14 | Embedding model (CLIP or DINOv2) |
| `--matching-method` | margin | Matching method: `simple`, `margin`, or `multi-ref` |
| `--margin` | 0.05 | Margin over second-best match (margin/multi-ref) |
| `--refs-per-logo` | 3 | Reference images per logo |
| `--min-matching-refs` | 1 | Min refs that must match (multi-ref only) |
| `--use-max-similarity` | False | Use max instead of mean similarity (multi-ref only) |
| `--positive-samples` | 5 | Positive test images per logo |
| `--negative-samples` | 20 | Negative test images per logo |
| `-s, --seed` | None | Random seed for reproducibility |
| `--output-file` | None | Append results summary to file (clean output) |
| `--clear-cache` | False | Clear embedding cache before running |
**Matching Methods:**
- `simple` - Returns all logos above threshold (not recommended - too many false positives)
- `margin` - Requires margin over second-best match (high precision, low recall)
- `multi-ref` - **Recommended.** Aggregates scores across multiple reference images per logo
See `--help` for all options.
### Run Comparison Tests
```bash
# Compare all matching methods
./run_comparison_tests.sh
# Test various threshold/margin combinations
./run_threshold_tests.sh
# Compare embedding models (CLIP vs DINOv2)
./run_model_comparison.sh
# Test different refs-per-logo values
./run_refs_per_logo_test.sh
```
| Script | Purpose | Output File |
|--------|---------|-------------|
| `run_comparison_tests.sh` | Compare matching methods | `test_results/comparison_*.txt` |
| `run_threshold_tests.sh` | Test threshold/margin combinations | `test_results/threshold_*.txt` |
| `run_model_comparison.sh` | Compare CLIP vs DINOv2 models | `test_results/model_comparison_results.txt` |
| `run_refs_per_logo_test.sh` | Test refs-per-logo values | `test_results/refs_per_logo_analysis.txt` |
| `run_preprocess_test.sh` | Compare preprocessing modes | `test_results/preprocessing_comparison.txt` |
## Project Structure
```
logo_test/
├── logo_detection_detr.py # Core detection library (DetectLogosDETR class)
├── test_logo_detection.py # Test script for accuracy evaluation
├── prepare_test_data.py # Script to prepare test database
├── run_comparison_tests.sh # Compare all matching methods
├── run_threshold_tests.sh # Test threshold/margin combinations
├── run_model_comparison.sh # Compare CLIP vs DINOv2 models
├── test_data_mapping.db # SQLite database with ground truth
├── reference_logos/ # Reference logo images (not in git)
├── test_images/ # Test images (not in git)
├── LogoDet-3K/ # Source dataset (not in git)
├── logo_detection_detr_usage.md # API usage guide
├── logo_detection_test_methodology.md # Test methodology documentation
└── test_results_analysis.md # Analysis of test results
```
## Accuracy Improvement Techniques
The framework implements several techniques to improve detection accuracy:
1. **Non-Maximum Suppression (NMS)** - Removes overlapping duplicate detections
2. **Minimum Box Size Filtering** - Filters out noise from tiny detections
3. **Confidence Threshold Filtering** - Removes low-confidence detections
4. **Multiple Reference Images** - Uses multiple refs per logo for robust matching
5. **Margin-Based Matching** - Requires confidence margin over second-best match
6. **Multi-Ref Matching** - Aggregates similarity scores across references
7. **Embedding Caching** - Caches embeddings to avoid recomputation
## Models
### Detection Model
- **DETR**: `Pravallika6/detr-finetuned-logo-detection_v2`
### Embedding Models (selectable via `-e/--embedding-model`)
| Model | Type | Description |
|-------|------|-------------|
| `openai/clip-vit-large-patch14` | CLIP | Default. General-purpose vision-language model |
| `openai/clip-vit-base-patch32` | CLIP | Smaller, faster CLIP variant |
| `facebook/dinov2-small` | DINOv2 | Self-supervised, good for visual similarity |
| `facebook/dinov2-base` | DINOv2 | Larger DINOv2 variant |
| `facebook/dinov2-large` | DINOv2 | Largest DINOv2 variant |
Models are automatically downloaded from HuggingFace on first run and cached in `~/.cache/huggingface/`.
**Note**: When switching between embedding models, use `--clear-cache` to ensure embeddings are recomputed with the new model.
## Documentation
- [API Usage Guide](logo_detection_detr_usage.md) - How to use the DetectLogosDETR class
- [Test Methodology](logo_detection_test_methodology.md) - Detailed explanation of test framework and tuning
## License
MIT