diff --git a/README.md b/README.md
index 7add27b..be77d1e 100644
--- a/README.md
+++ b/README.md
@@ -82,8 +82,9 @@ uv run python test_logo_detection.py -n 50 --seed 42
 | Parameter | Default | Description |
 |-----------|---------|-------------|
 | `-n, --num-logos` | 10 | Number of reference logos to sample |
-| `-t, --threshold` | 0.7 | CLIP similarity threshold |
+| `-t, --threshold` | 0.7 | Similarity threshold for matching |
 | `-d, --detr-threshold` | 0.5 | DETR detection confidence threshold |
+| `-e, --embedding-model` | openai/clip-vit-large-patch14 | Embedding model (CLIP or DINOv2) |
 | `--matching-method` | margin | Matching method: `simple`, `margin`, or `multi-ref` |
 | `--margin` | 0.05 | Margin over second-best match (margin/multi-ref) |
 | `--refs-per-logo` | 3 | Reference images per logo |
@@ -93,6 +94,7 @@ uv run python test_logo_detection.py -n 50 --seed 42
 | `--negative-samples` | 20 | Negative test images per logo |
 | `-s, --seed` | None | Random seed for reproducibility |
 | `--output-file` | None | Append results summary to file (clean output) |
+| `--clear-cache` | False | Clear embedding cache before running |
 
 **Matching Methods:**
 - `simple` - Returns all logos above threshold (baseline, most permissive)
@@ -103,13 +105,22 @@ See `--help` for all options.
 
 ### Run Comparison Tests
 
-To compare all matching methods with consistent parameters:
-
 ```bash
+# Compare all matching methods
 ./run_comparison_tests.sh
+
+# Test various threshold/margin combinations
+./run_threshold_tests.sh
+
+# Compare embedding models (CLIP vs DINOv2)
+./run_model_comparison.sh
 ```
 
-This runs all four matching configurations (simple, margin, multi-ref mean, multi-ref max) and saves clean results to `comparison_results.txt`.
+| Script | Purpose | Output File |
+|--------|---------|-------------|
+| `run_comparison_tests.sh` | Compare all 4 matching methods | `comparison_results.txt` |
+| `run_threshold_tests.sh` | Test threshold/margin combinations | `threshold_test_results.txt` |
+| `run_model_comparison.sh` | Compare CLIP vs DINOv2 models | `model_comparison_results.txt` |
 
 ## Project Structure
 
@@ -118,13 +129,16 @@ logo_test/
 ├── logo_detection_detr.py      # Core detection library (DetectLogosDETR class)
 ├── test_logo_detection.py      # Test script for accuracy evaluation
 ├── prepare_test_data.py        # Script to prepare test database
-├── run_comparison_tests.sh     # Script to run all matching methods
+├── run_comparison_tests.sh     # Compare all matching methods
+├── run_threshold_tests.sh      # Test threshold/margin combinations
+├── run_model_comparison.sh     # Compare CLIP vs DINOv2 models
 ├── test_data_mapping.db        # SQLite database with ground truth
 ├── reference_logos/            # Reference logo images (not in git)
 ├── test_images/                # Test images (not in git)
 ├── LogoDet-3K/                 # Source dataset (not in git)
 ├── logo_detection_detr_usage.md        # API usage guide
-└── logo_detection_test_methodology.md  # Test methodology documentation
+├── logo_detection_test_methodology.md  # Test methodology documentation
+└── test_results_analysis.md    # Analysis of test results
 ```
 
 ## Accuracy Improvement Techniques
@@ -141,12 +155,23 @@ The framework implements several techniques to improve detection accuracy:
 
 ## Models
 
-The framework uses:
+### Detection Model
 - **DETR**: `Pravallika6/detr-finetuned-logo-detection_v2`
-- **CLIP**: `openai/clip-vit-large-patch14`
+
+### Embedding Models (selectable via `-e/--embedding-model`)
+
+| Model | Type | Description |
+|-------|------|-------------|
+| `openai/clip-vit-large-patch14` | CLIP | Default. General-purpose vision-language model |
+| `openai/clip-vit-base-patch32` | CLIP | Smaller, faster CLIP variant |
+| `facebook/dinov2-small` | DINOv2 | Self-supervised, good for visual similarity |
+| `facebook/dinov2-base` | DINOv2 | Larger DINOv2 variant |
+| `facebook/dinov2-large` | DINOv2 | Largest DINOv2 variant |
 
 Models are automatically downloaded from HuggingFace on first run and cached in `~/.cache/huggingface/`.
 
+**Note**: When switching between embedding models, use `--clear-cache` to ensure embeddings are recomputed with the new model.
+
 ## Documentation
 
 - [API Usage Guide](logo_detection_detr_usage.md) - How to use the DetectLogosDETR class
diff --git a/test_results_analysis.md b/test_results_analysis.md
new file mode 100644
index 0000000..b97b828
--- /dev/null
+++ b/test_results_analysis.md
@@ -0,0 +1,144 @@
+# Logo Detection Test Results Analysis
+
+This document provides analysis of logo detection test results across different matching methods and configurations.
+
+---
+
+## Test Run: CLIP Defaults with All Matching Methods
+
+**Date**: 2025-12-31
+**Embedding Model**: openai/clip-vit-large-patch14 (default)
+
+### Test Configuration
+
+| Parameter | Value |
+|-----------|-------|
+| Reference logos | 20 |
+| Refs per logo | 10 |
+| Total reference embeddings | 189 |
+| Positive samples per logo | 20 |
+| Negative samples per logo | 100 |
+| Test images processed | ~2,350 |
+| Similarity threshold | 0.70 |
+| DETR threshold | 0.50 |
+| Margin | 0.05 |
+| Min matching refs | 3 |
+| Random seed | 42 |
+
+### Results Summary
+
+| Method | TP | FP | FN | Precision | Recall | F1 |
+|--------|---:|---:|---:|----------:|-------:|---:|
+| Simple | 751 | 58,221 | 9 | 1.3% | 203.5%* | 2.5% |
+| Margin | 60 | 26 | 310 | 69.8% | 16.3% | 26.4% |
+| Multi-ref (mean) | 233 | 217 | 170 | 51.8% | 63.1% | 56.9% |
+| Multi-ref (max) | 278 | 259 | 136 | 51.8% | 75.3% | 61.4% |
+
+*Recall >100% indicates multiple true positive detections per expected logo (multiple detected regions matching the same logo).
+
+### Analysis by Method
+
+#### Simple Matching
+
+The simple method returns ALL logos above the similarity threshold without any rejection logic. This serves as a baseline to understand the raw discriminative power of CLIP embeddings.
+
+**Observations**:
+- 58,221 false positives vs 751 true positives (~78:1 ratio)
+- At threshold 0.70, CLIP embeddings are not discriminative enough to distinguish between different logos
+- The extremely high false positive count indicates that unrelated logo regions frequently produce similarity scores above 0.70
+- This method is unsuitable for production use but valuable for understanding the embedding space
+
+#### Margin-Based Matching
+
+The margin method requires the best match to exceed the second-best by a minimum margin (0.05), rejecting ambiguous matches.
+
+**Observations**:
+- Highest precision (69.8%) but very low recall (16.3%)
+- Only 60 true positives out of 369 expected
+- The margin requirement is too strict when using multiple references per logo
+- With 10 refs per logo, references from the SAME logo compete with each other
+  - Example: If Logo A has refs scoring 0.85 and 0.84, the margin is only 0.01, causing rejection
+- This explains why margin matching produces fewer matches than multi-ref methods
+
+#### Multi-Ref Matching (Mean Similarity)
+
+Uses the average similarity across all reference images for each logo.
+
+**Observations**:
+- Balanced precision (51.8%) and recall (63.1%)
+- F1 score of 56.9%
+- False positive ratio approximately 1:1 with true positives (217 FP vs 233 TP)
+- Mean aggregation penalizes logos where some references don't match well
+- More conservative than max aggregation
+
+#### Multi-Ref Matching (Max Similarity)
+
+Uses the highest similarity score from any single reference image.
+
+**Observations**:
+- Best F1 score (61.4%) and recall (75.3%)
+- Same precision as mean method (51.8%)
+- 278 true positives vs 259 false positives (still approximately 1:1)
+- Max aggregation is more lenient, improving recall at no precision cost
+- Better suited when reference images capture different logo variants
+
+### Key Findings
+
+#### 1. CLIP Embedding Similarity Distribution
+
+The simple matching results reveal a fundamental issue: at threshold 0.70, the CLIP embedding space does not provide sufficient separation between different logos. The 78:1 false positive to true positive ratio indicates that:
+
+- Many unrelated images produce high cosine similarity scores
+- The threshold would need to be significantly higher (0.85+) to reduce false positives
+- Even then, recall would likely suffer
+
+#### 2. Margin Method Limitation with Multiple References
+
+The margin-based matching method was designed assuming one reference per logo. When using multiple references (10 per logo in this test), references from the same logo compete against each other in the margin calculation. This causes legitimate matches to be rejected when two references from the same logo have similar scores.
+
+#### 3. False Positive Rate Remains High
+
+Even the best-performing method (multi-ref max) produces nearly as many false positives as true positives:
+- 278 correct matches
+- 259 incorrect matches
+- This 1:1 ratio is problematic for production use cases
+
+#### 4. Trade-off Between Precision and Recall
+
+| Goal | Best Method | Trade-off |
+|------|-------------|-----------|
+| Maximize precision | Margin | Very low recall (16.3%) |
+| Maximize recall | Multi-ref (max) | Lower precision (51.8%) |
+| Balance both | Multi-ref (max) | Best F1 but still ~50% precision |
+
+### Deficiencies of This Approach
+
+#### CLIP Model Limitations
+
+1. **General-Purpose Training**: CLIP was trained on text-image pairs for general visual understanding, not for fine-grained logo discrimination. Logo matching requires distinguishing between visually similar brand marks, which CLIP's training objective doesn't optimize for.
+
+2. **Embedding Space Density**: The cosine similarity scores cluster in a narrow range (0.6-0.9 for most images), making threshold-based discrimination difficult. Small differences in embedding similarity don't reliably indicate visual differences.
+
+3. **Scale and Context Sensitivity**: CLIP embeddings are affected by the context around detected regions. A logo on a busy background may produce different embeddings than the same logo on a clean background.
+
+4. **No Logo-Specific Features**: CLIP doesn't learn features specific to logo recognition such as:
+   - Typography and font shapes
+   - Brand-specific color combinations
+   - Geometric patterns and symmetry
+   - Edge and contour characteristics
+
+#### Detection Pipeline Issues
+
+1. **DETR Detection Quality**: The pipeline assumes DETR correctly identifies logo regions. Detection errors (missed logos, partial detections, non-logo regions) propagate to the matching stage.
+
+2. **Cropping Artifacts**: Detected regions are cropped and resized before embedding extraction. This may introduce artifacts that affect embedding quality.
+
+3. **Threshold Sensitivity**: The entire system is highly sensitive to the similarity threshold parameter. A 0.05 change in threshold can dramatically alter precision/recall balance.
+
+---
+
+## Test Run: [Next Test Name]
+
+*Results pending...*
+
+---
\ No newline at end of file