Files

Rick McEwen 8706edcd13 Initial commit: Jersey detection test suite

Test scripts and utilities for evaluating vision-language models
on jersey number detection using llama.cpp server.

2026-01-20 13:37:01 -07:00

7.0 KiB

Raw Blame History

llama-swap Setup Guide for Jersey Detection Testing

This guide explains how to use llama-swap to automatically switch between different vision language models when testing jersey detection.

What is llama-swap?

llama-swap is a model-swapping proxy that sits between your application and llama.cpp servers. It automatically loads and unloads models based on the model parameter in API requests, allowing you to test multiple models without manually restarting servers.

Installation

Docker (Recommended)

# Pull the CUDA image (or cpu, vulkan, intel depending on your hardware)
docker pull ghcr.io/mostlygeek/llama-swap:cuda

Homebrew (macOS/Linux)

brew tap mostlygeek/llama-swap
brew install llama-swap

Pre-built Binaries

Download from the releases page.

Configuration

A configuration file llama-swap-config.yaml is provided with 8 pre-configured vision models:

Small Models (1-4B parameters)

lfm2-vl-1.6b - LiquidAI LFM2-VL 1.6B (F16)
gemma-3-4b - Gemma 3 4B Instruct (F16)
kimi-vl-3b - Kimi VL A3B Thinking (F16)

Medium Models (7-12B parameters)

qwen2.5-vl-7b - Qwen2.5-VL 7B Instruct (F16)
gemma-3-12b - Gemma 3 12B Instruct (F16)

Large Models (24-27B parameters)

mistral-small-24b-q8 - Mistral Small 3.2 24B (Q8_K_XL)
mistral-small-24b-q4 - Mistral Small 3.2 24B (Q4_K_XL)
gemma-3-27b - Gemma 3 27B Instruct (Q8_0)

Starting llama-swap

Using Docker

docker run -it --rm --runtime nvidia -p 8080:8080 \
  -v $(pwd)/llama-swap-config.yaml:/app/config.yaml \
  -v /path/to/hf/cache:/root/.cache/huggingface \
  ghcr.io/mostlygeek/llama-swap:cuda

Using Binary

llama-swap --config llama-swap-config.yaml --listen localhost:8080

Testing with Jersey Detection Script

Once llama-swap is running, you can test different models by specifying the --model-tag parameter:

Test a Single Model

# Test Qwen2.5-VL 7B with resizing
python test_jersey_detection.py ./images jersey_prompt.txt \
  --model-tag "qwen2.5-vl-7b" \
  --resize 1024

Test Multiple Models Sequentially

# Test small models
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "lfm2-vl-1.6b" --resize 1024
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "gemma-3-4b" --resize 1024
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "kimi-vl-3b" --resize 1024

# Test medium models
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "qwen2.5-vl-7b" --resize 1024
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "gemma-3-12b" --resize 1024

# Test large models
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "mistral-small-24b-q4" --resize 1024
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "gemma-3-27b" --resize 1024

Automated Testing Scripts

Two bash scripts are provided for automated testing:

1. Full Test Suite (`test_all_models.sh`)

Tests all models defined in llama-swap-config.yaml:

# Basic usage (uses defaults)
./test_all_models.sh ./test_images

# Customize configuration with environment variables
RESIZE=2048 ./test_all_models.sh ./test_images
OUTPUT_FILE=custom_results.jsonl ./test_all_models.sh ./test_images
PROMPT_FILE=custom_prompt.txt ./test_all_models.sh ./test_images

# Disable resize
RESIZE= ./test_all_models.sh ./test_images

Features:

Automatically extracts all model tags from YAML config
Color-coded output with progress tracking
Confirms before starting tests
Shows summary with success/failure counts
Asks to continue if a model fails

Default Configuration:

Images: ./test_images
Prompt: jersey_prompt_with_confidence.txt
Resize: 1024px
Output: jersey_detection_results.jsonl

2. Quick Test (`test_quick.sh`)

Tests a small subset of models for rapid iteration:

# Test default selection (small, medium, large)
./test_quick.sh ./test_images

# Test custom models
MODELS="lfm2-vl-1.6b qwen2.5-vl-7b" ./test_quick.sh ./test_images

# Customize settings
RESIZE=512 MODELS="gemma-3-4b" ./test_quick.sh ./test_images

Default Models:

lfm2-vl-1.6b (Small - 1.6B)
qwen2.5-vl-7b (Medium - 7B)
mistral-small-24b-q4 (Large - 24B Q4)

Use Cases:

Quick validation after prompt changes
Testing configuration adjustments
Rapid prototyping before full test run

Analyzing Results

After testing multiple models, use the analysis script to compare performance:

python analyze_jersey_results.py

This will show:

Comparison table of all models tested
Performance charts with hallucination rates
Best performers by speed and accuracy
Confidence distribution (if applicable)

Model Swapping Behavior

llama-swap will:

Automatically load the requested model when you specify --model-tag
Automatically unload the previous model (if different from current request)
Keep running if you test the same model multiple times
Monitor model loading/unloading in the web UI at http://localhost:8080/ui

Optional: Model Auto-Unloading

To automatically unload models after 5 minutes of inactivity, uncomment this line in llama-swap-config.yaml:

ttl: 300

Optional: Preload Model on Startup

To preload a specific model when llama-swap starts, uncomment and modify this section:

hooks:
  onStartup:
    - loadModel: qwen2.5-vl-7b

Customizing Models

To add or modify models, edit llama-swap-config.yaml:

models:
  my-custom-model:
    name: "My Custom Model Description"
    cmd: llama-server --no-mmap -ngl 999 -fa on --host 0.0.0.0 --port ${PORT} -hf user/model-name:quantization

Then test with:

python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "my-custom-model"

Troubleshooting

Model not loading

Check llama-swap logs at http://localhost:8080/log or via curl http://localhost:8080/log/stream
Verify the model name in the config matches the --model-tag parameter
Ensure sufficient GPU memory for the model

Connection refused

Verify llama-swap is running: curl http://localhost:8080/health
Check the server URL matches: default is http://192.168.1.126:8080 (from scan.ini)

Slow model switching

First load downloads models from HuggingFace (can be slow)
Subsequent loads are faster (cached locally)
Use quantized models (Q4, Q8) for faster loading and lower memory usage

Web UI

llama-swap includes a web interface for monitoring:

Dashboard: http://localhost:8080/ui - View loaded models and logs
Activity: See recent API requests
Logs: Real-time log monitoring

7.0 KiB Raw Blame History