Test scripts and utilities for evaluating vision-language models on jersey number detection using llama.cpp server.
7.0 KiB
llama-swap Setup Guide for Jersey Detection Testing
This guide explains how to use llama-swap to automatically switch between different vision language models when testing jersey detection.
What is llama-swap?
llama-swap is a model-swapping proxy that sits between your application and llama.cpp servers. It automatically loads and unloads models based on the model parameter in API requests, allowing you to test multiple models without manually restarting servers.
Installation
Docker (Recommended)
# Pull the CUDA image (or cpu, vulkan, intel depending on your hardware)
docker pull ghcr.io/mostlygeek/llama-swap:cuda
Homebrew (macOS/Linux)
brew tap mostlygeek/llama-swap
brew install llama-swap
Pre-built Binaries
Download from the releases page.
Configuration
A configuration file llama-swap-config.yaml is provided with 8 pre-configured vision models:
Small Models (1-4B parameters)
lfm2-vl-1.6b- LiquidAI LFM2-VL 1.6B (F16)gemma-3-4b- Gemma 3 4B Instruct (F16)kimi-vl-3b- Kimi VL A3B Thinking (F16)
Medium Models (7-12B parameters)
qwen2.5-vl-7b- Qwen2.5-VL 7B Instruct (F16)gemma-3-12b- Gemma 3 12B Instruct (F16)
Large Models (24-27B parameters)
mistral-small-24b-q8- Mistral Small 3.2 24B (Q8_K_XL)mistral-small-24b-q4- Mistral Small 3.2 24B (Q4_K_XL)gemma-3-27b- Gemma 3 27B Instruct (Q8_0)
Starting llama-swap
Using Docker
docker run -it --rm --runtime nvidia -p 8080:8080 \
-v $(pwd)/llama-swap-config.yaml:/app/config.yaml \
-v /path/to/hf/cache:/root/.cache/huggingface \
ghcr.io/mostlygeek/llama-swap:cuda
Using Binary
llama-swap --config llama-swap-config.yaml --listen localhost:8080
Testing with Jersey Detection Script
Once llama-swap is running, you can test different models by specifying the --model-tag parameter:
Test a Single Model
# Test Qwen2.5-VL 7B with resizing
python test_jersey_detection.py ./images jersey_prompt.txt \
--model-tag "qwen2.5-vl-7b" \
--resize 1024
Test Multiple Models Sequentially
# Test small models
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "lfm2-vl-1.6b" --resize 1024
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "gemma-3-4b" --resize 1024
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "kimi-vl-3b" --resize 1024
# Test medium models
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "qwen2.5-vl-7b" --resize 1024
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "gemma-3-12b" --resize 1024
# Test large models
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "mistral-small-24b-q4" --resize 1024
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "gemma-3-27b" --resize 1024
Automated Testing Scripts
Two bash scripts are provided for automated testing:
1. Full Test Suite (test_all_models.sh)
Tests all models defined in llama-swap-config.yaml:
# Basic usage (uses defaults)
./test_all_models.sh ./test_images
# Customize configuration with environment variables
RESIZE=2048 ./test_all_models.sh ./test_images
OUTPUT_FILE=custom_results.jsonl ./test_all_models.sh ./test_images
PROMPT_FILE=custom_prompt.txt ./test_all_models.sh ./test_images
# Disable resize
RESIZE= ./test_all_models.sh ./test_images
Features:
- Automatically extracts all model tags from YAML config
- Color-coded output with progress tracking
- Confirms before starting tests
- Shows summary with success/failure counts
- Asks to continue if a model fails
Default Configuration:
- Images:
./test_images - Prompt:
jersey_prompt_with_confidence.txt - Resize:
1024px - Output:
jersey_detection_results.jsonl
2. Quick Test (test_quick.sh)
Tests a small subset of models for rapid iteration:
# Test default selection (small, medium, large)
./test_quick.sh ./test_images
# Test custom models
MODELS="lfm2-vl-1.6b qwen2.5-vl-7b" ./test_quick.sh ./test_images
# Customize settings
RESIZE=512 MODELS="gemma-3-4b" ./test_quick.sh ./test_images
Default Models:
lfm2-vl-1.6b(Small - 1.6B)qwen2.5-vl-7b(Medium - 7B)mistral-small-24b-q4(Large - 24B Q4)
Use Cases:
- Quick validation after prompt changes
- Testing configuration adjustments
- Rapid prototyping before full test run
Analyzing Results
After testing multiple models, use the analysis script to compare performance:
python analyze_jersey_results.py
This will show:
- Comparison table of all models tested
- Performance charts with hallucination rates
- Best performers by speed and accuracy
- Confidence distribution (if applicable)
Model Swapping Behavior
llama-swap will:
- Automatically load the requested model when you specify
--model-tag - Automatically unload the previous model (if different from current request)
- Keep running if you test the same model multiple times
- Monitor model loading/unloading in the web UI at
http://localhost:8080/ui
Optional: Model Auto-Unloading
To automatically unload models after 5 minutes of inactivity, uncomment this line in llama-swap-config.yaml:
ttl: 300
Optional: Preload Model on Startup
To preload a specific model when llama-swap starts, uncomment and modify this section:
hooks:
onStartup:
- loadModel: qwen2.5-vl-7b
Customizing Models
To add or modify models, edit llama-swap-config.yaml:
models:
my-custom-model:
name: "My Custom Model Description"
cmd: llama-server --no-mmap -ngl 999 -fa on --host 0.0.0.0 --port ${PORT} -hf user/model-name:quantization
Then test with:
python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "my-custom-model"
Troubleshooting
Model not loading
- Check llama-swap logs at
http://localhost:8080/logor viacurl http://localhost:8080/log/stream - Verify the model name in the config matches the
--model-tagparameter - Ensure sufficient GPU memory for the model
Connection refused
- Verify llama-swap is running:
curl http://localhost:8080/health - Check the server URL matches: default is
http://192.168.1.126:8080(from scan.ini)
Slow model switching
- First load downloads models from HuggingFace (can be slow)
- Subsequent loads are faster (cached locally)
- Use quantized models (Q4, Q8) for faster loading and lower memory usage
Web UI
llama-swap includes a web interface for monitoring:
- Dashboard:
http://localhost:8080/ui- View loaded models and logs - Activity: See recent API requests
- Logs: Real-time log monitoring