# llama-swap Setup Guide for Jersey Detection Testing This guide explains how to use [llama-swap](https://github.com/mostlygeek/llama-swap) to automatically switch between different vision language models when testing jersey detection. ## What is llama-swap? llama-swap is a model-swapping proxy that sits between your application and llama.cpp servers. It automatically loads and unloads models based on the `model` parameter in API requests, allowing you to test multiple models without manually restarting servers. ## Installation ### Docker (Recommended) ```bash # Pull the CUDA image (or cpu, vulkan, intel depending on your hardware) docker pull ghcr.io/mostlygeek/llama-swap:cuda ``` ### Homebrew (macOS/Linux) ```bash brew tap mostlygeek/llama-swap brew install llama-swap ``` ### Pre-built Binaries Download from the [releases page](https://github.com/mostlygeek/llama-swap/releases). ## Configuration A configuration file `llama-swap-config.yaml` is provided with 8 pre-configured vision models: ### Small Models (1-4B parameters) - `lfm2-vl-1.6b` - LiquidAI LFM2-VL 1.6B (F16) - `gemma-3-4b` - Gemma 3 4B Instruct (F16) - `kimi-vl-3b` - Kimi VL A3B Thinking (F16) ### Medium Models (7-12B parameters) - `qwen2.5-vl-7b` - Qwen2.5-VL 7B Instruct (F16) - `gemma-3-12b` - Gemma 3 12B Instruct (F16) ### Large Models (24-27B parameters) - `mistral-small-24b-q8` - Mistral Small 3.2 24B (Q8_K_XL) - `mistral-small-24b-q4` - Mistral Small 3.2 24B (Q4_K_XL) - `gemma-3-27b` - Gemma 3 27B Instruct (Q8_0) ## Starting llama-swap ### Using Docker ```bash docker run -it --rm --runtime nvidia -p 8080:8080 \ -v $(pwd)/llama-swap-config.yaml:/app/config.yaml \ -v /path/to/hf/cache:/root/.cache/huggingface \ ghcr.io/mostlygeek/llama-swap:cuda ``` ### Using Binary ```bash llama-swap --config llama-swap-config.yaml --listen localhost:8080 ``` ## Testing with Jersey Detection Script Once llama-swap is running, you can test different models by specifying the `--model-tag` parameter: ### Test a Single Model ```bash # Test Qwen2.5-VL 7B with resizing python test_jersey_detection.py ./images jersey_prompt.txt \ --model-tag "qwen2.5-vl-7b" \ --resize 1024 ``` ### Test Multiple Models Sequentially ```bash # Test small models python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "lfm2-vl-1.6b" --resize 1024 python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "gemma-3-4b" --resize 1024 python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "kimi-vl-3b" --resize 1024 # Test medium models python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "qwen2.5-vl-7b" --resize 1024 python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "gemma-3-12b" --resize 1024 # Test large models python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "mistral-small-24b-q4" --resize 1024 python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "gemma-3-27b" --resize 1024 ``` ### Automated Testing Scripts Two bash scripts are provided for automated testing: #### 1. Full Test Suite (`test_all_models.sh`) Tests **all models** defined in `llama-swap-config.yaml`: ```bash # Basic usage (uses defaults) ./test_all_models.sh ./test_images # Customize configuration with environment variables RESIZE=2048 ./test_all_models.sh ./test_images OUTPUT_FILE=custom_results.jsonl ./test_all_models.sh ./test_images PROMPT_FILE=custom_prompt.txt ./test_all_models.sh ./test_images # Disable resize RESIZE= ./test_all_models.sh ./test_images ``` **Features:** - Automatically extracts all model tags from YAML config - Color-coded output with progress tracking - Confirms before starting tests - Shows summary with success/failure counts - Asks to continue if a model fails **Default Configuration:** - Images: `./test_images` - Prompt: `jersey_prompt_with_confidence.txt` - Resize: `1024px` - Output: `jersey_detection_results.jsonl` #### 2. Quick Test (`test_quick.sh`) Tests a **small subset** of models for rapid iteration: ```bash # Test default selection (small, medium, large) ./test_quick.sh ./test_images # Test custom models MODELS="lfm2-vl-1.6b qwen2.5-vl-7b" ./test_quick.sh ./test_images # Customize settings RESIZE=512 MODELS="gemma-3-4b" ./test_quick.sh ./test_images ``` **Default Models:** - `lfm2-vl-1.6b` (Small - 1.6B) - `qwen2.5-vl-7b` (Medium - 7B) - `mistral-small-24b-q4` (Large - 24B Q4) **Use Cases:** - Quick validation after prompt changes - Testing configuration adjustments - Rapid prototyping before full test run ## Analyzing Results After testing multiple models, use the analysis script to compare performance: ```bash python analyze_jersey_results.py ``` This will show: - Comparison table of all models tested - Performance charts with hallucination rates - Best performers by speed and accuracy - Confidence distribution (if applicable) ## Model Swapping Behavior llama-swap will: 1. **Automatically load** the requested model when you specify `--model-tag` 2. **Automatically unload** the previous model (if different from current request) 3. **Keep running** if you test the same model multiple times 4. **Monitor** model loading/unloading in the web UI at `http://localhost:8080/ui` ## Optional: Model Auto-Unloading To automatically unload models after 5 minutes of inactivity, uncomment this line in `llama-swap-config.yaml`: ```yaml ttl: 300 ``` ## Optional: Preload Model on Startup To preload a specific model when llama-swap starts, uncomment and modify this section: ```yaml hooks: onStartup: - loadModel: qwen2.5-vl-7b ``` ## Customizing Models To add or modify models, edit `llama-swap-config.yaml`: ```yaml models: my-custom-model: name: "My Custom Model Description" cmd: llama-server --no-mmap -ngl 999 -fa on --host 0.0.0.0 --port ${PORT} -hf user/model-name:quantization ``` Then test with: ```bash python test_jersey_detection.py ./images jersey_prompt.txt --model-tag "my-custom-model" ``` ## Troubleshooting ### Model not loading - Check llama-swap logs at `http://localhost:8080/log` or via `curl http://localhost:8080/log/stream` - Verify the model name in the config matches the `--model-tag` parameter - Ensure sufficient GPU memory for the model ### Connection refused - Verify llama-swap is running: `curl http://localhost:8080/health` - Check the server URL matches: default is `http://192.168.1.126:8080` (from scan.ini) ### Slow model switching - First load downloads models from HuggingFace (can be slow) - Subsequent loads are faster (cached locally) - Use quantized models (Q4, Q8) for faster loading and lower memory usage ## Web UI llama-swap includes a web interface for monitoring: - **Dashboard**: `http://localhost:8080/ui` - View loaded models and logs - **Activity**: See recent API requests - **Logs**: Real-time log monitoring ## References - [llama-swap GitHub](https://github.com/mostlygeek/llama-swap) - [llama-swap Documentation](https://github.com/mostlygeek/llama-swap/tree/main/docs) - [llama.cpp Documentation](https://github.com/ggerganov/llama.cpp)