Deployment Guide¶

This guide covers deploying LandmarkDiff for production or demo use, including Docker, HuggingFace Spaces, REST API setup, and production considerations.

Local Gradio Demo¶

The simplest deployment is the built-in Gradio interface:

pip install -e ".[app]"
python scripts/app.py
# Open http://localhost:7860

This launches a five-tab demo (single procedure, multi-comparison, intensity sweep, face analysis, multi-angle capture) on port 7860. The first run downloads model weights (~6 GB), which are cached for subsequent launches.

To use a different port:

python scripts/app.py --port 8080

Docker Deployment¶

CPU-only deployment¶

For demos that only need TPS (geometric warping) mode:

# Build
docker build -t landmarkdiff:cpu -f Dockerfile.cpu .

# Run
docker run -p 7860:7860 landmarkdiff:cpu

The CPU Dockerfile uses python:3.11-slim, installs CPU-only PyTorch from https://download.pytorch.org/whl/cpu, and runs the Gradio demo in TPS mode. The resulting image is smaller and does not require any GPU drivers.

GPU deployment¶

For ControlNet and diffusion-based inference:

# Build the GPU image (runtime CUDA, smaller footprint)
docker build -t landmarkdiff:gpu -f Dockerfile.gpu .

# Run with GPU passthrough
docker run --gpus all -p 7860:7860 landmarkdiff:gpu

Dockerfile.gpu uses nvidia/cuda:12.1.1-runtime-ubuntu22.04 with Python 3.11. It requires NVIDIA Container Toolkit on the host.

For detailed GPU prerequisites, VRAM requirements by GPU tier, verification steps, and troubleshooting, see the Docker GPU Setup guide.

Docker Compose¶

The docker-compose.yml defines five services:

# CPU demo (default)
docker compose up app

# GPU demo (runtime image, recommended)
docker compose up gpu

# GPU demo (devel image, for compiling extensions)
docker compose up app-gpu

# Build Sphinx docs
docker compose run docs

# Training (requires GPU)
docker compose --profile training run train

Service details:

Service	Dockerfile	GPU	Port	Description
`app`	Dockerfile.cpu	No	7860	TPS-mode Gradio demo
`gpu`	Dockerfile.gpu	Yes (1 GPU)	7861	GPU inference (runtime image)
`app-gpu`	Dockerfile	Yes (1 GPU)	7860	GPU inference (devel image)
`docs`	python:3.11-slim	No	--	Sphinx documentation builder
`train`	Dockerfile	Yes (1 GPU)	--	ControlNet training

Volumes:

All services mount these volumes:

./data:/app/data: training data, test pairs
./checkpoints:/app/checkpoints: model checkpoints
model-cache:/root/.cache: shared HuggingFace model cache

Create the host directories before running:

mkdir -p data checkpoints

Custom Docker configuration¶

To modify the default inference mode, set the LANDMARKDIFF_MODE environment variable:

docker run -e LANDMARKDIFF_MODE=controlnet --gpus all -p 7860:7860 landmarkdiff

To pre-download model weights during build (so the first inference is fast), add to the Dockerfile:

RUN python -c "from diffusers import ControlNetModel; ControlNetModel.from_pretrained('CrucibleAI/ControlNetMediaPipeFace')"

HuggingFace Spaces¶

Setup¶

Create a new Space at huggingface.co/new-space.
Select Gradio as the SDK.
Choose hardware:
CPU Basic (free): works for TPS mode only
T4 Small: minimum for ControlNet inference
A10G Small: recommended for faster inference
Push the repository contents to the Space.

The scripts/app.py Gradio demo is compatible with HuggingFace Spaces out of the box. It auto-detects the environment and sets share=False (Spaces already provides a public URL).

Space configuration¶

Create an app.py at the repository root that imports and launches the demo, or point the Space to scripts/app.py in your README.md metadata:

---
title: LandmarkDiff
emoji: 🔬
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.0.0
app_file: scripts/app.py
pinned: false
---

Environment variables¶

Set secrets in the Space settings (Settings > Repository Secrets):

HF_TOKEN: if using gated models
LANDMARKDIFF_MODE: set to tps for CPU Spaces

Persistence¶

HuggingFace Spaces provides /data as a persistent volume. Use it for caching model weights:

export HF_HOME=/data/huggingface_cache

REST API Setup¶

FastAPI server¶

For programmatic access without the Gradio UI, wrap the pipeline in a FastAPI server:

"""LandmarkDiff REST API server."""

import io
import logging
from contextlib import asynccontextmanager
from pathlib import Path

import cv2
import numpy as np
from fastapi import FastAPI, File, HTTPException, Query, UploadFile
from fastapi.responses import StreamingResponse

from landmarkdiff.inference import LandmarkDiffPipeline

logger = logging.getLogger(__name__)

# Pipeline singleton: loaded once at startup
_pipeline = None


@asynccontextmanager
async def lifespan(app: FastAPI):
    """Load the pipeline on startup, clean up on shutdown."""
    global _pipeline
    logger.info("Loading LandmarkDiff pipeline...")
    _pipeline = LandmarkDiffPipeline(mode="controlnet", device="cuda")
    _pipeline.load()
    logger.info("Pipeline ready.")
    yield
    _pipeline = None


app = FastAPI(
    title="LandmarkDiff API",
    version="0.2.0",
    lifespan=lifespan,
)

VALID_PROCEDURES = [
    "rhinoplasty",
    "blepharoplasty",
    "rhytidectomy",
    "orthognathic",
    "brow_lift",
    "mentoplasty",
]


@app.post("/predict")
async def predict(
    image: UploadFile = File(...),
    procedure: str = Query("rhinoplasty", enum=VALID_PROCEDURES),
    intensity: int = Query(60, ge=0, le=100),
    seed: int = Query(42, ge=0),
):
    """Generate a surgical prediction.

    Returns the predicted post-operative image as PNG.
    """
    if _pipeline is None:
        raise HTTPException(503, "Pipeline not loaded")

    # Read and decode image
    contents = await image.read()
    nparr = np.frombuffer(contents, np.uint8)
    img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
    if img is None:
        raise HTTPException(400, "Could not decode image")

    # Run prediction
    try:
        result = _pipeline.generate(
            img,
            procedure=procedure,
            intensity=intensity,
            seed=seed,
        )
    except Exception as e:
        raise HTTPException(500, f"Prediction failed: {e}")

    # Encode output as PNG
    output_img = result["output"]
    _, buffer = cv2.imencode(".png", output_img)

    return StreamingResponse(
        io.BytesIO(buffer.tobytes()),
        media_type="image/png",
        headers={"Content-Disposition": "inline; filename=prediction.png"},
    )


@app.get("/health")
async def health():
    """Health check endpoint."""
    return {
        "status": "ok",
        "pipeline_loaded": _pipeline is not None,
    }

Save this as scripts/api_server.py and run with uvicorn:

pip install fastapi uvicorn python-multipart

uvicorn scripts.api_server:app --host 0.0.0.0 --port 8000

Testing the API¶

# Health check
curl http://localhost:8000/health

# Run prediction
curl -X POST http://localhost:8000/predict \
    -F "image=@face.jpg" \
    -F "procedure=rhinoplasty" \
    -F "intensity=60" \
    -o prediction.png

Flask alternative¶

If you prefer Flask:

"""Minimal Flask API for LandmarkDiff."""

import cv2
import numpy as np
from flask import Flask, request, send_file
import io

from landmarkdiff.inference import LandmarkDiffPipeline

app = Flask(__name__)

pipeline = LandmarkDiffPipeline(mode="controlnet", device="cuda")
pipeline.load()


@app.route("/predict", methods=["POST"])
def predict():
    file = request.files.get("image")
    if file is None:
        return {"error": "No image provided"}, 400

    procedure = request.form.get("procedure", "rhinoplasty")
    intensity = int(request.form.get("intensity", 60))

    nparr = np.frombuffer(file.read(), np.uint8)
    img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
    if img is None:
        return {"error": "Could not decode image"}, 400

    result = pipeline.generate(img, procedure=procedure, intensity=intensity)

    _, buffer = cv2.imencode(".png", result["output"])
    return send_file(io.BytesIO(buffer.tobytes()), mimetype="image/png")


if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8000)

Production Considerations¶

Model caching¶

On first inference, the pipeline downloads ~6 GB of model weights from HuggingFace. For production deployments:

Pre-download models during the Docker build or deployment setup:

export HF_HOME=/persistent/cache
python -c "
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
ControlNetModel.from_pretrained('CrucibleAI/ControlNetMediaPipeFace')
"

Use a persistent volume for the cache so container restarts do not re-download:

docker run -v model-cache:/root/.cache --gpus all -p 7860:7860 landmarkdiff

Set HF_HOME to a stable location:
```
export HF_HOME=/data/huggingface_cache
```

Batch processing¶

For processing many images (e.g., generating training data or running evaluations):

Load the pipeline once and reuse it:

pipeline = LandmarkDiffPipeline(mode="controlnet", device="cuda")
pipeline.load()

for image_path in image_paths:
    result = pipeline.generate(load_image(image_path), ...)

Use the batch inference script for directory-level processing:

python examples/batch_inference.py /path/to/images/ \
    --procedure rhinoplasty \
    --intensity 50 \
    --output output/batch/

For large batches on HPC, use SLURM array jobs:

#!/bin/bash
#SBATCH --array=0-9
#SBATCH --gres=gpu:1

TOTAL=1000
PER_JOB=$((TOTAL / 10))
START=$((SLURM_ARRAY_TASK_ID * PER_JOB))

python scripts/batch_process.py \
    --start $START --count $PER_JOB \
    --input data/images/ --output output/

Resource limits¶

Memory: The full inference pipeline uses ~5.2 GB VRAM and ~4 GB CPU RAM. For a web server, budget at least 8 GB RAM per worker process.

Concurrency: The pipeline is not thread-safe. Use process-based concurrency: - uvicorn: Run with --workers N where N is the number of GPUs. - gunicorn: Use --workers N --worker-class uvicorn.workers.UvicornWorker. - Each worker loads its own pipeline and GPU. Do not share pipeline objects across workers.

Timeouts: ControlNet inference takes 3-15 seconds depending on hardware. Set request timeouts accordingly:

uvicorn scripts.api_server:app --timeout-keep-alive 30

Disk: Model weights take ~6 GB on disk. Temporary files (intermediate images) are cleaned up automatically, but ensure enough temp space for concurrent requests.

Rate limiting¶

For public-facing deployments, add rate limiting to prevent abuse:

from slowapi import Limiter
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)

@app.post("/predict")
@limiter.limit("10/minute")
async def predict(...):
    ...

Authentication¶

Never expose the API without authentication in production:

from fastapi import Depends, HTTPException
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials

security = HTTPBearer()

async def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
    if credentials.credentials != expected_token:
        raise HTTPException(401, "Invalid token")

@app.post("/predict", dependencies=[Depends(verify_token)])
async def predict(...):
    ...

Reverse proxy¶

Put the API behind nginx for SSL termination and additional security:

server {
    listen 443 ssl;
    server_name landmarkdiff.example.com;

    ssl_certificate /etc/ssl/cert.pem;
    ssl_certificate_key /etc/ssl/key.pem;

    client_max_body_size 10M;

    location / {
        proxy_pass http://127.0.0.1:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_read_timeout 60s;
    }
}

Security checklist¶

Use HTTPS for all connections
Authenticate all API endpoints
Rate limit requests to prevent abuse
Validate and sanitize all inputs (file type, file size)
Do not store patient photos without explicit consent
Follow HIPAA guidelines if handling medical data
Set client_max_body_size to a reasonable limit (10 MB)
Log requests for audit trails but do not log image content
Run the container as a non-root user in production

Monitoring¶

Add basic metrics to track API health:

import time
from prometheus_client import Counter, Histogram

REQUEST_COUNT = Counter("predict_total", "Total prediction requests", ["procedure"])
REQUEST_LATENCY = Histogram("predict_seconds", "Prediction latency in seconds")

@app.post("/predict")
async def predict(...):
    start = time.time()
    # ... run prediction ...
    REQUEST_LATENCY.observe(time.time() - start)
    REQUEST_COUNT.labels(procedure=procedure).inc()

Next Steps¶

Evaluation Guide: Measure model quality
Training Guide: Train your own checkpoint
FAQ: Common questions and troubleshooting