LandmarkDiff¶

Photorealistic facial surgery outcome prediction from standard clinical photography, powered by anatomically-conditioned latent diffusion.

LandmarkDiff takes a patient's pre-operative photograph and a specified surgical procedure, then generates a photorealistic prediction of what that patient will look like post-operatively. The entire pipeline runs on a single 2D photo - no 3D CT scans, no depth sensors, no multi-view capture required.

It works by extracting MediaPipe's 478-point face mesh from the input photo, applying procedure-specific Gaussian RBF deformations calibrated from anthropometric surgical data, rendering the deformed mesh as a tessellation wireframe, and feeding that wireframe into a ControlNet-conditioned Stable Diffusion 1.5 backbone to synthesize the predicted face. The output is composited back onto the original image using Laplacian pyramid blending with feathered surgical masks, then refined through neural face restoration and identity verification.

Paper: "LandmarkDiff: Anatomically-Conditioned Latent Diffusion for Photorealistic Facial Surgery Outcome Prediction," targeting MICCAI 2026.

Quick Links¶

Quickstart notebook: load the pipeline, run predictions, compare procedures
CHANGELOG: what changed between releases
Discussions: questions, ideas, and community discussion
API reference: per-module documentation

Table of Contents¶

Getting started - Why LandmarkDiff - Quick Start - Inference Modes - Gradio Web Demo

Core pipeline - Supported Procedures - How It Works - Clinical Edge Cases - Post-Processing Pipeline

Training and evaluation - Training - Evaluation and Metrics - Benchmarks - Model Zoo

Reference - Demo Outputs - Project Structure - Configuration - Requirements - Docker - Roadmap

Community - Citation - Contributors - Contributing - License - Acknowledgments

Why LandmarkDiff¶

Patients considering facial surgery want to see what they'll look like afterward. Current approaches either require expensive 3D imaging hardware (structured light, CT, MRI) or produce cartoonish 2D warps that don't look real. LandmarkDiff bridges that gap:

2D input only. Works from any standard clinical photograph or even a phone selfie. No 3D scanners, no depth sensors, no multi-view rigs.
Photorealistic output. Diffusion-based generation produces natural skin texture, lighting, and shadows - not just a geometric morph.
Anatomically grounded. Deformations are driven by procedure-specific landmark displacements calibrated from anthropometric surgical literature, not arbitrary pixel pushing.
Identity preserving. ArcFace verification ensures the output still looks like the same patient, not a different person.
Clinically aware. Built-in handling for vitiligo, Bell's palsy, keloid-prone skin, and Ehlers-Danlos syndrome - conditions that affect how facial tissue responds to surgery.
Fair across skin tones. All evaluation metrics are stratified by Fitzpatrick skin type (I through VI) to catch and prevent performance disparities.

Supported Procedures¶

LandmarkDiff ships with four procedure presets, each targeting specific anatomical regions with calibrated displacement vectors.

Rhinoplasty (Nose Reshaping)¶

Targets 24 landmarks across the nasal bridge, tip, and alar base. Key deformations include alar base narrowing (nostril width reduction), tip refinement with upward rotation, and dorsal hump reduction. Uses a 30px Gaussian RBF influence radius for smooth transitions across the nasal region.

Landmark indices: 1, 2, 4, 5, 6, 19, 94, 141, 168, 195, 197, 236, 240, 274, 275, 278, 279, 294, 326, 327, 360, 363, 370, 456, 460

Blepharoplasty (Eyelid Surgery)¶

Targets 28 landmarks around the upper and lower eyelids. Deformations include upper lid elevation (hooded eye correction), medial and lateral canthal tapering, and lower lid tightening. Uses a tighter 15px influence radius to avoid affecting surrounding structures like the brow.

Landmark indices: 33, 7, 163, 144, 145, 153, 154, 155, 157, 158, 159, 160, 161, 246, 362, 382, 381, 380, 374, 373, 390, 249, 263, 466, 388, 387, 386, 385, 384, 398

Rhytidectomy (Facelift)¶

Targets 32 landmarks along the jawline, cheeks, and periauricular region. Deformations include jowl lifting (upward and lateral traction), submental tightening, and gentle temple lifting to simulate tissue redistribution. Uses a wider 40px influence radius for the broad soft tissue mobilization typical of facelifts.

Landmark indices: 10, 21, 54, 58, 67, 93, 103, 109, 127, 132, 136, 150, 162, 172, 176, 187, 207, 213, 234, 284, 297, 323, 332, 338, 356, 361, 365, 379, 389, 397, 400, 427, 454

Orthognathic Surgery (Jaw Repositioning)¶

Targets 47 landmarks across the mandible, maxilla, and chin. Deformations simulate mandibular advancement or setback, chin projection changes, and lateral jaw narrowing. Uses a 35px influence radius. Note that identity loss is disabled for orthognathic predictions because jaw repositioning inherently changes facial proportions more than the other procedures.

Landmark indices: 0, 17, 18, 36, 37, 39, 40, 57, 61, 78, 80, 81, 82, 84, 87, 88, 91, 95, 146, 167, 169, 170, 175, 181, 191, 200, 201, 202, 204, 208, 211, 212, 214, 269, 270, 291, 311, 312, 317, 321, 324, 325, 375, 396, 405, 407, 415

Adding Your Own Procedure¶

You can define custom procedures by specifying which landmarks to move, how far, and in what direction. See tutorials/custom_procedures.md for a step-by-step guide.

How It Works¶

LandmarkDiff is a five-stage pipeline. Each stage is independently testable and swappable.

                         Input Photo (512x512)
                                |
                    [1] MediaPipe Face Mesh
                    478 landmarks, 3D coords
                                |
                    [2] Gaussian RBF Deformation
                    Procedure-specific displacement
                    vectors scaled by intensity (0-100%)
                                |
                    [3] Conditioning Generation
                    2556-edge tessellation wireframe
                    + adaptive Canny edges + mask
                                |
                    [4] ControlNet + Stable Diffusion 1.5
                    CrucibleAI/ControlNetMediaPipeFace
                    conditioned latent space generation
                                |
                    [5] Post-Processing
                    Laplacian pyramid blend (6 levels)
                    + CodeFormer face restoration
                    + Real-ESRGAN background upscale
                    + ArcFace identity verification
                                |
                        Predicted Post-Op Face

Stage 1: Landmark Extraction¶

MediaPipe Face Mesh detects 478 facial landmarks in 3D (x, y, z normalized coordinates) at roughly 30 fps on CPU. The landmarks are grouped into anatomical regions:

Region	Landmark count
Jawline	33
Left eye	16
Right eye	16
Left eyebrow	10
Right eyebrow	10
Nose	25
Lips	22
Left iris	5
Right iris	5
Face oval	37

The extraction runs at the start of every prediction and again on the output for evaluation (NME metric).

Stage 2: Gaussian RBF Deformation¶

Each procedure preset defines a set of DeformationHandle objects, each specifying: - Which landmark to move (index into the 478-point mesh) - How far to move it (pixel displacement vector, scaled by the intensity slider) - How wide the influence is (Gaussian RBF radius in pixels)

The deformation is applied as a smooth, spatially weighted field. Landmarks near the handle move the most; landmarks far away are unaffected. This prevents the jarring discontinuities you get from simple point-to-point warping.

All displacement magnitudes are scaled by the intensity parameter (0 to 100), so you can preview subtle through aggressive versions of the same procedure.

Stage 3: Conditioning Generation¶

The deformed landmarks are rendered into conditioning images for ControlNet:

Tessellation wireframe - The full 2556-edge MediaPipe face mesh drawn on a black canvas. This is the primary conditioning signal. It uses a static anatomical adjacency list (not Delaunay triangulation), so the topology is invariant to landmark displacement.
Adaptive Canny edges - Edge detection with thresholds derived from the image median (low = 0.66 * median, high = 1.33 * median). This adapts to different skin tones without hardcoded thresholds, plus morphological skeletonization to produce 1-pixel edges that ControlNet expects.
Surgical mask - A feathered mask indicating where the procedure affects the face. Built from the convex hull of procedure-specific landmarks, dilated, Gaussian-feathered, then perturbed with Perlin-style boundary noise (2-4px) to prevent visible seam lines.

Stage 4: Diffusion Generation¶

The conditioning images are fed to CrucibleAI's pre-trained ControlNet for MediaPipe Face, which conditions Stable Diffusion 1.5 to generate a face matching the deformed mesh topology. Procedure-specific text prompts emphasize clinical photography qualities (natural appearance, sharp focus, studio lighting).

Stage 5: Post-Processing¶

Six-step refinement: 1. CodeFormer neural face restoration (fidelity weight 0.7 for quality-fidelity balance) 2. Real-ESRGAN background super-resolution (non-face regions only) 3. Histogram matching in LAB color space for robust skin tone transfer from input to output 4. Frequency-aware sharpening on the L channel only (avoids color fringing) 5. Laplacian pyramid blending (6 levels) - low frequencies blend smoothly for lighting continuity, high frequencies transition sharply for texture/pore preservation 6. ArcFace identity verification - flags if the output drifts too far from the input identity (cosine similarity threshold 0.6)

Demo Outputs¶

Sample outputs are in the demos/ directory.

Pipeline visualization (pipeline_abstract.png):

Schematic overview of the five-stage pipeline: Input, 478-point mesh extraction, RBF deformation, ControlNet + SD1.5 synthesis, and predicted result.

Mesh deformation (mesh_deformation.png):

Side-by-side comparison of original and deformed face meshes, showing procedure-specific Gaussian RBF displacement vectors.

ControlNet-generated photorealistic samples will be added after model training completes.

Quick Start¶

Installation¶

git clone https://github.com/dreamlessx/LandmarkDiff-public.git
cd LandmarkDiff-public

# Core (inference only)
pip install -e .

# With training dependencies
pip install -e ".[train]"

# With Gradio demo
pip install -e ".[app]"

# With evaluation metrics
pip install -e ".[eval]"

# Everything
pip install -e ".[train,eval,app,dev]"

Run a single prediction¶

python scripts/run_inference.py /path/to/face.jpg \
    --procedure rhinoplasty \
    --intensity 60 \
    --mode controlnet

This will: 1. Detect the face and extract 478 landmarks 2. Apply rhinoplasty deformation at 60% intensity 3. Generate the ControlNet-conditioned prediction 4. Composite the result back onto the original 5. Save the output to output/result.png

CPU-only mode (no GPU needed)¶

python examples/tps_only.py /path/to/face.jpg \
    --procedure rhinoplasty \
    --intensity 60

TPS mode does pure geometric warping. It runs instantly on CPU and produces a geometrically accurate result, but without the photorealistic texture synthesis that the diffusion modes provide.

Batch processing¶

python examples/batch_inference.py /path/to/image_dir/ \
    --procedure blepharoplasty \
    --intensity 50 \
    --output output/batch/

Inference Modes¶

LandmarkDiff supports four inference modes with different quality-speed-hardware tradeoffs:

Mode	GPU Required	Speed	Quality	Identity Preservation
`tps`	No	Instant (~0.5s)	Geometric only	Perfect (pixel-level)
`img2img`	Yes (6GB)	~5s	Good	Good
`controlnet`	Yes (6GB)	~5s	Best	Good
`controlnet_ip`	Yes (8GB)	~7s	Best	Best

TPS mode - Thin-plate spline warping. No diffusion, no neural network inference. Just mathematically warps the pixels according to landmark displacements. Fast and deterministic, but the output looks like a geometric morph rather than a natural photo. Good for previewing the deformation before committing to a full diffusion run.

img2img mode - Standard Stable Diffusion img2img with the TPS-warped image as input and a feathered mask restricting generation to the surgical region. Faster than ControlNet but less controllable.

ControlNet mode - The primary mode. Uses CrucibleAI's pre-trained ControlNet for MediaPipe Face mesh conditioning. The deformed wireframe directly controls the spatial layout of the generated face, producing the most anatomically accurate results.

ControlNet + IP-Adapter mode - Adds IP-Adapter FaceID on top of ControlNet for stronger identity preservation. Uses face embeddings from the input photo to condition generation, reducing the chance of producing a different-looking person. Slightly slower due to the additional encoder pass.

from landmarkdiff.inference import LandmarkDiffPipeline

pipeline = LandmarkDiffPipeline(mode="controlnet", device="cuda")
pipeline.load()

result = pipeline.generate(
    image,
    procedure="rhinoplasty",
    intensity=60,
    num_inference_steps=30,
    guidance_scale=7.5,
    controlnet_conditioning_scale=1.0,
    strength=0.75,
    seed=42,
    postprocess=True,
)

# result dict contains:
# result["output"]              - final composited image
# result["output_raw"]          - raw diffusion output (before compositing)
# result["output_tps"]          - TPS-only geometric warp
# result["conditioning"]        - wireframe fed to ControlNet
# result["mask"]                - surgical mask
# result["landmarks_original"]  - input landmarks
# result["landmarks_manipulated"] - deformed landmarks
# result["identity_check"]      - ArcFace similarity score

Gradio Web Demo¶

python scripts/app.py
# Opens at http://localhost:7860

The demo has five tabs:

Tab 1: Single Procedure¶

Upload a photo, pick a procedure, adjust intensity from 0-100%. The interface shows every intermediate step: extracted landmarks, deformed mesh, wireframe conditioning, surgical mask, TPS warp, and the final result in a side-by-side before/after view. Clinical flags (vitiligo, Bell's palsy with side selector, keloid-prone regions, Ehlers-Danlos) are available as checkboxes.

Tab 2: Multi-Procedure Comparison¶

Set independent intensity sliders for all four procedures and generate them all from the same photo. Useful for showing a patient their options side by side.

Tab 3: Intensity Sweep¶

Pick a procedure and a number of steps (3 to 10). Generates a gallery progressing from 0% to 100% intensity so you can see exactly how the result changes with the intensity parameter.

Tab 4: Face Analysis¶

Upload a photo and get back the detected Fitzpatrick skin type, face view classification (frontal, three-quarter, or profile), yaw and pitch angles in degrees, per-region landmark counts, confidence scores, and an annotated landmark visualization.

Tab 5: Multi-Angle Capture¶

Guides the user through capturing 5 standardized clinical views: frontal (0 degrees), left three-quarter (45 degrees), right three-quarter (45 degrees), left profile (90 degrees), right profile (90 degrees). Validates each photo against the expected yaw range and generates predictions for all views, producing a combined before/after gallery.

Training¶

Training happens in two phases.

Phase A: Synthetic Data (current)¶

Generate TPS-warped face pairs from FFHQ, then fine-tune ControlNet to reconstruct the original face from the deformed wireframe.

# 1. Download FFHQ samples
python scripts/download_ffhq.py --num 50000 --resolution 512

# 2. Generate training pairs (original + TPS-warped + wireframe)
python scripts/generate_synthetic_data.py \
    --input data/ffhq_samples/ \
    --output data/synthetic_pairs/ \
    --num 50000

# 3. Train ControlNet
python scripts/train_controlnet.py \
    --data_dir data/synthetic_pairs/ \
    --output_dir checkpoints/ \
    --num_train_steps 50000

Phase A uses diffusion loss only (MSE between predicted and target noise).

Phase B: Clinical + Combined Loss (planned)¶

Fine-tune further on clinical before/after pairs with the full four-term loss:

Loss	Weight	Purpose
Diffusion (MSE)	1.0	Primary training signal
Landmark L2	0.1	Anatomical accuracy (inside surgical mask only)
Identity (ArcFace)	0.05	Patient identity preservation
Perceptual (LPIPS)	0.1	Texture quality (outside mask, prevents penalizing the TPS warp)

The landmark loss is normalized by inter-ocular distance (landmarks 33 vs 263) for scale invariance. The identity loss uses procedure-dependent face cropping - rhinoplasty crops to the upper face, blepharoplasty uses the full face, rhytidectomy crops above the jawline, and orthognathic disables identity loss entirely since jaw surgery inherently changes proportions.

Training Configuration¶

Default config at configs/training.yaml:

Parameter	Value	Notes
Learning rate	1e-5	With cosine scheduler
Warmup steps	500
Batch size	4	Gradient accumulation 4x, effective batch 16
Mixed precision	bf16	NOT fp16 - activation range exceeded
EMA decay	0.9999
Checkpoint interval	5000 steps
ControlNet scale max	1.2	Sum > 1.2 causes saturation

Important training safeguards: - VAE is always frozen (gradient leak corrupts the latent space) - GroupNorm instead of BatchNorm (batch size 4 makes BN unstable) - TPS warps are precomputed to avoid CPU bottleneck during training - Git LFS required for checkpoints

SLURM (HPC)¶

sbatch scripts/train_slurm.sh

See docs/GPU_TRAINING_GUIDE.md for detailed HPC setup, Apptainer containers, and multi-node configurations.

Evaluation and Metrics¶

Primary Metrics¶

Metric	What it measures	Target	How it's computed
FID	Realism	< 50	Frechet Inception Distance via torch-fidelity (GPU-accelerated)
LPIPS	Perceptual similarity	< 0.15	Learned Perceptual Image Patch Similarity (AlexNet backbone)
SSIM	Structural similarity	> 0.80	Structural Similarity Index between input and output
NME	Landmark accuracy	< 0.05	Normalized Mean Error - L2 distance between predicted and target landmarks, normalized by inter-ocular distance (landmarks 33 vs 263)
Identity Sim	Identity preservation	> 0.85	ArcFace cosine similarity between input and output face embeddings (InsightFace buffalo_l, 512-dim)

Fitzpatrick Stratification¶

Every metric is broken down by Fitzpatrick skin type to ensure equitable performance. Skin type is classified automatically from the input photo using Individual Typology Angle (ITA):

ITA = arctan((L - 50) / b) * (180 / pi)

where L and b come from the LAB color space.

ITA Range	Fitzpatrick Type	Description
> 55	Type I	Very light
41 to 55	Type II	Light
28 to 41	Type III	Intermediate
10 to 28	Type IV	Tan
-30 to 10	Type V	Brown
< -30	Type VI	Dark

This catches cases where the model might work well on lighter skin but degrade on darker skin (or vice versa). Results are reported per-type in evaluation output.

Running Evaluation¶

python scripts/evaluate.py \
    --pred_dir output/predictions/ \
    --target_dir data/targets/ \
    --output eval_results.json

The evaluation harness computes all metrics, stratifies by Fitzpatrick type and by procedure, and writes a JSON report.

Clinical Edge Cases¶

LandmarkDiff handles four clinical conditions that affect how deformations should be applied or how the mask should behave.

Vitiligo¶

Vitiligo causes depigmented patches on the skin that should be preserved, not blended over. LandmarkDiff detects vitiligo patches using LAB luminance thresholding (high L, low saturation), filters by minimum area (200 px squared), and reduces mask intensity over detected patches by a preservation factor of 0.3. This means the surgical region is still modified, but depigmented areas are largely left alone.

Bell's Palsy¶

Bell's palsy causes unilateral facial paralysis. Deforming the paralyzed side produces unrealistic results because the tissue doesn't respond to surgery the same way. LandmarkDiff takes the affected side (left or right) as input and disables all deformation handles on that side. The bilateral landmark groups (eye, eyebrow, mouth corner, jawline) for the affected side are excluded from manipulation.

Keloid-Prone Skin¶

Keloid-prone patients develop raised scars at incision sites. LandmarkDiff identifies keloid-prone regions (specified by anatomical zone, e.g., "jawline", "nose"), creates exclusion masks with margins, and reduces mask intensity by a factor of 0.5 with additional Gaussian blur (sigma 10.0) for softer transitions. This prevents sharp compositing boundaries that would suggest incision lines.

Ehlers-Danlos Syndrome¶

Ehlers-Danlos causes tissue hypermobility - the skin stretches more than typical. LandmarkDiff multiplies the Gaussian RBF influence radius by 1.5 for Ehlers-Danlos patients, producing wider, more gradual deformations that reflect how hypermobile tissue actually responds to surgical manipulation.

Using Clinical Flags¶

from landmarkdiff.clinical import ClinicalFlags

flags = ClinicalFlags(
    vitiligo=True,
    bells_palsy=True,
    bells_palsy_side="left",
    keloid_prone=True,
    keloid_regions=["jawline", "nose"],
    ehlers_danlos=False,
)

result = pipeline.generate(
    image,
    procedure="rhinoplasty",
    intensity=60,
    clinical_flags=flags,
)

In the Gradio demo, these are checkboxes and dropdowns in Tab 1.

Post-Processing Pipeline¶

The raw diffusion output needs refinement before it looks right. The post-processing pipeline runs six steps:

1. CodeFormer Face Restoration¶

Neural face restoration that fixes small artifacts, enhances detail, and sharpens facial features. Uses a fidelity weight of 0.7 (range 0.0 to 1.0) to balance quality enhancement against faithfulness to the diffusion output. Falls back to GFPGAN if CodeFormer is unavailable.

2. Real-ESRGAN Background Enhancement¶

Super-resolution applied only to non-face regions (background, hair, clothing). Prevents the background from looking noticeably lower quality than the restored face.

3. Skin Tone Matching¶

CDF histogram matching in LAB color space transfers the input photo's skin tone to the generated output. LAB matching is more robust than RGB for this because it separates luminance from color, preventing brightness shifts.

4. Frequency-Aware Sharpening¶

Unsharp masking applied to the L channel only (luminance) with a default strength of 0.25. Sharpening only luminance avoids the color fringing artifacts you get from sharpening RGB channels directly.

5. Laplacian Pyramid Blending¶

The compositing step - blends the generated face into the original photo. Uses a 6-level Laplacian pyramid where low-frequency levels blend smoothly (lighting and color continuity) while high-frequency levels transition sharply (texture and pore detail). This prevents the color halos and "pasted on" look that simple alpha blending produces.

6. ArcFace Identity Verification¶

Final sanity check. Extracts ArcFace embeddings from the input and output, computes cosine similarity, and flags if the score drops below 0.6. This catches cases where the diffusion model drifted too far from the patient's appearance.

Project Structure¶

landmarkdiff/                   # Core library
    landmarks.py                #   MediaPipe 478-point face mesh extraction
                                #   FaceLandmarks dataclass, extract_landmarks(),
                                #   render_landmark_image(), LANDMARK_REGIONS dict
    conditioning.py             #   ControlNet conditioning generation
                                #   Tessellation wireframe (2556 edges), adaptive
                                #   Canny edge detection, generate_conditioning()
    manipulation.py             #   Gaussian RBF landmark deformation
                                #   DeformationHandle, PROCEDURE_LANDMARKS,
                                #   apply_procedure_preset(), clinical modifiers
    masking.py                  #   Feathered surgical mask generation
                                #   Convex hull + dilation + Gaussian feather +
                                #   Perlin boundary noise, clinical adjustments
    inference.py                #   Full pipeline (4 modes: tps/img2img/controlnet/
                                #   controlnet_ip), LandmarkDiffPipeline class,
                                #   face view estimation, procedure-specific prompts
    losses.py                   #   Combined loss (diffusion + landmark + identity
                                #   + perceptual), phase A/B control, procedure-
                                #   dependent identity cropping
    evaluation.py               #   Metrics (FID, LPIPS, SSIM, NME, Identity Sim),
                                #   Fitzpatrick ITA classification, per-type and
                                #   per-procedure stratification
    clinical.py                 #   Clinical edge cases: ClinicalFlags dataclass,
                                #   vitiligo patch detection, Bell's palsy side
                                #   exclusion, keloid mask adjustment, Ehlers-Danlos
    postprocess.py              #   Neural + classical post-processing: CodeFormer,
                                #   GFPGAN, Real-ESRGAN, LAB histogram matching,
                                #   Laplacian pyramid blend, ArcFace verification
    synthetic/
        pair_generator.py       #   Training pair generation pipeline
        tps_warp.py             #   Thin-plate spline warping with rigid regions
                                #   (teeth, sclera), smart control point subsampling
                                #   (max 80 from 478), batched evaluation
        augmentation.py         #   Clinical photography augmentations

scripts/                        # CLI tools
    app.py                      #   Gradio web demo (5 tabs)
    run_inference.py            #   Single image inference
    train_controlnet.py         #   ControlNet fine-tuning
    evaluate.py                 #   Automated evaluation harness
    demo.py                     #   CLI demo with visualizations
    download_ffhq.py            #   FFHQ face image downloader
    generate_synthetic_data.py  #   Synthetic training pair generator
    train_slurm.sh              #   SLURM job script (single GPU)
    train_slurm_v2.sh           #   SLURM job script (multi-GPU)
    gen_synthetic_slurm.sh      #   SLURM job for data generation

examples/                       # Runnable example scripts
    basic_inference.py          #   Single image with GPU fallback to TPS
    batch_inference.py          #   Process a directory of images
    tps_only.py                 #   CPU-only TPS warp (no GPU)
    compare_procedures.py       #   Side-by-side all procedures grid
    custom_procedure.py         #   Define a lip augmentation procedure
    landmark_visualization.py   #   Visualize mesh with displacement arrows

benchmarks/                     # Performance benchmarks
    benchmark_inference.py      #   Inference speed across hardware
    benchmark_landmarks.py      #   Landmark extraction throughput
    benchmark_training.py       #   Training steps/hour

configs/                        # Training configuration
    training.yaml               #   Default hyperparameters, loss weights, safeguards

paper/                          # MICCAI 2026 manuscript (Springer LNCS)
docs/                           # Documentation
    tutorials/                  #   quickstart, custom_procedures, training,
                                #   evaluation, deployment
    api/                        #   Per-module API reference (landmarks,
                                #   manipulation, conditioning, inference,
                                #   evaluation, clinical)
    GPU_TRAINING_GUIDE.md       #   HPC setup, Apptainer, SLURM

containers/                     # Apptainer/Singularity container definitions
tests/                          # Unit tests (9 test modules)
demos/                          # Curated sample output images

Configuration¶

Training (configs/training.yaml)¶

The training config controls all hyperparameters, loss weights, and safeguards. Key sections:

model:
  controlnet: CrucibleAI/ControlNetMediaPipeFace
  base_model: runwayml/stable-diffusion-v1-5

training:
  learning_rate: 1.0e-5
  lr_scheduler: cosine
  warmup_steps: 500
  batch_size: 4
  gradient_accumulation_steps: 4  # effective batch = 16
  num_train_steps: 10000
  mixed_precision: bf16
  ema_decay: 0.9999

loss_weights:  # Phase B only
  diffusion: 1.0
  landmark: 0.1
  identity: 0.05
  perceptual: 0.1

Inference Parameters¶

Parameter	Default	Range	Effect
`intensity`	60	0 - 100	How aggressive the deformation is (percentage)
`num_inference_steps`	30	10 - 100	Diffusion denoising steps (more = higher quality, slower)
`guidance_scale`	7.5	1.0 - 20.0	Classifier-free guidance strength
`controlnet_conditioning_scale`	1.0	0.0 - 1.2	How strongly the wireframe controls generation. Max 1.2 to avoid saturation
`strength`	0.75	0.0 - 1.0	img2img denoising strength
`seed`	None	any int	For reproducible results

Benchmarks¶

Inference Speed¶

Hardware	Mode	Time per image
A100 80GB	ControlNet (30 steps)	~3 sec
A100 40GB	ControlNet (30 steps)	~4 sec
RTX 4090	ControlNet (30 steps)	~5 sec
RTX 3090	ControlNet (30 steps)	~7 sec
T4 16GB	ControlNet (30 steps)	~15 sec
M3 Pro (MPS)	ControlNet (30 steps)	~45 sec
Any CPU	TPS only	~0.5 sec

Training Throughput¶

Hardware	Batch size	Grad accum	Effective batch	Steps/hour
A100 80GB	4	4	16	~600
A100 40GB	2	8	16	~400
RTX 4090	2	8	16	~350
RTX 3090	1	16	16	~200

VRAM Usage¶

Component	VRAM
SD 1.5 (FP16)	~2.5 GB
ControlNet (FP16)	~1.5 GB
VAE (FP32)	~0.5 GB
CodeFormer	~0.4 GB
ArcFace	~0.3 GB
Total inference	~5.2 GB
Total training	~25 GB

Run benchmarks yourself:

python benchmarks/benchmark_inference.py --device cuda --num_images 100
python benchmarks/benchmark_landmarks.py --num_images 1000
python benchmarks/benchmark_training.py --device cuda --num_steps 100

Model Zoo¶

See MODEL_ZOO.md for the full list of required and optional models.

Base models (auto-downloaded on first run): - runwayml/stable-diffusion-v1-5 - ~4 GB - CrucibleAI/ControlNetMediaPipeFace - ~1.4 GB - MediaPipe Face Mesh - ~5 MB

Post-processing models (optional, auto-downloaded): - CodeFormer - ~400 MB - GFPGAN v1.4 - ~350 MB - Real-ESRGAN x4 - ~64 MB - ArcFace (InsightFace buffalo_l) - ~250 MB

Requirements¶

Python 3.10+
PyTorch 2.1+ with CUDA (or MPS on Apple Silicon)
~6 GB VRAM for inference (SD 1.5 + ControlNet)
~25 GB VRAM for training (A100 40GB minimum, 80GB recommended)
MediaPipe 0.10.9+
diffusers 0.27.0+, transformers 4.38.0+

Full dependency list in pyproject.toml.

Docker¶

# CPU-only demo (TPS mode, no GPU required)
docker build -t landmarkdiff:cpu -f Dockerfile.cpu .
docker run -p 7860:7860 landmarkdiff:cpu

# GPU-accelerated demo (ControlNet inference)
docker build -t landmarkdiff:gpu -f Dockerfile.gpu .
docker run --gpus all -p 7860:7860 landmarkdiff:gpu

Or with Docker Compose:

docker compose up app       # CPU demo on :7860
docker compose up gpu       # GPU demo on :7861
docker compose --profile training run train  # training (GPU)

GPU passthrough requires NVIDIA Container Toolkit. See Docker GPU Setup for prerequisites, VRAM requirements by GPU tier, and troubleshooting.

For HPC environments using Apptainer/Singularity, see containers/.

Make Targets¶

make help            # show all commands
make install         # install (inference only)
make install-dev     # install with dev tools
make install-train   # install with training deps
make install-app     # install with Gradio
make install-all     # install everything
make test            # run full test suite
make test-fast       # run tests excluding slow ones
make lint            # run ruff linter
make format          # auto-format code
make type-check      # run mypy
make check           # lint + format + type-check
make demo            # launch Gradio demo
make inference       # run single inference
make train           # train ControlNet
make evaluate        # run evaluation
make docker          # build Docker image
make paper           # build MICCAI paper PDF
make clean           # remove build artifacts

Roadmap¶

Current (v0.1 - Spring 2026)¶

[x] Core pipeline: landmark extraction, RBF deformation, ControlNet conditioning, mask compositing
[x] 4 procedure presets (rhinoplasty, blepharoplasty, rhytidectomy, orthognathic)
[x] Synthetic training pair generation via TPS warps
[x] Clinical edge case handling (vitiligo, Bell's palsy, keloid, Ehlers-Danlos)
[x] Neural post-processing (CodeFormer, Real-ESRGAN, ArcFace identity verification)
[x] Gradio demo with multi-angle capture
[x] Fitzpatrick-stratified evaluation protocol
[x] Docker and Apptainer container support
[ ] ControlNet fine-tuning on 50K+ synthetic pairs (in progress)
[ ] Populate results tables in paper

Next (v0.2 - Summer 2026)¶

[ ] FLUX.1-dev backbone upgrade (higher quality generation at 1024x1024)
[ ] IP-Adapter FaceID for stronger identity preservation
[ ] Additional procedure presets (mentoplasty, brow lift, otoplasty)
[ ] Clinical validation with board-certified plastic surgeons
[ ] Hugging Face Spaces interactive demo
[ ] arXiv preprint (target: April 2026)

Future (v1.0)¶

[ ] FLAME 3D morphable model integration for depth-aware deformation
[ ] Multi-view consistency loss across frontal/profile predictions
[ ] Physics-informed tissue simulation (FEM for soft tissue response)
[ ] React Native mobile capture app with standardized clinical photo acquisition
[ ] Cloud deployment with Triton inference server

Publication Targets¶

MICCAI 2026 workshop paper (July 2026 submission)
RSNA 2026 abstract (May 2026)
Full conference paper (CVPR/NeurIPS 2027)

Citation¶

@inproceedings{landmarkdiff2026,
  title={LandmarkDiff: Anatomically-Conditioned Latent Diffusion for
         Photorealistic Facial Surgery Outcome Prediction},
  booktitle={Medical Image Computing and Computer Assisted Intervention
            : MICCAI},
  year={2026},
  publisher={Springer}
}

Machine-readable citation metadata is also available in CITATION.cff.

Contributors¶

We track all contributions and contributors will be acknowledged in the MICCAI 2026 paper. Significant contributions earn co-authorship.

Contribution Level	Recognition
Bug fix or typo	Acknowledged in README
New procedure preset	Acknowledged in paper and README
Feature module (new loss, metric, clinical handler)	Co-author on paper
Clinical validation data	Co-author on paper
Sustained multi-feature contributions	Co-author on paper

Current Contributors¶

GitHub Handle	Contribution
@dreamlessx	Core architecture, training pipeline, paper

To join this list, open a PR or contribute to an issue. See CONTRIBUTING.md for guidelines.

Contributing¶

Contributions welcome. See CONTRIBUTING.md for the full guide, including development setup, coding style, testing requirements, and how to add new procedures.

For bug reports and feature requests, use the issue templates.

For major changes, please open an issue first to discuss the proposed approach.

License¶

MIT License. See LICENSE for details.

Acknowledgments¶

CrucibleAI for the MediaPipe Face ControlNet
MediaPipe for the 478-point face mesh
Stable Diffusion and ControlNet for the diffusion backbone
CodeFormer and Real-ESRGAN for face restoration
InsightFace for ArcFace identity verification
FFHQ for training data

:maxdepth: 2
:caption: Getting Started

getting-started
install
tutorials/quickstart
faq

:maxdepth: 2
:caption: Procedures

procedures/rhinoplasty
procedures/blepharoplasty
procedures/rhytidectomy
procedures/orthognathic
procedures/brow_lift
procedures/mentoplasty

:maxdepth: 2
:caption: Tutorials

tutorials/custom_procedures
tutorials/training
tutorials/evaluation
tutorials/deployment
docker-gpu
GPU_TRAINING_GUIDE

:maxdepth: 2
:caption: API Reference

api/landmarks
api/manipulation
api/conditioning
api/inference
api/evaluation
api/clinical

:maxdepth: 1
:caption: Project

benchmarks
changelog