Model Zoo¶

Pre-trained models and weights for LandmarkDiff.

Base Models (Required)¶

These are third-party models that LandmarkDiff uses. They are downloaded automatically on first run.

Model	Source	Size	Purpose
Stable Diffusion 1.5	runwayml/stable-diffusion-v1-5	~4 GB	Base diffusion backbone
ControlNet MediaPipe Face	CrucibleAI/ControlNetMediaPipeFace	~1.4 GB	Face mesh conditioning
MediaPipe Face Mesh	google/mediapipe	~5 MB	478-point landmark detection

Post-Processing Models (Optional)¶

Model	Source	Size	Purpose
CodeFormer	sczhou/CodeFormer	~400 MB	Face restoration (primary)
GFPGAN v1.4	TencentARC/GFPGAN	~350 MB	Face restoration (fallback)
Real-ESRGAN x4	xinntao/Real-ESRGAN	~64 MB	Background super-resolution
ArcFace	insightface/buffalo_l	~250 MB	Identity verification

Fine-tuned Checkpoints¶

Checkpoint	Dataset	Steps	FID	LPIPS	Status
`phase_a_50k`	50K synthetic pairs	50K	TBD	TBD	Training
`phase_b_clinical`	Clinical + synthetic	TBD	TBD	TBD	Planned

Downloading checkpoints¶

# Phase A (50K steps) - coming soon
# Will be available via Hugging Face Hub

Training Your Own¶

See docs/GPU_TRAINING_GUIDE.md for instructions on training from scratch.

# Generate training data
python scripts/generate_synthetic_data.py --input data/ffhq_samples/ --output data/synthetic_pairs/ --num 50000

# Train Phase A
python scripts/train_controlnet.py --data_dir data/synthetic_pairs/ --output_dir checkpoints/ --num_train_steps 50000

Hardware Requirements¶

Task	Min VRAM	Recommended	Time
Inference (single image)	6 GB	8 GB	~5 sec
Inference (batch of 16)	12 GB	16 GB	~30 sec
Training Phase A (10K steps)	24 GB	40 GB (A100)	~1 hour
Training Phase A (50K steps)	40 GB	80 GB (A100)	~6 hours
Training Phase B	40 GB	80 GB (A100)	~30 hours

Planned Models¶

The following models are on the roadmap as LandmarkDiff moves toward a 3D-native pipeline (phone video scan to interactive 3D surgical preview). None of these are available yet; this section previews what is coming.

Model	Approach	Purpose	Status
3D Face Reconstruction	FLAME-based fitting or neural implicit (NeRF/3DGS)	Reconstruct a textured 3D face mesh from a short phone video scan	Research
3D Deformation Model	Mesh-space surgical simulation	Apply procedure-specific displacements directly on 3D mesh vertices instead of 2D pixel warps	Research
Multi-View Consistency	View-conditioned diffusion or 3DGS rendering	Ensure deformed face renders consistently across arbitrary viewpoints	Research
Mobile-Optimized Inference	Distilled/quantized pipeline	Run landmark detection, reconstruction, and preview on-device with acceptable latency	Planned

3D face reconstruction model¶

Replaces the current single-image 2D pipeline entry point. Given 10-30 frames from a phone video scan (patient rotating their head), reconstructs a FLAME mesh with per-vertex texture. Candidate approaches include DECA-style regression, optimization-based FLAME fitting from MediaPipe landmarks, and feed-forward 3DGS methods.

3D deformation model¶

Operates on the reconstructed mesh rather than on 2D pixel coordinates. Existing procedure presets (rhinoplasty, blepharoplasty, etc.) would be re-expressed as 3D vertex displacement fields, enabling anatomically grounded deformations that look correct from any viewing angle.

Multi-view consistency model¶

Ensures that the deformed 3D representation renders without view-dependent artifacts. This may be handled implicitly by the 3D representation (mesh or 3DGS) or may require an additional consistency loss during training.

Mobile-optimized inference model¶

A distilled or quantized version of the pipeline targeting on-device inference. The goal is real-time landmark tracking and capture guidance, with reconstruction offloaded to a server or run locally on modern phones.