Benchmarks¶

Performance benchmarks for LandmarkDiff across different hardware.

Inference Speed¶

Hardware	Mode	Resolution	Time per image
A100 80GB	ControlNet (30 steps)	512x512	~3 sec
A100 40GB	ControlNet (30 steps)	512x512	~4 sec
RTX 4090	ControlNet (30 steps)	512x512	~5 sec
RTX 3090	ControlNet (30 steps)	512x512	~7 sec
T4 16GB	ControlNet (30 steps)	512x512	~15 sec
M3 Pro (MPS)	ControlNet (30 steps)	512x512	~45 sec
CPU (i9-13900K)	TPS only	512x512	~0.5 sec

Landmark Extraction¶

Hardware	Images/sec	Notes
Any modern CPU	~30 fps	MediaPipe runs on CPU

Training Throughput¶

Hardware	Batch size	Grad accum	Effective batch	Steps/hour
A100 80GB	4	4	16	~600
A100 40GB	2	8	16	~400
RTX 4090	2	8	16	~350
RTX 3090	1	16	16	~200

Memory Usage¶

Component	VRAM
SD 1.5 (FP16)	~2.5 GB
ControlNet (FP16)	~1.5 GB
VAE (FP32)	~0.5 GB
CodeFormer	~0.4 GB
ArcFace	~0.3 GB
Total inference	~5.2 GB
Total training	~25 GB

Running benchmarks¶

# Inference benchmark
python benchmarks/benchmark_inference.py --device cuda --num_images 100

# Landmark extraction benchmark
python benchmarks/benchmark_landmarks.py --num_images 1000

# Training throughput benchmark
python benchmarks/benchmark_training.py --device cuda --num_steps 100