Spaces:
Running
A newer version of the Gradio SDK is available:
6.2.0
title: rgbd-depth
emoji: ๐จ
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.1
python_version: '3.10'
app_file: app.py
pinned: false
license: apache-2.0
๐จ RGBD-Depth: Real-time Depth Refinement
Transform noisy depth camera data into clean, simulation-quality depth maps
Try Online Demo โข Quickstart Colab โข Installation โข Usage โข Models
๐ What is RGBD-Depth?
Optimized Python package for RGB-D depth refinement using Vision Transformer encoders. This implementation is aligned with the ByteDance CDM reference implementation with additional performance optimizations for CUDA, MPS (Apple Silicon), and CPU.
โก Performance & Results
Inference Speed (RealSense D435, 640ร480, M2 Max / RTX 3090):
| Device | Precision | Time | vs Reference |
|---|---|---|---|
| CUDA + xFormers | FP32 | 0.95s | ๐ ~8% faster |
| CUDA + xFormers | FP16 | 0.52s | ๐ ~2ร faster |
| Apple M2 Max (MPS) | FP32 | 1.34s | โ Native support |
| CPU (16 cores) | FP32 | 13.37s | โ No GPU needed |
Quality Metrics:
- โ Pixel-perfect alignment with ByteDance reference (0 pixel diff verified)
- โ Metric depth accuracy preserved (meters)
- โ Compatible with all original checkpoints
Real-world improvements:
- ๐ Noise reduction: Up to 80% cleaner depth maps
- ๐ฏ Edge preservation: Sharp object boundaries maintained
- ๐ง Sensor-specific: Models trained per camera (D405, D435, L515, ZED 2i, Kinect)
๐ฎ Try it Online
Try rgbd-depth directly in your browser with our interactive Gradio demoโno installation required. Upload your images and refine depth maps instantly.
Available on Hugging Face Spaces: Upload your RGB and depth images, adjust parameters (camera model, precision, resolution), and get refined depth maps instantly. Models are automatically downloaded from Hugging Face Hub on first use.
Overview
Camera Depth Models (CDMs) are sensor-specific depth models trained to produce clean, simulation-like depth maps from noisy real-world depth camera data. By bridging the visual gap between simulation and reality through depth perception, CDMs enable robotic policies trained purely in simulation to transfer directly to real robots.
Original work by ByteDance Research. This package provides an optimized implementation with:
- โ Pixel-perfect alignment with reference implementation (verified: 0 pixel difference)
- โก Device-specific optimizations: xFormers (CUDA), SDPA fallback, torch.compile
- ๐ฏ Mixed precision support: FP16 (CUDA/MPS), BF16 (CUDA)
- ๐ง Better CLI: Device selection, optimization control, precision modes
- ๐ฆ Easy installation: Single
pip installcommand
Why This Package?
This is an optimized, production-ready version of ByteDance's Camera Depth Models with several improvements:
| Feature | ByteDance Original | This Package |
|---|---|---|
| Installation | Manual setup | pip install rgbd-depth |
| CUDA Optimization | Basic | xFormers (~8% faster) + torch.compile |
| Apple Silicon (MPS) | Not optimized | Native support with fallbacks |
| Mixed Precision | Manual | Automatic FP16/BF16 with --precision flag |
| CLI | Basic | Enhanced with device selection, optimization control |
| Documentation | Minimal | Comprehensive guides (README + OPTIMIZATION.md) |
| Testing | None | CI/CD with automated tests |
| PyPI Package | No | โ
Yes (rgbd-depth) |
Choose this package if you want:
- ๐ Faster inference on CUDA (xFormers) or Apple Silicon (MPS)
- ๐ฏ Easy mixed precision (FP16/BF16) without code changes
- ๐ฆ Simple installation via PyPI
- ๐ง Production-ready CLI with device/precision control
- โ Maintained with CI/CD and tests
Key Features
- Metric Depth Estimation: Produces accurate absolute depth measurements in meters
- Multi-Camera Support: Optimized models for various depth sensors (RealSense D405/D435/L515, ZED 2i, Azure Kinect)
- Performance Optimizations: ~8% faster on CUDA with xFormers, automatic backend selection
- Mixed Precision: FP16/BF16 support for faster inference on compatible hardware
- Sim-to-Real Ready: Generates simulation-quality depth from real camera data
Architecture
CDM uses a dual-branch Vision Transformer architecture:
- RGB Branch: Extracts semantic information from RGB images
- Depth Branch: Processes noisy depth sensor data
- Cross-Attention Fusion: Combines RGB semantics with depth scale information
- DPT Decoder: Produces final metric depth estimation
Supported ViT encoder sizes:
vits: Small (64 features, 384 output channels)vitb: Base (128 features, 768 output channels)vitl: Large (256 features, 1024 output channels)vitg: Giant (384 features, 1536 output channels)
All pretrained models we provide are based on vitl.
๐ Hugging Face Spaces Demo
The easiest way to try rgbd-depth is via Hugging Face Spacesโcompletely free, no installation needed:
- Open the interactive demo
- Upload an RGB image and a depth map (PNG or JPG)
- Configure camera model, precision, and visualization options
- Click "Refine Depth" and download the result
What happens:
- Models are auto-downloaded from Hugging Face Hub on first use
- Runs on free CPU hardware (inference: ~10-30s)
- GPU hardware available for faster processing (~2-5s)
- All computations are done server-sideโyour images stay private
Limitations (HF Spaces CPU):
- No xFormers optimization (CUDA-only)
- Inference slower than local GPU
- Perfect for testing and prototyping
For production workflows or faster inference, use the local installation below.
๐ Note: This README is optimized for GitHub, PyPI, and Hugging Face Spaces. The YAML metadata (top of file) is auto-detected by HF Spaces and not displayed.
๐ฏ Use Cases
Robotics & Manipulation
- ๐ค Sim-to-Real Transfer: Train robot policies in simulation, deploy on real hardware with clean depth
- ๐ฆพ Grasping: Accurate object boundaries for pick-and-place tasks
- ๐ Navigation: Obstacle detection with metric depth for path planning
Computer Vision
- ๐ฅ AR/VR: Real-time depth refinement for mixed reality applications
- ๐ธ 3D Reconstruction: Clean depth maps for photogrammetry and SLAM
- ๐จ Portrait Mode: Professional depth-of-field effects on mobile devices
Research & Development
- ๐ฌ Benchmarking: Consistent depth quality across camera types
- ๐ Dataset Creation: Generate clean training data from noisy sensors
- ๐งช Prototyping: Quick iteration with HuggingFace Spaces demo
Production Systems
- ๐ญ Quality Control: Precise measurements for automated inspection
- ๐ฆ Logistics: Volume estimation and bin picking
- ๐ฅ Medical Imaging: Enhanced depth perception for surgical robots
Installation
From PyPI (recommended)
Basic installation (core dependencies only):
pip install rgbd-depth
Installation with extras:
# With CUDA optimizations (xFormers, ~8% faster)
pip install rgbd-depth[xformers]
# With Gradio demo interface
pip install rgbd-depth[demo]
# With HuggingFace Hub model downloads
pip install rgbd-depth[download]
# With development tools (pytest, black, ruff, etc.)
pip install rgbd-depth[dev]
# Install everything (all extras)
pip install rgbd-depth[all]
Development installation (editable):
git clone https://github.com/Aedelon/rgbd-depth.git
cd rgbd-depth
pip install -e ".[dev]" # or uv sync --extra dev
Requirements:
- Python 3.10+ (Python 3.8-3.9 support dropped in v1.0.2+)
- PyTorch 2.0+ with appropriate CUDA/MPS support
- OpenCV, NumPy, Pillow
Quick Start
Easiest: No Installation (HF Spaces)
๐ Open interactive demo in your browser โ Start here!
Local Installation
After pip install rgbd-depth:
# CUDA (optimizations auto-enabled, FP16 for best speed)
python infer.py --input rgb.png --depth depth.png --precision fp16
# Apple Silicon (MPS)
python infer.py --input rgb.png --depth depth.png --device mps
# CPU (FP32 only)
python infer.py --input rgb.png --depth depth.png --device cpu
Example images are provided in
example_data/. Pre-trained models can be downloaded from Hugging Face.
Usage
Command Line Interface
Basic inference:
python infer.py \
--input /path/to/rgb.png \
--depth /path/to/depth.png \
--output refined_depth.png
CUDA with optimizations (default):
# FP32 (best accuracy)
python infer.py --input rgb.png --depth depth.png
# FP16 (best speed, ~2ร faster)
python infer.py --input rgb.png --depth depth.png --precision fp16
# BF16 (best stability)
python infer.py --input rgb.png --depth depth.png --precision bf16
# Disable optimizations (debugging)
python infer.py --input rgb.png --depth depth.png --no-optimize
Apple Silicon (MPS):
# FP32 (default)
python infer.py --input rgb.png --depth depth.png --device mps
# FP16 (faster)
python infer.py --input rgb.png --depth depth.png --device mps --precision fp16
CPU:
# FP32 only (FP16 not recommended on CPU)
python infer.py --input rgb.png --depth depth.png --device cpu
Command Line Arguments
Required:
--input: Path to RGB input image (JPG/PNG)--depth: Path to depth input image (PNG, 16-bit or 32-bit)
Optional:
--output: Output visualization path (default:output.png)--device: Device to use:auto,cuda,mps,cpu(default:auto)--precision: Precision mode:fp32,fp16,bf16(default:fp32)--no-optimize: Disable optimizations on CUDA (for debugging)--encoder: Model size:vits,vitb,vitl,vitg(default:vitl)--input-size: Input resolution for inference (default: 518)--depth-scale: Scale factor for depth values (default: 1000.0)--max-depth: Maximum valid depth in meters (default: 6.0)
Python API
import torch
from rgbddepth import RGBDDepth
import cv2
import numpy as np
# Load model with optimizations
model = RGBDDepth(encoder='vitl', features=256, use_xformers=True)
model.load_state_dict(torch.load('model.pth'))
model.eval()
model = model.to('cuda') # or 'mps', 'cpu'
# Optional: compile for extra speed on CUDA
model = torch.compile(model)
# Load images
rgb = cv2.imread('rgb.jpg')[:, :, ::-1] # BGR to RGB
depth = cv2.imread('depth.png', cv2.IMREAD_UNCHANGED) / 1000.0 # Convert to meters
# Create similarity depth (inverse depth)
simi_depth = np.zeros_like(depth)
simi_depth[depth > 0] = 1 / depth[depth > 0]
# Run inference with mixed precision
with torch.amp.autocast('cuda', dtype=torch.float16):
pred_depth = model.infer_image(rgb, simi_depth, input_size=518)
Model Training
CDMs are trained on synthetic datasets generated using camera-specific noise models:
- Noise Model Training: Learn hole and value noise patterns from real camera data
- Synthetic Data Generation: Apply learned noise to clean simulation depth
- CDM Training: Train depth estimation model on synthetic noisy data
Training datasets: HyperSim, DREDS, HISS, IRS (280,000+ images total)
Supported Cameras
We currently provide pre-trained models available for:
- Intel RealSense D405/D435/L515
- Stereolabs ZED 2i (2 modes: Quality, Neural)
- Microsoft Azure Kinect
File Structure
rgbd-depth/
โโโ app.py # Gradio web demo for HuggingFace Spaces
โโโ infer.py # CLI inference script (main entry point)
โโโ pyproject.toml # Modern package config (PEP 621, replaces setup.py)
โโโ setup.py # Legacy setuptools build script
โโโ requirements.txt # Minimal deps for HuggingFace Spaces
โโโ uv.lock # UV package manager lock file
โโโ LICENSE # Apache 2.0 license
โโโ README.md # This file (GitHub/PyPI/HF Spaces unified)
โโโ OPTIMIZATION.md # Performance benchmarks and optimization guide
โโโ CHANGELOG.md # Version history and release notes
โโโ VIRAL_STRATEGY.md # GitHub/PyPI marketing strategy
โ
โโโ rgbddepth/ # Main Python package
โ โโโ __init__.py # Public API exports (RGBDDepth, DinoVisionTransformer, __version__)
โ โโโ dpt.py # RGBDDepth model (dual-branch ViT + DPT decoder)
โ โโโ dinov2.py # DINOv2 Vision Transformer encoder
โ โโโ flexible_attention.py # Cross-attention w/ xFormers + SDPA fallback
โ โ
โ โโโ dinov2_layers/ # Vision Transformer building blocks (from Meta DINOv2)
โ โ โโโ __init__.py
โ โ โโโ attention.py # Self-attention w/ optional xFormers (MemEffAttention)
โ โ โโโ block.py # Transformer encoder block (NestedTensorBlock)
โ โ โโโ mlp.py # Feed-forward network (Mlp)
โ โ โโโ patch_embed.py # Image โ patch embeddings (PatchEmbed)
โ โ โโโ swiglu_ffn.py # SwiGLU activation FFN
โ โ โโโ drop_path.py # Stochastic depth regularization
โ โ โโโ layer_scale.py # LayerScale normalization
โ โ
โ โโโ util/ # Utilities
โ โโโ __init__.py
โ โโโ blocks.py # DPT decoder blocks (FeatureFusionBlock, ResidualConvUnit)
โ โโโ transform.py # Image preprocessing (Resize, PrepareForNet)
โ
โโโ tests/ # Test suite (42 tests, runs in GitHub Actions)
โ โโโ test_import.py # Basic imports and smoke tests
โ โโโ test_model.py # Architecture, forward pass, attention, preprocessing
โ
โโโ example_data/ # Example RGB-D pairs for testing
โ โโโ color_12.png # RGB image sample
โ โโโ depth_12.png # Depth map sample
โ โโโ result.png # Expected output
โ
โโโ .github/workflows/ # CI/CD automation
โโโ test.yml # Run tests on Python 3.10-3.12 (Ubuntu/macOS/Windows)
โโโ publish.yml # Auto-publish to PyPI on release tags
โโโ deploy-hf.yml # Auto-deploy to HuggingFace Spaces on push to main
Performance
Accuracy
This implementation achieves pixel-perfect alignment with the ByteDance reference:
- โ 0 pixel difference between vanilla and optimized inference (verified on test images)
- โ Identical checkpoint loading (weights are fully compatible)
- โ Numerical precision preserved (min=0.2036, max=1.1217, exact match)
CDMs achieve state-of-the-art performance on metric depth estimation:
- Superior accuracy compared to existing prompt-based depth models
- Zero-shot generalization across different camera types
- Real-time inference suitable for robot control (lightweight ViT variants)
Performance optimizations:
- xFormers support on CUDA (~8% faster than native SDPA)
- Mixed precision (FP16/BF16) for faster inference
- Device-specific optimizations (CUDA/MPS/CPU)
For detailed optimization strategies and benchmarks, see OPTIMIZATION.md.
What's Different from Reference?
This implementation maintains 100% compatibility with ByteDance CDM while adding:
1. Performance Optimizations
- xFormers support: ~8% faster attention on CUDA (automatic fallback to SDPA)
- torch.compile: JIT compilation (CUDA only, auto-enabled)
- Mixed precision: FP16/BF16 support via
torch.amp.autocast - Device-specific strategies: Optimizations only where beneficial
2. Better CLI/API
--deviceflag: Force specific device (auto/cuda/mps/cpu)--precisionflag: Choose FP32/FP16/BF16--no-optimizeflag: Disable optimizations for debugging- Automatic device detection and optimization selection
3. Improved Architecture
FlexibleCrossAttention: Inherits fromnn.MultiheadAttentionfor checkpoint compatibility- Automatic backend selection: xFormers (CUDA) โ SDPA (fallback)
- Device-aware preprocessing: Uses model's device instead of auto-detection
4. Code Quality
- Type hints and better documentation
- Cleaner argument parsing
- Validation for precision/device combinations
- Helpful warnings for incompatible configurations
All changes are backwards compatible with original checkpoints and produce identical numerical results.
Citation
If you use CDM in your research, please cite:
@article{liu2025manipulation,
title={Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots},
author={Liu, Minghuan and Zhu, Zhengbang and Han, Xiaoshen and Hu, Peng and Lin, Haotong and
Li, Xinyao and Chen, Jingxiao and Xu, Jiafeng and Yang, Yichu and Lin, Yunfeng and
Li, Xinghang and Yu, Yong and Zhang, Weinan and Kong, Tao and Kang, Bingyi},
journal={arXiv preprint},
year={2025}
}
License
This project is licensed under the Apache 2.0 License. See LICENSE for details.
