TRELLIS.2 Devlog, Running Microsoft's 3D Diffusion Model on a Single GPU

TRELLIS.2

Running Microsoft's TRELLIS.2 3D Diffusion Model on Consumer Hardware, A Technical Deep Dive

CUDA PyTorch 3D Diffusion ComfyUI TRELLIS.2

* Overview

This project documents the technical journey of deploying TRELLIS.2, Microsoft's 4-billion-parameter image-to-3D diffusion model, on consumer hardware (RTX 3090 24GB). What presented as a standard installation task evolved into a multi-week engineering challenge involving dependency resolution, CUDA compilation, and systematic debugging across the PyTorch/CUDA stack.

~60s Per Asset Generation
5 CUDA Extensions Built
12 Bugs Fixed
1,667 Lines in nodes.py
Hardware: RTX 3090 (24GB) / 64GB RAM / Windows 11 + WSL2 Ubuntu
Repository: ComfyUI-TRELLIS.2

> The Goal

The objective was to deploy TRELLIS.2's image-to-3D pipeline locally and establish an automation framework for batch processing. A local deployment offers distinct advantages: data sovereignty, predictable per-asset costs, and unrestricted throughput without API rate limits or subscription fees.

[]

Image Placeholder: Sample input concept art and resulting 3D mesh output

Add before/after comparison showing 2D image to 3D model transformation

TRELLIS.2 generates full 3D meshes with PBR textures from single 2D images. The implementation complexity stems from its dependency on five custom CUDA extensions that are not distributed as pre-built binaries, requiring compilation from source against specific PyTorch and CUDA versions.

I Act I: Dependency Resolution on Windows

The initial deployment target was Windows 11, the standard OS for modeling and development workflows. The Electron-based ComfyUI desktop application served as the starting point.

Step 1: Clone repo into ComfyUI's custom_nodes 
Step 2: Download the 16GB of model weights 
Step 3: Start ComfyUI and... UNKNOWN NODES.

ComfyUI detected the folder but failed to import the module. The dependency chain was broken at the first import statement.

The Missing Dependencies

The embedded Electron environment lacked required Python packages:

ModuleNotFoundError: No module named 'trimesh'
ModuleNotFoundError: No module named 'accelerate'
ModuleNotFoundError: No module named 'rembg'

Resolution required identifying the isolated Python environment path within ComfyUI's embedded distribution and manually installing each dependency into that specific location.

The critical blocker was Flash Attention, which lacks pre-built Windows wheels and requires compilation from source. The solution was to implement an SDPA fallback directly in the codebase:

try:
 import flash_attn
 FLASH_ATTN_AVAILABLE = True
except ImportError:
 FLASH_ATTN_AVAILABLE = False

def scaled_dot_product_attention(q, k, v):
 if FLASH_ATTN_AVAILABLE:
 return flash_attn.flash_attn_func(q, k, v)
 else:
 # PyTorch native, works everywhere, just slower
 return torch.nn.functional.scaled_dot_product_attention(q, k, v)

This enabled the first successful generation on Windows. However, operating without Flash Attention consumed excessive VRAM, causing recurrent out-of-memory crashes. The Windows environment imposed memory constraints that made stable operation impossible. A platform migration was necessary.

II Act II: The Migration to WSL2

GPU OOM (Out of Memory) crashes persisted under Windows. TRELLIS.2 loads 5 separate DiT models sequentially during inference. Memory profiling revealed the following allocation pattern on a 24GB RTX 3090:

DINO
Sparse
Shape
Texture
Decoders
KV Cache
24GB MAX
DINOv3 (~2GB)
Sparse DiT (~4GB)
Shape Cascade (~4GB)
Texture DiT (~4GB)
Decoders (~6GB)
KV Cache (~3-4GB)

With ~22GB peak usage, headroom was minimal. Flash Attention (Linux-only) provides 2-4x memory reduction for attention operations. This necessitated migration to WSL2.

Windows 11

Browser UI & File Storage (L: Drive)

run_wsl.bat

WSL2 Ubuntu

Python 3.12, ComfyUI, Torch 2.5.1

CUDA Passthrough

RTX 3090

Inference & Custom Kernels

The result: Python and ComfyUI running natively in WSL2 Ubuntu, bypassing Windows memory overhead and providing the Linux environment required for Flash Attention compilation.

III Act III: The Compilation Marathon

Server startup was only the first milestone. Standard Python execution was insufficient because TRELLIS.2 relies on custom sparse convolutions and hardware-level memory optimizations. These are C++ and CUDA kernels distributed as source code, not pre-compiled packages.

Flash Attention: The --no-build-isolation Solution

Standard pip install flash-attn fails because Flash Attention requires custom C++ and CUDA compilation. By default, pip isolates the build process and downloads its own PyTorch version for compilation. The resulting binaries are then linked against a different PyTorch version than the runtime environment, causing fatal C++ ABI mismatch crashes on import.

The solution required three coordinated changes:

  1. Downgrade the environment explicitly to PyTorch 2.5.1+cu124, the version where all five custom extensions maintain mathematical compatibility.
  2. Configure CUDA_HOME in .bashrc to point to the CUDA 12.4 toolkit, ensuring the C++ compiler links against the correct NVIDIA libraries.
  3. Bypass pip's build isolation to compile Flash Attention against the installed PyTorch version:
pip install flash-attn --no-build-isolation

The Custom Extensions

With Flash Attention operational, the TRELLIS.2-specific extensions remained. Each required individual troubleshooting:

Click cards to reveal the fixes.

IV Act IV: The Bug Marathon

Getting a single successful mesh generation working was only half the battle. The real goal was unattended batch automation: the ability to feed the pipeline 50+ images and let it run overnight without human intervention.

A pipeline that crashes even 10% of the time is useless for batch processing. If job 5 of 50 fails at 2 AM, the entire queue stops. 100% stability was non-negotiable.

The compiled CUDA extensions worked, but the runtime Python code had a cascade of latent bugs. Only after fixing all twelve could the pipeline run truly unattended:

Device Mismatch Crash

Tensor generated on CPU, but the mesh was resting on the GPU.

Fix: Moved tensor to .cuda() explicitly before operations.

Grid Sample Crash

attr_volume and coords on wrong devices inside nodes.py.

Fix: Added explicit device movement for attributes right before the grid sample call.

FlexGEMM Autotuning

Triton do_bench API changed. We patched the flex_gemm source.

Fix: Patched autotuner.py to match the new Triton signature.

CUDNN Error

Version mismatch between Torch and CUDNN requirements.

Fix: Downgraded to PyTorch 2.5.1+cu124 bundle that matched our CUDNN build.

"Floating Triangles"

Generated geometry was pure garbage due to a torchaudio conflict.

Fix: Removed conflicting torchaudio package that was overriding core tensor math.

The Dummy Cube

When o_voxel failed to import, the pipeline silently returned a primitive 12-face cube instead of raising an error. It then completely textured it perfectly. Very sneaky.

Fix: Shimmed the missing o_voxel function in fdg_vae.py and forced a hard crash instead of the silent fallback.

Invisible Texture

Valid geometry but completely transparent texture map.

Fix: Hardcoded the GLTF material property to OPAQUE during the final export serialization step.

Black Texture Map

Coordinate mismatch in grid_sample_3d (XYZ instead of ZYX formatting).

Fix: Flipped the tensor dimensions before passing them to the grid sampler to match the expected ZYX format.

The Black Cube Edge Case

Coordinates evaluating at exactly 1024.0 went out of bounds and caused a crash.

Fix: Added a dynamic clamping function to clamp coordinate values strictly to 1023.999.

Minecraft Output

Meshes were suddenly blocky due to broken remeshing scripts.

Fix: Disabled the broken custom simplifier and reverted to using the native VAE mesh output.

Empty GLB Files

52KB files with no geometry. xatlas UV script failed.

Fix: Forced the repair_non_manifold_edges flag to True so the UV unwrapper wouldn't instantly crash on complex geometry.

Over-Cleaning Geometry

A function designed to remove stray floating triangles was so aggressive that it classified the entire geometry as a floating item and deleted the entire 3D mesh before saving.

Fix: Adjusted the remove_small_connected_components threshold to only strip components that were less than 5% of the total bounding volume.

Click timeline cards to reveal the fixes.

V Act V: Production Automation

With single-image generation stable, the next phase was building batch_process.py, a Python script that interfaces with ComfyUI's REST API to process concept art queues without manual intervention.

_ [] x
wsl, batch_process.py
Found 47 images to process
================================================
[1/47] Copied: character_warrior.png
Queued: eaf71694-661c-42ca-abe0-a76e8b280c60
COMPLETED in 62s
[2/47] SKIP (already done): prop_crystal.png
Memory: Mem: 31Gi 22Gi 812Mi
Cooling down 15s before next job...
================================================
BATCH PROCESSING COMPLETE
Total: 47 | Success: 44 | Skipped: 3 | Failed: 0

The 15-second cooldown was empirically determined. Without this delay, consecutive jobs triggered OOM errors. The GPU requires time to release CUDA memory after each generation. This behavior was not documented in any official materials.

# Results

Double-clicking run_wsl.bat now initializes a fully functional automated 3D generation pipeline in approximately 10 seconds.

* 3D Viewer
Loading model...
[]

Image Placeholder: Generated 3D model gallery

Add grid of sample outputs showing various generated meshes

The codebase includes pragmatic solutions rather than theoretical elegance. Build scripts contain comments like # The copy that cost 6 hours to discover. The pipeline produces near-production-ready PBR meshes from single images in approximately one minute, operating entirely on local hardware. Full production quality requires additional VRAM headroom.

? Technical Learnings

Deploying a 4-billion-parameter diffusion model on consumer hardware yielded several engineering insights:

1. Memory Allocation is Architecture

OOM crashes are not merely operational issues; they dictate system design. Profiling revealed exact VRAM spike patterns during DiT cascading, and how ComfyUI offloads weights to system RAM when VRAM ceiling is reached.

2. CUDA Dependencies Require Version Lock

Pre-built wheels for research code are unreliable. Controlling the compilation environment is essential. Matching PyTorch ABI and Triton versions across five distinct C++ extensions was necessary for stability.

3. Batch Pipelining Transforms Capability

Single-image UI interaction is a demonstration. Scripting the ComfyUI REST API to process unattended batches of 50+ images converts a research tool into a production asset pipeline.

Next Steps

The single RTX 3090 has reached its operational limits. A second RTX 3090 has been acquired, moving the project into multi-GPU territory.

With 48GB of unified VRAM across two cards, the technical roadmap includes:

  • Higher Quality Generations: Memory headroom to run TRELLIS.2's resolution and texture decoders at maximum settings without OOM conditions.
  • SAM-3D Integration: Introducing Segment Anything Model (SAM) into the automated workflow for contextual segmentation and isolation of generated 3D objects.
  • Multi-GPU Distribution: Extending the ComfyUI workflow and Python scripts to distribute model loads across two PCIe devices.
The pipeline is operational. The next phase is scaling.
Next
Next

MVP build for a startup (Technical Art) (Environment Art) (AI) (VFX)