Post

View Synthesis: From NeRF to 3D Gaussian Splatting

View Synthesis: From NeRF to 3D Gaussian Splatting

Overview

Photorealistic 3D scene reconstruction from 2D images represents one of the most challenging problems in computer vision. While traditional methods rely on explicit geometry (meshes, point clouds), modern neural approaches have revolutionized the field by learning implicit scene representations. This project explores two cutting-edge techniques: Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), implementing both from scratch with a focus on practical deployment and real-time performance.

The goal was to build a complete pipeline—from multi-view image capture to interactive 3D visualization—that demonstrates the evolution of view synthesis technology and provides hands-on experience with state-of-the-art neural rendering techniques.


The Challenge: Novel View Synthesis

Given a sparse set of images from different viewpoints, how can we synthesize photorealistic images from arbitrary camera positions? This problem, known as novel view synthesis, requires understanding:

  1. Scene Geometry: Where are objects located in 3D space?
  2. Appearance Modeling: How do materials interact with light?
  3. View-Dependent Effects: How do reflections and specularity change with viewpoint?
  4. Computational Efficiency: Can we render in real-time?

Traditional approaches like Structure-from-Motion (SfM) + Multi-View Stereo (MVS) reconstruct explicit geometry but struggle with:

  • Fine detail capture (thin structures, hair, foliage)
  • View-dependent appearance (specularities, reflections)
  • Completeness (holes in reconstruction)

Neural methods address these limitations by learning continuous volumetric representations.


Approach 1: Neural Radiance Fields (NeRF)

Core Concept

NeRF represents scenes as continuous 5D functions that map:

  • Input: 3D position $(x, y, z)$ and viewing direction $(\theta, \phi)$
  • Output: Volume density $\sigma$ and view-dependent RGB color $(r, g, b)$
\[F_\Theta : (\mathbf{x}, \mathbf{d}) \rightarrow (\mathbf{c}, \sigma)\]

Where $\Theta$ represents the weights of a Multi-Layer Perceptron (MLP) neural network.

Volume Rendering with Ray Marching

For each pixel in the target image, we:

  1. Cast a ray from the camera through the pixel
  2. Sample points along the ray at intervals $t_i$
  3. Query the MLP at each sample point
  4. Accumulate color using volume rendering:
\[C(\mathbf{r}) = \sum_{i=1}^{N} T_i \cdot \alpha_i \cdot \mathbf{c}_i\]

Where:

  • $T_i = \exp\left(-\sum_{j=1}^{i-1} \sigma_j \delta_j\right)$ is the accumulated transmittance
  • $\alpha_i = 1 - \exp(-\sigma_i \delta_i)$ is the opacity contribution
  • $\delta_i = t_{i+1} - t_i$ is the distance between samples

Positional Encoding

Raw 3D coordinates lack high-frequency detail. We apply positional encoding to map inputs to a higher-dimensional space:

\[\gamma(p) = \left[\sin(2^0 \pi p), \cos(2^0 \pi p), \ldots, \sin(2^{L-1} \pi p), \cos(2^{L-1} \pi p)\right]\]

This enables the network to learn fine geometric details and sharp textures.

Hierarchical Sampling Strategy

To improve efficiency, NeRF uses two networks:

  • Coarse network: Samples uniformly along the ray
  • Fine network: Focuses sampling on regions with high density (where objects exist)

This reduces wasted computation in empty space by 2-3x.

NeRF Novel View Synthesis Figure 1: NeRF synthesizing novel views from a learned volumetric representation. Notice the smooth camera transitions and view-dependent lighting effects.


Approach 2: 3D Gaussian Splatting

Motivation: The Need for Speed

While NeRF produces stunning results, it’s prohibitively slow:

  • Training: 24-48 hours on high-end GPUs
  • Rendering: 10-30 seconds per frame

For interactive applications (VR, gaming, robotics), we need real-time rendering (30+ FPS).

Core Innovation: Explicit 3D Gaussians

Instead of an implicit neural field, 3DGS represents scenes as a collection of 3D Gaussian primitives. Each Gaussian is defined by:

Position: Center location $\mu \in \mathbb{R}^3$

Covariance: 3D shape defined by covariance matrix $\Sigma$:

\[\Sigma = R S S^T R^T\]

Where $R$ is rotation (quaternion) and $S$ is a diagonal scaling matrix.

Opacity: Transparency value $\alpha \in [0, 1]$

Spherical Harmonics: View-dependent color encoded as SH coefficients up to degree 3, capturing view-dependent effects efficiently.

Differentiable Rasterization

The rendering process is fully differentiable:

  1. Project Gaussians to 2D screen space using camera parameters
  2. Sort by depth for correct alpha blending (back-to-front)
  3. Rasterize using tile-based rendering:

For each pixel, blend overlapping Gaussians:

\[C = \sum_{i \in \mathcal{N}} c_i \alpha_i \prod_{j=1}^{i-1} (1 - \alpha_j)\]

Where $\mathcal{N}$ are Gaussians affecting the pixel, sorted by depth.

Key Advantage: This entire pipeline runs on GPU in CUDA, enabling real-time rendering at 30-100 FPS.

Adaptive Density Control

During training, Gaussians undergo densification and pruning:

  • Clone: Split Gaussians in under-reconstructed regions (high gradient)
  • Split: Divide large Gaussians covering complex geometry
  • Prune: Remove Gaussians with low opacity (< 0.005)

This dynamic optimization balances quality and efficiency.


Implementation Pipeline

1. Data Preprocessing with COLMAP

Both methods require multi-view images with known camera poses. We use COLMAP, an SfM pipeline that:

  • Extracts SIFT features from images
  • Matches features across views
  • Estimates camera intrinsics and extrinsics
  • Generates sparse point cloud initialization
1
2
3
4
# Run full COLMAP pipeline
colmap feature_extractor --database_path database.db --image_path images/
colmap exhaustive_matcher --database_path database.db
colmap mapper --database_path database.db --image_path images/ --output_path sparse/

2. Training Configuration

NeRF Training Hyperparameters:

  • MLP Architecture: 8 layers, 256 units per layer
  • Positional Encoding: L=10 for position, L=4 for direction
  • Learning Rate: 5e-4 with exponential decay
  • Batch Size: 1024 rays
  • Training Time: ~24 hours on RTX 2060

3DGS Training Hyperparameters:

  • Initial Gaussians: ~100K from COLMAP sparse reconstruction
  • Optimization: Adam with custom learning rates per parameter
    • Position: 1.6e-4
    • Opacity: 0.05
    • Scaling: 5e-3
    • Rotation: 1e-3
  • Training Time: ~30 minutes (7K iterations) on RTX 2060

3. Loss Functions

Both methods optimize using photometric reconstruction loss:

\[\mathcal{L} = \lambda_1 \mathcal{L}_1 + \lambda_2 \mathcal{L}_{SSIM}\]

Where:

  • $\mathcal{L}1 = |C{pred} - C_{gt}|_1$ measures pixel-wise difference
  • $\mathcal{L}_{SSIM}$ captures structural similarity

For 3DGS, we use: $\lambda_1 = 0.8, \lambda_2 = 0.2$


Experimental Results

Quantitative Comparison

MetricNeRF3D Gaussian Splatting
PSNR28.5 dB30.2 dB
SSIM0.890.94
LPIPS0.120.08
Training Time24 hours30 minutes
Rendering Speed0.1 FPS60 FPS
Memory (Training)8 GB12 GB
Final Model Size5 MB500 MB

Key Observations:

  • 3DGS achieves 48x faster training and 600x faster rendering
  • Quality: 3DGS produces sharper results with better high-frequency detail
  • Trade-off: Larger model size for 3DGS due to explicit Gaussian storage

Visual Quality Analysis

Strengths of NeRF:

  • Compact representation (small model size)
  • Smooth interpolation between views
  • No artifacts from discrete primitives

Strengths of 3DGS:

  • Crisp edges and fine details (hair, text, mesh patterns)
  • Accurate view-dependent effects (specularities, reflections)
  • Real-time performance enables interactive applications

3D Gaussian Splatting Rendering Comparison Figure 2: 3D Gaussian Splatting rendering quality comparison. The explicit Gaussian representation captures fine details with significantly faster rendering times compared to NeRF.


Real-Time Visualization

Both implementations integrate with SIBR Viewers (System for Image-Based Rendering), providing:

  • Interactive camera navigation (WASD + mouse)
  • Real-time rendering at 30-60 FPS (for 3DGS)
  • Debug visualization modes (depth maps, normals, point clouds)

Controls:

  • W/A/S/D: Camera movement
  • Mouse: Look around
  • Q/E: Vertical movement
  • F: Toggle full-screen
  • Tab: Show/hide UI

Note: A real-time rendering demo video (gs3d_real_time_rendering.mp4) is available in the project repository, showcasing the interactive performance of the 3D Gaussian Splatting implementation at 60+ FPS on an RTX 2060.


Deployment Pipeline

Docker Containerization & Setup

To simplify the complex dependency stack (CUDA, PyTorch, COLMAP, custom CUDA kernels), I created complete Docker-based workflows for both implementations. The repository is available at: github.com/rohitDey23/view_synthesis

3D Gaussian Splatting Setup (Docker)

The 3DGS implementation uses a fully containerized environment with all dependencies pre-configured:

1. Clone and Build:

1
2
3
4
5
6
7
# Clone the repository and checkout the gaussian_splatting branch
git clone https://github.com/rohitDey23/view_synthesis.git
cd view_synthesis
git checkout gaussian_splatting

# Build Docker image (~10 minutes)
docker build -t view_synthesis .

2. Run Container:

1
2
3
4
5
6
7
8
9
10
11
# Navigate to model directory for bind mounting
cd model

# Launch container with GPU support
docker run --rm -it --name view_synth \
    --gpus all \
    -e DISPLAY=host.docker.internal:0 \
    -e LIBGL_ALWAYS_INDIRECT=0 \
    --mount type=bind,src=.,dst=/home/user_dev/code_ws/model/ \
    --runtime=nvidia \
    view_synthesis bash

3. Train 3DGS:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Inside container: activate conda environment
conda activate view_synthesis

# Download dataset (COLMAP format required)
cd data
wget https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/datasets/input/tandt_db.zip
unzip tandt_db.zip && rm tandt_db.zip

# Install custom CUDA submodules
pip3 install src/submodules/diff-gaussian-rasterization
pip3 install src/submodules/simple-knn

# Train the model
cd /home/user_dev/code_ws/
python3 src/train.py -s ./data/train_data -m ./model/

4. Render Results:

1
2
3
4
5
# Render test views
python3 src/render.py -s ./data -m ./model/

# Create GIF from renders (optional)
python3 src/create_gif.py <ground/truth/path/> <renders/path/> <output/path/filename.gif> --duration 4

NeRF Setup (Using UV)

The NeRF implementation uses UV (modern Python package manager) for dependency management, providing faster and more reliable installations:

1. Clone and Setup:

1
2
3
4
5
6
7
8
9
10
# Clone the repository and checkout the nerf branch
git clone https://github.com/rohitDey23/view_synthesis.git
cd view_synthesis
git checkout nerf

# Initialize UV project (UV handles all dependencies)
uv init && uv sync

# Activate the virtual environment
source .venv/bin/activate

2. Train NeRF:

1
2
3
4
5
# Training with default configuration
python train.py --config configs/lego.txt

# Training time: ~24 hours on RTX 2060
# Output: Saved in logs/ directory

Key Differences:

  • 3DGS: Docker-based, ~30min training, real-time rendering
  • NeRF: UV-based, ~24hr training, slower rendering but compact model

Both implementations support COLMAP for camera pose estimation and include SIBR viewers for interactive visualization.


Technical Challenges & Solutions

Challenge 1: CUDA Memory Management

Problem: Training crashes with OOM errors on consumer GPUs (6-12 GB VRAM)

Solution:

  • Gradient checkpointing for NeRF (reduces memory 50%)
  • Dynamic batch sizing based on available memory
  • Mixed precision training (FP16) with gradient scaling
  • Offload optimizer states to CPU when needed

Challenge 2: Gaussian Splatting Instabilities

Problem: Gaussians grow unbounded or collapse during training

Solution:

  • Adaptive learning rate scaling based on Gaussian size
  • Regularization: Limit maximum scale to 10% of scene extent
  • Opacity reset every 3000 iterations (forces re-evaluation)
  • Gradient clipping (norm < 2.0)

Challenge 3: COLMAP Failure on Challenging Scenes

Problem: SfM fails on low-texture, repetitive, or reflective surfaces

Solution:

  • Increase SIFT feature detection threshold
  • Use sequential matching instead of exhaustive (for ordered captures)
  • Mask out problematic regions (mirrors, windows) manually
  • Provide approximate camera poses via ARKit/ARCore when available


Future Directions

Several exciting avenues remain unexplored:

Technical Extensions

  • Dynamic Scenes: Extend to video with temporal consistency (4D Gaussian Splatting)
  • Large-Scale Scenes: City-scale reconstruction using Block-NeRF concepts
  • Faster NeRF Variants: Integrate Instant-NGP for competitive speed
  • Semantic Understanding: Add semantic segmentation for object-level editing

Application Areas

  • VR/AR: Real-time rendering for immersive experiences
  • Robotics Navigation: Photorealistic simulation environments
  • Cultural Heritage: Digital preservation of historical sites
  • E-commerce: Interactive 3D product visualization

Lessons Learned

NeRF Insights

  • Hierarchical sampling is critical: Provides 3x speedup with no quality loss
  • Positional encoding frequency matters: L=10 for geometry, L=4 for appearance
  • Convergence is slow but steady: Always train for 200K+ iterations

3D Gaussian Splatting Insights

  • Initialization quality is crucial: Poor COLMAP reconstruction → poor final result
  • Densification timing: Start at iteration 500, stop at 15K to avoid overfitting
  • Opacity reset prevents mode collapse: Essential for stable training
  • View-dependent effects need high SH degree: Degree 3 captures most specularities

General Best Practices

  • Always validate COLMAP results before starting expensive training
  • Use learning rate warmup to stabilize early training
  • Log intermediate renders every 1K iterations for debugging
  • Checkpoint frequently: Training failures are common with custom CUDA ops

Conclusion

This project demonstrates the rapid evolution of neural view synthesis, from the groundbreaking but slow NeRF to the real-time capable 3D Gaussian Splatting. While NeRF remains valuable for its compact representation and theoretical elegance, 3DGS has emerged as the practical choice for applications demanding interactivity.

The field is moving incredibly fast—techniques presented at SIGGRAPH 2023 are already being surpassed by newer methods in 2024. Yet the fundamental principles—differentiable rendering, volumetric scene representations, and multi-view consistency—remain constant and will continue to drive innovation in 3D computer vision.

For researchers and practitioners entering this space, I hope this implementation serves as both a learning resource and a practical starting point for building next-generation view synthesis systems.


Original Papers:

Acknowledgments: This work builds upon the excellent open-source implementations from GRAPHDECO Research Group at Inria and the broader neural rendering community. Special thanks to the authors for making their code publicly available.

This post is licensed under CC BY 4.0 by the author.