연세대 인공지능학회 YAI

Instant Neural Graphics Primitives with Multiresolution Hash Encoding 본문

컴퓨터비전 : CV/3D

Instant Neural Graphics Primitives with Multiresolution Hash Encoding

_YAI_ 2022. 9. 26. 22:04

Instant Neural Graphics Primitives with Multiresolution Hash Encoding

https://arxiv.org/abs/2201.05989

 

Instant Neural Graphics Primitives with a Multiresolution Hash Encoding

Neural graphics primitives, parameterized by fully connected neural networks, can be costly to train and evaluate. We reduce this cost with a versatile new input encoding that permits the use of a smaller network without sacrificing quality, thus significa

arxiv.org

*9기 조정빈님이 3D 팀에서 작성하신 리뷰입니다.

1. Introduction

Computer graphics primitives are represented by mathematical functions→MLPs used as neural graphics primitives e.g NeRF, Neural sparse voxel fileds, deep SDF, ACORN

  • The inputs of the neural network needs to be encoded(mapped) in to higher dimensions to extract high approximation quality from compact models
    • [-] heuristic, structural modifications that complicate training→task specific, limit GPU performance
  • Multiresolution hash encoding
    • Adaptivity
      • Mapping cascade of grids to fixed size array feature vectors → No structural update needed
      • coarse resolution → 1:1 mapping
      • fine resolution → spatial hash function automatically prioritizing sparse areas with most important fine detail
    • Efficiency
      • $O(1)$ hash tabel lookup
      • No pointer-chasing
    • Independent of task
      • Gigapixel image, Neural SDF, NRC, NeRF

2. Background and Related work

Frequency Encodings

  • Transformers introduced encoded scalar positions as a multi resolution sequence of $L \in \mathbb{N}$ sine and cosine functions

$$enc(x) = \Bigl( \sin(2^0x),\sin(2^1x),\dots ,\sin(2^{L-1}x), \ \cos(2^0x),\cos(2^1x),\dots ,\cos(2^{L-1}x) \Bigr)$$

Parametric Encodings

    • Arrange additional trainable parameters in an auxiliary data structure e.g grid, tree and to look-up and interpolate these parameters depending on the input vector.
    • Larger memory footprint for a smaller computational cost
      • For each gradient → every parameter in MLP needs to be updated, but for trainable input encoding parameters, only a small number are affected
      • reducing the size of the MLP, such parametric models can typically be trained to convergence much faster without sacrificing approximation quality

Coordinate Encoder

  • Large Auxiliary coordinate encoder neural network (ACORN) is trained to ouput dense feature grids around $\textbf{x}$

Sparse Parametric Encodings

  • Dense Grids → Consume much memory, fixed resolution
    • Allocates too much memory on empty space
    • Natural scenes exhibit smoothness → multi resolution decomposition

Multiresolution Hash Encoding

  • Compact spatial hash table whose size can be tuned, not relies on priori knowledge or pruning during training
  • Multiple separate hash tables indexed at different resolutions

3. Method

Parameter Symbol Value
Number of levels 16
Max. entries per level(hash table size)
Number of feature dimensions per entry 2
Coarest resolution 16
Finest resolution 512~524288

Multi-Resolution Hash Encoding

Given a fully connected neural network , $m(\textbf{y};\Phi)$ we are interested in an encoding of its inputs $\textbf{y}=\text{enc}(\textbf{x}; \theta)$ that improves the approximation quality and training speed across a wide range of applications

Procedure

  1. Input coordinate $\textbf{x} \in \mathbb{R}^d$ is scaled by that level’s grid resolution and rounded down and up
    • $\lfloor \textbf{x}_l \rfloor:= \lfloor \textbf{x} \cdot N_l \rfloor$, $\lceil \textbf{x}_l \rceil := \lceil \textbf{x} \cdot N_l \rceil$
  2. $\lfloor \textbf{x}_l \rfloor , \lceil \textbf{x}_l \rceil$ span a voxel with $2^d$ integer vertices
    • Coarse levels : $V: (N_l + 1)^d \le T$
      • mapping is 1 : 1
    • Fine levels : $V: (N_l + 1)^d > T$
      • hash function to index the array
      • $h(\textbf{x})=\Biggl(\bigoplus^{d}_{i=1} x_i \pi_i \Biggr) \ \ \ \mod T$
      • $h : \mathbb{Z}^d \rightarrow \mathbb{Z}_T$
      • No explicit collision handling as the following $m(\textbf{y};\Phi)$ handles it
  3. Feature vectors are d-linearly interpolated according to the relative position of $\textbf{x}$ within its hypercube
  4. Interpolated feature vectors at each level and auxiliary inputs are concatenated

$$\text{enc}(\textbf{x}; \theta) \rightarrow\textbf{y} \in \mathbb{R}^{LF+E}$$

Choice of Grid Resolution for Each Level

  • The resolution of each level is chosen to be a geometric progression between coarsest and finest resolutions $[N_{min},N_{max}]$

$$N_l :=\lfloor N_{min} \cdot b^l \rfloor$$

$$b :=\exp \Bigl( \frac{\ln N_{max}-\ln N_{min}}{L-1} \Bigr)$$

  • $N_{max}$ is chosen to match the finest detail in training data
  • Growth factor $b$ is usuall small $b\in [1.26,2]$

Performance vs Quality

  • Memory grow linearly with $T$ but quality and performance scale sub-linearly

Implicit hash collision resolution

    • Low levels
      • No collisions as mapping is 1:1
      • Low-resolution as features are interpolated from a widely spaced grid of points
    • High levels
      • Lots of collisions → average the gradients
        • Visible surface that contribute highly will have larger gradients

AS a result, the gradients of the more important samples dominate the collision average and the aliased table entry will naturally be optimized in such a way that it reflects the needs of the higher-weighted point.

  • Capture small features thanks to fine grid resolution

Online Adaptivity

  • Input $\textbf{x}$ changes during training → fewer collisions → more accurate function can be learned
  • Multiresolution hash encoding automatically adapts to the training data distribution

D-Linear Interpolation

  • To make ensure that the encoding and the composition with the MLP are continuous

$$m(\text{enc}(\textbf{x};\theta) ; \Phi)$$

4. Implementation

  • Hash table entries at half precision(2 bytes per entry) and maintain master copy of parameters in full precision for stable mixed-precision parameter updates
  • Evaluate hash tables level by level
    • first level of multiresolution hash encoding for all inputs, then second level
    • → only a small number of hash tables reside in caches

5. Results

6. Discussion and Future work

Concatenation vs Reduction

Concatenation allows

  • Fully parallel processing of each resolution
  • Keeps dimensions which helps encode useful information compared with reduction

Reduction can be favorable when

  • following MLP is so large that increasing $F$ is insignificant
Comments