연세대 인공지능학회 YAI

[논문 리뷰] Conv-TasNet: Surpassing Ideal Time-Frequency magnitude Masking for Speech Separation 본문

컴퓨터비전 : CV/CNN based

[논문 리뷰] Conv-TasNet: Surpassing Ideal Time-Frequency magnitude Masking for Speech Separation

_YAI_ 2022. 9. 26. 21:20

Conv-TasNet: Surpassing Ideal Time-Frequency magnitude Masking for Speech Separation

https://arxiv.org/abs/1809.07454

 

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

Single-channel, speaker-independent speech separation methods have recently seen great progress. However, the accuracy, latency, and computational cost of such methods remain insufficient. The majority of the previous methods have formulated the separation

arxiv.org

*YAI 10기 이진우님께서 음성팀에서 작성해주신 리뷰입니다.


1. Introduction

Most previous speech separation approaches have been formulated in the time-frequency (T-F) representation of the mixture signal

  • estimated from the waveform using short-time Fourier transform (STFT)
  • The output (waveform of each source) is calculated using iSTFT.

Drawbacks of T-F masking

  1. Not necessarily optimal for speech separation
    1. Then let’s replace the STFT with a data-driven representation
  2. Erroneous estimation of phase
    • Even when the ideal clean magnitude spectrograms are applied to the mixture → imperfect reconstruction
  3.  due to STFT calculation which includes a long temporal windowing
  4. limits its applicability in real-time, low-latency applicationsLatency problem
    • No more decoupling the mag and phase!
    •  Let’s directly separate in time domain!

Drawbacks of TasNet

  1. Smaller kernel size == len(wav segments)
    1. `training of the LSTMs become unmanageable
  2. Large number of parameters
    1. increased computational cost
  3. long temporal dependencies of LSTM
    1. inconsistent separation accuracy

2. Convolutional Time-domain Audio Separation Network

  • Estimating $C$ sources $s_1(t), ..., s_c(t) \in \mathbb{R}^{1 \times T}$ from $x(t) \in \mathbb{R}^{1 \times T}$, where

$$x(t)=\sum_{i=1}^Cs_i(t)$$

 

Encoder

$\mathbf{x}$ (==$\mathbf{x}_k$) is transformed into $\mathbf{w}$ by 1-D convolution

$\mathbf{w}=\mathcal{H}(\mathbf{xU})$

  • $\mathbf{x}$ (==$\mathbf{x}_k$) is a segmented waveforms
  • $\mathbf{U}$ contains $N=512$ vectors (encoder basis functions) with length $L$ each
  • $\mathcal{H}()$ is (optional) nonlinear functions
    • Pseudo-inverse, Linear, ReLU are used in the experiments
    • Linear encoder (without $\mathcal{H}()$) are proven to perform best
class TasNet(nn.Module):
    def __init__(self, enc_dim=512, feature_dim=128, sr=16000, win=2, layer=8, stack=3, 
                 kernel=3, num_spk=2, causal=False):
        super(TasNet, self).__init__()

...

# input encoder
        self.encoder = nn.Conv1d(1, self.enc_dim, self.win, bias=False, stride=self.stride)

Separation module

  • consists of stacked 1-D dilated convolution blocks
  • Each layer in a TCN consists of 1-D conv blocks with increasing dilation factors

Decoder

reconstructs the waveform $\mathbf{\hat x}$, from $\mathbf{w}$, by 1-D transposed convolution

$\mathbf{\hat x} = \mathbf{wV}$

  • $\mathbf{\hat x}$ is the reconstruction of $\mathbf{x}$
  • $\mathbf{V}$ contains $N=512$ vectors (decoder basis functions) with length $L$ each
class TasNet(nn.Module):
    def __init__(self, enc_dim=512, feature_dim=128, sr=16000, win=2, layer=8, stack=3, 
                 kernel=3, num_spk=2, causal=False):
        super(TasNet, self).__init__()

...

        # output decoder
        self.decoder = nn.ConvTranspose1d(self.enc_dim, 1, self.win, bias=False, stride=self.stride)

...

How is Separation done?

  • Estimation from $C$ vectors (== masks) $\mathbf{m}_i$
    • $C$(==number of speakers)
    • $\mathbf{m}_i$ is the output from the Separation module
  • $\mathbf{m}_i$ is then multiplied by $\mathbf{w}$ in an element-wise manner
    • $\mathbf{d}_i = \mathbf{w} \odot \mathbf{m}_i$
    • $\mathbf{d}_i$ is the input to the Decoder
    • $\mathbf{\hat s}_i = \mathbf{d}_i \mathbf{V}$$\mathbf{d}_i$ is then multiplied by $\mathbf{V}$ to become an estimated waveform $\mathbf{\hat s}_i$

3. Experiments

Datasets - Wall Street Journal (WSJ)

  • WSJ0-2mix for two-speaker separation
  • WSJ0-3mix for three-speaker separation
  • generating mixtures
    • resampled at 8kHz
    • randomly choosing utterances from different speakers
    • randomly choosing SNR between -5 and 5dB

Experiment configurations

  • 100 epochs on 4-second segments
  • initial lr=$1e^{-3}$
  • Adam optimizer
  • 50% stride size (== 50% overlap between frames)

Training objective

  • maximizes scale-invariant source-to-noise ratio (SI-SNR)
  • SI-SNR

SDR – HALF-BAKED OR WELL DONE? (https://arxiv.org/pdf/1811.02508.pdf)

Comparison with ideal TF masks

  • Ideal binary mask (IBM)
  • Ideal ratio mask (IRM)
  • Wiener filter-like mask (WFM)where $\mathcal{S}_i(f,t) \in \mathbb{C}^{F \times T}$ are complex spectrograms

  • configurations
    • 32ms window size, Hanning window
    • 8ms hop size

4. Results

Comparison with previous methods

  • noncausal Conv-TasNet surpasses all three ideal TF masks

  • noncausal Conv-TasNet outperforms all STFT-based systems

Subjective and objective quality evaluation

  • PESQ: Perceptual Evaluation of Speech Quality
  • aims to predict the subjective quality of speech

5. Questions and Comments

  1. Ideal TF masks vs STFT-based systems
    • Ideal TF masks는 전통적인 신호처리 방식을 일컫는 것인가? 혹은 딥러닝 방법론도 포함하는 것인가?
    • Ideal TF masks $\subset$ STFT-based system 인 것인가?
  2. PESQ의 정확한 metric이 무엇인가?

https://oaji.net/articles/2017/1992-1514892800.pdf

Comments