Notice

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

연세대 인공지능학회 YAI

[논문 리뷰] Conv-TasNet: Surpassing Ideal Time-Frequency magnitude Masking for Speech Separation 본문

컴퓨터비전 : CV/CNN based

[논문 리뷰] Conv-TasNet: Surpassing Ideal Time-Frequency magnitude Masking for Speech Separation

_YAI_ 2022. 9. 26. 21:20

Conv-TasNet: Surpassing Ideal Time-Frequency magnitude Masking for Speech Separation

https://arxiv.org/abs/1809.07454

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

Single-channel, speaker-independent speech separation methods have recently seen great progress. However, the accuracy, latency, and computational cost of such methods remain insufficient. The majority of the previous methods have formulated the separation

arxiv.org

*YAI 10기 이진우님께서 음성팀에서 작성해주신 리뷰입니다.

1. Introduction

Most previous speech separation approaches have been formulated in the time-frequency (T-F) representation of the mixture signal

estimated from the waveform using short-time Fourier transform (STFT)
The output (waveform of each source) is calculated using iSTFT.

Drawbacks of T-F masking

Not necessarily optimal for speech separation
1. Then let’s replace the STFT with a data-driven representation
Erroneous estimation of phase
- Even when the ideal clean magnitude spectrograms are applied to the mixture → imperfect reconstruction
due to STFT calculation which includes a long temporal windowing
limits its applicability in real-time, low-latency applicationsLatency problem
- No more decoupling the mag and phase!
- Let’s directly separate in time domain!

Drawbacks of TasNet

Smaller kernel size == len(wav segments)
1. `training of the LSTMs become unmanageable
Large number of parameters
1. increased computational cost
long temporal dependencies of LSTM
1. inconsistent separation accuracy

2. Convolutional Time-domain Audio Separation Network

Estimating $C$ sources $s_1(t), ..., s_c(t) \in \mathbb{R}^{1 \times T}$ from $x(t) \in \mathbb{R}^{1 \times T}$, where

$$x(t)=\sum_{i=1}^Cs_i(t)$$

Encoder

$\mathbf{x}$ (==$\mathbf{x}_k$) is transformed into $\mathbf{w}$ by 1-D convolution

$\mathbf{w}=\mathcal{H}(\mathbf{xU})$

$\mathbf{x}$ (==$\mathbf{x}_k$) is a segmented waveforms
$\mathbf{U}$ contains $N=512$ vectors (encoder basis functions) with length $L$ each
$\mathcal{H}()$ is (optional) nonlinear functions
- Pseudo-inverse, Linear, ReLU are used in the experiments
- Linear encoder (without $\mathcal{H}()$) are proven to perform best

class TasNet(nn.Module):
    def __init__(self, enc_dim=512, feature_dim=128, sr=16000, win=2, layer=8, stack=3, 
                 kernel=3, num_spk=2, causal=False):
        super(TasNet, self).__init__()

...

# input encoder
        self.encoder = nn.Conv1d(1, self.enc_dim, self.win, bias=False, stride=self.stride)

Separation module

consists of stacked 1-D dilated convolution blocks
Each layer in a TCN consists of 1-D conv blocks with increasing dilation factors

Decoder

reconstructs the waveform $\mathbf{\hat x}$, from $\mathbf{w}$, by 1-D transposed convolution

$\mathbf{\hat x} = \mathbf{wV}$

$\mathbf{\hat x}$ is the reconstruction of $\mathbf{x}$
$\mathbf{V}$ contains $N=512$ vectors (decoder basis functions) with length $L$ each

class TasNet(nn.Module):
    def __init__(self, enc_dim=512, feature_dim=128, sr=16000, win=2, layer=8, stack=3, 
                 kernel=3, num_spk=2, causal=False):
        super(TasNet, self).__init__()

...

        # output decoder
        self.decoder = nn.ConvTranspose1d(self.enc_dim, 1, self.win, bias=False, stride=self.stride)

...

How is Separation done?

Estimation from $C$ vectors (== masks) $\mathbf{m}_i$
- $C$(==number of speakers)
- $\mathbf{m}_i$ is the output from the Separation module
$\mathbf{m}_i$ is then multiplied by $\mathbf{w}$ in an element-wise manner
- $\mathbf{d}_i = \mathbf{w} \odot \mathbf{m}_i$
- $\mathbf{d}_i$ is the input to the Decoder
- $\mathbf{\hat s}_i = \mathbf{d}_i \mathbf{V}$$\mathbf{d}_i$ is then multiplied by $\mathbf{V}$ to become an estimated waveform $\mathbf{\hat s}_i$

3. Experiments

Datasets - Wall Street Journal (WSJ)

WSJ0-2mix for two-speaker separation
WSJ0-3mix for three-speaker separation
generating mixtures
- resampled at 8kHz
- randomly choosing utterances from different speakers
- randomly choosing SNR between -5 and 5dB

Experiment configurations

100 epochs on 4-second segments
initial lr=$1e^{-3}$
Adam optimizer
50% stride size (== 50% overlap between frames)

Training objective

maximizes scale-invariant source-to-noise ratio (SI-SNR)
SI-SNR

SDR – HALF-BAKED OR WELL DONE? (https://arxiv.org/pdf/1811.02508.pdf)

Comparison with ideal TF masks

Ideal binary mask (IBM)
Ideal ratio mask (IRM)
Wiener filter-like mask (WFM)where $\mathcal{S}_i(f,t) \in \mathbb{C}^{F \times T}$ are complex spectrograms

configurations
- 32ms window size, Hanning window
- 8ms hop size

4. Results

Comparison with previous methods

noncausal Conv-TasNet surpasses all three ideal TF masks

noncausal Conv-TasNet outperforms all STFT-based systems

Subjective and objective quality evaluation

PESQ: Perceptual Evaluation of Speech Quality
aims to predict the subjective quality of speech

5. Questions and Comments

Ideal TF masks vs STFT-based systems
- Ideal TF masks는 전통적인 신호처리 방식을 일컫는 것인가? 혹은 딥러닝 방법론도 포함하는 것인가?
- Ideal TF masks $\subset$ STFT-based system 인 것인가?
PESQ의 정확한 metric이 무엇인가?

https://oaji.net/articles/2017/1992-1514892800.pdf

'컴퓨터비전 : CV > CNN based' 카테고리의 다른 글

[논문 리뷰] Going Deeper with Convolutions (0)	2022.08.13
[논문 리뷰] SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks (0)	2022.05.08
[논문 리뷰] Mask R-CNN (0)	2022.03.20
[논문 리뷰] Retina Net : Focal loss (0)	2022.03.12
[논문 리뷰] FPN : Feature Pyramid Network (0)	2022.03.12

'컴퓨터비전 : CV/CNN based' Related Articles

Comments