（論文読解） Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

Googleが出した論文
[1804.03619] Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

顔画像をもとにノイズマスキングを生成し、傾聴したい音声だけを抽出する。

ブログ記事
ai.googleblog.com

解説記事
tech.d-itlab.co.jp

モデル構成
f:id:ichou1:20181229092637p:plain

Our network is implemented in TensorFlow

TensorFlowで実装した模様。
ソースは公開されていないので、論文をもとに内部の処理を追ってみる。

[step 1] インプットデータ準備

音声と動画それぞれ3秒分を1sample。
話者は2人と仮定。

動画

We resample the face embeddings from all videos to 25 frames-per-second (FPS) before training and inference by either removing or replicating embeddings.
This results in an input visual stream of 75 face embeddings.
When missing frames are encountered in a particular sample, we use a vector of zeros in lieu of a face embedding.

25FPSの動画3秒分をclip、75frameが得られる。
全体のピクセル数が「1024」なので、画像サイズは「32x32」と思われる。

モデル構成図から読み解いた最終的なtensor形状。

Video(person A) --> (75frame x 1024pixel x 1)
Video(person B) --> (75frame x 1024pixel x 1)

音声

All audio is resampled to 16kHz, and stereo audio is converted to mono by taking only the left channel.
STFT is computed using a Hann window of length 25ms, hop length of 10ms, and FFT size of 512, resulting in an input audio feature of 257 × 298 × 2 scalars.
Power-law compression is performed with p = 0.3 (A 0.3 , where A is the input/output audio spectrogram).
use both the real and imaginary parts of a complex number
power-law compression to prevent loud audio from overwhelming soft audio

サンプリングレートは「16k」Hz、25msごとに窓関数（ハニング）をかけてFFTを実行、298frameが得られる。

モデル構成図から読み解いた最終的なtensor形状。
Audio --> (298frame x 2 x 257)

[step 2] CNN

画像

Note that "spatial" convolutions and dilations in the visual stream are performed over the temporal axis (not over the 1024-D face embedding channel).

畳み込みはframeの時間軸に対して実施。

To compensate for the sampling rate discrepancy between the audio and video signals, we upsample the output of the visual stream to match the spectrogram sampling rate (100 Hz).

画像と音声のサンプリングレートの差異を補完するため、画像ストリームに対してアップサンプリングを実施。

モデル構成図から読み解いた最終的なtensor形状。
Video1(person A) --> (298, 256)
Video2(person B) --> (298, 256)

音声

Audio --> (298, 8*257)

[step 3] Fusion

AV fusion.
The audio and visual streams are combined by concatenating the feature maps of each stream

モデル構成図から読み解いた最終的なtensor形状。
(298, (256*2)+(8*257)) = (298, 2568)

[step 4] Bidirectional LSTM

subsequently fed into a BLSTM

モデル構成図から読み解いた最終的なtensor形状。
(298, 400)

[step 5] Fully connect

followed by three FC layers
ReLU activations follow all network layers except for last (mask), where a sigmoid is applied.

最後の活性化関数は「sigmoid」

モデル構成図から読み解いたtensor形状の遷移
(298, 600)
(298, 600)
(2person, 298frame, 2, 257)

The final output consists of a complex mask (two-channels, real and imaginary) for each of the input speakers.
The output of our model is a multiplicative spectrogram mask, which describes the time-frequency relationships of clean speech to background interference.

話者ごとノイズマスキングが出力される。これがモデルのアウトプット。

The corresponding spectrograms are computed by complex multiplication of the noisy input spectrogram and the output masks.

ここで得られたマスキングデータとインプット音声（spectrogram）の乗算（complex multiplication）を計算し（この計算結果が、損失関数で使う「the enhanced spectrogram」と思われる）

The final output waveforms are obtained using ISTFT

逆変換で音声データに戻す。
これが、傾聴したい特定話者の発話にあたる（他者の発話およびノイズ分離済み）

トレーニングの設定

Batch normalization is performed after all convolutional layers.
Dropout is not used, as we train on a large amount of data and do not suffer from overfitting.
We use a batch size of 6 samples and train with Adam optimizer for 5 million steps (batches) with a learning rate of 3e−5 which is reduced by half every 1.8 million steps.

batch-sizeは「6」sample、オプティマイザは「Adam」、エポックは「500万」、学習率は「0.00003」（180万回ごとに半減）

The squared error (L2) between the power-law compressed clean spectrogram and the enhanced spectrogram is used as a loss function to train the network.

損失関数はL2ノルム、「the power-law compressed clean spectrogram」と「the enhanced spectrogram」の乖離を最小化する。

ichou1のブログ

主に音声認識、時々、データ分析のことを書く