音声ファイル特徴量変換（その5）メルスペクトログラム(TensorFlow)

TensorFlowでメルスペクトログラムを求めるには、「tf.signal.linear_to_mel_weight_matrix」関数が提供されている。
https://www.tensorflow.org/api_docs/python/tf/signal/linear_to_mel_weight_matrix

Melスケールに変換するためのMatrixが得られるので、これとSTFTで得られたデータとの行列積を計算する。

使用する音声データは「yes」という一秒間の発話データ。
f:id:ichou1:20200216111412p:plain

実行する環境は「TensorFlow 2.X」系

python -c 'import tensorflow as tf; print(tf.__version__)'
2.1.0

TensorFlow 1.X系のAPIを使ったコード

短時間フーリエ変換を行ってから、振幅を求めるまで。
前回と同じコードを再掲。

import tensorflow as tf
from tensorflow.python.ops import io_ops

tf.compat.v1.disable_eager_execution()

# Audio Data
audio_path = 'speech_dataset/yes/0a7c2a8d_nohash_0.wav'

with tf.compat.v1.Session(graph=tf.Graph()) as sess:

    wav_filename_placeholder = tf.compat.v1.placeholder(tf.string, [])
    wav_loader = io_ops.read_file(wav_filename_placeholder)

    # audio: A Tensor of type float32.
    # sample_rate: A Tensor of type int32.
    data, sr = tf.audio.decode_wav(wav_loader,
                                   desired_channels=1)

    # channelの次元を削除
    data_ = tf.squeeze(data, -1)

    # batch_sizeの次元を追加
    data__ = tf.expand_dims(data_, axis=0)

    # Input: A Tensor of [batch_size, num_samples] 
    # mono PCM samples in the range [-1, 1]. 
    stfts = tf.signal.stft(data__,
                           frame_length=480,
                           frame_step=160,
                           fft_length=512)

    # 振幅を求める
    spectrograms = tf.abs(stfts)
    # --> Output shape: (1, 98, 257)

Melスケール変換用のMatrixを作成する。

    linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(
        num_mel_bins=128,
        num_spectrogram_bins=257, # FFT size / 2 + 1
        sample_rate=16000,
        lower_edge_hertz=0.0,
        upper_edge_hertz=8000.0
    )
    # --> shape=(257, 128) = (FFT size / 2 + 1, num of mel bins)

「num_mel_bins」がメルフィルタバンクのチャネル数にあたる。
プロットしてみると以下のとおり。

フィルタバンクチャネル数: 128

f:id:ichou1:20200314092204p:plain

HTKのデフォルトである「24」を指定した場合は以下のとおり。

フィルタバンクチャネル数: 24

f:id:ichou1:20200314091903p:plain

行列積を求める。

    mel_spectrograms = tf.tensordot(
        spectrograms,                 # (1, 98, 257) 
        linear_to_mel_weight_matrix,  # (257, 128)
        1)

「shape」プロパティをセットする。

    # tf.tensordot does not support shape inference for this case yet.
    mel_spectrograms.set_shape(
        spectrograms.shape[:-1].concatenate(linear_to_mel_weight_matrix.shape[-1:])
    )
    # spectrograms.shape[:-1] : (1, 98)
    # linear_to_mel_weight_matrix.shape[-1:] : (128)

ここまでで、Outputの形状は以下のとおり。

    # mel_spectrograms shape: (1, 98, 128)
    # (batch_size, frame, num of mel bins)

対数をとる。
値に微小値を足しておく（足さないと下限が「-inf」になってしまう）

    # Compute a stabilized log to get log-magnitude mel-scale spectrograms.
    log_mel_spectrograms = tf.math.log(mel_spectrograms + 1e-6)

セッションを実行する。

    feature = sess.run(
        log_mel_spectrograms,
        feed_dict={wav_filename_placeholder: audio_path}
    )

アウトプットの形状の確認。

print('feature shape: ', feature.shape)
print('feature type: ', type(feature))

feature shape:  (1, 98, 128)  # (batch_size, frame, num_mel_bins)
feature type:  <class 'numpy.ndarray'>

プロットしてみる。

import numpy as np
import librosa

# 次元を減らす(0次元目のbatch_sizeを除く)
feature_ = np.squeeze(feature)

# 次元を入れ替える (frame, mel_index) --> (mel_index, frame)
feature__ = feature_.transpose(1, 0)

# plot
import matplotlib.pyplot as plt
import librosa.display

librosa.display.specshow(feature__,
                         sr=16000,
                         hop_length=160, 
                         y_axis='linear',
                         x_axis='time')

plt.title('yes/0a7c2a8d_nohash_0.wav')
plt.colorbar(format='%+2.0f')
plt.ylim(0, 8000)
plt.tight_layout()
plt.show()

f:id:ichou1:20200301103755p:plain
値の範囲を確認。

print('Max Value: ', np.max(feature))
print('Min Value: ', np.min(feature))

Max Value:  2.4283957
Min Value:  -13.815511

その他の音声をプロットしてみる。

「down」

speech_dataset/down/0a9f9af7_nohash_2.wav
f:id:ichou1:20200308095058p:plain

Max Value:  2.144659
Min Value:  -13.815511

「off」

speech_dataset/off/3e31dffe_nohash_1.wav
f:id:ichou1:20200308095443p:plain

Max Value:  2.4816704
Min Value:  -13.815511

TensorFlow 2.X系のAPIを使ったコード

import tensorflow as tf
from tensorflow.python.ops import io_ops

# Audio Data
audio_path = 'speech_dataset/yes/0a7c2a8d_nohash_0.wav'

# Load Audio File
def load_data(filename):

    wav_loader = io_ops.read_file(filename)
    data, sr = tf.audio.decode_wav(wav_loader,
                                   desired_channels=1)

    # channelの次元を削除
    data_ = tf.squeeze(data)

    # batch_sizeの次元を追加
    data__ = tf.expand_dims(data_, axis=0)

    return data__, sr

# compute STFT
def get_stft_spectrogram(data):
    # Input: A Tensor of [batch_size, num_samples]
    # mono PCM samples in the range [-1, 1]. 
    stfts = tf.signal.stft(data,
                           frame_length=480,
                           frame_step=160,
                           fft_length=512)

    # 振幅を求める
    spectrograms = tf.abs(stfts)

    return spectrograms

# compute mel-Frequency
def get_mel(stfts):

    # STFT-bin
    n_stft_bin = stfts.shape[-1]          # --> 257 (= FFT size / 2 + 1)

    linear_to_mel_weight_matrix = tf.signal.linear_to_mel_weight_matrix(
        num_mel_bins=128,
        num_spectrogram_bins=n_stft_bin,
        sample_rate=16000,
        lower_edge_hertz=0.0,
        upper_edge_hertz=8000.0
    )
    # --> shape=(257, 128) = (FFT size / 2 + 1, num of mel bins)

    mel_spectrograms = tf.tensordot(
        stfts,                        # (1, 98, 257) 
        linear_to_mel_weight_matrix,  # (257, 128)
        1)
    # --> mel_spectrograms shape: (1, 98, 128)

    return mel_spectrograms


# 音声データ読み込み
audio_data, sr = load_data(audio_path)

# 特徴量(STFT)を求める
stfts = get_stft_spectrogram(audio_data)

# 特徴量(メルスペクトログラム)を求める
mel_spectrograms = get_mel(stfts)

# 対数をとる
feature = tf.math.log(mel_spectrograms + 1e-6)

「tf.tensordot」実行後に「set_shape」を呼び出すのは以下のサンプルをもとにしたが、
https://www.tensorflow.org/api_docs/python/tf/tensordot

計算結果となる「EagerTensor」はshapeプロパティを持っており、値も適切にセットされているので、以下のコードは除いた。

    # このコードは不要
    # tf.tensordot does not support shape inference for this case yet.
    mel_spectrograms.set_shape(
        stfts.shape[:-1].concatenate(linear_to_mel_weight_matrix.shape[-1:])
    )