音声ファイル特徴量変換（その2）メルスペクトログラム

前回の続き。

「log-mel spectrogram」（STFT＋メル周波数変換＋自然対数）について見ていく。

音声データは「yes」という一秒間の発話データ。
f:id:ichou1:20200216111412p:plain

log-mel spectrogram

メル周波数（対数変換なし）

メル尺度に変換する。

import librosa
import numpy as np

# Audio Data
audio_path = 'speech_dataset/yes/0a7c2a8d_nohash_0.wav'

# Load
data, sr = librosa.load(
    audio_path,
    sr=16000)

# メル周波数のスペクトログラムを求める
mel = librosa.feature.melspectrogram(y=data,
                                     sr=sr,
                                     n_mels=128,
                                     n_fft=512,
                                     win_length=480,
                                     hop_length=160)

print(mel.shape)  # --> (128, 101)

# plot
import librosa.display
import matplotlib.pyplot as plt

librosa.display.specshow(mel,
                         x_axis='time',
                         y_axis='linear',
                         sr=sr,
                         hop_length=160)

plt.colorbar(format='%+2.0f')
plt.title('yes/0a7c2a8d_nohash_0.wav')
plt.ylim(0, 8000)
plt.tight_layout()
plt.show()

f:id:ichou1:20200223101443p:plain

「1000Hz」から「2000Hz」までの範囲を拡大
f:id:ichou1:20200223102230p:plain

周波数からメル周波数への変換はサンプリング周波数によらず一定。
「libROSA」パッケージを使った確認方法は以下のとおり。
（「8000Hz」をメル周波数に変換する例）

>>> import librosa
>>> librosa.hz_to_mel(8000)
45.245640471924965

>>> librosa.hz_to_mel(8000, htk=True)
2840.023046708319

メル周波数の中間に該当する周波数を確認する。

Slaney formula

>>> mel_slaney = librosa.hz_to_mel(8000) / 2
>>> mel_laney
22.622820235962482
>>> librosa.mel_to_hz(mel_slaney)
1688.90846266178

HTK formula

>>> mel_htk = librosa.hz_to_mel(8000, htk=True) / 2
>>> mel_htk
1420.0115233541594
>>> librosa.mel_to_hz(mel_htk, htk=True)
1767.7925358506134

どちらも4分の1（「2000Hz」）以下になっている。

フィルタバンクの重み（lower channel weights）の確認。

sr = 16000  # sampling rate
melfb = librosa.filters.mel(sr, n_fft=512, n_mels=128)
print(melfb.shape)  # --> (128, 257) = （n_mels, n_fft / 2 + 1）

np.set_printoptions(suppress=True)  # 指数表記を禁止 
np.set_printoptions(precision=3)    # 小数点第3位まで表示

print(melfb[0][:10])
print(melfb[1][:10])
print(melfb[2][:10])
print(melfb[3][:10])
print(melfb[4][:10])
print(melfb[5][:10])
print(melfb[6][:10])
print(melfb[7][:10])
print(melfb[8][:10])
print(melfb[9][:10])

[0.    0.028 0.    0.    0.    0.    0.    0.    0.    0.   ]
[0.    0.014 0.014 0.    0.    0.    0.    0.    0.    0.   ]
[0.    0.    0.029 0.    0.    0.    0.    0.    0.    0.   ]
[0.    0.    0.    0.042 0.    0.    0.    0.    0.    0.   ]
[0.    0.    0.    0.    0.028 0.    0.    0.    0.    0.   ]
[0.    0.    0.    0.    0.015 0.014 0.    0.    0.    0.   ]
[0.    0.    0.    0.    0.    0.029 0.    0.    0.    0.   ]
[0.    0.    0.    0.    0.    0.    0.042 0.    0.    0.   ]
[0.    0.    0.    0.    0.    0.    0.001 0.028 0.    0.   ]
[0.    0.    0.    0.    0.    0.    0.    0.015 0.013 0.   ]

print(melfb[126][245:])
print(melfb[127][245:])

[0.004 0.004 0.003 0.002 0.001 0.    0.    0.    0.    0.    0.    0.   ]
[0.001 0.002 0.003 0.004 0.005 0.005 0.004 0.003 0.003 0.002 0.001 0.   ]

短時間フーリエ変換後のデータを渡しても同じ結果になる。

# 音声ファイルをロードする部分はこれまでと同様

# 短時間フーリエ変換
S_F = librosa.stft(data,
                   n_fft=512,
                   win_length=480,
                   hop_length=160)

# 振幅に変換
amp = np.abs(S_F)

# パワーに変換（振幅を二乗する）
P = amp ** 2

# メル周波数のスペクトログラムを求める
# STFTのデータを渡す場合、パラメータは「S」になるので注意
mel = librosa.feature.melspectrogram(S=P, sr=sr)