MakeItTalkその1 - ichou1のブログ

「MakeItTalk」を試してみる。
音声ファイルに合わせて、口パク動画を生成してくれるもので、インプットとなる音声ファイルも画像も「１つだけ」でよいのがすごい。

実際に生成された口パク動画のデモ。
多少、画像の精度が落ちているようだが、音声にマッチした動きになっている。
cedro3.com

上記サイトのデモ画像を使わせていただいたが、真正面を向いていなくても動くし、まばたきもする。
声のボリュームに合わせて、口の大きさも変わる。
f:id:ichou1:20210418093835p:plain

これを使えば、映像やアニメの作成がラクになると思われる。

どのような技術が使われているかを見てみる。
githubの要約情報の抜粋。

our method first disentangles the content and speaker information in the input audio signal.
The audio content robustly controls the motion of lips and nearby facial regions, while the speaker information determines the specifics of facial expressions and the rest of the talking-head dynamics.

"content"（話の内容）から"唇の動き"を求めて、"speaker information"から"顔のパーツの動き"を求める模様。
（後半の内容はおそらく、声の大きさや早さのことと思われる）

"speaker infomation"というのがResemblyzerを使った256次元の特徴量抽出。

# audio embedding
from thirdparty.resemblyer_util.speaker_emb import get_spk_emb
me, ae = get_spk_emb('<audio file>')
# Embedding dim: 256
# --> me: embeds(MEAN), (256, ) 
# --> ae: embeds(ALL),  (batch_size, 256)
au_emb.append(me.reshape(-1))

そして、AUTOVCを使った声質変換。
（ここでは、オバマ氏の声質に変換している）

from src.autovc.AutoVC_mel_Convertor_retrain_version import AutoVC_mel_Convertor
c = AutoVC_mel_Convertor('examples')
au_data_i = c.convert_single_wav_to_autovc_input(audio_filename='<audio file>',
                                                 autovc_model_path='ckpt_autovc.pth')

音声から、landmark情報を求める。
（landmarkについては、下記を参照）
work-in-progress.hatenablog.com

''' STEP 4: RUN audio -> landmark network'''
from src.approaches.train_audio2landmark import Audio2landmark_model
model = Audio2landmark_model(opt_parser, jpg_shape=shape_3d)
model.test(au_emb=au_emb)
# <-- input:  landmark fake placeholder
#             examples/dump/random_val_fl.pickle
#                           random_val_au.pickle
#                           random_val_gaze.pickle (rot_trans/rot_quat/anchor_t_shape)
# --> output: landmark network as TEXT
#             examples/pred_fls_<video_name>_<audio_embed_key>.txt

landmark情報をもとに、イメージを生成する。

''' STEP 6: Imag2image translation '''
model = Image_translation_block(opt_parser, single_test=True)
with torch.no_grad():
    model.single_test(jpg=img,
                      fls=fl,  # landmark network(de-normalized)
                      filename=fls[i],
                      prefix=opt_parser.jpg.split('.')[0])

次回に続く。