Deep Learning Strategies for Voice Conversion ? 2 Introduction Features Objective Evaluation Voice

  • Published on
    04-Aug-2018

  • View
    212

  • Download
    0

Transcript

1Introduction Features Objective Evaluation.......Deep Learning Strategies for Voice ConversionCSLU Seminar 03/10/2014Seyed. Hamidreza. MohammadiCenter for Spoken Language Understanding (CSLU)Oregon Health & Science University (OHSU)Portland, Oregon, USAFebruary 3, 2015Seyed. Hamidreza. Mohammadi Deep Learning Strategies for Voice Conversion2Introduction Features Objective EvaluationVoice Conversion.Voice Conversion (VC)........process the speech of a source speaker to sound like a target speakerApplicationspersonalized TTSfor individuals with disabilitiesmessage readers with custom/sender identitiesmovie dubbinginterpretive services by human or machineImportant criteriaspeaker recognizabilityspeech qualitySeyed. Hamidreza. Mohammadi Deep Learning Strategies for Voice Conversion3Introduction Features Objective EvaluationVoice ConversionGenerative approaches for VC:Generativesource-filter speech model (source: vocal cords, filter: vocal tract)compact parametrization of speech as parameterswe assume we have a parallel sentence corpus of source and targetspeakersdirect mapping from source x to target y parametersquality is limited by parametric vocoderSeyed. Hamidreza. Mohammadi Deep Learning Strategies for Voice Conversion4Introduction Features Objective EvaluationLSFs...1 Linear Predictive Coding (LPC) coefficientsthey model spectral peaksinterpolating LPCs may cause unstable filters...2 LSFs are another representation of LPCsthey represent spectral peaks directlyThe main problem of LSFs is that: one specific LSF coefficient doesnot necessarily represent the same formantFor 16kHz speech, 18 coefficientsTwo similar spectral may not have similar LSFsSeyed. Hamidreza. Mohammadi Deep Learning Strategies for Voice Conversion5Introduction Features Objective EvaluationMCEPs...1 Mel Cepstrum (MCEP) coefficientsthey model spectrum directlythey weight peaks and valleys equallyThe main problem of MCEPs is oversmoothing, since we average alot of frames and it leads to wide formantsFor 16kHz speech, 24 coefficientsTwo similar spectral do have similar MCEPsSeyed. Hamidreza. Mohammadi Deep Learning Strategies for Voice Conversion6Introduction Features Objective EvaluationLPCs vs MCEPsimages/cep_vs_lpc.pngSeyed. Hamidreza. Mohammadi Deep Learning Strategies for Voice Conversion7Introduction Features Objective EvaluationAutoencodersDeep AutoEncoders (DAEs) have been used for pre-training andfeature extraction, specially in image and text processing literaturecompute speech feature using a DAEAutoencoder (AE):The encoder: y = f(Wx + b) where x and y, W and b are the input,output, weights and bias, respectively.The decoder: x = g(Wy + b).f and g are usually non-linear functions (sigmoid or tanh)weighs are tied W = WDAE:Multiple AEs are trained layer-by-layer and stacked together.The output of the last layer can be treated as a new feature type.Seyed. Hamidreza. Mohammadi Deep Learning Strategies for Voice Conversion8Introduction Features Objective EvaluationAEsAuto Encoders (AEs)...................Seyed. Hamidreza. Mohammadi Deep Learning Strategies for Voice Conversion9Introduction Features Objective EvaluationAEsDeep AutoEncoders (DAEs).................................Seyed. Hamidreza. Mohammadi Deep Learning Strategies for Voice Conversion10Introduction Features Objective EvaluationFeaturesdraw spectrogram + featuresSeyed. Hamidreza. Mohammadi Deep Learning Strategies for Voice Conversion11Introduction Features Objective EvaluationMapping approachesThe voice conversion problem using the generative approachseparate source signal and vocal tract features (LSF, MCEP or AEfeatures)map source speaker vocal tract features x to target features yy = F(x)where F is a transformation function:Frame Selection (FS) [Dutoit08, Sundermann06]Joint Density Gaussian Mixture Model (JDGMM) [Kain98]Artificial Neural Networks (ANN) [Desai08]Deep Neural Networks (DNN)Seyed. Hamidreza. Mohammadi Deep Learning Strategies for Voice Conversion12Introduction Features Objective EvaluationFrame SelectionOverall idea similar to Unit-Selection Text to Speech Synthesis(TTS)A memory-based approachKeep all training data [x, y]At conversion time, find k-nearest entries to xt, CmktFind the best output sequence y = [1, ..., yt] using Viterbi where itminimizes target and concatenation costsCostconcatenation(Cymkt ,Cymkt+1) = d(Cymkt ,Cymkt+1),Costtarget(xt,Cxmkt ) = d(xt,Cxmkt )overall quality can suffer from highly incomplete coverageSeyed. Hamidreza. Mohammadi Deep Learning Strategies for Voice Conversion13Introduction Features Objective EvaluationJDGMMhave the potential to generalize to unseen data (unlike FS)Let x = [x1, ..., xt] and y = [y1, ..., yt] be D-dimensional source andtarget feature vectorsLet zt = [z1, ..., zt] be the joint feature vectorEach GMM performs a linear transformation of type Axt + byt =Mm=1P(m|xt, zm) (Wmxt + bm)where Wm = yxmxx1m , bm = ym +yxmxx1m xmand P(m|xt, zm) is the posterior probability of a mixturecomponent m given the input vector xtSeyed. Hamidreza. Mohammadi Deep Learning Strategies for Voice Conversion14Introduction Features Objective EvaluationArtificial Neural Networkswe use a two-layered ANNs as a transformation functionEach layer is y = f(Wx + b), where x and y are the input and outputof that layer, respectivelyEach layer applies a linear transformation using weights and biases(W and b) and then applies a non-linear activation function f(.)The parameters of each layer are trained using the back-propagationalgorithm...................Seyed. Hamidreza. Mohammadi Deep Learning Strategies for Voice Conversion15Introduction Features Objective EvaluationDeep Neural NetworkThe DNN consists of the trained ANN (on DAE-features)connecting the original hidden layersThe hidden layers are duplicated to the top and the bottom of the ofthe DAE.Thus, the DNN is effectively pre-trained, taking its top and bottomweights from the DAEand the middle weights from the ANN.The network can now be further fine-tuned by back-propagationThe AE is trained to be speaker independentSeyed. Hamidreza. Mohammadi Deep Learning Strategies for Voice Conversion16Introduction Features Objective EvaluationDeep Neural Network.........................................Seyed. Hamidreza. Mohammadi Deep Learning Strategies for Voice Conversion17Introduction Features Objective EvaluationTrajectory GenerationThe TG algorithm is used to smooth the feature sequence after theconversions [Toda07].images/tg.pngSeyed. Hamidreza. Mohammadi Deep Learning Strategies for Voice Conversion18Introduction Features Objective EvaluationSetupTraining corpus:11 speakers7 chosen to train AE (1-2 recording from each, no need to beparallel)4 chosen for testing purposesbig training set: 70 harvard sentence from each of the 4 speakerssmall training set: two randomly selected sentences from abovetesting sentences: 20 sentences from each of the 4 speakers4 speakers: two male (M1, M2) and two female (F1, F2)4 conversion pairs2 cross-gender (M1F1, F2M2)2 intra-gender (M2M1, F1F2)Seyed. Hamidreza. Mohammadi Deep Learning Strategies for Voice Conversion19Introduction Features Objective EvaluationModel ParametersFeature order: MCEP: 24, LSF: 18 , DAE: 15MCEP LSF DMCEP DLSFJDGMM big (H) 32 32 64 32ANN big (Q) 64 64 64 64ANN small (H) 16 8 16 8JDGMM small (Q) 8 2 8 4Seyed. Hamidreza. Mohammadi Deep Learning Strategies for Voice Conversion20Introduction Features Objective EvaluationObjective Scoresmel-warped log spectral distance between (target and convertedsource)Average of all conversion over all 20 test sentencesLarge training setfeat/map FS GMM ANN DNNLSF 8.14 (0.27) 8.00 (0.29) 7.95 (0.30) NAMCEP 6.83 (0.31) 6.90 (0.31) 6.85 (0.34) 6.83 (0.31)DAE-LSF 8.68 (0.32) 8.61 (0.30) 8.63 (0.30) -DAE-MCEP 7.05 (0.28) 6.93 (0.29) 6.89 (0.29) -Seyed. Hamidreza. Mohammadi Deep Learning Strategies for Voice Conversion21Introduction Features Objective EvaluationObjective ScoreSmall training set (2 sentences)feat/map FS GMM ANN DNNLSF 8.81 (0.36) 9.14 (0.34) 8.23 (0.31) NAMCEP 7.60 (0.35) 8.31 (0.29) 7.58 (0.28) 7.40 (0.30)DAE-LSF 9.31 (0.33) 9.56 (0.32) 9.03 (0.30) -DAE-MCEP 7.57 (0.31) 7.90 (0.29) 7.46 (0.26) -Seyed. Hamidreza. Mohammadi Deep Learning Strategies for Voice Conversion22Introduction Features Objective EvaluationFuture Work!Soon: Do a subjective experiment on Amazon Mechanical Turk(AMT)Speaker SimilaritySpeech QualityInclude neighboring frames (11 frames?) + directly on spectrum(not mceps)It requires a huge corpuswe can use speech recognition databases to train speakerindependent AESeyed. Hamidreza. Mohammadi Deep Learning Strategies for Voice Conversion23Introduction Features Objective EvaluationThank you!Questions?Seyed. Hamidreza. Mohammadi Deep Learning Strategies for Voice ConversionIntroductionFeaturesObjective Evaluation