您好,欢迎访问三七文档
ConvolutionalNeuralNetworksforSpeechRecognitionOssamaAbdel-Hamid,Abdel-rahmanMohamed,HuiJiang,LiDeng,GeraldPenn,andDongYuIEEE/ACMTRANSACTIONSONAUDIO,SPEECH,ANDLANGUAGEPROCESSING,VOL.22,NO.10,OCTOBER2014Organizationofthepaper•Introduction•DeepNeuralNetworks•ConvolutionalNeuralNetworks•CNNwithlimitedweightsharing•Experiments•ConclusionsDeepNeuralNetworks•Generallyspeaking,adeepneuralnetwork(DNN)referstoafeedforwardneuralnetworkwithmorethanonehiddenlayer.Eachhiddenlayerhasanumberofunits(orneurons),eachofwhichtakesalloutputsofthelowerlayerasinput,multipliesthembyaweightvector,sumstheresultandpassesitthroughanon-linearactivationfunctionsuchassigmoidortanh.•Allneuronactivationsineachlayercanberepresentedinthefollowingmatrixform:•Foramulti-classclassificationproblem,theposteriorprobabilityofeachclasscanbeestimatedusinganoutputsoftmaxlayer:•InthehybridDNN-HMMmodel,theDNNoutputlayercomputesthestateposteriorprobabilitieswhicharedividedbythestates’priorstoestimatetheobservationlikelihoods.•BecauseoftheincreasedmodelcomplexityofDNNs,apretrainingalgorithmisoftenneeded,whichinitializesallweightmatricespriortotheaboveback-propagationalgorithm.•OnepopularmethodtopretrainDNNsusestherestrictedBoltzmannmachine(RBM).•AnRBMhasasetofhiddenunitsthatareusedtocomputeabetterfeaturerepresentationoftheinputdata.Afterlearning,allRBMweightscanbeusedasagoodinitializationforoneDNNlayer.ConvolutionalNeuralNetworks•Theconvolutionalneuralnetwork(CNN)canberegardedasavariantofthestandardneuralnetwork.Insteadofusingfullyconnectedhiddenlayersasdescribedintheprecedingsection,theCNNintroducesaspecialnetworkstructure,whichconsistsofconvolutionandpoolinglayers.ConvolutionalNeuralNetworks•OrganizationoftheInputDatatotheCNN•ConvolutionPly•PoolingPly•LearningWeightsintheCNN•TreatmentofEnergyFeatures•TheOverallCNNArchitecture•BenefitsofCNNsforASROrganizationoftheInputDatatotheCNN•InusingtheCNNforpatternrecognition,theinputdataneedtobeorganizedasanumberoffeaturemapstobefedintotheCNN.•Weneedtouseinputsthatpreservelocalityinbothaxesoffrequencyandtime.•MFSCfeatureswillbeusedtorepresenteachspeechframe,alongwiththeirdeltasanddelta-deltas,inordertodescribetheacousticenergydistributionineachofseveraldifferentfrequencybands.•Asfortime,asinglewindowofinputtotheCNNwillconsistofawideamountofcontext.Asforfrequency,theconventionaluseofMFCCsdoespresentamajorproblembecausethediscretecosinetransformprojectsthespectralenergiesintoanewbasisthatmaynotmaintainlocality.Inthispaper,weshallusethelog-energycomputeddirectlyfromthemel-frequencyspectralcoefficients(i.e.,withnoDCT),whichwewilldenoteasMFSCfeatures.anumberofone-dimensional(1-D)featuremapsthree2-DfeaturemapsConvolutionPly•Everyinputfeaturemapisconnectedtomanyfeaturemapsintheconvolutionplybasedonanumberoflocalweightmatrices.Themappingcanberepresentedasthewell-knownconvolutionoperationinsignalprocessing.•eachunitofonefeaturemapintheconvolutionplycanbecomputedas:•writteninamoreconcisematrixform:PoolingPly•Thepoolingplyisalsoorganizedintofeaturemaps,andithasthesamenumberoffeaturemapsasthenumberoffeaturemapsinitsconvolutionply,buteachmapissmaller.•Thisreductionisachievedbyapplyingapoolingfunctiontoseveralunitsinalocalregion.Itisusuallyasimplefunctionsuchasmaximizationoraveraging.PoolingPlyLearningWeightsintheCNN•Allweightsintheconvolutionplycanbelearnedusingthesameerrorback-propagationalgorithmbutsomespecialmodificationsareneededtotakecareofsparseconnectionsandweightsharing.•letusfirstrepresenttheconvolutionoperationineq.(9)inthesamemathematicalformasthefullyconnectedANNlayersothatthesamelearningalgorithmcanbesimilarlyapplied.•Sincethepoolingplyhasnoweights,nolearningisneededhere.However,theerrorsignalsshouldbeback-propagatedtolowerpliesthroughthepoolingfunction.•Thatis,theerrorsignalreachingthelowerconvolutionplycanbecomputedas:TreatmentofEnergyFeatures•InASR,log-energyisusuallycalculatedperframeandappendedtootherspectralfeatures.InaCNN,itisnotsuitabletotreatenergythesamewaysinceitisthesumoftheenergyinallfrequencybandsandsodoesnotdependonfrequency.Instead,thelog-energyfeaturesshouldbeappendedasextrainputstoallconvolutionunits.TreatmentofEnergyFeaturesTheOverallCNNArchitecture•Inthispaper,wefollowthehybridANN-HMMframework,whereweuseasoftmaxoutputlayerontopofthetopmostlayeroftheCNNtocomputetheposteriorprobabilitiesforallHMMstates.TheseposteriorsareusedtoestimatethelikelihoodofallHMMstatesperframebydividingbythestates’priorprobabilities.Finally,thelikelihoodsofallHMMstatesaresenttoaViterbidecodertorecognizethecontinuousstreamofspeechunits.BenefitsofCNNsforASR•TheCNNhasthreekeyproperties:locality,weightsharing,andpooling.•Localityintheunitsoftheconvolutionplyallowsmorerobustnessagainstnon-whitenoisewheresomebandsarecleanerthantheothers.Thisisbecausegoodfeaturescanbecomputedlocallyfromcleanerpartsofthespectrumandonlyasmallernumberoffeaturesareaffectedbythenoise.BenefitsofCNNsforASR•Weightsharingcanalsoimprovemodelrobustnessandreduceoverfittingaseachweightislearnedfrommultiplefrequencybandsintheinputinsteadofjustfromonesinglelo
本文标题:Convolutional-Neural-Networks-for-Speech-Recogniti
链接地址:https://www.777doc.com/doc-4691010 .html