您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 经营企划 > Haploop&MongoDB
HadoopandMongoDBIntroductionofOLTP,HadoopandMongoDBPresenter:YuboAgenda•OLTPandTraditionalRDBMS•BigDataChallengesforRDBMS–Datastorage,Retrieveandprocesschallenge–DataAnalysischallenge•MongoDB•Hadoop–HadoopDistributedFileSystem–MapReduceOLTPWhatisOLTPOnlinetransactionprocessing,orOLTP,isaclassofinformationsystemsthatfacilitateandmanagetransaction-orientedapplications,typicallyfordataentryandretrievaltransactionprocessing.Itinvolvesgatheringinputinformation,processingtheinformationandupdatingexistinginformationtoreflectthegatheredandprocessedinformation.AndgenerallytheOLTPbuildonthetraditionalRDBMSandgainagreatsuccessinthepastdecade.SoinmanycaseswhenwesayOLTP,werefertothetraditionalRDBMSandtheapplicationbuildonit.UseCases:•OnlineBanking•CRMsystem•OAsystem•SaleForceAdvantagesofRDBMS•TherearemanymatureRDBMSproducts,likeOracle,SQLServer,MySQLetc.•Havematurealgorithmonstoring,retrievingdataefficientlyonlittleandintermediatedatavolume•HavebuildinACIDpropertiestoensurethereliabilityandaccurancyofthebusinesstraction•FlexibleIndexmechanismtoimprovethedataretrievalChallengesforTraditionalRDBMSDataVolumeChallengeInrecentyearstherehasbeenanexplosionofdata,variednewersetsofsources,includingGlobalPositioningSystems(GPS),automatedtrackersandmonitoringsystems,aregeneratingalotofdata.TheselargervolumesofdatasetscangrowtohundredofTBforwhichRDBMShardtostoreandprocessSemi-structuredChallengeInparalleltothefastdatagrowth,dataisalsobecomingincreasinglysemi-structuredandsparse.ThismeansthetraditionaldatamanagementtechniquesaroundupfrontschemadefinitionandrelationalreferencesisalsobeingquestionedThequesttosolvetheproblems(store,retrieveandprocessthoselargeandsemi-structureddataefficiently)ledtotheemergenceofaclassofnewertypesofdatabaseproductswhichcalledNoSQLdatabase.isoneofthatkindofDatabase.DataAnalysis/ComputingChallengeTheexponentialgrowthofdataalsopresentchallengeforthedataanalysis.LikeforGoogle,Yahoo,Amazon,theyneedtogothroughterabytesandevenpetabytesofdatatofigureoutwhichwebsites/product/campaignwerepopular,whatkindsofadsappealedtopeople.GooglewasthefirsttopublicizeMapReduce–asystemtheyhadusedtoscaletheirdataprocessingneeds.DougCuttingsawanopportunityandledthechargetodevelopanopensourceversionofthisMapReducesystemcalledSoonafter,Yahooandothersralliedaroundtosupportthiseffort.Today,Hadoopisacorepartofthecomputinginfrastructureformanywebcompanies,suchasYahoo,Facebook,LinkedIn,andTwitter.ChallengesforTraditionalRDBMSNow,let’sdoasimpleintroductionforANDWhatisHadoopTheApacheHadoopsoftwarelibraryisaframeworkthatallowsforthedistributedprocessingoflargedatasetsacrossclustersofcomputersusingsimpleprogrammingmodels.Itisdesignedtoscaleupfromsingleserverstothousandsofmachines,eachofferinglocalcomputationandstorage.Theprojectincludesthesemodules:•HadoopCommon:ThecommonutilitiesthatsupporttheotherHadoopmodules.•HadoopDistributedFileSystem:Adistributedfilesystemthatprovideshigh-throughputaccesstoapplicationdata.•HadoopYARN:Aframeworkforjobschedulingandclusterresourcemanagement.•HadoopMapReduce:AYARN-basedsystemforparallelprocessingoflargedatasets.Requirement:GooglewanttoclassifytheSearchKeyWordsandfrequentResult:KeyWordsCountHadoop4213423C#543345T-SQL64354…………..……………..Source:AlltheweblogfilesChallenges:1.Morethan100billionlogfiles,TBdata,howtostorethedatawhichneedtoanalyzed?2.HowtoAnalyze/computebaseonsolargedata?StorageSolutions:•ScaleUp?SuperComputer?Soexpensive!NoteasytoscaleanymoreIOisstillabottleneckforlaterprocessStorageSolutions:•Let’sScaleOutClusterhavemaybethousandsofcommonPCSplitandDistributethedataamongtheclusterHowtohandledatanodecorruptissue?SplitfileintodifferentblockItmeansthedistributionisbaseonfileblockinsteadoffileReplicationautomaticallyWhereisthemetadata?That’sHadoopDistributedFileSystemAdvantage:•AutomaticallyDistributethefileblockamongthecluster•Replicationensurethedatawillnotmissifonedatanodedown•Userknownothingaboutthedistributionandreplication•EasytoscaleouttoaddmoremachineintotheclusterDisadvantage:•Costmorestoragetoensurereplication•SinglepointoffailurelimitationCalculationChallenge:•Movedatatocentralizedlocationandthencalculate?IOandnetworkwillbethebottleneckTheclientnodewhichdothecalculationwillbethebottleneckCalculationSolution:•Solet’sdistributethecalculationinsteadofdataEverydatanodeneedanagenttodothecalculationlocallyShouldhaveamasternodetoschedulethecalculationdistributionandcollecttheintermediateresultShouldeasytoscaleoutandprogrammingThatisMapReduce!!WhatisMapReduce:MapReduceisaprogrammingmodelforprocessinglargedatasetswithaparallel,distributedalgorithmonacluster.InspiredbythemapandreduceprimitivespresentinfunctionallanguagesMap:foreacheveryiteminalisttodosomeoperationandoutputanotherlistReduce:dosomekindofaggregationonalistandoutputanotherlistKeyValue1C#2SQLServer3Hadoop4SQLServer5Hadoop6HadoopKeyValueC#1SQLServer1Hadoop1SQLServer1Hadoop1Hadoop1MapKeyValueC#1SQLServer2Hadoop3ReduceCalculatetheSearchKeywordsfrequencyusingMapReduceKeyValueLog20120909.txtFilecontentKeyValueC#4321SQLServer54523H
本文标题:Haploop&MongoDB
链接地址:https://www.777doc.com/doc-4868917 .html