您好,欢迎访问三七文档
JournalofMachineLearningResearch8(2007)131-154Submitted2/05;Revised9/05;Published1/07DistancesbetweenDataSetsBasedonSummaryStatisticsNikolajTattiNTATTI@CC.HUT.FIHIITBasicResearchUnitLaboratoryofComputerandInformationScienceHelsinkiUniversityofTechnology,FinlandEditor:DaleSchuurmansAbstractTheconceptsofsimilarityanddistancearecrucialindatamining.Weconsidertheproblemofdefiningthedistancebetweentwodatasetsbycomparingsummarystatisticscomputedfromthedatasets.Theinitialdefinitionofourdistanceisbasedongeometricalnotionsofcertainsetsofdistributions.Weshowthatthisdistancecanbecomputedincubictimeandthatithasseveralintuitiveproperties.WealsoshowthatthisdistanceistheuniqueMahalanobisdistancesatisfyingcertainassumptions.Wealsodemonstratethatifwearedealingwithbinarydatasets,thenthedistancecanberepresentednaturallybycertainparityfunctions,andthatitcanbeevaluatedinlineartime.Ourempiricaltestswithrealworlddatashowthatthedistanceworkswell.Keywords:dataminingtheory,complexdata,binarydata,itemsets1.IntroductionInthispaperwewillconsiderthefollowingproblem:GiventwodatasetsD1andD2ofdimensionK,defineadistancebetweenD1andD2.Tobemoreprecise,weconsidertheproblemofdefiningthedistancebetweentwomultisetsoftransactions,eachsetsampledfromitsownunknowndistribution.WewilldefineadissimilaritymeasurebetweenD1andD2andwewillrefertothismeasureasCMdistance.Generallyspeaking,thenotionofdissimilaritybetweentwoobjectsisoneofthemostfunda-mentalconceptsindatamining.Ifoneisabletoretrieveadistancematrixfromasetofobjects,thenoneisabletoanalysedatabyusingforexample,clusteringorvisualisationtechniques.Manyrealworlddatacollectionsmaybenaturallydividedintoseveraldatasets.Forexample,ifadatacollectionconsistsofmoviesfromdifferenteras,thenwemaydividethemoviesintosubcollec-tionsbasedontheirreleaseyears.Adistancebetweenthesedata(sub)setswouldprovidemeanstoanalysethemassingleobjects.Suchanapproachmayeasethetaskofunderstandingcomplexdatacollections.LetuscontinuebyconsideringthepropertiestheCMdistanceshouldhave.Firstofall,itshouldbeametric.Themotivationbehindthisrequirementisthatthemetrictheoryisawell-knownareaandmetricshavemanytheoreticalandpracticalvirtues.Secondly,inourscenariothedatasetshavestatisticalnatureandtheCMdistanceshouldtakethisintoaccount.Forexample,considerthatbothdatasetsaregeneratedfromthesamedistribution,thentheCMdistanceshouldgivesmallvaluesandapproach0asthenumberofdatapointsinthedatasetsincreases.ThethirdrequirementisthatweshouldbeabletoevaluatetheCMdistancequickly.Thisrequirementiscrucialsincewemayhavehighdimensionaldatasetswithavastamountofdatapoints.c2007NikolajTatti.TATTITheCMdistancewillbebasedonsummarystatistics,features.Letusgiveasimpleexample:AssumethatwehavedatasetsD1=fA;B;A;AgandD2=fA;B;C;BgandassumethattheonlyfeatureweareinterestedinistheproportionofAinthedatasets.ThenwecansuggestthedistancebetweenD1andD2tobej3=4 1=4j=1=2.TheCMdistanceisbasedonthisidea;however,thereisasubtledifficulty:Ifwecalculateseveralfeatures,thenshouldwetakeintoaccountthecorrelationofthesefeatures?WewilldoexactlythatindefiningtheCMdistance.Therestofthispaperisorganisedasfollows.InSection2wegivethedefinitionoftheCMdistancebyusingsomegeometricalinterpretations.Wealsostudythepropertiesofthedistanceandprovideanalternativecharacterisation.InSection3westudytheCMdistanceandbinarydatasets.InSection4wediscusshowtheCMdistancecanbeusedwitheventsequencesandinSection5wecommentaboutthefeatureselection.Section6isdevotedforrelatedwork.TheempiricaltestsarerepresentedinSection7andweconcludeourworkwiththediscussioninSection8.2.TheConstrainedMinimumDistanceInthefollowingsubsectionwewilldefineourdistanceusinggeometricalintuitionandshowthatthedistancecanbeevaluatedefficiently.Inthesecondsubsectionwewilldiscussvariouspropertiesofthedistance,andinthelastsubsectionwewillprovideanalternativejustificationtothedistance.Theaimofthisjustificationistoprovidemoretheoreticalevidenceforourdistance.2.1TheDefinitionWebeginbygivingsomebasicdefinitions.ByadatasetDwemeanafinitecollectionofsampleslyinginsomefinitespaceΩ.ThesetΩiscalledsamplespace,andfromnowonwewilldenotethisspacebytheletterΩ.ThenumberofelementsinΩisdenotedbyjΩj.ThenumberofsamplesinthedatasetDisdenotedbyjDj.Aswesaidintheintroduction,ourgoalisnottodefineadistancedirectlyondatasetsbutratherthroughsomestatisticsevaluatedfromthedatasets.Inordertodoso,wedefineafeaturefunctionS:Ω!RNtomapapointinthesamplespacetoarealvector.ThroughoutthissectionSwillindicatesomegivenfeaturefunctionandNwillindicatethedimensionoftherangespaceofS.WewillalsodenotetheithcomponentofSbySi.Notethatifwehaveseveralfeaturefunctions,thenwecanjointhemintoonebigfeaturefunction.Afrequencyθ2RNofStakenwithrespecttoadatasetDistheaverageofvaluesofStakenoverthedataset,thatis,θ=1jDj∑ω2DS(ω).WedenotethisfrequencybyS(D).AlthoughwedonotmakeanyassumptionsconcerningthesizeofΩ,someofourchoicesaremotivatedbythinkingthatjΩjcanbeverylarge—solargethateventhesimplestoperation,say,enumeratingalltheelementsinΩ,isnottractable.Ontheotherhand,weassumethatNissuchthatanalgorithmexecutablein,say,O(N3)timeisfeasible.Ino
本文标题:Distances between data sets based on summary stati
链接地址:https://www.777doc.com/doc-3336669 .html