您好,欢迎访问三七文档
AdaptiveSamplingfork-MeansClusteringAnkitAggarwal1,AmitDeshpande2,andRaviKannan21IITDelhizenithankit@gmail.com2MicrosoftResearchIndia{amitdesh,kannan}@microsoft.eduAbstract.WeshowthatadaptivelysampledO(k)centersgiveacon-stantfactorbi-criteriaapproximationforthek-meansproblem,withaconstantprobability.Moreover,theseO(k)centerscontainasubsetofkcenterswhichgiveaconstantfactorapproximation,andcanbefoundus-ingLP-basedtechniquesofJainandVazirani[JV01]andCharikaretal.[CGTS02].BoththesealgorithmsrunineffectivelyO(nkd)timeandex-tendtheO(logk)-approximationachievedbythek-means++algorithmofArthurandVassilvitskii[AV07].1Introductionk-meansisapopularobjectivefunctionusedforclusteringproblemsincomputervision,machinelearningandcomputationalgeometry.Thek-meansclusteringproblemongivenndatapointsasksforasetofkcentersthatminimizesthesumofsquareddistancesbetweeneachpointanditsnearestcenter.Towriteitformally,thek-meansproblemasks:GivenasetX⊆Rdofndatapointsandanintegerk0,findasetC⊆Rdofkcentersthatminimizesthefollowingpotentialfunction.φ(C)=x∈Xminc∈Cx−c2WedenotebyφA(C)=x∈Aminc∈Cx−c2thecontributionofpointsinasubsetA⊆X.LetCOPTbethesetofoptimalkcenters.Intheoptimalsolution,eachpointofXisassignedtoitsnearestcenterinCOPT.ThisinducesanaturalpartitiononXasA1∪A2∪···∪Akintodisjointsubsets.Thereisavariantofthek-meansproblemknownasthediscretek-meansproblemwherethecentershavetobepointsfromXitself.Notethattheoptimaofthek-meansproblemanditsdiscretevariantarewithinconstantfactorsofeachother.Thereareothervariantswheretheobjectiveistominimizethesumofp-thpowersofdistancesinsteadofsquares(forp≥1),ortobemoreprecise,x∈Xminc∈Cx−cp1/p.Thep=1caseisknownasthek-medianproblemandthep=∞caseisknownasthek-centerproblem.Moreover,onecanalsoaskthediscretek-meansproblemoverarbitrarymetricspacesinsteadofRd.I.Dinuretal.(Eds.):APPROXandRANDOM2009,LNCS5687,pp.15–28,2009.cSpringer-VerlagBerlinHeidelberg200916A.Aggarwal,A.Deshpande,andR.Kannan1.1PreviousWorkItisNP-hardtosolvethek-meansproblemexactly,evenfork=2[ADHP09],[Das08,KNV08]andevenintheplane[MNV09].Constantfactorapproximationalgorithmsareknownbasedonlinearprogrammingtechniquesusedforfacilitylocationproblemsbuttheirrunningtimeissuper-linearinn[JV01].Kanugoetal.[KMN+04]givea(9+)-approximationvialocalsearchbutinrunningtimeO(n3−d)thathasexponentialdependenceond.Therearepolynomialtimeapproximationschemeswithrunningtimelinearinnanddbutexponentialorworseink[dlVKKR03,HPM04,KSS04,Mat00,Che09].Suchadependenceonkmaywellbeunavoidable,asshowninthecaseofthediscretek-medianproblem[GI03].Ontheotherhand,themostpopularalgorithmforthek-meansproblemisasimpleiterative-refinementheuristicduetoLloyd[Llo82]:startwithkarbitrary(orrandom)centers,computetheclustersdefinedbythem,definethemeansoftheseclustersasthenewcenters,re-computeclustersandrepeat.Lloyd’smethodisfastinpracticebutisguaranteedtoconvergeonlytoalocaloptimum.Intheory,theworst-caserunningtimeofLloyd’sheuristicisexponentialevenintheplane[Vat09];however,aplausibleexplanationforitspopularitycouldbeitspolynomialsmoothedcomplexity[AMR09].Inattemptstobridgethisgapbetweentheoryandpractice,severalrandom-izedalgorithmshavebeenproposedbasedontheideaofsamplingasubsetofpointsascenterstogetaconstantfactorapproximationintimeeffectivelyO(nkd).ThesecenterscouldthenbeusedtoinitializetheLloyd’smethod.MettuandPlaxton[MP02]andOstrovskyetal.[ORSS06]giveconstantfactorapprox-imationsbuttheirresultsdonotworkunconditionallyforalldatasets.Themostrelevanttoourpaperisarandomizedalgorithmcalledk-means++duetoArthurandVassilvitskii[AV07].Theyproposeasimpleadaptivesam-plingscheme(theycallitasD2sampling):ineachstep,pickapointwithprobabilityproportionaltoitscurrentcost(i.e,itssquareddistancetothenear-estcenterpickedsofar)andadditasanewcenter.Thisissimilartoagreedy2-approximationalgorithmforthek-centerproblemthatpicksapointwiththemaximumcostineachstep[Gon85].ArthurandVassilvitskiishowthatadaptivelysampledkcentersgive,inexpectation,anO(logk)-approximationforthek-meansproblem.Thisalsomeans,byMarkovinequality,thatwegetanO(logk)-approximationwithaconstantprobability.Similarsamplingschemeshaveappearedintheliteratureonclusteringofdatastreams[GMM+03,COP03]andonlinefacilitylocation[Mey01].However,thesesamplingschemesarenotassimpleandtheiranalysisisquitedifferent.ArthurandVassilvitskii’sanalysisoftheirO(logk)-approximationreliesheav-ilyonanon-trivialinductionargument(Lemma3.3of[AV07]).Reverseengi-neeringthesameargument,theyshowalowerboundexamplewhereadaptivelysampledkcentersgiveΩ(logk)-approximation,inexpectation.However,theirlowerboundismisleadinginthesensethateventhoughtheexpectederrorforadaptivesamplingonthisexampleishigh,itgivesanO(1)-approximationwithhighprobability.Thestartingpointforourworkwasthefollowingquestion:DoAdaptiveSamplingfork-MeansClustering17adaptivelysampledkcentersalwaysgiveaconstantfactorapproximation,withaconstantprobability?1.2OurResultsInSection2,weextendtheresultsofArthurandVassilvitskiitoshowthatadaptivelysampledO(k)centersgiveaconstantfactorbi-criteriaapproximationforthek-meansproblem,w
本文标题:快速k均值聚类SCI文献:Adaptive-Sampling-for-k-Means-Cluster
链接地址:https://www.777doc.com/doc-4409720 .html