快速k均值聚类SCI文献：Adaptive-Sampling-for-k-Means-Cluster

AdaptiveSamplingfork-MeansClusteringAnkitAggarwal1,AmitDeshpande2,andRaviKannan21IITDelhizenithankit@gmail.com2MicrosoftResearchIndia{amitdesh,kannan}@microsoft.eduAbstract.WeshowthatadaptivelysampledO(k)centersgiveacon-stantfactorbi-criteriaapproximationforthek-meansproblem,withaconstantprobability.Moreover,theseO(k)centerscontainasubsetofkcenterswhichgiveaconstantfactorapproximation,andcanbefoundus-ingLP-basedtechniquesofJainandVazirani[JV01]andCharikaretal.[CGTS02].BoththesealgorithmsrunineﬀectivelyO(nkd)timeandex-tendtheO(logk)-approximationachievedbythek-means++algorithmofArthurandVassilvitskii[AV07].1Introductionk-meansisapopularobjectivefunctionusedforclusteringproblemsincomputervision,machinelearningandcomputationalgeometry.Thek-meansclusteringproblemongivenndatapointsasksforasetofkcentersthatminimizesthesumofsquareddistancesbetweeneachpointanditsnearestcenter.Towriteitformally,thek-meansproblemasks:GivenasetX⊆Rdofndatapointsandanintegerk0,ﬁndasetC⊆Rdofkcentersthatminimizesthefollowingpotentialfunction.φ(C)=x∈Xminc∈Cx−c2WedenotebyφA(C)=x∈Aminc∈Cx−c2thecontributionofpointsinasubsetA⊆X.LetCOPTbethesetofoptimalkcenters.Intheoptimalsolution,eachpointofXisassignedtoitsnearestcenterinCOPT.ThisinducesanaturalpartitiononXasA1∪A2∪···∪Akintodisjointsubsets.Thereisavariantofthek-meansproblemknownasthediscretek-meansproblemwherethecentershavetobepointsfromXitself.Notethattheoptimaofthek-meansproblemanditsdiscretevariantarewithinconstantfactorsofeachother.Thereareothervariantswheretheobjectiveistominimizethesumofp-thpowersofdistancesinsteadofsquares(forp≥1),ortobemoreprecise,x∈Xminc∈Cx−cp1/p.Thep=1caseisknownasthek-medianproblemandthep=∞caseisknownasthek-centerproblem.Moreover,onecanalsoaskthediscretek-meansproblemoverarbitrarymetricspacesinsteadofRd.I.Dinuretal.(Eds.):APPROXandRANDOM2009,LNCS5687,pp.15–28,2009.cSpringer-VerlagBerlinHeidelberg200916A.Aggarwal,A.Deshpande,andR.Kannan1.1PreviousWorkItisNP-hardtosolvethek-meansproblemexactly,evenfork=2[ADHP09],[Das08,KNV08]andevenintheplane[MNV09].Constantfactorapproximationalgorithmsareknownbasedonlinearprogrammingtechniquesusedforfacilitylocationproblemsbuttheirrunningtimeissuper-linearinn[JV01].Kanugoetal.[KMN+04]givea(9+)-approximationvialocalsearchbutinrunningtimeO(n3−d)thathasexponentialdependenceond.Therearepolynomialtimeapproximationschemeswithrunningtimelinearinnanddbutexponentialorworseink[dlVKKR03,HPM04,KSS04,Mat00,Che09].Suchadependenceonkmaywellbeunavoidable,asshowninthecaseofthediscretek-medianproblem[GI03].Ontheotherhand,themostpopularalgorithmforthek-meansproblemisasimpleiterative-reﬁnementheuristicduetoLloyd[Llo82]:startwithkarbitrary(orrandom)centers,computetheclustersdeﬁnedbythem,deﬁnethemeansoftheseclustersasthenewcenters,re-computeclustersandrepeat.Lloyd’smethodisfastinpracticebutisguaranteedtoconvergeonlytoalocaloptimum.Intheory,theworst-caserunningtimeofLloyd’sheuristicisexponentialevenintheplane[Vat09];however,aplausibleexplanationforitspopularitycouldbeitspolynomialsmoothedcomplexity[AMR09].Inattemptstobridgethisgapbetweentheoryandpractice,severalrandom-izedalgorithmshavebeenproposedbasedontheideaofsamplingasubsetofpointsascenterstogetaconstantfactorapproximationintimeeﬀectivelyO(nkd).ThesecenterscouldthenbeusedtoinitializetheLloyd’smethod.MettuandPlaxton[MP02]andOstrovskyetal.[ORSS06]giveconstantfactorapprox-imationsbuttheirresultsdonotworkunconditionallyforalldatasets.Themostrelevanttoourpaperisarandomizedalgorithmcalledk-means++duetoArthurandVassilvitskii[AV07].Theyproposeasimpleadaptivesam-plingscheme(theycallitasD2sampling):ineachstep,pickapointwithprobabilityproportionaltoitscurrentcost(i.e,itssquareddistancetothenear-estcenterpickedsofar)andadditasanewcenter.Thisissimilartoagreedy2-approximationalgorithmforthek-centerproblemthatpicksapointwiththemaximumcostineachstep[Gon85].ArthurandVassilvitskiishowthatadaptivelysampledkcentersgive,inexpectation,anO(logk)-approximationforthek-meansproblem.Thisalsomeans,byMarkovinequality,thatwegetanO(logk)-approximationwithaconstantprobability.Similarsamplingschemeshaveappearedintheliteratureonclusteringofdatastreams[GMM+03,COP03]andonlinefacilitylocation[Mey01].However,thesesamplingschemesarenotassimpleandtheiranalysisisquitediﬀerent.ArthurandVassilvitskii’sanalysisoftheirO(logk)-approximationreliesheav-ilyonanon-trivialinductionargument(Lemma3.3of[AV07]).Reverseengi-neeringthesameargument,theyshowalowerboundexamplewhereadaptivelysampledkcentersgiveΩ(logk)-approximation,inexpectation.However,theirlowerboundismisleadinginthesensethateventhoughtheexpectederrorforadaptivesamplingonthisexampleishigh,itgivesanO(1)-approximationwithhighprobability.Thestartingpointforourworkwasthefollowingquestion:DoAdaptiveSamplingfork-MeansClustering17adaptivelysampledkcentersalwaysgiveaconstantfactorapproximation,withaconstantprobability?1.2OurResultsInSection2,weextendtheresultsofArthurandVassilvitskiitoshowthatadaptivelysampledO(k)centersgiveaconstantfactorbi-criteriaapproximationforthek-meansproblem,w

快速k均值聚类SCI文献：Adaptive-Sampling-for-k-Means-Cluster

免费阅读已结束，点击付费阅读剩下 ... 页

阅读已结束，您可以下载文档离线阅读

水利工程全套表格及填写范例

内部质量审核程序

选好产品,速卖通轻松赚美金

青年联合会工作总结

关键风险

浦原外贸公司业务风险分析及其对策

罗兰贝格－南航战略发展规划中期报告

职工请假管理办法(1)

开放经济中的货币政策操作目标理论_纳入汇率因素的货币状况指数_MCI

各岗位日常工作流程概要

相关文档

相关搜索