您好,欢迎访问三七文档
当前位置:首页 > 商业/管理/HR > 其它文档 > 用python实现mapreduce的web访问日志统计分析
用python实现mapreduce的web访问日志统计分析注意:1、需给py脚本执行权限chmod+xXXX.py。2、在执行前在系统用管道命令进行调试cattest|map.py|reduce.py日志类型:175.44.19.36--[29/Sep/2013:00:10:57+0800]GET/mapreduce-nextgen/client-codes/HTTP/1.120025470112.111.183.57--[29/Sep/2013:00:10:58+0800]POST/wp-comments-post.phpHTTP/1.13025135.63.145.70--[29/Sep/2013:00:11:03+0800]HEAD/HTTP/1.1200221-checks.panopta.com3.1-统计访问ip地址数目mapper实现--正则表达式#!/usr/bin/python#_*_coding:utf-8_*_#Filename:mapper_3_1.pyimportreimportsysforlineinsys.stdin:line=line.strip()words=re.match('(\d{1,3}\.){3}\d{1,3}',line).group()words=words.split('\n')foriinrange(0,len(words)):print'%s\t%s'%(words[i],1)mapper实现--字符串#!/usr/bin/python#_*_coding:utf-8_*_#Filename:mapper_3_1_1.pyimportsysforlineinsys.stdin:line=line.strip()words=line[:line.find('')]words=words.split('\n')foriinrange(0,len(words)):print'%s\t%s'%(words[i],1)reduce与之前一样3.2-统计目录访问次数(/mapreduce-nextgen/client-codes/)mapper实现--filter(lambda)打印#!/usr/bin/python#_*_coding:utf-8_*_#Filename:mapper_3_2.pyimportsysforlineinsys.stdin:line=line.strip()ifline.find('GET')!=-1:words=line[line.find('GET')+3:line.find('HTTP')]#ifline.find('POST')!=-1:elifline.find('HEAD')!=-1:words=line[line.find('HEAD')+4:line.find('HTTP')]else:words=line[line.find('POST')+4:line.find('HTTP')]words=filter(lambdaword:word,words.split('\n'))forwordinwords:print'%s\t%s'%(word,1)mapper实现--元组打印(遇到空行实现不了)#!/usr/bin/python#_*_coding:utf-8_*_#Filename:mapper_3_2.pyimportsysforlineinsys.stdin:line=line.strip()ifline.find('GET')!=-1:words=line[line.find('GET')+3:line.find('HTTP')]#ifline.find('POST')!=-1:elifline.find('HEAD')!=-1:words=line[line.find('HEAD')+4:line.find('HTTP')]else:words=line[line.find('POST')+4:line.find('HTTP')]words=filter(lambdaword:word,words.split('\n'))forwordinwords:print'%s\t%s'%(word,1)reduce与之前一样3.3-统计每个ip,访问的子目录次数,输出如:175.44.30.93/structure/heap/8取IP和路径1如果一样+1思路:IP和目录用\t来做分隔符,然后使用特殊符号\@来做为和1的分隔符,在reduce中进行分割,然后比对IP和目录,进行累加mapper实现#!/usr/bin/python#_*_coding:utf-8_*_#Filename:mapper_3_3.pyimportsysforlineinsys.stdin:line=line.strip()ifline.find('GET')!=-1:words=line[:line.find('')]+'\t'+line[line.find('GET')+3:line.find('HTTP')]#ifline.find('POST')!=-1:elifline.find('HEAD')!=-1:words=line[:line.find('')]+'\t'+line[line.find('HEAD')+4:line.find('HTTP')]elifline.find('POST')!=-1:words=line[:line.find('')]+'\t'+line[line.find('POST')+4:line.find('HTTP')]else:words=''words=filter(lambdaword:word,words.split('\n'))forwordinwords:print'%s\@%s'%(word,1)reduce实现#!/usr/bin/python#_*_coding:utf-8_*_#Filename:reduce.pyfromoperatorimportitemgetterimportsysword2count={}forlineinsys.stdin:line=line.strip()word,count=line.split('\@',1)try:count=int(count)word2count[word]=word2count.get(word,0)+countexceptValueError:passsorted_word2count=sorted(word2count.items(),key=itemgetter(0))forword,countinsorted_word2count:print'%s\t%s'%(word,count)
本文标题:用python实现mapreduce的web访问日志统计分析
链接地址:https://www.777doc.com/doc-6243937 .html