sae上通过python获取访问网站ip及其来源
这篇文章是当时在新浪云上搭建博客时写的,后来因为新浪云开始各种收费了,所以就把博客转到了github上。这里还是把文章贴出来,做个记录
常常看到有些网站会显示访问过该网站的所有人数及其分布地点,所以就琢磨着这个怎么实现,好给自己的网站也添加上去;在google上搜了一下发现大都是通过分析日志得出的,而新浪云上也提供了日志访问的API,所以下面就说说怎么通过这个API获取访问的IP及其来源地。
大致的步骤就是先通过身份校验获取访问日志的权限,然后通过HTTP请求摘取日志中表示访问ip和访问次数的段记录。剔除其中的私网IP,再获取IP所在地,存入数据库。
下面为具体的实施步骤
获取访问的IP及其访问次数
身份验证
这个是新浪提供的用于校验身份的一个api,校验身份是通过应用的ACESSKEY和SECRETKEY来实现的。
代码下载链接地址:https://raw.githubusercontent.com/sinacloud/sae-python-dev-guide/master/examples/apibus/apibus_handler.py
也可复制下面的代码创建一个python源文件,命名为apibus_handler.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83#-*-coding: utf8 -*-
"""
SAE API auth handler for urllib2 and requests
urllib2:
import urllib2
apibus_handler = SaeApibusAuthHandler(ACCESSKEY, SECRETKEY)
opener = urllib2.build_opener(apibus_handler)
print opener.open('http://g.sae.sina.com.cn/log/http/2015-06-18/1-access.log').read()
requests:
import requests
print requests.get('http://g.sae.sina.com.cn/log/http/2015-06-18/1-access.log?head/0/10|fields/ /1/2/3/4', auth=SaeApibusAuth(ACCESSKEY, SECRETKEY)).content
"""
import hmac
import base64
import hashlib
import time
import urllib
from urllib2 import BaseHandler, Request
_APIBUS_URL_PREFIX = 'http://g.sae.sina.com.cn/'
class SaeApibusAuthHandler(BaseHandler):
# apibus handler must be in front
handler_order = 100
def __init__(self, accesskey, secretkey):
self.accesskey = accesskey
self.secretkey = secretkey
def http_request(self, req):
orig_url = req.get_full_url()
if not orig_url.startswith(_APIBUS_URL_PREFIX):
return req
timestamp = str(int(time.time()))
headers = [
('x-sae-timestamp', timestamp),
('x-sae-accesskey', self.accesskey),
]
req.headers.update(headers)
method = req.get_method()
resource = urllib.unquote(req.get_full_url()[len(_APIBUS_URL_PREFIX)-1:])
sae_headers = [(k.lower(), v.lower()) for k, v in req.headers.items() if k.lower().startswith('x-sae-')]
req.add_header('Authorization', _signature(self.secretkey, method, resource, sae_headers))
return req
https_request = http_request
try:
from requests.auth import AuthBase
class SaeApibusAuth(AuthBase):
"""Attaches HTTP Basic Authentication to the given Request object."""
def __init__(self, accesskey, secretkey):
self.accesskey = accesskey
self.secretkey = secretkey
def __call__(self, r):
timestamp = str(int(time.time()))
r.headers['x-sae-timestamp'] = timestamp
r.headers['x-sae-accesskey'] = self.accesskey
resource = urllib.unquote(r.url[len(_APIBUS_URL_PREFIX)-1:])
#resource = r.url[len(_APIBUS_URL_PREFIX)-1:]
sae_headers = [(k.lower(), v.lower()) for k, v in r.headers.items() if k.lower().startswith('x-sae-')]
r.headers['Authorization'] = _signature(self.secretkey, r.method, resource, sae_headers)
return r
except ImportError:
# requests was not present!
pass
def _signature(secret, method, resource, headers):
msgToSign = "\n".join([
method, resource,
"\n".join([(k + ":" + v) for k, v in sorted(headers)]),
])
return "SAEV1_HMAC_SHA256 " + base64.b64encode(hmac.new(secret, msgToSign, hashlib.sha256).digest())
通过http请求获取日志
提供了通过requests包和urllib包两种方式,代码来源后面的参考文章,将下面代码保存成sae_log_util.py
即可:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55#-*-coding: utf8 -*-
#sae_log_util.py
#sae log utility based on sae apibus_handler
#author blog: http://bookshadow.com
#src date: 2015-09-17
status_code_dict = {200 : 'OK', 206 : 'Partial Content', 400 : 'Bad Request', \
500 : 'Internal Server Error' , 404 : 'Not Found'}
service_ident_dict = {'http': ['access', 'error', 'alert', 'debug', 'warning', 'notice'], \
'taskqueue' : ['error'], \
'cron' : ['error'], \
'mail': ['access', 'error'], \
'rdc' : ['error', 'warning'], \
'storage' : ['access'], \
'push' : ['access'], \
'fetchurl' : ['access']
}
_URL_PREFIX = 'http://g.sae.sina.com.cn/log/'
class SaeLogFetcher(object):
def __init__(self, access_key, secret_key):
self.access_key = access_key
self.secret_key = secret_key
def fetch_log(self, service, date, ident, fop = '', version = 1):
assert self.access_key, 'access_key should not be empty'
assert self.secret_key, 'secret_key should not be empty'
assert service in service_ident_dict, 'invalid service parameter'
assert ident in service_ident_dict[service], 'invalid ident parameter'
url = _URL_PREFIX + service + '/' + date + '/' + str(version) + '-' + ident + '.log'
content = None
try:
import requests
from apibus_handler import SaeApibusAuth
r = requests.get(url + ('?' + fop if fop else ''), \
auth=SaeApibusAuth(self.access_key, self.secret_key))
status_code, status = r.status_code, status_code_dict.get(r.status_code, 'Unknown')
if status_code == 200:
content = r.content
except ImportError:
# requests was not present!
from apibus_handler import SaeApibusAuthHandler
import urllib, urllib2
apibus_handler = SaeApibusAuthHandler(self.access_key, self.secret_key)
opener = urllib2.build_opener(apibus_handler)
if fop:
url += '?' + urllib.quote(fop, safe='')
content = opener.open(url).read()
return content
调用上面的代码
下面通过代码获取访问过的ip及次数,代码也是来源于参考链接,可将代码复制后另存为ip_counter.py
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26#-*-coding: utf8 -*-
#ip_counter.py
#ip counter based on sae_log_util
#author blog: http://bookshadow.com
#src date: 2015-09-17
from collections import Counter
from sae_log_util import SaeLogFetcher
date = '2015-09-16'
service = 'http'
ident = 'access'
fop = 'fields/ /2' #fetch ip only
version = 1
ACCESSKEY = '<<ACCESSKEY>>'
SECRETKEY = '<<SECRETKEY>>'
log_fetcher = SaeLogFetcher(ACCESSKEY, SECRETKEY)
result = log_fetcher.fetch_log(service, date, ident, fop, version)
content = result.split('\n')[:-1]
for e, c in Counter(content).most_common():
print e, c<<ACCESSKEY>>
与<<SECRETKEY>>
替换为你的sae应用具体的值。
然后将上面的代码放到同一个工作目录,执行ip_counter.py
这个文件,即可获取访问的ip,
剔除私网IP
上面显示出出来的结果会显示出有私网ip,猜测是sae内部一些服务器间的通信,比如说memcached
、mysql
等服务与应用不在同一台服务器等,但是无论如何,这些私网ip都是我们不希望看到的,所以下面是剔除私网IP的过程。
私网IP总共有A、B、C三类,而每一类IP的nei-id均是固定的,详见下面所示:
1
2
3A类:10.0.0.0/8: 10.0.0.0~10.255.255.255
B类:172.16.0.0/12: 172.16.0.0~172.31.255.255
C类:192.168.0.0/16: 192.168.0.0~192.168.255.255
1 | def ip_into_int(ip): |
查询IP所在地
可以通过淘宝提供的API来查询IP所在地,查询代码如下所示:
1 | #encoding:utf-8 |
然后就可以将获取到的关于ip的信息存入数据库,同时存入更新时间,这样便在数据库中有了访问网站的记录,便于后续的可视化分析。
为了方便,还是利用了sae的cron服务每天定时将昨天的访问记录存入数据库。