吴良超的学习笔记

sae上通过python获取访问网站ip及其来源

这篇文章是当时在新浪云上搭建博客时写的,后来因为新浪云开始各种收费了,所以就把博客转到了github上。这里还是把文章贴出来,做个记录

常常看到有些网站会显示访问过该网站的所有人数及其分布地点,所以就琢磨着这个怎么实现,好给自己的网站也添加上去;在google上搜了一下发现大都是通过分析日志得出的,而新浪云上也提供了日志访问的API,所以下面就说说怎么通过这个API获取访问的IP及其来源地。

大致的步骤就是先通过身份校验获取访问日志的权限,然后通过HTTP请求摘取日志中表示访问ip和访问次数的段记录。剔除其中的私网IP,再获取IP所在地,存入数据库。

下面为具体的实施步骤

获取访问的IP及其访问次数

身份验证

这个是新浪提供的用于校验身份的一个api,校验身份是通过应用的ACESSKEY和SECRETKEY来实现的。

代码下载链接地址:https://raw.githubusercontent.com/sinacloud/sae-python-dev-guide/master/examples/apibus/apibus_handler.py

也可复制下面的代码创建一个python源文件,命名为apibus_handler.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
#-*-coding: utf8 -*-

"""
SAE API auth handler for urllib2 and requests

urllib2:

>>> import urllib2
>>> apibus_handler = SaeApibusAuthHandler(ACCESSKEY, SECRETKEY)
>>> opener = urllib2.build_opener(apibus_handler)
>>> print opener.open('http://g.sae.sina.com.cn/log/http/2015-06-18/1-access.log').read()

requests:

>>> import requests
>>> print requests.get('http://g.sae.sina.com.cn/log/http/2015-06-18/1-access.log?head/0/10|fields/ /1/2/3/4', auth=SaeApibusAuth(ACCESSKEY, SECRETKEY)).content
"""


import hmac
import base64
import hashlib
import time
import urllib
from urllib2 import BaseHandler, Request

_APIBUS_URL_PREFIX = 'http://g.sae.sina.com.cn/'

class SaeApibusAuthHandler(BaseHandler):
# apibus handler must be in front
handler_order = 100

def __init__(self, accesskey, secretkey):
self.accesskey = accesskey
self.secretkey = secretkey

def http_request(self, req):
orig_url = req.get_full_url()
if not orig_url.startswith(_APIBUS_URL_PREFIX):
return req

timestamp = str(int(time.time()))
headers = [
('x-sae-timestamp', timestamp),
('x-sae-accesskey', self.accesskey),
]
req.headers.update(headers)

method = req.get_method()
resource = urllib.unquote(req.get_full_url()[len(_APIBUS_URL_PREFIX)-1:])
sae_headers = [(k.lower(), v.lower()) for k, v in req.headers.items() if k.lower().startswith('x-sae-')]
req.add_header('Authorization', _signature(self.secretkey, method, resource, sae_headers))
return req

https_request = http_request

try:
from requests.auth import AuthBase

class SaeApibusAuth(AuthBase):
"""Attaches HTTP Basic Authentication to the given Request object."""
def __init__(self, accesskey, secretkey):
self.accesskey = accesskey
self.secretkey = secretkey

def __call__(self, r):
timestamp = str(int(time.time()))
r.headers['x-sae-timestamp'] = timestamp
r.headers['x-sae-accesskey'] = self.accesskey
resource = urllib.unquote(r.url[len(_APIBUS_URL_PREFIX)-1:])
#resource = r.url[len(_APIBUS_URL_PREFIX)-1:]
sae_headers = [(k.lower(), v.lower()) for k, v in r.headers.items() if k.lower().startswith('x-sae-')]
r.headers['Authorization'] = _signature(self.secretkey, r.method, resource, sae_headers)
return r
except ImportError:
# requests was not present!
pass

def _signature(secret, method, resource, headers):
msgToSign = "\n".join([
method, resource,
"\n".join([(k + ":" + v) for k, v in sorted(headers)]),
])
return "SAEV1_HMAC_SHA256 " + base64.b64encode(hmac.new(secret, msgToSign, hashlib.sha256).digest())

通过http请求获取日志

提供了通过requests包和urllib包两种方式,代码来源后面的参考文章,将下面代码保存成sae_log_util.py即可:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
#-*-coding: utf8 -*-

#sae_log_util.py
#sae log utility based on sae apibus_handler
#author blog: http://bookshadow.com
#src date: 2015-09-17

status_code_dict = {200 : 'OK', 206 : 'Partial Content', 400 : 'Bad Request', \
500 : 'Internal Server Error' , 404 : 'Not Found'}

service_ident_dict = {'http': ['access', 'error', 'alert', 'debug', 'warning', 'notice'], \
'taskqueue' : ['error'], \
'cron' : ['error'], \
'mail': ['access', 'error'], \
'rdc' : ['error', 'warning'], \
'storage' : ['access'], \
'push' : ['access'], \
'fetchurl' : ['access']
}

_URL_PREFIX = 'http://g.sae.sina.com.cn/log/'

class SaeLogFetcher(object):

def __init__(self, access_key, secret_key):
self.access_key = access_key
self.secret_key = secret_key

def fetch_log(self, service, date, ident, fop = '', version = 1):
assert self.access_key, 'access_key should not be empty'
assert self.secret_key, 'secret_key should not be empty'
assert service in service_ident_dict, 'invalid service parameter'
assert ident in service_ident_dict[service], 'invalid ident parameter'

url = _URL_PREFIX + service + '/' + date + '/' + str(version) + '-' + ident + '.log'
content = None

try:
import requests
from apibus_handler import SaeApibusAuth
r = requests.get(url + ('?' + fop if fop else ''), \
auth=SaeApibusAuth(self.access_key, self.secret_key))
status_code, status = r.status_code, status_code_dict.get(r.status_code, 'Unknown')
if status_code == 200:
content = r.content
except ImportError:
# requests was not present!
from apibus_handler import SaeApibusAuthHandler
import urllib, urllib2
apibus_handler = SaeApibusAuthHandler(self.access_key, self.secret_key)
opener = urllib2.build_opener(apibus_handler)
if fop:
url += '?' + urllib.quote(fop, safe='')
content = opener.open(url).read()
return content

调用上面的代码

下面通过代码获取访问过的ip及次数,代码也是来源于参考链接,可将代码复制后另存为ip_counter.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#-*-coding: utf8 -*-
#ip_counter.py
#ip counter based on sae_log_util
#author blog: http://bookshadow.com
#src date: 2015-09-17

from collections import Counter
from sae_log_util import SaeLogFetcher

date = '2015-09-16'
service = 'http'
ident = 'access'
fop = 'fields/ /2' #fetch ip only
version = 1

ACCESSKEY = '<<ACCESSKEY>>'
SECRETKEY = '<<SECRETKEY>>'

log_fetcher = SaeLogFetcher(ACCESSKEY, SECRETKEY)

result = log_fetcher.fetch_log(service, date, ident, fop, version)

content = result.split('\n')[:-1]

for e, c in Counter(content).most_common():
print e, c

将代码内的<<ACCESSKEY>><<SECRETKEY>>替换为你的sae应用具体的值。

然后将上面的代码放到同一个工作目录,执行ip_counter.py这个文件,即可获取访问的ip,

剔除私网IP

上面显示出出来的结果会显示出有私网ip,猜测是sae内部一些服务器间的通信,比如说memcachedmysql等服务与应用不在同一台服务器等,但是无论如何,这些私网ip都是我们不希望看到的,所以下面是剔除私网IP的过程。

私网IP总共有A、B、C三类,而每一类IP的nei-id均是固定的,详见下面所示:

1
2
3
A类:10.0.0.0/8: 10.0.0.0~10.255.255.255
B类:172.16.0.0/12: 172.16.0.0~172.31.255.255
C类:192.168.0.0/16: 192.168.0.0~192.168.255.255

这样便可将IP移位后与三类私网IP的net-id比较,从而判断该IP是否属于私网IP。实现代码如下:

1
2
3
4
5
6
7
8
9
10
11
def ip_into_int(ip):
#以192.168.1.13为例,先把 192.168.1.13 变成16进制的 c0.a8.01.0d ,再去了“.”后转成10进制的 3232235789 即可。
#(((((192 * 256) + 168) * 256) + 1) * 256) + 13
return reduce(lambda x,y:(x<<8)+y,map(int,ip.split('.')))

def is_internal_ip(ip):
ip = ip_into_int(ip)
net_a = ip_into_int('10.255.255.255') >> 24
net_b = ip_into_int('172.31.255.255') >> 20
net_c = ip_into_int('192.168.255.255') >> 16
return ip >> 24 == net_a or ip >>20 == net_b or ip >> 16 == net_c

查询IP所在地

可以通过淘宝提供的API来查询IP所在地,查询代码如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#encoding:utf-8
import requests

#输入的ip的数据结构为字典:{'ip':具体的ip地址}
def find_ip_place(ip):
URL = 'http://ip.taobao.com/service/getIpInfo.php'
try:
r = requests.get(URL, params=ip, timeout=3)
except requests.RequestException as e:
print(e)
else:
json_data = r.json()
if json_data['code'] == 0:
print u'所在国家:',json_data[u'data'][u'country']
print u'所在地区:',json_data[u'data'][u'area']
print u'所在省份:',json_data[u'data'][u'region']
print u'所在城市:',json_data[u'data'][u'city']
print u'所属运营商:',json_data[u'data'][u'isp']
else:
print '查询失败,请稍后再试!'

然后就可以将获取到的关于ip的信息存入数据库,同时存入更新时间,这样便在数据库中有了访问网站的记录,便于后续的可视化分析。

为了方便,还是利用了sae的cron服务每天定时将昨天的访问记录存入数据库。

参考:
使用SAE实时日志API统计IP来访次数
SAE实时日志API Python使用小记
Python判断内网IP