因为想爬虫知乎的数据,所以采用scrapy来进行爬虫
对于一般的问题的链接以及相关的内容不需要登录,但是对于用户的关注数据,必须得登录之后才能获取
所以,通过scrapy登录
在登录的过程中,碰到了一点问题:
- 验证码获取问题
本来是想通过xpath来获取,通过shell调试后,发现无论怎么获取都是空的,无法获取,可能是因为js动态生成的原因而无法获取
–>曲线救国
所以查看captcha的src的地址,找到规律如下1
2
3
4
5p = re.compile(r'\d{13}')
t = repr(time.time()*1000)
m = p.findall(t)[0]
# 获得验证码的地址
captcha_url = "http://m.zhihu.com/captcha.gif?r=" + m + "&type=login"
其中repr方法第一次使用,为了让原始的object变成字符串,help返回如下:1
2
3
4
5repr(...)
repr(object) -> string
Return the canonical string representation of the object.
For most object types, eval(repr(object)) == object.
手动的获取
在这之前,因为好几次使用了错误的密码,导致耽搁了进程
po一下部分代码,截取到登录部分:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96# -*- coding: utf-8 -*-
import scrapy
import urlparse
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose, Join
from ..settings import headers, mode, proxy, email, password, host, port, db
from ..items import ZhihuTry2Item
import time
import os
import re
class TopicSpider(scrapy.Spider):
name = "topic"
allowed_domains = ["m.zhihu.com"]
zhihu_url = "https://m.zhihu.com"
login_url = "https://m.zhihu.com/login/email"
start_urls = (
'http://m.zhihu.com/',
)
def start_requests(self):
yield scrapy.Request(
url=self.zhihu_url,
headers=headers,
meta={
"proxy": proxy,
"cookiejar": 1
},
callback=self.request_captcha
)
def request_captcha(self, response):
# 获取_xsrf值
_xsrf = response.css('input[name="_xsrf"]::attr(value)').extract()[0]
p = re.compile(r'\d{13}')
t = repr(time.time()*1000)
m = p.findall(t)[0]
# 获得验证码的地址
captcha_url = "http://m.zhihu.com/captcha.gif?r=" + m + "&type=login"
# 准备下载验证码
# 获取请求
yield scrapy.Request(
url=captcha_url,
headers=headers,
meta={
"proxy": proxy,
"cookiejar": response.meta["cookiejar"],
"_xsrf": _xsrf
},
callback=self.download_captcha
)
def download_captcha(self, response):
# 下载验证码
with open("captcha.gif", "wb") as fp:
fp.write(response.body)
# 打开验证码
os.system('open captcha.gif')
# 输入验证码
print "请输入验证码:\n"
captcha = raw_input()
# 输入账号和密码
yield scrapy.FormRequest(
url=self.login_url,
headers=headers,
formdata={
"email": email,
"password": password,
"_xsrf": response.meta["_xsrf"],
"remember_me": "true",
"captcha": captcha
},
meta={
"proxy": proxy,
"cookiejar": response.meta["cookiejar"],
},
callback=self.request_zhihu
)
def request_zhihu(self, response):
"""
现在已经登录,请求www.zhihu.com的页面
"""
print response.body
# yield scrapy.Request(url=self.zhihu_url,
# headers=self.headers_dict,
# meta={
# "proxy": proxy,
# "cookiejar": response.meta["cookiejar"],
# },
# callback=self.get_question,
# dont_filter=True)