本帖最后由 bags 于 2021-10-28 00:53 编辑
今天在论坛上看到这个网站:https://scrape.center/,打算跟着他的案例学习python爬虫!
SSR1:电影数据网站,无反爬,数据通过服务端渲染,适合基本爬虫练习。
[Python] 纯文本查看 复制代码 from requests_html import HTMLSession
import json
def main():
session = HTMLSession()
data_all = []
for i in range(1,2):
try:
url = 'https://ssr1.scrape.center/detail/{}'.format(i)
req = session.get(url)
name = req.html.xpath('//*[@id="detail"]/div[1]/div/div/div[1]/div/div[2]/a/h2/text()')[0]
label_all = req.html.xpath('//*[@id="detail"]/div[1]/div/div/div[1]/div/div[2]/div[1]/button/span/text()')
label = "、".join(label_all)
country = req.html.xpath('//*[@id="detail"]/div[1]/div/div/div[1]/div/div[2]/div[2]/span[1]/text()')[0]
duration = req.html.xpath('//*[@id="detail"]/div[1]/div/div/div[1]/div/div[2]/div[2]/span[3]/text()')[0]
time = req.html.xpath('//*[@id="detail"]/div[1]/div/div/div[1]/div/div[2]/div[3]/span/text()')[0]
introduce = req.html.xpath('//*[@id="detail"]/div[1]/div/div/div[1]/div/div[2]/div[4]/p/text()')[0]
data = {
'name':name,
'label':label,
'country':country,
'duration':duration,
'time':time,
'introduce':introduce
}
data_all.append(data)
print("当前进度{}%".format(i))
except:
print("当前进度{}%,出错".format(i))
with open("./爬虫练习/SSR1/error.txt","a+",encoding="utf-8") as f:
f.write("出错页数:"+str(i))
data_json = json.dumps(data_all,ensure_ascii=False)
with open("./爬虫练习/SSR1/data.json","w",encoding="utf-8") as f:
json.dump(data_json,f,ensure_ascii=False)
if __name__ == '__main__':
main()
SSR2:电影数据网站,无反爬,无 HTTPS 证书,适合用作 HTTPS 证书验证。
IDA console, courier new, monospace">这个没感觉什么不同,用ssr1代码也能实现
SSR3: 电影数据网站,无反爬,带有 HTTP Basic Authentication,适合用作 HTTP 认证案例,用户名密码均为 admin。
通过控制台抓包发现:请求标头增加了Authorization: Basic YWRtaW46YWRtaW4=
初步判断是base64加密,百度解密得:admin:admin
思路:直接增加请求头即可
[Python] 纯文本查看 复制代码 from requests_html import HTMLSession
import json
def main():
session = HTMLSession()
data_all = []
for i in range(1,3):
try:
header = {'Authorization': 'Basic YWRtaW46YWRtaW4='}
url = 'https://ssr3.scrape.center/detail/{}'.format(i)
req = session.get(url,headers=header)
name = req.html.xpath('//*[@id="detail"]/div[1]/div/div/div[1]/div/div[2]/a/h2/text()')[0]
label_all = req.html.xpath('//*[@id="detail"]/div[1]/div/div/div[1]/div/div[2]/div[1]/button/span/text()')
label = "、".join(label_all)
country = req.html.xpath('//*[@id="detail"]/div[1]/div/div/div[1]/div/div[2]/div[2]/span[1]/text()')[0]
duration = req.html.xpath('//*[@id="detail"]/div[1]/div/div/div[1]/div/div[2]/div[2]/span[3]/text()')[0]
time = req.html.xpath('//*[@id="detail"]/div[1]/div/div/div[1]/div/div[2]/div[3]/span/text()')[0]
introduce = req.html.xpath('//*[@id="detail"]/div[1]/div/div/div[1]/div/div[2]/div[4]/p/text()')[0]
data = {
'name':name,
'label':label,
'country':country,
'duration':duration,
'time':time,
'introduce':introduce
}
data_all.append(data)
print("当前进度{}%".format(i))
except:
print("当前进度{}%,出错".format(i))
with open("./爬虫练习/SSR3/error.txt","a+",encoding="utf-8") as f:
f.write("出错页数:"+str(i))
data_json = json.dumps(data_all,ensure_ascii=False)
with open("./爬虫练习/SSR3/data.json","w",encoding="utf-8") as f:
json.dump(data_json,f,ensure_ascii=False)
if __name__ == '__main__':
main()
SSR4:电影数据网站,无反爬,每个响应增加了 5 秒延迟,适合测试慢速网站爬取或做爬取速度测试,减少网速干扰。
因为没有书,没有琢磨出来题目什么意思,有了解的朋友可以一起学习下
|