《Python3网络爬虫开发实战（第二版）》案例SSR代码编写

bags 发表于 2021-10-27 23:50

本帖最后由 bags 于 2021-10-28 00:53 编辑

今天在论坛上看到这个网站：https://scrape.center/，打算跟着他的案例学习python爬虫！

SSR1：电影数据网站，无反爬，数据通过服务端渲染，适合基本爬虫练习。
from requests_html import HTMLSession
import json

def main():
session = HTMLSession()
data_all = []
for i in range(1,2):
   try:
         url = 'https://ssr1.scrape.center/detail/{}'.format(i)
         req = session.get(url)
         name = req.html.xpath('//*[@id="detail"]/div/div/div/div/div/div/a/h2/text()')
         label_all = req.html.xpath('//*[@id="detail"]/div/div/div/div/div/div/div/button/span/text()')
         label = "、".join(label_all)
         country = req.html.xpath('//*[@id="detail"]/div/div/div/div/div/div/div/span/text()')
         duration = req.html.xpath('//*[@id="detail"]/div/div/div/div/div/div/div/span/text()')
         time = req.html.xpath('//*[@id="detail"]/div/div/div/div/div/div/div/span/text()')
         introduce = req.html.xpath('//*[@id="detail"]/div/div/div/div/div/div/div/p/text()')
         data = {
            'name':name,
            'label':label,
            'country':country,
            'duration':duration,
            'time':time,
            'introduce':introduce
         }
         data_all.append(data)
         print("当前进度{}%".format(i))
   except:
         print("当前进度{}%,出错".format(i))
         with open("./爬虫练习/SSR1/error.txt","a+",encoding="utf-8") as f:
            f.write("出错页数："+str(i))

data_json = json.dumps(data_all,ensure_ascii=False)
with open("./爬虫练习/SSR1/data.json","w",encoding="utf-8") as f:
   json.dump(data_json,f,ensure_ascii=False)





if __name__ == '__main__':
main()

SSR2：电影数据网站，无反爬，无 HTTPS 证书，适合用作 HTTPS 证书验证。
这个没感觉什么不同，用ssr1代码也能实现{:1_909:}

SSR3: 电影数据网站，无反爬，带有 HTTP Basic Authentication，适合用作 HTTP 认证案例，用户名密码均为 admin。
通过控制台抓包发现：请求标头增加了Authorization: Basic YWRtaW46YWRtaW4=
初步判断是base64加密，百度解密得：admin:admin
思路：直接增加请求头即可
from requests_html import HTMLSession
import json

def main():
session = HTMLSession()
data_all = []
for i in range(1,3):
   try:
         header = {'Authorization': 'Basic YWRtaW46YWRtaW4='}
         url = 'https://ssr3.scrape.center/detail/{}'.format(i)
         req = session.get(url,headers=header)
         name = req.html.xpath('//*[@id="detail"]/div/div/div/div/div/div/a/h2/text()')
         label_all = req.html.xpath('//*[@id="detail"]/div/div/div/div/div/div/div/button/span/text()')
         label = "、".join(label_all)
         country = req.html.xpath('//*[@id="detail"]/div/div/div/div/div/div/div/span/text()')
         duration = req.html.xpath('//*[@id="detail"]/div/div/div/div/div/div/div/span/text()')
         time = req.html.xpath('//*[@id="detail"]/div/div/div/div/div/div/div/span/text()')
         introduce = req.html.xpath('//*[@id="detail"]/div/div/div/div/div/div/div/p/text()')
         data = {
            'name':name,
            'label':label,
            'country':country,
            'duration':duration,
            'time':time,
            'introduce':introduce
         }
         data_all.append(data)
         print("当前进度{}%".format(i))
   except:
         print("当前进度{}%,出错".format(i))
         with open("./爬虫练习/SSR3/error.txt","a+",encoding="utf-8") as f:
            f.write("出错页数："+str(i))

data_json = json.dumps(data_all,ensure_ascii=False)
with open("./爬虫练习/SSR3/data.json","w",encoding="utf-8") as f:
   json.dump(data_json,f,ensure_ascii=False)

if __name__ == '__main__':
main()

SSR4:电影数据网站，无反爬，每个响应增加了 5 秒延迟，适合测试慢速网站爬取或做爬取速度测试，减少网速干扰。
因为没有书，没有琢磨出来题目什么意思{:1_896:}，有了解的朋友可以一起学习下

bags 发表于 2021-10-28 10:01

a15015073001 发表于 2021-10-28 09:31
当前进度那里 i 不是 1-3 嘛所以打印不应该是 1%-3% ？
不应该在前面处理一下 i / 3 ?

数据是100页，然后我改成3页，忘记改了

CCQc 发表于 2021-10-28 00:21

看看实战爬取经验，感谢分享

优秀东 发表于 2021-10-28 00:27

感谢分享学习

mortai 发表于 2021-10-28 00:29

感谢分享学习

天心阁主 发表于 2021-10-28 00:48

谢谢分享有用知识

Nophy 发表于 2021-10-28 07:54

感谢分享知识

Wapj_Wolf 发表于 2021-10-28 08:04

感谢楼主分享干货{:1_932:}

yhgfwly007 发表于 2021-10-28 08:30

学知识了，感谢分享！

qianshijiuge 发表于 2021-10-28 08:33

感谢分享

liubing3613 发表于 2021-10-28 08:42

看看实战爬取经验，感谢分享

页: [1] 2

吾爱破解 - 52pojie.cn's Archiver

《Python3网络爬虫开发实战（第二版）》案例SSR代码编写