爬虫day02:爬虫请求库之requests库

淘小欣 发表于 2021-5-25 00:14

本帖最后由淘小欣于 2021-5-29 22:16 编辑

# 02.爬虫请求库之requests库

## 一、requests模块介绍

### 1.介绍：

+ 使用requests可以模拟浏览器发送HTTP的请求
+ 不仅仅用来做爬虫，服务之间的调用也使用它
+ HTTP情趣请求头，请求体，请求地址都可以使用这个模块
+ requests是基于python urllib2模块封装的，这个模块用起来比较繁琐

>注意：requests库发送请求将网页内容下载下来以后，并不会执行js代码，这需要我们自己分析目标站点然后发起新的request请求

### 2.安装

```bash
pip3 install requests
```

### 3.各种请求方式

+ 各种请求方式：常用的就是`requests.get()`和`requests.post()`

```python
>>> import requests
>>> r = requests.get('https://api.github.com/events')
>>> r = requests.post('http://httpbin.org/post', data = {'key':'value'})
>>> r = requests.put('http://httpbin.org/put', data = {'key':'value'})
>>> r = requests.delete('http://httpbin.org/delete')
>>> r = requests.head('http://httpbin.org/get')
>>> r = requests.options('http://httpbin.org/get')
```

## 二、requests模块发送get请求

### 1.基本get请求

```python
import requests
# response是响应对象，http响应封装了，响应体
response = requests.get('https://weread.qq.com/')
# 把响应体的数据转成了字符串
print(response.text)

# 示例向百度发送请求

res=requests.get('https://www.baidu.com/')
# print(res.text)
# 将爬取的数据写入到文件中
with open('baidu.html','wb') as f:
   f.write(res.content) # 响应体二进制内容
```

### 2.带参数的GET请求->params

+ 将数据拼在路径中
+ 在请求头中携带`user-agent`（客户端类型），(https://baike.baidu.com/item/HTTP_REFERER/5358396?fr=aladdin)

#### 2.1请求地址中携带数据方式一: 直接携带 (中文一般不会进行url编码, 会出现编码问题)

```python
import requests

header = {
# 模拟浏览器
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'
# 可以解决防盗链问题，没有可以不写
'referer': ''
}
res = requests.get('https://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3', headers=header)
print(res.text)
```

>**注意**：
>
>如果你发送请求去一个地址，拿不到数据或者拿到的数据不对的原因是什么？你模拟的不像浏览器，把请求头的数据和该带的带上

#### 2.2 使用params来传递get请求参数（中文自动进行url编码）

```python
import requests

header = {
# 模拟浏览器
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'
}
res = requests.get('https://www.baidu.com/s', params={'wd': '爸爸打我'}, headers=header)
with open('baidu.html', 'wb') as f:
f.write(res.content)# 响应体二进制内容
```

#### 2.3 使用urllib模块中文转码和编码

```bash
from urllib.parse import urlencode, unquote

# 把中文转成%的形式
res = urlencode({'wd': '我真帅'}, encoding='utf-8')
print(res)# wd=%E6%88%91%E7%9C%9F%E5%B8%85

# 把%形式转成中文
res = unquote('%E6%88%91%E7%9C%9F%E5%B8%85', encoding='utf-8')
print(res)# 我真帅
```

#### 2.4 带参数的GET请求请求头

>通常我们在发送请求时都需要带上请求头，请求头是将自身伪装成浏览器的关键，常见的有用的请求头如下

+ `user-agent`：客户端
+ `referer`：大型网站通常都会根据该参数判断请求的来源
+ `Cookie`:未认证的cookie，认证过的cookie

### 3.请求中带cookie

+ `cookie`经常用，作者把`cookies`当作一个参数使用

#### 3.1 方式一：存放在header（请求头）

```python
import requests

headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
'cookie': 'Hm_lvt_b72418f3b1d81bbcf8f99e6eb5d4e0c3=1596202661; UM_distinctid=173a517b3192a1-07d51af695c35e-3a65420e-1fa400-173a517b31a70b;'
}

# 注意: url不要写错成了 'http://127.0.0.1:8050/index'
response = requests.get('http://127.0.0.1:8050/index/', headers=headers)
print(response.text)
# 服务端获取:
# 注意: 放在headers中, cookie对应的value如果有等于号, 那么等于号左边作为cookie的key, 右边作为cookie的value. 如果没有那么, key为空字符串. value为值.
# Dict or CookieJar:是一个对象，登录成功以后拿cookie得到的就是一个cookieJar对象
"""
{
'Hm_lvt_b72418f3b1d81bbcf8f99e6eb5d4e0c3': '1596202661',
'UM_distinctid': '173a517b3192a1-07d51af695c35e-3a65420e-1fa400-173a517b31a70b'
}
"""
```

#### 3.2 方式二：存放在指定的cookies参数中

```python
import requests

headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
}
cookies = {
'cookie': 'Hm_lvt_b72418f3b1d81bbcf8f99e6eb5d4e0c3=1596202661; UM_distinctid=173a517b3192a1-07d51af695c35e-3a65420e-1fa400-173a517b31a70b;'
}

# 注意: url不要写错成了 'http://127.0.0.1:8050/index'
response = requests.get('http://127.0.0.1:8050/index/', headers=headers, cookies=cookies)
print(response.text)

# 服务端获取:
# 注意: cookies直接指定. 字典中的key对应的服务端cookie获取的key. 字典中value冒号分隔的等于号左边作为cookie的key, 右边作为value
'''
{
'cookie': 'Hm_lvt_b72418f3b1d81bbcf8f99e6eb5d4e0c3=1596202661',
'UM_distinctid': '173a517b3192a1-07d51af695c35e-3a65420e-1fa400-173a517b31a70b'
}
```

### 4.小结

```python
# headers参数:
1. 模拟浏览器: user-agent
2. 解决防盗链: referer

# 解决url编码问题
1. params参数默认解决url编码问题
2. urllib模块

# 携带cookie
第一种方式: 存放在headers中
客户端发送: {'cookie': 'key=value;key1=value1'}
服务端获取: {key: value, key1: value1}

第二种方式: 指定cookie参数 (提示: 可以存放dict 和 CookieJar对象)
   客户端发送: {key: value, key1: value1}
   服务端获取: {key: value, key1: value1}

response.text文本
response.content二进制
response.iter_content()迭代器
```

## 三、基于POST请求

### 1.携带数据发送POST请求

#### 1.1 携带数据：urlencoded

```python
import requests

response = requests.post('http://127.0.0.1:8050/index/', data={'name': 'shawn'})
print(response.text)

# 服务端获取
'''
request.body: b'name=shawn'
request.POST: <QueryDict: {'name': ['shawn']}>
'''
```

#### 1.2 携带数据：json

```python
import requests

response = requests.post('http://127.0.0.1:8050/index/', json={'name': 'shawn'})
print(response.text)

# 服务端获取
'''
request.body: b'{"name": "shawn"}'
request.POST: <QueryDict: {}>
'''
```

### 2.自动携带cookie

```bash
import requests

session = requests.session()    # 注意: 是session()方法, 不是sessions()
session.post('http://127.0.0.1:8050/login/', json={'username': 'yang', 'password': '123'})       # 假设这个请求登录了
response = session.get('http://127.0.0.1:8050/index/')# 现在不需要手动带cookie, session会自动处理
print(response)
```

### 3.自定义请求头

```python
requests.post(url='xxxxxxxx',
         data={'xxx': 'yyy'})# 没有指定请求头,# 默认的请求头:application/x-www-form-urlencoed

# 如果我们自定义请求头是application/json,并且用data传值, 则服务端取不到值
requests.post(url='',
         data={'': 1, },
         headers={
               'content-type': 'application/json'
         })

requests.post(url='',
         json={'': 1, },
         )# 默认的请求头:application/json
```

### 4.模拟登陆某网站

```bash
import requests

data = {
'username': '用户名',
'password': '密码',
'captcha': '9eee',
'ref': 'http://www.aa7a.cn/',
'act': 'act_login',
}
res = requests.post('http://www.aa7a.cn/user.php', data=data)
print(res.text)
# {"error":0,"ref":"http://www.aa7a.cn/"} 登录成功
# 取到cookie--》登录成功的cookie

# CookieJar 对象
print(res.cookies.get_dict())

res1 = requests.get('http://www.aa7a.cn/', cookies=res.cookies.get_dict())

print('用户名' in res1.text)
```

**如何携带data数据，如何携带cookies**

+ `cookies`:`CookieJar`或者字典

#### requests.session() 自动携带cookies

```python
import requests
# 拿到一个session对象，发送请求时，跟使用reqesuts一样，只不过它自动处理了cookie
session=requests.session()
data = {
'username': '用户名',
'password': '密码',
'captcha': '9eee',
'ref': 'http://www.aa7a.cn/',
'act': 'act_login',
}
res = session.post('http://www.aa7a.cn/user.php', data=data)
print(res.text)
# {"error":0,"ref":"http://www.aa7a.cn/"} 登录成功
# 取到cookie--》登录成功的cookie

# CookieJar 对象
print(res.cookies.get_dict())

res1 = session.get('http://www.aa7a.cn/')

print('用户名' in res1.text)
```

### 5.小结

```python
# 携带数据:
携带json数据: json={}
携带urlencoded数据: data={}

# 自动携带cookie:
session = requests.session()
res = session.post(认证url)
res1 = session.get(访问url)

# 自定义请求头:
默认: application/x-www-form-urlencoed
headers={'content-type': 'application/json'}
```

## 四、响应Response：requests模块响应对象

```python
# 1 响应对象
import requests
respone=requests.get('http://www.jianshu.com')
# # respone属性
print(respone.text) # 把body体中数据转成字符串格式
print(respone.content) # body体中的二进制格式

print(respone.status_code) # 响应状态码
print(respone.headers) # 响应头
print(respone.cookies) # 响应的cookie，如果登录了，这个cookie就是登录的cookie
print(respone.cookies.get_dict()) # cookiejar对象---》字典对象
print(respone.cookies.items()) # 跟字典一样

print(respone.url)          # 请求的地址
print(respone.history)       # 列表，访问一个网址，重定向了，列表中放这两个地址

print(respone.encoding)    # 响应的编码格式（一般都是utf-8）

# 如果是图片，视频，保存到本地
# response.iter_content(): 可以循环它，而不是循环response.content,循环它一点点存
res=requests.get('xxx')
for line in res.iter_content():
f.write(line)

# 2编码问题(一般不存在，如果存在)
response.encoding='gb2312' # 改成网站编码方式即可
import requests
response=requests.get('http://www.autohome.com/news')
# response.encoding='gbk' #汽车之家网站返回的页面内容为gb2312编码的，而requests的默认编码为ISO-8859-1，如果不设置成gbk则中文乱码
print(response.text)

# 3 获取二进制内容
import requests

response=requests.get('https://wx4.sinaimg.cn/mw690/005Po8PKgy1gqmatpdmhij309j070dgj.jpg')

with open('a.jpg','wb') as f:
# f.write(response.content)
# 推荐用这个
for line in response.iter_content():
   f.write(line)

# 4 json格式解码
import requests
import json

res = requests.get('https://api.luffycity.com/api/v1/course/actual/?category_id=1')
# print(json.loads(res.text))
print(res.json()['code'])
```

### 小结

```python
# response对象方法:
响应文本                   response.text
响应二进制数据             response.content
响应状态码                response.status_code
响应头                      response.headers
响应CookieJar对象          response.cookies
响应cookie字典             response.cookies.get_dict()
响应cookie列表套元组       response.cookies.items()
响应重定向之前的response对象 response.history
响应url地址                response.url
响应编码                   response.encoding
响应数据的迭代器             response.iter_content()

# 解决响应内容编码:
手动: response.encoding = '你知道你获取url资源的编码'
自动: response.encoding = response.apparent_encoding

# 解析json:
1. json模块解析
   json.loads(response.text)
2. requests提供的json()方法解析
   response.json()
```

## 五、案例

### 案例一：爬取好看视频

分析出视频地址

```python
https://vd2.bdstatic.com/mda-mcbkh5a50wx55wpi/1080p/h264_cae/1620464746505493348/mda-mcbkh5a50wx55wpi.mp4
```

示例代码

```python
import requests

res = requests.get(
'https://vd2.bdstatic.com/mda-mcbkh5a50wx55wpi/1080p/h264_cae/1620464746505493348/mda-mcbkh5a50wx55wpi.mp4')
with open('study.mp4', 'wb') as f:
for line in res:
   f.write(line)
```

### 案例二：爬取梨视频

分析出爬取视频的地址

```python
https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=9&start=0
```

示例代码

```python
import requests
import re

res = requests.get('https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=9&start=0')
# print(res.text)

video_ids = re.findall('<a href="(.*?)" class="vervideo-lilink actplay">', res.text)
# print(video_ids)
for video_id in video_ids:
video_url = 'https://www.pearvideo.com/' + video_id
# print(video_url)
real_video_id = video_id.split('_')[-1]
# print(real_video_id)
# print(video_url)
# res_detail=video_detail=requests.get(video_url)
# print(res_detail.text)
# break
# 直接发送ajax请求，拿到json格式数据--》json格式数据中就有mp4
header = {
   # 解决跨域问题
   'Referer': video_url
}
res_json = requests.get('https://www.pearvideo.com/videoStatus.jsp?contId=%s' % real_video_id, headers=header)
# print(res_json.json())
mp4_url = res_json.json()['videoInfo']['videos']['srcUrl']
mp4_url = mp4_url.replace(mp4_url.split('/')[-1].split('-'), 'cont-%s' % real_video_id)

print(mp4_url)
video_res = requests.get(mp4_url)
name=mp4_url.split('/')[-1]
with open('video/%s'%name, 'wb') as f:
   for line in video_res.iter_content():
         f.write(line)

# https://video.pearvideo.com/mp4/third/20210509/cont-1728918-15454898-094108-hd.mp4能播放
# https://video.pearvideo.com/mp4/third/20210509/1621312234758 -15454898-094108-hd.mp4

```

### 案例三：自动登录某网站

```python
import requests
# 拿到一个session对象，发送请求时，跟使用reqesuts一样，只不过它自动处理了cookie
session=requests.session()
data = {
'username': '用户名',
'password': '密码',
'captcha': '9eee',
'ref': 'http://www.aa7a.cn/',
'act': 'act_login',
}
res = session.post('http://www.aa7a.cn/user.php', data=data)
# print(res.text)
# {"error":0,"ref":"http://www.aa7a.cn/"} 登录成功
# 取到cookie--》登录成功的cookie

# CookieJar 对象
# print(res.cookies.get_dict())

res1 = session.get('http://www.aa7a.cn/')

print('用户名' in res1.text)
```

## 六、高级用法

### 1.SSL Cerf Verification（携带证书，很少见）

```python
import requests

response = requests.get('https://www.12306.cn', verify=False)
print(response.status_code)# 不验证证书,报警告,返回200

# 使用证书，需要手动携带
import requests

response = requests.get('https://www.12306.cn',
                     cert=('/path/server.crt',
                           '/path/key'))
print(response.status_code)
```

### 2.超时设置

```python
import requests

# 两种超时:float or tuple
# timeout = 0.001# 代表接收数据的超时时间
timeout = (0.0001, 0.002)# 0.1代表链接超时0.2代表接收数据的超时时间
response = requests.get('https://www.baidu.com',
                     timeout=timeout)
print(response.text)
print(response.status_code)
# 注意: 超时以后抛出异常.
'''
# 0.0001表示链接超时
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.baidu.com', port=443): Read timed out. (read timeout=0.0001)
'''
```

### 3.认证设置

官网链接：http://docs.python-requests.org/en/master/user/authentication/

```python
'''
认证设置:登陆网站是,弹出一个框,要求你输入用户名密码（与alter很类似），此时是无法获取html的
但本质原理是拼接成请求头发送
   r.headers['Authorization'] = _basic_auth_str(self.username, self.password)
一般的网站都不用默认的加密方式，都是自己写
那么我们就需要按照网站的加密方式，自己写一个类似于_basic_auth_str的方法
得到加密字符串后添加到请求头
   r.headers['Authorization'] =func('.....')
'''

# 看一看默认的加密方式吧，通常网站都不会用默认的加密设置
import requests
from requests.auth import HTTPBasicAuth

r = requests.get('xxx', auth=HTTPBasicAuth('user', 'password'))
print(r.status_code)

# HTTPBasicAuth可以简写为如下格式
import requests

r = requests.get('xxx', auth=('user', 'password'))
print(r.status_code)
```

### 4.异常处理

```python
import requests
from requests.exceptions import *#可以查看requests.exceptions获取异常类型

try:
response = requests.get('https://www.baidu.com', timeout=(0.001, 0.002))
print(response.status_code)
except ReadTimeout:
print('读取超时!')
except ConnectionError: #网络不通
print('连接失败!')
except Timeout:
print('超时')
except RequestException:
print("请求异常")
except Exception as e:
print(e)
```

### 5.使用代{过}{滤}理

HTTP代{过}{滤}理| [百度百科链接](https://baike.baidu.com/item/http%E4%BB%A3%E7%90%86/7689519?fr=aladdin)

```python
'''
代{过}{滤}理：网上免费的（不稳定，自己玩）收费的（稳定，公司都会买）
代{过}{滤}理：高匿：隐藏访问者ip
透明：不隐藏访问者ip http的请求头中：X-Forwarded-For---》django中从META中取

每次访问，随机使用代{过}{滤}理
从网上找很多免费的代{过}{滤}理，放到列表中，每次随机取一个

使用第三方开源的代{过}{滤}理池：python+flask写的，自己搭建一个免费的代{过}{滤}理池https://github.com/jhao104/proxy_pool

'''
# 服务端
from django.shortcuts import HttpResponse
def test_ip(request):
ip = request.META.get('REMOTE_ADDR')
return HttpResponse(f'你的ip是{ip}')

def upload_file(request):
file = request.FILES.get('myfile')

with open(file.name, 'wb') as f:
   for line in file:
         f.write(line)
return HttpResponse()

urlpatterns = [
path('test_ip/', test_ip),
path('upload_file/', upload_file),
]

# 客户端
import requests

ip = requests.get('http://118.24.52.95:5010/get/').json()['proxy']
print(ip)
proxies = {
'http': ip
}
respone = requests.get('http://101.133.225.166:8088/test_ip/', proxies=proxies)
print(respone.text)
```

### 6.上传文件

```python
import requests
respone=requests.post('http://101.133.225.166:8088/upload_file/',files={'myfile':open('1 requests高级用法.py','rb')})
print(respone.text)
```

## 七、小结

```python
# SSL认证
verify=False不校验
verify=True校验.cert=(证书格式)

# 代{过}{滤}理
# HTTP代{过}{滤}理
proxies={'http': 'IP:PORT'}

# socks代{过}{滤}理: 安装requests
proxies={'http': 'socks://IP:PORT'}

# 超时设置
timeout=(连接超时时间, 接受数据超时时间)
抛出异常: ReadTimeOut

# 认证设置
from requests.auth import HTTPBasicAuth
auth=HTTPBasicAuth('user', 'password')

# 异常处理
from requests.exceptions import *
ReadTimeOut    连接或者获取数据超时
TimeOut       超时
ConnectionError 连接错误
RequestException请求异常

# 上传文件
files={key: file, key1: file1}
```

mosou 发表于 2021-5-25 07:49

出点py 处理 Excel数据的教程撒

hagas520 发表于 2021-5-25 08:48

坐等处理 excel 的文章，现在处理excel 确实比较需要

caballarri 发表于 2021-5-25 09:01

学习了，谢谢楼主分享

RootPotence 发表于 2021-5-25 09:35

这不关注一波

好学发表于 2021-5-25 09:35

十分不错，感谢python大佬

Gordon_c 发表于 2021-5-25 09:52

最近也在学习爬虫，这个文章写的很错。
请教一下，有些py模块方法看不懂，怎么办？

szwangbin001 发表于 2021-5-25 09:55

感谢分享

璐璐诺 发表于 2021-5-25 10:20

前来学习大佬文章

不苦小和尚 发表于 2021-5-25 10:31

挺详细的，谢谢分享

页: [1] 2 3

吾爱破解 - 52pojie.cn's Archiver

爬虫day02:爬虫请求库之requests库