请问用requests爬虫出现页面跳转爬取不到怎么办？(python)

hj170520 · 发表于 2020-5-25 10:47

本帖最后由 hj170520 于 2020-5-25 12:32 编辑

代码如下：

[Python] 纯文本查看 复制代码

from lxml import etree
import requests
import re
import os

url = 'http://www.effortlessenglish.libsyn.com/'
headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
        'Host': 'www.effortlessenglish.libsyn.com'}

Web = requests.get(url, headers=headers)
Web_html = etree.HTML(Web.text)

contain1 = Web_html.xpath('//div[@class="postDetails"]/a/@href[1]')
book = []
if not os.path.exists('./audio'):
    os.mkdir('./audio')

for i in range(len(contain1)):
    if re.search(r'http.*17069',contain1[i]):
        url_data = contain1[i]
        # print(contain1[i])
        # url_data = re.findall(r'http.*17069',contain1[i])[0] # 因为findall输出的是list 所以用[0]将它拿出来
        print(url_data)
        # data = requests.get(url_data, headers=headers)
        # print("正在下载")
        # with open("./audio/" + str(i) + '.mp3', 'wb') as f:
        #     f.write(data.content)
        #     f.close()

例如爬取的是
http://traffic.libsyn.com/effortlessenglish/Death_To_The_Schools__DESTROY_Limiting_Beliefs_and_Be_HAPPY.mp3?dest-id=17069
打开后页面变成了
http://hwcdn.libsyn.com/p/c/7/6/c76011bf65265504/Death_To_The_Schools__DESTROY_Limiting_Beliefs_and_Be_HAPPY.mp3?c_id=73623335&cs_id=73623335&destination_id=17069&expiration=1590379908&hwt=1834174b745f61f56b0cd6ee70d59831

再拓展个问题，就是我用request打开某网页，他自动跳转我抓取不到新的网站怎么办？
谢谢大佬们，我成功了！

[Python] 纯文本查看 复制代码

for i in range(len(contain1)):
    if re.search(r'http.*17069',contain1[i]):
        url_data = contain1[i]
        # print(contain1[i])
        # url_data = re.findall(r'http.*17069',contain1[i])[0] # 因为findall输出的是list 所以用[0]将它拿出来
        print(url_data)
        new_url_location = requests.get(url_data, headers=headers, allow_redirects=False)
        print(new_url_location.headers['location'])
        print(new_url_location._next.url)  # 另一种方法
        # print("正在下载")
        # with open("./audio/" + str(i) + '.mp3', 'wb') as f:
        #     f.write(data.content)
        #     f.close()

pzx521521 · 发表于 2020-5-25 11:13

本帖最后由 pzx521521 于 2020-5-25 11:15 编辑

楼下正解

52pojie_xyf · 发表于 2020-5-25 11:14

本帖最后由 52pojie_xyf 于 2020-5-25 11:53 编辑

我看了一下这个链接请求会发生302 跳转而跳转的地址正是 302返回的头部地址中location 也就是你要的地址

通过上述地址可下载

zldtb19931116 · 发表于 2020-5-25 11:16

你要看是怎么跳转的，比如很多都是将目标链接放到response header里的，你要自己写代码从header里取目标链接，或者有的是JavaScript控制跳转的，你要看看js代码，自己先分析下怎么跳转的

ZB_陈 · 发表于 2020-5-25 11:17

本帖最后由 ZB_陈于 2020-5-25 11:22 编辑

data.history 里包含有所有跳转的响应
也可以通过在请求的时候将参数`allow_redirects`设置为False禁用重定向

具体可以查看文档：https://requests.readthedocs.io/zh_CN/latest/user/quickstart.html#id9

wkfy · 发表于 2020-5-25 11:25

r=requests.get('http://traffic.libsyn.com/effortlessenglish/Death_To_The_Schools__DESTROY_Limiting_Beliefs_and_Be_HAPPY.mp3?dest-id=17069',allow_redirects=False)
print(r._next.url)

hj170520 · 发表于 2020-5-25 11:35

ZB_陈发表于 2020-5-25 11:17
data.history 里包含有所有跳转的响应
也可以通过在请求的时候将参数`allow_redirects`设置为False禁用重 ...

好像不太管用～
我需要的就是抓取跳转后的页面的数据

hj170520 · 发表于 2020-5-25 11:37

52pojie_xyf 发表于 2020-5-25 11:14
我看了一下这个链接请求会发生302 跳转而跳转的地址正是 302返回的头部地址中location 也就是你要的地 ...

可以，我好像有点头绪了！我先试试

hj170520 · 发表于 2020-5-25 11:42

zldtb19931116 发表于 2020-5-25 11:16
你要看是怎么跳转的，比如很多都是将目标链接放到response header里的，你要自己写代码从header里取目标链 ...

谢谢，这个思路我没想到。我先试试

hj170520 · 发表于 2020-5-25 12:31

wkfy 发表于 2020-5-25 11:25
r=requests.get('http://traffic.libsyn.com/effortlessenglish/Death_To_The_Schools__DESTROY_Limiting_B ...

谢谢，结果很赞～

谢谢大佬

帐号		自动登录	找回密码
密码			注册[Register]

[已解决] 请问用requests爬虫出现页面跳转爬取不到怎么办？(python)