请问用requests爬虫出现页面跳转爬取不到怎么办？(python)

hj170520 发表于 2020-5-25 10:47

本帖最后由 hj170520 于 2020-5-25 12:32 编辑

代码如下：
from lxml import etree
import requests
import re
import os

url = 'http://www.effortlessenglish.libsyn.com/'
headers = {
   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
   'Host': 'www.effortlessenglish.libsyn.com'}

Web = requests.get(url, headers=headers)
Web_html = etree.HTML(Web.text)

contain1 = Web_html.xpath('//div[@class="postDetails"]/a/@href')
book = []
if not os.path.exists('./audio'):
os.mkdir('./audio')

for i in range(len(contain1)):
if re.search(r'http.*17069',contain1):
   url_data = contain1
   # print(contain1)
   # url_data = re.findall(r'http.*17069',contain1) # 因为findall输出的是list 所以用将它拿出来
   print(url_data)
   # data = requests.get(url_data, headers=headers)
   # print("正在下载")
   # with open("./audio/" + str(i) + '.mp3', 'wb') as f:
   # f.write(data.content)
   # f.close()

例如爬取的是
http://traffic.libsyn.com/effortlessenglish/Death_To_The_Schools__DESTROY_Limiting_Beliefs_and_Be_HAPPY.mp3?dest-id=17069
打开后页面变成了
http://hwcdn.libsyn.com/p/c/7/6/c76011bf65265504/Death_To_The_Schools__DESTROY_Limiting_Beliefs_and_Be_HAPPY.mp3?c_id=73623335&cs_id=73623335&destination_id=17069&expiration=1590379908&hwt=1834174b745f61f56b0cd6ee70d59831

再拓展个问题，就是我用request打开某网页，他自动跳转我抓取不到新的网站怎么办？
谢谢大佬们，我成功了！{:301_1001:}

for i in range(len(contain1)):
if re.search(r'http.*17069',contain1):
   url_data = contain1
   # print(contain1)
   # url_data = re.findall(r'http.*17069',contain1) # 因为findall输出的是list 所以用将它拿出来
   print(url_data)
   new_url_location = requests.get(url_data, headers=headers, allow_redirects=False)
   print(new_url_location.headers['location'])
   print(new_url_location._next.url)# 另一种方法
   # print("正在下载")
   # with open("./audio/" + str(i) + '.mp3', 'wb') as f:
   # f.write(data.content)
   # f.close()

pzx521521 发表于 2020-5-25 11:13

本帖最后由 pzx521521 于 2020-5-25 11:15 编辑

楼下正解

52pojie_xyf 发表于 2020-5-25 11:14

本帖最后由 52pojie_xyf 于 2020-5-25 11:53 编辑

我看了一下这个链接请求会发生302 跳转而跳转的地址正是302返回的头部地址中location也就是你要的地址https://pic0.xuxuweizhi.cn/group1/M00/00/00/rBEhoF7LOAGAYhY_AAEFv-yUSZk071.png

通过上述地址可下载

zldtb19931116 发表于 2020-5-25 11:16

你要看是怎么跳转的，比如很多都是将目标链接放到response header里的，你要自己写代码从header里取目标链接，或者有的是JavaScript控制跳转的，你要看看js代码，自己先分析下怎么跳转的

ZB_陈 发表于 2020-5-25 11:17

本帖最后由 ZB_陈于 2020-5-25 11:22 编辑

data.history 里包含有所有跳转的响应
也可以通过在请求的时候将参数`allow_redirects`设置为False禁用重定向

具体可以查看文档：https://requests.readthedocs.io/zh_CN/latest/user/quickstart.html#id9

wkfy 发表于 2020-5-25 11:25

r=requests.get('http://traffic.libsyn.com/effortlessenglish/Death_To_The_Schools__DESTROY_Limiting_Beliefs_and_Be_HAPPY.mp3?dest-id=17069',allow_redirects=False)
print(r._next.url)

hj170520 发表于 2020-5-25 11:35

ZB_陈发表于 2020-5-25 11:17
data.history 里包含有所有跳转的响应
也可以通过在请求的时候将参数`allow_redirects`设置为False禁用重 ...

好像不太管用～
我需要的就是抓取跳转后的页面的数据{:301_999:}

hj170520 发表于 2020-5-25 11:37

52pojie_xyf 发表于 2020-5-25 11:14
我看了一下这个链接请求会发生302 跳转而跳转的地址正是302返回的头部地址中location也就是你要的地 ...

可以，我好像有点头绪了！我先试试{:301_974:}

hj170520 发表于 2020-5-25 11:42

zldtb19931116 发表于 2020-5-25 11:16
你要看是怎么跳转的，比如很多都是将目标链接放到response header里的，你要自己写代码从header里取目标链 ...

谢谢，这个思路我没想到。我先试试{:301_986:}

hj170520 发表于 2020-5-25 12:31

wkfy 发表于 2020-5-25 11:25
r=requests.get('http://traffic.libsyn.com/effortlessenglish/Death_To_The_Schools__DESTROY_Limiting_B ...

谢谢，结果很赞～{:301_986:}
谢谢大佬

页: [1] 2

吾爱破解 - 52pojie.cn's Archiver

请问用requests爬虫出现页面跳转爬取不到怎么办？(python)