【已解决】某文泉爬取求助
本帖最后由 Zhili.An 于 2024-3-26 09:35 编辑最近在对某文泉一些爬取工作,虽然页面图片只能访问一次,但是它的缩略图却可以一直访问;
所以也准备一下爬取工作,但是遇到了问题
下面将对缩略图简称 【图片】;
图片在网页中可以一直访问,无论是点出去【只要保留cookie】,还是请求重发
都可以访问。
但是用python虽然显示是200,但得到的图片却是损坏的
代码如下:
import requests
session = requests.session()
url ="https://lib-xjtu.wqxuetang.com/deep/page/imgs/3225567/7?width=160&k=eyJ1IjoiRVpv..."
headers = {
"Host": "lib-xjtu.wqxuetang.com",
"Connection": "keep-alive",
"Pragma": "no-cache",
'Cache-Control': "no-cache",
"sec-ch-ua": '"Chromium";v="122", "Not(A:Brand";v="24", "Microsoft Edge";v="122"',
"sec-ch-ua-mobile": "?0",
"RequestID": "0",
"sec-ch-ua-platform": "Windows",
"DNT": "1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0",
"Accept": "*/*",
"Sec-Fetch-Site": "same-origin",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Dest": "empty",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "zh-CN,zh;q=0.9",
"Referer": "https://lib-xjtu.wqxuetang.com/deep/read/pdf?bid=3225567",
"Cookie": "acw_tc=0b6e704617113690830561493e06fd50da72d2bb7ea2f760909e5e54bcaef3; _gid=177223...."
}
try:
response =session.get(url, headers=headers)
print(response)
if response.status_code == 200:
with open('1.jpg', 'wb') as f:
f.write(response.content)
print('下载完成:')
except Exception as e:
print(e)
而且python得到图片大小与原图片大小相近,但是无法打开。
cookie这些都完整的,也没错啊,就很离谱啊,,,,,,谢各位大佬帮忙!!!!!! "Accept-Encoding": "gzip, deflate, br", 是gzip,可能要解压,gzip.decompress(pic_gzip) import requests
url = "https://lib-xjtu.wqxuetang.com/deep/page/imgs/3225567/7?width=160&k=eyJ1IjoiRVpv"
headers = {
"authority": "lib-xjtu.wqxuetang.com",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"accept-language": "zh-CN,zh;q=0.9",
"cache-control": "max-age=0",
"cookie": "acw_tc=0bdd346e17113706029986224edd4465a233e52e8a3b8f17bdbd735f23f69c; SERVERID=f164105ccbc961f51f901041b71e3b0d|1711370603|1711370603; SERVERCORSID=f164105ccbc961f51f901041b71e3b0d|1711370603|1711370603",
"sec-ch-ua": '"Chromium";v="122", "Not A Brand";v="24", "Google Chrome";v="122"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "Windows",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "none",
"sec-fetch-user": "?1",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
with open('1.jpg', 'wb') as file:
file.write(response.content)
print("下载完成")
else:
print("下载失败,状态码:", response.status_code)
现在可以下载了,但是下载下来也看不清呀 qfxldhw 发表于 2024-3-25 20:51
import requests
url = "https://lib-xjtu.wqxuetang.com/deep/page/imgs/322 ...
是什么原因哎?感觉也没差什么啊 Zhili.An 发表于 2024-3-25 21:07
是什么原因哎?感觉也没差什么啊
"Accept-Encoding": "gzip, deflate, br", 楼上说那个问题,把参数删了 页面图片只能访问一次,啥意思?同一ip只能访问一次? qfxldhw 发表于 2024-3-25 21:10
"Accept-Encoding": "gzip, deflate, br", 楼上说那个问题,把参数删了
奥嗷嗷哦,谢谢明天试试 Time丨Brand 发表于 2024-3-25 20:44
"Accept-Encoding": "gzip, deflate, br", 是gzip,可能要解压,gzip.decompress(pic_gzip)
好哒,明天试一下,谢谢了 sai609 发表于 2024-3-25 21:11
页面图片只能访问一次,啥意思?同一ip只能访问一次?
对于可以看的图片,那个链接只能访问一次,就失效了。缩略图不限制