Zhili.An 发表于 2024-3-25 20:33

【已解决】某文泉爬取求助

本帖最后由 Zhili.An 于 2024-3-26 09:35 编辑

最近在对某文泉一些爬取工作,虽然页面图片只能访问一次,但是它的缩略图却可以一直访问;
所以也准备一下爬取工作,但是遇到了问题
下面将对缩略图简称 【图片】;
图片在网页中可以一直访问,无论是点出去【只要保留cookie】,还是请求重发

都可以访问。
但是用python虽然显示是200,但得到的图片却是损坏的
代码如下:
import requests
session = requests.session()
url ="https://lib-xjtu.wqxuetang.com/deep/page/imgs/3225567/7?width=160&k=eyJ1IjoiRVpv..."

headers = {
    "Host": "lib-xjtu.wqxuetang.com",
    "Connection": "keep-alive",
    "Pragma": "no-cache",
    'Cache-Control': "no-cache",
    "sec-ch-ua": '"Chromium";v="122", "Not(A:Brand";v="24", "Microsoft Edge";v="122"',
    "sec-ch-ua-mobile": "?0",
    "RequestID": "0",
    "sec-ch-ua-platform": "Windows",
    "DNT": "1",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36 Edg/122.0.0.0",
    "Accept": "*/*",
    "Sec-Fetch-Site": "same-origin",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Dest": "empty",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "zh-CN,zh;q=0.9",
    "Referer": "https://lib-xjtu.wqxuetang.com/deep/read/pdf?bid=3225567",
    "Cookie": "acw_tc=0b6e704617113690830561493e06fd50da72d2bb7ea2f760909e5e54bcaef3; _gid=177223...."
   
    }
try:
    response =session.get(url, headers=headers)
    print(response)
    if response.status_code == 200:
      with open('1.jpg', 'wb') as f:
            f.write(response.content)
            print('下载完成:')
except Exception as e:
    print(e)


而且python得到图片大小与原图片大小相近,但是无法打开。

cookie这些都完整的,也没错啊,就很离谱啊,,,,,,谢各位大佬帮忙!!!!!!

Time丨Brand 发表于 2024-3-25 20:44

"Accept-Encoding": "gzip, deflate, br", 是gzip,可能要解压,gzip.decompress(pic_gzip)

qfxldhw 发表于 2024-3-25 20:51

import requests

url = "https://lib-xjtu.wqxuetang.com/deep/page/imgs/3225567/7?width=160&k=eyJ1IjoiRVpv"

headers = {
    "authority": "lib-xjtu.wqxuetang.com",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
    "accept-language": "zh-CN,zh;q=0.9",
    "cache-control": "max-age=0",
    "cookie": "acw_tc=0bdd346e17113706029986224edd4465a233e52e8a3b8f17bdbd735f23f69c; SERVERID=f164105ccbc961f51f901041b71e3b0d|1711370603|1711370603; SERVERCORSID=f164105ccbc961f51f901041b71e3b0d|1711370603|1711370603",
    "sec-ch-ua": '"Chromium";v="122", "Not A Brand";v="24", "Google Chrome";v="122"',
    "sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": "Windows",
    "sec-fetch-dest": "document",
    "sec-fetch-mode": "navigate",
    "sec-fetch-site": "none",
    "sec-fetch-user": "?1",
    "upgrade-insecure-requests": "1",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
}

response = requests.get(url, headers=headers)

if response.status_code == 200:
    with open('1.jpg', 'wb') as file:
      file.write(response.content)
    print("下载完成")
else:
    print("下载失败,状态码:", response.status_code)
现在可以下载了,但是下载下来也看不清呀

Zhili.An 发表于 2024-3-25 21:07

qfxldhw 发表于 2024-3-25 20:51
import requests

url = "https://lib-xjtu.wqxuetang.com/deep/page/imgs/322 ...

是什么原因哎?感觉也没差什么啊

qfxldhw 发表于 2024-3-25 21:10

Zhili.An 发表于 2024-3-25 21:07
是什么原因哎?感觉也没差什么啊


"Accept-Encoding": "gzip, deflate, br",   楼上说那个问题,把参数删了

sai609 发表于 2024-3-25 21:11

页面图片只能访问一次,啥意思?同一ip只能访问一次?

Zhili.An 发表于 2024-3-25 21:18

qfxldhw 发表于 2024-3-25 21:10
"Accept-Encoding": "gzip, deflate, br",   楼上说那个问题,把参数删了

奥嗷嗷哦,谢谢明天试试

Mr.Jimmy 发表于 2024-3-25 21:19

Zhili.An 发表于 2024-3-25 21:20

Time丨Brand 发表于 2024-3-25 20:44
"Accept-Encoding": "gzip, deflate, br", 是gzip,可能要解压,gzip.decompress(pic_gzip)

好哒,明天试一下,谢谢了

Zhili.An 发表于 2024-3-25 21:21

sai609 发表于 2024-3-25 21:11
页面图片只能访问一次,啥意思?同一ip只能访问一次?

对于可以看的图片,那个链接只能访问一次,就失效了。缩略图不限制
页: [1] 2 3
查看完整版本: 【已解决】某文泉爬取求助