好友
阅读权限30
听众
最后登录1970-1-1
|
Zeaf
发表于 2020-6-11 23:20
本帖最后由 Zeaf 于 2020-6-12 11:01 编辑
原网址:https://pixivic.com/illusts/82140995?VNK=418eebdd(实际上VNK可不要,会自动生成)
找到第一张图的真实地址(即点击进入后图片的地址):https://original.img.cheerfun.dev/img-original/img/2020/06/07/00/00/13/82140995_p0.jpg(浏览器直接访问403)
通过网站访问可以获得,为什么直接访问不可?
相关信息如图(注意,之后进入会显示304什么的,各位自己看看)
我的代码(注意看下面那个user,我带了referer!!!):
[Python] 纯文本查看 复制代码 import requests # 导入requests库
import re # 导入正则表达式库
import os # 保存文件
os.system('title pixiv图片爬取@Zeaf')#设置窗口标题
if not os.path.exists('pixivimg'): # 判断文件夹是否存在,如果不存在:
os.mkdir('pixivimg') # 创建一个文件夹
#artistid = input('请输入画师id:')
#VNK = input('请输入VNK:')
user = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36',
}
url='https://api.pixivic.com/ranks?page=1&date=2020-06-08&mode=day&pageSize=30'
response = requests.get(url,headers=user)#模拟访问
response.encoding = response.apparent_encoding#防止乱码
html = response.text # 用文本显示访问网页得到的内容
urls = re.findall('"original":"([url=https://i.pximg.net/img-original/img/..../../../../../../]https://i.pximg.net/img-original/img/..../../../../../../[/url][0-9]*?_p0.*?g")', html) # 用正则表达式获得本页各网址
names = re.findall('"artistId":.*?,"title":"(.*?)","type"', html)
ids = re.findall('"original":"https://i.pximg.net/img-original/img/..../../../../../../([0-9]*?)_p0.*?g"', html)
url=urls[0]#这里获取到的url实际上就是下图的url,即是原图的真实地址
name=names[0]
id=ids[0]
user = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362',
'Referer': 'https://pixivic.com/illusts/'+id+'?VNK=418eebdd',
'Accept': 'image/png, image/svg+xml, image/*; q=0.8, */*; q=0.5',
'Host': 'original.img.cheerfun.dev',
'Cache-Control': 'max-age=0',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'Keep-Alive',
'Accept-Language': 'zh-CN'
}
response = requests.get(url,headers=user)#模拟访问
print(response.status_code)
with open('pixivimg'+ '/' + name+'.jpg', 'wb') as f:
f.write(response.content)
经过抓包分析,发现直接爬取的批量url不能直接用,前缀有问题,得换个前缀。。。那个i.pximg开头的网址有问题,根本无法访问(304)
简而言之:我通过爬取首页获得到的网址是https://i.pximg.net/img-original/img/...这种形式,但实际上我需要的是如下图的那种行使,即https://original.img.cheerfun.dev/img-original/img/...,网站偷换前缀我没发现。。。
一般而言,403都是由于没加referer,大家注意(当然,cookie等问题也有哦~主要是自己分析请求头看看需要什么)
以下为修复后的完整代码
[Python] 纯文本查看 复制代码 # -*- coding: utf-8 -*-
"""
Created on Thu Jun 11 19:23:48 2020
@author: Zeaf
"""
import requests # 导入requests库
import re # 导入正则表达式库
import os # 保存文件
import threading #导入多线程库
os.system('title pixiv图片爬取@Zeaf')#设置窗口标题
if not os.path.exists('pixivimg'): # 判断文件夹是否存在,如果不存在:
os.mkdir('pixivimg') # 创建一个文件夹
#artistid = input('请输入画师id:')
#VNK = input('请输入VNK:')
user = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36',
}
url='https://api.pixivic.com/ranks?page=1&date=2020-06-08&mode=day&pageSize=30'
response = requests.get(url,headers=user)#模拟访问
response.encoding = response.apparent_encoding#防止乱码
html = response.text # 用文本显示访问网页得到的内容
urls = re.findall('"original":"https://i.pximg.net/img-original/img/(..../../../../../../[0-9]*?_p0.*?g)"', html) # 用正则表达式获得本页各网址
names = re.findall('"artistId":.*?,"title":"(.*?)","type"', html)
ids = re.findall('"original":"https://i.pximg.net/img-original/img/..../../../../../../([0-9]*?)_p0.*?g"', html)
for name,url,id in zip(names,urls,ids):
user = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362',
'Referer': 'https://pixivic.com/illusts/'+id+'?VNK=418eebdd',
'Accept': 'image/png, image/svg+xml, image/*; q=0.8, */*; q=0.5',
'Host': 'original.img.cheerfun.dev',
'Cache-Control': 'max-age=0',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'Keep-Alive',
'Accept-Language': 'zh-CN'
}
url = 'https://original.img.cheerfun.dev/img-original/img/'+url
try:
name = name.replace('\\','_')
except:
name = name
response = requests.get(url,headers=user)#模拟访问
print(response.status_code)
with open('pixivimg'+ '/' + name+'.jpg', 'wb') as f:
f.write(response.content) |
-
|