Python爬取pixiv[P站]每日插画排行榜(优化版)
本帖最后由 kof21411 于 2020-7-27 16:14 编辑原帖地址:https://www.52pojie.cn/thread-1228666-1-1.html
原作者通过抓取https://www.pixiv.net/ranking.php?mode=daily&content=illust
只能爬五十张
现在我把它改直接爬它的接口utl,有多少就能爬到多少
更新:
实现输入日期然(默认回车为当天日期)后并把作者多张原图下载
#!/usr/bin/env python
# -*- coding:utf8 -*-
import requests
import json
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
}
pages = 1
img_num = 0
dates = int(raw_input("Enter the date like 202007xx : ") or "0")
while (True):
#爬取图片接口链接
if dates == 0:
url = 'https://www.pixiv.net/ranking.php?mode=daily&content=illust&p=%s&format=json' % pages
else:
url = 'https://www.pixiv.net/ranking.php?mode=daily&content=illust&p=%s&date=%s&format=json' % (pages,dates)
rsp = requests.get(url=url, headers=headers,timeout=60, verify=False).text
rspJson = json.loads(rsp)
if 'error' not in rspJson:
for content in rspJson['contents']:
#爬取原图链接
img_urls = content['url']
img_urls = img_urls.replace('c/240x480/img-master','img-original')
img_urls = img_urls.replace('0_master1200.jpg','')
illust_page_count = int(content['illust_page_count'])
#伪造请求绕过限制
user = {
'Referer': 'https://www.pixiv.net/artworks/'+str(content['illust_id'])
}
img_num = img_num+1
for i in range(0,illust_page_count):
img_url = img_urls+str(i)+'.jpg'
rgid=requests.get(img_url,headers=user)
if rgid.status_code != 200:
img_url = img_url.replace('.jpg','.png')
rgid=requests.get(img_url,headers=user)
print(img_url)
#下载图片
img=rgid.content
img_type = str(img_url.split(".")[-1])
if illust_page_count > 1:
img_name = str(img_num)+'-'+str(i+1)
else:
img_name = str(img_num)
with open('./'+img_name+'.'+img_type,'wb') as f:
f.write(img)
else:
# print(rspJson['error'])
break
pages = pages + 1 judgecx 发表于 2020-7-26 16:16
老哥你现在在吗 我现在莫得空
能帮我改下下载的那个命名吗 我下载是第一张会一直往后排的
就1-500那样 ...
现在这样可以1-500了
import requests
for x in range(1,11):
url = 'https://www.pixiv.net/ranking.php?mode=daily&content=illust&p='+str(x)
for i in range(0,50):
rg = requests.get(url)
in_url = 'https://www.pixiv.net'+str(rg.text.split("<div class=\"ranking-image-item\"><a href=\"").split("\""))
img_id = str(in_url.split("/artworks/"))
img_rank = str(rg.text.split("data-rank-text=\"").split("\""))
rgi = requests.get(in_url)
img_url = str(rgi.text.split("original\":\"").split("\""))
user = { 'Referer': in_url }
rgid = requests.get(img_url,headers=user)
img = rgid.content
img_num = ((x-1)*50)+i+1
with open('./'+img_num+'.'+str(img_url.split(".")),'wb') as f:
f.write(img)
print(img_url)
别外再说一下的就是你的代码利用正则和字符串分割提取数据,这样扩展性差,维护成本高,对方前端布局一改,你也得要跟着改,我就是给你提点意见,不是踩你的代码,希望不要介意。
我的代码刚刚也修改过,现在下载的应该是你想要的大图了
一般接口上的数据比较稳定,前端怎么变,接口数据源是不会变的,脚本维护成本极低,而且原因是数据源,所以扩展性也很强,源数据在你手,你想怎么弄就可以怎么弄
#!/usr/bin/env python
# -*- coding:utf8 -*-
import requests
import json
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
}
pages = 1
img_num = 0
while (True):
#爬取图片接口链接
url = 'https://www.pixiv.net/ranking.php?mode=daily&content=illust&p=%s&format=json' % pages
rsp = requests.get(url=url, headers=headers,timeout=60, verify=False).text
rspJson = json.loads(rsp)
if 'error' not in rspJson:
for content in rspJson['contents']:
#爬取原图链接
img_url = content['url']
img_url = img_url.replace('c/240x480/img-master','img-original')
img_url = img_url.replace('_master1200','')
#伪造请求绕过限制
user = {
'Referer': 'https://www.pixiv.net/artworks/'+str(content['illust_id'])
}
rgid=requests.get(img_url,headers=user)
print(img_url)
#下载图片
img=rgid.content
img_type = str(img_url.split(".")[-1])
img_num = img_num+1
with open('./'+str(img_num)+'.'+img_type,'wb') as f:
f.write(img)
else:
# print(rspJson['error'])
break
pages = pages + 1
judgecx 发表于 2020-7-26 21:46
老哥要是实现爬取指定输入的日期这个难实现吗 我用我的代码 我实现不了
date=202007xx
#!/usr/bin/env python
# -*- coding:utf8 -*-
import requests
import json
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
}
pages = 1
img_num = 0
dates = int(raw_input("Enter the date like 202007xx : ") or "0")
while (True):
#爬取图片接口链接
if dates == 0:
url = 'https://www.pixiv.net/ranking.php?mode=daily&content=illust&p=%s&format=json' % pages
else:
url = 'https://www.pixiv.net/ranking.php?mode=daily&content=illust&p=%s&date=%s&format=json' % (pages,dates)
rsp = requests.get(url=url, headers=headers,timeout=60, verify=False).text
rspJson = json.loads(rsp)
if 'error' not in rspJson:
for content in rspJson['contents']:
#爬取原图链接
img_url = content['url']
img_url = img_url.replace('c/240x480/img-master','img-original')
img_url = img_url.replace('_master1200','')
#伪造请求绕过限制
user = {
'Referer': 'https://www.pixiv.net/artworks/'+str(content['illust_id'])
}
rgid=requests.get(img_url,headers=user)
if rgid.status_code != 200:
img_url = img_url.replace('.jpg','.png')
rgid=requests.get(img_url,headers=user)
print(img_url)
#下载图片
img=rgid.content
img_type = str(img_url.split(".")[-1])
img_num = img_num+1
with open('./'+str(img_num)+'.'+img_type,'wb') as f:
f.write(img)
else:
# print(rspJson['error'])
break
pages = pages + 1 太棒了,大佬就是大佬! {:301_1008:} 你居然不@我一下下 不是我刷到都不知道大佬你已经优化了 judgecx 发表于 2020-7-26 10:39
你居然不@我一下下 不是我刷到都不知道大佬你已经优化了
不好意思,忘记@一下你,修改完了就直接发布,下次一定记得 感谢分享! kof21411 发表于 2020-7-26 10:42
不好意思,忘记@一下你,修改完了就直接发布,下次一定记得
{:301_1004:}你有没有运行过 我这运行把原来的图片替换了 就一张 爷青回!感谢分享 judgecx 发表于 2020-7-26 10:45
你有没有运行过 我这运行把原来的图片替换了 就一张
审核通过了,你看看还有什么问题? 大佬秀膜拜了 感谢分享,支持一下