吾爱破解 - 52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 2218|回复: 10
收起左侧

[求助] python爬虫用同步方法下载的图片可以打开但是速度慢,用异步方法下载图片后打不开

[复制链接]
AnWenpython 发表于 2021-11-17 18:07
本帖最后由 AnWenpython 于 2021-11-18 19:43 编辑

先看同步方法的代码
[Python] 纯文本查看 复制代码
# 1.输入要爬取那组图片的url例如:https://www.ivsky.com/bizhi/moraine_lake_v48781/
# https://www.ivsky.com/bizhi/moraine_lake_v48781/pic_769150.html    第一张图片的url
#  wyAmZ1NxpQ3/ljl4PhA7CKEXM04HGkR+mKNM8hKFgkdWRVMaz+QZp1vq+FUCxrkeCcDJYkP4UIlYC5Qh2/w1vTGXChZo
# https://img-picdown.ivsky.com/             另一组原图url
# img/downloadpic/download/fmAmcgttgHinljl4PhA7CKEXM04HGkRmgK9M8hKFgkdWRVMaz+QZp1vq+FUCxpUeFcDBdkP8OI15F4R1O
# [img]https://m.ivsky.com/get_picinfo.php?tn=downloadpic&picurl=/img/bizhi/pic/201804/30/moraine_lake.jpg[/img]    data的url
# /img/bizhi/pic/201804/30/moraine_lake.jpg

import requests
from lxml import etree
import re
import os
import json
import time

# 创建一个session对象
# sessionp = requests.Session()
t1 = time.time()
if not os.path.exists('./极简壁纸爬取结果'):
    os.mkdir('./极简壁纸爬取结果')  # 创建保存图片的文件夹
# 1.对这组图片的url发get请求,拿到每张图片的url
url = 'https://www.ivsky.com/bizhi/yangzi_v59469/'
html1 = requests.get(url=url)
# print(html1.text)
d = etree.HTML(html1.text)
slist = d.xpath('/html/body/div[3]/div[4]/ul/li/div/a/@href')  # href="/bizhi/moraine_lake_v48781/pic_769150.html"
html1.close()
ec = "var imgURL='(.*?)';var"
# 2.对每张图的url发请请求
for li in slist:  # https://www.ivsky.com/bizhi/moraine_lake_v48781/pic_769150.html"
    url2 = 'https://www.ivsky.com' + li
    html2 = requests.get(url=url2)
    data_url = 'https://www.ivsky.com/get_picinfo.php?'
    hdes = {
        'cookie': '__yjs_duid=1_1bda7ffae32f47059d094aa036adcf651634380757728; Hm_lvt_a951b469f6e313457f2934c362ed30de=1636204999,1636277027,1636278256; statistics_clientid=me; Hm_lvt_862071acf8e9faf43a13fd4ea795ff8c=1636954225,1637036227,1637054815,1637123760; Hm_lpvt_c13cf8e9faf62071ac13fd4eafaf1acf=1637140041; Hm_lpvt_862071acf8e9faf43a13fd4ea795ff8c=1637140042'
        }
    data = {
        'tn': 'downloadpic',
        'picurl': re.findall(ec, html2.text)[0]  # /img/bizhi/pic/201804/30/moraine_lake.jpg
    }
    # 3.对保存原图后缀的data发请求
    data_html = requests.get(url=data_url, headers=hdes, params=data)
    dit = data_html.json()
    h = dit['data']
    url3 = 'https://img-picdown.ivsky.com/img/downloadpic/download/' + h
    img_name = url3.split('/')[-1]  # 获取图片名称
    imgPath = './极简壁纸爬取结果/' + img_name + '.jpg'  # 图片储存路径
    with open(imgPath, 'wb') as fp:
        fp.write(requests.get(url=url3).content)
        print(img_name, '下载成功')

    print(img_name,'爬取成功!!!')
    html2.close()
    data_html.close()

t2 = time.time()
print(t2 - t1)

异步方法的代码
# 1.输入要爬取那组图片的url例如:https://www.ivsky.com/bizhi/moraine_lake_v48781/
# https://www.ivsky.com/bizhi/moraine_lake_v48781/pic_769150.html    第一张图片的url
# https://img-picdown.ivsky.com/             原图url
# img/downloadpic/download/wyAmZ1NxpQ3/ljl4PhA7CKEXM04HGkR+mKNM8hKFgkdWRVMaz+QZp1vq+FUCxrkeCcDJYkP4UIlYC5Qh2/w1vTGXChZo
# https://img-picdown.ivsky.com/             另一组原图url
# img/downloadpic/download/fmAmcgttgHinljl4PhA7CKEXM04HGkRmgK9M8hKFgkdWRVMaz+QZp1vq+FUCxpUeFcDBdkP8OI15F4R1O
#     data的url
# /img/bizhi/pic/201804/30/moraine_lake.jpg

import requests
from lxml import etree
import re
import os
import json
import asyncio
import aiohttp
import aiofiles
import time

# 创建一个session对象
# sessionp = requests.Session()
"""
1. 同步操作: 访问一组图片的url 拿到所有图片的url,访问这些URl拿到data的url(过程需要进行数据解析)
2. 异步操作: 访问这些data的url,拿到图片url, 下载所有的图片内容
"""
t1 = time.time()
if not os.path.exists('./极简壁纸爬取结果2'):
    os.mkdir('./极简壁纸爬取结果2')  # 创建保存图片的文件夹
# 1.对这组图片的url发get请求,拿到每张图片的url
url = 'https://www.ivsky.com/bizhi/yangzi_v59469/'
html1 = requests.get(url=url)
d = etree.HTML(html1.text)
slist = d.xpath('/html/body/div[3]/div[4]/ul/li/div/a/@href')  # href="/bizhi/moraine_lake_v48781/pic_769150.html"
html1.close()
# 2.对每张图的url发请请求,获取data的url
lest = []  # 保存data的url


async def yyds(url):
    url = 'https://www.ivsky.com' + url

    async with aiohttp.ClientSession() as session:
        async with session.get(url) as resp:
            html = await resp.text()
            ec = "var imgURL='(.*?)';var"
            s = re.findall(ec, html)[0]  # /img/bizhi/pic/201804/30/moraine_lake.jpg
            data_url = f'https://www.ivsky.com/get_picinfo.php?tn=downloadpic&picurl={s}'
            lest.append(data_url)
        print(f'{s}获取成功')


async def mian():
    task_1 = [asyncio.create_task(yyds(url)) for url in slist]
    await asyncio.wait(task_1)
    print(lest)


loop = asyncio.get_event_loop()
loop.run_until_complete(mian())

# 开始异步
hdrs = {
    'cookie': '__yjs_duid=1_1bda7ffae32f47059d094aa036adcf651634380757728; Hm_lvt_a951b469f6e313457f2934c362ed30de=1636204999,1636277027,1636278256; statistics_clientid=me; Hm_lvt_862071acf8e9faf43a13fd4ea795ff8c=1637036227,1637054815,1637123760,1637141749; Hm_lpvt_c13cf8e9faf62071ac13fd4eafaf1acf=1637141759; Hm_lpvt_862071acf8e9faf43a13fd4ea795ff8c=1637141760'
}


async def aiodownload(h):
    url = 'https://img-picdown.ivsky.com/img/downloadpic/download/' + h
    img_name = url.split('/')[-1]  # 获取图片名称
    imgPath = './极简壁纸爬取结果2/' + img_name + '.jpg'  # 图片储存路径

    async with aiohttp.ClientSession() as session:
        async with session.get(url=url) as resp:
            async with aiofiles.open(imgPath, mode='wb') as f:
                await f.write(await resp.content.read())
        print(img_name, "下载完成")


async def getCatalog(lest):
    tasks = []
    for item in lest:
        data_json = requests.get(url=item, headers=hdrs).json()
        h = data_json['data']
        tasks.append(asyncio.create_task(aiodownload(h)))
        # 准备异步任务
    print('开始下载任务')
    await asyncio.wait(tasks)


# # asyncio.run(getCatalog(lest)))
loop = asyncio.get_event_loop()
loop.run_until_complete(getCatalog(lest))
t2 = time.time()

print(t2 - t1)
两者都是用wb写二进制数据


同步爬取的结果,非常的好

同步爬取的结果,非常的好

异步爬取结果,打不开

异步爬取结果,打不开

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

我叫小月亮 发表于 2021-11-17 23:50
本帖最后由 我叫小月亮 于 2021-11-18 01:42 编辑

这异步属实看不懂,看起来好麻烦的样子,看看是不是获取到的图片地址那出错了,把你那个 lest 里面的链接跟实际的图片链接对比一下看看

直接上多线程吧,Thread 简单粗暴

[Python] 纯文本查看 复制代码
#-*- coding:utf-8 -*-
# version : python 3.6.5

import requests
from lxml import etree
import os
import time
from threading import Thread

RUN_PATH = (os.path.split(os.path.realpath(__file__))[0])

HEAD = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36"}

IMG_PATH = os.path.join(RUN_PATH,'极简壁纸爬取结果')

t1 = time.time()
if not os.path.exists(IMG_PATH):
    os.mkdir(IMG_PATH)  # 创建保存图片的文件夹
    
# 1.对这组图片的url发get请求,拿到每张图片的url
url = 'https://www.ivsky.com/bizhi/yangzi_v59469'
html1 = requests.get(url=url,headers=HEAD)
d = etree.HTML(html1.text)
slist = d.xpath('/html/body/div[3]/div[4]/ul/li/div/a/img/@src')  # href="/bizhi/moraine_lake_v48781/pic_769150.html"
html1.close()

# 下载图片的函数
def download_img(url,img_save_path):
    try:
        res  = requests.get(url,headers=HEAD,timeout=3)
        with open (img_save_path,'wb')as f:
            f.write(res.content)
        print(f"{img_save_path} 下载成功.")
    except Exception as er:
        print(f"{url} 下载失败,ERROR:{er}")

t_list = []
for li in slist:
    new_li = li.replace("//img.ivsky.com/img/bizhi/t","/img/bizhi/pic")
    data_url = 'https://www.ivsky.com/get_picinfo.php?tn=downloadpic&picurl='+new_li
    hdrs = {
     "accept": "application / json, text / javascript, * / *; q = 0.01",
    'cookie': f'__yjs_duid=1_40f8d9b42822fb264887754d6cd6f58d1637161518093; Hm_lpvt_c13cf8e9faf62071ac13fd4eafaf1acf={int(time.time())}',"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36"}
    try:
        data_html = requests.get(url=data_url, headers=hdrs)
        dit = data_html.json()
        h = dit['data']
        url3 = 'https://img-picdown.ivsky.com/img/downloadpic/download/' + h
        img_name = li.split('/')[-1]  # 获取图片名称
        imgPath = os.path.join(IMG_PATH,img_name) # 图片储存路径
        t = Thread(target=download_img, args=(url3,imgPath))
        t.start()
        t_list.append(t)
    except Exception as er:
        print(f"获取图片地址失败,ERROR:{er}")

if len(t_list)>0:
    for t in t_list:
        t.join()

t2 = time.time()
print(t2 - t1)

免费评分

参与人数 1吾爱币 +1 热心值 +1 收起 理由
AnWenpython + 1 + 1 谢谢@Thanks!

查看全部评分

 楼主| AnWenpython 发表于 2021-11-19 11:24
我叫小月亮 发表于 2021-11-17 23:50
这异步属实看不懂,看起来好麻烦的样子,看看是不是获取到的图片地址那出错了,把你那个 lest 里面的链接跟 ...

玩的真六啊,铁子
 楼主| AnWenpython 发表于 2021-11-19 12:46
我叫小月亮 发表于 2021-11-17 23:50
这异步属实看不懂,看起来好麻烦的样子,看看是不是获取到的图片地址那出错了,把你那个 lest 里面的链接跟 ...

你的代码我运行了一下,下载的图片还是打不开,是我电脑(widows10)的原因吗?我也想学成你这样的技术,能指点指点吗?
 楼主| AnWenpython 发表于 2021-11-19 15:15
AnWenpython 发表于 2021-11-19 12:46
你的代码我运行了一下,下载的图片还是打不开,是我电脑(widows10)的原因吗?我也想学成你这样的技术, ...

我去,原来这网站有反爬机制,data的值前几位不是原图的url,刷新几下后发现他会一直变动,把我整的不会了,,,,
Prozacs 发表于 2021-11-19 15:38
Referer  
 楼主| AnWenpython 发表于 2021-11-19 16:23

heades要加referer吗?我试过了,还是不行啊
楚子沦i 发表于 2021-11-21 20:56
简单看了一下网站,没发现有啥验证。。
而且网站的原图应该要下载或者点击那个查看原图吧
这个貌似是正确的原图链接
[Asm] 纯文本查看 复制代码
https://www.ivsky.com/download_pic.html?picurl=/img/bizhi/pic/201804/30/moraine_lake-001.jpg&pichtml=//www.ivsky.com/bizhi/moraine_lake_v48781/pic_769151.html


然后就是他这个网站好像提供了下载原图的地方吧,你可以直接爬每一页的获取原图的链接,然后请求获取原图就好了。

再就是如果有验证什么的,推荐用selenium,基本上通杀吧。
 楼主| AnWenpython 发表于 2021-11-29 20:03
我叫小月亮 发表于 2021-11-17 23:50
这异步属实看不懂,看起来好麻烦的样子,看看是不是获取到的图片地址那出错了,把你那个 lest 里面的链接跟 ...

可以加个好友请教一下吗,那个cookie为什么可以这样处理,我现在还没搞明白
我叫小月亮 发表于 2021-11-30 12:36
AnWenpython 发表于 2021-11-29 20:03
可以加个好友请教一下吗,那个cookie为什么可以这样处理,我现在还没搞明白

他那个Cookie生成的好像跟你的IP地址有关系,你的Cookie在我这就用不了,我刚开始以为是时间戳的问题,所以就那样处理了。
您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则

返回列表

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2024-11-25 18:53

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表