吾爱破解 - 52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 1917|回复: 16
收起左侧

[Python 原创] wallhave原图下载,增加去重功能

  [复制链接]
s936505608 发表于 2023-8-21 11:15
已经下载过的图片名称会存在log.txt中,每次下载图片会检测一遍,存在则跳过,自动翻页
[Python] 纯文本查看 复制代码
import os
import requests
from bs4 import BeautifulSoup
from tkinter import Tk, filedialog
import logging

def get_images_from_url(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        lazyload_images = soup.find_all('img', class_='lazyload')
        image_urls = []
        for lazyload_image in lazyload_images:
            image_url = lazyload_image['data-src']
            image_urls.append(image_url)
        return image_urls
    return []

def modify_image_url(image_url, formats):
    image_name = image_url.split('/')[-1]
    for format in formats:
        modified_url = f"https://w.wallhaven.cc/full/{image_name[0:2]}/wallhaven-{image_name.split('.')[-2]}.{format}"
        yield modified_url

def download_images(image_urls, save_folder, page, formats):
    logging.basicConfig(filename=os.path.join(save_folder, 'log.txt'), level=logging.INFO)
    existing_images = set()
    if os.path.exists(os.path.join(save_folder, 'log.txt')):
        with open(os.path.join(save_folder, 'log.txt'), 'r') as log_file:
            for line in log_file:
                image_name = line.strip().split(':')[-1]
                existing_images.add(image_name)
    for url in image_urls:
        image_name = url.split('/')[-1]
        image_path = os.path.join(save_folder, image_name)
        if image_name in existing_images:
            print(f"图片已存在,跳过下载:{image_name}")
            continue
        if not os.path.exists(image_path):
            downloaded = False
            for modified_url in modify_image_url(url, formats):
                response = requests.get(modified_url)
                if response.status_code == 200:
                    with open(image_path, 'wb') as f:
                        f.write(response.content)
                    logging.info(f"{image_name}")
                    print(f"下载图片成功:{image_name}")
                    downloaded = True
                    break
            if not downloaded:
                print(f"所有链接下载失败:{image_name},已跳过")
    print(f"处理完第 {page} 页,即将处理下一页...")

if __name__ == "__main__":
    root = Tk()
    root.withdraw()

    save_folder = filedialog.askdirectory(title="选择存储文件夹")
    if save_folder:
        page = 1
        image_formats = ['jpg', 'png', 'gif', 'jpeg']
        while True:
            url = f"https://wallhaven.cc/toplist?page={page}"
            image_urls = get_images_from_url(url)
            if not image_urls:
                break
            download_images(image_urls, save_folder, page, image_formats)
            page += 1

免费评分

参与人数 6吾爱币 +10 热心值 +5 收起 理由
苏紫方璇 + 7 + 1 欢迎分析讨论交流,吾爱破解论坛有你更精彩!
zx1086 + 1 谢谢@Thanks!
peiki + 1 我很赞同!
A582168411 + 1 + 1 谢谢@Thanks!
woyucheng + 1 + 1 热心回复!
浮尘晓梦 + 1 谢谢@Thanks!

查看全部评分

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

龍魂小白 发表于 2023-8-24 01:20
另一个思路:搞个redis(key相同会覆盖),把每次爬取到的图片url存进去 ,最后在取出来获取到byte保存
AlanLee360 发表于 2023-8-21 11:49
ieee1395 发表于 2023-8-21 12:21
johna 发表于 2023-8-21 12:25
学习了,谢谢分享!
A582168411 发表于 2023-8-21 12:34
谢谢分享
wangwh 发表于 2023-8-21 14:26
感谢分享,就是看不懂
次谐波 发表于 2023-8-21 14:32
不是到是啥
Evan1992 发表于 2023-8-21 15:06

学习了.....
18974881483 发表于 2023-8-21 15:18
感谢分享
john198688 发表于 2023-8-21 15:21
研究一下 感谢楼主
您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则

返回列表

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2024-11-24 20:45

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表