【多线程】美女壁纸爬虫，请注意身体！

话痨司机啊 发表于 2022-5-9 21:15

本帖最后由话痨司机啊于 2022-9-5 13:21 编辑

这个网站，有2W多张美图~~，别给人网站搞坏了，我就给了5个线程，自己会动手的，自己改吧，成品就5个线程爬~
成品链接: https://pan.baidu.com/s/1N5WeMLoGL6a09zUhFv2kVA?pwd=6han 提取码: 6han
多线程爬虫

from collections import namedtuple
from concurrent.futures import ThreadPoolExecutor
from typing import Dict, List
import re
import requests
import os
from datetime import datetime
import keyboard
from fake_useragent import UserAgent
from lxml import etree
from rich.console import Console

console = Console()
headers = {'User-Agent':UserAgent().random}
DATA = namedtuple('DATA',['year','month','day','title','href'])
url = 'https://www.vmgirls.com/archives/'
img_list = ['jpg','png','gif','jpeg']

def start_requests():
'''
获取下载链接
'''
res = requests.get(url,headers=headers)
et = etree.HTML(res.text)
# 获取全部年份
y = et.xpath('//div[@id="archives"]/h4/text()')
for year in range(1,len(y)+1):
   # 每个月
   m = et.xpath(f'//div[@id="archives"]//ul[{year}]/li/span/text()')
   for month in range(1,len(m)+1):
         # 每天
         d = et.xpath(f'//div[@id="archives"]//ul[{year}]/li[{month}]/ul/li')
         for day in range(1,len(d)+1):
            # 每天的网址
            _day = et.xpath(f'//div[@id="archives"]//ul[{year}]/li[{month}]/ul/li[{day}]/text()')
            _href = et.xpath(f'//div[@id="archives"]//ul[{year}]/li[{month}]/ul/li[{day}]/a/@href')
            _title = et.xpath(f'//div[@id="archives"]//ul[{year}]/li[{month}]/ul/li[{day}]/a/text()')
            yield DATA(y,m,_day,_title,_href)

def get_data(yield_func):
'''
转换数据
'''
yield from yield_func

def save_img(url,path,title):
'''
保存图片
'''
imgcs = requests.get(url,headers=headers)
et = etree.HTML(imgcs.text)
IMG = et.xpath('//div[@class="nc-light-gallery"]//@href')
for i in range(0,len(IMG)-1):
   path = mkdir_path(path)
   if IMG.split('.')[-1] in img_list:
         res = requests.get(IMG,headers=headers)
         with open(f'{path}/{title}_{i}.jpg','wb') as f:
            f.write(res.content)
            nowdate = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
            console.print(f'创建时间:{nowdate}\n保存路径:{path}\n文件:{title}_{i}.jpg 保存成功!\n提示:按esc退出')
            console.print('-'*70)
   if keyboard.read_key() == 'esc':
         raise KeyboardInterrupt

def mkdir_path(path):
'''
创建路径
'''
path = re.sub(r'[\s]','',path)
if not os.path.exists(path):
   os.makedirs(path)
return path

def main():
'''
多线程主函数
'''
with ThreadPoolExecutor(max_workers=5) as executor:
   try:
         for data in get_data(start_requests()):
            path = os.path.join(os.getcwd(),'美女壁纸',data.year,data.month,data.day[:-2])
            img_name = data.title
            url = data.href
            executor.submit(save_img,url,path,img_name)
   except Exception as e:
         print(e)
         console.print('程序即将退出！')
         os._exit(0)

if __name__ == '__main__':
main()

效果图

话痨司机啊 发表于 2022-5-9 22:22

本帖最后由话痨司机啊于 2022-5-9 22:29 编辑

alongzhenggang 发表于 2022-5-9 22:00
QQ管家拦住了我
“代码里有一行监控键盘按键esc的” 火绒都不管，QQ管家就管上了，那你用源码好了~

别一看有杀毒软件报错就木马病毒的，如果有问题论坛审核就过不去。

alongzhenggang 发表于 2022-5-9 21:44

末学敬仰{:301_975:}{:301_986:}

tony0727 发表于 2022-5-9 21:48

成品试用，不错, 赞一个

国际豆哥 发表于 2022-5-9 21:50

我觉得这是一个好东西

9293mcqmyxh 发表于 2022-5-9 21:54

可以导吗

huduke 发表于 2022-5-9 21:58

这是一个好东西

alongzhenggang 发表于 2022-5-9 22:00

QQ管家拦住了我{:301_1002:}{:301_1008:}

https://s1.ax1x.com/2022/05/09/OJR9d1.png

https://s1.ax1x.com/2022/05/09/OJRrSU.png

dunniu 发表于 2022-5-9 22:02

你这只下载第一张么！

小楼昨夜东风 发表于 2022-5-9 22:21

这个，需要。下载了

话痨司机啊 发表于 2022-5-9 22:21

dunniu 发表于 2022-5-9 22:02
你这只下载第一张么！

测试了1秒~

页: [1] 2 3 4 5 6 7

吾爱破解 - 52pojie.cn's Archiver

【多线程】美女壁纸爬虫，请注意身体！