整合,python多线程批量下载网页内的图片
本帖最后由 ytw6176 于 2022-4-23 16:43 编辑小弟前几天刷抖音,看到Python相关视频,比较上头,然后网上找了一些教程,看了5、6个没心情看下去了
决定直接撸demo,在本站翻到两个大哥写的python
demo1:简易封装的python多线程类 - 『编程语言讨论求助区』 - 吾爱破解 - LCG - LSG |安卓破解|病毒分析|www.52pojie.cn
demo2:python爬虫下载网页图片简例 - 『编程语言区』 - 吾爱破解 - LCG - LSG |安卓破解|病毒分析|www.52pojie.cn
看了半天把两个demo大致看懂了。。一边百度一边调试。。感觉这样上手快
看到demo2帖子中2L的建议,就试着整合到一起。现发给各位python前辈,帮忙看看优化优化
import re
import requests
from threading import Thread
from queue import Queue
import time
from bs4 import BeautifulSoup as bsp
q = Queue(100000)
class FastRequests:
def __init__(
self,threads=20,headers={
'User-Agent':'Mozilla/5.0 (X11; Linux aarch64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.188 Safari/537.36 CrKey/1.54.250320 Edg/99.0.4844.74',
'Cookie':''
}
):
self.threads = threads # 线程数 20
self.headres = headers # 头部
self.imgUrl = r'https://www.tuba555.net/m/htm9/43816.html'
# 获取网页上所有img元素
def getImg(self):
eleGet=requests.get(url=self.imgUrl, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0"})
soup=bsp(eleGet.content.decode('gb2312'), 'lxml')
imgs=soup.find_all('img')
for i in imgs:
i=str(i)
if "tupian_img" in i: # 过滤class为tupian_img的img标签
kaishi= i.find("src")+5 #src=" 一共5个字符
jieshu= i.find("jpg")+3 #jpg一共3个字符
img= i
q.put(img)
def run(self):
for i in range(self.threads):
t = Consumer(self.headres)
t.start()
class Consumer(Thread):
def __init__(self,headers):
Thread.__init__(self)
self.headers = headers
self.size = 0
self.time = 0
def run(self):
while True:
if q.qsize() == 0:
break
self.download(q.get())
def validateTitle(self,title):
rstr = r"[\/\\\:\*\?\"\<\>\|]"# '/ \ : * ? " < > |'
new_title = re.sub(rstr, "_", title)# 替换为下划线
return new_title
def sizeFormat(self,size, is_disk=False, precision=2):
formats = ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB']
unit = 1000.0 if is_disk else 1024.0
if not (isinstance(size, float) or isinstance(size, int)):
raise TypeError('a float number or an integer number is required!')
if size < 0:
raise ValueError('number must be non-negative')
for i in formats:
size /= unit
if size < unit:
return f'{round(size, precision)}{i}'
return f'{round(size, precision)}{i}'
def download(self,info):
title = info.split(r'/')[-1]
link = info
if title == '':
title = self.validateTitle(link.split('/')[-1])
start_time = time.time()
response = requests.get(url=link, headers=self.headers, stream=True).content
end_time = time.time()
self.time = end_time - start_time
self.size += response.__sizeof__()
with open("D:\\" + title,'wb') as f:
f.write(response)
f.close()
print(f'{title} {self.sizeFormat(self.size)} 耗时:{round(self.time,3)}s')
if __name__ == '__main__': # 仅限作为脚本执行,不能被import执行
fr = FastRequests()
fr.getImg()
fr.run()
homehome 发表于 2022-4-23 20:36
我相信,不久的将来,楼主又会整合啥的在里面
哈哈继续学习,整合了再分享 感谢分享,碉堡了 感谢分享,学习一下。 感谢楼主:victory: 我相信,不久的将来,楼主又会整合啥的在里面 牛逼,我都完全看不懂啥意思 mark,以后学习下 只能下载当前页面,不能下载全部页面的图片? 感谢分享
页:
[1]
2