本帖最后由 OnlineYx 于 2021-1-13 14:44 编辑
今天逛论坛,看到一个吾友分享的爬小姐姐图片的代码挺不错,原贴在这:https://www.52pojie.cn/thread-1348486-1-1.html
我试着改了一下,写了个循环和判断,输出起始网页的ID和结束网页的ID就能批量地下载小姐姐了。
[Python] 纯文本查看 复制代码 import os
import time
import requests
import re
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
'Accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
'Accept-Encoding': 'gzip',
"Referer": "https://www.baidu.com/"
}
httpnum = int(input("请输入爬取网页的起始ID:"))
httpnum1 = int(input("请输入爬取网页的结束ID:"))
for i in range(httpnum,httpnum1+1):
httpurl = "https://www.vmgirls.com/{0}.html".format(i)
response = requests.get(httpurl, headers=headers)
html = response.text
if str("<style></style><meta name=keywords content=") not in html:
print("{0}网页不存在".format(i))
continue
else:
dir_name = re.findall('<h1 class="post-title h1">(.*?)</h1>', html)[-1]
if not os.path.exists(dir_name):
os.mkdir(dir_name)
urls = re.findall('<a href="(.*?)" alt=".*?" title=".*?">', html)
for url in urls:
time.sleep(1)
name = url.split('/')[-1]
response = requests.get("https:" + url, headers=headers)
print(name + "正在下载")
with open(dir_name + '/' + name, 'wb') as f:
f.write(response.content)
print("{0}下载完毕".format(i))
print("全部下载完毕")
-------------------------------------------------------------------------
刚刚又稍微改了一下代码,在目录前加上一个网页ID
[Python] 纯文本查看 复制代码 dir_name0 =re.findall('<h1 class="post-title h1">(.*?)</h1>', html)[-1]
dir_name=str(i)+dir_name0
-------------------------------------------------------------------------
再次完善(关于正则部分)
爬取网页ID为12000之后的图片没有什么问题。
以网页ID12985为例
它的网页源代码图片部分是这样
但是在ID12000之前的网页,图片部分的源代码是这样
所以还需要加一个判断才能爬取到网页ID12000之前的图片
[Python] 纯文本查看 复制代码 urls = re.findall('<img alt=".*?" loading=lazy src="(.*?)" alt=""', html)
if len(urls)==0:
urls = re.findall('<a href="(.*?)" alt=".*?" title=".*?">', html)
|