批量爬取wallhaven壁纸

gebiafu 发表于 2022-11-14 00:09

# 批量爬取wallhaven壁纸
1、导入需要用到的模块

```python
import requests
from bs4 import BeautifulSoup
```
2、获取网页内容

```python
link = 'https://wallhaven.cc/toplist'
# 请求头
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36'
}

res=requests.get(url=link,headers=headers).text
print(res)
```
大概是这样的

![在这里插入图片描述](https://img-blog.csdnimg.cn/0a252997e063468e94cdd341784fab21.png)
3、使用浏览器元素抓取获得图片元素所在位置

![在这里插入图片描述](https://img-blog.csdnimg.cn/886a6392c7de488f94a6c35c1f018099.png)
4、解析数据，获取图片链接；从上图可知class id=preview

```python
# 解析数据
soup = BeautifulSoup(res,'html.parser')
items=soup.find(class_='preview')['href']
print(items)
```
控制台打印
结果如下：

```powershell
C:\Users\w\AppData\Local\Programs\Python\Python37\python.exe D:\PY\catchVideos\test.py
https://wallhaven.cc/w/zygeko
```
5、拿到的地址为图片预览页面，还需要再次解析才能获得图片的真实路径；同上

```python
resUel=requests.get(url=items,headers=headers).text
soup1 = BeautifulSoup(resUel, 'html.parser').find('img', id='wallpaper')['src']
print(soup1)
```
得到最终图片链接，点击可以打开图片
**(https://w.wallhaven.cc/full/zy/wallhaven-zygeko.jpg)**

6、获取图片的二进制内容并保存到本地，根据链接可知为jpg格式的图片

```python
ts_content = requests.get(url=soup1, headers=headers).content
# 保存图片
soup1_content = requests.get(url=soup1, headers=headers).content
with open('e:\\' + soup1[-20:-4] + '.jpg', mode='ab') as f:
f.write(soup1_content)
```
![在这里插入图片描述](https://img-blog.csdnimg.cn/35482d480d57473faf91c0b6e7433371.png)

> 这样只是获取了一张图片，下面改成批量获取，改造步骤4

```python
# 解析数据
soup = BeautifulSoup(res.text,'html.parser')
items=soup.find_all(class_='preview')

for item in items:
   link=item['href']
   resUel=requests.get(url=link,headers=headers).text
   soup1 = BeautifulSoup(resUel, 'html.parser').find('img', id='wallpaper')['src']
   soup1_content = requests.get(url=soup1, headers=headers).content
   with open('e:\\img\\' + soup1[-20:-4] + '.jpg', mode='ab') as f:
         f.write(soup1_content)
```
>这里使用find_all获取所有符合条件的元素，得到一个包含很多链接的数组；循环步骤5、6，即可实现批量保存

![在这里插入图片描述](https://img-blog.csdnimg.cn/d751c4706c8c438fa6f133881783db67.png)
>上面只会获取一页的图片，下面再次改造，把这些方法定义为函数，页数和链接作为参数，批量爬取多页的图片；

```python
import requests
from bs4 import BeautifulSoup
# 循环爬取壁纸网站图片

def catchImg(url,page):

# 请求头
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36'
}

res=requests.get(url=url+'?page='+page,headers=headers)
# 解析数据
soup = BeautifulSoup(res.text,'html.parser')
items=soup.find_all(class_='preview')

for item in items:
   link=item['href']
   resUel=requests.get(url=link,headers=headers).text
   soup1 = BeautifulSoup(resUel, 'html.parser').find('img', id='wallpaper')['src']
   soup1_content = requests.get(url=soup1, headers=headers).content
   with open('e:\\img\\' + soup1[-20:-4] + '.jpg', mode='ab') as f:
         f.write(soup1_content)
         print(soup1)

for i in range(10):
page=2
link = 'https://wallhaven.cc/toplist'
catchImg(link, str(page))
page += 1

```
>这里我们看一下图片的大小，不是缩略图

![在这里插入图片描述](https://img-blog.csdnimg.cn/73682cb8ad214e9590cb7e79755de7d6.png)

>这个网站比较简单一些，不过壁纸是真的nice！

叫我小白呀 发表于 2022-11-15 11:40

巧了，上个星期我也写了个，专门针对收藏夹的，图片太大了，爬起来很慢我这宽带下行最高才10兆https://s1.ax1x.com/2022/11/15/zEeEZV.png
https://s1.ax1x.com/2022/11/15/zEZYgs.png

gebiafu 发表于 2022-11-14 10:17

YSJohnson 发表于 2022-11-14 08:53
只获取了24张就没了怎么办
把for i in range的范围改成了30后，链接有获取新的和重复的，但 ...

应该是有限制，topList应该只能抓到48张，换个链接试试，https://wallhaven.cc/hot

wjqok 发表于 2022-11-14 01:16

我原来手动把这个网站4k以上的壁纸扒完了，其它有几个网站也扒完了，

XiaoYxx 发表于 2022-11-14 01:29

不错，我也去试试

adlom0530 发表于 2022-11-14 02:03

多谢分享，我去试试！

long8818 发表于 2022-11-14 04:49

多谢分享，正需要:lol

sxtyys 发表于 2022-11-14 07:45

多谢分享，我去试试！

stvk1986 发表于 2022-11-14 07:48

看了看学了学没学会 {:301_999:}

jinos 发表于 2022-11-14 08:04

好东西，看会了，一学就废{:1_925:}

jiahaoya 发表于 2022-11-14 08:06

可以跟着搞一搞

ADMMUU 发表于 2022-11-14 08:16

正好需要壁纸，试试

页: [1] 2 3 4 5 6 7

吾爱破解 - 52pojie.cn's Archiver

批量爬取wallhaven壁纸