Python爬虫分享4K高清壁纸

小阿狸呀 · 发表于 2022-7-29 18:46

简单的爬虫写法，推荐小白来学习

[Python] 纯文本查看 复制代码

import requests
import re
import os
import time


#创建文件夹
def file_folder():
    # 创建mydata文件夹
    # 如果mydata文件夹已存在，清空文件夹（先清空后删除再创建）
    pathd = os.getcwd() + '\\彼岸花4k壁纸'
    if os.path.exists(pathd):  # 判断mydata文件夹是否存在
        for root, dirs, files in os.walk(pathd, topdown=False):
            for name in files:
                os.remove(os.path.join(root, name))  # 删除文件
            for name in dirs:
                os.rmdir(os.path.join(root, name))  # 删除文件夹
        os.rmdir(pathd)  # 删除mydata文件夹
    os.mkdir(pathd)  # 创建mydata文件夹
count_1=1
def data(url):
    response=requests.get(url)
    response.encoding='gbk'
    response = response.text
    # print(response)
    url_r = '<li><a href="(.*?)" target="_blank"><img.*?><b>.*?</b></a></li>'
    url_name=re.findall(url_r,response)
    del url_name[0]
    for i in url_name:
        global count_1
        str='https://pic.netbian.com'+i
        ress = requests.get(str)
        ress.encoding="gbk"
        response=ress.text
        r='<div class="photo-pic"><a href="" id="img"><img src="(.*?)".*?></a></div>'
        url_list=re.findall(r,response)
        if len(url_list)==0:
            continue
        sts='https://pic.netbian.com'+url_list[0]
        res =requests.get(sts)
        with open(f"{os.getcwd()}\\彼岸花4k壁纸\\{count_1}.jpg", "wb") as f:
            f.write(res.content)
        count_1+=1
        time.sleep(1.5)
if __name__ =="__main__":
    #创建文件夹images
    url=input('图片网站：https://pic.netbian.com\n文件保存在彼岸花4k壁纸文件夹里\n第一页不可以爬\n请输入爬取的链接\n（提示：网址链接写法\n如：https://pic.netbian.com/4kbeijing/index_斜杠后不要加后面内容）：')
    pagestart=input('请输入开始页数:')
    pagestop = input('请输入结束页数:')
    file_folder()
    for i in range(int(pagestart),int(pagestop)):
        st=url+str(i)+'.html'
        print(f'开始爬取第{i}页')
        data(st)
        print('第',i,'页爬取完毕，如果文件夹不加载图片说明网址输入错误')

xxwwcc250 · 发表于 2022-8-4 17:44

爬取手机壁纸的时候提示超出列表索引范围，小白debug不来，请大佬指教。
Traceback (most recent call last):
File "E:/pythonProject/彼岸花爬虫.py", line 61, in <module>
data(st)
File "E:/pythonProject/彼岸花爬虫.py", line 32, in data
del url_name[0]
IndexError: list assignment index out of range

pangziyuan · 发表于 2022-8-19 17:01

只需要把26行的代码
url_r = '<li><a href="(.*?)" target="_blank"><img.*?><b>.*?</b></a></li>'
改成如下就可以下载所有分类了
url_r = '<li.*?><a href="(.*?)" target="_blank"><img.*?><b>.*?</b></a></li>'

tfrist · 发表于 2022-7-30 00:32

用的正则表达式不错不错！

忧郁之子 · 发表于 2022-7-30 08:32

学习了，支持一下，谢谢分享

zhaoqingdz · 发表于 2022-7-30 18:53

学习了学习了！感谢楼主！代码我拿去学习了！

kkccy · 发表于 2022-8-2 08:45

来学习了

cyhcuichao · 发表于 2022-8-3 22:46

直接复制代码不能使用呢？

xxwwcc250 · 发表于 2022-8-4 17:46

cyhcuichao 发表于 2022-8-3 22:46
直接复制代码不能使用呢？

你的看提示什么错误

zhl00544 · 发表于 2022-8-4 17:58

谢谢大佬分享

cyhcuichao · 发表于 2022-8-4 18:29

xxwwcc250 发表于 2022-8-4 17:46
你的看提示什么错误

Traceback (most recent call last):
File "D:\py_newwork\chao.py", line 61, in <module>
data(st)
File "D:\py_newwork\chao.py", line 32, in data
del url_name[0]
IndexError: list assignment index out of range

帐号		自动登录	找回密码
密码			注册[Register]

[Python 转载] Python爬虫分享4K高清壁纸

免费评分