记爬虫实现过程，长期更新（1）

TZ糖纸 · 发表于 2023-6-25 17:39

本帖最后由 TZ糖纸于 2023-7-10 14:57 编辑

确定需要爬取的网站及其页面内容；
使用请求库发送请求，获取页面的HTML代码；
解析HTML代码，提取目标数据；
存储目标数据。

[Asm] 纯文本查看 复制代码

import requests
from bs4 import BeautifulSoup

# 发送请求，获取HTML代码
url = 'https://www.example.com'
response = requests.get(url)
html = response.text

# 解析HTML代码，提取目标数据
soup = BeautifulSoup(html, 'html.parser')
title = soup.title.string
links = [link.get('href') for link in soup.find_all('a')]

# 存储目标数据
with open('example.txt', 'w', encoding='utf-8') as f:
    f.write(title + '\n')
    for link in links:
        f.write(link + '\n')

上边是一个示例，后边回基于52破解实现以一个简单爬虫

下面的代码示例使用Python的Requests库和BeautifulSoup库实现对52pojie的一个版块“安全工具”下的帖子列表的抓取：

[Python] 纯文本查看 复制代码

import requests
from bs4 import BeautifulSoup

url = 'https://www.52pojie.cn/forum-24-1.html'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
}
response = requests.get(url, headers=headers)

if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')
    items = soup.find('div', class_='bm_c').find_all('table', class_='datatable')

    for item in items:
        title = item.find('a', class_="s xst").text
        url = item.find('a', class_="s xst").get('href')
        author = item.find('cite').text
        publish_date = item.find('em').text
        reply_count = item.find_all('td')[3].text
        view_count = item.find_all('td')[4].text
        print('标题:', title, '\n', '地址:', url, '\n', '作者:', author, '\n', '发帖时间:', publish_date, '\n', '回复数:', reply_count, '\n', '浏览数:', view_count, '\n', '------------------------------')
else:
    print('抓取失败')

代码解析：首先模拟浏览器发送请求，获取网页源代码，如果响应的状态码为200，则继续处理；使用BeautifulSoup库解析源代码，根据52pojie网站的HTML结构找出帖子列表对应的标签，这里是class为“bm_c”的div标签下的class为“datatable”的table标签；对每个table标签找出其中的标题、地址、作者、发帖时间、回复数和浏览数等信息，并打印输出。需要注意的是，网站有反爬虫措施，为了避免被封IP，代码中模拟了浏览器的请求头部。

下面的代码示例使用Python的Requests库和BeautifulSoup库实现对快手平台上关键字“美女”的搜索结果页面的爬取：

[Python] 纯文本查看 复制代码

import requests
from bs4 import BeautifulSoup

url = 'https://www.kuaishou.com/search/video'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
}
params = {
    'searchKey': '美女'
}
response = requests.get(url, headers=headers, params=params)

if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')
    items = soup.find_all('div', class_='card-media')

    for item in items:
        title = item.find('div', class_='title').text
        url = 'https://www.kuaishou.com' + item.find('a', class_='cover').get('href')
        view_count = item.find('div', class_='count').text
        print('标题:', title, '\n', '地址:', url, '\n', '播放数:', view_count, '\n', '------------------------------')
else:
    print('抓取失败')

代码解析:首先定义URL和请求头部，搜索关键字用params参数添加到URL中，模拟浏览器发送请求；使用BeautifulSoup库解析源代码，根据快手网站的HTML结构找出搜索页面对应的标签，这里是class为“card-media”的div标签；对每个div标签找出其中的标题、地址、播放数等信息，并打印输出。

爬取某图片网站按照标题创建文件夹并下载图片到对应文件夹

[Python] 纯文本查看 复制代码

import os
import requests
from bs4 import BeautifulSoup

# 选定目标页面
target_url = 'https://www.umei.cc/touxiangtupian/QQtouxiang/'
response = requests.get(target_url)
soup = BeautifulSoup(response.text, 'html.parser')
items = soup.select('.item.masonry_brick')

# 遍历收集每个目标项目下所有图片链接
for item in items:
    # 提取链接和标题
    a_tag = item.select_one('.img a')
    link = 'https://www.umei.cc' + a_tag['href']
    title = a_tag.img['alt']
    print('处理中：', title)
    
    # 创建文件夹
    dir_name = './' + title
    if not os.path.exists(dir_name):
        os.makedirs(dir_name)
    os.chdir(dir_name)

    # 访问链接提取所有图片链接并下载
    res2 = requests.get(link)
    soup2 = BeautifulSoup(res2.text, 'html.parser')
    imgs = soup2.select('.tsmaincont-main-cont img')
    for i, img in enumerate(imgs):
        try:
            img_url = img['src']
            res3 = requests.get(img_url)
            with open(f'{i}.jpg', 'wb') as f:
                f.write(res3.content)
        except Exception as e:
            print('下载失败:', img_url, e)
    
    # 结束本项目，回退上一级目录
    os.chdir('..')
    print('解析完成:', title)

print('全部处理完成！')

代码会先访问目标页面，收集出所有项目（即每个标题和链接），并按照要求创建对应文件夹并进入其中。接着访问每个链接，并从中提取所有图片链接，并将其下载到对应文件夹下。对于下载过程中可能出现的异常情况（例如连接超时等），代码将会忽略并继续下载下一个图片。每个项目下载完毕后，程序会回退到上一级目录，继续处理下一个项目。最后输出“全部处理完成！”表示程序执行完毕。

studentguo · 发表于 2023-6-25 19:10

学习。。。

sk8820 · 发表于 2023-6-25 19:48

跟着学习，帖子不错。

earlc · 发表于 2023-6-25 20:29

学习学习，坚持更新

鹿鸣 · 发表于 2023-6-25 20:30

学习了学习了，真不错呀

moruye · 发表于 2023-6-25 21:55

提示: 作者被禁止或删除内容自动屏蔽

naisitu · 发表于 2023-6-25 22:55

努力学习中！
前段时间遇到必须得账户登录才能下载安装包的网站，正好可以学习，到时候能能不能用上，谢谢大佬分享！

xinxiu · 发表于 2023-6-25 23:55

不错的教程，只是BeautifulSoup有点老了。

xiaogao2677 · 发表于 2023-6-26 09:43

学习中，跟上进度

chnman8 · 发表于 2023-6-28 18:38

跟着学习学习，谢谢分享~

帐号		自动登录	找回密码
密码			注册[Register]

[其他原创] 记爬虫实现过程，长期更新（1）

免费评分

本帖被以下淘专辑推荐:

moruye moruye 当前离线好友阅读权限 0 听众最后登录 1970-1-1 头像被屏蔽	moruye 发表于 2023-6-25 21:55 提示: 作者被禁止或删除内容自动屏蔽

	回复支持举报