贴吧爬取

火火嘛 · 发表于 2023-11-16 09:30

[Python] 纯文本查看 复制代码

import requests
import time
from bs4 import BeautifulSoup

def get_content(url):
    '''
    分析贴吧的网页文件，整理信息，保存在列表变量中
    '''

    # 初始化一个列表来保存所有的帖子信息：
    comments = []
    # 使用request请求所需url
    html = requests.get(url)

    # 使用BeautifulSoup解析网页内容
    soup = BeautifulSoup(html.text, 'lxml')

    # 找到所有具有‘j_thread_list clearfix’属性的li标签
    liTags = soup.find_all('li', attrs={"class":['j_thread_list', 'clearfix']})

    # 循环遍历li标签
    for li in liTags:
        # 初始化一个字典来存储帖子信息
        comment = {}
        try:
            # 筛选信息，并保存到字典中
            comment['title'] = li.find('a', attrs={"class": ['j_th_tit']}).text.strip()
            comment['link'] = "tieba.baidu.com/" + li.find('a', attrs={"class": ['j_th_tit']})['href']
            comment['name'] = li.find('span', attrs={"class": ['tb_icon_author']}).text.strip()
            comment['time'] = li.find('span', attrs={"class": ['pull-right is_show_create_time']}).text.strip()
            comment['replyNum'] = li.find('span', attrs={"class": ['threadlist_rep_num center_text']}).text.strip()
            comments.append(comment)
        except:
            print('出了点小问题')

    return comments

def Out2File(comments):
    '''
    将爬取到的文件写入到本地
    保存到当前目录的TTBT.txt文件中。
    '''
    with open('TTBT.txt', 'a+', encoding='utf-8') as f:
        for comment in comments:
            f.write('标题：{} \t 链接：{} \t 发帖人：{} \t 发帖时间：{} \t 回复数量：{} \n'.format(
                comment['title'], comment['link'], comment['name'], comment['time'], comment['replyNum']))
        print('当前页面爬取完成')

def main(base_url, deep):
    url_list = []
    # 将所有需要爬取的url存入列表
    for i in range(0, deep):
        url_list.append(base_url + '&pn=' + str(50 * i))
    # 循环写入所有的数据
    for url in url_list:
        print(f"开始爬取：{url}")
        content = get_content(url)
        print(content)
        Out2File(content)
        time.sleep(5)
    print('所有的信息都已经保存完毕！')

base_url = 'https://tieba.baidu.com/f?ie=utf-8&kw=亚运会'
# 设置需要爬取的页码数量
deep = 3

if __name__ == '__main__':
    main(base_url, deep)

wkdxz · 发表于 2023-11-16 10:13

感谢分享，提个小建议，生成的TXT不太方便进一步操作。如果要分析数据，可以将数据导出为Excel文件或扔数据库里。如果要直观一点，可以生成HTML文件。

XuJingDaoZhang · 发表于 2023-11-16 12:04

关于爬取网页数据问题我有尝试过使用chat-gpt来实现发现给出的代码确实可以实现效果如果想要得到自己更精确的结果需要对chat-gpt不断的提出修改要求
所以对于一些的简单的需求如果对编程有一定的了解是可以很容易的实现的

火火嘛 · 发表于 2023-11-16 11:04

wkdxz 发表于 2023-11-16 10:13
感谢分享，提个小建议，生成的TXT不太方便进一步操作。如果要分析数据，可以将数据导出为Excel文件或扔数据 ...

好的，谢谢，后续改一下

chensheng · 发表于 2023-11-16 11:08

学习学习

sai609 · 发表于 2023-11-16 13:03

导出为csv格式，比xlsx轻便多了

fengyingzong · 发表于 2023-11-16 13:04

1虽然看不懂，但还是要支持一下

superpeo · 发表于 2023-11-16 13:23

支持支持

csmhdd · 发表于 2023-11-16 15:22

好，优秀

shuipen · 发表于 2023-11-16 16:57

支持，爬虫如果学好了，可以为大数据和AI提供更多的基础数据。

帐号		自动登录	找回密码
密码			注册[Register]

[Python 原创] 贴吧爬取

免费评分

本帖被以下淘专辑推荐: