某书网小说爬虫

wanwfy · 发表于 2019-7-25 23:31

爬虫解析数据有很多工具，正则，xpath,BeautifulSoup等等，听大神们说BeautifulSoup是其中解析速度最慢的，因此造成我长时间对BeautifulSoup不感兴趣，但是今天突然发现BeautifulSoup也是有优势的，举例说明：

之前看过某个团队一个小说爬虫分享视频，是通过正则解析数据的，正好当时我正在学正则就跟着练过代码，
但是正则解析出来的内容需要应该很多次清洗才能得到比较干净的文本内容，然而，这些操作，对BeautifulSoup来说就SO easy 了，
用get_text()直接就获取到很干净的内容。

我是一个自学python的小白，大神们不要见笑，如果有什么经验分享还请不吝赐教，谢谢

附BeautifulSoup解析的代码，之前正则的代码也写过，不过好像丢了。

[Python] 纯文本查看 复制代码

import requests
from bs4 import BeautifulSoup


class NovelSpider:
    """某书网，小说爬虫"""

    def __init__(self):
        self.session = requests.Session()

    def get_novel(self, url):  # 主逻辑
        """下载小说"""
        # 下载小说首页html
        index_html = self.download(url, encoding="gbk")
        # 小说的标题
        soup = BeautifulSoup(index_html, "html.parser")
        article_title = soup.find('a', class_="article_title").get_text()

        # 提取章节信息，url 网址
        novel_chapter_infos = self.get_chapter_info(index_html)
        # 创建一个文件 小说名.txt
        fb = open(f"{article_title}.txt", "w", encoding="utf-8")

        # 下载章节信息 循环
        for chapter_info in novel_chapter_infos:
            # 写章节
            fb.write(f"{chapter_info[1]}\n")
            # 下载章节
            content = self.get_chapter_content(chapter_info[0])
            fb.write(f"{content}\n")
            print(chapter_info)
        fb.close()

    def download(self, url, encoding):
        """下载html源码"""
        r = self.session.get(url)
        r.encoding = encoding
        return r.content

    def get_chapter_info(self, index_html):
        """提取章节信息"""
        soup = BeautifulSoup(index_html, "html.parser")
        chapterNum = soup.find('div', class_="chapterNum")
        data = []
        for link in chapterNum.find_all("li"):
            link = link.find('a')
            data.append((link["href"], link.get_text()))

        return data

    def get_chapter_content(self, chapter_url):
        """下载章节内容"""
        chapter_html = self.download(chapter_url, encoding="gbk")

        soup = BeautifulSoup(chapter_html, "html.parser")
        content = soup.find("div", class_="mainContenr")
        content = content.get_text().replace("style5();", '')
        return content


if __name__ == '__main__':
    novel_url = 'http://www.quanshuwang.com/book/9/9055'
    spider = NovelSpider()
    spider.get_novel(novel_url)

wanwfy · 发表于 2019-8-9 12:02

zimengmeng131 发表于 2019-8-9 10:41
你好楼主，我目前也在自学中，目前卡在网页分析上，请问有什么推荐的路子么。多谢指导

多看相关的案例教程吧，网页解析方法有很多，常见的有正则，xpath，Beautiful Soup,个人觉得其中Beautiful Soup应该比较容易快速学会,其次是xpath,正则。你可以先学会一种方法，熟练也再学别的。
个人推荐先学xpath。

LUOLUOPO · 发表于 2019-7-25 23:38

这个代码怎么用啊，有没有现成的程序

3鼠 · 发表于 2019-7-25 23:42

大佬大佬

manbajie · 发表于 2019-7-25 23:47

期待更好的作品，楼主写的真棒

瑞安刷哥 · 发表于 2019-7-25 23:55

每天看到这些代码，我心里一直都很不屑，哼，啥玩意，反正我看不懂的都是错的

xp2201 · 发表于 2019-7-26 00:08

谢谢楼主分享,不过我都是用的笔趣阁免费看小说的。。。。

jidesheng6 · 发表于 2019-7-26 00:09

BeautifulSoup确实不错，挺方便的，至于爬虫框架我有点玩不太懂

zgydsy · 发表于 2019-7-26 00:10

6666666666666

Liang丶少 · 发表于 2019-7-26 00:23

看无字天书

碳基猴子 · 发表于 2019-7-26 00:29

我是小白，单纯的代码还应用不来

帐号		自动登录	找回密码
密码			注册[Register]

[Python 转载] 某书网小说爬虫

免费评分