吾爱破解 - 52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 1323|回复: 51
收起左侧

[Python 原创] 【爬虫】小说爬取实例

  [复制链接]
icer233 发表于 2024-11-16 15:18
最近看了太一生水的《万古至尊》,觉得挺好看的,推荐一下
随便找了个网站,下载一下。
以下是爬虫源码。

如果由于网络因素等问题部分章节没有下载下来,可以再次运行,程序只会下载那些没有的,不用担心重复下载浪费时间。

[Python] 纯文本查看 复制代码
# -*- coding:utf-8 -*-
import requests
from lxml import etree
import os
from multiprocessing.dummy import Pool

def create_path(file_path):
    if not os.path.exists(file_path):
        os.makedirs(file_path)

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36 Edg/128.0.0.0'
}

book_url = 'https://www.bqvvxg8.cc/wenzhang/1/1424/' + 'index.html'
book_detail_content = requests.get(url=book_url, headers=headers)
book_detail_content.encoding = 'gbk'
book_detail_content = book_detail_content.text
book_detail_tree = etree.HTML(book_detail_content)
book_name = book_detail_tree.xpath('//div[@class="book"]/div[@class="info"]/h2/text()')[0]
create_path('./' + book_name)

chapter_dd_list = book_detail_tree.xpath('//div[@class="listmain"]/dl/dd')

def down_chapter(dd):
    chapter_url = 'https://www.bqvvxg8.cc/' + dd.xpath('./a/@href')[0]
    chapter_title = dd.xpath('./a/text()')[0].replace('?', '?')
    chapter_txt_path = './' + book_name +'/' + chapter_title + '.txt'
    if not os.path.exists(chapter_txt_path):
        chapter_content = requests.get(url=chapter_url, headers=headers).text
        chapter_tree = etree.HTML(chapter_content)
        chapter_text = chapter_tree.xpath('//*[@id="content"]/text()')

        # 保存章节
        with open(chapter_txt_path, 'a', encoding='UTF-8') as file:
            file.write(chapter_title)
            for i in range(1, chapter_text.__len__() - 3):
                file.write(chapter_text[i])
        print(chapter_title, " 下载成功")


pool = Pool(20)
pool.map(down_chapter, chapter_dd_list)

pool.close()
pool.join()

免费评分

参与人数 9吾爱币 +12 热心值 +9 收起 理由
wiilo + 1 + 1 感谢发布原创作品,吾爱破解论坛因你更精彩!
sunyue2719 + 1 感谢发布原创作品,吾爱破解论坛因你更精彩!
arthurll + 1 + 1 用心讨论,共获提升!
苏紫方璇 + 7 + 1 欢迎分析讨论交流,吾爱破解论坛有你更精彩!
chongfa + 1 我很赞同!
Mickey2024 + 1 + 1 用心讨论,共获提升!
gluttonPride + 1 鼓励转贴优秀软件安全工具和文档!
迦南圣经 + 1 + 1 期待成品
wzm0668 + 1 + 1 感谢发布原创作品,吾爱破解论坛因你更精彩!

查看全部评分

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

sunyue2719 发表于 2024-11-22 09:58
icer233 发表于 2024-11-20 20:15
可以用网上的免费代{过}{滤}理

[Python] 纯文本查看 复制代码
# -*- coding:utf-8 -*-
import requests
from lxml import etree
import os
from multiprocessing.dummy import Pool
import random

def create_path(file_path):
    if not os.path.exists(file_path):
        os.makedirs(file_path)

fetch or update this list as needed)
proxy_list = [
    {"http": "http://127.0.0.1:7890"},
{"http": "http://127.0.0.1:7891"},
]

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36 Edg/128.0.0.0'
}

book_url = 'https://www.bqvvxg8.cc/wenzhang/1/1424/' + 'index.html'

# Fetch the book's main page using a random proxy
proxy = random.choice(proxy_list)
book_detail_content = requests.get(url=book_url, headers=headers, proxies=proxy)
book_detail_content.encoding = 'gbk'
book_detail_content = book_detail_content.text
book_detail_tree = etree.HTML(book_detail_content)
book_name = book_detail_tree.xpath('//div[@class="book"]/div[@class="info"]/h2/text()')[0]
create_path('./' + book_name)

chapter_dd_list = book_detail_tree.xpath('//div[@class="listmain"]/dl/dd')

def down_chapter(dd):
    chapter_url = 'https://www.bqvvxg8.cc/' + dd.xpath('./a/@href')[0]
    chapter_title = dd.xpath('./a/text()')[0].replace('?', '?')
    chapter_txt_path = './' + book_name + '/' + chapter_title + '.txt'

    if not os.path.exists(chapter_txt_path):
        # Fetch the chapter content using a random proxy
        proxy = random.choice(proxy_list)
        chapter_content = requests.get(url=chapter_url, headers=headers, proxies=proxy).text
        chapter_tree = etree.HTML(chapter_content)
        chapter_text = chapter_tree.xpath('//*[@id="content"]/text()')

        # Save the chapter content
        with open(chapter_txt_path, 'a', encoding='UTF-8') as file:
            file.write(chapter_title + "\n")
            for i in range(1, len(chapter_text) - 3):
                file.write(chapter_text[i] + "\n")
        print(chapter_title, "下载成功")

# Use multithreading to download chapters
pool = Pool(20)
pool.map(down_chapter, chapter_dd_list)

pool.close()
pool.join()

好的,感谢
Xianhuagan 发表于 2024-11-16 15:23
三生沧海踏歌 发表于 2024-11-16 15:27
Yhuo 发表于 2024-11-16 16:02
感谢分享,学习学习
11Zero 发表于 2024-11-16 16:06
感谢分享
Tick12333 发表于 2024-11-16 16:42
感谢分享
A11111111 发表于 2024-11-16 16:51
I感谢楼主分享技术
s15s 发表于 2024-11-16 17:20
感谢大神分享
adafsaf 发表于 2024-11-16 17:36
试了一下请求的频率高了报错,应该是服务器限流或者没抗住,加个err等5s重试就OK
zzt5211314 发表于 2024-11-16 17:51
感谢分享
您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则

返回列表

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2025-1-8 19:59

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表