吾爱破解 - 52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 2233|回复: 16
收起左侧

[Python 原创] 多线程爬某小说网

  [复制链接]
jaaks 发表于 2023-9-17 05:02
[Python] 纯文本查看 复制代码
from bs4 import BeautifulSoup
import os,requests,re,threading,time,json
url_list = []
headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.87 Safari/537.36"
        }
directory = "txt"  # 相对路径,将在当前工作目录下创建txt目录
if not os.path.exists(directory):
    os.makedirs(directory)
def get_list(bookid):#获取章节列表
    data = {"bookId": bookid}
    r = requests.post("https://bookapi.zongheng.com/api/chapter/getChapterList", data=data, headers=headers)
    response_data = json.loads(r.text)
    # print(response_data["result"]["chapterList"]["chapterViewList"]["chapterId"])
    chapter_list = response_data["result"]["chapterList"]
    for chapter in chapter_list:
        for chapte in chapter["chapterViewList"]:
            chapterId = chapte["chapterId"]
            url_list.append(f"https://read.zongheng.com/chapter/{bookid}/{chapterId}.html")

    return True
def get_text(url,Lock:threading.Lock):#访问正文
        p_text = ""
        for ur in url:
            #Lock.acquire()  # 锁
            r = requests.get(ur,headers=headers)
            #Lock.release()
            soup = BeautifulSoup(r.text, 'html.parser')
            name = soup.find(class_="title_txtbox").text    #标题
            contents = soup.find('div', class_="content")   #正文
            content = contents.find_all("p")
            for conten in content:
                p_text += conten.text+"\n\n"
            name = re.sub('[?|&]',"",name.strip())    #正则过滤内容
            #将标题和内容写进去
            file_name = os.path.join("txt",name+".txt")
            sava_file(file_name,p_text)
            time.sleep(1)
            print(name)
def sava_file(name,text):
    with open(name,"w",encoding="utf8") as f:
        f.write(text)
Chapter = get_list("1249806") #访问章节
Lock = threading.Lock() #设置线程锁
print("长度:"+str(len(url_list)))
if Chapter:
    # 计算每个子列表的长度
    num = int(input("输入线程数:"))    #线程数
    Length = len(url_list) // num
    urls = [url_list[i:i+num] for i in range(0,len(url_list),num)]  #对列表进行切片为子列表
    for url in urls:
        threading.Thread(target=get_text, args=(url,Lock)).start()


有一点我不是很明白,我测试使用线程池并发CPU直接占满,但是使用threading多线程并发却不会

还有一点保存下来的文件排序也不好排,如果设置线程锁,确实能按照排序,但是其他线程阻塞导致就像单线程一样慢

免费评分

参与人数 4吾爱币 +7 热心值 +3 收起 理由
JiaXiaoShuai + 1 + 1 热心回复!
苏紫方璇 + 5 + 1 欢迎分析讨论交流,吾爱破解论坛有你更精彩!
wizarrr + 1 热心回复!
创世帅帅 + 1 我很赞同!

查看全部评分

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

wkdxz 发表于 2023-9-17 09:46
试了下,我电脑线程30时速度比较快,具体线程可自行测试。

[Asm] 纯文本查看 复制代码
★  maxWorkers:30        time:10.26秒


[Python] 纯文本查看 复制代码
# -*- coding: utf-8 -*-

import concurrent.futures
import json
import os
import re
import time

import requests
from bs4 import BeautifulSoup

url_list = []
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.87 Safari/537.36"
}
directory = "txt"  # 相对路径,将在当前工作目录下创建txt目录
if not os.path.exists(directory):
    os.makedirs(directory)


def time_it(func):
    def wrapper(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()
        print(f"time:{end_time - start_time:.2f}秒")
        return result

    return wrapper


def get_list(bookid):  # 获取章节列表
    data = {"bookId": bookid}
    r = requests.post(
        "https://bookapi.zongheng.com/api/chapter/getChapterList",
        data=data,
        headers=headers,
    )
    response_data = json.loads(r.text)
    chapter_list = response_data["result"]["chapterList"]
    for chapter in chapter_list:
        for chapte in chapter["chapterViewList"]:
            chapterId = chapte["chapterId"]
            url_list.append(
                f"https://read.zongheng.com/chapter/{bookid}/{chapterId}.html"
            )

    return True


def get_text(url):
    p_text = ""
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, "html.parser")
    name = soup.find(class_="title_txtbox").text  # 标题
    contents = soup.find("div", class_="content")  # 正文
    content = contents.find_all("p")
    for conten in content:
        p_text += conten.text + "\n\n"
    name = re.sub("[?|&]", "", name.strip())  # 正则过滤内容
    file_name = os.path.join("txt", name + ".txt")
    sava_file(file_name, p_text)
    # print(name)


def sava_file(name, text):
    with open(name, "w", encoding="utf8") as f:
        f.write(text)


@time_it
def main(maxWorkers):
    print(f"★  maxWorkers:{maxWorkers}", end="\t ")
    Chapter = get_list("1249806")  # 访问章节
    # print("长度:" + str(len(url_list)))
    if Chapter:
        with concurrent.futures.ThreadPoolExecutor(maxWorkers) as executor:
            executor.map(get_text, url_list)


if __name__ == "__main__":
    main(30)  # 线程数
zhangxu0529 发表于 2023-9-20 08:56
[Python] 纯文本查看 复制代码
headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.87 Safari/537.36"
        }

如果换成代{过}{滤}理,这么该怎么填?
MyModHeaven 发表于 2023-9-17 07:42
MyModHeaven 发表于 2023-9-17 07:44
我用threading多线程的时候,CPU会占满。没用过线程池
MyModHeaven 发表于 2023-9-17 07:47
num = int(input("输入线程数:")),是不是因为线程数不够多?网站不禁IP的话,我就不怎么限制线程数量
ysjd22 发表于 2023-9-17 08:08
学习一下代码
baliao 发表于 2023-9-17 10:31
感谢分享! 学习了.
yswd 发表于 2023-9-17 19:47
多谢分享,学习了。
KevinF 发表于 2023-9-17 21:58
感谢分享
吖力锅 发表于 2023-9-17 23:15
我试一下网站就不给我下了
您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则

返回列表

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2024-11-24 19:40

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表