吾爱破解 - 52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 2197|回复: 22
收起左侧

[Python 原创] 异步秒爬某小说网

  [复制链接]
jaaks 发表于 2023-9-17 14:00
[Python] 纯文本查看 复制代码
from bs4 import BeautifulSoup
import os,re,time,json,aiohttp,asyncio
url_list = []
headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.87 Safari/537.36"
        }
directory = "txt"  # 相对路径,将在当前工作目录下创建txt目录
if not os.path.exists(directory):
    os.makedirs(directory)
async def fetch_post(url, headers, data):
    async with aiohttp.ClientSession() as session:
        async with session.post(url, headers=headers, data=data) as response:
            return await response.text()
async def fetch_get(url, headers):
    async with aiohttp.ClientSession() as session:
        async with session.get(url, headers=headers) as response:
            return await response.text()
async def get_list(bookid):#获取章节列表
    data = {"bookId": bookid}
    r = await fetch_post("https://bookapi.zongheng.com/api/chapter/getChapterList", data=data, headers=headers)
    response_data = json.loads(r)
    chapter_list = response_data["result"]["chapterList"]
    for chapter in chapter_list:
        for chapte in chapter["chapterViewList"]:
            chapterId = chapte["chapterId"]
            url_list.append(f"https://read.zongheng.com/chapter/{bookid}/{chapterId}.html")

    return True
async def get_text(url):#访问正文
        p_text = ""
        r = await fetch_get(url,headers=headers)
        soup = BeautifulSoup(r, 'html.parser')
        name = soup.find(class_="title_txtbox").text    #标题
        contents = soup.find('div', class_="content")   #正文
        content = contents.find_all("p")
        for conten in content:
            p_text += conten.text+"\n\n"
        name = re.sub('[?|&]',"",name.strip())    #正则过滤内容
        #将标题和内容写进去
        file_name = os.path.join("txt",name+".txt")
        await sava_file(file_name,p_text)
        await asyncio.sleep(2)
        print(name)
async def sava_file(name,text):
    with open(name,"w",encoding="utf8") as f:
        f.write(text)
async def main():
    loop = asyncio.get_running_loop()
    task = [asyncio.ensure_future(get_text(url)) for url in url_list]
    await asyncio.gather(*task)
Chapter =  asyncio.run(get_list("1249806"))#访问章节
print("长度:"+str(len(url_list)))
print(url_list)
if Chapter:
    asyncio.run(main())


多线程爬某小说网:https://www.52pojie.cn/thread-1834722-1-1.html

基于同一个源码只不过改成异步实现秒爬,没找到网络请求阻塞的好处理方法,所以我学了异步

免费评分

参与人数 5吾爱币 +8 热心值 +5 收起 理由
likebbs + 1 谢谢@Thanks!
danshiyuan + 1 + 1 用心讨论,共获提升!
echoaku + 1 + 1 我很赞同!
TheSSS + 1 + 1 谢谢@Thanks!
苏紫方璇 + 5 + 1 欢迎分析讨论交流,吾爱破解论坛有你更精彩!

查看全部评分

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

joy95611 发表于 2023-9-19 09:03
好好学习. 现在遇到问题
module 'asyncio' has no attribute 'run'
原来我的python 是3.6的, 我参考了
python 中 AttributeError: module 'async io' has no attribute 'run' 解决 - wzqwer - 博客园
https://www.cnblogs.com/wzbk/p/14119401.html
问题的解法.
改动代码如下

.....前面一样...
async def main():
    #loop = asyncio.get_running_loop()
    loop = asyncio.get_event_loop()
    task = [asyncio.ensure_future(get_text(url)) for url in url_list]
    await asyncio.gather(*task)
#Chapter =  asyncio.run(get_list("1249806"))#访问章节
loop = asyncio.get_event_loop()
Chapter = loop.run_until_complete(get_list("1249806"))
print("长度:"+str(len(url_list)))
print(url_list)
print(Chapter)
#loop = asyncio.get_event_loop()
if Chapter:
    result = loop.run_until_complete(main())
    #asyncio.run(main())

顺利爬取数据了 !
sssguo 发表于 2023-9-17 20:46
daraxi 发表于 2023-9-17 21:52
吖力锅 发表于 2023-9-17 23:16
异步我还没学会,向你学习
lookfeiji 发表于 2023-9-18 11:06
异步确实好用,奈何我还不会
fengxiaoxiao7 发表于 2023-9-18 14:06
异步确实厉害
黑金刚 发表于 2023-9-18 16:18
[远程计算机拒绝网络连接。]是不是爬太快了。
sinyzh 发表于 2023-9-18 17:32
好用的好用的,谢谢楼主
qinren051 发表于 2023-9-19 11:54
想请教下,要换成别的小说在替换哪个地方了
您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则

返回列表

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2025-1-11 05:30

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表