利用协程做爬虫《明朝那些事》

E式丶男孩 发表于 2023-4-21 21:58

刚学使用协程来写爬虫，看教程爬的这个小说网站，所以拿来练练手。
代码如下
```python
import asyncio
import os.path

import aiofiles as aiofiles
import aiohttp as aiohttp
import requests as requests
from pyquery import PyQuery as q

def get_all_chapter():
response = requests.get("https://www.mingchaonaxieshier.com/")
pq = q(response.content)
books =
books_data = []
for book in books:
   trs =
   book_chapters = []
   # build chapters of every book
   for i in trs:
         for j in i('a').items():
            book_chapters.append({
               "title": j.text(),
               "href": j.attr('href')
            })

   books_data.append({
         "book_title": trs.text(),
         "chapters": book_chapters
   })
return books_data

def build_tasks(books):
task_list = []
for book in books:
   for chapter in book['chapters']:
         task_list.append(asyncio.ensure_future(download_chapter(chapter, book['book_title'])))
return task_list

async def download_chapter(chapter, book_title):
if not os.path.exists(f"./{book_title}/"):
   os.mkdir(f"./{book_title}/")
print(f'下载：{book_title}/{chapter["title"]}')
async with aiohttp.ClientSession() as session:
   async with session.get(chapter['href']) as response:
         src = await response.text()
         pq = q(src)
         content__text = pq('div.content p').text()
         async with aiofiles.open(f'./{book_title}/{chapter["title"]}.txt', mode='w', encoding='utf-8') as f:
            await f.write(content__text)

def main():
books = get_all_chapter()
tasks = build_tasks(books)
event_loop = asyncio.get_event_loop()
event_loop.run_until_complete(asyncio.wait(tasks))

if __name__ == '__main__':
main()

```
爬虫效果如下
[!(https://s1.ax1x.com/2023/04/21/p9Eb0yR.png)](https://imgse.com/i/p9Eb0yR)
录制教程时这个小说网站文章内容还是挺有规律的，但是新版没有规律了，所以爬出来相当不好看，不知道大佬们有没有更好的想法
[!(https://s1.ax1x.com/2023/04/21/p9EbGwV.png)](https://imgse.com/i/p9EbGwV)
如上图所示，文章内容和评论等内容全放一块了，不太好提取内容，也没有更小单位的元素。
由于我使用的python版本是`3.10.11`，在使用协程的时候与教程内容有较大的差异，总结如下：
1. 创建任务用`ensure_future`，之前的create_task创建的任务执行不了，不知道是不是新版做了改动
2. 启动任务为`event_loop.run_until_complete(asyncio.wait(tasks))`，教程里面用的`asyncio.wait(task_list)`，在我的python中运行不起来，有警告但是执行失败

以上问题如果有大佬看到麻烦请指点我一下，谢谢了

话痨司机啊 发表于 2023-4-22 01:10

本帖最后由话痨司机啊于 2023-4-22 01:24 编辑

def build_tasks(books,event_loop):
task_list = []
for book in books:
for chapter in book['chapters']:
task_list.append(event_loop.create_task(download_chapter(chapter, book['book_title'])))
return task_list

因为新的create_task 是线程安全的,所以需要获取事件循环才可以运行否则报错RUNTIMEERROR.

建议写全 asyncio.wait(aws, *, timeout=5, return_when=ALL_COMPLETED)或者用asyncio.wait_for(aws,timeout)
并且aws 必须是一个task对象的迭代

IT大小白 发表于 2023-4-22 11:28

本帖最后由 IT大小白于 2023-4-22 11:31 编辑

去除评论区内容：

content__text = pq('div.content p').text()
async with aiofiles.open(f'./{book_title}/{chapter["title"]}.txt', mode='w', encoding='utf-8') as f:
await f.write(content__text)

##改成
content__text = pq('div.content > p')
txt =""
for i in content__text:
txt +=q(i).text()+"\n\n"

async with aiofiles.open(f'./{book_title}/{chapter["title"]}.txt', mode='w', encoding='utf-8') as f:
await f.write(txt)

szwangbin001 发表于 2023-4-21 22:18

用协程是不是速度能快点

30345 发表于 2023-4-21 22:32

感谢分享

zjk414 发表于 2023-4-21 23:05

感谢分享

keber 发表于 2023-4-21 23:07

感谢分享！

bigdawn 发表于 2023-4-21 23:14

感谢分享！

netpeng 发表于 2023-4-21 23:38

用起来方便，感谢分享。

mianhuan 发表于 2023-4-22 00:09

厉害厉害厉害:lol

lsy832 发表于 2023-4-22 00:46

当年明月写的很不错

页: [1] 2 3

吾爱破解 - 52pojie.cn's Archiver

利用协程做爬虫《明朝那些事》