有大佬可以帮忙看看，为什么一加协程就无法下载嘛？我单独下载的时候是没有问题的

fatlong 发表于 2024-1-14 15:35

import os

import requests
import asyncio
from bs4 import BeautifulSoup
import random
import aiohttp
import aiofiles

def ua():
headers_list = [
   'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
   'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
   'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)',
   'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)',
   'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)',
   'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
   'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
   'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
   'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)',
   'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)'
]
header = {'User-Agent': random.choice(headers_list)}
return header

def main_page(a):
main_page_url = a
res = requests.get(a,ua())
soup = BeautifulSoup(res.text,'lxml')
div = soup.find_all('div',attrs={'class':'mulu-list quanji'})
url_list = []
for i in div:
   links = i.find_all('a')
   for q in links:
         href = q['href']
         url_list.append(href)
return url_list

def get_book_name(book_page):
book_number = book_page.split('/')[-1].split('.')[0]
book_chapter_name = book_page.split('/')[-2]
return book_number, book_chapter_name

async def aio_down_one(chapter_url, signal):
number,c_name = get_book_name(chapter_url)
for c in range(10):
   try:
         # 控制协程的并发数据量
         async with signal:
            async with aiohttp.ClientSession() as session:
               async with session.get(chapter_url) as resp:
                     page_source = await resp.text()
                     soup = BeautifulSoup(page_source, 'html.parser')
                     chapter_name = soup.find('h1').text
                     p_content = soup.find('div', attrs={'class': 'neirong'}).find_all('p')
                     p_tags = soup.find('div', attrs={'class': 'neirong'}).find_all('p')
                     p_contents = strip=True) for p in p_tags]
                     for content in p_contents:
                        if not os.path.exists(f'{bookname}/{c_name}'):
                           os.makedirs(f'{bookname}/{c_name}')
                        async with aiofiles.open(f'{bookname}/{c_name}/{number}_{chapter_name}.txt', mode="w",
                                                encoding='utf-8') as f:
                           await f.write(content)
                        print(chapter_url, "下载完毕!")
                        return ""
   except Exception as e:
         print(e)
         print(chapter_url, "下载失败!, 重新下载. ")
return chapter_url
async def aio_down(parse_url_list):
tasks = []
semaphore = asyncio.Semaphore(10)
for i in parse_url_list:
   tasks.append(asyncio.create_task(aio_down_one(i,semaphore)))
await asyncio.wait(tasks)

if __name__ == '__main__':
url = 'https://www.51shucheng.net/daomu/guichuideng/'
bookname = '鬼吹灯'
os.makedirs(bookname,os.path.exists('true'))
url_list = main_page(url)
loop = asyncio.get_event_loop()
loop.run_until_complete(aio_down(url_list))
loop.close()

侃遍天下无二人 发表于 2024-1-14 16:26

是不是死锁了，你这个嵌套得也太复杂了，在async后面加些输出看看

fatlong 发表于 2024-1-14 16:34

侃遍天下无二人发表于 2024-1-14 16:26
是不是死锁了，你这个嵌套得也太复杂了，在async后面加些输出看看

好的，谢谢

sai609 发表于 2024-1-14 18:39

分段跑代码，不是一股脑塞进去函数

FitContent 发表于 2024-1-15 21:15

# 简要说明

因为楼主**把爬取到的章节名直接作为文件名使用，而部分章节名包含不规范的字符**，从而导致创建文件时出错，并引发一系列的问题。

- 其中“不规范的字符”指的是在创建文件或目录时不能使用的字符，如下：

- 上面的代码中爬取第一章时，其章节名为：`第1章 : 白纸人和鼠友`，里面包含了非法字符 `:`，程序会把 `第1章` 当作是一个盘符！！可怕的是程序不会报错，明明电脑上根本就不存在一个名为 `第1章` 的盘符！

解决方法如下：

```py
# 此函数由 GPT 提供
import re

def sanitize_filename(filename):
# 替换非法字符
sanitized_filename = re.sub(r'[<>:"/\\|?*\x00-\x1F]', '_', filename)
return sanitized_filename

# 然后在开始写入文件前处理掉章节名中的不规范字符
...
chapter_name = sanitize_filename(chapter_name)
async with aiofiles.open(
f"{bookname}/{c_name}/{number}_{chapter_name}.txt",
mode="w",
encoding="utf-8",
) as f:
...
```

我只测试了前面几个章节，楼主应该**将所有可能含有不规范字符的、作为文件名或目录名使用的字符串进行处理**。

---

# 吐槽

楼主的代码属实让我两眼一黑，其它细节暂且不说，至少这个循环应该优化一下，可以使用 `f.writelines` 写入多行。

上面的问题可以简化为：

```py
import aiofiles
import asyncio

async def test():
for c in 'AB':
   # 在一个循环中不断打开文件，并且还是 write 模式，可怕＞﹏＜
   async with aiofiles.open('temp.txt',mode='w') as f:
         await f.write(c)

if __name__ == '__main__':
asyncio.run(test())

# 最终 temp.txt 文件中只有字符 B ！！！！
```

fatlong 发表于 2024-1-15 23:12

FitContent 发表于 2024-1-15 21:15
# 简要说明

因为楼主**把爬取到的章节名直接作为文件名使用，而部分章节名包含不规范的字符**，从而 ...

跪谢，真的太感谢了

页: [1]

吾爱破解 - 52pojie.cn's Archiver

有大佬可以帮忙看看，为什么一加协程就无法下载嘛？我单独下载的时候是没有问题的