文库吧小说爬取下载打包 EPUB

飞龙使者 发表于 2023-2-12 10:47

本帖最后由飞龙使者于 2023-2-17 18:35 编辑

采用整本下载的接口，无需登录（但需要知道小说ID），用正则分章。

开源在 Github 上：https://github.com/apachecn/Book ... dTool/lightnovel.py

关键代码：

```py
def format_text(text):
# 多个换行变为一个
text = re.sub(r'(\r\n)+', '\r\n', text)
# 去掉前两行
text = re.sub(r'^.+?\r\n.+?\r\n', '', text)
# 去掉后两行
text = re.sub(r'\r\n.+?\r\n.+?$', '', text)
# 划分标题和段落
def rep_func(m):
   s = m.group(1)
   return '' + s + '' \
         if s.startswith(' ') else \
         '<h1>' + s + '</h1>'
text = re.sub(r'^(.+?)$', rep_func, text, flags=re.M)
# 拆分章节，过滤空白章节
chs = filter(None, text.split(''))
# 将章节拆分为标题和内容
map_func = lambda x: {
   'title': re.search(r'<h1>(.+?)</h1>', x).group(1),
   'content': re.sub(r'<h1>.+?<\/h1>', '', x),
}
return list(map(map_func, chs))

def get_info(html):
root = pq(html)
dt = root('#content > div:nth-child(1) > table:nth-child(1) tr:nth-child(2) > td:nth-child(4)').text().replace('-', '') or 'UNKNOWN'
url = root('#content > div:nth-child(1) > div:nth-child(6) > div > span:nth-child(1) > fieldset > div > a').attr('href')
title = root('#content > div:nth-child(1) > table:nth-child(1) tr:nth-child(1) > td > table tr > td:nth-child(1) > span > b').text()
author = root('#content > div:nth-child(1) > table:nth-child(1) tr:nth-child(2) > td:nth-child(2)').text()
return {'dt': dt, 'url': url, 'title': fname_escape(title), 'author': fname_escape(author)}

def download_ln(args):
id = args.id
save_path = args.save_path
headers = default_hdrs.copy()
headers['Cookie'] = args.cookie

url = f'https://www.wenku8.net/book/{id}.htm'
html = request_retry('GET', url, headers=headers).content.decode('gbk')
info = get_info(html)
print(info['title'], info['author'], info['dt'])

ofname = f"{save_path}/{info['title']} - {info['author']} - {info['dt']}.epub"
if path.exists(ofname):
   print('已存在')
   return
safe_mkdir(save_path)

articles = [{
   'title': info['title'],
   'content': f"作者：{info['author']}",
}]
url = f'http://dl.wenku8.com/down.php?type=udefault_hdrstf8&id={id}'
text = request_retry('GET', url, headers=headers).content.decode('utf-8')
chs = format_text(text)
articles += chs
gen_epub(articles, {}, None, ofname)
```

已发布到 PYPI，可以一键下载安装：

```sh
pip install BookerDownloadTool
dl-tool ln <id>
```

注：文库吧首页被隐藏了，需要手动输入【/login.php】来登录。

charleschai 发表于 2023-2-12 12:39

wenku8关闭了吧

Wapj_Wolf 发表于 2023-2-12 11:04

谢谢分享，分析学习中。

zjh889 发表于 2023-2-12 11:30

好东西，谢谢，楼主辛苦了！

likezqc 发表于 2023-2-12 12:12

好东西感谢分享

TaoaoLXXXL 发表于 2023-2-12 12:59

感谢大佬分享

yihailanxin 发表于 2023-2-12 14:20

大佬，咋用啊？

haiyangnanzi 发表于 2023-2-12 14:26

在哪儿下？大佬，有成品吗？小白求

infozzz 发表于 2023-2-12 14:50

学习💪

dizzy0001 发表于 2023-2-12 22:10

大佬，入口在哪里啊？没看到main啊

页: [1] 2

吾爱破解 - 52pojie.cn's Archiver

文库吧小说爬取下载打包 EPUB