HadesGiant 发表于 2022-11-13 23:23

python抓取静态网页小说 re解析出来 章节名称 内容地址 如何下载到本地保存为TXT

大致情况描述:
书接上回,本人萌新一枚 看到网上很多教python 抓取网页小说的教程就自己琢磨试试
一开始 下载安装python然后遇到第三方库 导入有问题 等等 一堆总算能运行代码了 但是又卡住了······
我看的教程是 是利用四种 解析方式 requests re xpath ess 去解析获取小说内容 但是这个教程 我电脑 进行到 xpath ess 解析的时候就报错了 可能是我电脑库的我问题····
同样的代码输进去就是报错



扯远了 言归正传
=======分割线==========
1、需求
按照我现有代码 如何继续编辑 实现re 解析出来的小说地址 获取小说内容 并且保存为txt文件到本地
2、目前已知条件
python 版本python-3.9.0a1-amd64
小说章节地址:https://www.yushubo.net/list_other_80726.html
代码目前只写到了 re 解析出来的部分 往下不会了因为教程没有教·······
============分割线===============

import re
url = 'https://www.yushubo.net/list_other_80726.html'#小说章节地址
#发送请求
import requests
data = requests.get(url=url).text
print(data)
# re解析
list_url = re.findall('<span><a href="(.*?)">(.*?)</a></span>', data)
for book_url in list_url:
    urls = 'https://www.yushubo.net/'+book_url #目录信息
    name = book_url #标题信息
    print(urls,name)

#<span><a href="/read_105142_1.html">第一章:刚醒来就要当巫?(求收藏)</a></span>
# []----列表
# ()----元组

===============分割线===========================
以上跑出来的效果如下

有没有大佬能接着写两句的 按照逻辑讲 目前我拿到了小说的每个章节内容地址 以及每个章节的名称 只要能够提取到文字 然后保存就行了
但是我是个小白 照葫芦画瓢会 讲道理太难的听不懂 大佬可以帮我续写实现需求 也可以告诉我接下来怎么做 我会操作 并且给你汇报进度
大佬你就当云监工了 好人一生平安 大富大贵{:1_919:}




天真Aro 发表于 2022-11-14 00:48

```
import re
import requests
from bs4 import BeautifulSoup
url = 'https://www.yushubo.net/list_other_80726.html'# 小说章节地址
# 发送请求
data = requests.get(url=url).text
# print(data)
# re解析
list_url = re.findall('<span><a href="(.*?)">(.*?)</a></span>', data)
i = 0
for book_url in list_url:
    urls = 'https://www.yushubo.net' + book_url# 目录信息
    name = book_url# 标题信息
    # print(urls, name)
    soup = BeautifulSoup(requests.get(urls).text, 'lxml')
    content = soup.find('div', class_='article-con')
    print('下载'+name+'ing')
    with open(f'E:/little_story/{name}.txt', mode='a+', encoding='utf-8') as f:
      f.write(content.get_text())
    print('下载' + name + '完成')
```

调味包 发表于 2022-11-14 08:35

学习到一手

HadesGiant 发表于 2022-11-14 08:50

天真Aro 发表于 2022-11-14 00:48
```
import re
import requests


大佬牛 我看到又篇文章 有说用到 i 来代替什么解析的 但是我不知道代替哪部分 看到你这个我好像又懂了 哈哈哈哈 一会去试试 笔芯{:1_919:}满满正能量~

cyy571 发表于 2022-11-14 08:53

不错,有的网站需要身份验证

lansemeiying 发表于 2022-11-14 09:03

需求中啊,呵呵

HadesGiant 发表于 2022-11-14 09:11

kytion 发表于 2022-11-14 09:00
直接看不懂!牛!~~向你学习

慢慢学 我也是照葫芦画瓢 零基础起步{:1_905:}

HadesGiant 发表于 2022-11-14 09:27

天真Aro 发表于 2022-11-14 00:48
```
import re
import requests


大佬 这个库我电脑也装不上 提示如下
Microsoft Windows [版本 10.0.19044.2251]
(c) Microsoft Corporation。保留所有权利。

C:\Users\Le'novo>C:\Users\Le'novo\PycharmProjects\venv\Scripts\activate.bat

(venv) C:\Users\Le'novo>pip install BeautifulSoup
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting BeautifulSoup
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/40/f2/6c9f2f3e696ee6a1fb0e4d7850617e224ed2b0b1e872110abffeca2a09d4/BeautifulSoup-3.2.2.tar.gz (32 kB)
Preparing metadata (setup.py) ... error
ERROR: Command errored out with exit status 1:
   command: 'C:\Users\Le'"'"'novo\PycharmProjects\venv\Scripts\python.exe' -c 'import io, os, sys, setuptools, tokenize; sys.argv = "C:\\Users\\Le'"'"'novo\\AppData\\Local\\Temp\\pip-install-pvseyqd8\\beautifulsoup_d58efd557d52466aaef34141dd7fdf49\\setup.py"; __file__="C:\\Users\\Le'"'"'novo\\AppData\\Local\\Temp\\pip-install-pvseyqd8\\beautifulsoup_d58efd557d52466aaef34141dd7fdf49\\setup.py";f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\Le'"'"'novo\AppData\Local\Temp\pip-pip-egg-info-yyivd7j_'
       cwd: C:\Users\Le'novo\AppData\Local\Temp\pip-install-pvseyqd8\beautifulsoup_d58efd557d52466aaef34141dd7fdf49\
Complete output (6 lines):
Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "C:\Users\Le'novo\AppData\Local\Temp\pip-install-pvseyqd8\beautifulsoup_d58efd557d52466aaef34141dd7fdf49\setup.py", line 3
      "You're trying to run a very old release of Beautiful Soup under Python 3. This will not work."<>"Please use Beautiful Soup 4, available through the pip package 'beautifulsoup4'."
                                                                                                      ^
SyntaxError: invalid syntax
----------------------------------------
WARNING: Discarding https://pypi.tuna.tsinghua.edu.cn/packages/40/f2/6c9f2f3e696ee6a1fb0e4d7850617e224ed2b0b1e872110abffeca2a09d4/BeautifulSoup-3.2.2.tar.gz#sha256=a04169602bff6e3138b1259dbbf491f5a27f9499dea9a8fbafd48843f9d89970 (from https://pypi.tuna.tsinghua.edu.cn/simple/beautifulsoup/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/1e/ee/295988deca1a5a7accd783d0dfe14524867e31abb05b6c0eeceee49c759d/BeautifulSoup-3.2.1.tar.gz (31 kB)
Preparing metadata (setup.py) ... error
ERROR: Command errored out with exit status 1:
   command: 'C:\Users\Le'"'"'novo\PycharmProjects\venv\Scripts\python.exe' -c 'import io, os, sys, setuptools, tokenize; sys.argv = "C:\\Users\\Le'"'"'novo\\AppData\\Local\\Temp\\pip-install-pvseyqd8\\beautifulsoup_6bd99853e19641d98101a8f96dde0420\\setup.py"; __file__="C:\\Users\\Le'"'"'novo\\AppData\\Local\\Temp\\pip-install-pvseyqd8\\beautifulsoup_6bd99853e19641d98101a8f96dde0420\\setup.py";f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\Le'"'"'novo\AppData\Local\Temp\pip-pip-egg-info-lge_pehj'
       cwd: C:\Users\Le'novo\AppData\Local\Temp\pip-install-pvseyqd8\beautifulsoup_6bd99853e19641d98101a8f96dde0420\
Complete output (6 lines):
Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "C:\Users\Le'novo\AppData\Local\Temp\pip-install-pvseyqd8\beautifulsoup_6bd99853e19641d98101a8f96dde0420\setup.py", line 22
      print "Unit tests have failed!"
            ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print("Unit tests have failed!")?
----------------------------------------
WARNING: Discarding https://pypi.tuna.tsinghua.edu.cn/packages/1e/ee/295988deca1a5a7accd783d0dfe14524867e31abb05b6c0eeceee49c759d/BeautifulSoup-3.2.1.tar.gz#sha256=6a8cb4401111e011b579c8c52a51cdab970041cc543814bbd9577a4529fe1cdb (from https://pypi.tuna.tsinghua.edu.cn/simple/beautifulsoup/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/33/fe/15326560884f20d792d3ffc7fe8f639aab88647c9d46509a240d9bfbb6b1/BeautifulSoup-3.2.0.tar.gz (31 kB)
Preparing metadata (setup.py) ... error
ERROR: Command errored out with exit status 1:
   command: 'C:\Users\Le'"'"'novo\PycharmProjects\venv\Scripts\python.exe' -c 'import io, os, sys, setuptools, tokenize; sys.argv = "C:\\Users\\Le'"'"'novo\\AppData\\Local\\Temp\\pip-install-pvseyqd8\\beautifulsoup_c24f484783094be982f165c3233bf1c9\\setup.py"; __file__="C:\\Users\\Le'"'"'novo\\AppData\\Local\\Temp\\pip-install-pvseyqd8\\beautifulsoup_c24f484783094be982f165c3233bf1c9\\setup.py";f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\Le'"'"'novo\AppData\Local\Temp\pip-pip-egg-info-wjy50yf8'
       cwd: C:\Users\Le'novo\AppData\Local\Temp\pip-install-pvseyqd8\beautifulsoup_c24f484783094be982f165c3233bf1c9\
Complete output (6 lines):
Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "C:\Users\Le'novo\AppData\Local\Temp\pip-install-pvseyqd8\beautifulsoup_c24f484783094be982f165c3233bf1c9\setup.py", line 22
      print "Unit tests have failed!"
            ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print("Unit tests have failed!")?
----------------------------------------
WARNING: Discarding https://pypi.tuna.tsinghua.edu.cn/packages/33/fe/15326560884f20d792d3ffc7fe8f639aab88647c9d46509a240d9bfbb6b1/BeautifulSoup-3.2.0.tar.gz#sha256=0dc52d07516c1665c9dd9f0a390a7a054bfb7b147a50b2866fb116b8909dfd37 (from https://pypi.tuna.tsinghua.edu.cn/simple/beautifulsoup/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement BeautifulSoup (from versions: 3.2.0, 3.2.1, 3.2.2)
ERROR: No matching distribution found for BeautifulSoup
WARNING: You are using pip version 21.3.1; however, version 22.3.1 is available.
You should consider upgrading via the 'C:\Users\Le'novo\PycharmProjects\venv\Scripts\python.exe -m pip install --upgrade pip' command.

(venv) C:\Users\Le'novo>

HadesGiant 发表于 2022-11-14 09:30

天真Aro 发表于 2022-11-14 00:48
```
import re
import requests


我装上 bs4的包最新的报错如下
C:\Users\Le'novo\PycharmProjects\venv\Scripts\python.exe "C:\Users\Le'novo\PycharmProjects\pythonProject\小说下载\小说\天真Aro.py"
Traceback (most recent call last):
File "C:\Users\Le'novo\PycharmProjects\pythonProject\小说下载\小说\天真Aro.py", line 3, in <module>
    from bs4 import BeautifulSoup
File "C:\Users\Le'novo\PycharmProjects\venv\lib\site-packages\bs4\__init__.py", line 37, in <module>
    from .builder import (
File "C:\Users\Le'novo\PycharmProjects\venv\lib\site-packages\bs4\builder\__init__.py", line 627, in <module>
    from . import _lxml
File "C:\Users\Le'novo\PycharmProjects\venv\lib\site-packages\bs4\builder\_lxml.py", line 41, in <module>
    class LXMLTreeBuilderForXML(TreeBuilder):
File "C:\Users\Le'novo\PycharmProjects\venv\lib\site-packages\bs4\builder\_lxml.py", line 42, in LXMLTreeBuilderForXML
    DEFAULT_PARSER_CLASS = etree.XMLParser
AttributeError: 'function' object has no attribute 'XMLParser'

进程已结束,退出代码1

我今天是大佬 发表于 2022-11-14 09:49

用anaconda吧, 省去很多环境麻烦
页: [1] 2 3 4 5
查看完整版本: python抓取静态网页小说 re解析出来 章节名称 内容地址 如何下载到本地保存为TXT