吾爱破解 - 52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 11127|回复: 33
收起左侧

[Python 转载] 一段爬取某网站小说的python代码

[复制链接]
lpdswing 发表于 2018-1-8 09:00
本帖最后由 lpdswing 于 2018-1-9 18:33 编辑

需要安装python和lxml
此代码仅适合下面的网站,理论上可以爬取此站点所有小说.
如果失效欢迎反馈.
大多数小说都是女频,很多收费的可以免费读

代码如下:
[Asm] 纯文本查看 复制代码
import requests
from lxml import etree
url = 'http://m.18xs.org/book_2181/all.html'          #把需要爬的小说的目录链接替换即可
headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537'
}

r = requests.get(url,headers=headers)
html_tree = etree.HTML(r.text)
url_list =html_tree.xpath('//div[@id="chapterlist"]/p/a/@href')
url_list_new=[]
for i in url_list[1:-1]:
        a = 'http://m.18xs.org/'+ i
        url_list_new.append(a)
# print(url_list_new)

def get_text(url):
        headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537'
}
        r = requests.get(url,headers=headers)
        htree = etree.HTML(r.content)
        text = htree.xpath('//div[@id="chaptercontent"]/text()')
        txt=''
        for i in text:
                i = ''.join(i.split())
                txt = txt+i+'\n'
        return txt

for i in url_list_new:
        t=get_text(i)
        with open('bkhs.txt','a+',encoding='utf-8') as fp:
                fp.write(t)

附上爬的正文,欢迎指正!


当初写这段代码的目的仅为了爬取一篇小说,但是全站的都可以爬取,代码比较简单,有学习的小伙伴可以自己完善代码,比如把书名换成title.我写代码以自己看懂,简洁为主.

bkhs.txt

172.49 KB, 下载次数: 42, 下载积分: 吾爱币 -1 CB

免费评分

参与人数 7吾爱币 +9 热心值 +7 收起 理由
刘甲乙 + 1 + 1 楼主 给个具体步骤呗
felixyuan + 1 + 1 我很赞同!
﹏゛小瓶盖◎ + 1 + 1 不会用,没有详细的教程
zjjxyz + 1 + 1 我很赞同!
福建是我 + 1 + 1 热心回复!
malno + 1 + 1 谢谢@Thanks!
1595901624 + 3 + 1 热心回复!

查看全部评分

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

Dirichlets 发表于 2018-12-26 13:45
报错提示:
"D:\Program Files\Python\3.7.0\python.exe" G:/PyCharm_Projects/novel_spider/__init__.py
Traceback (most recent call last):
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\connection.py", line 171, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw)
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\util\connection.py", line 79, in create_connection
    raise err
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\util\connection.py", line 69, in create_connection
    sock.connect(sa)
TimeoutError: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\connectionpool.py", line 354, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "D:\Program Files\Python\3.7.0\lib\http\client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "D:\Program Files\Python\3.7.0\lib\http\client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "D:\Program Files\Python\3.7.0\lib\http\client.py", line 1224, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "D:\Program Files\Python\3.7.0\lib\http\client.py", line 1016, in _send_output
    self.send(msg)
  File "D:\Program Files\Python\3.7.0\lib\http\client.py", line 956, in send
    self.connect()
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\connection.py", line 196, in connect
    conn = self._new_conn()
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\connection.py", line 180, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x000002DF47DEE048>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\Program Files\Python\3.7.0\lib\site-packages\requests\adapters.py", line 445, in send
    timeout=timeout
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\util\retry.py", line 398, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='m.18xs.org', port=80): Max retries exceeded with url: //book_3506/2353186.html (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000002DF47DEE048>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "G:/PyCharm_Projects/novel_spider/__init__.py", line 31, in <module>
    t=get_text(i)
  File "G:/PyCharm_Projects/novel_spider/__init__.py", line 21, in get_text
    r = requests.get(url,headers=headers)
  File "D:\Program Files\Python\3.7.0\lib\site-packages\requests\api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "D:\Program Files\Python\3.7.0\lib\site-packages\requests\api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "D:\Program Files\Python\3.7.0\lib\site-packages\requests\sessions.py", line 512, in request
    resp = self.send(prep, **send_kwargs)
  File "D:\Program Files\Python\3.7.0\lib\site-packages\requests\sessions.py", line 622, in send
    r = adapter.send(request, **kwargs)
  File "D:\Program Files\Python\3.7.0\lib\site-packages\requests\adapters.py", line 513, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='m.18xs.org', port=80): Max retries exceeded with url: //book_3506/2353186.html (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000002DF47DEE048>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。'))

Process finished with exit code 1
chenyriu 发表于 2018-1-8 11:15
import requests
from lxml import etree
url = 'http://m.18xs.org/book_2181/all.html'&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; #把需要爬的小说的目录链接替换即可
headers = {
&#160;&#160;&#160;&#160;'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537'
}
&#160;
r = requests.get(url,headers=headers)
html_tree = etree.HTML(r.text)
url_list =html_tree.xpath('//div[@id="chapterlist"]/p/a/@href')
url_list_new=[]
for i in url_list[1:-1]:
&#160;&#160;&#160;&#160;a = 'http://m.18xs.org/'+ i
&#160;&#160;&#160;&#160;url_list_new.append(a)
# print(url_list_new)
&#160;
def get_text(url):
&#160;&#160;&#160;&#160;headers = {
&#160;&#160;&#160;&#160;'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537'
}
&#160;&#160;&#160;&#160;r = requests.get(url,headers=headers)
&#160;&#160;&#160;&#160;htree = etree.HTML(r.content)
&#160;&#160;&#160;&#160;text = htree.xpath('//div[@id="chaptercontent"]/text()')
&#160;&#160;&#160;&#160;txt=''
&#160;&#160;&#160;&#160;for i in text:
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;i = ''.join(i.split())
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;txt = txt+i+'\n'
&#160;&#160;&#160;&#160;return txt
&#160;
for i in url_list_new:
&#160;&#160;&#160;&#160;t=get_text(i)
&#160;&#160;&#160;&#160;with open('bkhs.txt','a+',encoding='utf-8') as fp:
&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;fp.write(t)
easygoingtobe 发表于 2018-1-8 09:14
finalcrasher 发表于 2018-1-8 09:15
把使用方法发出来
杀阡陌爱花千骨 发表于 2018-1-8 09:43
不觉明厉,楼主还能更详细点么??
某些人 发表于 2018-1-8 09:45
不会用呀。。。
Mcoco 发表于 2018-1-8 09:47
这不就是模拟请求拿到网页再把网页中指定的元素抽取出来吗
shui5462115 发表于 2018-1-8 09:59
哈哈,厉害了我的哥!!

免费评分

参与人数 1吾爱币 +1 收起 理由
fangdao + 1 卧槽头像一样!

查看全部评分

dum333 发表于 2018-1-8 10:04
感谢楼主
正在学习, 刚好拿来作参考
q137100856 发表于 2018-1-8 10:44
小说内容这样处理会不会不太友好,一点格式的都没有!
xpath 出来的小说内容! 每一个前面加 \t 和后面加换行 会不会好看点!
然后我有个疑问.
你这样爬出来的章节是有序的吗?
tiantangtianma 发表于 2018-1-8 10:51
顶一个顶一个顶一个顶一个顶一个顶一个顶一个顶一个顶一个顶一个顶一个顶一个顶一个
您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则

返回列表

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2024-11-30 06:00

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表