一段爬取某网站小说的python代码
本帖最后由 lpdswing 于 2018-1-9 18:33 编辑需要安装python和lxml
此代码仅适合下面的网站,理论上可以爬取此站点所有小说.
如果失效欢迎反馈.
大多数小说都是女频,很多收费的可以免费读
代码如下:
import requests
from lxml import etree
url = 'http://m.18xs.org/book_2181/all.html' #把需要爬的小说的目录链接替换即可
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537'
}
r = requests.get(url,headers=headers)
html_tree = etree.HTML(r.text)
url_list =html_tree.xpath('//div[@id="chapterlist"]/p/a/@href')
url_list_new=[]
for i in url_list:
a = 'http://m.18xs.org/'+ i
url_list_new.append(a)
# print(url_list_new)
def get_text(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537'
}
r = requests.get(url,headers=headers)
htree = etree.HTML(r.content)
text = htree.xpath('//div[@id="chaptercontent"]/text()')
txt=''
for i in text:
i = ''.join(i.split())
txt = txt+i+'\n'
return txt
for i in url_list_new:
t=get_text(i)
with open('bkhs.txt','a+',encoding='utf-8') as fp:
fp.write(t)
附上爬的正文,欢迎指正!
当初写这段代码的目的仅为了爬取一篇小说,但是全站的都可以爬取,代码比较简单,有学习的小伙伴可以自己完善代码,比如把书名换成title.我写代码以自己看懂,简洁为主. 报错提示:
"D:\Program Files\Python\3.7.0\python.exe" G:/PyCharm_Projects/novel_spider/__init__.py
Traceback (most recent call last):
File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\connection.py", line 171, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw)
File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\util\connection.py", line 79, in create_connection
raise err
File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\util\connection.py", line 69, in create_connection
sock.connect(sa)
TimeoutError: 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen
chunked=chunked)
File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\connectionpool.py", line 354, in _make_request
conn.request(method, url, **httplib_request_kw)
File "D:\Program Files\Python\3.7.0\lib\http\client.py", line 1229, in request
self._send_request(method, url, body, headers, encode_chunked)
File "D:\Program Files\Python\3.7.0\lib\http\client.py", line 1275, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "D:\Program Files\Python\3.7.0\lib\http\client.py", line 1224, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "D:\Program Files\Python\3.7.0\lib\http\client.py", line 1016, in _send_output
self.send(msg)
File "D:\Program Files\Python\3.7.0\lib\http\client.py", line 956, in send
self.connect()
File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\connection.py", line 196, in connect
conn = self._new_conn()
File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\connection.py", line 180, in _new_conn
self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x000002DF47DEE048>: Failed to establish a new connection: 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\Program Files\Python\3.7.0\lib\site-packages\requests\adapters.py", line 445, in send
timeout=timeout
File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info())
File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\util\retry.py", line 398, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='m.18xs.org', port=80): Max retries exceeded with url: //book_3506/2353186.html (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000002DF47DEE048>: Failed to establish a new connection: 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "G:/PyCharm_Projects/novel_spider/__init__.py", line 31, in <module>
t=get_text(i)
File "G:/PyCharm_Projects/novel_spider/__init__.py", line 21, in get_text
r = requests.get(url,headers=headers)
File "D:\Program Files\Python\3.7.0\lib\site-packages\requests\api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "D:\Program Files\Python\3.7.0\lib\site-packages\requests\api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "D:\Program Files\Python\3.7.0\lib\site-packages\requests\sessions.py", line 512, in request
resp = self.send(prep, **send_kwargs)
File "D:\Program Files\Python\3.7.0\lib\site-packages\requests\sessions.py", line 622, in send
r = adapter.send(request, **kwargs)
File "D:\Program Files\Python\3.7.0\lib\site-packages\requests\adapters.py", line 513, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='m.18xs.org', port=80): Max retries exceeded with url: //book_3506/2353186.html (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000002DF47DEE048>: Failed to establish a new connection: 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。'))
Process finished with exit code 1 import requests
from lxml import etree
url = 'http://m.18xs.org/book_2181/all.html' #把需要爬的小说的目录链接替换即可
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537'
}
r = requests.get(url,headers=headers)
html_tree = etree.HTML(r.text)
url_list =html_tree.xpath('//div[@id="chapterlist"]/p/a/@href')
url_list_new=[]
for i in url_list:
a = 'http://m.18xs.org/'+ i
url_list_new.append(a)
# print(url_list_new)
def get_text(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537'
}
r = requests.get(url,headers=headers)
htree = etree.HTML(r.content)
text = htree.xpath('//div[@id="chaptercontent"]/text()')
txt=''
for i in text:
i = ''.join(i.split())
txt = txt+i+'\n'
return txt
for i in url_list_new:
t=get_text(i)
with open('bkhs.txt','a+',encoding='utf-8') as fp:
fp.write(t) 怎么用啊?楼主有成功爬取小说? 把使用方法发出来 不觉明厉,楼主还能更详细点么?? 不会用呀。。。 这不就是模拟请求拿到网页再把网页中指定的元素抽取出来吗{:1_918:} 哈哈,厉害了我的哥!! 感谢楼主
正在学习, 刚好拿来作参考 小说内容这样处理会不会不太友好,一点格式的都没有!
xpath 出来的小说内容! 每一个前面加 \t 和后面加换行 会不会好看点!
然后我有个疑问.
你这样爬出来的章节是有序的吗? 顶一个顶一个顶一个顶一个顶一个顶一个顶一个顶一个顶一个顶一个顶一个顶一个顶一个