【完结】抓取“斗罗大陆4之终极斗罗”小说的程序,问题已由1170解决,14楼有代码
本帖最后由 basfan 于 2020-3-28 13:41 编辑小白学python爬虫,自己写了一个小程序,目的是爬取斗罗大陆之极限斗罗小说
网页结构很简单,代码也很简单,但就是前几章能顺利抓下来并保存,
从第6章开始,一直提示错误,希望各位高手不吝赐教!谢谢!
源码:
import requests,re
from bs4 import BeautifulSoup
def gethtmlcontent(url):
r=requests.get(url)
return r.text
def savecontent(file,content):
soup=BeautifulSoup(content,'html.parser')
title=soup.h1.contents
with open(file,'a+') as f:
f.writelines(title)
f.writelines('\n')
for each in soup.findAll('p'):
f.writelines(each.get_text())
f.writelines('\n')
def getnexturl(text):
soup=BeautifulSoup(text,'html.parser')
temp=soup.find_all(rel="next")
if temp:
return re.findall(r'http://www.qiushuge.net/zhongjidouluo/.*.html',str(temp))
else:
return 0
if __name__=='__main__':
name='zjdl.txt'
baseurl='http://www.qiushuge.net/zhongjidouluo/4.html'
text=gethtmlcontent(baseurl)
while getnexturl(text):
savecontent(name,text)
url=getnexturl(text)
text=gethtmlcontent(url)
前几章保存的没问题,从第六章开始,就出错了。未找到原因,请高手指教,谢谢!
错误代码:
Traceback (most recent call last):
File "C:\Users\Administrator.SC-201810042154\Desktop\test\爬取网站.py", line 34, in <module>
savecontent(name,text)
File "C:\Users\Administrator.SC-201810042154\Desktop\test\爬取网站.py", line 16, in savecontent
f.writelines(each.get_text())
UnicodeEncodeError: 'gbk' codec can't encode character '\u200b' in position 0: illegal multibyte sequence
源代码下载(py程序):
文件写入时编码问题,with open(file,'a+') as f:这一句改成with open(file,'a+', encoding='utf-8') as f: 1170 发表于 2020-3-27 23:58
文件写入时编码问题,with open(file,'a+') as f:这一句改成with open(file,'a+', encoding='utf-8') as f:
谢谢,我试了一下,可以抓好多章节,感觉没问题了。
就是运行一段时间后,提示:
Traceback (most recent call last):
File "C:\Users\Administrator.SC-201810042154\AppData\Local\Programs\Python\Python38-32\lib\site-packages\urllib3\connection.py", line 156, in _new_conn
conn = connection.create_connection(
File "C:\Users\Administrator.SC-201810042154\AppData\Local\Programs\Python\Python38-32\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection
raise err
File "C:\Users\Administrator.SC-201810042154\AppData\Local\Programs\Python\Python38-32\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection
sock.connect(sa)
TimeoutError: 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Administrator.SC-201810042154\AppData\Local\Programs\Python\Python38-32\lib\site-packages\urllib3\connectionpool.py", line 665, in urlopen
httplib_response = self._make_request(
File "C:\Users\Administrator.SC-201810042154\AppData\Local\Programs\Python\Python38-32\lib\site-packages\urllib3\connectionpool.py", line 387, in _make_request
conn.request(method, url, **httplib_request_kw)
File "C:\Users\Administrator.SC-201810042154\AppData\Local\Programs\Python\Python38-32\lib\http\client.py", line 1230, in request
self._send_request(method, url, body, headers, encode_chunked)
File "C:\Users\Administrator.SC-201810042154\AppData\Local\Programs\Python\Python38-32\lib\http\client.py", line 1276, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "C:\Users\Administrator.SC-201810042154\AppData\Local\Programs\Python\Python38-32\lib\http\client.py", line 1225, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "C:\Users\Administrator.SC-201810042154\AppData\Local\Programs\Python\Python38-32\lib\http\client.py", line 1004, in _send_output
self.send(msg)
File "C:\Users\Administrator.SC-201810042154\AppData\Local\Programs\Python\Python38-32\lib\http\client.py", line 944, in send
self.connect()
File "C:\Users\Administrator.SC-201810042154\AppData\Local\Programs\Python\Python38-32\lib\site-packages\urllib3\connection.py", line 184, in connect
conn = self._new_conn()
File "C:\Users\Administrator.SC-201810042154\AppData\Local\Programs\Python\Python38-32\lib\site-packages\urllib3\connection.py", line 168, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x05A93F58>: Failed to establish a new connection: 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。
During handling of the above exception, another exception occurred:
感觉应该是速度太快导致的,抓的时候每次多一个“暂停”,就好了,还没改代码呢
只看过第一部 斗罗大陆居然还有4? 林夕丶 发表于 2020-3-28 00:12
斗罗大陆居然还有4?
出很长时间了 应该是最后一部了!~ 改一下这里就行了。我截图这里。 你抓取的网页章节好少,其他网站更新到 第975章 龙三 了 这烂饭还没恰完。看了几张,换汤不换药。去起点看看也一堆人骂 li217322810 发表于 2020-3-28 01:51
你抓取的网页章节好少,其他网站更新到 第975章 龙三 了
我抓的网站更新速度最快,更新到了“第九百七十六章 熟人相遇”,所以才在他的网站抓