basfan 发表于 2020-3-27 23:43

【完结】抓取“斗罗大陆4之终极斗罗”小说的程序,问题已由1170解决,14楼有代码

本帖最后由 basfan 于 2020-3-28 13:41 编辑

小白学python爬虫,自己写了一个小程序,目的是爬取斗罗大陆之极限斗罗小说
网页结构很简单,代码也很简单,但就是前几章能顺利抓下来并保存,
从第6章开始,一直提示错误,希望各位高手不吝赐教!谢谢!
源码:

import requests,re
from bs4 import BeautifulSoup

def gethtmlcontent(url):
    r=requests.get(url)

    return r.text

def savecontent(file,content):
    soup=BeautifulSoup(content,'html.parser')
    title=soup.h1.contents
    with open(file,'a+') as f:
      f.writelines(title)
      f.writelines('\n')
      for each in soup.findAll('p'):
            f.writelines(each.get_text())
            f.writelines('\n')

def getnexturl(text):
    soup=BeautifulSoup(text,'html.parser')
    temp=soup.find_all(rel="next")
    if temp:
      return re.findall(r'http://www.qiushuge.net/zhongjidouluo/.*.html',str(temp))
    else:
      return 0


if __name__=='__main__':
    name='zjdl.txt'
    baseurl='http://www.qiushuge.net/zhongjidouluo/4.html'
    text=gethtmlcontent(baseurl)
    while getnexturl(text):
      savecontent(name,text)
      url=getnexturl(text)
      text=gethtmlcontent(url)



前几章保存的没问题,从第六章开始,就出错了。未找到原因,请高手指教,谢谢!

错误代码:

Traceback (most recent call last):
File "C:\Users\Administrator.SC-201810042154\Desktop\test\爬取网站.py", line 34, in <module>
    savecontent(name,text)
File "C:\Users\Administrator.SC-201810042154\Desktop\test\爬取网站.py", line 16, in savecontent
    f.writelines(each.get_text())
UnicodeEncodeError: 'gbk' codec can't encode character '\u200b' in position 0: illegal multibyte sequence
源代码下载(py程序):

1170 发表于 2020-3-27 23:58

文件写入时编码问题,with open(file,'a+') as f:这一句改成with open(file,'a+', encoding='utf-8') as f:

basfan 发表于 2020-3-28 09:56

1170 发表于 2020-3-27 23:58
文件写入时编码问题,with open(file,'a+') as f:这一句改成with open(file,'a+', encoding='utf-8') as f:

谢谢,我试了一下,可以抓好多章节,感觉没问题了。
就是运行一段时间后,提示:

Traceback (most recent call last):
File "C:\Users\Administrator.SC-201810042154\AppData\Local\Programs\Python\Python38-32\lib\site-packages\urllib3\connection.py", line 156, in _new_conn
    conn = connection.create_connection(
File "C:\Users\Administrator.SC-201810042154\AppData\Local\Programs\Python\Python38-32\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection
    raise err
File "C:\Users\Administrator.SC-201810042154\AppData\Local\Programs\Python\Python38-32\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection
    sock.connect(sa)
TimeoutError: 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\Administrator.SC-201810042154\AppData\Local\Programs\Python\Python38-32\lib\site-packages\urllib3\connectionpool.py", line 665, in urlopen
    httplib_response = self._make_request(
File "C:\Users\Administrator.SC-201810042154\AppData\Local\Programs\Python\Python38-32\lib\site-packages\urllib3\connectionpool.py", line 387, in _make_request
    conn.request(method, url, **httplib_request_kw)
File "C:\Users\Administrator.SC-201810042154\AppData\Local\Programs\Python\Python38-32\lib\http\client.py", line 1230, in request
    self._send_request(method, url, body, headers, encode_chunked)
File "C:\Users\Administrator.SC-201810042154\AppData\Local\Programs\Python\Python38-32\lib\http\client.py", line 1276, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
File "C:\Users\Administrator.SC-201810042154\AppData\Local\Programs\Python\Python38-32\lib\http\client.py", line 1225, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
File "C:\Users\Administrator.SC-201810042154\AppData\Local\Programs\Python\Python38-32\lib\http\client.py", line 1004, in _send_output
    self.send(msg)
File "C:\Users\Administrator.SC-201810042154\AppData\Local\Programs\Python\Python38-32\lib\http\client.py", line 944, in send
    self.connect()
File "C:\Users\Administrator.SC-201810042154\AppData\Local\Programs\Python\Python38-32\lib\site-packages\urllib3\connection.py", line 184, in connect
    conn = self._new_conn()
File "C:\Users\Administrator.SC-201810042154\AppData\Local\Programs\Python\Python38-32\lib\site-packages\urllib3\connection.py", line 168, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x05A93F58>: Failed to establish a new connection: 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。

During handling of the above exception, another exception occurred:

感觉应该是速度太快导致的,抓的时候每次多一个“暂停”,就好了,还没改代码呢

xytwlh 发表于 2020-3-28 00:07

只看过第一部

林夕丶 发表于 2020-3-28 00:12

斗罗大陆居然还有4?

s51280131 发表于 2020-3-28 00:40

林夕丶 发表于 2020-3-28 00:12
斗罗大陆居然还有4?

出很长时间了 应该是最后一部了!~

小涩席 发表于 2020-3-28 00:49

改一下这里就行了。我截图这里。

li217322810 发表于 2020-3-28 01:51

你抓取的网页章节好少,其他网站更新到 第975章 龙三 了

biftino 发表于 2020-3-28 02:04

这烂饭还没恰完。看了几张,换汤不换药。去起点看看也一堆人骂

basfan 发表于 2020-3-28 10:00

li217322810 发表于 2020-3-28 01:51
你抓取的网页章节好少,其他网站更新到 第975章 龙三 了

我抓的网站更新速度最快,更新到了“第九百七十六章 熟人相遇”,所以才在他的网站抓
页: [1] 2 3
查看完整版本: 【完结】抓取“斗罗大陆4之终极斗罗”小说的程序,问题已由1170解决,14楼有代码