python写的小说抓取源码【原创】

seeyou_shj · 发表于 2020-7-27 11:16

闲来无事，不想学C++了。听说最近流行的python很火，研究一下。

顺手将小说抓取完善一下。源码列在下面了。需要的自行整理吧。很简单，也就不做说明了。

觉得有用的，给个免费的热心值支持一下吧。

from urllib.request import urlopen

myurl = 'https://www.zwdu.com/book/31855/'
myhost = myurl[0:myurl.find("/",10)]
f = open('d:/text.txt','w+',encoding='gbk')
smsg = urlopen(myurl).read().decode('gbk')
tmsg = smsg.find("<dd>")
while tmsg > 0:
t = smsg[tmsg:smsg.find("</dd>",tmsg)]
smsg = smsg[smsg.find("</dd>",tmsg):]
tmsg = smsg.find("<dd>")
chapurl = myhost + t[t.find("\"") + 1 : t.find("\"",t.find("\"") + 2)]
chapname = t[t.find("\">")+2:t.find("</a>",t.find("\">")+6)]+"\n"
temp = urlopen(chapurl).read().decode('gbk')
content = temp[temp.find("<div id=\"content\">")+18:temp.find("</div>",temp.find("<div id=\"content\">")+20)] + "\n"
content = content.replace("\t","")
content = content.replace("<br />","\n")
f.write(chapname)
f.write(content)
f.close()

黑了黑 · 发表于 2020-7-27 13:25

谢谢分享，学习

seeyou_shj · 发表于 2020-8-19 13:29

遇到解码问题，搜索后找到解决方法。源码经过修改如下：
import re
from urllib.request import urlopen

myurl = 'http://www.purepen.com/hlm/'
myhost = myurl
f = open('d:/mytemp/红楼梦.txt','w+',encoding='gb18030')
smsg = urlopen(myurl).read()
#选择解码字符集
if re.search(b'[a-zA-Z0-9\-]*',smsg[smsg.find(b'charset=')+8:]).group() == b'GB2312' \
or re.search(b'[a-zA-Z0-9\-]*',smsg[smsg.find(b'charset=')+8:]).group() == b'gb2312':
charset = 'gb18030'
if re.search(b'[a-zA-Z0-9\-]*',smsg[smsg.find(b'charset=')+8:]).group() == b'GBK' \
or re.search(b'[a-zA-Z0-9\-]*',smsg[smsg.find(b'charset=')+8:]).group() == b'gbk':
charset = 'gbk'
if re.search(b'[a-zA-Z0-9\-]*',smsg[smsg.find(b'charset=')+8:]).group() == b'UTF-8' \
or re.search(b'[a-zA-Z0-9\-]*',smsg[smsg.find(b'charset=')+8:]).group() == b'utf-8':
charset = 'utf-8'
smsg = smsg.decode(charset) #解码
tmsg = smsg.find("<TD>第一回")
t = smsg[tmsg:smsg.find("</TABLE>",tmsg)]
tmsg = t.find("<A HREF")
while tmsg > 0:
#smsg = smsg[smsg.find("<a",tmsg):]
#tmsg = smsg.find("</a>")
chapurl = myhost + t[tmsg + 9 : t.find("\"",tmsg + 12)]
#chapname = t[t.find("html\">")+6:t.find("</a>",t.find("html\">")+6)]+"\n"
tmsg = t.find("</A>")
t = t[tmsg+6:]
tmsg = t.find("<A HREF")
temp = urlopen(chapurl).read().decode('gb18030')
chapname = temp[temp.find("<b>")+3:temp.find("</b>")] + '\n'
f.write(chapname)
temp = temp[temp.find("<table>"):temp.find("</table>")]
content = temp[temp.find("size=\"3\">")+9:temp.find("</font>")] + "\n"
content = content.replace("\t","")
content = content.replace("<br />","")
content = content.replace(" ","")
temp = re.split(r'\n',content)
for i in range(len(temp)):
      if len(temp[i])<33:
         temp[i] = temp[i]+'\n'
      f.write(temp[i])
f.close()

帐号		自动登录	找回密码
密码			注册[Register]

[Python 原创] python写的小说抓取源码【原创】

免费评分