入门python爬取精彩阅读网的小说 - 2 - 吾爱破解 - 52pojie.cn

HaNnnng 发表于 2018-11-14 01:22

入门python爬取精彩阅读网的小说 --- 2

本帖最后由 HaNnnng 于 2018-11-14 01:29 编辑

之前写了一篇面向过程的爬虫帖子https://www.52pojie.cn/forum.php ... &extra=#pid22524169

简单来说面向过程就是把分析出解决问题所需要的步骤，然后用函数把这些步骤一步一步实现，使用的时候一个一个依次调用就可以了。

而面向对象则是把构成问题事务分解成各个对象，建立对象的目的不是为了完成一个步骤，而是为了描叙某个事物在整个解决问题的步骤中的行为。

在我的理解就是，面向过程要手把手教电脑干活。而面向对象只要设定一个模拟问题解决的方案，就会自动教电脑干活。

我们先将之前面向过程的代码根据功能改成一个个方法，函数。如下面getHtml是获取网页信息，get_chapter_info是获取章节信息。最后再调用这个方法。
# _*_ coding: utf_8 _*_
__author__ = 'lwh'
__date__ = '2018/11/12 19:56'

import re
import requests

# 获取网页信息
def getHtml(url):
response = requests.get(url)
response.encoding = 'utf-8'
html = response.text
return html

def get_chapter_info(url):
html = getHtml(url)
nover_title = re.findall(r'<meta property="og:novel:book_name" content="(.*?)"/>', html)
dl = re.findall(r'<dl class="panel-body panel-chapterlist"><dd class="col-md-3">.*?</dl>', html, re.S)
chapter_info_list = re.findall(r'href="(.*?)">(.*?)<', dl)
return nover_title, chapter_info_list

def get_chapter_content(chapter_url): html = getHtml(chapter_url)
chapter_content = re.findall(r' <div class="panel-body" id="htmlContent">(.*?)</div> ', html, re.S)
chapter_content = chapter_content.replace('<br />', '')
chapter_content = chapter_content.replace('<br>', '')
chapter_content = chapter_content.replace('<br />', '')
chapter_content = chapter_content.replace('<p>', '')
chapter_content = chapter_content.replace('</p>', '')
chapter_content = chapter_content.replace(' ', '')

return chapter_content

def spider(url):
nover_title, chapter_info_list = get_chapter_info(url)

f = open('%s.txt' % nover_title, "w", encoding='utf-8')
# 下载各章节
for chapter_url, chapter_title in chapter_info_list:
   chapter_url = 'http://www.jingcaiyuedu.com%s' % chapter_url
   chapter_content = get_chapter_content(chapter_url)

   f.write(chapter_title)
   f.write('\n')
   f.write(chapter_content)
   f.write('\n\n\n\n\n')
   print(chapter_url)

if __name__ == '__main__':
nover_url = 'http://www.jingcaiyuedu.com/book/376655.html'
spider(nover_url)
我们也可以将它彻底封装成一个类，这样就算别人不知道内部实现的细节也能直接调用# _*_ coding: utf_8 _*_
__author__ = 'lwh'
__date__ = '2018/11/12 20:10'

import re
import requests

class Spider():
# 获取网页信息
def getHtml(self, url):
   response = requests.get(url)
   response.encoding = 'utf-8'
   html = response.text
   return html

def get_chapter_info(self, url):
   html = self.getHtml(url)
   nover_title = re.findall(r'<meta property="og:novel:book_name" content="(.*?)"/>', html)
   dl = re.findall(r'<dl class="panel-body panel-chapterlist"><dd class="col-md-3">.*?</dl>', html, re.S)
   chapter_info_list = re.findall(r'href="(.*?)">(.*?)<', dl)
   return nover_title, chapter_info_list

def get_chapter_content(self, chapter_url):
   html = self.getHtml(chapter_url)
   chapter_content = re.findall(r' <div class="panel-body" id="htmlContent">(.*?)</div> ', html, re.S)
   chapter_content = chapter_content.replace('<br />', '')
   chapter_content = chapter_content.replace('<br>', '')
   chapter_content = chapter_content.replace('<br />', '')
   chapter_content = chapter_content.replace('<p>', '')
   chapter_content = chapter_content.replace('</p>', '')
   chapter_content = chapter_content.replace(' ', '')

   return chapter_content

def spider(self, url):
   nover_title, chapter_info_list = self.get_chapter_info(url)

   f = open('%s.txt' % nover_title, "w", encoding='utf-8')
   # 下载各章节
   for chapter_url, chapter_title in chapter_info_list:
         chapter_url = 'http://www.jingcaiyuedu.com%s' % chapter_url
         chapter_content = self.get_chapter_content(chapter_url)

         f.write(chapter_title)
         f.write('\n')
         f.write(chapter_content)
         f.write('\n\n\n\n\n')
         print(chapter_url)

HaNnnng 发表于 2018-11-14 13:18

本帖最后由 HaNnnng 于 2018-11-14 13:27 编辑

chantmisaya 发表于 2018-11-14 09:52
如果是一些登录或者充值才可以看的网站，有什么办法可以爬下来吗
登录的话就模拟登录，用post提交数据，分析数据。
首先分析请求参数，找出哪些是动态参数，然后分类参数，哪些参数是页面源码中能解析出，哪些是需要动态生成的，如密码加密，如
果是js动态加密，那就提取加密js文件出来，还要留意请求头的参数，然后调用就基本可以了。如果有VIP账号会简单点

junsky1129 发表于 2018-11-14 03:36

入门python爬取难度打不打啊

nightingwish 发表于 2018-11-14 03:46

赞一下，思路清晰，代码写的规范。

夏雨未晴 发表于 2018-11-14 07:52

谢谢大佬分享，非常有用

sdlizj 发表于 2018-11-14 08:02

想爬点电影看看

kongbaiGG 发表于 2018-11-14 08:25

cxbb 发表于 2018-11-14 08:27

谢谢分享！

maokiss 发表于 2018-11-14 08:28

非常好，谢谢大佬分享！~~

xzxlove 发表于 2018-11-14 08:52

学Python用什么软件打代码比较好

ssqhmmm 发表于 2018-11-14 08:52

谢谢大佬分享，正需要这样的代码

页: [1] 2 3

吾爱破解 - 52pojie.cn's Archiver

入门python爬取精彩阅读网的小说 --- 2