零基础写python小说爬虫

三木猿 发表于 2020-8-31 16:54

本人是java开发工程师，闲来无事自学python，本帖子分享小说爬虫实现过程，代码有什么问题请各位大神不吝指教
1.导包，一般需要这两个包，包的下载方法就不说了，网上都有
import requestsfrom bs4 import BeautifulSoup
2.跟Java爬虫类似，爬虫的核心就是根据网址获取网页对象
def download_page(url):
data = requests.get(url).content
return data
3.其次是爬取逻辑，这个需要自己打开需要爬取的网页，然后找到需要的内容，根据所需要的内容的父节点获取

def parse_html(html):
#从上一个方法中获取到的html对象需要进行转换
soup = BeautifulSoup(html)
#获取table对象
movie_list_soup = soup.find('table')
# print(movie_list_soup)
#书名
movie_list = []
#章节名
movie_name_list = []
if movie_list_soup is not None:
   i = 1
#获取table中的每个tr
   for movie_li in movie_list_soup.find_all('tr'):
#排除表格标题
         if movie_li.find_all('th'):
            continue
#获取每个tr中的a标签，主要为获取书籍地址
         a_ = movie_li.find_all('td', attrs={'class': 'odd'}).find('a')
         print(i, '.', a_.text)
         movie_list.append(a_['href'])#获取地址存入集合
         movie_name_list.append(a_.text)
         i = i+1
#用户输入序号获取对应书籍地址
   count = int(input('请输入书籍序号')) - 1
   page = BeautifulSoup(download_page(movie_list))
   dd_s = page.find_all('dd')
   file_handle = open('D:/SanMu/'+movie_name_list+'.txt', mode='w')
   for dd in dd_s:
         beautiful_soup = BeautifulSoup(download_page(dd.find('a')['href']))
         name = beautiful_soup.find_all('h1').text
         file_handle.write(name)
         file_handle.write('\r\n')
         catalogue_html = str(beautiful_soup.find('div', attrs={'id': 'content'}))
         html_replace = catalogue_html.replace("<div id=\"content\">", "")
         replace = html_replace.replace("/n", "").replace(
            "</div>", "").replace("<p>", "")
         split = replace.split("</p>")
         for p_ in split:
            file_handle.write(p_)
            file_handle.write('\r\n')
   file_handle.close()

最后就只需要调用一下这两个方法
def main():
parse_html(download_page("https://www.biquge5200.com/modules/article/search.php?searchkey="+input("搜索：")))

main()
然后一个爬取笔趣阁的小说爬虫就完成了，是不是很简单，有问题请评论另附综合代码
import requests
from bs4 import BeautifulSoup

def download_page(url):
data = requests.get(url).content
return data

def parse_html(html):
soup = BeautifulSoup(html)
movie_list_soup = soup.find('table')
# print(movie_list_soup)
movie_list = []
movie_name_list = []
if movie_list_soup is not None:
   i = 1
   for movie_li in movie_list_soup.find_all('tr'):
         if movie_li.find_all('th'):
            continue
         a_ = movie_li.find_all('td', attrs={'class': 'odd'}).find('a')
         print(i, '.', a_.text)
         movie_list.append(a_['href'])
         movie_name_list.append(a_.text)
         i = i+1
   count = int(input('请输入书籍序号')) - 1
   page = BeautifulSoup(download_page(movie_list))
   dd_s = page.find_all('dd')
   file_handle = open('D:/SanMu/'+movie_name_list+'.txt', mode='w')
   for dd in dd_s:
         beautiful_soup = BeautifulSoup(download_page(dd.find('a')['href']))
         name = beautiful_soup.find_all('h1').text
         file_handle.write(name)
         file_handle.write('\r\n')
         catalogue_html = str(beautiful_soup.find('div', attrs={'id': 'content'}))
         html_replace = catalogue_html.replace("<div id=\"content\">", "")
         replace = html_replace.replace("/n", "").replace(
            "</div>", "").replace("<p>", "")
         split = replace.split("</p>")
         for p_ in split:
            file_handle.write(p_)
            file_handle.write('\r\n')
   file_handle.close()

def main():
parse_html(download_page("https://www.biquge5200.com/modules/article/search.php?searchkey="+input("搜索：")))

main()

三木猿 发表于 2020-9-1 12:57

另外有兴趣可以看看我的另一篇，这个是java代码写的
爬取漫画保存到指定文件夹
https://www.52pojie.cn/thread-1255895-1-1.html
(出处: 吾爱破解论坛)

三木猿 发表于 2020-9-1 12:59

多线程篇已发出
零基础写python小说爬虫--如何使用多线程爬取笔趣阁
https://www.52pojie.cn/thread-1258440-1-1.html
(出处: 吾爱破解论坛)

三木猿 发表于 2020-8-31 21:41

xiong779 发表于 2020-8-31 18:17
感谢分享，学习了，遇到一個問題加入這個 utf-8

file_handle = open('D:/SanMu/'+movie_name_list

对的，加上编码集会更好

xiong779 发表于 2020-8-31 18:17

感谢分享，学习了，遇到一個問題加入這個 utf-8

file_handle = open('D:/SanMu/'+movie_name_list+'.txt', mode='w',encoding = 'utf-8')

是谁呀 发表于 2020-8-31 17:46

感谢分享，学习了

cctv96 发表于 2020-8-31 18:07

想学PH{:1_924:}

Angel泠鸢 发表于 2020-8-31 18:20

感谢分享，学习了

huaaishangbing 发表于 2020-8-31 18:23

感谢楼主分享

lsy832 发表于 2020-8-31 18:29

感谢分享啊

Zmacro 发表于 2020-8-31 18:33

感谢分享，学习了

绫音发表于 2020-8-31 18:44

支持一下感谢楼主

大兵马元帅 发表于 2020-8-31 18:48

Java大哥你好

页: [1] 2

吾爱破解 - 52pojie.cn's Archiver

零基础写python小说爬虫