手把手教python爬取漫画(每一步都有注释)

三木猿 发表于 2020-9-2 14:01

本人也刚学，本帖水平含量不高，有什么问题请指教
想要编写一个爬虫，不管用什么语言最重要的都是先获取所需要的内容在网页中的位置，
就是说我们要获取到他的唯一标识，就比如根据标签的id或class，id和class获取的区别在于，id是唯一的，所以只会获取到一条数据，而class则不一样，一个页面可能会有多条class，
所以如果要根据class获取数据，你需要找到你所需要的数据在第几个class，当然除了根据id我们也可以根据标签名来获取，这个就更加宽泛了，接下来我们以爬取漫画为例，手把手写一个爬虫，手把手奥（明确暗示）

1.首先我们找到要爬取的漫画网站我这里以https://m.gufengmh8.com/为例，截图为搜索页面，可以看到网址为https://m.gufengmh8.com/search/?keywords=完美世界
keywords后面跟的就是要搜索的内容，然后我们获取url的方式就可以是这样

"https://m.gufengmh8.com/manhua/search/?keywords="+str(input("搜索漫画:"))#input让用户输入，获取输入内容

2.然后我们开始对这个页面进行剖析，我们要获取的内容有哪些呢，在这里就不写太复杂，只爬取漫画名供用户选择就行，毕竟同名的漫画也不多嘛（其实就是太懒）
浏览器按f12进入代码调试，单击下图位置，然后可以看到class为itemBox,所以我们只需要获取到这个页面所有的class为itemBox的div，就可以获取每本漫画的所有信息，
在这里只取漫画名，再用小箭头点击漫画名，可以看到a标签下的就是要获取的漫画名，所以逻辑就清晰了，先获取class，然后遍历class获取到每个class中的itemTxt，然后再获取到itemTxt的第一个节点

然后现在我们的代码就变成这样

import math
import threading
import time
import os
import requests
from bs4 import BeautifulSoup
from urllib3.connectionpool import xrange

#根据url获取对应页面的所有内容，然后返回
def get_document(url):
# print(url)
try:
   get = requests.get(url)#打开连接
   data = get.content#获取内容
   get.close()#关闭连接
except:#抛异常就重试
   time.sleep(3)#睡眠3秒，给网页反应时间
   try:再次获取
         get = requests.get(url)
         data = get.content
         get.close()
   except:
         time.sleep(3)
         get = requests.get(url)
         data = get.content
         get.close()
return data

#下载漫画
def download_img(html):
soup = BeautifulSoup(html)#BeautifulSoup和request搭配使用更佳呦
itemBox = soup.find_all('div', attrs={'class': 'itemBox'})#find_all返回的是一个list
for index, item in enumerate(itemBox):#遍历itemBox，index是当前项list的下标，item是内容
   itemTxt = item.find('div', attrs={'class': 'itemTxt'})#因为只有一个，所以itemBox中只有一个itemTxt所以这次我们用find
   a = itemTxt.find('a', attrs={'class': 'title'}).text[]
   print(str(index+1)+'.'+a)

# download_img(get_document("https://m.gufengmh8.com/search/?keywords="+str(input("搜索漫画:"))))
download_img(get_document("https://m.gufengmh8.com/search/?keywords=完美世界"))#这个就不解释了吧

执行后打印这样

1.捕获宠物娘的正确方法
2.吾猫当仙
3.百诡谈
4.完美世界PERFECTWORLD
5.诛仙·御剑行
6.洞仙歌
7.猫仙生
8.完美世界
3.现在我们基本实现了搜索功能，这已经算是个简单爬虫了，之后让用户输入书籍序号，然后下载
我们随便点进去一本漫画，用之前的方式获取到id为chapter-list-1的ul包含了所有的章节，ul中的每一个li又包含一个a标签和span标签，分别是url和章节名,之后就可以继续写了

def download_img(html):
chapter_url_list=[]
soup = BeautifulSoup(html)#BeautifulSoup和request搭配使用更佳呦
itemBox = soup.find_all('div', attrs={'class': 'itemBox'})#find_all返回的是一个list
for index, item in enumerate(itemBox):#遍历itemBox，index是当前项list的下标，item是内容
   itemTxt = item.find('div', attrs={'class': 'itemTxt'})#因为只有一个，所以itemBox中只有一个itemTxt所以这次我们用find
   a = itemTxt.find('a', attrs={'class': 'title'})
   chapter_url = a['href']
   chapter_url_list.append(chapter_url)#把所有书的url存起来
   print(str(index+1)+'.'+a.text)
number = int(input('请输入漫画序号'))
chapter_html = BeautifulSoup(get_document(chapter_url_list))#因为打印的序号和list的索引是相差1的,所以输入的序号减一获取对应书的url，再根据url获取到目录页面
ul = chapter_html.find('ul', attrs={'id': 'chapter-list-1'})#获取到ul
li_list = ul.find_all('li')#获取其中所有li
for li in li_list:#遍历
   li_a_href = li.find('a')['href']#注意这里获取到的url是不完整的/manhua/buhuochongwuniangdezhengquefangfa/1000845.html
4.现在我们随便点入一个章节获取到漫画图片的位置

chapter_html = BeautifulSoup(get_document('https://m.gufengmh8.com' + li_a_href))
   chapter_content = chapter_html.find('div', attrs={'class': 'chapter-content'})
   img_src = chapter_content.find('img')['src']
然后我们终于获取到了图片的src，但是还有个问题，他是分页的，所以。。

仔细钻研后发现如果当前页不存在时会显示这个图片，那我们就一直循环，直到获取的到的图片是这个时，结束循环，也就是这个样子↓
while True:
         li_a_href_replace = li_a_href
         if i != 0:#不加-i就是第一页
            li_a_href_replace = li_a_href.replace('.', ('-' + str(i) + '.'))#https://m.gufengmh8.com/manhua/wanmeishijiePERFECTWORLD/549627.html把"."换成"-1."https://m.gufengmh8.com/manhua/wanmeishijiePERFECTWORLD/549627-1.html就是第二页了
         print(li_a_href_replace)
         chapter_html = BeautifulSoup(get_document('https://m.gufengmh8.com' + li_a_href_replace))
         chapter_content = chapter_html.find('div', attrs={'class': 'chapter-content'})
         img_src = chapter_content.find('img')['src']
         if img_src.__eq__('https://res.xiaoqinre.com/images/default/cover.png'):
            break
5.然后我们获取到了所有的漫画图片src，现在就只需要把他下载下来了,先创建目录

path = "d:/SanMu/"+book_name+'/'+li.text.replace('\n', '')

if not os.path.exists(path):
         os.makedirs(path)
然后下载，很简单吧
open(path+'/'+str(i)+'.jpg', 'wb').write(get_document(img_src))#保存到d:/SanMu/书名/章节名/0.jpg

最后放出综合代码
import math
import threading
import time
import os
import requests
from bs4 import BeautifulSoup
from urllib3.connectionpool import xrange

def split_list(ls, each):
list = []
eachExact = float(each)
groupCount = int(len(ls) // each)
groupCountExact = math.ceil(len(ls) / eachExact)
start = 0
for i in xrange(each):
   if i == each - 1 & groupCount < groupCountExact:# 假如有余数，将剩余的所有元素加入到最后一个分组
         list.append(ls)
   else:
         list.append(ls)
   start = start + groupCount

return list

def get_document(url):
# print(url)
try:
   get = requests.get(url)
   data = get.content
   get.close()
except:
   time.sleep(3)
   try:
         get = requests.get(url)
         data = get.content
         get.close()
   except:
         time.sleep(3)
         get = requests.get(url)
         data = get.content
         get.close()
return data

def download_img(html):
chapter_url_list=[]
soup = BeautifulSoup(html)#BeautifulSoup和request搭配使用更佳呦
itemBox = soup.find_all('div', attrs={'class': 'itemBox'})#find_all返回的是一个list
for index, item in enumerate(itemBox):#遍历itemBox，index是当前项list的下标，item是内容
   itemTxt = item.find('div', attrs={'class': 'itemTxt'})#因为只有一个，所以itemBox中只有一个itemTxt所以这次我们用find
   a = itemTxt.find('a', attrs={'class': 'title'})
   chapter_url = a['href']
   chapter_url_list.append(chapter_url)#把所有书的url存起来
   print(str(index+1)+'.'+a.text)
number = int(input('请输入漫画序号'))
chapter_html_list = BeautifulSoup(get_document(chapter_url_list))#因为打印的序号和list的索引是相差1的,所以输入的序号减一获取对应书的url，再根据url获取到目录页面
ul = chapter_html_list.find('ul', attrs={'id': 'chapter-list-1'})#获取到ul
book_name = chapter_html_list.find('h1', attrs={'class': 'title'}).text#获取到ul
li_list = ul.find_all('li')#获取其中所有li
for li in li_list:#遍历
   li_a_href = li.find('a')['href']#注意这里获取到的url是不完整的/manhua/buhuochongwuniangdezhengquefangfa/1000845.html
   i = 0
   path = "d:/SanMu/"+book_name+'/'+li.text.replace('\n', '')
   if not os.path.exists(path):
         os.makedirs(path)
   while True:
         li_a_href_replace = li_a_href
         if i != 0:
            li_a_href_replace = li_a_href.replace('.', ('-' + str(i) + '.'))
         print(li_a_href_replace)
         chapter_html = BeautifulSoup(get_document('https://m.gufengmh8.com' + li_a_href_replace))
         chapter_content = chapter_html.find('div', attrs={'class': 'chapter-content'})
         img_src = chapter_content.find('img')['src']
         if img_src.__eq__('https://res.xiaoqinre.com/images/default/cover.png'):
            break
         chapter_content = chapter_html.find('div', attrs={'class': 'chapter-content'})
         img_src = chapter_content.find('img')['src']
         open(path+'/'+str(i)+'.jpg', 'wb').write(get_document(img_src))#保存到d:/SanMu/书名/章节名/0.jpg
         i += 1

download_img(get_document("https://m.gufengmh8.com/search/?keywords="+str(input("搜索漫画:"))))

到这就结束了，不晓得有没有人会看我的文章呢，有没有呢，没有呢，有呢，呢~~~（明确暗示）

三木猿 发表于 2020-9-2 14:03

喜欢不，喜欢就点点收藏，免费评分走一下？新人求照顾

三木猿 发表于 2020-9-2 16:10

本帖最后由三木猿于 2020-9-3 13:33 编辑

看来大家很喜欢，记得点免费评分呦，再发个多线程版的，漫画多少章就启多少个线程的那种（另外多线程不是越多越好，有些网站会有保护措施，太多线程同时请求可能会被封ip呦）
import math
import threading
import sys, time
import os
import requests
from bs4 import BeautifulSoup
from urllib3.connectionpool import xrange
from tqdm import tqdm

class myThread (threading.Thread):

def __init__(self, split_dds, name, num):
   threading.Thread.__init__(self)
   self.name = name
   self.split_dds = split_dds
   self.num = num

def run(self):
   print("开始线程：" + self.num)
   save_img(self.split_dds, self.name)
   print("退出线程：" + self.num)

def split_list(ls, each):
list = []
eachExact = float(each)
groupCount = int(len(ls) // each)
groupCountExact = math.ceil(len(ls) / eachExact)
start = 0
for i in xrange(each):
   if i == each - 1 & groupCount < groupCountExact:# 假如有余数，将剩余的所有元素加入到最后一个分组
         list.append(ls)
   else:
         list.append(ls)
   start = start + groupCount

return list

def get_document(url):
# print(url)
try:
   get = requests.get(url)
   data = get.content
   get.close()
except:
   time.sleep(3)
   try:
         get = requests.get(url)
         data = get.content
         get.close()
   except:
         time.sleep(3)
         get = requests.get(url)
         data = get.content
         get.close()
return data

def save_img(li_list_split, book_name):
for num in range(len(li_list_split)):#遍历
   li=li_list_split
   li_a_href = li.find('a')['href']#注意这里获取到的url是不完整的/manhua/buhuochongwuniangdezhengquefangfa/1000845.html
   i = 0
   path = "d:/SanMu/"+book_name+'/'+li.text.replace('\n', '')
   if not os.path.exists(path):
         os.makedirs(path)
   while True:
         li_a_href_replace = li_a_href
         if i != 0:
            li_a_href_replace = li_a_href.replace('.', ('-' + str(i) + '.'))
         chapter_html = BeautifulSoup(get_document('https://m.gufengmh8.com' + li_a_href_replace), 'lxml')
         chapter_content = chapter_html.find('div', attrs={'class': 'chapter-content'})
         img_src = chapter_content.find('img')['src']
         if img_src.__eq__('https://res.xiaoqinre.com/images/default/cover.png'):
            break
         chapter_content = chapter_html.find('div', attrs={'class': 'chapter-content'})
         img_src = chapter_content.find('img')['src']
         open(path+'/'+str(i)+'.jpg', 'wb').write(get_document(img_src))#保存到d:/SanMu/书名/章节名/0.jpg
         i += 1

def download_img(html):
chapter_url_list=[]
soup = BeautifulSoup(html, 'lxml')#BeautifulSoup和request搭配使用更佳呦
itemBox = soup.find_all('div', attrs={'class': 'itemBox'})#find_all返回的是一个list
for index, item in enumerate(itemBox):#遍历itemBox，index是当前项list的下标，item是内容
   itemTxt = item.find('div', attrs={'class': 'itemTxt'})#因为只有一个，所以itemBox中只有一个itemTxt所以这次我们用find
   a = itemTxt.find('a', attrs={'class': 'title'})
   chapter_url = a['href']
   chapter_url_list.append(chapter_url)#把所有书的url存起来
   print(str(index+1)+'.'+a.text)
number = int(input('请输入漫画序号'))
chapter_html_list = BeautifulSoup(get_document(chapter_url_list), 'lxml')#因为打印的序号和list的索引是相差1的,所以输入的序号减一获取对应书的url，再根据url获取到目录页面
ul = chapter_html_list.find('ul', attrs={'id': 'chapter-list-1'})#获取到ul
book_name = chapter_html_list.find('h1', attrs={'class': 'title'}).text#获取到ul
li_list = ul.find_all('li')#获取其中所有li
thread_list = []
thread_count = split_list(li_list, len(li_list))#多少章就启多少个线程len(li_list)可以改成固定的线程数,
for num, li_list_split in enumerate(thread_count):#拆分了多少个list就创建多少个线程
   thread = myThread(li_list_split, book_name, str(num))
   thread_list.append(thread)
for thread in thread_list:
   thread.start()
for thread in thread_list:
   thread.join()
while 1:
   break

download_img(get_document("https://m.gufengmh8.com/search/?keywords="+str(input("搜索漫画:"))))

三木猿 发表于 2020-9-3 13:36

GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.这个错其实并不影响，如果有强迫症受不了可以在BeautifulSoup方法中加入'lxml'------soup = BeautifulSoup(html, 'lxml')

相信无限活宝 发表于 2020-9-3 12:20

报错啊，不知道咋回事，哈哈GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 74 of the file C:/Users/1/PycharmProjects/untitled2/爬漫画.py. To get rid of this warning, pass the additional argument 'features="lxml"' to the BeautifulSoup constructor.

chapter_html = BeautifulSoup(get_document('https://m.gufengmh8.com' + li_a_href_replace))

kaixin15A 发表于 2020-9-2 14:09

厉害，支持分享，支持

星雨星 发表于 2020-9-2 14:19

谢谢分享！

z1991627 发表于 2020-9-2 14:39

厉害，支持分享，支持

一人之下123456 发表于 2020-9-2 14:51

感谢分享，学习一下，谢谢楼主

yingsummery 发表于 2020-9-2 15:04

哇，厉害啊，就是看着头晕。等有空研究一下

传说中的yang哥 发表于 2020-9-2 15:14

不错不错

喵小盼 发表于 2020-9-2 15:18

学习一下，谢谢楼主

哭泣滴梦 发表于 2020-9-2 15:21

看的我头疼。程序猿看来都需要个强大的大脑

页: [1] 2 3

吾爱破解 - 52pojie.cn's Archiver

手把手教python爬取漫画(每一步都有注释)