scrapy框架爬取奇书网整站小说，一文学会scrapy用法！

huguo002 发表于 2019-9-19 18:29

本帖最后由 huguo002 于 2019-9-19 18:36 编辑

scrapy框架爬取奇书网整站小说，一文学会scrapy用法！
scrapy框架爬取奇书网整站小说，一文学会scrapy用法！

Scrapy默认是不能在IDE中调试的，调试方法：
我们在根目录中新建一个py文件叫：entrypoint.py；在里面写入以下内容：
from scrapy.cmdline import execute
execute(['scrapy','crawl','qisuu'])

我这里的ide是pycham
调试直接运行 entrypoint.py 文件即可！

spider爬虫主程序qisuu.py

import re
import scrapy
from bs4 import BeautifulSoup
from scrapy.http import Request

class Myspider(scrapy.Spider):

name='qisuu'
allowed_domains=['qisuu.la']
bash_url='https://www.qisuu.la/soft/sort0'
bashurl='.html'

def start_requests(self):
   for i in range(1,11):
         url=f'{self.bash_url}{str(i)}/index_1{self.bashurl}'
         yield Request(url,self.parse)

def parse(self,response):
   max_num=re.findall(r"下一页</a>.+?<a href='/soft/sort0.+?/index_(.+?).html'>尾页</a>",response.text,re.S)
   bashurl=str(response.url)[:-6]
   for i in range(1,int(max_num)+1):
         url=f'{bashurl}{str(i)}{self.bashurl}'
         yield Request(url,callback=self.get_name)

def get_name(self,response):
   lis=BeautifulSoup(response.text,'lxml').find('div',class_="listBox").find_all('li')
   for li in lis:
         novelname=li.find('a').get_text() #小说名
         novelinformation = li.find('div', class_="s").get_text() #小说信息
         novelintroduce=li.find('div',class_="u").get_text() #小说简介
         novelurl=f"https://www.qisuu.la{li.find('a')['href']}" #小说链接
         yield Request(novelurl,callback=self.get_chapterurl,meta={'name':novelname,'url':novelurl})

def get_chapterurl(self,response):
   #novelname =BeautifulSoup(response.text,'lxml').find('h1').get_text()
   novelname=str(response.meta['name'])
   lis=BeautifulSoup(response.text,'lxml').find('div',class_="detail_right").find_all('li')
   noveclick=lis.get_text() #点击次数
   novefilesize=lis.get_text() #文件大小
   novefiletype = lis.get_text()# 书籍类型
   noveupatedate = lis.get_text()# 更新日期
   novestate = lis.get_text()# 连载状态
   noveauthor = lis.get_text()# 书籍作者
   novefile_running_environment = lis.get_text()# 运行环境
   lis=BeautifulSoup(response.text,'lxml').find('div',class_="showDown").find_all('li')
   novefile_href=re.findall(r"'.+?','(.+?)','.+?'",str(lis[-1]),re.S)#小说下载地址
   print(novelname)
   print(noveclick)
   print(novefilesize)
   print(novefiletype)
   print(noveupatedate)
   print(novestate)
   print(noveauthor)
   print(novefile_running_environment)
   print(novefile_href)

方法来源：崔庆才，静觅-小白进阶之Scrapy第一篇
https://cuiqingcai.com/3472.html/3

感兴趣的话可以参照尝试！
也欢迎一起交流py！
欢迎留言探讨！

如果有帮到您！可以的话免费给个评分！
每天一次评分，您随手一点，才能给予分享者更多动力！

ZIfumaker 发表于 2019-9-19 20:19

跟着学习了

cqx754810735 发表于 2019-12-19 19:06

顶顶帖子啊啊

ghoob321 发表于 2019-12-25 09:47

貌似不完全

页: [1]

吾爱破解 - 52pojie.cn's Archiver

scrapy框架爬取奇书网整站小说，一文学会scrapy用法！