scrapy框架爬取奇书网整站小说,一文学会scrapy用法!
本帖最后由 huguo002 于 2019-9-19 18:36 编辑scrapy框架爬取奇书网整站小说,一文学会scrapy用法!
scrapy框架爬取奇书网整站小说,一文学会scrapy用法!
Scrapy默认是不能在IDE中调试的,调试方法:
我们在根目录中新建一个py文件叫:entrypoint.py;在里面写入以下内容:
from scrapy.cmdline import execute
execute(['scrapy','crawl','qisuu'])
我这里的ide是pycham
调试直接运行 entrypoint.py 文件即可!
spider爬虫主程序qisuu.py
import re
import scrapy
from bs4 import BeautifulSoup
from scrapy.http import Request
class Myspider(scrapy.Spider):
name='qisuu'
allowed_domains=['qisuu.la']
bash_url='https://www.qisuu.la/soft/sort0'
bashurl='.html'
def start_requests(self):
for i in range(1,11):
url=f'{self.bash_url}{str(i)}/index_1{self.bashurl}'
yield Request(url,self.parse)
def parse(self,response):
max_num=re.findall(r"下一页</a>.+?<a href='/soft/sort0.+?/index_(.+?).html'>尾页</a>",response.text,re.S)
bashurl=str(response.url)[:-6]
for i in range(1,int(max_num)+1):
url=f'{bashurl}{str(i)}{self.bashurl}'
yield Request(url,callback=self.get_name)
def get_name(self,response):
lis=BeautifulSoup(response.text,'lxml').find('div',class_="listBox").find_all('li')
for li in lis:
novelname=li.find('a').get_text() #小说名
novelinformation = li.find('div', class_="s").get_text() #小说信息
novelintroduce=li.find('div',class_="u").get_text() #小说简介
novelurl=f"https://www.qisuu.la{li.find('a')['href']}" #小说链接
yield Request(novelurl,callback=self.get_chapterurl,meta={'name':novelname,'url':novelurl})
def get_chapterurl(self,response):
#novelname =BeautifulSoup(response.text,'lxml').find('h1').get_text()
novelname=str(response.meta['name'])
lis=BeautifulSoup(response.text,'lxml').find('div',class_="detail_right").find_all('li')
noveclick=lis.get_text() #点击次数
novefilesize=lis.get_text() #文件大小
novefiletype = lis.get_text()# 书籍类型
noveupatedate = lis.get_text()# 更新日期
novestate = lis.get_text()# 连载状态
noveauthor = lis.get_text()# 书籍作者
novefile_running_environment = lis.get_text()# 运行环境
lis=BeautifulSoup(response.text,'lxml').find('div',class_="showDown").find_all('li')
novefile_href=re.findall(r"'.+?','(.+?)','.+?'",str(lis[-1]),re.S)#小说下载地址
print(novelname)
print(noveclick)
print(novefilesize)
print(novefiletype)
print(noveupatedate)
print(novestate)
print(noveauthor)
print(novefile_running_environment)
print(novefile_href)
方法来源:崔庆才,静觅-小白进阶之Scrapy第一篇
https://cuiqingcai.com/3472.html/3
感兴趣的话可以参照尝试!
也欢迎一起交流py!
欢迎留言探讨!
如果有帮到您!可以的话免费给个评分!
每天一次评分,您随手一点,才能给予分享者更多动力! 跟着学习了 顶顶帖子啊啊
貌似不完全
页:
[1]