Scrapy用起来真的很方便随便爬一个页面的所有写真

mayixb 发表于 2021-2-6 14:14

本帖最后由 mayixb 于 2021-2-6 18:40 编辑

1.首先创建一个项目=scrapy startproject firstblood

2.进入项目目录创建爬虫
scrapy genspider first www.xxx.com

3.爬虫文件的代码
# -*- coding: utf-8 -*-
import scrapy,os
from firstblood.items import FirstbloodItem

class FirstSpider(scrapy.Spider):
name = 'first'
# allowed_domains = ['www.xxx.com']
start_urls = ['https://www.xiurenji.cc/YouWu/']

def parse(self, response):
   #到当前页面所有套图的a标签
   page_list=response.xpath('//div[@class="dan"]/a')
   #循环所有套图的a标签
   for page in page_list:
         #拿到套图的页面地址
         page_url=f'https://www.xiurenji.cc{page.xpath("./@href").extract_first()}'
         #拿到套图的名称
         title=page.xpath("./@title").extract_first()
         #拼接路径,并建立本地文件夹
         dir=f'./down/{title}'
         if not os.path.isdir(dir):
            os.makedirs(dir)
         #手动发动请求套图页面,把套图的目录传过去
         yield scrapy.Request(url=page_url,callback=self.parse_page,meta={'dir':dir})

#解析套图页面
def parse_page(self, response):
   #拿到套图内的所有页面链接 (发现点小问题，页码上下各有一个)
   page_list=response.xpath('//div[@class="page"]')
   #循环套图的每一个页面（只循环第一个页码div标签）
   for page_url in page_list.xpath('./a/@href'):
         #拿到套图分页页面的链接
         url=f'https://www.xiurenji.cc{page_url.extract()}'
         #前面转过来的本地套图目录保存地址,要继续往下面传
         dir=response.meta['dir']
         #手动发送请求解析分页页面 (这旦要加上dont_filter=True，这样不会过滤重复的请求，因为首页前面请求过，过滤掉的话少几张照片))
         yield scrapy.Request(url=url, callback=self.get_img_url,meta={'dir':dir},dont_filter=True)

#解析套图的分页页面
def get_img_url(self,response):
   #分页页面内所有写真图片的url
   img_list=response.xpath('//div[@class="img"]//img/@src')
   #前面传过来的套图目录
   dir = response.meta['dir']
   #循环当前页面内的写真url
   for img_url in img_list:
         #拿到图片的文件名称
         name = img_url.extract().split('/')[-1]
         #保存目录和图片名称拼接起来,得到这张写真在本地的保存路径
         path=f'{dir}/{name}'

         #拿到这张写真的url
         url=f'https://www.xiurenji.cc{img_url.extract()}'
         #手动发送请求这张有写真,并且把本地保障路径也传过去
         yield scrapy.Request(url=url,meta={'path':path},callback=self.down_img)

def down_img(self,response):
   #创建item实例
   item=FirstbloodItem()
   #把写真路径和写真的内容给item字段赋值
   item['path']=response.meta['path']
   item['url']=response.body
   #把item传给管道这里要用yield，前面用的return，速度慢了不少
   yield item

4.item文件定义字段
import scrapy

class FirstbloodItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
url=scrapy.Field()
path=scrapy.Field()

5.管道文件
import time

class FirstbloodPipeline(object):

def open_spider(self,spider):
   print('开始爬虫'.center(40,'='))

def process_item(self, item, spider):
   # 获取本张写真地保存路径
   path=f'{item["path"]}'
   #写入本地
   with open(path,'wb') as f:
         f.write(item["url"])

def close_spider(self,spider):
   print('结束爬虫'.center(40,'='))
   print(time.perf_counter())

6.配置文件要打开管道设置ua伪装
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
'firstblood.pipelines.FirstbloodPipeline': 300,
}

7.在项目目录内执行cmd命令
scrapy crawl first

只弄了一个页面，测试了一下，全站下载太慢，其实下载下来的图片都没有看，只是享受下载的过程

后面发现点问题，页码标签有2个，只用1个就可以，
实际拿到的写真少3张，是因为首页在拿分页的时候请求过，后面拿写真图片的时候过滤掉不请求了，所以要加上dont_filter=True ，允许重复请求

tkingoo 发表于 2021-3-15 08:19

mayixb 发表于 2021-3-13 18:47
In : response.xpath('//*[@id="head"]/title/text()')
Out: []

谢谢大佬,还请教一个问题.我想爬手机app但是ios安装并且信任fiddler证书,wifi设置代{过}{滤}理之后就上不了网了,一直都是请求超时的状态,这种问题要怎么解决?

mayixb 发表于 2021-3-13 18:47

tkingoo 发表于 2021-3-12 22:44
大佬能帮忙看看这个为什么返回空么

In : response.xpath('//*[@id="head"]/title/text()')
Out: [<Selector xpath='//*[@id="head"]/title/text()' data='学而思网校-每天进步一点点'>]

拉比克 发表于 2021-2-6 14:26

大佬怎么爬斗鱼每个主播进去后的截图

Mr.Gavin 发表于 2021-2-6 14:38

这个下班回去试试看:lol

James0 发表于 2021-2-6 14:58

感谢楼主感谢楼主

王成发表于 2021-2-6 15:51

表示没看懂

龍謹发表于 2021-2-6 16:08

谢谢楼主，注释得太详细了。

dbu00956 发表于 2021-2-6 16:09

其实下载下来的图片都没有看，只是享受下载的过程

:handshake

Jack-yu 发表于 2021-2-6 16:38

好家伙，这就是你学爬虫的动力所在吧{:301_1001:}

axin1999 发表于 2021-2-6 17:01

拉比克发表于 2021-2-6 14:26
大佬怎么爬斗鱼每个主播进去后的截图

可以获取每个房间号，用selenium模拟浏览器构造请求，在用pillow模块截屏，再处理，就行了

super谦 发表于 2021-2-6 17:25

拉比克发表于 2021-2-6 14:26
大佬怎么爬斗鱼每个主播进去后的截图

用selenium

页: [1] 2 3

吾爱破解 - 52pojie.cn's Archiver

Scrapy用起来真的很方便 随便爬一个页面的所有写真

Scrapy用起来真的很方便随便爬一个页面的所有写真