scrapy框架爬虫实战-轻松下载各种美食美女头像图片 - 吾爱破解 - 52pojie.cn

梦幻嘟嘟 发表于 2019-11-11 18:05

scrapy框架爬虫实战--轻松下载各种美食美女头像图片

Python--scrapy框架爬虫实战
准备工作
https://static.52pojie.cn/static/image/hrline/5.gif

[*]python3开发环境
[*]相应的python包：
import scrapy
from PhotoSpider.items import PhotospiderItem
import re
from urllib.request import *
貌似都是自带的包，不需要另外安装
[*]开发环境：这里使用的是pycharm
[*]谷歌浏览器：用于抓包分析（其他浏览器也可以）

分析网页结构
https://static.52pojie.cn/static/image/hrline/5.gif

网站：https://www.mn52.com/ mn52图库网（正规！正规！）

这个网站是一个图片网站，内容基本都是图片，分类有很多，因为网页结构都是一样的，所以我这里选取了头像集这个分类进行分析（办公室环境，美女图什么就算了）

头像集：https://www.mn52.com/txj/

首先嘛，是分析图片的url组成，想获取最终的的图片url需要先从头像集页面那 4 * 7 组图片点击一个进去就到图片详情页了，这时候就可以用检查代码看到具体图片的url了

那么，思路就是进入初始url ： https://www.mn52.com/txj/ → 爬取到 4 * 7 个详情页面的url → 进入到对应图片详情页 → 爬取到详情页内那一排的原图url → 下载

咳咳，那么，下面就是开始框架的安装了。。。

scrapy 框架安装
https://static.52pojie.cn/static/image/hrline/5.gif

首先打开 cmd，然后 cd 到python的工作目录，输入项目名为 PhotoSpider 的命令(名字什么的当然可以自己定了)
scrapy startproject PhotoSpider

然后这时候在目录下会出现一个 PhotoSpider 文件夹，这时候我们还要继续在cmd上操作，输入以下代码创建 Spider类
cd PhotoSpider
scrapy genspider getPhotoSpider mn52.com

getPhotoSpider 将是我们运行框架的关键，这里所有的文件已经创建完成，现在就是做填空啦~

完整代码
https://static.52pojie.cn/static/image/hrline/5.gif

首先填的是 items.py用来定义想获得的属性
import scrapy

class PhotospiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# 跟着上面填就是了
# 图片名
photo_id = scrapy.Field()
# 图片下载地址
photo_url = scrapy.Field()

接着就是最最重要的 getPhotoSpider.py 获取页面信息并过滤出想要的
# -*- coding: utf-8 -*-
import scrapy
from PhotoSpider.items import PhotospiderItem
import re

class GetphotospiderSpider(scrapy.Spider):
name = 'getPhotoSpider'
allowed_domains = ['mn52.com']
start_urls = ['https://www.mn52.com/txj/']# 这里填写头像集的url，当然，可以将txj改成你想要的分类

# 添加__init__函数用于存放页数
def __init__(self):
   self.page_index = 1

def parse(self, response):
   # 填空题开始了！！从这里xpath在start_urls上获取的消息，过滤出url
   for photo in response.xpath('//div[@class="content"]/div/div'):
         url = photo.xpath('./div/a/@href').extract_first()
         # 这里要给url加上https，否则会。。无法访问
         url_new = 'https:' + url
         # 将新的url甩给下面的函数，也就是爬取下级页面信息~
         yield scrapy.Request(url_new, callback=self.parse_detail, dont_filter=True)

def parse_detail(self, response):
   # 过滤从上面丢下来的信息，得到photo_url，也就是真实的图片下载链接
   for photos in response.xpath('//div[@id="originalpic"]/img'):
         # 这个是正则表达式，用来筛选出//image.mn52.com/img/allimg/190906/8-1ZZ6094322-53.jpg中的8-1ZZ6094322-53
         pattern = '\w*?\-\w+'
         # 这个 PhotospiderItem 是用来存放的，在items.py里面
         item = PhotospiderItem()
         item['photo_url'] = photos.xpath('./@src').extract_first()
         item['photo_id'] = re.search(pattern, item['photo_url']).group()
         yield item
   # 这是用来执行下一页的
   self.page_index += 1
   # 下面的12代表第十二个分类（头像集），修改分类的时候需要同时将这个12一起修改（比如爬取美食图片要将12改成10）
   next_link = 'https://www.mn52.com/mstp/list_12_' + str(self.page_index) + '.html'
   yield scrapy.Request(next_link, callback=self.parse)

emmmm，然后就是处理页 pipelines.py 这里是用来将下载图片
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from urllib.request import *

class PhotospiderPipeline(object):

def process_item(self, item, spider):
   print('--------------' + item['photo_id'])
   # 对下载地址加https，不然又无法访问
   real_url ='https:' + item['photo_url']
   # 这里是加信息头（反反爬），本来是在中间件中加的，但是可能是二级跳转的原因，没效果，所以就直接在这里加了
   opener = build_opener()
   opener.addheaders = [('User-Agent',
                  'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
   install_opener(opener)
   print('开始下载' + real_url)
   # 下载到项目环境目录下的文件夹images中
   with urlopen(real_url) as result:
         data = result.read()
         # 这是文件命名方式，嫌麻烦，统一用jpg格式命名了。。
         with open("images/" + item['photo_id'] + '.jpg', 'wb+') as f:
            f.write(data)
            print('下载完成')

这边填空题基本已经做完啦~~，下面修改一下分配器 settings.py分配下任务哦~（可能被# 备注了，去掉#就好了，后面300多少随意，只是个执行顺序）
ITEM_PIPELINES = {
'PhotoSpider.pipelines.PhotospiderPipeline': 300,
}

代码写完了。。。。。{:301_1007:}
别忘了在PhotoSpider项目目录下创建images文件夹，也就是E:\PycharmProjects\xxxxx\PhotoSpider下

最最最后，也就是最最最关键的当然是运行啦，打开 cmd 运行，cd 到项目目录下，跟上面的目录一样，运行代码
scrapy crawl getPhotoSpider

效果如下
https://attach.52pojie.cn//forum/201911/11/174808cuiui4hhy93zcece.png?l

https://attach.52pojie.cn//forum/201911/11/175032jxzomqozoucimmc8.png?l

大功告成~{:301_999:}

待改进的地方
https://static.52pojie.cn/static/image/hrline/5.gif

每次爬取新的分类都需要对getPhotoSpider.py 里面的 start_urls 和 next_link 进行修改 {:301_973:}
当然，也可以对项目打包成 .exe,具体流程百度比我牛逼多了，这里就不打包了

后记https://static.52pojie.cn/static/image/hrline/5.gif

详细思路可以去美女图片爬虫实战--轻松爬取几万张美女图片他写的比我细节多了

Py破解群众 发表于 2019-11-11 18:13

python 爬虫真的很强大啊！加油。自己好好学习

vethenc 发表于 2019-11-11 18:25

感谢分享，早说用途早入行

光之继承者 发表于 2019-11-11 19:51

Go语言版本的爬虫很厉害，python语言的这个版本也非常不错，学习了。

梦幻嘟嘟 发表于 2019-11-12 09:25

a13529烟雨发表于 2019-11-11 18:59
爬虫用途真的很广

学爬虫容易营养不良~

mikeee 发表于 2019-11-13 22:20

感谢分享，关注一下

页: [1]

吾爱破解 - 52pojie.cn's Archiver

scrapy框架爬虫实战--轻松下载各种美食美女头像图片