【美女壁纸爬虫】基于Scrapy框架爬取优美图库实例【附完整代码】

风景暗色调 发表于 2023-6-25 19:48

本帖最后由风景暗色调于 2023-6-25 22:13 编辑

这个案例的几个注意事项：

1. 启用scrapy自带的http2支持
2. 使用scrapy的图片pipeline下载图片

**实现功能：**
1.根据给定的壁纸分类url，自动进入壁纸详情页爬取，能够自动翻页爬取下一页壁纸。

**完整代码附件**：

图片会根据网站中的标题作为目录，保存在项目下Downloads/壁纸标题/xxx
（由于美女太性感，所以这里只演示下载其他壁纸）

下面是几个核心文件的代码：

items.py

```python
import scrapy

class UMeiItem(scrapy.Item):
name = scrapy.Field()
# 如果要使用图片管道，这个字段必须是这个名字，否则你自己要继承图片管道类，重写相关的方法。
# 这里字段要写入图片的下载url地址。
image_urls = scrapy.Field()

# images是用于存放下载图片的结果
images = scrapy.Field()

dirname = scrapy.Field()

```

pipelines.py

```python
from itemadapter import ItemAdapter
from scrapy.pipelines.images import ImagesPipeline

class MyImagePipeline(ImagesPipeline):

def file_path(self, request, response=None, info=None, *, item=None):
   adapter = ItemAdapter(item)
   img_name = adapter.get("name")
   dirname = adapter.get("dirname")
   # print(f"{img_name=}")
   return f"{dirname}/{img_name}.jpg"

```

settings.py

```python
import os.path

BOT_NAME = "umeiwallpaper"

SPIDER_MODULES = ["umeiwallpaper.spiders"]
NEWSPIDER_MODULE = "umeiwallpaper.spiders"

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/114.0"

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY = 0.2

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

DOWNLOAD_HANDLERS = {
"https": "scrapy.core.downloader.handlers.http2.H2DownloadHandler",
}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
"umeiwallpaper.pipelines.MyImagePipeline": 300,
}

# 图片管道下载存储根路径
BASE_DIR = os.path.dirname(os.path.dirname(__file__))
IMAGES_STORE = os.path.join(BASE_DIR, "Downloads")

# 你可以自定，也可以用默认的，默认就这2个值
IMAGES_URLS_FIELD = "image_urls"
IMAGES_RESULT_FIELD = "images"

# 允许重定向下载（有一些http会重定向到https，不允许就提示301状态码）
MEDIA_ALLOW_REDIRECTS = True

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# # # The initial download delay
AUTOTHROTTLE_START_DELAY = 1
# # # The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 5
# # # The average number of requests Scrapy should be sending in parallel to
# # # each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
# # # Enable showing throttling stats for every response received:
AUTOTHROTTLE_DEBUG = False

REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

```

爬虫 umei.py

```python
import scrapy
from scrapy.http import HtmlResponse

from umeiwallpaper.items import UMeiItem

class UmeiSpider(scrapy.Spider):
name = "umei"
allowed_domains = ["umei.cc"]
# start_urls = ["https://www.umei.cc/meinvtupian/meinvmote/235660.htm"]
start_urls = ["https://www.umei.cc/meinvtupian/meinvmote/"]

def parse(self, response: HtmlResponse, **kwargs):

   lists_xpath = "//div[@class='item masonry_brick']/div/div[@class='img']/a"

   selector_list = response.xpath(lists_xpath)

   for selector in selector_list:
         url = selector.xpath("./@href").get()
         # title = selector.xpath("./img/@alt").get()
         print(f"列表：{url=}")
         # print(f"{title=}")

         yield scrapy.Request("https://www.umei.cc" + url, callback=self.parse_item, dont_filter=False)

def parse_item(self, response: HtmlResponse):
   last_page = response.xpath('//a/@href').get()
   if last_page:
         count = int(last_page.split("/")[-1].rsplit(".", 1).split("_"))
   else:
         count = 1
   # print(f"{count=}")
   for i in range(count):
         if i == 0:
            # url = self.start_urls
            url = response.url
         else:
            url = f"{response.url.rsplit('.', 1)}_{i + 1}.htm"
         print(f"{url=}")
         yield scrapy.Request(url, callback=self.parse_detail, dont_filter=True)

def parse_detail(self, response: HtmlResponse):

   img_url = response.xpath("//div[@class='big-pic']/a/img/@src").get()
   name = response.url.split("/")[-1].split(".")
   title = response.xpath('//div/h1/text()').get()

   u_mei_item = UMeiItem()
   u_mei_item["name"] = name
   u_mei_item["dirname"] = title
   u_mei_item["image_urls"] =

   # print(u_mei_item)

   yield u_mei_item

```

翻译搜索复制
翻译搜索复制

鹿鸣发表于 2023-6-25 20:42

rengxumiaoshou 发表于 2023-6-25 20:33
我也想把数据变现，但除了找工作和开淘宝店卖数据真的没别的活可干了呀

你那除了的不就是变现了，为什么要除了{:1_918:}当然了应用肯定还是很广的，只要有技术哪不能变现

rengxumiaoshou 发表于 2023-6-25 20:31

爬虫除了装逼还能干啥呀家人们😭

鹿鸣发表于 2023-6-25 20:32

rengxumiaoshou 发表于 2023-6-25 20:31
爬虫除了装逼还能干啥呀家人们😭

见识短浅了不是

rengxumiaoshou 发表于 2023-6-25 20:33

xiaorun 发表于 2023-6-25 20:32
见识短浅了不是

我也想把数据变现，但除了找工作和开淘宝店卖数据真的没别的活可干了呀

aliya0416 发表于 2023-6-25 20:42

谢谢楼主分享！感谢

wuai4444 发表于 2023-6-25 21:43

谢谢楼主分享

TsL05 发表于 2023-6-25 21:44

试试再说

moruye 发表于 2023-6-25 21:52

风景暗色调 发表于 2023-6-25 22:03

rengxumiaoshou 发表于 2023-6-25 20:31
爬虫除了装逼还能干啥呀家人们😭

这可是美女下载器{:1_926:}

页: [1] 2 3 4 5

吾爱破解 - 52pojie.cn's Archiver

【美女壁纸爬虫】基于Scrapy框架爬取优美图库实例【附完整代码】