风景暗色调 发表于 2023-6-25 19:48

【美女壁纸爬虫】基于Scrapy框架爬取优美图库实例【附完整代码】

本帖最后由 风景暗色调 于 2023-6-25 22:13 编辑

这个案例的几个注意事项:

1. 启用scrapy自带的http2支持
2. 使用scrapy的图片pipeline下载图片

**实现功能:**
1.根据给定的壁纸分类url,自动进入壁纸详情页爬取,能够自动翻页爬取下一页壁纸。


**完整代码附件**:

图片会根据网站中的标题作为目录,保存在项目下Downloads/壁纸标题/xxx
(由于美女太性感,所以这里只演示下载其他壁纸)



下面是几个核心文件的代码:

items.py

```python
import scrapy


class UMeiItem(scrapy.Item):
    name = scrapy.Field()
    # 如果要使用图片管道,这个字段必须是这个名字,否则你自己要继承图片管道类,重写相关的方法。
    # 这里字段要写入图片的下载url地址。
    image_urls = scrapy.Field()
   
    # images是用于存放下载图片的结果
    images = scrapy.Field()
   
    dirname = scrapy.Field()

```



pipelines.py

```python
from itemadapter import ItemAdapter
from scrapy.pipelines.images import ImagesPipeline


class MyImagePipeline(ImagesPipeline):

    def file_path(self, request, response=None, info=None, *, item=None):
      adapter = ItemAdapter(item)
      img_name = adapter.get("name")
      dirname = adapter.get("dirname")
      # print(f"{img_name=}")
      return f"{dirname}/{img_name}.jpg"

```



settings.py

```python
import os.path

BOT_NAME = "umeiwallpaper"

SPIDER_MODULES = ["umeiwallpaper.spiders"]
NEWSPIDER_MODULE = "umeiwallpaper.spiders"

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/114.0"

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY = 0.2

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

DOWNLOAD_HANDLERS = {
    "https": "scrapy.core.downloader.handlers.http2.H2DownloadHandler",
}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    "umeiwallpaper.pipelines.MyImagePipeline": 300,
}

# 图片管道下载存储根路径
BASE_DIR = os.path.dirname(os.path.dirname(__file__))
IMAGES_STORE = os.path.join(BASE_DIR, "Downloads")

# 你可以自定,也可以用默认的,默认就这2个值
IMAGES_URLS_FIELD = "image_urls"
IMAGES_RESULT_FIELD = "images"

# 允许重定向下载(有一些http会重定向到https,不允许就提示301状态码)
MEDIA_ALLOW_REDIRECTS = True

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# # # The initial download delay
AUTOTHROTTLE_START_DELAY = 1
# # # The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 5
# # # The average number of requests Scrapy should be sending in parallel to
# # # each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
# # # Enable showing throttling stats for every response received:
AUTOTHROTTLE_DEBUG = False

REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

```



爬虫 umei.py

```python
import scrapy
from scrapy.http import HtmlResponse

from umeiwallpaper.items import UMeiItem


class UmeiSpider(scrapy.Spider):
    name = "umei"
    allowed_domains = ["umei.cc"]
    # start_urls = ["https://www.umei.cc/meinvtupian/meinvmote/235660.htm"]
    start_urls = ["https://www.umei.cc/meinvtupian/meinvmote/"]

    def parse(self, response: HtmlResponse, **kwargs):

      lists_xpath = "//div[@class='item masonry_brick']/div/div[@class='img']/a"

      selector_list = response.xpath(lists_xpath)

      for selector in selector_list:
            url = selector.xpath("./@href").get()
            # title = selector.xpath("./img/@alt").get()
            print(f"列表:{url=}")
            # print(f"{title=}")

            yield scrapy.Request("https://www.umei.cc" + url, callback=self.parse_item, dont_filter=False)

    def parse_item(self, response: HtmlResponse):
      last_page = response.xpath('//a/@href').get()
      if last_page:
            count = int(last_page.split("/")[-1].rsplit(".", 1).split("_"))
      else:
            count = 1
      # print(f"{count=}")
      for i in range(count):
            if i == 0:
                # url = self.start_urls
                url = response.url
            else:
                url = f"{response.url.rsplit('.', 1)}_{i + 1}.htm"
            print(f"{url=}")
            yield scrapy.Request(url, callback=self.parse_detail, dont_filter=True)

    def parse_detail(self, response: HtmlResponse):

      img_url = response.xpath("//div[@class='big-pic']/a/img/@src").get()
      name = response.url.split("/")[-1].split(".")
      title = response.xpath('//div/h1/text()').get()

      u_mei_item = UMeiItem()
      u_mei_item["name"] = name
      u_mei_item["dirname"] = title
      u_mei_item["image_urls"] =

      # print(u_mei_item)

      yield u_mei_item

```


翻译搜索复制
翻译搜索复制

鹿鸣 发表于 2023-6-25 20:42

rengxumiaoshou 发表于 2023-6-25 20:33
我也想把数据变现,但除了找工作和开淘宝店卖数据真的没别的活可干了呀

你那除了的不就是变现了,为什么要除了{:1_918:}当然了应用肯定还是很广的,只要有技术哪不能变现

rengxumiaoshou 发表于 2023-6-25 20:31

爬虫除了装逼还能干啥呀家人们😭

鹿鸣 发表于 2023-6-25 20:32

rengxumiaoshou 发表于 2023-6-25 20:31
爬虫除了装逼还能干啥呀家人们😭

见识短浅了不是

rengxumiaoshou 发表于 2023-6-25 20:33

xiaorun 发表于 2023-6-25 20:32
见识短浅了不是

我也想把数据变现,但除了找工作和开淘宝店卖数据真的没别的活可干了呀

aliya0416 发表于 2023-6-25 20:42

谢谢楼主分享!感谢

wuai4444 发表于 2023-6-25 21:43


谢谢楼主分享

TsL05 发表于 2023-6-25 21:44

试试再说

moruye 发表于 2023-6-25 21:52

风景暗色调 发表于 2023-6-25 22:03

rengxumiaoshou 发表于 2023-6-25 20:31
爬虫除了装逼还能干啥呀家人们😭

这可是美女下载器{:1_926:}
页: [1] 2 3 4 5
查看完整版本: 【美女壁纸爬虫】基于Scrapy框架爬取优美图库实例【附完整代码】