可控核聚变 发表于 2021-1-6 12:18

摄影大师街拍美丽可爱的小姐姐,我用Python爬取魔镜原创摄影图。

本帖最后由 可控核聚变 于 2021-1-6 22:13 编辑

嗨!朋友们,大家好。我这两天在浏览魔镜原创摄影这个网站,里面有超多美丽可爱的小姐姐,都是一些街拍摄影大师的高质量作品,先上几张图look look。:lol


   

可是,这些都是小图,要看全套高清大图是要付费的,无奈我囊中羞涩,付不起这个钱(如果对这些作品感兴趣的朋友们也可以付费支持一下这些大师哦)。所以我只好把每个大师每个系列的展示图片爬下来,估计也有几千张图片,也能过过眼瘾了。
说干就干,代码一顿敲,终于找到图片地址了,准备下载!走你!


嗯?不对,怎么获取到的是一堆,什么鬼?



我以为我标签搞错了,仔细反复检查了N遍,标签没问题呀,我又尝试获取同级标签下的其它标签属性,都能正确得到数据。唯独获取这个img src属性就给我来一堆这玩意:template/dsvue_black_gold/skin_img/t.gif
然后我就到Elements里面去搜索template/dsvue_black_gold/skin_img/t.gif 这串字符串哪来的?原来从这来的。


当我看到text/javascript这几个字时我就怂了,我没学过javascript呀,对javascript一窍不通,我只能上度娘查如何解决这个lazyLoadImg。原来这个叫懒加载模式(具体是什么我就不说了,可以网上查),很多帖子都说使用js_script可以解决,可我不会咋办咧?代码都敲了这么多行,就这样完蛋了,MD,我不甘心就这样被她征服,喝下她藏好的毒。

继续查,终于在CSDN上有一哥们说可以通过打印网页源码,对比文本,就能找到img标签中真正存放图片链接的属性。试试呗!


通过对比网页源码文本,终于找到目标,就是这么简单。


OK,走起!


上全部代码:
import requests
from lxml import etree
import re
import os
from faker import Factory
import ast
import random

class MoJing_Spider(object):
    def __init__(self):
      user_agent = []
      for i in range(30):
            f = Factory.create()
            ua = "{{{0}}}".format("'User-Agent'" + ":" + "'{}'".format(f.user_agent()))
            headers = ast.literal_eval(ua)# 使用 ast.literal_eval 将str转换为dict
            user_agent.append(headers)
      self.headers = random.choice(user_agent)
      self.start_url = 'https://www.520mojing.com/forum.php'# 网站根地址
      self.main_folder = r'/Volumes/魔镜街拍图'# 主路径
      # print(self.headers)

    # 解析主页
    def data_range(self):
      start_res = requests.get(url=self.start_url, headers=self.headers)
      start_sel = etree.HTML(start_res.content.decode())
      return start_sel
    # 获取网址列表
    def get_author_url(self, start_sel):
      main_url_list = start_sel.xpath('//dt[@style="font-size:15px; margin-top:6px"]/a/@href')
      return main_url_list
    # 获取分区网址
    def create_url(self, author_url):
      author_res = requests.get(url=author_url, headers=self.headers)
      author_sel = etree.HTML(author_res.content.decode())
      try:
            id = re.findall(r'https://www.520mojing.com/forum-(.*?)-1.html', author_url)
            len_page = author_sel.xpath('//span[@id="fd_page_top"]/div/label/span/text()').replace('/ ', '').replace(' 页', '')
            len_page = int(len_page)
            name = author_sel.xpath('//div[@class="bm_h cl"]/h1/a/text()')
            print(f'=================正在保存{name}图片,共{len_page}页=================')
            return name, id, len_page
      except IndexError:
            pass
    # 获取每页图片链接
    def get_pciturelinks(self, page_url):
      picture_res = requests.get(url=page_url, headers=self.headers)
      picture_sel = etree.HTML(picture_res.content.decode())
      try:
            pciture_links = picture_sel.xpath('//div[@class="c cl"]/a/img/@data-src')
            return pciture_links
      except IndexError:
            pass
    # 保存图片
    def save_picture(self, name, link, num):
      try:
            # 创建多层文件夹
            folder = self.main_folder + '/' + name + '/' + str(num) + '/'
            if not os.path.exists(folder):
                os.makedirs(folder)
            with open(folder + link.split('/')[-2] + os.path.splitext(link)[-1], 'wb') as f:
                image = requests.get(url=link, headers=self.headers).content
                f.write(image)
      except:
            print('保存失败')

    def run(self):
      start_sel = self.data_range()
      main_url_list = self. get_author_url(start_sel)
      for author_url in main_url_list:
            name, id, len_page = self.create_url(author_url)
            for num in range(1, len_page + 1):
                page_url = f'https://www.520mojing.com/forum-{id}-{num}.html'
                pciture_links = self.get_pciturelinks(page_url)
                for link in pciture_links:
                  # print(link)
                  self.save_picture(name, link, num)

if __name__ == '__main__':
    MoJing = MoJing_Spider()# 实例化对象
    MoJing.run()

我在测试代码的时候,访问网站的次数过多,所以我就弄了几个随机的请求头,不知道有没有用。运行代码之前改一下self.main_folder的路径就可以了,朋友们开心就好!

gf7802346 发表于 2021-2-8 20:35

本帖最后由 gf7802346 于 2021-2-9 00:12 编辑


魔镜无水印大图api

api :https://api.huaishu520.com:8080/ ... ges/images/listpage
post:{"page":"1","limit":"500","orderbyType":"id"}

api:https://api.huaishu520.com:8080/ ... ages/frontinfo/2131
get:


https://badman1.oss-cn-beijing.aliyuncs.com/20210126/98a1b87eb7d145e5b32628a95543710e.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/639c9366ed7647cab0d529ada08c57b1.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/34078f35cb6244e7bd4907ef847e8602.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/3fa27770d3bc4530bf23b88713105b88.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/196f9c0fcae2495886f539dd5866b484.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/803e2119d17a42a680f7d6685f56d520.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/b8b1279380fa4b5d982de537f48bbf8e.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/0244e5584e414eacb3f70b3b3cf02ceb.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/a35c1e34c5dc4db7b1d3c2551c5a07dd.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/81445e9d84b84a73b1b609a05c6fff4e.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/52a5f7730e7f475e9142ce3feacb14df.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/b7e5b0243c034e0e82b74a988b80db1c.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/7829bc52d50f48fda22895a7cd49d360.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/3703a45b3d2a437caedbc2f678f82236.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/06fcc1efabdf456c8673f34c030b8274.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/962b777ba88f4bac8303f84af0f01dcb.jpg

可控核聚变 发表于 2021-1-15 23:54

我睡一会再来 发表于 2021-1-15 22:55
默认文件夹放在了哪里

你把这个self.main_folder变量的路径改成你自己电脑文件夹的路径。

gf7802346 发表于 2021-2-8 20:45

本帖最后由 gf7802346 于 2021-2-9 00:13 编辑

https://api.huaishu520.com:8080/renren-fast/images/images/frontinfo/2179
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/b85aa0f8e45242a896b09602bb83fdeb.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/3bae9767125c40bfbab3065cc2d6bd71.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/e9d26acff8194f1790eb8c789fdfefb0.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/febe709d7ecd4e9183645187e587c947.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/13b9c288bb0243b399a517e34311d424.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/ec947b4c7cb741778c2b4a57549ea7f2.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/8f78706c4cdb4bbe9285a667077b4afa.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/4e76b46ab0df412c9bd863f8554ad98f.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/982996ed31674f908c676762da249f55.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/d7c3b1729d42426b89d133de0cff5099.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/d4999bad2030438c87dc7de128da91e6.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/eda441e66b464c7995aa9ec31202fa16.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/ed87212e160949aca5c0d4d7c42f5269.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/71df165f1812409ea0fd140517744c5a.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/f86baf0bb4ff4dbc991fe8180f93a8d4.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/2e02baeb17ed4896a07b9a2f7cf5d76c.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/c8263b0babd54b8bb63ffd96319ab5ab.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/017dab8746ef4ac0bed0a06cdca48b9e.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/2cda815daf16431eba1b7ad535e7cc34.JPG

ming_2794 发表于 2021-1-8 22:55

感谢分享,学习学习!

dzqaww 发表于 2021-1-9 19:03

D:\pycharm\pythonProject\venv\Scripts\python.exe D:/pycharm/pythonProject/mm/4.py

进程已结束,退出代码0

直接就...

tjf 发表于 2021-1-10 17:42

我怀疑你开车, 但是我没有证据

可控核聚变 发表于 2021-1-10 23:34

dzqaww 发表于 2021-1-9 19:03
D:\pycharm\pythonProject\venv\Scripts\python.exe D:/pycharm/pythonProject/mm/4.py

进程已结束,退出 ...

先要改文件夹路径

fault 发表于 2021-1-10 23:48

一边学习,一边“学习”

相信无限活宝 发表于 2021-1-12 13:48

东西很不错啊

可控核聚变 发表于 2021-1-12 18:01

相信无限活宝 发表于 2021-1-12 13:48
东西很不错啊

感谢支持

我睡一会再来 发表于 2021-1-15 22:55

默认文件夹放在了哪里

xjshuaishuai 发表于 2021-1-15 23:32

谢谢楼主的分享!
页: [1] 2 3 4 5 6 7 8
查看完整版本: 摄影大师街拍美丽可爱的小姐姐,我用Python爬取魔镜原创摄影图。