摄影大师街拍美丽可爱的小姐姐,我用Python爬取魔镜原创摄影图。
本帖最后由 可控核聚变 于 2021-1-6 22:13 编辑嗨!朋友们,大家好。我这两天在浏览魔镜原创摄影这个网站,里面有超多美丽可爱的小姐姐,都是一些街拍摄影大师的高质量作品,先上几张图look look。:lol
可是,这些都是小图,要看全套高清大图是要付费的,无奈我囊中羞涩,付不起这个钱(如果对这些作品感兴趣的朋友们也可以付费支持一下这些大师哦)。所以我只好把每个大师每个系列的展示图片爬下来,估计也有几千张图片,也能过过眼瘾了。
说干就干,代码一顿敲,终于找到图片地址了,准备下载!走你!
嗯?不对,怎么获取到的是一堆,什么鬼?
我以为我标签搞错了,仔细反复检查了N遍,标签没问题呀,我又尝试获取同级标签下的其它标签属性,都能正确得到数据。唯独获取这个img src属性就给我来一堆这玩意:template/dsvue_black_gold/skin_img/t.gif
然后我就到Elements里面去搜索template/dsvue_black_gold/skin_img/t.gif 这串字符串哪来的?原来从这来的。
当我看到text/javascript这几个字时我就怂了,我没学过javascript呀,对javascript一窍不通,我只能上度娘查如何解决这个lazyLoadImg。原来这个叫懒加载模式(具体是什么我就不说了,可以网上查),很多帖子都说使用js_script可以解决,可我不会咋办咧?代码都敲了这么多行,就这样完蛋了,MD,我不甘心就这样被她征服,喝下她藏好的毒。
继续查,终于在CSDN上有一哥们说可以通过打印网页源码,对比文本,就能找到img标签中真正存放图片链接的属性。试试呗!
通过对比网页源码文本,终于找到目标,就是这么简单。
OK,走起!
上全部代码:
import requests
from lxml import etree
import re
import os
from faker import Factory
import ast
import random
class MoJing_Spider(object):
def __init__(self):
user_agent = []
for i in range(30):
f = Factory.create()
ua = "{{{0}}}".format("'User-Agent'" + ":" + "'{}'".format(f.user_agent()))
headers = ast.literal_eval(ua)# 使用 ast.literal_eval 将str转换为dict
user_agent.append(headers)
self.headers = random.choice(user_agent)
self.start_url = 'https://www.520mojing.com/forum.php'# 网站根地址
self.main_folder = r'/Volumes/魔镜街拍图'# 主路径
# print(self.headers)
# 解析主页
def data_range(self):
start_res = requests.get(url=self.start_url, headers=self.headers)
start_sel = etree.HTML(start_res.content.decode())
return start_sel
# 获取网址列表
def get_author_url(self, start_sel):
main_url_list = start_sel.xpath('//dt[@style="font-size:15px; margin-top:6px"]/a/@href')
return main_url_list
# 获取分区网址
def create_url(self, author_url):
author_res = requests.get(url=author_url, headers=self.headers)
author_sel = etree.HTML(author_res.content.decode())
try:
id = re.findall(r'https://www.520mojing.com/forum-(.*?)-1.html', author_url)
len_page = author_sel.xpath('//span[@id="fd_page_top"]/div/label/span/text()').replace('/ ', '').replace(' 页', '')
len_page = int(len_page)
name = author_sel.xpath('//div[@class="bm_h cl"]/h1/a/text()')
print(f'=================正在保存{name}图片,共{len_page}页=================')
return name, id, len_page
except IndexError:
pass
# 获取每页图片链接
def get_pciturelinks(self, page_url):
picture_res = requests.get(url=page_url, headers=self.headers)
picture_sel = etree.HTML(picture_res.content.decode())
try:
pciture_links = picture_sel.xpath('//div[@class="c cl"]/a/img/@data-src')
return pciture_links
except IndexError:
pass
# 保存图片
def save_picture(self, name, link, num):
try:
# 创建多层文件夹
folder = self.main_folder + '/' + name + '/' + str(num) + '/'
if not os.path.exists(folder):
os.makedirs(folder)
with open(folder + link.split('/')[-2] + os.path.splitext(link)[-1], 'wb') as f:
image = requests.get(url=link, headers=self.headers).content
f.write(image)
except:
print('保存失败')
def run(self):
start_sel = self.data_range()
main_url_list = self. get_author_url(start_sel)
for author_url in main_url_list:
name, id, len_page = self.create_url(author_url)
for num in range(1, len_page + 1):
page_url = f'https://www.520mojing.com/forum-{id}-{num}.html'
pciture_links = self.get_pciturelinks(page_url)
for link in pciture_links:
# print(link)
self.save_picture(name, link, num)
if __name__ == '__main__':
MoJing = MoJing_Spider()# 实例化对象
MoJing.run()
我在测试代码的时候,访问网站的次数过多,所以我就弄了几个随机的请求头,不知道有没有用。运行代码之前改一下self.main_folder的路径就可以了,朋友们开心就好! 本帖最后由 gf7802346 于 2021-2-9 00:12 编辑
魔镜无水印大图api
api :https://api.huaishu520.com:8080/ ... ges/images/listpage
post:{"page":"1","limit":"500","orderbyType":"id"}
api:https://api.huaishu520.com:8080/ ... ages/frontinfo/2131
get:
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/98a1b87eb7d145e5b32628a95543710e.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/639c9366ed7647cab0d529ada08c57b1.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/34078f35cb6244e7bd4907ef847e8602.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/3fa27770d3bc4530bf23b88713105b88.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/196f9c0fcae2495886f539dd5866b484.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/803e2119d17a42a680f7d6685f56d520.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/b8b1279380fa4b5d982de537f48bbf8e.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/0244e5584e414eacb3f70b3b3cf02ceb.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/a35c1e34c5dc4db7b1d3c2551c5a07dd.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/81445e9d84b84a73b1b609a05c6fff4e.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/52a5f7730e7f475e9142ce3feacb14df.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/b7e5b0243c034e0e82b74a988b80db1c.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/7829bc52d50f48fda22895a7cd49d360.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/3703a45b3d2a437caedbc2f678f82236.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/06fcc1efabdf456c8673f34c030b8274.jpg
https://badman1.oss-cn-beijing.aliyuncs.com/20210126/962b777ba88f4bac8303f84af0f01dcb.jpg 我睡一会再来 发表于 2021-1-15 22:55
默认文件夹放在了哪里
你把这个self.main_folder变量的路径改成你自己电脑文件夹的路径。 本帖最后由 gf7802346 于 2021-2-9 00:13 编辑
https://api.huaishu520.com:8080/renren-fast/images/images/frontinfo/2179
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/b85aa0f8e45242a896b09602bb83fdeb.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/3bae9767125c40bfbab3065cc2d6bd71.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/e9d26acff8194f1790eb8c789fdfefb0.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/febe709d7ecd4e9183645187e587c947.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/13b9c288bb0243b399a517e34311d424.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/ec947b4c7cb741778c2b4a57549ea7f2.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/8f78706c4cdb4bbe9285a667077b4afa.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/4e76b46ab0df412c9bd863f8554ad98f.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/982996ed31674f908c676762da249f55.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/d7c3b1729d42426b89d133de0cff5099.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/d4999bad2030438c87dc7de128da91e6.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/eda441e66b464c7995aa9ec31202fa16.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/ed87212e160949aca5c0d4d7c42f5269.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/71df165f1812409ea0fd140517744c5a.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/f86baf0bb4ff4dbc991fe8180f93a8d4.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/2e02baeb17ed4896a07b9a2f7cf5d76c.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/c8263b0babd54b8bb63ffd96319ab5ab.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/017dab8746ef4ac0bed0a06cdca48b9e.JPG
https://badman1.oss-cn-beijing.aliyuncs.com/20210205/2cda815daf16431eba1b7ad535e7cc34.JPG 感谢分享,学习学习! D:\pycharm\pythonProject\venv\Scripts\python.exe D:/pycharm/pythonProject/mm/4.py
进程已结束,退出代码0
直接就... 我怀疑你开车, 但是我没有证据 dzqaww 发表于 2021-1-9 19:03
D:\pycharm\pythonProject\venv\Scripts\python.exe D:/pycharm/pythonProject/mm/4.py
进程已结束,退出 ...
先要改文件夹路径 一边学习,一边“学习” 东西很不错啊 相信无限活宝 发表于 2021-1-12 13:48
东西很不错啊
感谢支持 默认文件夹放在了哪里
谢谢楼主的分享!