本帖最后由 smallchen 于 2021-3-14 08:52 编辑
别人的贴:@OnlineYx @l2430478
https://www.52pojie.cn/thread-1348486-1-3.html
https://www.52pojie.cn/thread-1349446-1-1.html
看别人只能爬一页或几页,我就在想能不能都爬下来,下面开始说思路:
1、首先想到的是先把分类全都加载出来,比如专辑汇总,可是打开后发现有重复,应该数据是不全的或者是有其他入口,跳过
2、然后是看到了搜索按钮,于是试一下能不能sql注入,失败
3、发现下方有个sitemap,打开看下,可以是可以,但太累了,跳过
4、最后发现了个好地方
站长似乎把所有的文章都列在了这个地方,以数量规模来看,一共1320条,应该是数据库查询然后自动生成的,所以,开始爬!
以下为代码:
[Python] 纯文本查看 复制代码 # requests 请求 需要提前安装 pip install requests
# xpath 解析 需要提前安装 pip install lxml
import os
import requests
from lxml import etree
if __name__ == '__main__':
base_url = 'https://www.vmgirls.com/'
archives_url = base_url + 'archives.html'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3861.400 QQBrowser/10.7.4313.400'
}
count = 1
page_now_count = 1
all_res = requests.get(url=archives_url, headers=headers).text
html = etree.HTML(all_res)
item_list = html.xpath('//ul[@class="al_post_list"]/li')
if item_list:
if not os.path.exists('girls'):
os.mkdir('girls')
page_all_count = len(item_list)
for item in item_list:
#if page_now_count < 1300:
#page_now_count += 1
#continue
date = item.xpath('./text()')[0]
title = item.xpath('./a/text()')[0]
item_url = item.xpath('./a/@href')[0]
item_url = base_url + item_url
item_res = requests.get(url=item_url, headers=headers).text
html = etree.HTML(item_res)
girls = html.xpath('//div[@class="post-content"]/div[@class="nc-light-gallery"]//img')
if not girls:
girls = html.xpath('//ul[@class="blocks-gallery-grid"]/li[@class="blocks-gallery-item"]/figure/a/img')
i = 1
if girls:
for girl in girls:
girl_alt = girl.xpath('./@alt')[0]
girl_img = 'https:' + girl.xpath('./@src')[0]
if not os.path.exists('girls/' + title):
os.mkdir('girls/' + title)
img_data = requests.get(girl_img, headers=headers).content
with open('girls/' + title + '/' + str(i) + ".jpeg", 'wb') as f:
f.write(img_data)
#print(title + "-" + str(i) + " 下载完成!")
count += 1
i += 1
else:
print(item_url, '解析不到')
print('(', str(page_now_count), '/', str(page_all_count), ')【', title, '】已完成【', str(i), '】张图片')
page_now_count += 1
print('|=================完成(', str(count), ')=================|')
结束语:
本来想用scrapy框架,就一个界面没必要台繁琐,还是原生requests爽
爬着爬着发现页面的图片位置跟不上了,仔细一看好像站长做的好像并不是图集,而是文章,所以改了一行代码
girls = html.xpath('//div[@class="post-content"]/div[@class="nc-light-gallery"]//img')
if not girls:
girls = html.xpath('//ul[@class="blocks-gallery-grid"]/li[@class="blocks-gallery-item"]/figure/a/img')
谢谢大佬的request详解,https://www.52pojie.cn/thread-1351042-1-1.html
那就这样吧,结束! |