Python自学记录-实战1-爬取美女图片 - 吾爱破解 - 52pojie.cn

BoBuo 发表于 2021-10-6 20:48

Python自学记录--实战1--爬取美女图片

目标网址：http://www.netbian.com/

调用模块
import requests
from lxml import etree

# 设置ua
header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 "
               "Safari/537.36"}

Page = int(input('请输入下载页数：'))
if Page == 1:
url = 'http://www.netbian.com/meinv/'
response = requests.get(url, headers=header).text
html = etree.HTML(response)
for a in range(4, 21):
   print('正在下载第', a - 3, '张！')
   response2 = html.xpath('//*[@id="main"]/div/ul/li[' + str(a) + ']/a/@href')
   # 确定图片地址
   CQT_url = 'http://www.netbian.com/' + str(response2)

   response2 = requests.get(CQT_url, headers=header).text
   html2 = etree.HTML(response2)
   TP_url = html2.xpath('//*[@id="main"]/div/div/p/a/img/@src')
   TP_name = html2.xpath('//*[@id="main"]/div/div/p/a/img/@alt')
   image = requests.get(TP_url)
   # 将图片保存
   file = open(fr"D:\代码保存\图片\{TP_name}.jpg", "wb")
   file.write(image.content)
   file.close()
elif Page >= 2:
for page in range(2, Page + 1):
   url = 'http://www.netbian.com/meinv/index_' + str(page) + '.htm'
   response = requests.get(url, headers=header).text
   html = etree.HTML(response)
   for a in range(4, 21):
         print('正在下载第', page, '页！第', a - 3, '张！')
         response2 = html.xpath('//*[@id="main"]/div/ul/li[' + str(a) + ']/a/@href')
         CQT_url = 'http://www.netbian.com/' + str(response2)
         # print(CQT_url)

         response2 = requests.get(CQT_url, headers=header).text
         html2 = etree.HTML(response2)
         TP_url = html2.xpath('//*[@id="main"]/div/div/p/a/img/@src')
         TP_name = html2.xpath('//*[@id="main"]/div/div/p/a/img/@alt')
         image = requests.get(TP_url)

         file = open(fr"D:\代码保存\图片\{TP_name}.jpg", "wb")
         file.write(image.content)
         file.close()

翻页网址没办法通用，使用了if语句，网站好像设置了反爬，一页20张图，第三张xpath跟其他的不一样，所以我是从第四张开始的，师兄们不忙的话可以看一下，顺便指点指点小弟，差异图我贴上。
想发几张美女图片，太大了传不上来，哎！

suyaming 发表于 2021-10-7 04:29

本帖最后由 suyaming 于 2021-10-7 04:31 编辑

帮你改了下，加了线程和代{过}{滤}理池，速度如下图

这个网站反爬主要是Cookie里面的yjs_js_security_passport，直接手动加入就行，代{过}{滤}理池是为了更稳定，可能不用也可以，我没试。
import time
from jsonpath import jsonpath
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor

class Netbian:
def __init__(self):
   self.proxy = {
         'http': 'http://' + '', # 初始代{过}{滤}理，自己添加
         'https': 'https://' + ''
   }
   self.count = 0
   self.headers = {
         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36',
         'Host': 'www.netbian.com',
         'Cookie': 'yjs_js_security_passport=56950c00c92b7532be82ed4ae7f342785ac382dd_1633550649_js; '
   }

def get_proxy(self):
   url = ''# 代{过}{滤}理池API，自己弄
   re = requests.get(url)
   data = re.json()
   ip = jsonpath(data, '$..ip')
   port = jsonpath(data, '$..port')
   proxy = str(ip) + ':' + str(port)
   self.proxy = {
         'http': 'http://' + proxy,
         'https': 'https://' + proxy
   }

def get_data(self, num=1):
   url = 'http://www.netbian.com/'
   if num != 1:
         url = 'http://www.netbian.com/index_' + str(num) + '.htm'
   respone = requests.get(
         url=url,
         headers=self.headers,
         proxies=self.proxy)
   if respone.status_code == 503:
         self.get_proxy()
         respone = requests.get(
            url=url,
            headers=self.headers,
            proxies=self.proxy)
   respone.encoding = 'gb2312'
   soup = BeautifulSoup(respone.text, 'lxml')
   return soup

def analytical_data(self, num):
   data = self.get_data(num)
   all_li = data.find(attrs={'class': 'list'}).ul.find_all('li')
   for i in all_li:
         title = i.a.get('title')
         if title is not None:
            pictrue_url = i.img.get('src')
            with open('Save Pictrue' + '/' + title + '.jpg', "wb") as code:
               img_content = requests.get(url=pictrue_url, headers=self.headers)
               code.write(img_content.content)
               self.count += 1
               print('下载成功,第 ' + str(self.count) + ' 张.')

def main(self):
   number = input('从多少页开始爬取?\n')
   page = input('爬取多少页?\n')
   with ThreadPoolExecutor(max_workers=5) as t:
         obj_list = []
         for i in range((int(page))):
            obj = t.submit(self.analytical_data, int(number) + i)
            obj_list.append(obj)

if __name__ == '__main__':
Netbian().main()

成果图

piratedrizzle 发表于 2021-10-6 21:29

支持，这么搞具体的更有动力

wanglinok 发表于 2021-10-6 20:58

支持，正在自学中...

ty1314 发表于 2021-10-6 21:16

一个个LSP，学编程就是为了干这个呀。。。。

我喜欢。。。{:1_918:}

Sandwiches 发表于 2021-10-6 21:42

反爬可以改cookies跟header，应该就好了吧，这些都是li标签，挺容易的吧

BoBuo 发表于 2021-10-6 21:51

Sandwiches 发表于 2021-10-6 21:42
反爬可以改cookies跟header，应该就好了吧，这些都是li标签，挺容易的吧

20个li里面有个标签名不一样

Cacarot 发表于 2021-10-6 21:56

理论加这种实践可以的

晓渡寒沙 发表于 2021-10-7 06:42

{:1_899:}赶紧试试

BoBuo 发表于 2021-10-7 09:07

suyaming 发表于 2021-10-7 04:29
帮你改了下，加了线程和代{过}{滤}理池，速度如下图

这个网站反爬主要是Cookie里面的yjs_js_security_pa ...

感谢师兄，太厉害了！

页: [1] 2

吾爱破解 - 52pojie.cn's Archiver

Python自学记录--实战1--爬取美女图片