某些图片网站的爬取历程

Tanyongfeng 发表于 2020-7-23 22:51

今天不知道为什么对Java爬虫异常感兴趣,就参照API文档和别人的教程写了一个爬取http://www.netbian.com/网站的程序,下面直接上源码!!!
package com.cn.utils;

import com.sun.xml.internal.messaging.saaj.util.ByteOutputStream;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.*;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;

/**
* @AuThor ：tanyongfeng
* @version 1.0.0
*/
public class ImageDownloading {
public static void main(String[] args) {
   ArrayList<String> filePath = new ArrayList<>();//构建文件路径集合
   ArrayList<String> fileId= new ArrayList<>();//构建独一无二的文件名
   String url;//可以diy的地址
   for (int PageIndex =2 ; PageIndex < 20 ; PageIndex++){
         if (PageIndex == 1){//网站http://www.netbian.com/1_index.htm会显示404 所以判断
            url= "http://www.netbian.com/";
         }else{
            url= "http://www.netbian.com/index_"+PageIndex+".htm";
         };
         try {
            Document doc = Jsoup.connect(url).get();
            Elements el = doc.getElementsByClass("list");//获取class值为list的元素
            Elements li = el.select("li").select("a");//获取div下的 li下的a标签
            for (Element element : li){//遍历
               String url1 = String.valueOf(element.attr("href"));//获取第二层的href
               if (url1.contains("p")){//有些href是广告和图片链接最主要的区别是含有字母p 所以判断一下
                     continue;
               }
               String ID = url1.substring(6,11);//获取文件ID 这个独一无二的值
               fileId.add(ID);//添加到集合中
               String url2 = "http://www.netbian.com/desk/"+ID+"-1920x1080.htm";//构建下载1080P图片地址
               /* 第二层网页跳转开始抓取 */
               Document document = Jsoup.connect(url2).get();
               Element ending = document.getElementById("endimg");
               Elements img = ending.getElementsByTag("img");
               filePath.add(img.get(0).attr("src"));
            }
         } catch (IOException e) {
            System.out.println("抓取文件过程中出错");
            e.printStackTrace();
         }finally {
            System.out.println("爬取完第"+PageIndex+"页");
         }
   }
   try {
         SavePng(filePath,fileId);//开始根据文件地址和文件名进行下载保存
   } catch (Exception e) {
         System.out.println("保存文件过程中出错");
         e.printStackTrace();
   }finally {
         System.out.println("保存完成");
   }
}

//下载图片到本地函数

/**
*
* @Param filePath 文件路径
* @param fileId 文文件ID
*/
public static void SavePng(ArrayList<String> filePath,ArrayList<String> fileId){
   ByteOutputStream byteOutputStream = null;
   FileOutputStream fileOutputStream = null;
   DataInputStream dataInputStream = null;
   //遍历保存图片
   for (int index = 0 ; index < filePath.size() ; index++){
         URL url = null;
         try {
            url = new URL(filePath.get(index));
            dataInputStream = new DataInputStream(url.openStream());
            File file = new File("D:/test/"+fileId.get(index)+".jpg");
            fileOutputStream = new FileOutputStream(file);
            byteOutputStream = new ByteOutputStream();
            byte[] buffer = new byte;
            int length;
            /*写入字节*/
            while ((length = dataInputStream.read(buffer))>0){
               byteOutputStream.write(buffer,0,length);
            }
            byte[] context = byteOutputStream.toByteArray();
            fileOutputStream.write(context);
            System.out.println("保存了"+file.getName());
         } catch (MalformedURLException e) {
            System.out.println("URL转换错误");
            e.printStackTrace();
         } catch (FileNotFoundException e) {
            System.out.println("无法创建文件");
            e.printStackTrace();
         } catch (IOException e) {
            System.out.println("文件传输出现错误");
            e.printStackTrace();
         }

   }
   /*关闭流*/
   try {
         fileOutputStream.close();
         byteOutputStream.close();
         dataInputStream.close();
   } catch (IOException e) {
         System.out.println("关闭流出现错误");
         e.printStackTrace();
   }
}
}

其中我使用的是 Jsoup 技术
因此需要首先引入依赖
<dependencies>    <dependency>
         <groupId>org.jsoup</groupId>
         <artifactId>jsoup</artifactId>
         <version>1.13.1</version>
   </dependency>
</dependencies>

同时如果你进网站可以自定义源码中的url变量（也就是图片分类）比如改成 http://www.netbian.com/meinv/.....你会发现新大陆..
文件默认保存在D:/test目录下.如果有问题可以评论哦！！
请大家赏个币吧哈哈。
成功截图

Tanyongfeng 发表于 2020-7-24 07:50

下载的图片都是1920*1080的图片图片质量非常好

Tanyongfeng 发表于 2020-7-28 07:12

半夏orz 发表于 2020-7-27 22:16
感谢回复
我的表述有问题，如果下载地址不能直接拿到，而是点击下载时后台生成的，而下载需要登录。这样 ...

我这个源码只适用于上述的网站.下载地址一般来说都是和网页源代码显示的一致.如果是网站重定向和请求转发应该是没有问题的

半夏orz 发表于 2020-7-27 22:16

Tanyongfeng 发表于 2020-7-25 21:52
后缀无规律的是什么意思

感谢回复
我的表述有问题，如果下载地址不能直接拿到，而是点击下载时后台生成的，而下载需要登录。这样会不会就很麻烦。
再恶心一点，下载要积分。。。

HelloWorld001 发表于 2020-7-24 00:04

大佬牛逼

就是那个秋 发表于 2020-7-24 00:12

是高清的图吗？

xiaoxi2011 发表于 2020-7-24 00:32

学习了，谢谢分享

sunv52pojie 发表于 2020-7-24 00:56

感谢楼主分享谢谢

云之从 发表于 2020-7-24 01:43

感谢分享爬虫真的有用学习了 {:1_893:}

Psyber 发表于 2020-7-24 02:00

java爬要导好多包啊，python的话要用的包相对少一些

YIHAN1008 发表于 2020-7-24 06:17

学习了，第一次知道java爬虫

Tanyongfeng 发表于 2020-7-24 07:42

就是那个秋发表于 2020-7-24 00:12
是高清的图吗？

1080P的图

页: [1] 2 3

吾爱破解 - 52pojie.cn's Archiver

某些图片网站的爬取历程