java 爬虫

songjing 发表于 2021-9-27 17:54

本帖最后由 songjing 于 2021-9-27 18:01 编辑

之前看好像是吾爱大哥写的一个爬虫
https://www.52pojie.cn/thread-1309809-1-1.html
然后改了一下我想要的一些东西，大致思路不变
站在巨人的肩膀上，我才可以看的更远感谢大佬
第一次发技术贴，排版不好各位见谅
测试网址
https://www.loggly.com/docs-index/log-sources/
package com.plan.plan;

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.*;
import java.net.HttpURLConnection;
import java.net.URL;

public class pc {
public static void main(String[] args) throws IOException {
   long t1=System.currentTimeMillis();
   //访问目标网址
   Connection connection1=Jsoup.connect("https://www.loggly.com/docs-index/log-sources/");
   //连接成功后获取Document对象
   Document document1= connection1.get();
   Element elementDiv=document1.selectFirst("");

   Element elementDiv1=elementDiv.selectFirst("");//搜索class=cl r 标签
   Element elementUL=elementDiv1.selectFirst("");
   Element elementUL1=elementUL.selectFirst("");

   Elements elements=null;
   try {
         Thread.sleep(5000);
         elements=elementUL1.getElementsByClass("log-sources__main");
   } catch (InterruptedException e) {
         e.printStackTrace();
   }
   Elements elementUL1mainlist=elements.select(".log-list");
   Elements elementLis=elementUL1mainlist.select(".log-list__item");//通过找到的ul 搜索ul里面的所有li标签
   for(Element elementLi:elementLis) {//遍历所有找到的li
         Element elementA=elementLi.selectFirst("a");//搜索li里的a标签
         String name=elementA.attr("href");
         Elements elements1log__front=elementA.getElementsByClass("log__front");
         Elements select = elements1log__front.select("img");
         String src2=null;

         String divName= String.valueOf(System.currentTimeMillis());

         for(Element element:select){
            String src=element.attr("abs:src");//获取src的绝对路径
            src2=element.attr("src");//获取src的绝对路径
      //好像有反爬虫机制所以加了这个代码
//          参考       https://www.cnblogs.com/xijieblog/p/4540026.html
            URL url=new URL(src2);
            url = new URL(src2);
            HttpURLConnection connection = (HttpURLConnection) url.
                     openConnection();
            connection.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)");

            InputStream is=url.openStream();

            FileOutputStream fos=new FileOutputStream("D:\\date\\test"+"//"+divName+".png");

            byte[] b=new byte;
            int count=is.read(b);
            while(count!=-1) {
               fos.write(b,0,count);
               fos.flush();
               count=is.read(b);
            }
            fos.close();
            is.close();
         }
         }

   long t2=System.currentTimeMillis();
   double a=(t2-t1)/1000;
   System.out.println("下载完毕"+"用时："+a+"s");
}
}

所需要的maven插件

<dependency>
      <groupId>org.jsoup</groupId>
         <artifactId>jsoup</artifactId>
         <version>1.13.1</version>
   </dependency>

hualonghongyan 发表于 2021-9-27 18:32

没什么用，爬虫的技巧在于分析网页的代码，找到资源的来源

阿光最帅 发表于 2021-9-27 20:43

光说从网站下下载点东西什么的，其实并不难，不就是右键，保存，或者干脆复制嘛，难点在于如何批量下载，破解网站的反爬机制，找到需要的资源

2210075017 发表于 2021-9-27 18:58

hualonghongyan 发表于 2021-9-27 18:32
没什么用，爬虫的技巧在于分析网页的代码，找到资源的来源

找到资源的来源
能举个例子吗，最近刚好要学习爬虫

liwangC 发表于 2021-9-27 23:02

来学习学习

lyj996 发表于 2021-9-27 23:34

感谢分享

我今天是大佬 发表于 2021-9-28 09:17

java爬虫还是麻烦一点, python做比较好

songjing 发表于 2021-9-28 09:20

我今天是大佬发表于 2021-9-28 09:17
java爬虫还是麻烦一点, python做比较好

不会啊，我还是个java小菜鸡

songjing 发表于 2021-9-28 09:27

2210075017 发表于 2021-9-27 18:58
找到资源的来源
能举个例子吗，最近刚好要学习爬虫

其实这个也不是很难，我感觉是比如说你想把这个网址的视频给下载下来，首先你要找到这个视频的链接地址，然后看这个地址的一个集合，就像这个图片，你要找到这个图片是在哪个div里，div下的哪个ul
然后就拿到了ul下的所有li 然后获取li下a标签的链接地址最后下载就行了

我今天是大佬 发表于 2021-9-28 09:27

songjing 发表于 2021-9-28 09:20
不会啊，我还是个java小菜鸡

会不会pyhon跟你java技术怎么样我看不出来有任何逻辑关系

页: [1] 2

吾爱破解 - 52pojie.cn's Archiver

java 爬虫