Java爬虫引擎

henry307 发表于 2023-3-22 12:20

一、此爬虫引擎利用HttpClient实现，支持http与https，支持自定义UserAgent，自定义Header，支持Proxy，支持HTML抓取，也支持图片抓取。此框架分三个部分：WebClient，Webquest以及ResponseResult，其中WebClient为引擎最核心部分，实现了资源下载，而Webquest为请求部分，自定义UserAgent，自定义Header，设置Proxy全部是针对Webquest，ResponseResult为响应部分，包括响应头，响应流，以及响应cookie等。二、html抓取测试/ 网页抓取测试
private staticvoid testHTMLSeek(){

   String token = "";
   try {
         String status = "";

         do {

            // 通过appkey 和 seckey 获取token
            String appkey = "youappkey";
            String seckey = "youseckey";
            WebRequest wb = new WebRequest();
            //wb.setProxy("122.4.45.43:3937"); // 设置代{过}{滤}理
            wb.setMethod("GET");
            wb.setUserAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36");
            wb.setUrl("https://api.qianzhan.com/OpenPlatformService/GetToken?type=JSON&appkey=" + appkey + "&seckey="
                     + seckey);
            ResponseResult sr = WebClient.download(wb);
            String tokenJson = sr.getResponseHtmlStr();
            token = ParseJson(ParseJson(tokenJson, "result"), "token");
            System.out.println("token:" + token);

            // 测试【多条件联合搜索】这个接口
            JSONObject json = new JSONObject();
            json.put("token", token);
            json.put("type", "JSON");
            json.put("companyName","腾讯");
            json.put("areaCode", "");
            json.put("faRen", "");
            json.put("bussinessDes", "");
            json.put("address", "");
            json.put("gd", "");
            json.put("page", "1");
            json.put("pagesize", "10");

            wb.setMethod("POST");
            wb.setUrl("https://api.qianzhan.com/OpenPlatformService/CombineIndexSearch");
            wb.setForm(json.toString());
            wb.setContentType("application/json");
            sr = WebClient.download(wb);
            String result = sr.getResponseHtmlStr();
            System.out.println(result);

            status = ParseJson(result, "status");

         } while (status == "101" || status == "102");

   } catch (Exception e1) {

         e1.printStackTrace();
   }

}
三、图片下载测试// 图片抓取测试
private static void testImageSeek(){
   WebRequest req = new WebRequest();
   req.setUrl("https://img3.qianzhan.com/news/201803/30/20180330-40fe3e684227ed76_250x150.jpg");
   req.setMethod("GET");
   ResponseResult rsp = WebClient.download(req);
   byte[] imageBytes = rsp.getResponseContent();
   try {
         String fileName = "测试图片.png";
         String userprofile = System.getenv().get("USERPROFILE");
         File file = new File(userprofile + "\\Desktop\\" + fileName);
         FileOutputStream fops = new FileOutputStream(file);
         fops.write(imageBytes);
         fops.flush();
         fops.close();
         System.out.println("图片已经写入到" + file.getAbsolutePath());
   } catch (Exception e) {
         e.printStackTrace();
   }
}
四、引擎源码下载地址
https://pan.baidu.com/s/1sji6UOhyvOLNv345Vn9EdA?_at_=1679458752261

爱吃猫的小鱼儿 发表于 2023-4-1 21:52

本帖最后由爱吃猫的小鱼儿于 2023-4-1 21:55 编辑

小兄弟实现的这个还挺有意思，我之前也了解过这方面的资料，给你推荐一个：Java爬虫http://webmagic.io/docs/zh/posts/ch1-overview/architecture.html

cl201696084076 发表于 2023-3-22 13:17

学习了，感谢分享

jianghuai 发表于 2023-3-22 13:42

bullsaibullsai 发表于 2023-3-22 13:51

小白进来学习学习，谢谢分享

dengjunyi 发表于 2023-3-22 14:05

谢谢分享

叶隽发表于 2023-3-22 14:08

挺有意思的，虽然现在爬虫很少用java

henry307 发表于 2023-3-22 14:10

叶隽发表于 2023-3-22 14:08
挺有意思的，虽然现在爬虫很少用java

嗯，当作学习吧，后面我再写个html解析器

lingwushexi 发表于 2023-3-22 14:13

小白学习，谢谢分享！

ppp936246 发表于 2023-3-22 14:20

经典do while

henry307 发表于 2023-3-22 14:23

ppp936246 发表于 2023-3-22 14:20
经典do while

哈哈，例子随便写的，主要看引擎的源码

页: [1] 2 3 4 5 6

吾爱破解 - 52pojie.cn's Archiver

Java爬虫引擎