简单的java爬虫，求指点

HK仅輝 · 发表于 2020-11-8 17:02

本帖最后由 HK仅輝于 2020-11-8 19:24 编辑

package test;

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.*;
import java.net.URL;

public class pc {

      public static void main(String[] args) throws IOException {
            // TODO Auto-generated method stub
            long t1=System.currentTimeMillis();
            //访问目标网址
            Connection connection=Jsoup.connect("https://lol.qq.com/data/info-heros.shtml");
            //连接成功后获取Document对象，（资源信息的映射通过操作Document获取哪些真实的文字）
            Document document= connection.get();

            Element elementUL=document.selectFirst("[class=imgtextlist]");//搜索class=imgtextlist的ul标签
            Elements elementLis=elementUL.select("li");//通过找到的ul 搜索ul里面的所有li标签
            for(Element elementLi:elementLis) {//遍历所有找到的li
                     Element elementA=elementLi.selectFirst("a");//搜索li里的a标签

                     String herURL=elementA.attr("href");//把a标签中的 href属性的值获取到
                     //执行到这报Exception in thread "main" java.lang.NullPointerException
                     //为什么获取不到内容
                  //<a href="info-defail.shtml?id=1" title="黑暗之女安妮"></a>

                     Element elementP=elementA.selectFirst("p");//把a标签里的p标签找到
String innerName= elementP.text();//把p标签里的文字获取到
System.out.println("下载"+innerName);
String path="https://lol.qq.com/data/"+herURL;//通过获取的href拼接一个新的连接
Connection newConnection=Jsoup.connect(path);//访问新的连接
Document newDocument=newConnection.get();
Element elementUl=newDocument.selectFirst("[class=defail-skin-bg]");//搜索class=defail-skin-bg的ul标签
Elements elementLI=elementUl.select("li");//搜索ul里的li标签
for(Element elementLIs:elementLI) {
Element elementImg=elementLIs.selectFirst("img");//搜索li里的img标签
String srcURL=elementImg.attr("src");//把img标签中的 src属性的值获取到
String altName= elementImg.attr("lat");//把lat属性的文字获取到
URL url=new URL(srcURL);

InputStream is=url.openStream();
FileOutputStream fos=new FileOutputStream("E://桌面//img//"+altName);
byte[] b=new byte[1024];
int count=is.read(b);
while(count!=-1) {
fos.write(b,0,count);
fos.flush();
count=is.read(b);
}
fos.close();
is.close();
}
}
long t2=System.currentTimeMillis();
System.out.println("下载完毕"+"用时："+(t2-t1)+"ms");

}

}

lphgor · 发表于 2020-11-8 17:18

本帖最后由 lphgor 于 2020-11-8 21:04 编辑

js动态加载的，不能直接从html页面获取。
先从hero_list.js获取英雄列表，然后从各个英雄的js里面获取你需要的信息。

https://game.gtimg.cn/images/lol/act/img/js/heroList/hero_list.js
http://game.gtimg.cn/images/lol/act/img/js/hero/1.js
http://game.gtimg.cn/images/lol/act/img/js/hero/2.js
.....

哈_喽 · 发表于 2020-11-8 17:52

报空指针，说明没有这个元素吧。你打印一下，或者debug看看

其叶沃若丶 · 发表于 2020-11-8 18:20

页面做了js加载,你的方法读取的时候js还在加载中,所以实际上你拿到的li便签里的东西是不对的

ZhengJL1008 · 发表于 2020-11-8 18:27

可以使用无头浏览器，等页面加载完成后在读取页面标签

liujieboss · 发表于 2020-11-8 18:47

我一般都用python，看你用java就先马一下

HK仅輝 · 发表于 2020-11-8 19:19

本帖最后由 HK仅輝于 2020-11-8 19:22 编辑

哈_喽发表于 2020-11-8 17:52
报空指针，说明没有这个元素吧。你打印一下，或者debug看看

怎么解决，给个方法
E:\桌面\QQ截图20201108191609

哈_喽 · 发表于 2020-11-8 21:53

你的a标签里的herf属性报空指针，首先确定你爬的网站是不是这个结构，如果是，那就说明你代码有问题，可能是因为网站这个属性是动态加载，你爬取的是静态页面。如果你用类似selenium那样的方式爬取，你需要休眠几秒，等待页面完成渲染

帐号		自动登录	找回密码
密码			注册[Register]

[求助] 简单的java爬虫，求指点