本帖最后由 甘霖之霜 于 2023-4-13 22:00 编辑
思路
从书的起始页出发,先获取整部书的目录
根据目录来到对应的章节的详情页,然后开爬。
开敲
1.先导入依赖
[XML] 纯文本查看 复制代码
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.15.4</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.11.0</version>
</dependency>
2.获取目录
参数说明一下 root是书目录起始页 next是目录的下一页 dir是收集各章节网址的集合
[Java] 纯文本查看 复制代码 private static void getDir(String root,String next,List<String> dir) throws Exception {
Document document = Jsoup.connect(next).get();
Elements elements = document.select("a[href$=\".html\"]");
List<String> list = elements.eachAttr("href");
list.remove(0);
if (elements.last().text().equals("下一页")){
String nextPage = list.get(list.size() - 1);
nextPage = root + nextPage.substring(nextPage.lastIndexOf("/") + 1);
list.remove(list.size() - 1);
if (elements.get(elements.size() - 2).text().equals("上一页")){
list.remove(list.size() - 1);
}
dir.addAll(list);
getDir(root,nextPage,dir);
return;
}
if ((elements.last().text().equals("上一页"))){
list.remove(list.size() - 1);
}
dir.addAll(list);
}
3.根据目录获取章节信息写入文件
参数说明:dir刚才收集的目录 root是书目录起始页 writer用于将书写到文件中
[Asm] 纯文本查看 复制代码 private static void getContent(List<String> dir,String root, Writer writer) throws Exception {
StringBuilder temp = new StringBuilder();
for (String url : dir) {
Document document = Jsoup.connect(root + url).get();
String title = document.select("h1").text() + "\n";
System.out.println(title);
Elements content = document.select("div[id=\"content\"]");
String text = content.toString();
int i = text.indexOf("&");
if (i != -1){
text = text.substring(i);
}
text = text.replaceAll(" ","").replaceAll("<br><br>","").replaceAll("</div>","");
temp.append(title + text);
}
IOUtils.write(temp,writer);
writer.close();
IOUtils.close();
}
总的代码
[Java] 纯文本查看 复制代码 public class Soup {
public static void main(String[] args) throws Exception {
String url = "https://www.bbiquge.net/book/132488/";
String fileName = Jsoup.connect(url).get().select("h1").text();
fileName = fileName.replace("/","") + ".txt";
File file = new File(fileName);
Writer writer =new FileWriter(file,true);
List<String> dir = new ArrayList<>();
getDir(url,url,dir);
getContent(dir,url,writer);
}
private static void getDir(String root,String next,List<String> dir) throws Exception {
Document document = Jsoup.connect(next).get();
Elements elements = document.select("a[href$=\".html\"]");
List<String> list = elements.eachAttr("href");
list.remove(0);
if (elements.last().text().equals("下一页")){
String nextPage = list.get(list.size() - 1);
nextPage = root + nextPage.substring(nextPage.lastIndexOf("/") + 1);
list.remove(list.size() - 1);
if (elements.get(elements.size() - 2).text().equals("上一页")){
list.remove(list.size() - 1);
}
dir.addAll(list);
getDir(root,nextPage,dir);
return;
}
if ((elements.last().text().equals("上一页"))){
list.remove(list.size() - 1);
}
dir.addAll(list);
}
private static void getContent(List<String> dir,String root, Writer writer) throws Exception {
StringBuilder temp = new StringBuilder();
for (String url : dir) {
Document document = Jsoup.connect(root + url).get();
String title = document.select("h1").text() + "\n";
System.out.println(title);
Elements content = document.select("div[id=\"content\"]");
String text = content.toString();
int i = text.indexOf("&");
if (i != -1){
text = text.substring(i);
}
text = text.replaceAll(" ","").replaceAll("<br><br>","").replaceAll("</div>","");
temp.append(title + text);
}
IOUtils.write(temp,writer);
writer.close();
IOUtils.close();
}
}
十分钟写出来的,运行起来效率感人也是,希望各位能说些优化方案 学习一下 |