用爬虫写的小说漫画网站(多线程,已上线)
本帖最后由 三木猿 于 2020-9-10 12:34 编辑jar包版:https://wws.lanzouj.com/ilcdzght3yj密码:4vmx
只是想用的直接拿jar包吧java -jar SanMuYuanBook-1.0-SNAPSHOT.jar 运行就行,然后在浏览器localhost:80访问
exe版:https://wws.lanzouj.com/ioi0qghw26d密码:b3z1
之前发的帖子沉了,这次希望别沉了,好歹写了很久的,嘤嘤嘤~项目放在了码云上,需要请自行下载(随便霍霍)https://gitee.com/sen_yang/SanMuYuanBook
1.首先是java目录,
com\aaa\config\SSLHelper.java这个是用来忽略网站安全证书的,不加他的话,就没次都需要下载要爬取的网站的安全证书,很麻烦,所以就干脆全部忽略。
com\aaa\config\ThreadExecutorConfig.java用来配置线程池
com\aaa\controller\BookController.java关于小说的各种操作
com\aaa\controller\CartoonController.java关于漫画的各种操作
com\aaa\data这个包获取数据的操作,项目核心都在这
com\aaa\util\DataProcessing.java用来数据处理,里面只有个分割list的方法
com\aaa\util\Download.java用来调用浏览器下载
com\aaa\util\GetDocument.java传入网址获取document对象
com\aaa\util\ZipUtils.java用来将漫画打包为zip
2.然后是前端
index是小说主界面
cartoonIndex是漫画主界面
Catalogue结尾的是目录界面
read开头的是阅读界面
代码都在项目里面自己看吧,另附运行截图
漫画模块截图
小说模块截图
2.我电脑上运行时所有页面加载速度平均为3s,网速好不到一秒,很多地方用到了线程池,不过只用了定长的线程池
3.项目已经部署在腾讯云服务器,http://49.235.253.131,部署在云上的加载速度慢,这个也是没办法的,咱只是个穷逼程序员,买云服务器都是学生优惠价{:1_936:}
项目部分代码
package com.aaa.data;
import com.aaa.config.SSLHelper;
import com.aaa.dto.BookCatalogueDto;
import com.aaa.entity.BookCatalogue;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import static com.aaa.util.DataProcessing.splitList;
/**
* @AuThor 杨森
* @version 1.0
* @Title: BookCatalogue
* @date 2020/8/7 15:39
*/
public class BookCatalogueDB {
private static ExecutorService executorService;
public static List<BookCatalogueDto> setDataSource(String dataSource, String bookCod,ExecutorService executorService) {
BookCatalogueDB.executorService =executorService;
SSLHelper.init();
if ("biquge5200".equals(dataSource)) {
return biquge5200(bookCod);
} else if ("biquge".equals(dataSource)) {
return biquge(bookCod);
}
return null;
}
private static List<BookCatalogueDto> biquge5200(String bookCod) {
try {
Map<Integer,List<BookCatalogueDto>> bookCatalogueDtoMaps = new HashMap<>(3);
Pattern pattern = Pattern.compile("<a\\s*href=\"?([\\w\\W]*?)\"?[\\s]*?[^>]>([\\s\\S]*?)(?=</a>)");
Document document = Jsoup.connect("https://www.biquge5200.com/" + bookCod + "/").get();
Elements dd = document.getElementsByTag("dd");
Map<Integer, List<Element>> integerListMap = splitList(dd, 3);
CountDownLatch latch=new CountDownLatch(3);
for (int i = 0; i < 3; i++) {
final int ins=i;
executorService.execute(() -> {
bookCatalogueDtoMaps.put(ins,get(integerListMap.get(ins), bookCod, document, pattern));
latch.countDown();
});
}
latch.await();
List<BookCatalogueDto> bookCatalogueDtos=new ArrayList<>(dd.size());
for (int i = 0; i < 3; i++) {
bookCatalogueDtos.addAll(bookCatalogueDtoMaps.get(i));
}
return bookCatalogueDtos;
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}finally {
executorService.shutdown();
}
return null;
}
public static List<BookCatalogueDto> get(List<Element> dd, String bookCod, Document document, Pattern pattern) {
List<BookCatalogueDto> bookCatalogueDtos = new ArrayList<>(dd.size());
Element imgurl = document.getElementById("fmimg");
Element intro = document.getElementById("intro");
Element info = document.getElementById("info");
Element child = info.child(1);
String h1 = info.select("h1").text();
for (int i = 0; i < dd.size(); i++) {
Element element = dd.get(i);
BookCatalogueDto bookCatalogueDto = new BookCatalogueDto();
BookCatalogue bookCatalogue = new BookCatalogue();
Node node = element.childNode(0);
for (Node e : element.childNodes()) {
if (!"".equals(e.toString())) {
node = e;
}
}
bookCatalogueDto.setCatalogueName(node.childNode(0).toString());
String s1 = node.toString();
Matcher matcher = pattern.matcher(s1);
if (matcher.find()) {
String nameCodeUrl = matcher.group(1);
String insStr = nameCodeUrl.substring(nameCodeUrl.lastIndexOf("/") + 1, nameCodeUrl.lastIndexOf("."));
bookCatalogueDto.setCatalogueCod(Integer.parseInt(insStr));
}
bookCatalogueDto.setBookName(h1);
bookCatalogueDto.setBookIntro(intro.text());
for(Node n :imgurl.childNodes()){
if(n.toString().matches("(.*)img(.*)")){
bookCatalogueDto.setBookImage(imgurl.childNode(0).toString());
}
}
bookCatalogueDto.setBookCod(bookCod);
bookCatalogueDto.setBookAuthor(child.text().replace("作 者:",""));
bookCatalogueDtos.add(bookCatalogueDto);
if (i + 1 < dd.size()) {
Node node1 = dd.get(i + 1).childNode(0);
Matcher matcher1 = pattern.matcher(node1.toString());
if (matcher1.find()) {
String nameCodeUrl = matcher.group(1);
String insStr = nameCodeUrl.substring(nameCodeUrl.lastIndexOf("/") + 1, nameCodeUrl.lastIndexOf("."));
bookCatalogue.setNextCode(Integer.parseInt(insStr));
}
}
}
return bookCatalogueDtos;
}
private static List<BookCatalogueDto> biquge(String bookCod) {
try {
Pattern pattern = Pattern.compile("<a\\s*href=\"?([\\w\\W]*?)\"?[\\s]*?[^>]>([\\s\\S]*?)(?=</a>)");
Document document = Jsoup.connect("https://www.biquge.com/" + bookCod + "/").get();
Map<Integer,List<BookCatalogueDto>> bookCatalogueDtoMaps = new HashMap<>(3);
Elements dd = document.getElementsByTag("dd");
Map<Integer, List<Element>> integerListMap = splitList(dd, 3);
CountDownLatch latch=new CountDownLatch(3);
for (int i = 0; i <3; i++) {
final int ins=i;
executorService.execute(()->{
bookCatalogueDtoMaps.put(ins,get(integerListMap.get(ins), bookCod, document, pattern)) ;
latch.countDown();
});
}
latch.await();
List<BookCatalogueDto> bookCatalogueDtos=new ArrayList<>(dd.size());
for (int i = 0; i < 3; i++) {
bookCatalogueDtos.addAll(bookCatalogueDtoMaps.get(i));
}
return bookCatalogueDtos;
} catch (Exception e) {
e.printStackTrace();
}finally {
executorService.shutdown();
}
return null;
}
}
不会java的小伙纸,我之前的帖子有python版爬虫的教学 我就不信了,我随便写的python爬虫能火,咋个认真写的java网站不能火,java的弟兄们可比python多多了吧,满意的记得点赞评论 xxjackiexx 发表于 2020-9-17 15:26
楼主,有个建议,不知道是否过分,就是可不可以把小说源地址,写到配置文件里面,然后后续直接维护配置文件 ...
这样是比较规范点,当前项目写的不是特别完美,不能兼容很多的数据源,每新增一个数据源都挺麻烦的,所以就只有两个 JAVA是世界上最好的语言 感谢楼主分享 感谢楼主分享。。 这个能自动采集吗 年轻的旅途 发表于 2020-9-4 10:08
这个能自动采集吗
现看现爬,实时的,没保存到服务器 太复杂了,看不懂啊,还是爬虫好 三木猿 发表于 2020-9-4 10:09
现看现爬,实时的,没保存到服务器
数据源固定不能自己设置爬取的网站,之后可能会继续改成自己设置数据源