基于nodejs实现爬取网站源码并正则抓取信息接口

Mr.Lih 发表于 2020-7-17 12:55

本帖最后由 Mr.Lih 于 2020-7-17 12:56 编辑

git码云地址 https://gitee.com/lihann/zhuazhua
### 基于nodejs实现爬取网站源码并正则抓取信息接口

```
npm install
node index.js
```

http://localhost:3000/

>get
>
> 参数
>
>url 需要抓取信息的网站地址
>
>reg 正则

**用处：利用正则抓取网站信息**

抓取虾米音乐歌单：

**url=https://www.xiami.com/billboard/102**

**reg = <td><img class="logo" src=[\"|\']?(.*?)[\"|\']?\s.*?></td><td><div class="songName-container"><div class="song-name em"><a href="(.*?)">(.*?)</a>**

http://localhost:3000/?url=https://www.xiami.com/billboard/102®=<td><img class="logo" src=[\"|\']?(.*?)[\"|\']?\s.*?></td><td><div class="songName-container"><div class="song-name em"><a href="(.*?)">(.*?)</a>

结果

!( https://gitee.com/lihann/pic/raw/master/20200717-114843-0940.png )

> **抓取网站所有图片：**
>
><img.*?src=[\"|\']?(.*?)[\"|\']?\s.*?>

> **抓取酷狗歌单和连接**
>
> url:https://www.kugou.com/yy/rank/home/1-37361.html?from=rank
>
> reg:<a href="https://www.kugou.com/song/(.*?)" data-active="playDwn" data-index="[\s\S].*?" class="pc_temp_songname" title="(.*?)" hidefocus="true">

> **抓取小说**：
>
> url：http://www.yuetutu.com/cbook_23452/1.html
>
> reg：<div id="content">((.|\n)+?)</div>

!( https://gitee.com/lihann/pic/raw/master//20200717-122847-0958.png )
源码：
const path = require('path');
const express = require('express');
const app = express()
const got = require('got');
app.use(express.json());
app.use(express.urlencoded({
   extended: false
}));

app.all('*', function(req, res, next) {
   res.header("Access-Control-Allow-Origin", "*");
   res.header("Access-Control-Allow-Headers", "Origin, X-Requested-With, Content-Type, Accept");
   next();
});
app.use('/', (req, res) => {
   const query = req.query
   const url = query.url
   const reg = query.reg
   let data = [],
            arr = [];
            let reg1 = new RegExp(reg, "gm");
            _get(url, reg1).then(e=>{
                     console.log(e);
                     if(!reg){
                           res.json({
                                    data: e
                           })
                           return
                     }
                     e.replace(reg1, (...g) => {
                           const l = g.length - 3;
                           for (let i = 1; i <= l; i++) {
                                    data.push(g)
                                    if (i % l == 0) {
                                             arr.push(data)
                                             data = []
                                    }
                           }
                     })
                     res.json({
                           data: arr
                     })
            })
})
app.listen(3000, () => {
   console.log("run 3000");
})
async function _get(url, reg1) {
   const response = await got(url);
   const str = response.body;
   return str
}

新手一个正在学习希望大佬指出意见

Mr.Lih 发表于 2020-7-17 18:20

ych126 发表于 2020-7-17 17:23
就是网站上的

只要是网站源码里出现的，你写对正则就可以抓到你想要的信息

Mr.Lih 发表于 2020-7-17 14:38

ych126 发表于 2020-7-17 14:11
可以抓影视的？

只是抓取网站源码上有的信息抓不到影视的东西

CN911 发表于 2020-7-17 13:38

这个很好，加油！

风起回忆怎么潜 发表于 2020-7-17 14:03

写的不错，目前比较推荐使用koa2

ych126 发表于 2020-7-17 14:11

可以抓影视的？

ych126 发表于 2020-7-17 17:23

就是网站上的

君月栩 发表于 2020-7-17 19:08

感谢楼主分享

a820922716 发表于 2020-7-17 21:18

后台能抓？

Mr.Lih 发表于 2020-7-17 21:25

a820922716 发表于 2020-7-17 21:18
后台能抓？

只要能是网站源码里出现的

页: [1] 2

吾爱破解 - 52pojie.cn's Archiver

基于nodejs实现 爬取网站源码并正则抓取信息 接口

基于nodejs实现爬取网站源码并正则抓取信息接口