吾爱破解 - 52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 3853|回复: 22
收起左侧

[其他原创] 【VBA】Word VBA爬取百度文库json源数据半成品

[复制链接]
漁滒 发表于 2020-5-12 15:59
本帖最后由 aiai 于 2020-5-14 11:49 编辑

需求来源:想用Word VBA来爬取百度文库json源数据,然后解析json还原成word源文档

测试链接:https://wenku.baidu.com/view/f6fc53e7bc64783e0912a21614791711cd79797d.html

首先第一步就是获取主页的源代码
[Visual Basic] 纯文本查看 复制代码
    Dim http As Object
    Set http = CreateObject("WinHttp.WinHttpRequest.5.1")
    Dim htmlurl As String
    htmlurl = "https://wenku.baidu.com/view/4ed46924031ca300a6c30c22590102020640f263.html"
    With http
        .Open "GET", htmlurl, False
        .SetRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"
        .Send
        Dim body() As Byte
        body = .responseBody
        Dim htmltext As String
        htmltext = encoding(body, "gb2312")
    End With


然后可以通过源代码获得总页数和各个页面的地址
[Visual Basic] 纯文本查看 复制代码
    '获取总页数
    Dim reg As Object
    Dim page As Integer
    Set reg = CreateObject("VbScript.regexp") 
    With reg
        .Global = False 
        .Pattern = "totalPageNum.+," 
        page = CInt(Split(.Execute(htmltext)(0), "'")(2))
    End With
    '获取获取链接
    Dim urllist As String
    With reg
        .Global = False
        .Pattern = "WkInfo.htmlUrls.+(?=')"
        urllist = .Execute(htmltext)(0)
    End With
    urllist = Right(urllist, Len(urllist) - 19)


这里的页面地址还有ascii编码和转义的,写个自定义函数处理一下,就可以得到标准的json数据,再对这个json进行解析
[Visual Basic] 纯文本查看 复制代码
    'ascii解码
    urllist = asciidecode(urllist)
    Dim json As Object
    Set json = jsonpath(urllist, "")
    Dim url
    For Each url In json.json
        With http
            .Open "GET", url.pageLoadUrl, False
            .Send
            body = .responseBody
            htmltext = encoding(body, "gb2312")
        End With
        htmltext = Mid(htmltext, InStr(1, htmltext, "(", vbBinaryCompare) + 1)
        htmltext = Left(htmltext, Len(htmltext) - 1)
        Dim pagejson As Object
        Set pagejson = jsonpath(htmltext, "")
    Next url


这时就已经获得所有json的源数据,但是只有前50页。百度文库每50页分割一次,后50页的地址为
https://wenku.baidu.com/view/f6fc53e7bc64783e0912a21614791711cd79797d.html?pn=51
通过一开始的总页数,可以确定循环的次数,至此就可以获得所有的json源数据

上面需要用到的自定义函数
[Visual Basic] 纯文本查看 复制代码
Public Function encoding(body() As Byte, CodeBase As String) As String
    Dim ado As Object
    Set ado = CreateObject("Adodb.Stream")
    With ado             
        .Type = 1                  
        .mode = 3               
        .Open                         
        .Write body                 
        .Position = 0                 
        .Type = 2                   
        .Charset = CodeBase        
        encoding = .readtext         
    End With
End Function

Public Function asciidecode(body As String) As String
    '处理ascii
    Dim index As Long
    For index = 0 To 255 Step 1
        Dim text As String
        text = Hex(index)
        If Len(text) = 1 Then
            text = "\x0" & text
        Else
            text = "\x" & text
        End If
        body = Replace(body, text, Chr(index))
    Next index
    '处理转义
    asciidecode = ""
    For index = 1 To Len(body) Step 1
        text = Mid(body, index, 1)
        If text = "\" Then
            asciidecode = asciidecode & Mid(body, index + 1, 1)
            index = index + 1
        Else
            asciidecode = asciidecode & text
        End If
    Next index
End Function
Public Function jsonpath(js As String, path As String)
    Dim jsonpa As Object
    Set jsonpa = CreateObject("msscriptcontrol.scriptcontrol")
    jsonpa.Language = "JavaScript"
    jsonpa.AddCode ("var jsons = " & js)    
    If TypeName(jsonpa.eval("jsons" + path)) = "JScriptTypeInfo" Then 
        Set jsonpath = jsonpa.eval("jsons" + path) 
    Else
        jsonpath = jsonpa.eval("jsons" + path)
    End If
End Function


但是这个json源数据看的不太懂,不知道怎么还原为word源文档,请教给为坛友,如果有了解的可以分享一下
网络抓包的结果
TIM截图20200513084935.png

VBA运行的结果
TIM截图20200513085112.png


因为百度的json数据用的是css样式,对于word来说样式很难还原,那么就先还原一下简单的文字
[Visual Basic] 纯文本查看 复制代码
    urllist = asciidecode(urllist)
    Dim json As Object
    Set json = jsonpath(urllist, "")
    Dim jsonlen As Integer
    jsonlen = jsonpath(urllist, ".json.length")
    Dim jsonpng As Integer
    jsonpng = jsonpath(urllist, ".png.length")
    Dim index As Byte
    For index = 0 To jsonlen - 1 Step 1
        '下载图片,不一定存在
        If jsonpng > 0 Then
            Dim pngurl As String
            pngurl = jsonpath(urllist, ".png[" & index & "].pageLoadUrl")
            With http
                .Open "GET", pngurl, False
                .Send
                body = .responseBody
                'Call savebytefile(body, "D:/" & index + 1 & ".png")
            End With
        End If
        '下载主体
        Dim bodyurl As String
        bodyurl = jsonpath(urllist, ".json[" & index & "].pageLoadUrl")
        With http
            .Open "GET", bodyurl, False
            .Send
            body = .responseBody
            htmltext = encoding(body, "gb2312")
        End With
        htmltext = Mid(htmltext, InStr(1, htmltext, "(", vbBinaryCompare) + 1)
        htmltext = Left(htmltext, Len(htmltext) - 1)
        Dim pagejson As Object
        Set pagejson = jsonpath(htmltext, "")
        '仅还原文字
        Dim pagelen As Integer
        pagelen = jsonpath(htmltext, ".body.length")
        Dim index2 As Long
        For index2 = 0 To pagelen - 1 Step 1
            Dim word
            Set word = jsonpath(htmltext, ".body[" & index2 & "]")
            If word.t = "word" Then
                Dim worklen As Byte
                worklen = Len(word.c)
                Selection.TypeText text:=word.c
                Selection.MoveLeft Unit:=wdCharacter, Count:=worklen, Extend:=wdExtend
                Selection.Font.Size = word.p.h / 2
                Selection.MoveRight Unit:=wdCharacter, Count:=1
                If TypeName(word.ps) = "JScriptTypeInfo" Then
                    If jsonpath(htmltext, ".body[" & index2 & "].ps._enter") <> Empty Then
                        Selection.TypeParagraph
                    End If
                End If
            End If
        Next index2
        Selection.InsertBreak Type:=0
    Next index


百度显示的内容
TIM截图20200514114513.png TIM截图20200514114534.png


还原文字的内容
TIM截图20200514114841.png

免费评分

参与人数 8吾爱币 +14 热心值 +8 收起 理由
transpuring + 1 + 1 热心回复!
xhtdtk + 1 + 1 鼓励转贴优秀软件安全工具和文档!
Oo不弃 + 1 + 1 大佬看下15楼~
枫叶荻花 + 1 + 1 小白甚多,还是有个成品更好
ymhld + 1 + 1 大佬弄出成品来给分享呀,为啥不用python?
ww185912 + 1 + 1 热心回复!
苏紫方璇 + 5 + 1 欢迎分析讨论交流,吾爱破解论坛有你更精彩!
JuncoJet + 3 + 1 JSON解析方法不错

查看全部评分

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

Oo不弃 发表于 2020-9-15 16:23
大神,你网课下载那个帖子不能回复了,在这里给你反馈个问题,我下载腾讯课堂收费课程时报的一个错,麻烦帮忙看下:
你选择的是全部下载
Traceback (most recent call last):
  File "m3u8downloadBYYG_x64.py", line 68, in <module>
  File "m3u8downloadmain.py", line 160, in m3u8downloadmain.main
  File "tengxunketang\tengxunketang.py", line 437, in tengxunketang.main
  File "tengxunketang\tengxunketang.py", line 131, in tengxunketang.getm3u8
KeyError: 'exper'
[6408] Failed to execute script m3u8downloadBYYG_x64
 楼主| 漁滒 发表于 2020-9-15 16:53
Oo不弃 发表于 2020-9-15 16:23
大神,你网课下载那个帖子不能回复了,在这里给你反馈个问题,我下载腾讯课堂收费课程时报的一个错,麻烦帮 ...

获取token错误,尝试更新cookie。或者cookie缺少必要参数
523067000 发表于 2020-5-12 16:05
﹏詆調℡ 发表于 2020-5-12 16:57
不觉明历
叫我小王叔叔 发表于 2020-5-12 17:11
牛x。。。。。。。。。。。。。。。。。。。。
lsj123126 发表于 2020-5-12 20:04
学习下 收藏了
MC阿虎 发表于 2020-5-12 22:21
学习打卡中
吾爱007 发表于 2020-5-13 00:16
谢谢楼主分享
熊一只 发表于 2020-5-13 13:15
谢谢分享,收藏了
zhaoxiaohua125 发表于 2020-5-13 20:44
希望楼主继续加油
pwp 发表于 2020-7-6 17:31
膜拜渔哥会写几种语言。
您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则

返回列表

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2024-11-25 20:17

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表