吾爱破解 - 52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 1590|回复: 19
收起左侧

[求助] python爬取指定网站时,获取不到网页源码怎么办

[复制链接]
Yanpeen 发表于 2022-7-31 22:12
本帖最后由 Yanpeen 于 2022-8-5 18:03 编辑

返回的不是源码,而是一段js代码,要怎么获取网页源码,大家帮忙看看

[Python] 纯文本查看 复制代码
import requests, urllib3
urllib3.disable_warnings()
head = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36"} 
url = "https://www.tupianzj.com/"
response = requests.get(url, headers=head, verify=False) 
print(response.text)


以下是返回的结果:
[Python] 纯文本查看 复制代码
<script language="javascript" type="text/javascript">eval(function(p,a,c,k,e,d){e=function(c){return(c<a?"":e(parseInt(c/a)))+((c=c%a)>35?String.fromCharCode(c+29):c.toString(36))};if(!''.replace(/^/,String)){while(c--)d[e(c)]=k[c]||e(c);k=[function(e){return d[e]}];e=function(){return'\\w+'};c=1;};while(c--)if(k[c])p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k[c]);return p;}('p b(j){1 7=j+"=";1 a=3.4.o(\';\');u(1 i=0;i<a.9;i++){1 c=a[i].s();f(c.q(7)==0)g c.v(7.9,c.9)}g""}1 6=b("6");1 5=B(b("5"));f(6==""||5==""){D("8=8; ",C)}x{1 k=5-y;3.4="6=; d=e, m l n 2:2:2 h;";3.4="5=; d=e, m l n 2:2:2 h;";3.4="t="+6+";";3.4="r="+k+";";A.8.z(w)}',40,40,'|var|00|document|cookie|secret|token|name|location|length|ca|getCookie||expires|Thu|if|return|UTC||cname|random|Jan|01|1970|split|function|indexOf||trim||for|substring|true|else|100|reload|window|parseInt|3000|setTimeout'.split('|'),0,{}))
</script>


使用js格式化工具转化后:
[JavaScript] 纯文本查看 复制代码
function getCookie(cname) {
    var name = cname + "=";
    var ca = document.cookie.split(';');
    for (var i = 0; i < ca.length; i++) {
        var c = ca[i].trim();
        if (c.indexOf(name) == 0) return c.substring(name.length, c.length)
    }
    return ""
}
var token = getCookie("token");
var secret = parseInt(getCookie("secret"));
if (token == "" || secret == "") {
    setTimeout("location=location; ", 3000)
} else {
    var random = secret - 100;
    document.cookie = "token=; expires=Thu, 01 Jan 1970 00:00:00 UTC;";
    document.cookie = "secret=; expires=Thu, 01 Jan 1970 00:00:00 UTC;";
    document.cookie = "t=" + token + ";";
    document.cookie = "r=" + random + ";";
    window.location.reload(true)
}



--------------------------------------------------------------------------------------------
|                           谢谢大家的热心解答,目前已获取到网页源码                                 |
--------------------------------------------------------------------------------------------
在之前的基础上增加了cookies参数,至于为什么加cookies以及加什么cookies内容,需要根据之前返回的js代码的内容来确定,根据大家的解答,大概内容如下:
[JavaScript] 纯文本查看 复制代码
function getCookie(cname) {    #获取指定字段的键值
var name = cname + "=";
var ca = document.cookie.split(';');    #分割字段
for (var i = 0; i < ca.length; i++) {
var c = ca[i].trim();
if (c.indexOf(name) == 0) return c.substring(name.length, c.length)  #若键值不空,则返回对应键值,否则返回空串
}
return ""  
}
var token = getCookie("token");    #获取字段'token'的键值
var secret = parseInt(getCookie("secret")); #获取字段'secret'的键值
if (token == "" || secret == "") {
setTimeout("location=location; ", 3000)
} else {
var random = secret - 100;    #字段'secret'的键值在上一次的基础上减去100
document.cookie = "token=; expires=Thu, 01 Jan 1970 00:00:00 UTC;";
document.cookie = "secret=; expires=Thu, 01 Jan 1970 00:00:00 UTC;";
document.cookie = "t=" + token + ";";    #字段'token'的键值跟上一次访问一样
document.cookie = "r=" + random + ";";    
window.location.reload(true)
}

以下是能获取网页源码的代码:
[Python] 纯文本查看 复制代码
import requests, urllib3
urllib3.disable_warnings()
head = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36", 
} 
url = "https://www.tupianzj.com/"
response = requests.get(url, headers=head, verify=False)    #第一次访问,获取cookies值
cookies = {'t':'', 'r':''}
cookies['t'] = response.cookies['token']    #该字段继续沿用第一次访问时返回的cookies键值
cookies['r'] = str(int(response.cookies['secret'])-100)    #该字段在第一次访问时返回的cookies键值的基础上减去100
response = requests.get(url, headers=head, verify=False, cookies=cookies)     #此时传入指定cookies,则返回正确的网页源码
print(response.text)

--------------------------------------------------------------------------------------------
|                                  再次感谢大家的热心解答,共勉!!!                                     |
--------------------------------------------------------------------------------------------
以下附上返回的js转译过程
将以下代码转存为html文件
[HTML] 纯文本查看 复制代码
<html>        
    <body>    
        <script>   
                a=62;   
                function encode() {   
                var code = document.getElementById('code').value;   
                code = code.replace(/[ ]+/g, '');   
                code = code.replace(/'/g, "\'");   
                var tmp = code.match(/ (w+) /g);   
                tmp.sort();   
                var dict = [];   
                var i, t = '';   
                for(var i=0; i<tmp.length; i++) {   
                if(tmp[i] != t) dict.push(t = tmp[i]);   
                }   
                var len = dict.length;   
                var ch;   
                for(i=0; i<len; i++) {   
                ch = num(i);   
                code = code.replace(new RegExp('\b'+dict[i]+'\b','g'), ch);   
                if(ch == dict[i]) dict[i] = '';   
                }   
                document.getElementById('code').value = "eval(function(p,a,c,k,e,d){e=function(c){return(c<a?'':e(parseInt(c/a)))+((c=c%a)>35?String.fromCharCode(c+29):c.toString(36))};if(!''.replace(/^/,String)){while(c--)d[e(c)]=k[c]||e(c);k=[function(e){return d[e]}];e=function(){return'\\w+'};c=1};while(c--)if(k[c])p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k[c]);return p}("   
                + "'"+code+"',"+a+","+len+",'"+ dict.join('|')+"'.split('|'),0,{}))";   
                }   
                  
                function num(c) {   
                return(c<a?'':num(parseInt(c/a)))+((c=c%a)>35?String.fromCharCode(c+29):c.toString(36));   
                }   
                  
                function run() {   
                eval(document.getElementById('code').value);   
                }   
                  
                function decode() {   
                var code = document.getElementById('code').value;   
                code2 = code.replace(/^eval/, '');   
                //alert(code);   
                document.getElementById('code').value = eval(code2);   
                }   
                </script> 
                  
                <textarea id=code cols=80 rows=20>   
                </textarea>   

                <input type=button onclick=encode() value=编码>   
                <input type=button onclick=run() value=执行>   
                <input type=button onclick=decode() value=解码>
    </body>
</html>



双击html文件打开


再将js放进去(需要将<script>标签去掉)


点解码


1.jpg
2.jpg
3.jpg

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

西瓜菠萝糖 发表于 2022-8-1 00:48
您要访问的网站包含大量违法或违规内容..... 应该是做了防盗链,试试加一下来源之类的浏览器特性

免费评分

参与人数 1吾爱币 +1 收起 理由
Yanpeen + 1 谢谢@Thanks!

查看全部评分

 楼主| Yanpeen 发表于 2022-8-1 00:58
西瓜菠萝糖 发表于 2022-8-1 00:48
您要访问的网站包含大量违法或违规内容..... 应该是做了防盗链,试试加一下来源之类的浏览器特性

是在请求头里面增加吗,是增加什么特性
MyModHeaven 发表于 2022-8-1 07:07
获取不到源码,不应该是因为网站内容动态加载?要用selenium
nj2004 发表于 2022-8-1 08:03
感谢分享!
bluerabbit 发表于 2022-8-1 08:07
动态加载吧,试试用 selenium
ofw 发表于 2022-8-1 09:16
加上cookie就可以了
jjjzw 发表于 2022-8-1 09:17
获取js的请求头和获取源码的请求头不一样
获取js的:
[HTML] 纯文本查看 复制代码
GET / HTTP/1.1
Host: www.tupianzj.com
Sec-Ch-Ua: "Chromium";v="91", " Not;A Brand";v="99"
Sec-Ch-Ua-Mobile: ?0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.9
Connection: close

获取源码的:
[HTML] 纯文本查看 复制代码
GET / HTTP/2
Host: www.tupianzj.com
Cookie: t=30b5405b54bfe24d1b72f8bcdc13c863; r=4433
Cache-Control: max-age=0
Sec-Ch-Ua: "Chromium";v="91", " Not;A Brand";v="99"
Sec-Ch-Ua-Mobile: ?0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: same-origin
Sec-Fetch-Mode: navigate
Sec-Fetch-Dest: document
Referer: https://www.tupianzj.com/
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.9
Connection: close

免费评分

参与人数 1热心值 +1 收起 理由
Yanpeen + 1 谢谢@Thanks!

查看全部评分

ofw 发表于 2022-8-1 09:18
加上cookie就可以了
....png
我今天是大佬 发表于 2022-8-1 09:26
带上cookie 亲测有效
您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则

返回列表

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2024-11-25 09:48

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表