python爬取指定网站时,获取不到网页源码怎么办
本帖最后由 Yanpeen 于 2022-8-5 18:03 编辑返回的不是源码,而是一段js代码,要怎么获取网页源码,大家帮忙看看
import requests, urllib3
urllib3.disable_warnings()
head = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36"}
url = "https://www.tupianzj.com/"
response = requests.get(url, headers=head, verify=False)
print(response.text)
以下是返回的结果:
<script language="javascript" type="text/javascript">eval(function(p,a,c,k,e,d){e=function(c){return(c<a?"":e(parseInt(c/a)))+((c=c%a)>35?String.fromCharCode(c+29):c.toString(36))};if(!''.replace(/^/,String)){while(c--)d=k||e(c);k=}];e=function(){return'\\w+'};c=1;};while(c--)if(k)p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k);return p;}('p b(j){1 7=j+"=";1 a=3.4.o(\';\');u(1 i=0;i<a.9;i++){1 c=a.s();f(c.q(7)==0)g c.v(7.9,c.9)}g""}1 6=b("6");1 5=B(b("5"));f(6==""||5==""){D("8=8; ",C)}x{1 k=5-y;3.4="6=; d=e, m l n 2:2:2 h;";3.4="5=; d=e, m l n 2:2:2 h;";3.4="t="+6+";";3.4="r="+k+";";A.8.z(w)}',40,40,'|var|00|document|cookie|secret|token|name|location|length|ca|getCookie||expires|Thu|if|return|UTC||cname|random|Jan|01|1970|split|function|indexOf||trim||for|substring|true|else|100|reload|window|parseInt|3000|setTimeout'.split('|'),0,{}))
</script>
使用js格式化工具转化后:
function getCookie(cname) {
var name = cname + "=";
var ca = document.cookie.split(';');
for (var i = 0; i < ca.length; i++) {
var c = ca.trim();
if (c.indexOf(name) == 0) return c.substring(name.length, c.length)
}
return ""
}
var token = getCookie("token");
var secret = parseInt(getCookie("secret"));
if (token == "" || secret == "") {
setTimeout("location=location; ", 3000)
} else {
var random = secret - 100;
document.cookie = "token=; expires=Thu, 01 Jan 1970 00:00:00 UTC;";
document.cookie = "secret=; expires=Thu, 01 Jan 1970 00:00:00 UTC;";
document.cookie = "t=" + token + ";";
document.cookie = "r=" + random + ";";
window.location.reload(true)
}
--------------------------------------------------------------------------------------------
| 谢谢大家的热心解答,目前已获取到网页源码 |
--------------------------------------------------------------------------------------------
在之前的基础上增加了cookies参数,至于为什么加cookies以及加什么cookies内容,需要根据之前返回的js代码的内容来确定,根据大家的解答,大概内容如下:
function getCookie(cname) { #获取指定字段的键值
var name = cname + "=";
var ca = document.cookie.split(';'); #分割字段
for (var i = 0; i < ca.length; i++) {
var c = ca.trim();
if (c.indexOf(name) == 0) return c.substring(name.length, c.length)#若键值不空,则返回对应键值,否则返回空串
}
return ""
}
var token = getCookie("token"); #获取字段'token'的键值
var secret = parseInt(getCookie("secret")); #获取字段'secret'的键值
if (token == "" || secret == "") {
setTimeout("location=location; ", 3000)
} else {
var random = secret - 100; #字段'secret'的键值在上一次的基础上减去100
document.cookie = "token=; expires=Thu, 01 Jan 1970 00:00:00 UTC;";
document.cookie = "secret=; expires=Thu, 01 Jan 1970 00:00:00 UTC;";
document.cookie = "t=" + token + ";"; #字段'token'的键值跟上一次访问一样
document.cookie = "r=" + random + ";";
window.location.reload(true)
}
以下是能获取网页源码的代码:
import requests, urllib3
urllib3.disable_warnings()
head = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36",
}
url = "https://www.tupianzj.com/"
response = requests.get(url, headers=head, verify=False) #第一次访问,获取cookies值
cookies = {'t':'', 'r':''}
cookies['t'] = response.cookies['token'] #该字段继续沿用第一次访问时返回的cookies键值
cookies['r'] = str(int(response.cookies['secret'])-100) #该字段在第一次访问时返回的cookies键值的基础上减去100
response = requests.get(url, headers=head, verify=False, cookies=cookies) #此时传入指定cookies,则返回正确的网页源码
print(response.text)
--------------------------------------------------------------------------------------------
| 再次感谢大家的热心解答,共勉!!! |
--------------------------------------------------------------------------------------------
以下附上返回的js转译过程
将以下代码转存为html文件
<html>
<body>
<script>
a=62;
function encode() {
var code = document.getElementById('code').value;
code = code.replace(/[ ]+/g, '');
code = code.replace(/'/g, "\'");
var tmp = code.match(/ (w+) /g);
tmp.sort();
var dict = [];
var i, t = '';
for(var i=0; i<tmp.length; i++) {
if(tmp != t) dict.push(t = tmp);
}
var len = dict.length;
var ch;
for(i=0; i<len; i++) {
ch = num(i);
code = code.replace(new RegExp('\b'+dict+'\b','g'), ch);
if(ch == dict) dict = '';
}
document.getElementById('code').value = "eval(function(p,a,c,k,e,d){e=function(c){return(c<a?'':e(parseInt(c/a)))+((c=c%a)>35?String.fromCharCode(c+29):c.toString(36))};if(!''.replace(/^/,String)){while(c--)d=k||e(c);k=}];e=function(){return'\\w+'};c=1};while(c--)if(k)p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k);return p}("
+ "'"+code+"',"+a+","+len+",'"+ dict.join('|')+"'.split('|'),0,{}))";
}
function num(c) {
return(c<a?'':num(parseInt(c/a)))+((c=c%a)>35?String.fromCharCode(c+29):c.toString(36));
}
function run() {
eval(document.getElementById('code').value);
}
function decode() {
var code = document.getElementById('code').value;
code2 = code.replace(/^eval/, '');
//alert(code);
document.getElementById('code').value = eval(code2);
}
</script>
<textarea id=code cols=80 rows=20>
</textarea>
<input type=button onclick=encode() value=编码>
<input type=button onclick=run() value=执行>
<input type=button onclick=decode() value=解码>
</body>
</html>
双击html文件打开
https://attach.52pojie.cn//forum/202208/01/144005dds7o5oc87rroos4.jpg?l
https://attach.52pojie.cn//forum/202208/05/180249ophojr3upo1qvpej.jpg?l
再将js放进去(需要将<script>标签去掉)
https://attach.52pojie.cn//forum/202208/01/144455nq1ptjvz3qpv9qfq.jpg?l
https://attach.52pojie.cn//forum/202208/05/180252cwwjoqhk5vqpzjiw.jpg?l
点解码
https://attach.52pojie.cn//forum/202208/05/180254xmurruvgd8urcwcd.jpg?l
https://attach.52pojie.cn//forum/202208/01/144309gkkwnzkncy9n2u2k.jpg?l
您要访问的网站包含大量违法或违规内容..... 应该是做了防盗链,试试加一下来源之类的浏览器特性 西瓜菠萝糖 发表于 2022-8-1 00:48
您要访问的网站包含大量违法或违规内容..... 应该是做了防盗链,试试加一下来源之类的浏览器特性
是在请求头里面增加吗,是增加什么特性 获取不到源码,不应该是因为网站内容动态加载?要用selenium 感谢分享! 动态加载吧,试试用 selenium 加上cookie就可以了 获取js的请求头和获取源码的请求头不一样
获取js的:
GET / HTTP/1.1
Host: www.tupianzj.com
Sec-Ch-Ua: "Chromium";v="91", " Not;A Brand";v="99"
Sec-Ch-Ua-Mobile: ?0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.9
Connection: close
获取源码的:
GET / HTTP/2
Host: www.tupianzj.com
Cookie: t=30b5405b54bfe24d1b72f8bcdc13c863; r=4433
Cache-Control: max-age=0
Sec-Ch-Ua: "Chromium";v="91", " Not;A Brand";v="99"
Sec-Ch-Ua-Mobile: ?0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: same-origin
Sec-Fetch-Mode: navigate
Sec-Fetch-Dest: document
Referer: https://www.tupianzj.com/
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.9
Connection: close 加上cookie就可以了 带上cookie 亲测有效
页:
[1]
2