近来学习php知识发现fopen runoob页面返回的是乱码

plaodj · 发表于 2024-7-17 10:10

近来在runoob网学习php 及配套的 css等知识

时常写一小段代码测试并分析一下功能等

在看到php fopen函数时发现拿runoob试水的时候返回的竟然是乱码

[PHP] 纯文本查看 复制代码

<?php
 
//$url = 'https://www.runoob.com/cssref/css3-pr-animation-timing-function.html'; // 你想要请求的 URL
$refurl = 'https://www.runoob.com/cssref/pr-outline-color.html'; // 伪装的 Referer URL
 
// 目标URL
$url = 'https://www.runoob.com/cssref/css3-pr-animation-timing-function.html';
 
// 创建一个新的Context选项
$opts = stream_context_create(array(
    'http' => array(
        //'header' => 'Referer: ' . $refurl, // 设置referrer信息
		'accept' =>  'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
		'accept-encoding' => 'gzip, deflate, br',
		'accept-language' => 'zh-CN,zh;q=0.9',
		'cache-control' => 'no-cache',
		'cookie' => 'runoob-uuid=8b300359-ca76-4deb-8243-4a3f677b4e5f',
		'pragma' => 'no-cache',
		'referer' => 'https://www.runoob.com/cssref/pr-charset-rule.html',
		'sec-fetch-dest' => 'document',
		'sec-fetch-mode' => 'navigate',
		'sec-fetch-site' => 'same-origin',
		'sec-fetch-user' => '?1',
		'upgrade-insecure-requests' => '1',
		'user-agent' => 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36)',
)));
 
// 使用带有Context的fopen函数
$file = fopen($url, 'r', false, $opts);
 
// 读取数据并输出
$data = fread($file, 500000);
fclose($file);
 
echo $data;

?>

同样使用 curl https://www.runoob.com/cssref/css3-pr-animation-timing-function.html时

发现也不正常不像别的网页一样返回网页源码

也在python里面测试发现是正常

[Python] 纯文本查看 复制代码

import requests
import os
from urllib import parse
from bs4 import BeautifulSoup

url = "https://www.runoob.com/cssref/css-reference.html"

response = requests.get(url)
print(response)
if response.status_code == 200:
    html_content = response.text
    print(html_content)
else:
    print(f"Failed to retrieve the webpage: Status code {response.status_code}")

我想这应该是runoob 在服务器做了限制那这是如何实现的？还有使用php代码的话怎样才能不至于获取到的是乱码？

boxer · 发表于 2024-7-17 10:52

乱码通常有两种可能: 1是没有正确解码, 比如页面是gbk, 你用utf-8的解码了(或者反过来), 2是压缩了
你看上面的, 获取到的可能是压缩过的, 试着去掉 'accept-encoding' => 'gzip, deflate, br'这个

plaodj · 发表于 2024-7-17 13:26

boxer 发表于 2024-7-17 10:52
乱码通常有两种可能: 1是没有正确解码, 比如页面是gbk, 你用utf-8的解码了(或者反过来), 2是压缩了
你看上 ...

去掉那个一样看乱码的样子不像是因为编码的问题而是服务器做了什么限制

简单的 curl www.baidu.com 就能获取百度首页源码

而 curl https://www.runoob.com/cssref/cs ... iming-function.html 则获取不到页面源码

该是服务器做了哪种限制现在也好奇这种类似反爬技术这怎么做的呢

boxer · 发表于 2024-7-17 14:26

plaodj 发表于 2024-7-17 13:26
去掉那个一样看乱码的样子不像是因为编码的问题而是服务器做了什么限制

简单的 curl www.baidu. ...

就是压缩了的, 你自己试一下解压就知道了

randomnany · 发表于 2024-7-17 14:51

php 文件本身是什么编码？
或者输出再转码一次

njbb888 · 发表于 2024-7-17 17:01

php已经没啥人用了吧

爱飞的猫 · 发表于 2024-7-22 07:37

服务器会忽略客户端的 Content-Encoding 设置，强制压缩输出。

curl 指定 --compressed 就能自动解压了：

curl --compressed https://www.runoob.com/cssref/css-reference.html

或在 *nix 环境，将「乱码」内容传递给 gzip -d 也可以看到对应的 HTML 源码：

curl https://www.runoob.com/cssref/css-reference.html | gzip -d

帐号		自动登录	找回密码
密码			注册[Register]

[讨论] 近来学习php知识 发现fopen runoob页面返回的是乱码

[讨论] 近来学习php知识发现fopen runoob页面返回的是乱码