近来学习php知识发现fopen runoob页面返回的是乱码

plaodj 发表于 2024-7-17 10:10

近来在runoob网学习php 及配套的 css等知识

时常写一小段代码测试并分析一下功能等

在看到phpfopen函数时发现拿runoob试水的时候返回的竟然是乱码

<?php

//$url = 'https://www.runoob.com/cssref/css3-pr-animation-timing-function.html'; // 你想要请求的 URL
$refurl = 'https://www.runoob.com/cssref/pr-outline-color.html'; // 伪装的 Referer URL

// 目标URL
$url = 'https://www.runoob.com/cssref/css3-pr-animation-timing-function.html';

// 创建一个新的Context选项
$opts = stream_context_create(array(
'http' => array(
//'header' => 'Referer: ' . $refurl, // 设置referrer信息
'accept' =>'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding' => 'gzip, deflate, br',
'accept-language' => 'zh-CN,zh;q=0.9',
'cache-control' => 'no-cache',
'cookie' => 'runoob-uuid=8b300359-ca76-4deb-8243-4a3f677b4e5f',
'pragma' => 'no-cache',
'referer' => 'https://www.runoob.com/cssref/pr-charset-rule.html',
'sec-fetch-dest' => 'document',
'sec-fetch-mode' => 'navigate',
'sec-fetch-site' => 'same-origin',
'sec-fetch-user' => '?1',
'upgrade-insecure-requests' => '1',
'user-agent' => 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36)',
)));

// 使用带有Context的fopen函数
$file = fopen($url, 'r', false, $opts);

// 读取数据并输出
$data = fread($file, 500000);
fclose($file);

echo $data;

?>

同样使用 curl https://www.runoob.com/cssref/css3-pr-animation-timing-function.html时

发现也不正常不像别的网页一样返回网页源码

也在python里面测试发现是正常
import requests
import os
from urllib import parse
from bs4 import BeautifulSoup

url = "https://www.runoob.com/cssref/css-reference.html"

response = requests.get(url)
print(response)
if response.status_code == 200:
html_content = response.text
print(html_content)
else:
print(f"Failed to retrieve the webpage: Status code {response.status_code}")

我想这应该是runoob 在服务器做了限制那这是如何实现的？还有使用php代码的话怎样才能不至于获取到的是乱码？

boxer 发表于 2024-7-17 10:52

乱码通常有两种可能: 1是没有正确解码, 比如页面是gbk, 你用utf-8的解码了(或者反过来), 2是压缩了
你看上面的, 获取到的可能是压缩过的, 试着去掉 'accept-encoding' => 'gzip, deflate, br'这个

plaodj 发表于 2024-7-17 13:26

boxer 发表于 2024-7-17 10:52
乱码通常有两种可能: 1是没有正确解码, 比如页面是gbk, 你用utf-8的解码了(或者反过来), 2是压缩了
你看上 ...

去掉那个一样看乱码的样子不像是因为编码的问题而是服务器做了什么限制

简单的 curl www.baidu.com 就能获取百度首页源码

而curl https://www.runoob.com/cssref/css3-pr-animation-timing-function.html 则获取不到页面源码

该是服务器做了哪种限制现在也好奇这种类似反爬技术这怎么做的呢

boxer 发表于 2024-7-17 14:26

plaodj 发表于 2024-7-17 13:26
去掉那个一样看乱码的样子不像是因为编码的问题而是服务器做了什么限制

简单的 curl www.baidu. ...

就是压缩了的, 你自己试一下解压就知道了

randomnany 发表于 2024-7-17 14:51

php 文件本身是什么编码？
或者输出再转码一次

njbb888 发表于 2024-7-17 17:01

php已经没啥人用了吧

爱飞的猫 发表于 2024-7-22 07:37

服务器会忽略客户端的 Content-Encoding 设置，强制压缩输出。

curl 指定 `--compressed` 就能自动解压了：

```sh
curl --compressed https://www.runoob.com/cssref/css-reference.html
```

或在 *nix 环境，将「乱码」内容传递给 `gzip -d` 也可以看到对应的 HTML 源码：

```sh
curl https://www.runoob.com/cssref/css-reference.html | gzip -d
```

页: [1]

吾爱破解 - 52pojie.cn's Archiver

近来学习php知识 发现fopen runoob页面返回的是乱码

近来学习php知识发现fopen runoob页面返回的是乱码