使用Python下载网页出现403，是被拦截了吗

拨Q 发表于 2021-2-28 16:12

本帖最后由拨Q 于 2021-2-28 18:30 编辑

import urllib.request

def getHtml(url):
html = urllib.request.urlopen(url).read()
return html

def saveHtml(file_name, file_content):
with open(file_name.replace('/', '_') + ".html", "wb") as f:
f.write(file_content)

aurl = "https://blog.artron.net/space-1734003-do-blog-id-1677521.html"
html = getHtml(aurl)
saveHtml("index", html)

print("下载成功")

下载其他的新闻网页就没有问题，比如https://news.163.com/21/0227/15/G3RPMQT00001899O.html，请问这个要怎么改

嗜血的蚂蚁 发表于 2021-2-28 16:24

设置一下headers 啊

勇者为王 发表于 2021-2-28 16:29

加上这个headers = {

'user-agent': 'User-AgentMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'
}

拨Q 发表于 2021-2-28 16:34

本帖最后由拨Q 于 2021-2-28 16:36 编辑

我在研究一下，
还是一样的啊
import urllib.request

headers = {

'user-agent': 'User-AgentMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'
}

def getHtml(url):
html = urllib.request.urlopen(url).read()
return html

def saveHtml(file_name, file_content):
with open(file_name.replace('/', '_') + ".html", "wb") as f:
f.write(file_content)

aurl = "https://blog.artron.net/space-1734003-do-blog-id-1677521.html"
html = getHtml(aurl)
saveHtml("index", html)

print("下载成功")

细水流长 发表于 2021-2-28 16:44

youngchoice 发表于 2021-2-28 17:11

对，得设置一下headers伪装成浏览器，否则会被网站拒绝甚至封掉ip

ldx539 发表于 2021-2-28 17:20

from urllib import request
from urllib.request import urlopen

headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36'
}

def getHtml(url):
resp = request.Request(url, headers=headers)
html = urlopen(resp).read()
return html

def saveHtml(file_name, file_content):
with open(file_name.replace('/', '_') + ".html", "wb") as f:
f.write(file_content)

aurl = "https://blog.artron.net/space-1734003-do-blog-id-1677521.html"
html = getHtml(aurl)
saveHtml("index", html)

print("下载成功")

ldx539 发表于 2021-2-28 17:22

ldx539 发表于 2021-2-28 17:24

13422490181 发表于 2021-2-28 17:22
from urllib import request
from urllib.request import urlopen

加个UA小伪装就可以了:keai

rose520rain 发表于 2021-2-28 17:33

学习了！

页: [1] 2

吾爱破解 - 52pojie.cn's Archiver

使用Python下载网页出现403，是被拦截了吗