使用Python下载网页出现403,是被拦截了吗
本帖最后由 拨Q 于 2021-2-28 18:30 编辑import urllib.request
def getHtml(url):
html = urllib.request.urlopen(url).read()
return html
def saveHtml(file_name, file_content):
with open(file_name.replace('/', '_') + ".html", "wb") as f:
f.write(file_content)
aurl = "https://blog.artron.net/space-1734003-do-blog-id-1677521.html"
html = getHtml(aurl)
saveHtml("index", html)
print("下载成功")
下载其他的新闻网页就没有问题,比如https://news.163.com/21/0227/15/G3RPMQT00001899O.html,请问这个要怎么改 设置一下headers 啊 加上这个headers = {
'user-agent': 'User-AgentMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'
} 本帖最后由 拨Q 于 2021-2-28 16:36 编辑
我在研究一下,
还是一样的啊
import urllib.request
headers = {
'user-agent': 'User-AgentMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'
}
def getHtml(url):
html = urllib.request.urlopen(url).read()
return html
def saveHtml(file_name, file_content):
with open(file_name.replace('/', '_') + ".html", "wb") as f:
f.write(file_content)
aurl = "https://blog.artron.net/space-1734003-do-blog-id-1677521.html"
html = getHtml(aurl)
saveHtml("index", html)
print("下载成功")
对,得设置一下headers伪装成浏览器,否则会被网站拒绝甚至封掉ip from urllib import request
from urllib.request import urlopen
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36'
}
def getHtml(url):
resp = request.Request(url, headers=headers)
html = urlopen(resp).read()
return html
def saveHtml(file_name, file_content):
with open(file_name.replace('/', '_') + ".html", "wb") as f:
f.write(file_content)
aurl = "https://blog.artron.net/space-1734003-do-blog-id-1677521.html"
html = getHtml(aurl)
saveHtml("index", html)
print("下载成功")
from urllib import request
from urllib.request import urlopen
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36'
}
def getHtml(url):
resp = request.Request(url, headers=headers)
html = urlopen(resp).read()
return html
def saveHtml(file_name, file_content):
with open(file_name.replace('/', '_') + ".html", "wb") as f:
f.write(file_content)
aurl = "https://blog.artron.net/space-1734003-do-blog-id-1677521.html"
html = getHtml(aurl)
saveHtml("index", html)
print("下载成功") 13422490181 发表于 2021-2-28 17:22
from urllib import request
from urllib.request import urlopen
加个UA小伪装就可以了:keai 学习了!
页:
[1]
2