拨Q 发表于 2021-2-28 16:12

使用Python下载网页出现403,是被拦截了吗

本帖最后由 拨Q 于 2021-2-28 18:30 编辑

import urllib.request

def getHtml(url):
html = urllib.request.urlopen(url).read()
return html

def saveHtml(file_name, file_content):
with open(file_name.replace('/', '_') + ".html", "wb") as f:
f.write(file_content)

aurl = "https://blog.artron.net/space-1734003-do-blog-id-1677521.html"
html = getHtml(aurl)
saveHtml("index", html)

print("下载成功")


下载其他的新闻网页就没有问题,比如https://news.163.com/21/0227/15/G3RPMQT00001899O.html,请问这个要怎么改

嗜血的蚂蚁 发表于 2021-2-28 16:24

设置一下headers 啊

勇者为王 发表于 2021-2-28 16:29

加上这个headers = {

    'user-agent': 'User-AgentMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'
}

拨Q 发表于 2021-2-28 16:34

本帖最后由 拨Q 于 2021-2-28 16:36 编辑

我在研究一下,
还是一样的啊
import urllib.request

headers = {

    'user-agent': 'User-AgentMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'
}

def getHtml(url):
html = urllib.request.urlopen(url).read()
return html

def saveHtml(file_name, file_content):
with open(file_name.replace('/', '_') + ".html", "wb") as f:
f.write(file_content)

aurl = "https://blog.artron.net/space-1734003-do-blog-id-1677521.html"
html = getHtml(aurl)
saveHtml("index", html)

print("下载成功")

细水流长 发表于 2021-2-28 16:44

youngchoice 发表于 2021-2-28 17:11

对,得设置一下headers伪装成浏览器,否则会被网站拒绝甚至封掉ip

ldx539 发表于 2021-2-28 17:20

from urllib import request
from urllib.request import urlopen

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36'
}

def getHtml(url):
    resp = request.Request(url, headers=headers)
    html = urlopen(resp).read()
    return html



def saveHtml(file_name, file_content):
    with open(file_name.replace('/', '_') + ".html", "wb") as f:
      f.write(file_content)


aurl = "https://blog.artron.net/space-1734003-do-blog-id-1677521.html"
html = getHtml(aurl)
saveHtml("index", html)

print("下载成功")

ldx539 发表于 2021-2-28 17:22


from urllib import request
from urllib.request import urlopen

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36'
}

def getHtml(url):
    resp = request.Request(url, headers=headers)
    html = urlopen(resp).read()
    return html



def saveHtml(file_name, file_content):
    with open(file_name.replace('/', '_') + ".html", "wb") as f:
      f.write(file_content)


aurl = "https://blog.artron.net/space-1734003-do-blog-id-1677521.html"
html = getHtml(aurl)
saveHtml("index", html)

print("下载成功")

ldx539 发表于 2021-2-28 17:24

13422490181 发表于 2021-2-28 17:22
from urllib import request
from urllib.request import urlopen



加个UA小伪装就可以了:keai

rose520rain 发表于 2021-2-28 17:33

学习了!
页: [1] 2
查看完整版本: 使用Python下载网页出现403,是被拦截了吗