爬取中文内容乱码,是哪的问题
学习python爬取小说文本,爬下来的全是乱码,找半天也没找出问题所在第ä¸å·å®′æ¡å-è±aæ°ä¸ç»ä1æ©é»å·¾è±éé|ç«å 爬取成功!!!
第äoåÂ·å¼ ç¿¼å¾·æé-ç£é® ä½å½è è°èˉå®|ç« 爬取成功!!!
第ä¸å·议温æè£åå±ä¸åé|éç æèèˉ′åå¸ 爬取成功!!!
第åå·åoæ±å¸éçè·μä½ è°è£è′¼å-å¾·ç®å 爬取成功!!!
第äoå·åç«èˉèˉ¸éåoæ1å ¬ç ′å 3å μä¸è±æåå¸ 爬取成功!!!
import requests
from bs4 import BeautifulSoup
# 爬取诗词网中的三国演义小说所有的章节和内容
if __name__ == "__main__":
# 对首页的页面数据进行爬取
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
}
url = 'https://www.shicimingju.com/book/sanguoyanyi.html'
page_text = requests.get(url=url, headers=headers).text
# print(page_text)
# 在首页中解析出章节的标题和详情页的url
# 1.实例化BeautifulSoup对象,需要将页面源码数据加载到该对象中
soup = BeautifulSoup(page_text, 'lxml')
# 2.解析章节标题和详情页的url
li_list = soup.select('.book-mulu>ul>li')
fp = open('./sanguo.txt', 'w', encoding="UTF-8")
for li in li_list:
title = li.a.string
detail_url = 'https://www.shicimingju.com/' + li.a['href']
# 对详情页发起请求,解析出章节内容
detail_page_text = requests.get(url=detail_url, headers=headers).text
# 解析出详情页中的相关章节内容
detail_soup = BeautifulSoup(detail_page_text, 'lxml')
div_tag = detail_soup.find('div', class_='chapter_content')
# 解析到了章节内容
content = div_tag.text
fp.write(title + ':' + content + '\n')
print(title, '爬取成功!!!') import requests
from bs4 import BeautifulSoup
# 爬取诗词网中的三国演义小说所有的章节和内容
if __name__ == "__main__":
# 对首页的页面数据进行爬取
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
}
url = 'https://www.shicimingju.com/book/sanguoyanyi.html'
page_text = requests.get(url=url, headers=headers).content.decode('utf-8')
# print(page_text)
# 在首页中解析出章节的标题和详情页的url
# 1.实例化BeautifulSoup对象,需要将页面源码数据加载到该对象中
soup = BeautifulSoup(page_text, 'lxml')
# 2.解析章节标题和详情页的url
li_list = soup.select('.book-mulu>ul>li')
fp = open('./sanguo.txt', 'w', encoding="UTF-8")
for li in li_list:
title = li.a.string
detail_url = 'https://www.shicimingju.com/' + li.a['href']
# 对详情页发起请求,解析出章节内容
detail_page_text = requests.get(url=detail_url, headers=headers).content.decode('utf-8')
# 解析出详情页中的相关章节内容
detail_soup = BeautifulSoup(detail_page_text, 'lxml')
div_tag = detail_soup.find('div', class_='chapter_content')
# 解析到了章节内容
content = div_tag.text
fp.write(title + ':' + content + '\n')
print(title, '爬取成功!!!')
编码问题 请求头带一个charset 这不是乱码,应该就是编码问题,看看能不能向服务器请求utf-8编码的响应 本帖最后由 FrebEaton 于 2021-7-7 17:03 编辑
import requests
from bs4 import BeautifulSoup
# 爬取诗词网中的三国演义小说所有的章节和内容
if __name__ == "__main__":
# 对首页的页面数据进行爬取
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
}
url = 'https://www.shicimingju.com/book/sanguoyanyi.html'
req = requests.get(url=url, headers=headers)
req.encoding="utf-8"
page_text = req.text
# print(page_text)
# 在首页中解析出章节的标题和详情页的url
# 1.实例化BeautifulSoup对象,需要将页面源码数据加载到该对象中
soup = BeautifulSoup(page_text, 'lxml')
# 2.解析章节标题和详情页的url
li_list = soup.select('.book-mulu>ul>li')
fp = open('./sanguo.txt', 'w', encoding="UTF-8")
for li in li_list:
title = li.a.string
detail_url = 'https://www.shicimingju.com/' + li.a['href']
# 对详情页发起请求,解析出章节内容
req1 = requests.get(url=detail_url, headers=headers)
req1.encoding="utf-8"
detail_page_text = req1.text
# 解析出详情页中的相关章节内容
detail_soup = BeautifulSoup(detail_page_text, 'lxml')
div_tag = detail_soup.find('div', class_='chapter_content')
# 解析到了章节内容
content = div_tag.text
fp.write(title + ':' + content + '\n')
print(title, '爬取成功!!!') FrebEaton 发表于 2021-7-7 16:09
import requests
from bs4 import BeautifulSoup
感谢您的指导,又学了一招 咸鱼灭 发表于 2021-7-7 15:19
import requests
from bs4 import BeautifulSoup
感谢您的分享,这回懂了是怎么回事了 学到了。 首先requests.text 是不可靠的,这个方法注释就写的很明确:
"""Content of the response, in unicode.
If Response.encoding is None, encoding will be guessed using
``chardet``.
The encoding of the response content is determined based solely on HTTP
headers, following RFC 2616 to the letter. If you can take advantage of
non-HTTP knowledge to make a better guess at the encoding, you should
set ``r.encoding`` appropriately before accessing this property.
"""
告诉你这个就是靠chardet 这个模块来猜出来的解码方式解码的。你想要确切的解码,就得看网站的编码是什么,然后用content获取二进制流字符串,再对其进行解码。或者返回一个response对象,再设置response对象的encoding属性为对应的编码,再拿text也可以。
总之,不建议直接用text
页:
[1]
2