求助pyhon解析页面内容

kover 发表于 2022-12-15 12:35

本帖最后由 kover 于 2022-12-15 15:39 编辑

像这样的微信链接 https://mp.weixin.qq.com/s?src=11×tamp=1671077771&ver=4227&signature=EdUzpelUazkqkRVfv3HyP-E9ORc8ruu2R7Os6x3T3FDWNGDMyRVWCoe6aWfwCxre4zokjSqhvWCdjGaE7GTCGNdpBr*97VmwH3Jr0Zo4XbAvoqyqUJGIC4aq*VSWwlct&new=1
查看源码好像跟普通的html不一样，有很多代码隔开内容
想要提取里面的内容，要如何写呢？
```
rsp = requests.get(today_url, headers=heders)
hot = rsp.content.decode('utf8')
news_list = re.findall('(?<=关键字？).*(?=关键字)',hot)
```
用其他人的方法行不通了，这个不是html那种那么干净的代码
加了好多这种代码
```
<section style="margin-top: 10px;margin-bottom: 10px;max-width: 100%;min-height: 1em;letter-spacing: 0.544px;line-height: 2em;box-sizing: border-box !important;overflow-wrap: break-word !important;"><section data-role="outer" label="Powered by 135editor.com" style="margin-top: 10px;margin-bottom: 10px;white-space: normal;max-width: 100%;font-family: -apple-system-font, BlinkMacSystemFont, "Helvetica Neue", "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;letter-spacing: 0.544px;min-height: 1em;background-color: rgb(255, 255, 255);line-height: 2em;box-sizing: border-box !important;overflow-wrap: break-word !important;"><section data-tools="135编辑器" data-id="89202" style="margin-top: 10px;margin-bottom: 10px;max-width: 100%;min-height: 1em;letter-spacing: 0.544px;line-height: 2em;box-sizing: border-box !important;overflow-wrap: break-word !important;"><section style="margin-top: 10px;margin-bottom: 10px;max-width: 100%;min-height: 1em;letter-spacing: 0.544px;line-height: 2em;box-sizing: border-box !important;overflow-wrap: break-word !important;"><section data-role="outer" label="Powered by 135editor.com" style="margin-top: 10px;margin-bottom: 10px;white-space: normal;max-width: 100%;font-family: -apple-system-font, BlinkMacSystemFont, "Helvetica Neue", "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;letter-spacing: 0.544px;min-height: 1em;background-color: rgb(255, 255, 255);line-height: 2em;box-sizing: border-box !important;overflow-wrap: break-word !important;"><section data-tools="135编辑器" data-id="89202" style="margin-top: 10px;margin-bottom: 10px;max-width: 100%;min-height: 1em;letter-spacing: 0.544px;line-height: 2em;box-sizing: border-box !important;overflow-wrap: break-word !important;"><section style="margin-top: 10px;margin-bottom: 10px;max-width: 100%;min-height: 1em;letter-spacing: 0.544px;line-height: 2em;box-sizing: border-box !important;overflow-wrap: break-word !important;"><p style="margin-top: 10px;margin-bottom: 10px;max-width: 100%;letter-spacing: 0.544px;line-height: 2em;box-sizing: border-box !important;overflow-wrap: break-word !important;"><strong style="max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;"><span style="max-width: 100%;letter-spacing: 0.544px;color: rgb(0, 0, 0);box-sizing: border-box !important;overflow-wrap: break-word !important;">1、</span></strong><span style="max-width: 100%;letter-spacing: 0.544px;color: rgb(0, 0, 0);box-sizing: border-box !important;overflow-wrap: break-word !important;">
```

sunsjw 发表于 2022-12-15 13:12

你是一点都不知道吗？那你应该在网上搜下python爬虫

know1234 发表于 2022-12-15 14:28

你想干嘛

surepj 发表于 2022-12-15 14:30

这个页面内容嵌套太多，内容确实不好提取，我也只能做成这样了
import requests
from lxml import etree
url = "https://mp.weixin.qq.com/s?src=11&timestamp=1671077771&ver=4227&signature=EdUzpelUazkqkRVfv3HyP-E9ORc8ruu2R7Os6x3T3FDWNGDMyRVWCoe6aWfwCxre4zokjSqhvWCdjGaE7GTCGNdpBr*97VmwH3Jr0Zo4XbAvoqyqUJGIC4aq*VSWwlct&new=1"
response = requests.get(url)
html = etree.HTML(response.text)
texts = html.xpath('//*[@id="img-content"]//text()')
t2 = (''.join(texts)).replace(' ','').replace('\n','').replace('\t','').split('。')
for i, item in enumerate(t2[:-3]):
if i == 0:
   print(f"1、{item.split('、')[-1]}")
elif i == 5:
   print(item.replace('公众号：365资讯简报', ''))
else:
   print(item)

效果：
1、卫健委：因无法准确掌握实际数量，即日起不再公布无症状感染者数据
2、官方：在感染高风险人群、60岁以上老年人群等开展第二剂次加强免疫接种
3、国家邮政局：加紧调派京外力量驰援北京，优先保障药品和防疫物资投递
4、工信部：抗原试剂在一些地方出现暂时性短缺，随着产能释放能够满足群众需求
5、北京：全市均调整为常态防控状态，无高风险区域
6、教育部：今年研考设核酸阳性考场，每个考点至少1名医务人员
7、两部门：确保手机预装App除基本功能软件外，必须可卸载，2023年起执行
8、扩大内需战略规划纲要出炉：到2035年城乡居民人均收入再迈上新的大台阶、中等收入群体显著扩大
9、贵州：16日起至明年2月28日，377家A级景区免头道门票，自驾、住宿、商品五折
10、湖北黄冈：明年起，生三孩可享每年1000元育儿补助、1万元一次性购房补贴
11、日本预计本周宣布国防预算案，并拟在安保文件中将中国定位为"威胁"
12、刚果(金)首都暴雨引发大范围洪灾，已致至少141人死亡，大量房屋被冲毁
13、美国议员联合提案施压拜登团队，欲禁止TikTok等外国社交媒体在美运营
14、美计划向乌克兰输送爱国者导弹防御系统，俄罗斯：若乌克兰收到爱国者导弹，将予以打击
15、世界杯半决赛：法国2-0胜摩洛哥，晋级决赛与阿根廷争冠

kover 发表于 2022-12-15 14:50

surepj 发表于 2022-12-15 14:30
这个页面内容嵌套太多，内容确实不好提取，我也只能做成这样了
import requests ...

多谢大佬。这个页面确实比较麻烦

kover 发表于 2022-12-15 14:51

know1234 发表于 2022-12-15 14:28
你想干嘛

不想干嘛，就想看看

ct268gh 发表于 2022-12-15 16:32

a.replace(/<[^>]+>/g,"").replace(/。*(\d+、|【|——&)/g,"。\r\n$1")
不知道这样行不？我想问下，怎么自动获取到这个链接呢？

kover 发表于 2022-12-15 16:35

ct268gh 发表于 2022-12-15 16:32
a.replace(/]+>/g,"").replace(/。*(\d+、|【|——&)/g,"。\r\n$1")

搜狗微信可以查到
你这个方法是直接替换吗

ct268gh 发表于 2022-12-15 16:38

我用网页js可以直接得到结果，py应该也可以

kover 发表于 2022-12-15 16:40

ct268gh 发表于 2022-12-15 16:38
我用网页js可以直接得到结果，py应该也可以

等下回去我再用py测一下

页: [1] 2 3

吾爱破解 - 52pojie.cn's Archiver

求助pyhon解析页面内容