Python爬虫学习之每日推送早报新闻到微信

只有午安 发表于 2021-10-16 14:27

本帖最后由只有午安于 2021-10-19 17:04 编辑

修改了一下，发现有个新闻没发出来，一开始没注意，修改了一下def caijing 里面的正则表达式
话不多说，代码改一下wx_push的data，再自己部署到云函数里面，就可实现每日自动推送。加油提升自己！
相关库及库的说明
from bs4 import BeautifulSoup
import requests
import re
Beautiful Soup文档
快速上手—Requests
re库官方文档

1、注册一个企业微信
2、编写代码
3、部署云函数
def askURL():
这个方法用来设置请求头，以及获取页面响应的内容。加上utf-8防止乱码。
def getLink():
因为是需要自动获取并推送新闻，该页面的新闻最新的都是在第一篇，所以我们获取最新的早报链接。
没给他设置参数是因为获取早报最新链接的url是固定的。我们直接在方法里面给他写死。
早报链接
def getNews(news_url):
参数：news_url指的是getLink获取到的内容传到这里。
把获取的最新链接传到我们定义的解析方法里面，获取到最新链接里面的内容。再获取我们需要的内容。
def text(content):
参数：content指的是getNews获取到的内容传到这里。
获取到的内容用re库正则表达式来处理内容，如：截取，分割。
因为推送的长度有限制，故我们需要分割内容，分开发送。也更易维护（想要什么版块的内容就定义哪个版块）。具体请看代码。
每日早报格式
def wx_push(newsdata):
参数:newsdata 指的是text处理的内容传到这里。
推送方法，在这里我用的别人重新封装过的企业微信api，他是调用的官方的企业微信api，需要注册一个企业
微信。不用认证企业。具体操作如下。
server酱替代品：企业微信及时通讯api
def main(arg1,arg2):
入口函数，部署云函数需要用到参数arg1,arg2,尽管我们代码用不到这两个参数。
这里调用的是各个def text(content)的方法。
代码相关
def askURL(url):
head = {
   "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ""Chrome/92.0"
                  ".4515.107 ""Safari/537.36 "
}
r = requests.get(url, headers=head).content.decode("utf-8")
# print(r)
return r

def getLink(): # 获取最新链接
datalist = []
html = askURL("https://www.pmtown.com/archives/category/%e6%97%a9%e6%8a%a5")
soup = BeautifulSoup(html, "html.parser")
link1 = soup.find("a", class_="media-content", target="_blank")
link1 = str(link1)
link2 = re.findall(link, link1)
datalist.append(link2)
link3 = # 解析列表转成文本
link4 = ''.join(link3)
# print(link4)
return link4

# ---------------------------------------------爬取早报内容-------------------------------------------------#
def getNews(news_url):
news1 = []
html1 = askURL(news_url)
soup2 = BeautifulSoup(html1, "html.parser")
contents = soup2.find("div", class_="post-content")
contents = str(contents)
report = re.findall(news, contents)
report = re.sub("<br(\s+)?/>(\s+)?", "", report) # 替换<br>标签
report = re.sub("/", "", report)
report = re.sub(r'【融资收购.*?【', '【', report).replace("【泡面头条】", "【洛七早报】")
report = str(report)
news1.append(report.strip())
return news1

# -----------------------------------------------text内容-------------------------------------------------------
# 获取内容，先分割，后发送
# 洛七早报【洛.*(?=【)
# 国内头条【国.*(?=【)
# 海外头条【海.*(?=【)
# 体育竞技【体.*(?=【)
# 财经新闻【财.*(?=【)
def luoqi(content):
luoqi = re.compile(r'【洛.*?(?=【)')
luoqi1 = re.findall(luoqi, str(content))
luoqi2 =
luoqi3 = ''.join(luoqi2).replace("。", "").replace("1、", "\n1、").replace("2、", "\n2、").replace("3、", "\n3、").replace(
               "4、", "\n4、"). \
replace("5、", "\n5、").replace("6、", "\n6、").replace("7、", "\n7、").replace("8、", "\n8、").replace("9、", "\n9、"). \
replace("10、", "\n10、")
# print(luoqi3)
return luoqi3

def guonei(content):
guonei = re.compile(r'【国.*?(?=【)')
guonei1 = re.findall(guonei, str(content))
guonei2 =
guonei3 = ''.join(guonei2).replace("。", "").replace("1、", "\n1、").replace("2、", "\n2、").replace("3、",
                     "\n3、").replace(
                     "4、", "\n4、"). \
                     replace("5、", "\n5、").replace("6、", "\n6、").replace("7、", "\n7、").replace("8、", "\n8、").replace("9、", "\n9、"). \
                     replace("10、", "\n10、")
# print(guonei3)
return guonei3

def haiwai(content):
haiwai = re.compile(r'【海.*?(?=【)')
haiwai1 = re.findall(haiwai, str(content))
haiwai2 =
haiwai3 = ''.join(haiwai2).replace("。", "").replace("1、", "\n1、").replace("2、", "\n2、").replace("3、",
            "\n3、").replace(
            "4、", "\n4、"). \
            replace("5、", "\n5、").replace("6、", "\n6、").replace("7、", "\n7、").replace("8、", "\n8、").replace("9、", "\n9、"). \
            replace("10、", "\n10、")
# print(haiwai3)
return haiwai3

def tiyu(content):
tiyu = re.compile(r'【体.*?(?=【)')
tiyu1 = re.findall(tiyu, str(content))
tiyu2 =
tiyu3 = ''.join(tiyu2).replace("。", "").replace("1、", "\n1、").replace("2、", "\n2、").replace("3、", "\n3、").replace(
               "4、", "\n4、"). \
            replace("5、", "\n5、").replace("6、", "\n6、").replace("7、", "\n7、").replace("8、", "\n8、").replace("9、", "\n9、"). \
            replace("10、", "\n10、")
# print(tiyu3)
return tiyu3

def caijing(content):
caijing = re.compile(r'【财经新闻】.*')
caijing1 = re.findall(caijing, str(content))
caijing2 =
caijing3 = ''.join(caijing2).replace("。", "").replace("1、", "\n1、").replace("2、", "\n2、").replace("3、",
            "\n3、").replace(
            "4、", "\n4、"). \
            replace("5、", "\n5、").replace("6、", "\n6、").replace("7、", "\n7、").replace("8、", "\n8、").replace("9、", "\n9、"). \
            replace("10、", "\n10、").replace("\']", "")
# print(caijing3)
return caijing3

# ---------------------------------------------- 推送 ------------------------------------------------ #
def wx_push(newsdata): # 企业微信的推送，用到
data = {
"corpid": "", # 企业ID
"corpsecret": "", # 应用的凭证密钥secret
"agentid": "", # 应用ID
"text": newsdata # 推送内容，支持HTML
}
wxtalk = 'https://api.htm.fun/api/Wechat/text/'
response = requests.get(wxtalk, data=data)
return response

def serverPush(data): # 这是server酱的推送，没用到
data1 = {
"title": "推送",
"desp": data
}
wx_tui = "https://sctapi.ftqq.com/填写你的key.send"
response = requests.post(wx_tui, data=data1)
return response

def qmsgPush(data): # 这是qmsg酱推送，没用到
data2 = {
"msg": data
}
qmsg = "https://qmsg.zendee.cn/send/填写你的key"
response = requests.post(qmsg, data=data2)
return response

def main(arg1,arg2): # 入口函数
link = getLink()
news = getNews(link)
# 微信推送
wx_push(luoqi(news))
wx_push(guonei(news))
wx_push(haiwai(news))
wx_push(tiyu(news))
wx_push(caijing(news))部署云函数相关
由于tx云函数没有Beautiful Soup库，所以我们需要在Pycharm里面打包依赖。
具体操作如下，使用方法二更快。我们需要打包的是：beautifulsoup4-4.10.0.dist-info，bs4这两个包。
把这两个包和你的py文件放在一起添加到zip。我的py文件名叫 News.py，在部署的时候执行方法填写
News.main。最后点击部署。再点击测试，就能出效果了。云函数每个月都有免费额度，每日推送够用了。
为Python云函数打包依赖

效果图：
https://s.pc.qq.com/tousu/img/20211016/3754962_1634365513.jpg

2021.10.19更新了一下，上传云函数测试有警告问题（把soupsieve包也一并压缩，就不警告了），优化了一下推送顺序。（我想要的顺序{:301_988:}）
最好有大佬帮忙优化一下代码，看着感觉有点乱（小声bb）{:301_999:}

只有午安 发表于 2021-10-18 01:40

本帖最后由只有午安于 2021-10-19 11:55 编辑

打包的就是可以直接上传云函数的。
如果要在pycharm里面运行那就把def main(arg1,arg2):arg1,arg2删除，变成def main():
然后在代码最后面加上
if __name__ == "__main__":
main()

david1989 发表于 2021-10-19 13:46

只有午安发表于 2021-10-19 11:50
。。。我的错def main那里加两个参数arg1,arg2也就是 def main(arg1,arg2)

ps:
可以试试官方API来推送，已测试ok! 供参考：

微信推送参考了其他大佬改进的官方API:
wxid = '***'
wxsecret = '***'

def wx_push(newsdata):
wx_push_token = \
requests.post(url='https://qyapi.weixin.qq.com/cgi-bin/gettoken?corpid=%s&corpsecret=%s' % (wxid, wxsecret),
            data="").json()['access_token']
wx_push_data = {
   "agentid": 1000002,
   "msgtype": "text",
   "touser": "@all",
   "text": {
         "content": newsdata
   },
   "safe": 0
}
requests.post('https://qyapi.weixin.qq.com/cgi-bin/message/send?access_token=%s' % wx_push_token,
            json=wx_push_data)

david1989 发表于 2021-10-19 11:33

打包楼主的NEWS.ZIP 上传成功后测试提示如下：
START RequestId:ea7b89a4-25d4-4511-b620-ed0abf2a512b

ERROR RequestId:ea7b89a4-25d4-4511-b620-ed0abf2a512b Result:{"errorCode":1,"errorMessage":"user code exception caught","stackTrace":"Traceback (most recent call last):\nTypeError: main() takes 0 positional arguments but 2 were given","statusCode":430}

END RequestId:ea7b89a4-25d4-4511-b620-ed0abf2a512b

Report RequestId:ea7b89a4-25d4-4511-b620-ed0abf2a512b Duration:1ms Memory:128MB MemUsage:19.769531MB

xinyangtuina 发表于 2021-10-16 14:39

不错学习了{:1_893:}

Owliver 发表于 2021-10-16 15:06

学习了，不过报错，不知道哪里的问题

jiang12 发表于 2021-10-16 15:07

学以致用，不错

gusong125 发表于 2021-10-16 15:21

学习了，谢谢楼主分享！

好学发表于 2021-10-16 15:30

十分不错，个人微信应该也可以吧

只有午安 发表于 2021-10-16 15:37

Owliver 发表于 2021-10-16 15:06
学习了，不过报错，不知道哪里的问题

报的错误给我看看

只有午安 发表于 2021-10-16 15:38

好学发表于 2021-10-16 15:30
十分不错，个人微信应该也可以吧

这个是通过企业微信的应用给自己的普通微信推送的，普通微信暂时没看到有api

kk120305004 发表于 2021-10-16 15:43

谢谢分享

wudiww718 发表于 2021-10-16 15:53

谢谢LZ分享

页: [1] 2 3 4 5 6 7

吾爱破解 - 52pojie.cn's Archiver

Python爬虫学习之每日推送早报新闻到微信