Python爬虫学习之每日推送早报新闻到微信

只有午安 · 发表于 2021-10-16 14:27

本帖最后由只有午安于 2021-10-19 17:04 编辑

修改了一下，发现有个新闻没发出来，一开始没注意，修改了一下def caijing 里面的正则表达式
话不多说，代码改一下wx_push的data，再自己部署到云函数里面，就可实现每日自动推送。加油提升自己！
相关库及库的说明
from bs4 import BeautifulSoup
import requests
import re
Beautiful Soup文档
快速上手—Requests
re库官方文档

1、注册一个企业微信
2、编写代码
3、部署云函数
def askURL():
这个方法用来设置请求头，以及获取页面响应的内容。加上utf-8防止乱码。
def getLink():
因为是需要自动获取并推送新闻，该页面的新闻最新的都是在第一篇，所以我们获取最新的早报链接。
没给他设置参数是因为获取早报最新链接的url是固定的。我们直接在方法里面给他写死。
早报链接
def getNews(news_url):
参数：news_url 指的是getLink获取到的内容传到这里。
把获取的最新链接传到我们定义的解析方法里面，获取到最新链接里面的内容。再获取我们需要的内容。
def text(content):
参数：content 指的是getNews获取到的内容传到这里。
获取到的内容用re库正则表达式来处理内容，如：截取，分割。
因为推送的长度有限制，故我们需要分割内容，分开发送。也更易维护（想要什么版块的内容就定义哪个版块）。具体请看代码。
每日早报格式
def wx_push(newsdata):
参数:newsdata 指的是text处理的内容传到这里。
推送方法，在这里我用的别人重新封装过的企业微信api，他是调用的官方的企业微信api，需要注册一个企业
微信。不用认证企业。具体操作如下。
server酱替代品：企业微信及时通讯api
def main(arg1,arg2):
入口函数，部署云函数需要用到参数arg1,arg2,尽管我们代码用不到这两个参数。
这里调用的是各个def text(content)的方法。
代码相关

[Python] 纯文本查看 复制代码

def askURL(url):
    head = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ""Chrome/92.0"
                      ".4515.107 ""Safari/537.36 "
    }
    r = requests.get(url, headers=head).content.decode("utf-8")
    # print(r)
    return r

def getLink(): # 获取最新链接
    datalist = []
    html = askURL("https://www.pmtown.com/archives/category/%e6%97%a9%e6%8a%a5")
    soup = BeautifulSoup(html, "html.parser")
    link1 = soup.find("a", class_="media-content", target="_blank")
    link1 = str(link1)
    link2 = re.findall(link, link1)
    datalist.append(link2)
    link3 = [str(i) for i in link2] # 解析列表转成文本
    link4 = ''.join(link3)
    # print(link4)
    return link4

# ---------------------------------------------爬取早报内容-------------------------------------------------#
def getNews(news_url):
    news1 = []
    html1 = askURL(news_url)
    soup2 = BeautifulSoup(html1, "html.parser")
    contents = soup2.find("div", class_="post-content")
    contents = str(contents)
    report = re.findall(news, contents)[0]
    report = re.sub("<br(\s+)?/>(\s+)?", "", report) # 替换<br>标签
    report = re.sub("/", "", report)
    report = re.sub(r'【融资收购.*?【', '【', report).replace("【泡面头条】", "【洛七早报】")
    report = str(report)
    news1.append(report.strip())
    return news1

# -----------------------------------------------text内容-------------------------------------------------------
# 获取内容，先分割，后发送
# 洛七早报 【洛.*(?=【)
# 国内头条 【国.*(?=【)
# 海外头条 【海.*(?=【)
# 体育竞技 【体.*(?=【)
# 财经新闻 【财.*(?=【)
def luoqi(content):
    luoqi = re.compile(r'【洛.*?(?=【)')
    luoqi1 = re.findall(luoqi, str(content))
    luoqi2 = [str(i) for i in luoqi1]
    luoqi3 = ''.join(luoqi2).replace("。", "").replace("1、", "\n1、").replace("2、", "\n2、").replace("3、", "\n3、").replace(
                    "4、", "\n4、"). \
    replace("5、", "\n5、").replace("6、", "\n6、").replace("7、", "\n7、").replace("8、", "\n8、").replace("9、", "\n9、"). \
    replace("10、", "\n10、")
    # print(luoqi3)
    return luoqi3


def guonei(content):
    guonei = re.compile(r'【国.*?(?=【)')
    guonei1 = re.findall(guonei, str(content))
    guonei2 = [str(i) for i in guonei1]
    guonei3 = ''.join(guonei2).replace("。", "").replace("1、", "\n1、").replace("2、", "\n2、").replace("3、",
                        "\n3、").replace(
                        "4、", "\n4、"). \
                        replace("5、", "\n5、").replace("6、", "\n6、").replace("7、", "\n7、").replace("8、", "\n8、").replace("9、", "\n9、"). \
                        replace("10、", "\n10、")
    # print(guonei3)
    return guonei3


def haiwai(content):
    haiwai = re.compile(r'【海.*?(?=【)')
    haiwai1 = re.findall(haiwai, str(content))
    haiwai2 = [str(i) for i in haiwai1]
    haiwai3 = ''.join(haiwai2).replace("。", "").replace("1、", "\n1、").replace("2、", "\n2、").replace("3、",
                "\n3、").replace(
                "4、", "\n4、"). \
                replace("5、", "\n5、").replace("6、", "\n6、").replace("7、", "\n7、").replace("8、", "\n8、").replace("9、", "\n9、"). \
                replace("10、", "\n10、")
    # print(haiwai3)
    return haiwai3


def tiyu(content):
    tiyu = re.compile(r'【体.*?(?=【)')
    tiyu1 = re.findall(tiyu, str(content))
    tiyu2 = [str(i) for i in tiyu1]
    tiyu3 = ''.join(tiyu2).replace("。", "").replace("1、", "\n1、").replace("2、", "\n2、").replace("3、", "\n3、").replace(
                   "4、", "\n4、"). \
                replace("5、", "\n5、").replace("6、", "\n6、").replace("7、", "\n7、").replace("8、", "\n8、").replace("9、", "\n9、"). \
                replace("10、", "\n10、")
    # print(tiyu3)
    return tiyu3


def caijing(content):
    caijing = re.compile(r'【财经新闻】.*')
    caijing1 = re.findall(caijing, str(content))
    caijing2 = [str(i) for i in caijing1]
    caijing3 = ''.join(caijing2).replace("。", "").replace("1、", "\n1、").replace("2、", "\n2、").replace("3、",
                "\n3、").replace(
                "4、", "\n4、"). \
                replace("5、", "\n5、").replace("6、", "\n6、").replace("7、", "\n7、").replace("8、", "\n8、").replace("9、", "\n9、"). \
                replace("10、", "\n10、").replace("\']", "")
    # print(caijing3)
    return caijing3


# ---------------------------------------------- 推送 ------------------------------------------------ #
def wx_push(newsdata): # 企业微信的推送，用到
    data = {
    "corpid": "", # 企业ID
    "corpsecret": "", # 应用的凭证密钥secret
    "agentid": "", # 应用ID
    "text": newsdata # 推送内容，支持HTML
    }
    wxtalk = 'https://api.htm.fun/api/Wechat/text/'
    response = requests.get(wxtalk, data=data)
    return response


def serverPush(data): # 这是server酱的推送，没用到
    data1 = {
    "title": "推送",
    "desp": data
    }
    wx_tui = "https://sctapi.ftqq.com/填写你的key.send"
    response = requests.post(wx_tui, data=data1)
    return response


def qmsgPush(data): # 这是qmsg酱推送，没用到
    data2 = {
    "msg": data
    }
    qmsg = "https://qmsg.zendee.cn/send/填写你的key"
    response = requests.post(qmsg, data=data2)
    return response


def main(arg1,arg2): # 入口函数
    link = getLink()
    news = getNews(link)
    # 微信推送
    wx_push(luoqi(news))
    wx_push(guonei(news))
    wx_push(haiwai(news))
    wx_push(tiyu(news))
    wx_push(caijing(news))

部署云函数相关
由于tx云函数没有Beautiful Soup库，所以我们需要在Pycharm里面打包依赖。
具体操作如下，使用方法二更快。我们需要打包的是：beautifulsoup4-4.10.0.dist-info，bs4这两个包。
把这两个包和你的py文件放在一起添加到zip。我的py文件名叫 News.py，在部署的时候执行方法填写
News.main。最后点击部署。再点击测试，就能出效果了。云函数每个月都有免费额度，每日推送够用了。
为Python云函数打包依赖

效果图：

2021.10.19 更新了一下，上传云函数测试有警告问题（把soupsieve包也一并压缩，就不警告了），优化了一下推送顺序。（我想要的顺序

）
最好有大佬帮忙优化一下代码，看着感觉有点乱（小声bb）

new.zip (249.54 KB, 下载次数: 95)

只有午安 · 发表于 2021-10-18 01:40

本帖最后由只有午安于 2021-10-19 11:55 编辑

打包的就是可以直接上传云函数的。
如果要在pycharm里面运行那就把

[Python] 纯文本查看 复制代码

def main(arg1,arg2):

arg1,arg2删除，变成

[Python] 纯文本查看 复制代码

def main():

然后在代码最后面加上

[Python] 纯文本查看 复制代码

if __name__ == "__main__":
     main()

david1989 · 发表于 2021-10-19 13:46

只有午安发表于 2021-10-19 11:50
。。。我的错 def main那里加两个参数arg1,arg2 也就是 def main(arg1,arg2)

ps:
可以试试官方API来推送，已测试ok! 供参考：

微信推送参考了其他大佬改进的官方API:

[Python] 纯文本查看 复制代码

wxid = '***'
wxsecret = '***'

[Python] 纯文本查看 复制代码

def wx_push(newsdata):
    wx_push_token = \
    requests.post(url='https://qyapi.weixin.qq.com/cgi-bin/gettoken?corpid=%s&corpsecret=%s' % (wxid, wxsecret),
                data="").json()['access_token']
    wx_push_data = {
        "agentid": 1000002,
        "msgtype": "text",
        "touser": "@all",
        "text": {
            "content": newsdata
        },
        "safe": 0
    }
    requests.post('https://qyapi.weixin.qq.com/cgi-bin/message/send?access_token=%s' % wx_push_token,
                json=wx_push_data)

david1989 · 发表于 2021-10-19 11:33

打包楼主的NEWS.ZIP 上传成功后测试提示如下：
START RequestId:ea7b89a4-25d4-4511-b620-ed0abf2a512b

ERROR RequestId:ea7b89a4-25d4-4511-b620-ed0abf2a512b Result:{"errorCode":1,"errorMessage":"user code exception caught","stackTrace":"Traceback (most recent call last):\nTypeError: main() takes 0 positional arguments but 2 were given","statusCode":430}

END RequestId:ea7b89a4-25d4-4511-b620-ed0abf2a512b

Report RequestId:ea7b89a4-25d4-4511-b620-ed0abf2a512b Duration:1ms Memory:128MB MemUsage:19.769531MB

xinyangtuina · 发表于 2021-10-16 14:39

不错学习了

Owliver · 发表于 2021-10-16 15:06

学习了，不过报错，不知道哪里的问题

jiang12 · 发表于 2021-10-16 15:07

学以致用，不错

gusong125 · 发表于 2021-10-16 15:21

学习了，谢谢楼主分享！

好学 · 发表于 2021-10-16 15:30

十分不错，个人微信应该也可以吧

只有午安 · 发表于 2021-10-16 15:37

Owliver 发表于 2021-10-16 15:06
学习了，不过报错，不知道哪里的问题

报的错误给我看看

只有午安 · 发表于 2021-10-16 15:38

好学发表于 2021-10-16 15:30
十分不错，个人微信应该也可以吧

这个是通过企业微信的应用给自己的普通微信推送的，普通微信暂时没看到有api

kk120305004 · 发表于 2021-10-16 15:43

谢谢分享

wudiww718 · 发表于 2021-10-16 15:53

谢谢LZ分享

帐号		自动登录	找回密码
密码			注册[Register]

[Python 转载] Python爬虫学习之每日推送早报新闻到微信

免费评分