【Python】自动化爬取每日天气、热搜、每日一句并通过企业微信机器人发送至群聊

ARtcgb 发表于 2021-4-22 10:32

**工具需求：Python3 + 企业微信机器人**
**第三方库需求：requests + BeautifulSoup**
**演示环境：macOS 11.2.3 + Pycharm 2021**

**目标：**爬取[中国天气网](http://www.weather.com.cn/forecast/)的天气预报；爬取[百度热搜](http://top.baidu.com/buzz?b=1&c=513&fr=topbuzz_b42_c513)；爬取[金山词霸每日一句](http://iciba.com)；通过企业微信机器人推送至群聊内。
# 教程&源码
## 爬取[中国天气网](http://www.weather.com.cn/forecast/)的天气预报
随便点进一个城市的详细天气预报，这里以[北京](http://www.weather.com.cn/weather1d/101010100.shtml)为例。
按照惯例开下F12。这种实时更新的界面一般是通过ajax传入json文件实现的，打开network选项卡刷新验证一下。

果然，在这个文件中我们能找到我们需要的信息，我们右键打开这个链接。打开居然是乱码？那这估计就是编码问题了，我们先来写程序。
```Python3
url = "https://d1.weather.com.cn/sk_2d/101010100.html?_=1618886817920"
requests_url = requests.get(url)
```
这里的url是刚才那个文件的url，在Network选项卡中右键复制即可获得。
可以给requests.get加一个请求头，放一个假UA，防止被反爬。
```Python3
message = json.loads(requests_url.text.encode("latin1").decode("utf8").replace("var dataSK = ", ""))
```
**获取文件信息**
这里我先用latin1 编码，再用utf-8解码，发现可以提取出正常的文本信息，这时通过str文本的replace方法把里面的代码部分`var dataSK = `去除，再用json库的loads方法将其转化成Python字典。

到这一步就很简单了，通过字典的键获取对应的值。
```Python3
cityname = message['cityname']
aqi = int(message['aqi'])
sd = message['sd']
wd = message['WD']
ws = message['WS']
temp = message['temp']
weather = message['weather']
```
最后按照想要的格式输出就可以了，为了方便整个程序的操作，我把这一段代码封装成了函数。
```Python3
def get_weather():
url = "https://d1.weather.com.cn/sk_2d/101010100.html?_=1618886817920"
requests_url = requests.get(url)
message = json.loads(requests_url.text.encode("latin1").decode("utf8").replace("var dataSK = ", ""))
cityname = message['cityname']
aqi = int(message['aqi'])
sd = message['sd']
wd = message['WD']
ws = message['WS']
temp = message['temp']
weather = message['weather']
if aqi <= 50:
   airQuality = "优"
elif aqi <= 100:
   airQuality = "良"
elif aqi <= 150:
   airQuality = "轻度污染"
elif aqi <= 200:
   airQuality = "中度污染"
elif aqi <= 300:
   airQuality = "重度污染"
else:
   airQuality = "严重污染"
return cityname + " " + '今日天气：' + weather + ' 温度：' + temp + ' 摄氏度 ' + wd + ws + ' 相对湿度：' + sd + ' 空气质量：' \
      + str(aqi) + "（" + airQuality + "）"
```
## 爬取[百度热搜](http://top.baidu.com/buzz?b=1&c=513&fr=topbuzz_b42_c513)
不多说，直接打开F12，Network选项卡中却没有我们想要的，看来这次只能直接爬取网页了。这里要使用BeautifulSoup库。

```Python3
requests_page = requests.get('http://top.baidu.com/buzz?b=1&c=513&fr=topbuzz_b42_c513')
soup = BeautifulSoup(requests_page.text, "lxml")
```
这里解析出所有的html代码，再F12一下看看我们需要的内容在哪个标签下。定位一下主页里的标题，啊，我一看，原来是个`a`标签`class='list-title'`，这好办。
```Python3
soup_text = soup.find_all("a", class_='list-title')
```
然后我们把它输出出来。
```Python3
for text in soup_text:
   print(text.string)
```
然后发现，还他喵的是乱码，我们再试试`.encode("latin1").decode("GBK")`
```Python3
for text in soup_text:
   print(text.string.encode("latin1").decode("GBK"))
```
还真成了，我吐了，敢情你俩是一家的。。。
这里改动一下，我们把他们封装进列表里面，方便整理。
```Python3
def get_top_list():
requests_page = requests.get('http://top.baidu.com/buzz?b=1&c=513&fr=topbuzz_b42_c513')
soup = BeautifulSoup(requests_page.text, "lxml")
soup_text = soup.find_all("a", class_='list-title')
top_list = []
for text in soup_text:
   top_list.append(text.string.encode("latin1").decode("GBK"))
return top_list
```
## 爬取[金山词霸每日一句](iciba.com)
我直接找到他的[每日一句文件](http://open.iciba.com/dsapi/)。
直接开始写代码，轻车熟路。这里很简单，不解释。
```Python3
def get_daily_sentence():
url = "http://open.iciba.com/dsapi/"
r = requests.get(url)
r = json.loads(r.text)
content = r["content"]
note = r["note"]
daily_sentence = content + "\n" + note
return daily_sentence
```

## 整理信息
简单的调用一下我们写的函数，将返回信息整理到一个字符串内即可，方便我们下一步的发送。具体代码跟随后面整体代码展示一遍展示。

## 通过企业微信机器人发送
首先将机器人添加到群聊，具体步骤不演示，不会自行百度或查阅[官方文档](https://work.weixin.qq.com/help?person_id=1&doc_id=13376&from=search&isTopSearch=1&helpType=)。

然后获取你的机器人的webhook链接。**（不要把这个链接散播出去，要不然谁都可以调用你的机器人发送信息，造成垃圾信息污染）**

我们直接向这个链接发送Post请求就可以完成机器人发送信息了，十分的简单。
```Python3
url = #这里填写你的机器人的webhook链接
headers = {"Content-Type": "text/plain"}
data = {
   "msgtype": "text",
   "text": {
                     "content": #这里填写要发送的内容，这里选择了纯文本模式
   }
}
requests_url = requests.post(url, headers=headers, data=json.dumps(data))
```
完成。

# 完整代码
```Python3
import simplejson as json
import requests
import datetime
import fake_useragent # 这个库可以不用
from bs4 import BeautifulSoup
import time

def get_fake_ua(): #这个函数是用来获取随机UA的，可以不用
location = '/fake_useragent_0.1.11.json' #这里是我导入的fakeuseragent库文件，可以不用
ua = fake_useragent.UserAgent(path=location)

headers = {
   'user-agent': ua.random
}
return headers

def get_week_day(date):
week_day_dict = {
   0: '星期一',
   1: '星期二',
   2: '星期三',
   3: '星期四',
   4: '星期五',
   5: '星期六',
   6: '星期天',
}
day = date.weekday()
return "今天日期为：" + str(datetime.date.today()) + ' ' + week_day_dict

def get_weather():
url = "https://d1.weather.com.cn/sk_2d/101010100.html?_=1618886817920"
r_url = requests.get(url, headers=get_fake_ua())
message = json.loads(r_url.text.encode("latin1").decode("utf8").replace("var dataSK = ", ""))
cityname = message['cityname']
aqi = int(message['aqi'])
sd = message['sd']
wd = message['WD']
ws = message['WS']
temp = message['temp']
weather = message['weather']
if aqi <= 50:
   airQuality = "优"
elif aqi <= 100:
   airQuality = "良"
elif aqi <= 150:
   airQuality = "轻度污染"
elif aqi <= 200:
   airQuality = "中度污染"
elif aqi <= 300:
   airQuality = "重度污染"
else:
   airQuality = "严重污染"
return cityname + " " + '今日天气：' + weather + ' 温度：' + temp + ' 摄氏度 ' + wd + ws + ' 相对湿度：' + sd + ' 空气质量：' \
      + str(aqi) + "（" + airQuality + "）"

def get_top_list():
requests_page = requests.get('http://top.baidu.com/buzz?b=1&c=513&fr=topbuzz_b42_c513')
soup = BeautifulSoup(requests_page.text, "lxml")
soup_text = soup.find_all("a", class_='list-title')
i = 0
top_list = []
for text in soup_text:
   i += 1
   top_list.append(text.string.encode("latin1").decode("GBK"))
   if i == 10:
         break
return top_list

def get_daily_sentence():
url = "http://open.iciba.com/dsapi/"
r = requests.get(url, headers=get_fake_ua())
r = json.loads(r.text)
content = r["content"]
note = r["note"]
daily_sentence = content + "\n" + note
return daily_sentence

def get_sendContent():
sendContent =get_week_day(datetime.date.today()) + "\n\n" + get_weather() + "\n\n" + str(get_top_list()).replace(
   "', '", '\n').replace("['", "").replace("']", "") + "\n\n" + get_daily_sentence()
return sendContent

def send(content):
url = # 填写你的webhook链接
headers = {"Content-Type": "text/plain"}
data = {
   "msgtype": "text",
   "text": {
         "content": content,
   }
}
requests_url = requests.post(url, headers=headers, data=json.dumps(data))
if requests_url.text == '{"errcode":0,"errmsg":"ok"}':
   return "发送成功"
else:
   return "发送失败" + requests_url.text

print(send(get_sendContent()))

```

ARtcgb 发表于 2021-4-22 10:55

longle 发表于 2021-4-22 10:53
pycharm下载BeautifulSoup报错怎么办

试试用pip安装（记得切换国内镜像源），不行的话从官网安装然后解压

撑一把纸伞 发表于 2021-4-25 12:44

ARtcgb 发表于 2021-4-25 06:23
发一下代码和报错信息我看看

import simplejson as json
import requests
import datetime

from bs4 import BeautifulSoup
import time

def get_week_day(date):
week_day_dict = {
0: '星期一',
1: '星期二',
2: '星期三',
3: '星期四',
4: '星期五',
5: '星期六',
6: '星期天',
}
day = date.weekday()
return "今天日期为：" + str(datetime.date.today()) + ' ' + week_day_dict

def get_weather():
url = "https://d1.weather.com.cn/sk_2d/101190112.html?_=1618886817920"
r_url = requests.get(url)
message = json.loads(r_url.text.encode("latin1").decode("utf8").replace("var dataSK = ", ""))
cityname = message['cityname']
aqi = int(message['aqi'])
sd = message['sd']
wd = message['WD']
ws = message['WS']
temp = message['temp']
weather = message['weather']
if aqi <= 50:
airQuality = "优"
elif aqi <= 100:
airQuality = "良"
elif aqi <= 150:
airQuality = "轻度污染"
elif aqi <= 200:
airQuality = "中度污染"
elif aqi <= 300:
airQuality = "重度污染"
else:
airQuality = "严重污染"
return cityname + " " + '今日天气：' + weather + ' 温度：' + temp + ' 摄氏度 ' + wd + ws + ' 相对湿度：' + sd + ' 空气质量：' \
+ str(aqi) + "（" + airQuality + "）"

def get_top_list():
requests_page = requests.get('http://top.baidu.com/buzz?b=1&c=513&fr=topbuzz_b42_c513')
soup = BeautifulSoup(requests_page.text, "lxml")
soup_text = soup.find_all("a", class_='list-title')
i = 0
top_list = []
for text in soup_text:
i += 1
top_list.append(text.string.encode("latin1").decode("GBK"))
if i == 10:
break
return top_list

def get_daily_sentence():
url = "http://open.iciba.com/dsapi/"
r = requests.get(url)
r = json.loads(r.text)
content = r["content"]
note = r["note"]
daily_sentence = content + "\n" + note
return daily_sentence

def get_sendContent():
sendContent = get_week_day(datetime.date.today()) + "\n\n" + get_weather() + "\n\n" + str(get_top_list()).replace(
"', '", '\n').replace("['", "").replace("']", "") + "\n\n" + get_daily_sentence()
return sendContent
print(get_sendContent())

Traceback (most recent call last):
File "C:\Users\14014\Desktop\文档\Python\Python练习\天气.py", line 74, in <module>
print(get_sendContent())
File "C:\Users\14014\Desktop\文档\Python\Python练习\天气.py", line 71, in get_sendContent
sendContent =get_week_day(datetime.date.today()) + "\n\n" + get_weather() + "\n\n" + str(get_top_list()).replace(
File "C:\Users\14014\Desktop\文档\Python\Python练习\天气.py", line 27, in get_weather
aqi = int(message['aqi'])
ValueError: invalid literal for int() with base 10: ''

ARtcgb 发表于 2021-4-22 10:33

代码写的比较乱，请见谅，有任何问题都可以在下面问我，（看情况）我会及时回复的。

dleo 发表于 2021-4-22 10:36

看来要弄到一个城市列表替换那个 101010100 实用性更强

ARtcgb 发表于 2021-4-22 10:42

dleo 发表于 2021-4-22 10:36
看来要弄到一个城市列表替换那个 101010100 实用性更强https://j.i8tq.com/weather2020/search/city.js
是的，这个js文件里面有每个城市的编码，直接替换到网页链接的.shtml前面也是可以的。

RoyPenn 发表于 2021-4-22 10:46

厉害，学以致用

longle 发表于 2021-4-22 10:53

pycharm下载BeautifulSoup报错怎么办

sdrpsps 发表于 2021-4-22 10:53

本python新手学到了，谢谢楼主

阿木木不哭 发表于 2021-4-22 10:53

很厉害的样子，学习一下

welcome7758521 发表于 2021-4-22 10:55

整挺好，相当于个人日常助手了，挺ok的

页: [1] 2 3 4 5 6 7

吾爱破解 - 52pojie.cn's Archiver

【Python】自动化爬取每日天气、热搜、每日一句并通过企业微信机器人发送至群聊