【python入门爬虫】爬取公众号搜索结果导出为Excel
前言:昨天在吾爱看到一个悬赏《求写一个微信公众号爬虫》,要求根据关键字搜索全网公众号,并导出为excell表格。趁着上班划水的时间,我就写了一个爬虫。
爬取公众号的思路:
一、首先是要先登录公众号,可以用selenuim,模拟浏览器登录微信公众号,然后将cookies(登录状态)保存下来,还有就是微信的查询公众号的接口是需要token的,token的获取可以从登录后的跳转链接去截取
二、通过request+cookies,模拟查询的请求参数,发送get请求,获取报文,微信公众号的查询是分页查询,所以要多次调用,每次从第begin条记录开始查,每次查count个,当返回结果小于count的时候,就结束调用,begin初始化为0,count默认为5,然后query参数是关键字,其他参数默认
三、新建一个excel表,将返回的结果进行处理,获取每条记录的id,name等参数,做成一个行,写入到excel中,最后保存
(你们也可以根据这个思路,去获取某个公众号的往期推文,或者爬取其他信息)
效果图片:
以下是代码
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
import json
import requests
from openpyxl import Workbook
def savecookies(cookies):
with open("./cookies.json", "w") as fp:
json.dump(cookies, fp)
def setcookies(driver):
with open("./cookies.json", "r") as fp:
cookies = json.load(fp)
for cookie in cookies:
# for k,v in cookie.items():
# item = {'name':k,"value":v}
driver.add_cookie(cookie_dict=cookie)
return driver
def login(driver):
loginURL="https://mp.weixin.qq.com/";
driver.get(loginURL)
print("打开页面中...")
time.sleep(2)
#重新获取当前的url,当url不等于loginURL证明已经登录
if(str(driver.current_url).find("token=")>0):
print("已经登录")
else:
time.sleep(2)
#监听用户是否扫码登录,登录跳出循环
while driver.current_url == loginURL:
time.sleep( 0.5 )
print("----登录成功");
#保存cookies
savecookies(driver.get_cookies())
#获取token
driver.get("https://mp.weixin.qq.com");
nowURL = driver.current_url
start = str(nowURL).find("token=")+6
token = str(nowURL)
return token
def search(word,token,begin,count):
headers = {
"HOST": "mp.weixin.qq.com",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36",
}
#request设置cookies
cookies_dict = dict()
with open("./cookies.json", "r") as fp:
cookies = json.load(fp)
for cookie in cookies:
if cookie['name'] == 'pgv_si':
continue
if cookie['name'] == 'uuid':
continue
cookies_dict] = cookie['value']
cookiesJar = requests.utils.cookiejar_from_dict(cookies_dict, cookiejar=None,overwrite=True)
session = requests.session()
session.cookies=cookiesJar
session.headers =headers
#模拟get请求
res = session.get("https://mp.weixin.qq.com/cgi-bin/searchbiz?action=search_biz&begin="+str(begin)+"&count="+str(count)+"&query="+word+"&token="+str(token)+"&lang=zh_CN&f=json&ajax=1")
print(res.text)
returnjson.loads(res.text)
def save(datas):
#解析报文
resultList = datas['list']
for item in resultList:
#组成一行
row = ,item['nickname'],item['round_head_img'],item['alias'],item['service_type']]
#写入
sheet.append(row)
if __name__ == '__main__':
#新建一个excel
book = Workbook()
sheet= book._sheets
option = Options()
# 以下为不显示浏览器界面进行运行的设置
# option.add_argument('--headless')
# option.add_argument('--disable-gpu')
# option.add_argument("window-size=1024,768")
# option.add_argument("--no-sandbox")
option.add_argument(r"user-data-dir=D:\WeChat")#将浏览器的数据进行保存,第二次浏览可以不用在登录
#加载设置和驱动
driver = webdriver.Chrome(options= option,executable_path="./chromedriver.exe")
#登录并获取token
token = login(driver)
#编写标题行
row = ['fakeid','nickname','round_head_img','alias','service_type']
#写入
sheet.append(row)
#开始爬取搜索公众号的结果
begin = 0
while True:
#模拟get请求
res = search("笔吧",token,begin,5)
#写入到excel
save(res)
#翻页
begin+=5
#当最后一页时,返回的数据个数小于count跳出循环
if len(res['list'])<5:
break
#保存
book.save("./WeChat.xlsx")
driver.quit()
完整的代码:
https://www.lanzoux.com/i6Un9g3vcgd密码:i2tu Traceback (most recent call last):
File "D:/python/lizi/微信公众号爬取/WeChat.py", line 104, in <module>
driver = webdriver.Chrome(options= option,executable_path="./chromedriver.exe")
File "C:\Users\Administrator.SC-202103111759\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\chrome\webdriver.py", line 81, in __init__
desired_capabilities=desired_capabilities)
File "C:\Users\Administrator.SC-202103111759\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 157, in __init__
self.start_session(capabilities, browser_profile)
File "C:\Users\Administrator.SC-202103111759\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "C:\Users\Administrator.SC-202103111759\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "C:\Users\Administrator.SC-202103111759\AppData\Local\Programs\Python\Python36-32\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: cannot find Chrome binary
Process finished with exit code 1
求解??????????大佬 ablajan 发表于 2021-5-25 13:34
Traceback (most recent call last):
File "D:/python/lizi/微信公众号爬取/WeC ...
没有chromedriver驱动
去http://npm.taobao.org/mirrors/chromedriver/找下对于自己浏览器版本的chromedriver驱动,然后放在代码的目录里面就行了 要认真学习滴 认真学习呀 程序员上班竟然有划水的时间么? 求教学爬虫怎么入门 zhengyuqi 发表于 2020-8-27 16:38
求教学爬虫怎么入门
论坛入门的教程,多的是啊。按照流程学一遍就入门了 学习一下,啪啪啪{:301_977:}虫 学习了,谢谢楼主分享
zhengyuqi 发表于 2020-8-27 16:38
求教学爬虫怎么入门
先学语言的语法,再了解一下网络原理,然后不断打码
初学者推荐看看视频 谢楼主分享 学习下
页:
[1]
2