一款基于多线程爬虫的微博关注网分析工具
本帖最后由 trioKun 于 2019-8-27 18:07 编辑出于兴趣写了这么个小玩意,希望和大家分享一下。
由于网络延迟和反爬机制的原因,脚本运行速度仍比较慢,欢迎交流改进方案。
简要介绍一下脚本的作用:
分析器的基本思想和微博自带的推荐“你关注的XX也关注了YY”类似。分析器通过爬取用户关注列表,利用BFS深入到关注链的任意层,从而挖掘出很多你可能认识的人。同时通过简单的判定过滤掉大V用户和其他无效用户。
作为一个例子,运行分析器,你将获得一个包括如下信息的用户列表。Level是指关注链层次,Level=1表示你直接关注了该用户,Level=2表示你直接关注的用户关注了该用户,依此类推。Score用于表征该用户与你的关系网的相关程度,你也可以自定义Score的各项因子权重。
Nickname: 兴趣使然的英雄
Gender: 男
Region: 北京 海淀区
Followers: 638
Tweets: 142
Last Tweet: 2019-05-11 04:06
Home Page: https://weibo.com/1234567890
Relation Level: 3
Relation Score: 90
完整源码及更多相关信息见附件,也可在GitHub下载完整源码(trioKun/Weibo-Relation-Analysis-Spider)。
下面给出分析器的核心部分~
# MultiThreading solving the too long answering time problem
class GetUser(threading.Thread):
def __init__(self, uid, pa_ind, ch_ind):
threading.Thread.__init__(self)
self.uid = uid
self.pa_index = pa_ind
self.ch_index = ch_ind
def run(self):
new_usr = User(self.uid)
Analyzer.usr_nodes_mutex.acquire()
Analyzer.usr_nodes = new_usr
Analyzer.usr_nodes.dist = Analyzer.usr_nodes.dist + 1
Analyzer.usr_nodes_mutex.release()
class Analyzer:
usr_nodes = [] # containing all detected users
usr_nodes_mutex = threading.Lock()
def __init__(self, uid, level=2, child_threads=3):
"""
:param uid: the user id of the root user
:param level: search level, 2 or 3 is recommended
:param child_threads: max number of child threads, depends on the number of your cookies
"""
self.root_uid = uid
self.threads = 1 + child_threads # maximum child child_threads
self.bfs(level) # construct relationship graph by BFS
def bfs(self, level): # search until st level
cert = list() # certifications, decides whether to put a user into usr_nodes
# value of cert:
Certed = 2**level # get certificated
Init_Cert = 0
Pot_Cert = range(0, Certed) # Potential certificated. but not enough followed, looking for another
Not_Cert = -1 # No certificated. be blocked due to certain filter strategy
Exist = Certed + 1 # already existed usr_nodes
uids = list() # uid <==> cert
uids.append(self.root_uid)
self.usr_nodes.append(User(self.root_uid))
cert.append(Exist)
self.usr_nodes.dist = 0
self.usr_nodes.in_degree = 0
curr = 0
last_level = -1
while True:
self.usr_nodes_mutex.acquire()
curr_user = self.usr_nodes
self.usr_nodes_mutex.release()
if curr_user.dist != last_level:
print("current level is %d" % curr_user.dist)
last_level = curr_user.dist
print("\t %d.scanning uid %d..." % (curr, curr_user.usr_id))
for follow_uid in curr_user.follow_uid_list:
if follow_uid not in uids and curr_user.dist < level: # follow_uid is not yet collected
uids.append(follow_uid)
cert.append(Init_Cert)
if follow_uid in uids:
u_ind = uids.index(follow_uid)
if curr_user.dist < level and cert in Pot_Cert:
cert += 2**(level - curr_user.dist)
if cert == Certed:
MaxNoTweetDays = 180 # filter, ignore users who didn't tweet for a number of days
if calc_days_until_now(self.get_last_tweet_time(follow_uid)) > MaxNoTweetDays:
cert = Not_Cert
else:
cert = Exist
print("\t\t found new uid %d" % follow_uid)
self.usr_nodes_mutex.acquire()
ins_index = len(self.usr_nodes)
self.usr_nodes.append(User()) # a empty User object as a placeholder
self.usr_nodes_mutex.release()
while threading.activeCount() >= self.threads:
time.sleep(0.1)
GetUser(follow_uid, curr, ins_index).start()
elif cert == Exist: # increase corresponding attributes
self.usr_nodes_mutex.acquire()
n_ind = self.index_usr_node(follow_uid)
self.usr_nodes_mutex.release()
while n_ind == -1:
time.sleep(0.1) # wait until thread works it out
self.usr_nodes_mutex.acquire()
n_ind = self.index_usr_node(follow_uid)
self.usr_nodes_mutex.release()
self.usr_nodes_mutex.acquire()
# divide (dist+1) because dist could be 0
self.usr_nodes.in_degree += 1 / (curr_user.dist + 1)
if curr_user.usr_id in self.usr_nodes.follow_uid_list:# follow each other
self.usr_nodes.bidirectional_follow += 1 / (self.usr_nodes.dist + 1)
self.usr_nodes.bidirectional_follow += 1 / (curr_user.dist + 1)
self.usr_nodes_mutex.release()
curr += 1
self.usr_nodes_mutex.acquire()
over = bool(curr >= len(self.usr_nodes))
self.usr_nodes_mutex.release()
if over:
break
while True:
self.usr_nodes_mutex.acquire()
work_out = bool(self.usr_nodes.usr_id != 0)
self.usr_nodes_mutex.release()
if work_out:
break
time.sleep(0.1) # wait until thread works it out
@staticmethod
def get_last_tweet_time(uid):
return except_wrapper_func(get_last_tweet_time_fullver, uid)
def output(self, file=sys.stdout):
for user in sorted(self.usr_nodes, key=scoring, reverse=True):
if user.dist >= 1 and user.usr_id != 0:
user.show(file=file)
print("Relation Level: %d" % user.dist, file=file)
print("Relation Score: %d" % round(scoring(user)), file=file)
print(file=file)
def index_usr_node(self, uid):
"""
:param uid: an int, the target uid
:return: the index of User of the specific uid
"""
index = 0
while index < len(self.usr_nodes) and self.usr_nodes.usr_id != uid:
index += 1
if index < len(self.usr_nodes):
return index
else:
return -1
嗯,值得参考学习下,谢谢了 不是很清楚爬取到的数据可以哦做什么 追逐太阳 发表于 2019-8-27 17:36
不是很清楚爬取到的数据可以哦做什么
仔细看了一下原帖,发现自己连脚本能干啥都没讲明白,实在惭愧{:1_909:}
现已添加了具体的介绍,希望能够帮助大家了解~ 可以做一个抓取大V评论的 做自媒体很有用的{:1_918:}
页:
[1]