吾爱破解 - 52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 3636|回复: 4
收起左侧

[Python 转载] 一款基于多线程爬虫的微博关注网分析工具

  [复制链接]
trioKun 发表于 2019-8-27 16:35
本帖最后由 trioKun 于 2019-8-27 18:07 编辑

出于兴趣写了这么个小玩意,希望和大家分享一下。
由于网络延迟和反爬机制的原因,脚本运行速度仍比较慢,欢迎交流改进方案。


简要介绍一下脚本的作用:
分析器的基本思想和微博自带的推荐“你关注的XX也关注了YY”类似。分析器通过爬取用户关注列表,利用BFS深入到关注链的任意层,从而挖掘出很多你可能认识的人。同时通过简单的判定过滤掉大V用户和其他无效用户。
作为一个例子,运行分析器,你将获得一个包括如下信息的用户列表。Level是指关注链层次,Level=1表示你直接关注了该用户,Level=2表示你直接关注的用户关注了该用户,依此类推。Score用于表征该用户与你的关系网的相关程度,你也可以自定义Score的各项因子权重。

Nickname:         兴趣使然的英雄
Gender:           男
Region:           北京 海淀区
Followers:        638
Tweets:           142
Last Tweet:       2019-05-11 04:06
Home Page:        https://weibo.com/1234567890
Relation Level:     3
Relation Score:     90


完整源码及更多相关信息见附件,也可在GitHub下载完整源码(trioKun/Weibo-Relation-Analysis-Spider)
下面给出分析器的核心部分~
[Python] 纯文本查看 复制代码
# MultiThreading solving the too long answering time problem
class GetUser(threading.Thread):
    def __init__(self, uid, pa_ind, ch_ind):
        threading.Thread.__init__(self)
        self.uid = uid
        self.pa_index = pa_ind
        self.ch_index = ch_ind

    def run(self):
        new_usr = User(self.uid)
        Analyzer.usr_nodes_mutex.acquire()
        Analyzer.usr_nodes[self.ch_index] = new_usr
        Analyzer.usr_nodes[self.ch_index].dist = Analyzer.usr_nodes[self.pa_index].dist + 1
        Analyzer.usr_nodes_mutex.release()


class Analyzer:
    usr_nodes = []        # containing all detected users
    usr_nodes_mutex = threading.Lock()

    def __init__(self, uid, level=2, child_threads=3):
        """
        :param uid: the user id of the root user
        :param level: search level, 2 or 3 is recommended
        :param child_threads: max number of child threads, depends on the number of your cookies
        """
        self.root_uid = uid
        self.threads = 1 + child_threads        # maximum child child_threads
        self.bfs(level)           # construct relationship graph by BFS

    def bfs(self, level):           # search until [level]st level
        cert = list()           # certifications, decides whether to put a user into usr_nodes
        # value of cert[i]:
        Certed = 2**level         # get certificated
        Init_Cert = 0
        Pot_Cert = range(0, Certed)      # Potential certificated. but not enough followed, looking for another
        Not_Cert = -1               # No certificated. be blocked due to certain filter strategy
        Exist = Certed + 1         # already existed usr_nodes

        uids = list()        # uid[i] <==> cert[i]
        uids.append(self.root_uid)
        self.usr_nodes.append(User(self.root_uid))
        cert.append(Exist)
        self.usr_nodes[0].dist = 0
        self.usr_nodes[0].in_degree = 0
        curr = 0
        last_level = -1

        while True:
            self.usr_nodes_mutex.acquire()
            curr_user = self.usr_nodes[curr]
            self.usr_nodes_mutex.release()

            if curr_user.dist != last_level:
                print("current level is %d" % curr_user.dist)
                last_level = curr_user.dist
            print("\t %d.scanning uid %d..." % (curr, curr_user.usr_id))

            for follow_uid in curr_user.follow_uid_list:
                if follow_uid not in uids and curr_user.dist < level:     # follow_uid is not yet collected
                    uids.append(follow_uid)
                    cert.append(Init_Cert)

                if follow_uid in uids:
                    u_ind = uids.index(follow_uid)
                    if curr_user.dist < level and cert[u_ind] in Pot_Cert:
                        cert[u_ind] += 2**(level - curr_user.dist)
                        if cert[u_ind] == Certed:
                            MaxNoTweetDays = 180      # filter, ignore users who didn't tweet for a number of days
                            if calc_days_until_now(self.get_last_tweet_time(follow_uid)) > MaxNoTweetDays:
                                cert[u_ind] = Not_Cert
                            else:
                                cert[u_ind] = Exist
                                print("\t\t found new uid %d" % follow_uid)

                                self.usr_nodes_mutex.acquire()
                                ins_index = len(self.usr_nodes)
                                self.usr_nodes.append(User())              # a empty User object as a placeholder
                                self.usr_nodes_mutex.release()

                                while threading.activeCount() >= self.threads:
                                    time.sleep(0.1)
                                GetUser(follow_uid, curr, ins_index).start()
                    elif cert[u_ind] == Exist:           # increase corresponding attributes
                        self.usr_nodes_mutex.acquire()
                        n_ind = self.index_usr_node(follow_uid)
                        self.usr_nodes_mutex.release()
                        while n_ind == -1:
                            time.sleep(0.1)            # wait until thread works it out
                            self.usr_nodes_mutex.acquire()
                            n_ind = self.index_usr_node(follow_uid)
                            self.usr_nodes_mutex.release()

                        self.usr_nodes_mutex.acquire()
                        # divide (dist+1) because dist could be 0
                        self.usr_nodes[n_ind].in_degree += 1 / (curr_user.dist + 1)
                        if curr_user.usr_id in self.usr_nodes[n_ind].follow_uid_list:  # follow each other
                            self.usr_nodes[curr].bidirectional_follow += 1 / (self.usr_nodes[n_ind].dist + 1)
                            self.usr_nodes[n_ind].bidirectional_follow += 1 / (curr_user.dist + 1)
                        self.usr_nodes_mutex.release()
            curr += 1

            self.usr_nodes_mutex.acquire()
            over = bool(curr >= len(self.usr_nodes))
            self.usr_nodes_mutex.release()
            if over:
                break

            while True:
                self.usr_nodes_mutex.acquire()
                work_out = bool(self.usr_nodes[curr].usr_id != 0)
                self.usr_nodes_mutex.release()
                if work_out:
                    break
                time.sleep(0.1)              # wait until thread works it out


    @staticmethod
    def get_last_tweet_time(uid):
        return except_wrapper_func(get_last_tweet_time_fullver, uid)

    def output(self, file=sys.stdout):
        for user in sorted(self.usr_nodes, key=scoring, reverse=True):
            if user.dist >= 1 and user.usr_id != 0:
                user.show(file=file)
                print("Relation Level:    %d" % user.dist, file=file)
                print("Relation Score:    %d" % round(scoring(user)), file=file)
                print(file=file)

    def index_usr_node(self, uid):
        """
        :param uid: an int, the target uid
        :return: the index of User of the specific uid
        """
        index = 0
        while index < len(self.usr_nodes) and self.usr_nodes[index].usr_id != uid:
            index += 1
        if index < len(self.usr_nodes):
            return index
        else:
            return -1

Weibo-Relation-Analysis-Spider-master.zip

10.95 KB, 下载次数: 19, 下载积分: 吾爱币 -1 CB

分析器完整源码

免费评分

参与人数 1吾爱币 +3 热心值 +1 收起 理由
苏紫方璇 + 3 + 1 感谢发布原创作品,吾爱破解论坛因你更精彩!

查看全部评分

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

larry.zhao 发表于 2019-8-27 17:18
嗯,值得参考学习下,谢谢了
追逐太阳 发表于 2019-8-27 17:36
 楼主| trioKun 发表于 2019-8-27 17:59
追逐太阳 发表于 2019-8-27 17:36
不是很清楚爬取到的数据可以哦做什么

仔细看了一下原帖,发现自己连脚本能干啥都没讲明白,实在惭愧

现已添加了具体的介绍,希望能够帮助大家了解~
youjian 发表于 2019-8-27 18:34
可以做一个抓取大V评论的    做自媒体很有用的
您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则

返回列表

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2024-11-16 13:50

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表