python文本词频统计

罗木魄 发表于 2024-6-4 16:01

本帖最后由罗木魄于 2024-6-4 16:04 编辑

学习词频统计时，看到了B站up主马兹n的视频（https://www.bilibili.com/video/BV1f14y1C71Y/?vd_source=15f8328ce49048121394b0da72ff83b1），打算自己分析一下试试。我对视频中的代码进行了几点改善。
1.用分词之前的文本进行词频统计。原视频直接拿jieba分词后的文本进行词频统计，但是我发现jieba有些词并不能识别，例如“营业收入”就无法正常识别，网上给出的解决方法是添加细胞词库，但是我并没有在网上找到公开的细胞词库的下载地址，这一部分希望论坛内的大佬可以帮助一下。
2.修改了所要记录的代码、年份、简称的导入方式。某二手平台买到的年报文件名格式为：代码_年份_简称_文件名_发布日期，我按照此格式利用split（）进行了匹配。
3.增加了总词数。将文本信息用jieba进行分词，再去除停用词，最后统计总词数。

代码如下：
import jieba
import xlwt
import os

#导入停顿词
stopwords = {}.fromkeys([ line.rstrip() for line in open('cn_stopwords.txt',encoding='utf-8') ])

# 加载txt列表寻找关键词并保存到excel
def matchKeyWords(ThePath, keyWords,aim_path):
dir_list = os.listdir(ThePath)
book = xlwt.Workbook(encoding='utf-8', style_compression=0)
sheet = book.add_sheet('关键词词频统计', cell_overwrite_ok=True)
sheet.write(0, 0, '代码')
sheet.write(0, 1, '简称')
sheet.write(0, 2, '年份')
sheet.write(0, 3, '总词数')
for i,c_word in enumerate(keyWords):
   sheet.write(0, i+4, c_word)
index=0
files = os.listdir(ThePath)
for file in files:
         if os.path.splitext(file)[-1] == ".txt":
            txt_path = os.path.join(ThePath, file)
            stock_code = file.split("_")
            stock_name = file.split("_")
            year = file.split("_")
            sheet.write(index + 1, 0, stock_code)
            sheet.write(index + 1, 1, stock_name)
            sheet.write(index + 1, 2, year)
            print(f'正在统计{file}')
            with open(txt_path, "r", encoding='utf-8', errors='ignore')as fp:
               text = fp.read()
               words_list = list(jieba.cut(text))#jieba分词
               words_list = #去除停顿词
               total_words = len(words_list)#计算总词数
               sheet.write(index + 1, 3, str(total_words))
               for ind,word in enumerate(keyWords):
                     word_freq=text.count(word)
                     sheet.write(index + 1, ind + 4, str(word_freq))
            index+=1
book.save(aim_path)

ThePath= r'G:\年报\年报TXT版'#年报所在文件夹
aim_path=r'G:\年报\词频统计'#词频统计数据存放文件夹
keywords = ['营业收入','估值','资产','股东','智能数据分析','智能机器人','机器学习','深度学习']#所要进行统计的关键词
matchKeyWords(ThePath, keywords,f'{aim_path}\词频统计.xls')

boxer 发表于 2024-6-4 18:17

你这个是只统计已知的词, 直接使用正则(模式匹配)不是更快?

捷豹网络丶贱仔 发表于 2024-6-4 19:45

捷豹网络丶贱仔 发表于 2024-6-4 19:46

捷豹网络丶贱仔 发表于 2024-6-4 19:58

捷豹网络丶贱仔 发表于 2024-6-4 19:59

fengzi8388 发表于 2024-6-4 21:09

这个留个印记，方便日后来找。{:1_921:}

abpyu 发表于 2024-6-4 21:11

在整个词云，齐活

罗木魄 发表于 2024-6-5 00:26

捷豹网络丶贱仔发表于 2024-6-4 19:59
帮你写了修改了三个，你看一下哪个合适你

感谢您的指导{:1_893:}

页: [1]

吾爱破解 - 52pojie.cn's Archiver

python文本词频统计