本地 tts 生成，基于最新的 Kokoro 模型（啥都好，就是还不支持中文）

pyjiujiu · 发表于 2025-1-21 00:21

前言：看新闻出了强大的 tts 轻量化模型，于是就抱着试一试心态测试下，效果确实可以，自认为终于不必薅 edge-tts 的羊毛了
这篇也主要是向各位介绍下，顺便交流

说明：
1 暂时仅支持英文，（中文还不支持还在研发中）
2 模型地址：https://hf-mirror.com/hexgrad/Kokoro-82M （镜像地址，可直接访问）
3 用的新的辅助三方库 kokoro-onnx，仓库地址：https://github.com/thewh1teagle/kokoro-onnx

4 模型文件 kokoro-v0_19.onnx 体积 329MB （fp32 精度的版本）（可以通过 hf 或 github 找链接下载）
模型应该还可以量化，比如fp16,int8之类，未来可以期待一波
5 还有个 voices.json 文件，这个是 kokoro-onnx仓库自己的操作，将模型发布带的 voicepack 转过来的（需要从github下载）

---分割---
这个本来是个测试，不过AI辅助很方便，那么就顺手写个 GUI，（简陋勿怪，仅为测试）

* 需要先安装 kokoro-onnx
pip install kokoro-onnx

* 两个文件放在脚本同目录即可

* 简单的界面

---代码---

[Python] 纯文本查看 复制代码

001

002

003

004

005

006

007

008

009

010

011

012

013

014

015

016

017

018

019

020

021

022

023

024

025

026

027

028

029

030

031

032

033

034

035

036

037

038

039

040

041

042

043

044

045

046

047

048

049

050

051

052

053

054

055

056

057

058

059

060

061

062

063

064

065

066

067

068

069

070

071

072

073

074

075

076

077

078

079

080

081

082

083

084

085

086

087

088

089

090

091

092

093

094

095

096

097

098

099

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

import tkinter as tk
from tkinter import ttk, scrolledtext, messagebox
from kokoro_onnx import Kokoro
import soundfile as sf
import threading  # For running TTS in a separate thread
 
import time
from functools import wraps
from datetime import datetime
# 获取当前时间
now = datetime.now()
formatted_time = now.strftime('%Y%m%d_%H%M')
 
def timeit(func):
    """
    一个用于测量函数运行时间的装饰器。
 
    Args:
        func: 要装饰的函数。
 
    Returns:
        一个封装了计时功能的函数。
    """
    @wraps(func)
    def wrapper(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()
        execution_time = end_time - start_time
        print(f"函数 '{func.__name__}' 运行时间: {execution_time:.2f} 秒")
        return result
    return wrapper
     
VOICE_NAME = [
    'af', # Default voice is a 50-50 mix of Bella & Sarah
    'af_bella', 'af_sarah', 'am_adam', 'am_michael',
    'bf_emma', 'bf_isabella', 'bm_george', 'bm_lewis',
    'af_nicole', 'af_sky',
]
 #暂时语言仅 英语 
LANG_NAME =[
   "en-us",  # English
    "en-gb",  # English (British)
    "fr-fr",  # French
    "ja",  # Japanese
    "ko",  # Korean
    "cmn",  # Mandarin Chinese
]
 
class TTSApp:
    def __init__(self, root):
        self.root = root
        self.root.title("Kokoro TTS GUI")
        self.kokoro = None  # Initialize Kokoro instance
        self.create_widgets()
 
    def create_widgets(self):
        # --- Text Area ---
        ttk.Label(self.root, text="Text to Speak:").grid(row=0, column=0, sticky="w", padx=5, pady=5)
        self.text_area = scrolledtext.ScrolledText(self.root, wrap=tk.WORD, width=60, height=10)
        self.text_area.grid(row=1, column=0, columnspan=3, padx=5, pady=5)
 
        # --- Voice Parameter ---
        ttk.Label(self.root, text="Voice:").grid(row=2, column=0, sticky="w", padx=5, pady=5)
        self.voice_var = tk.StringVar(value="af") # Default value
        self.voice_combobox = ttk.Combobox(self.root, textvariable=self.voice_var, values=VOICE_NAME)
        self.voice_combobox.grid(row=2, column=1, sticky="ew", padx=5, pady=5)
 
        # --- Speed Parameter ---
        ttk.Label(self.root, text="Speed (0.5-2.0):").grid(row=3, column=0, sticky="w", padx=5, pady=5)
        self.speed_var = tk.DoubleVar(value=1.0)  # Default speed
        self.speed_entry = ttk.Entry(self.root, textvariable=self.speed_var)
        self.speed_entry.grid(row=3, column=1, sticky="ew", padx=5, pady=5)
 
         # --- Language Parameter ---
        ttk.Label(self.root, text="Language:").grid(row=4, column=0, sticky="w", padx=5, pady=5)
        self.lang_var = tk.StringVar(value="en-us")
        self.lang_entry = ttk.Entry(self.root, textvariable=self.lang_var)
        self.lang_entry.grid(row=4, column=1, sticky="ew", padx=5, pady=5)
 
 
        # --- Run Button ---
        self.run_button = ttk.Button(self.root, text="Generate Speech", command=self.run_tts)
        self.run_button.grid(row=5, column=0, columnspan=3, pady=10)
        self.root.columnconfigure(1,weight=1) # Make column expand
 
    def run_tts(self):
       
        text = self.text_area.get("1.0", "end-1c").strip()
        voice = self.voice_var.get()
        try:
          speed = float(self.speed_var.get())
          if not 0.5 <= speed <= 2.0:
              messagebox.showerror("Error", "Speed must be between 0.5 and 2.0")
              return
        except ValueError:
           messagebox.showerror("Error", "Invalid speed value.")
           return
        lang = self.lang_var.get()
        if not text:
            messagebox.showerror("Error", "Please enter text to speak.")
            return
         
         
        self.run_button.config(text='processing',state=tk.DISABLED) # Disable button during processing
        threading.Thread(target=self.perform_tts, args=(text, voice, speed, lang)).start()
 
    @timeit
    def perform_tts(self, text, voice, speed, lang,audio_format='mp3'):
        try:
             if self.kokoro is None:
                 self.kokoro = Kokoro("kokoro-v0_19.onnx", "voices.json")
             samples, sample_rate = self.kokoro.create(
                text,
                voice=voice, 
                speed=speed,
                lang=lang
            )
             output_file = f"audio_{voice}_{formatted_time}.{audio_format}"
             sf.write(output_file, samples, sample_rate)
             messagebox.showinfo("Success", f"Audio generated successfully :{output_file}")
             #clear the textarea
             self.text_area.delete('1.0',tk.END)
              
        except Exception as e:
             messagebox.showerror("Error", f"An error occurred: {e}")
        finally:
             self.run_button.config(state=tk.NORMAL) # Re-enable button
 
if __name__ == "__main__":
    root = tk.Tk()
    app = TTSApp(root)
    root.mainloop()

zhengzhenhui945 · 发表于 2025-1-21 01:39

模型读出来的英文很有质感，遗憾的是没有中文，期待啊

没方向感 · 发表于 2025-1-23 17:00

加入播放功能：
pip install kokoro-onnx soundfile pygame

下载模型数据
wget https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files/kokoro-v0_19.onnx
wget https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files/voices.bin

[Python] 纯文本查看 复制代码

001

002

003

004

005

006

007

008

009

010

011

012

013

014

015

016

017

018

019

020

021

022

023

024

025

026

027

028

029

030

031

032

033

034

035

036

037

038

039

040

041

042

043

044

045

046

047

048

049

050

051

052

053

054

055

056

057

058

059

060

061

062

063

064

065

066

067

068

069

070

071

072

073

074

075

076

077

078

079

080

081

082

083

084

085

086

087

088

089

090

091

092

093

094

095

096

097

098

099

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

import tkinter as tk
from tkinter import ttk, scrolledtext, messagebox
from kokoro_onnx import Kokoro
import soundfile as sf
import threading  # For running TTS in a separate thread
 
 
import pygame
 
import time
from functools import wraps
from datetime import datetime
# 获取当前时间
now = datetime.now()
formatted_time = now.strftime('%Y%m%d_%H%M')
 
 
VOICE_NAME = [
    'af', # Default voice is a 50-50 mix of Bella & Sarah
    'af_bella', 'af_sarah', 'am_adam', 'am_michael',
    'bf_emma', 'bf_isabella', 'bm_george', 'bm_lewis',
    'af_nicole', 'af_sky',
]
 #暂时语言仅 英语 
LANG_NAME =[
   "en-us",  # English
   "en-gb",  # English (British)
   "fr-fr",  # French
   "ja",  # Japanese
   "ko",  # Korean
   "cmn",  # Mandarin Chinese
]
 
def play_mp3(file_path):
    pygame.mixer.init()  # 初始化混音器
    pygame.mixer.music.load(file_path)  # 加载MP3文件
    pygame.mixer.music.play()  # 播放音乐
 
    while pygame.mixer.music.get_busy():  # 等待音乐播放完毕
        continue
 
    pygame.mixer.music.unload()  # 卸载音乐
    pygame.mixer.quit()  # 退出混音器
     
     
class TTSApp:
    def __init__(self, root):
        self.root = root
        self.root.title("Kokoro TTS GUI")
        self.kokoro = None  # Initialize Kokoro instance
        self.create_widgets()
  
    def create_widgets(self):
        # --- Text Area ---
        ttk.Label(self.root, text="文本朗读内容:").grid(row=0, column=0, sticky="w", padx=5, pady=5)
         
        self.text_area = scrolledtext.ScrolledText(self.root, wrap=tk.WORD, width=60, height=10)
        self.text_area.grid(row=1, column=0, columnspan=3, padx=5, pady=5)
  
        # --- Voice Parameter ---
        ttk.Label(self.root, text="音量:").grid(row=2, column=0, sticky="w", padx=5, pady=5)
        self.voice_var = tk.StringVar(value="af") # Default value
        self.voice_combobox = ttk.Combobox(self.root, textvariable=self.voice_var, values=VOICE_NAME)
        self.voice_combobox.grid(row=2, column=1, sticky="ew", padx=5, pady=5)
  
        # --- Speed Parameter ---
        ttk.Label(self.root, text="速度 (0.5-2.0):").grid(row=3, column=0, sticky="w", padx=5, pady=5)
        self.speed_var = tk.DoubleVar(value=1.0)  # Default speed
        self.speed_entry = ttk.Entry(self.root, textvariable=self.speed_var)
        self.speed_entry.grid(row=3, column=1, sticky="ew", padx=5, pady=5)
  
         # --- Language Parameter ---
        ttk.Label(self.root, text="语言:").grid(row=4, column=0, sticky="w", padx=5, pady=5)
        self.lang_var = tk.StringVar(value="en-us")
        self.lang_entry = ttk.Entry(self.root, textvariable=self.lang_var)
        self.lang_entry.grid(row=4, column=1, sticky="ew", padx=5, pady=5)
  
         # --- infomation ---
        self.info= ttk.Label(self.root, text="")
        self.info.grid(row=5, column=1, sticky="ew", padx=5, pady=5,columnspan=10)
         
        # --- Run Button ---
        self.run_button = ttk.Button(self.root, text="生成语音", command=self.run_tts)
        self.run_button.grid(row=6, column=0, columnspan=3, pady=10)
        self.root.columnconfigure(1,weight=1) # Make column expand
          
    def run_tts(self):
        
        text = self.text_area.get("1.0", "end-1c").strip()
        voice = self.voice_var.get()
        try:
            speed = float(self.speed_var.get())
            if not 0.5 <= speed <= 2.0:
                messagebox.showerror("错误", "速度必须介于 0.5 和 2.0 之间")
                return
        except ValueError:
            messagebox.showerror("错误", "速度值无效。")
            return
        lang = self.lang_var.get()
        if not text:
            messagebox.showerror("错误", "请输入文字说话.")
            return
         
        self.run_button.config(text='处理中',state=tk.DISABLED) # Disable button during processing
        threading.Thread(target=self.perform_tts, args=(text, voice, speed, lang)).start()
  
 
    def perform_tts(self, text, voice, speed, lang,audio_format='mp3'):
        self.info.config(text='处理中')
        try:
            if self.kokoro is None:
                self.kokoro = Kokoro("kokoro-v0_19.onnx", "voices.bin")
            samples, sample_rate = self.kokoro.create(text,voice=voice,speed=speed,lang=lang)
            output_file = f"audio_{voice}_{formatted_time}.{audio_format}"
            sf.write(output_file, samples, sample_rate)
             
            self.info.config(text='音频生成成功,正在播放...')
            self.run_button.config(text='正在播放',state=tk.DISABLED) # Disable button during processings
 
            play_mp3(output_file)
             
            #clear the textarea
            #self.text_area.delete('1.0',tk.END)
            self.run_button.config(text='生成语音',state=tk.NORMAL) # Disable button during processings
            self.info.config(text='')
 
               
        except Exception as e:
             messagebox.showerror("错误", f"发生错误： {e}")
        finally:
             self.run_button.config(state=tk.NORMAL) # Re-enable button
  
if __name__ == "__main__":
    root = tk.Tk()
    app = TTSApp(root)
    root.mainloop()

pyjiujiu · 发表于 2025-1-21 18:38

本帖最后由 pyjiujiu 于 2025-1-23 21:31 编辑

zhangsan2022 发表于 2025-1-21 09:53
中文的版本什么时候支持，期待。

在这里先更新说明：
1-21 模型作者说法，本月底前会放出下一个版本。

还没有放出的 0.23 版本（中间试验版本），可以在 hugging face 上体验（中文和 edge 差不多，但无法兼读字母数字，瑕疵还很多）
地址：https://huggingface.co/spaces/hexgrad/Kokoro-TTS

---分割线---
根据 hugging face #36的说法（1-12）
目前放出的是 0.19版（12月份放的），作者 hexgrad 实际已经训练好 0.23版，但还不准备放（据说已经支持中文），现在准备继续训练。
因为社区在给他持续提供更丰富的 data，处理数据也需要时间。

- If successful, you should expect the next-gen Kokoro model to ship with more voices and languages, also under an Apache 2.0 license, with a similar 82M parameter architecture.

- If unsuccessful, it would most likely be because the model does not converge, i.e. loss does not go down. That could be because of data quality issues, architecture limitations, overfitting on old data,
underfitting on new data, etc. Rollbacks and model collapse are not unheard of in ML, but fingers crossed it does not happen here—or if they do, that I can address such issues should they come up.

根据另外的帖子，0.19 版的架构是缺乏 encoder的（架构原型是 StyleTTS 2），后续要推出带encoder的，而且作者明确要实现 voice clone的功能（需要自己后训练）
因为基础模型的参数量就很小，作者有自信这将是最简单的声音克隆实施。

Do_zh · 发表于 2025-1-21 08:32

期待赶紧出中文。

buybuy · 发表于 2025-1-21 08:32

没中文的就先不试了

kongson · 发表于 2025-1-21 08:48

这个可以，备用了，谢谢

SherlockProel · 发表于 2025-1-21 09:25

不错，搞下来玩耍一番

13534870834 · 发表于 2025-1-21 09:44

现在的tts都要收费

zhangsan2022 · 发表于 2025-1-21 09:53

中文的版本什么时候支持，期待。

mrpizi1221 · 发表于 2025-1-21 10:03

感谢分享，支持一波！

rhci · 发表于 2025-1-21 10:45

等待支持中文，先留存。

帐号		自动登录	找回密码
密码			注册[Register]

[Python 原创] 本地 tts 生成，基于最新的 Kokoro 模型（啥都好，就是还不支持中文）

免费评分

本帖被以下淘专辑推荐:

[Python 原创] 本地 tts 生成，基于最新的 Kokoro 模型 （啥都好，就是还不支持中文）

免费评分

本帖被以下淘专辑推荐:

[Python 原创] 本地 tts 生成，基于最新的 Kokoro 模型（啥都好，就是还不支持中文）