divisor.acestep.language_segmentation.LangSegment

This file bundles language identification functions.

Modifications (fork): Copyright (c) 2021, Adrien Barbaresi.

Original code: Copyright (c) 2011 Marco Lui saffsd@gmail.com. Based on research by Marco Lui and Tim Baldwin.

See LICENSE file for more info. https://github.com/adbar/py3langid

Projects: https://github.com/juntaosun/LangSegment

LICENSE: py3langid - Language Identifier BSD 3-Clause License

Modifications (fork): Copyright (c) 2021, Adrien Barbaresi.

Original code: Copyright (c) 2011 Marco Lui saffsd@gmail.com. Based on research by Marco Lui and Tim Baldwin.

All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

  3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

   1"""
   2This file bundles language identification functions.
   3
   4Modifications (fork): Copyright (c) 2021, Adrien Barbaresi.
   5
   6Original code: Copyright (c) 2011 Marco Lui <saffsd@gmail.com>.
   7Based on research by Marco Lui and Tim Baldwin.
   8
   9See LICENSE file for more info.
  10https://github.com/adbar/py3langid
  11
  12Projects:
  13https://github.com/juntaosun/LangSegment
  14
  15LICENSE:
  16py3langid - Language Identifier
  17BSD 3-Clause License
  18
  19Modifications (fork): Copyright (c) 2021, Adrien Barbaresi.
  20
  21Original code: Copyright (c) 2011 Marco Lui <saffsd@gmail.com>.
  22Based on research by Marco Lui and Tim Baldwin.
  23
  24All rights reserved.
  25
  26Redistribution and use in source and binary forms, with or without modification, are
  27permitted provided that the following conditions are met:
  28
  291. Redistributions of source code must retain the above copyright notice, this
  30   list of conditions and the following disclaimer.
  31
  322. Redistributions in binary form must reproduce the above copyright notice,
  33   this list of conditions and the following disclaimer in the documentation
  34   and/or other materials provided with the distribution.
  35
  363. Neither the name of the copyright holder nor the names of its
  37   contributors may be used to endorse or promote products derived from
  38   this software without specific prior written permission.
  39
  40THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER ``AS IS'' AND ANY EXPRESS OR IMPLIED
  41WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
  42FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
  43CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
  44CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
  45SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
  46ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
  47NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
  48ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  49"""
  50
  51import os
  52import re
  53import sys
  54import numpy as np
  55from collections import Counter
  56from collections import defaultdict
  57
  58# import langid
  59# import py3langid as langid
  60# pip install py3langid==0.2.2
  61
  62# 启用语言预测概率归一化,概率预测的分数。因此,实现重新规范化 产生 0-1 范围内的输出。
  63# langid disables probability normalization by default. For command-line usages of , it can be enabled by passing the flag.
  64# For probability normalization in library use, the user must instantiate their own . An example of such usage is as follows:
  65from py3langid.langid import LanguageIdentifier, MODEL_FILE
  66
  67from divisor.acestep.language_segmentation.utils.num import num2str
  68
  69# -----------------------------------
  70# 更新日志:新版本分词更加精准。
  71# Changelog: The new version of the word segmentation is more accurate.
  72# チェンジログ:新しいバージョンの単語セグメンテーションはより正確です。
  73# Changelog: 분할이라는 단어의 새로운 버전이 더 정확합니다.
  74# -----------------------------------
  75
  76
  77# Word segmentation function:
  78# automatically identify and split the words (Chinese/English/Japanese/Korean) in the article or sentence according to different languages,
  79# making it more suitable for TTS processing.
  80# This code is designed for front-end text multi-lingual mixed annotation distinction, multi-language mixed training and inference of various TTS projects.
  81# This processing result is mainly for (Chinese = zh, Japanese = ja, English = en, Korean = ko), and can actually support up to 97 different language mixing processing.
  82
  83# ===========================================================================================================
  84# 分かち書き機能:文章や文章の中の例えば(中国語/英語/日本語/韓国語)を、異なる言語で自動的に認識して分割し、TTS処理により適したものにします。
  85# このコードは、さまざまなTTSプロジェクトのフロントエンドテキストの多言語混合注釈区別、多言語混合トレーニング、および推論のために特別に作成されています。
  86# ===========================================================================================================
  87# (1)自動分詞:「韓国語では何を読むのですかあなたの体育の先生は誰ですか?今回の発表会では、iPhone 15シリーズの4機種が登場しました」
  88# (2)手动分词:“あなたの名前は<ja>佐々木ですか?<ja>ですか?”
  89# この処理結果は主に(中国語=ja、日本語=ja、英語=en、韓国語=ko)を対象としており、実際には最大97の異なる言語の混合処理をサポートできます。
  90# ===========================================================================================================
  91
  92# ===========================================================================================================
  93# 단어 분할 기능: 기사 또는 문장에서 단어(중국어/영어/일본어/한국어)를 다른 언어에 따라 자동으로 식별하고 분할하여 TTS 처리에 더 적합합니다.
  94# 이 코드는 프런트 엔드 텍스트 다국어 혼합 주석 분화, 다국어 혼합 교육 및 다양한 TTS 프로젝트의 추론을 위해 설계되었습니다.
  95# ===========================================================================================================
  96# (1) 자동 단어 분할: "한국어로 무엇을 읽습니까? 스포츠 씨? 이 컨퍼런스는 4개의 iPhone 15 시리즈 모델을 제공합니다."
  97# (2) 수동 참여: "이름이 <ja>Saki입니까? <ja>?"
  98# 이 처리 결과는 주로 (중국어 = zh, 일본어 = ja, 영어 = en, 한국어 = ko)를 위한 것이며 실제로 혼합 처리를 위해 최대 97개의 언어를 지원합니다.
  99# ===========================================================================================================
 100
 101# ===========================================================================================================
 102# 分词功能:将文章或句子里的例如(中/英/日/韩),按不同语言自动识别并拆分,让它更适合TTS处理。
 103# 本代码专为各种 TTS 项目的前端文本多语种混合标注区分,多语言混合训练和推理而编写。
 104# ===========================================================================================================
 105# (1)自动分词:“韩语中的오빠读什么呢?あなたの体育の先生は誰ですか? 此次发布会带来了四款iPhone 15系列机型”
 106# (2)手动分词:“你的名字叫<ja>佐々木?<ja>吗?”
 107# 本处理结果主要针对(中文=zh , 日文=ja , 英文=en , 韩语=ko), 实际上可支持多达 97 种不同的语言混合处理。
 108# ===========================================================================================================
 109
 110
 111# 手动分词标签规范:<语言标签>文本内容</语言标签>
 112# 수동 단어 분할 태그 사양: <언어 태그> 텍스트 내용</언어 태그>
 113# Manual word segmentation tag specification: <language tags> text content </language tags>
 114# 手動分詞タグ仕様:<言語タグ>テキスト内容</言語タグ>
 115# ===========================================================================================================
 116# For manual word segmentation, labels need to appear in pairs, such as:
 117# 如需手动分词,标签需要成对出现,例如:“<ja>佐々木<ja>”  或者  “<ja>佐々木</ja>”
 118# 错误示范:“你的名字叫<ja>佐々木。” 此句子中出现的单个<ja>标签将被忽略,不会处理。
 119# Error demonstration: "Your name is <ja>佐々木。" Single <ja> tags that appear in this sentence will be ignored and will not be processed.
 120# ===========================================================================================================
 121
 122
 123# ===========================================================================================================
 124# 语音合成标记语言 SSML , 这里只支持它的标签(非 XML)Speech Synthesis Markup Language SSML, only its tags are supported here (not XML)
 125# 想支持更多的 SSML 标签?欢迎 PR! Want to support more SSML tags? PRs are welcome!
 126# 说明:除了中文以外,它也可改造成支持多语种 SSML ,不仅仅是中文。
 127# Note: In addition to Chinese, it can also be modified to support multi-language SSML, not just Chinese.
 128# ===========================================================================================================
 129# 中文实现:Chinese implementation:
 130# 【SSML】<number>=中文大写数字读法(单字)
 131# 【SSML】<telephone>=数字转成中文电话号码大写汉字(单字)
 132# 【SSML】<currency>=按金额发音。
 133# 【SSML】<date>=按日期发音。支持 2024年08月24, 2024/8/24, 2024-08, 08-24, 24 等输入。
 134# ===========================================================================================================
 135class LangSSML:
 136    def __init__(self):
 137        # 纯数字
 138        self._zh_numerals_number = {
 139            "0": "零",
 140            "1": "一",
 141            "2": "二",
 142            "3": "三",
 143            "4": "四",
 144            "5": "五",
 145            "6": "六",
 146            "7": "七",
 147            "8": "八",
 148            "9": "九",
 149        }
 150
 151    # 将2024/8/24, 2024-08, 08-24, 24 标准化“年月日”
 152    # Standardize 2024/8/24, 2024-08, 08-24, 24 to "year-month-day"
 153    def _format_chinese_data(self, date_str: str):
 154        # 处理日期格式
 155        input_date = date_str
 156        if date_str is None or date_str.strip() == "":
 157            return ""
 158        date_str = re.sub(r"[\/\._|年|月]", "-", date_str)
 159        date_str = re.sub(r"日", r"", date_str)
 160        date_arrs = date_str.split(" ")
 161        if len(date_arrs) == 1 and ":" in date_arrs[0]:
 162            time_str = date_arrs[0]
 163            date_arrs = []
 164        else:
 165            time_str = date_arrs[1] if len(date_arrs) >= 2 else ""
 166
 167        def nonZero(num, cn, func=None):
 168            if func is not None:
 169                num = func(num)
 170            return f"{num}{cn}" if num is not None and num != "" and num != "0" else ""
 171
 172        f_number = self.to_chinese_number
 173        f_currency = self.to_chinese_currency
 174        # year, month, day
 175        year_month_day = ""
 176        if len(date_arrs) > 0:
 177            year, month, day = "", "", ""
 178            parts = date_arrs[0].split("-")
 179            if len(parts) == 3:  # 格式为 YYYY-MM-DD
 180                year, month, day = parts
 181            elif len(parts) == 2:  # 格式为 MM-DD 或 YYYY-MM
 182                if len(parts[0]) == 4:  # 年-月
 183                    year, month = parts
 184                else:
 185                    month, day = parts  # 月-日
 186            elif len(parts[0]) > 0:  # 仅有月-日或年
 187                if len(parts[0]) == 4:
 188                    year = parts[0]
 189                else:
 190                    day = parts[0]
 191            year, month, day = (
 192                nonZero(year, "年", f_number),
 193                nonZero(month, "月", f_currency),
 194                nonZero(day, "日", f_currency),
 195            )
 196            year_month_day = re.sub(r"([年|月|日])+", r"\1", f"{year}{month}{day}")
 197        # hours, minutes, seconds
 198        time_str = re.sub(r"[\/\.\-:_]", ":", time_str)
 199        time_arrs = time_str.split(":")
 200        hours, minutes, seconds = "", "", ""
 201        if len(time_arrs) == 3:  # H/M/S
 202            hours, minutes, seconds = time_arrs
 203        elif len(time_arrs) == 2:  # H/M
 204            hours, minutes = time_arrs
 205        elif len(time_arrs[0]) > 0:
 206            hours = f"{time_arrs[0]}点"  # H
 207        if len(time_arrs) > 1:
 208            hours, minutes, seconds = (
 209                nonZero(hours, "点", f_currency),
 210                nonZero(minutes, "分", f_currency),
 211                nonZero(seconds, "秒", f_currency),
 212            )
 213        hours_minutes_seconds = re.sub(r"([点|分|秒])+", r"\1", f"{hours}{minutes}{seconds}")
 214        output_date = f"{year_month_day}{hours_minutes_seconds}"
 215        return output_date
 216
 217    # 【SSML】number=中文大写数字读法(单字)
 218    # Chinese Numbers(single word)
 219    def to_chinese_number(self, num: str):
 220        pattern = r"(\d+)"
 221        zh_numerals = self._zh_numerals_number
 222        arrs = re.split(pattern, num)
 223        output = ""
 224        for item in arrs:
 225            if re.match(pattern, item):
 226                output += "".join(zh_numerals[digit] if digit in zh_numerals else "" for digit in str(item))
 227            else:
 228                output += item
 229        output = output.replace(".", "点")
 230        return output
 231
 232    # 【SSML】telephone=数字转成中文电话号码大写汉字(单字)
 233    # Convert numbers to Chinese phone numbers in uppercase Chinese characters(single word)
 234    def to_chinese_telephone(self, num: str):
 235        output = self.to_chinese_number(num.replace("+86", ""))  # zh +86
 236        output = output.replace("一", "幺")
 237        return output
 238
 239    # 【SSML】currency=按金额发音。
 240    # Digital processing from GPT_SoVITS num.py (thanks)
 241    def to_chinese_currency(self, num: str):
 242        pattern = r"(\d+)"
 243        arrs = re.split(pattern, num)
 244        output = ""
 245        for item in arrs:
 246            if re.match(pattern, item):
 247                output += num2str(item)
 248            else:
 249                output += item
 250        output = output.replace(".", "点")
 251        return output
 252
 253    # 【SSML】date=按日期发音。支持 2024年08月24, 2024/8/24, 2024-08, 08-24, 24 等输入。
 254    def to_chinese_date(self, num: str):
 255        chinese_date = self._format_chinese_data(num)
 256        return chinese_date
 257
 258
 259class LangSegment:
 260    def __init__(self):
 261        self.langid = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
 262
 263        self._text_cache = None
 264        self._text_lasts = None
 265        self._text_langs = None
 266        self._lang_count = None
 267        self._lang_eos = None
 268
 269        # 可自定义语言匹配标签:カスタマイズ可能な言語対応タグ:사용자 지정 가능한 언어 일치 태그:
 270        # Customizable language matching tags: These are supported,이 표현들은 모두 지지합니다
 271        # <zh>你好<zh> , <ja>佐々木</ja> , <en>OK<en> , <ko>오빠</ko> 这些写法均支持
 272        self.SYMBOLS_PATTERN = r"(<([a-zA-Z|-]*)>(.*?)<\/*[a-zA-Z|-]*>)"
 273
 274        # 语言过滤组功能, 可以指定保留语言。不在过滤组中的语言将被清除。您可随心搭配TTS语音合成所支持的语言。
 275        # 언어 필터 그룹 기능을 사용하면 예약된 언어를 지정할 수 있습니다. 필터 그룹에 없는 언어는 지워집니다. TTS 텍스트에서 지원하는 언어를 원하는 대로 일치시킬 수 있습니다.
 276        # 言語フィルターグループ機能では、予約言語を指定できます。フィルターグループに含まれていない言語はクリアされます。TTS音声合成がサポートする言語を自由に組み合わせることができます。
 277        # The language filter group function allows you to specify reserved languages.
 278        # Languages not in the filter group will be cleared. You can match the languages supported by TTS Text To Speech as you like.
 279        # 排名越前,优先级越高,The higher the ranking, the higher the priority,ランキングが上位になるほど、優先度が高くなります。
 280
 281        # 系统默认过滤器。System default filter。(ISO 639-1 codes given)
 282        # ----------------------------------------------------------------------------------------------------------------------------------
 283        # "zh"中文=Chinese ,"en"英语=English ,"ja"日语=Japanese ,"ko"韩语=Korean ,"fr"法语=French ,"vi"越南语=Vietnamese , "ru"俄语=Russian
 284        # "th"泰语=Thai
 285        # ----------------------------------------------------------------------------------------------------------------------------------
 286        self.DEFAULT_FILTERS = ["zh", "ja", "ko", "en"]
 287
 288        # 用户可自定义过滤器。User-defined filters
 289        self.Langfilters = self.DEFAULT_FILTERS[:]  # 创建副本
 290
 291        # 合并文本
 292        self.isLangMerge = True
 293
 294        # 试验性支持:您可自定义添加:"fr"法语 , "vi"越南语。Experimental: You can customize to add: "fr" French, "vi" Vietnamese.
 295        # 请使用API启用:self.setfilters(["zh", "en", "ja", "ko", "fr", "vi" , "ru" , "th"]) # 您可自定义添加,如:"fr"法语 , "vi"越南语。
 296
 297        # 预览版功能,自动启用或禁用,无需设置
 298        # Preview feature, automatically enabled or disabled, no settings required
 299        self.EnablePreview = False
 300
 301        # 除此以外,它支持简写过滤器,只需按不同语种任意组合即可。
 302        # In addition to that, it supports abbreviation filters, allowing for any combination of different languages.
 303        # 示例:您可以任意指定多种组合,进行过滤
 304        # Example: You can specify any combination to filter
 305
 306        # 中/日语言优先级阀值(评分范围为 0 ~ 1):评分低于设定阀值 <0.89 时,启用 filters 中的优先级。\n
 307        # 중/일본어 우선 순위 임계값(점수 범위 0-1): 점수가 설정된 임계값 <0.89보다 낮을 때 필터에서 우선 순위를 활성화합니다.
 308        # 中国語/日本語の優先度しきい値(スコア範囲0〜1):スコアが設定されたしきい値<0.89未満の場合、フィルターの優先度が有効になります。\n
 309        # Chinese and Japanese language priority threshold (score range is 0 ~ 1): The default threshold is 0.89.  \n
 310        # Only the common characters between Chinese and Japanese are processed with confidence and priority. \n
 311        self.LangPriorityThreshold = 0.89
 312
 313        # Langfilters = ["zh"]              # 按中文识别
 314        # Langfilters = ["en"]              # 按英文识别
 315        # Langfilters = ["ja"]              # 按日文识别
 316        # Langfilters = ["ko"]              # 按韩文识别
 317        # Langfilters = ["zh_ja"]           # 中日混合识别
 318        # Langfilters = ["zh_en"]           # 中英混合识别
 319        # Langfilters = ["ja_en"]           # 日英混合识别
 320        # Langfilters = ["zh_ko"]           # 中韩混合识别
 321        # Langfilters = ["ja_ko"]           # 日韩混合识别
 322        # Langfilters = ["en_ko"]           # 英韩混合识别
 323        # Langfilters = ["zh_ja_en"]        # 中日英混合识别
 324        # Langfilters = ["zh_ja_en_ko"]     # 中日英韩混合识别
 325
 326        # 更多过滤组合,请您随意。。。For more filter combinations, please feel free to......
 327        # より多くのフィルターの組み合わせ、お気軽に。。。더 많은 필터 조합을 원하시면 자유롭게 해주세요. .....
 328
 329        # 可选保留:支持中文数字拼音格式,更方便前端实现拼音音素修改和推理,默认关闭 False 。
 330        # 开启后 True ,括号内的数字拼音格式均保留,并识别输出为:"zh"中文。
 331        self.keepPinyin = False
 332
 333        # DEFINITION
 334        self.PARSE_TAG = re.compile(r"(⑥\$*\d+[\d]{6,}⑥)")
 335
 336        self.LangSSML = LangSSML()
 337
 338    def _clears(self):
 339        self._text_cache = None
 340        self._text_lasts = None
 341        self._text_langs = None
 342        self._text_waits = None
 343        self._lang_count = None
 344        self._lang_eos = None
 345
 346    def _is_english_word(self, word):
 347        return bool(re.match(r"^[a-zA-Z]+$", word))
 348
 349    def _is_chinese(self, word):
 350        for char in word:
 351            if "\u4e00" <= char <= "\u9fff":
 352                return True
 353        return False
 354
 355    def _is_japanese_kana(self, word):
 356        pattern = re.compile(r"[\u3040-\u309F\u30A0-\u30FF]+")
 357        matches = pattern.findall(word)
 358        return len(matches) > 0
 359
 360    def _insert_english_uppercase(self, word):
 361        modified_text = re.sub(r"(?<!\b)([A-Z])", r" \1", word)
 362        modified_text = modified_text.strip("-")
 363        return modified_text + " "
 364
 365    def _split_camel_case(self, word):
 366        return re.sub(r"(?<!^)(?=[A-Z])", " ", word)
 367
 368    def _statistics(self, language, text):
 369        # Language word statistics:
 370        # Chinese characters usually occupy double bytes
 371        if self._lang_count is None or not isinstance(self._lang_count, defaultdict):
 372            self._lang_count = defaultdict(int)
 373        lang_count = self._lang_count
 374        if not "|" in language:
 375            lang_count[language] += int(len(text) * 2) if language == "zh" else len(text)
 376        self._lang_count = lang_count
 377
 378    def _clear_text_number(self, text):
 379        if text == "\n":
 380            return text, False  # Keep Line Breaks
 381        clear_text = re.sub(r"([^\w\s]+)", "", re.sub(r"\n+", "", text)).strip()
 382        is_number = len(re.sub(re.compile(r"(\d+)"), "", clear_text)) == 0
 383        return clear_text, is_number
 384
 385    def _saveData(self, words, language: str, text: str, score: float, symbol=None):
 386        # Pre-detection
 387        clear_text, is_number = self._clear_text_number(text)
 388        # Merge the same language and save the results
 389        preData = words[-1] if len(words) > 0 else None
 390        if symbol is not None:
 391            pass
 392        elif preData is not None and preData["symbol"] is None:
 393            if len(clear_text) == 0:
 394                language = preData["lang"]
 395            elif is_number == True:
 396                language = preData["lang"]
 397            _, pre_is_number = self._clear_text_number(preData["text"])
 398            if preData["lang"] == language:
 399                self._statistics(preData["lang"], text)
 400                text = preData["text"] + text
 401                preData["text"] = text
 402                return preData
 403            elif pre_is_number == True:
 404                text = f"{preData['text']}{text}"
 405                words.pop()
 406        elif is_number == True:
 407            priority_language = self._get_filters_string()[:2]
 408            if priority_language in "ja-zh-en-ko-fr-vi":
 409                language = priority_language
 410        data = {"lang": language, "text": text, "score": score, "symbol": symbol}
 411        filters = self.Langfilters
 412        if filters is None or len(filters) == 0 or "?" in language or language in filters or language in filters[0] or filters[0] == "*" or filters[0] in "alls-mixs-autos":
 413            words.append(data)
 414            self._statistics(data["lang"], data["text"])
 415        return data
 416
 417    def _addwords(self, words, language, text, score, symbol=None):
 418        if text == "\n":
 419            pass  # Keep Line Breaks
 420        elif text is None or len(text.strip()) == 0:
 421            return True
 422        if language is None:
 423            language = ""
 424        language = language.lower()
 425        if language == "en":
 426            text = self._insert_english_uppercase(text)
 427        # text = re.sub(r'[(())]', ',' , text) # Keep it.
 428        text_waits = self._text_waits
 429        ispre_waits = len(text_waits) > 0
 430        preResult = text_waits.pop() if ispre_waits else None
 431        if preResult is None:
 432            preResult = words[-1] if len(words) > 0 else None
 433        if preResult and ("|" in preResult["lang"]):
 434            pre_lang = preResult["lang"]
 435            if language in pre_lang:
 436                preResult["lang"] = language = language.split("|")[0]
 437            else:
 438                preResult["lang"] = pre_lang.split("|")[0]
 439            if ispre_waits:
 440                preResult = self._saveData(
 441                    words,
 442                    preResult["lang"],
 443                    preResult["text"],
 444                    preResult["score"],
 445                    preResult["symbol"],
 446                )
 447        pre_lang = preResult["lang"] if preResult else None
 448        if ("|" in language) and (pre_lang and not pre_lang in language and not "…" in language):
 449            language = language.split("|")[0]
 450        if "|" in language:
 451            self._text_waits.append({"lang": language, "text": text, "score": score, "symbol": symbol})
 452        else:
 453            self._saveData(words, language, text, score, symbol)
 454        return False
 455
 456    def _get_prev_data(self, words):
 457        data = words[-1] if words and len(words) > 0 else None
 458        if data:
 459            return (data["lang"], data["text"])
 460        return (None, "")
 461
 462    def _match_ending(self, input, index):
 463        if input is None or len(input) == 0:
 464            return False, None
 465        input = re.sub(r"\s+", "", input)
 466        if len(input) == 0 or abs(index) > len(input):
 467            return False, None
 468        ending_pattern = re.compile(r'([「」“”‘’"\'::。.!!?.?])')
 469        return ending_pattern.match(input[index]), input[index]
 470
 471    def _cleans_text(self, cleans_text):
 472        cleans_text = re.sub(r"(.*?)([^\w]+)", r"\1 ", cleans_text)
 473        cleans_text = re.sub(r"(.)\1+", r"\1", cleans_text)
 474        return cleans_text.strip()
 475
 476    def _mean_processing(self, text: str):
 477        if text is None or (text.strip()) == "":
 478            return None, 0.0
 479        arrs = self._split_camel_case(text).split(" ")
 480        langs = []
 481        for t in arrs:
 482            if len(t.strip()) <= 3:
 483                continue
 484            language, score = self.langid.classify(t)
 485            langs.append({"lang": language})
 486        if len(langs) == 0:
 487            return None, 0.0
 488        return Counter([item["lang"] for item in langs]).most_common(1)[0][0], 1.0
 489
 490    def _lang_classify(self, cleans_text):
 491        language, score = self.langid.classify(cleans_text)
 492        # fix: Huggingface is np.float32
 493        if score is not None and isinstance(score, np.generic) and hasattr(score, "item"):
 494            score = score.item()
 495        score = round(score, 3)
 496        return language, score
 497
 498    def _get_filters_string(self):
 499        filters = self.Langfilters
 500        return "-".join(filters).lower().strip() if filters is not None else ""
 501
 502    def _parse_language(self, words, segment):
 503        LANG_JA = "ja"
 504        LANG_ZH = "zh"
 505        LANG_ZH_JA = f"{LANG_ZH}|{LANG_JA}"
 506        LANG_JA_ZH = f"{LANG_JA}|{LANG_ZH}"
 507        language = LANG_ZH
 508        regex_pattern = re.compile(r"([^\w\s]+)")
 509        lines = regex_pattern.split(segment)
 510        lines_max = len(lines)
 511        LANG_EOS = self._lang_eos
 512        for index, text in enumerate(lines):
 513            if len(text) == 0:
 514                continue
 515            EOS = index >= (lines_max - 1)
 516            nextId = index + 1
 517            nextText = lines[nextId] if not EOS else ""
 518            nextPunc = len(re.sub(regex_pattern, "", re.sub(r"\n+", "", nextText)).strip()) == 0
 519            textPunc = len(re.sub(regex_pattern, "", re.sub(r"\n+", "", text)).strip()) == 0
 520            if not EOS and (textPunc == True or (len(nextText.strip()) >= 0 and nextPunc == True)):
 521                lines[nextId] = f"{text}{nextText}"
 522                continue
 523            number_tags = re.compile(r"(⑥\d{6,}⑥)")
 524            cleans_text = re.sub(number_tags, "", text)
 525            cleans_text = re.sub(r"\d+", "", cleans_text)
 526            cleans_text = self._cleans_text(cleans_text)
 527            # fix:Langid's recognition of short sentences is inaccurate, and it is spliced longer.
 528            if not EOS and len(cleans_text) <= 2:
 529                lines[nextId] = f"{text}{nextText}"
 530                continue
 531            language, score = self._lang_classify(cleans_text)
 532            prev_language, prev_text = self._get_prev_data(words)
 533            if language != LANG_ZH and all("\u4e00" <= c <= "\u9fff" for c in re.sub(r"\s", "", cleans_text)):
 534                language, score = LANG_ZH, 1
 535            if len(cleans_text) <= 5 and self._is_chinese(cleans_text):
 536                filters_string = self._get_filters_string()
 537                if score < self.LangPriorityThreshold and len(filters_string) > 0:
 538                    index_ja, index_zh = filters_string.find(LANG_JA), filters_string.find(LANG_ZH)
 539                    if index_ja != -1 and index_ja < index_zh:
 540                        language = LANG_JA
 541                    elif index_zh != -1 and index_zh < index_ja:
 542                        language = LANG_ZH
 543                if self._is_japanese_kana(cleans_text):
 544                    language = LANG_JA
 545                elif len(cleans_text) > 2 and score > 0.90:
 546                    pass
 547                elif EOS and LANG_EOS:
 548                    language = LANG_ZH if len(cleans_text) <= 1 else language
 549                else:
 550                    LANG_UNKNOWN = LANG_ZH_JA if language == LANG_ZH or (len(cleans_text) <= 2 and prev_language == LANG_ZH) else LANG_JA_ZH
 551                    match_end, match_char = self._match_ending(text, -1)
 552                    referen = prev_language in LANG_UNKNOWN or LANG_UNKNOWN in prev_language if prev_language else False
 553                    if match_char in "。.":
 554                        language = prev_language if referen and len(words) > 0 else language
 555                    else:
 556                        language = f"{LANG_UNKNOWN}|…"
 557            text, *_ = re.subn(number_tags, self._restore_number, text)
 558            self._addwords(words, language, text, score)
 559
 560    # ----------------------------------------------------------
 561    # 【SSML】中文数字处理:Chinese Number Processing (SSML support)
 562    # 这里默认都是中文,用于处理 SSML 中文标签。当然可以支持任意语言,例如:
 563    # The default here is Chinese, which is used to process SSML Chinese tags. Of course, any language can be supported, for example:
 564    # 中文电话号码:<telephone>1234567</telephone>
 565    # 中文数字号码:<number>1234567</number>
 566    def _process_symbol_SSML(self, words, data):
 567        tag, match = data
 568        language = SSML = match[1]
 569        text = match[2]
 570        score = 1.0
 571        if SSML == "telephone":
 572            # 中文-电话号码
 573            language = "zh"
 574            text = self.LangSSML.to_chinese_telephone(text)
 575        elif SSML == "number":
 576            # 中文-数字读法
 577            language = "zh"
 578            text = self.LangSSML.to_chinese_number(text)
 579        elif SSML == "currency":
 580            # 中文-按金额发音
 581            language = "zh"
 582            text = self.LangSSML.to_chinese_currency(text)
 583        elif SSML == "date":
 584            # 中文-按金额发音
 585            language = "zh"
 586            text = self.LangSSML.to_chinese_date(text)
 587        self._addwords(words, language, text, score, SSML)
 588
 589    # ----------------------------------------------------------
 590    def _restore_number(self, matche):
 591        value = matche.group(0)
 592        text_cache = self._text_cache
 593        if value in text_cache:
 594            process, data = text_cache[value]
 595            tag, match = data
 596            value = match
 597        return value
 598
 599    def _pattern_symbols(self, item, text):
 600        if text is None:
 601            return text
 602        tag, pattern, process = item
 603        matches = pattern.findall(text)
 604        if len(matches) == 1 and "".join(matches[0]) == text:
 605            return text
 606        for i, match in enumerate(matches):
 607            key = f"⑥{tag}{i:06d}⑥"
 608            text = re.sub(pattern, key, text, count=1)
 609            self._text_cache[key] = (process, (tag, match))
 610        return text
 611
 612    def _process_symbol(self, words, data):
 613        tag, match = data
 614        language = match[1]
 615        text = match[2]
 616        score = 1.0
 617        filters = self._get_filters_string()
 618        if language not in filters:
 619            self._process_symbol_SSML(words, data)
 620        else:
 621            self._addwords(words, language, text, score, True)
 622
 623    def _process_english(self, words, data):
 624        tag, match = data
 625        text = match[0]
 626        filters = self._get_filters_string()
 627        priority_language = filters[:2]
 628        # Preview feature, other language segmentation processing
 629        enablePreview = self.EnablePreview
 630        if enablePreview == True:
 631            # Experimental: Other language support
 632            regex_pattern = re.compile(r"(.*?[。.??!!]+[\n]{,1})")
 633            lines = regex_pattern.split(text)
 634            for index, text in enumerate(lines):
 635                if len(text.strip()) == 0:
 636                    continue
 637                cleans_text = self._cleans_text(text)
 638                language, score = self._lang_classify(cleans_text)
 639                if language not in filters:
 640                    language, score = self._mean_processing(cleans_text)
 641                if language is None or score <= 0.0:
 642                    continue
 643                elif language in filters:
 644                    pass  # pass
 645                elif score >= 0.95:
 646                    continue  # High score, but not in the filter, excluded.
 647                elif score <= 0.15 and filters[:2] == "fr":
 648                    language = priority_language
 649                else:
 650                    language = "en"
 651                self._addwords(words, language, text, score)
 652        else:
 653            # Default is English
 654            language, score = "en", 1.0
 655            self._addwords(words, language, text, score)
 656
 657    def _process_Russian(self, words, data):
 658        tag, match = data
 659        text = match[0]
 660        language = "ru"
 661        score = 1.0
 662        self._addwords(words, language, text, score)
 663
 664    def _process_Thai(self, words, data):
 665        tag, match = data
 666        text = match[0]
 667        language = "th"
 668        score = 1.0
 669        self._addwords(words, language, text, score)
 670
 671    def _process_korean(self, words, data):
 672        tag, match = data
 673        text = match[0]
 674        language = "ko"
 675        score = 1.0
 676        self._addwords(words, language, text, score)
 677
 678    def _process_quotes(self, words, data):
 679        tag, match = data
 680        text = "".join(match)
 681        childs = self.PARSE_TAG.findall(text)
 682        if len(childs) > 0:
 683            self._process_tags(words, text, False)
 684        else:
 685            cleans_text = self._cleans_text(match[1])
 686            if len(cleans_text) <= 5:
 687                self._parse_language(words, text)
 688            else:
 689                language, score = self._lang_classify(cleans_text)
 690                self._addwords(words, language, text, score)
 691
 692    def _process_pinyin(self, words, data):
 693        tag, match = data
 694        text = match
 695        language = "zh"
 696        score = 1.0
 697        self._addwords(words, language, text, score)
 698
 699    def _process_number(self, words, data):  # "$0" process only
 700        """
 701        Numbers alone cannot accurately identify language.
 702        Because numbers are universal in all languages.
 703        So it won't be executed here, just for testing.
 704        """
 705        tag, match = data
 706        language = words[0]["lang"] if len(words) > 0 else "zh"
 707        text = match
 708        score = 0.0
 709        self._addwords(words, language, text, score)
 710
 711    def _process_tags(self, words, text, root_tag):
 712        text_cache = self._text_cache
 713        segments = re.split(self.PARSE_TAG, text)
 714        segments_len = len(segments) - 1
 715        for index, text in enumerate(segments):
 716            if root_tag:
 717                self._lang_eos = index >= segments_len
 718            if self.PARSE_TAG.match(text):
 719                process, data = text_cache[text]
 720                if process:
 721                    process(words, data)
 722            else:
 723                self._parse_language(words, text)
 724        return words
 725
 726    def _merge_results(self, words):
 727        new_word = []
 728        for index, cur_data in enumerate(words):
 729            if "symbol" in cur_data:
 730                del cur_data["symbol"]
 731            if index == 0:
 732                new_word.append(cur_data)
 733            else:
 734                pre_data = new_word[-1]
 735                if cur_data["lang"] == pre_data["lang"]:
 736                    pre_data["text"] = f"{pre_data['text']}{cur_data['text']}"
 737                else:
 738                    new_word.append(cur_data)
 739        return new_word
 740
 741    def _parse_symbols(self, text):
 742        TAG_NUM = "00"  # "00" => default channels , "$0" => testing channel
 743        TAG_S1, TAG_S2, TAG_P1, TAG_P2, TAG_EN, TAG_KO, TAG_RU, TAG_TH = (
 744            "$1",
 745            "$2",
 746            "$3",
 747            "$4",
 748            "$5",
 749            "$6",
 750            "$7",
 751            "$8",
 752        )
 753        TAG_BASE = re.compile(rf'(([【《((“‘"\']*[LANGUAGE]+[\W\s]*)+)')
 754        # Get custom language filter
 755        filters = self.Langfilters
 756        filters = filters if filters is not None else ""
 757        # =======================================================================================================
 758        # Experimental: Other language support.Thử nghiệm: Hỗ trợ ngôn ngữ khác.Expérimental : prise en charge d’autres langues.
 759        # 相关语言字符如有缺失,熟悉相关语言的朋友,可以提交把缺失的发音符号补全。
 760        # If relevant language characters are missing, friends who are familiar with the relevant languages can submit a submission to complete the missing pronunciation symbols.
 761        # S'il manque des caractères linguistiques pertinents, les amis qui connaissent les langues concernées peuvent soumettre une soumission pour compléter les symboles de prononciation manquants.
 762        # Nếu thiếu ký tự ngôn ngữ liên quan, những người bạn quen thuộc với ngôn ngữ liên quan có thể gửi bài để hoàn thành các ký hiệu phát âm còn thiếu.
 763        # -------------------------------------------------------------------------------------------------------
 764        # Preview feature, other language support
 765        enablePreview = self.EnablePreview
 766        if "fr" in filters or "vi" in filters:
 767            enablePreview = True
 768        self.EnablePreview = enablePreview
 769        # 实验性:法语字符支持。Prise en charge des caractères français
 770        RE_FR = "" if not enablePreview else "àáâãäåæçèéêëìíîïðñòóôõöùúûüýþÿ"
 771        # 实验性:越南语字符支持。Hỗ trợ ký tự tiếng Việt
 772        RE_VI = "" if not enablePreview else "đơưăáàảãạắằẳẵặấầẩẫậéèẻẽẹếềểễệíìỉĩịóòỏõọốồổỗộớờởỡợúùủũụứừửữựôâêơưỷỹ"
 773        # -------------------------------------------------------------------------------------------------------
 774        # Basic options:
 775        process_list = [
 776            (
 777                TAG_S1,
 778                re.compile(self.SYMBOLS_PATTERN),
 779                self._process_symbol,
 780            ),  # Symbol Tag
 781            (
 782                TAG_KO,
 783                re.compile(re.sub(r"LANGUAGE", f"\uac00-\ud7a3", TAG_BASE.pattern)),
 784                self._process_korean,
 785            ),  # Korean words
 786            (
 787                TAG_TH,
 788                re.compile(re.sub(r"LANGUAGE", f"\u0e00-\u0e7f", TAG_BASE.pattern)),
 789                self._process_Thai,
 790            ),  # Thai words support.
 791            (
 792                TAG_RU,
 793                re.compile(re.sub(r"LANGUAGE", f"А-Яа-яЁё", TAG_BASE.pattern)),
 794                self._process_Russian,
 795            ),  # Russian words support.
 796            (
 797                TAG_NUM,
 798                re.compile(r"(\W*\d+\W+\d*\W*\d*)"),
 799                self._process_number,
 800            ),  # Number words, Universal in all languages, Ignore it.
 801            (
 802                TAG_EN,
 803                re.compile(re.sub(r"LANGUAGE", f"a-zA-Z{RE_FR}{RE_VI}", TAG_BASE.pattern)),
 804                self._process_english,
 805            ),  # English words + Other language support.
 806            (
 807                TAG_P1,
 808                re.compile(r'(["\'])(.*?)(\1)'),
 809                self._process_quotes,
 810            ),  # Regular quotes
 811            (
 812                TAG_P2,
 813                re.compile(r"([\n]*[【《((“‘])([^【《((“‘’”))》】]{3,})([’”))》】][\W\s]*[\n]{,1})"),
 814                self._process_quotes,
 815            ),  # Special quotes, There are left and right.
 816        ]
 817        # Extended options: Default False
 818        if self.keepPinyin == True:
 819            process_list.insert(
 820                1,
 821                (
 822                    TAG_S2,
 823                    re.compile(r"([\(({][^})\)]*?\d[^})\)]*?[})\])"),
 824                    self._process_pinyin,
 825                ),  # Chinese Pinyin Tag.
 826            )
 827        # -------------------------------------------------------------------------------------------------------
 828        words = []
 829        lines = re.findall(r".*\n*", re.sub(self.PARSE_TAG, "", text))
 830        for index, text in enumerate(lines):
 831            if len(text.strip()) == 0:
 832                continue
 833            self._lang_eos = False
 834            self._text_cache = {}
 835            for item in process_list:
 836                text = self._pattern_symbols(item, text)
 837            cur_word = self._process_tags([], text, True)
 838            if len(cur_word) == 0:
 839                continue
 840            cur_data = cur_word[0] if len(cur_word) > 0 else None
 841            pre_data = words[-1] if len(words) > 0 else None
 842            if cur_data and pre_data and cur_data["lang"] == pre_data["lang"] and cur_data["symbol"] == False and pre_data["symbol"]:
 843                cur_data["text"] = f"{pre_data['text']}{cur_data['text']}"
 844                words.pop()
 845            words += cur_word
 846        if self.isLangMerge == True:
 847            words = self._merge_results(words)
 848        lang_count = self._lang_count
 849        if lang_count and len(lang_count) > 0:
 850            lang_count = dict(sorted(lang_count.items(), key=lambda x: x[1], reverse=True))
 851            lang_count = list(lang_count.items())
 852            self._lang_count = lang_count
 853        return words
 854
 855    def setfilters(self, filters):
 856        # 当过滤器更改时,清除缓存
 857        # 필터가 변경되면 캐시를 지웁니다.
 858        # フィルタが変更されると、キャッシュがクリアされます
 859        # When the filter changes, clear the cache
 860        if self.Langfilters != filters:
 861            self._clears()
 862            self.Langfilters = filters
 863
 864    def getfilters(self):
 865        return self.Langfilters
 866
 867    def setPriorityThreshold(self, threshold: float):
 868        self.LangPriorityThreshold = threshold
 869
 870    def getPriorityThreshold(self):
 871        return self.LangPriorityThreshold
 872
 873    def getCounts(self):
 874        lang_count = self._lang_count
 875        if lang_count is not None:
 876            return lang_count
 877        text_langs = self._text_langs
 878        if text_langs is None or len(text_langs) == 0:
 879            return [("zh", 0)]
 880        lang_counts = defaultdict(int)
 881        for d in text_langs:
 882            lang_counts[d["lang"]] += int(len(d["text"]) * 2) if d["lang"] == "zh" else len(d["text"])
 883        lang_counts = dict(sorted(lang_counts.items(), key=lambda x: x[1], reverse=True))
 884        lang_counts = list(lang_counts.items())
 885        self._lang_count = lang_counts
 886        return lang_counts
 887
 888    def getTexts(self, text: str):
 889        if text is None or len(text.strip()) == 0:
 890            self._clears()
 891            return []
 892        # lasts
 893        text_langs = self._text_langs
 894        if self._text_lasts == text and text_langs is not None:
 895            return text_langs
 896        # parse
 897        self._text_waits = []
 898        self._lang_count = None
 899        self._text_lasts = text
 900        text = self._parse_symbols(text)
 901        self._text_langs = text
 902        return text
 903
 904    def classify(self, text: str):
 905        return self.getTexts(text)
 906
 907
 908def printList(langlist):
 909    """
 910    功能:打印数组结果
 911    기능: 어레이 결과 인쇄
 912    機能:配列結果を印刷
 913    Function: Print array results
 914    """
 915    print("\n===================【打印结果】===================")
 916    if langlist is None or len(langlist) == 0:
 917        print("无内容结果,No content result")
 918        return
 919    for line in langlist:
 920        print(line)
 921    pass
 922
 923
 924def main():
 925    # -----------------------------------
 926    # 更新日志:新版本分词更加精准。
 927    # Changelog: The new version of the word segmentation is more accurate.
 928    # チェンジログ:新しいバージョンの単語セグメンテーションはより正確です。
 929    # Changelog: 분할이라는 단어의 새로운 버전이 더 정확합니다.
 930    # -----------------------------------
 931
 932    # 输入示例1:(包含日文,中文)Input Example 1: (including Japanese, Chinese)
 933    # text = "“昨日は雨が降った,音楽、映画。。。”你今天学习日语了吗?春は桜の季節です。语种分词是语音合成必不可少的环节。言語分詞は音声合成に欠かせない環節である!"
 934
 935    # 输入示例2:(包含日文,中文)Input Example 1: (including Japanese, Chinese)
 936    # text = "欢迎来玩。東京,は日本の首都です。欢迎来玩.  太好了!"
 937
 938    # 输入示例3:(包含日文,中文)Input Example 1: (including Japanese, Chinese)
 939    # text = "明日、私たちは海辺にバカンスに行きます。你会说日语吗:“中国語、話せますか” 你的日语真好啊!"
 940
 941    # 输入示例4:(包含日文,中文,韩语,英文)Input Example 4: (including Japanese, Chinese, Korean, English)
 942    # text = "你的名字叫<ja>佐々木?<ja>吗?韩语中的안녕 오빠读什么呢?あなたの体育の先生は誰ですか? 此次发布会带来了四款iPhone 15系列机型和三款Apple Watch等一系列新品,这次的iPad Air采用了LCD屏幕"
 943
 944    # 试验性支持:"fr"法语 , "vi"越南语 , "ru"俄语 , "th"泰语。Experimental: Other language support.
 945    langsegment = LangSegment()
 946    langsegment.setfilters(["fr", "vi", "ja", "zh", "ko", "en", "ru", "th"])
 947    text = """
 948我喜欢在雨天里听音乐。
 949I enjoy listening to music on rainy days.
 950雨の日に音楽を聴くのが好きです。
 951비 오는 날에 음악을 듣는 것을 즐깁니다。
 952J'aime écouter de la musique les jours de pluie.
 953Tôi thích nghe nhạc vào những ngày mưa.
 954Мне нравится слушать музыку в дождливую погоду.
 955ฉันชอบฟังเพลงในวันที่ฝนตก
 956"""
 957
 958    # 进行分词:(接入TTS项目仅需一行代码调用)Segmentation: (Only one line of code is required to access the TTS project)
 959    langlist = langsegment.getTexts(text)
 960    printList(langlist)
 961
 962    # 语种统计:Language statistics:
 963    print("\n===================【语种统计】===================")
 964    # 获取所有语种数组结果,根据内容字数降序排列
 965    # Get the array results in all languages, sorted in descending order according to the number of content words
 966    langCounts = langsegment.getCounts()
 967    print(langCounts, "\n")
 968
 969    # 根据结果获取内容的主要语种 (语言,字数含标点)
 970    # Get the main language of content based on the results (language, word count including punctuation)
 971    lang, count = langCounts[0]
 972    print(f"输入内容的主要语言为 = {lang} ,字数 = {count}")
 973    print("==================================================\n")
 974
 975    # 分词输出:lang=语言,text=内容。Word output: lang = language, text = content
 976    # ===================【打印结果】===================
 977    # {'lang': 'zh', 'text': '你的名字叫'}
 978    # {'lang': 'ja', 'text': '佐々木?'}
 979    # {'lang': 'zh', 'text': '吗?韩语中的'}
 980    # {'lang': 'ko', 'text': '안녕 오빠'}
 981    # {'lang': 'zh', 'text': '读什么呢?'}
 982    # {'lang': 'ja', 'text': 'あなたの体育の先生は誰ですか?'}
 983    # {'lang': 'zh', 'text': ' 此次发布会带来了四款'}
 984    # {'lang': 'en', 'text': 'i Phone  '}
 985    # {'lang': 'zh', 'text': '15系列机型和三款'}
 986    # {'lang': 'en', 'text': 'Apple Watch '}
 987    # {'lang': 'zh', 'text': '等一系列新品,这次的'}
 988    # {'lang': 'en', 'text': 'i Pad Air '}
 989    # {'lang': 'zh', 'text': '采用了'}
 990    # {'lang': 'en', 'text': 'L C D '}
 991    # {'lang': 'zh', 'text': '屏幕'}
 992    # ===================【语种统计】===================
 993
 994    # ===================【语种统计】===================
 995    # [('zh', 51), ('ja', 19), ('en', 18), ('ko', 5)]
 996
 997    # 输入内容的主要语言为 = zh ,字数 = 51
 998    # ==================================================
 999    # The main language of the input content is = zh, word count = 51
1000
1001
1002if __name__ == "__main__":
1003    main()
class LangSSML:
136class LangSSML:
137    def __init__(self):
138        # 纯数字
139        self._zh_numerals_number = {
140            "0": "零",
141            "1": "一",
142            "2": "二",
143            "3": "三",
144            "4": "四",
145            "5": "五",
146            "6": "六",
147            "7": "七",
148            "8": "八",
149            "9": "九",
150        }
151
152    # 将2024/8/24, 2024-08, 08-24, 24 标准化“年月日”
153    # Standardize 2024/8/24, 2024-08, 08-24, 24 to "year-month-day"
154    def _format_chinese_data(self, date_str: str):
155        # 处理日期格式
156        input_date = date_str
157        if date_str is None or date_str.strip() == "":
158            return ""
159        date_str = re.sub(r"[\/\._|年|月]", "-", date_str)
160        date_str = re.sub(r"日", r"", date_str)
161        date_arrs = date_str.split(" ")
162        if len(date_arrs) == 1 and ":" in date_arrs[0]:
163            time_str = date_arrs[0]
164            date_arrs = []
165        else:
166            time_str = date_arrs[1] if len(date_arrs) >= 2 else ""
167
168        def nonZero(num, cn, func=None):
169            if func is not None:
170                num = func(num)
171            return f"{num}{cn}" if num is not None and num != "" and num != "0" else ""
172
173        f_number = self.to_chinese_number
174        f_currency = self.to_chinese_currency
175        # year, month, day
176        year_month_day = ""
177        if len(date_arrs) > 0:
178            year, month, day = "", "", ""
179            parts = date_arrs[0].split("-")
180            if len(parts) == 3:  # 格式为 YYYY-MM-DD
181                year, month, day = parts
182            elif len(parts) == 2:  # 格式为 MM-DD 或 YYYY-MM
183                if len(parts[0]) == 4:  # 年-月
184                    year, month = parts
185                else:
186                    month, day = parts  # 月-日
187            elif len(parts[0]) > 0:  # 仅有月-日或年
188                if len(parts[0]) == 4:
189                    year = parts[0]
190                else:
191                    day = parts[0]
192            year, month, day = (
193                nonZero(year, "年", f_number),
194                nonZero(month, "月", f_currency),
195                nonZero(day, "日", f_currency),
196            )
197            year_month_day = re.sub(r"([年|月|日])+", r"\1", f"{year}{month}{day}")
198        # hours, minutes, seconds
199        time_str = re.sub(r"[\/\.\-:_]", ":", time_str)
200        time_arrs = time_str.split(":")
201        hours, minutes, seconds = "", "", ""
202        if len(time_arrs) == 3:  # H/M/S
203            hours, minutes, seconds = time_arrs
204        elif len(time_arrs) == 2:  # H/M
205            hours, minutes = time_arrs
206        elif len(time_arrs[0]) > 0:
207            hours = f"{time_arrs[0]}点"  # H
208        if len(time_arrs) > 1:
209            hours, minutes, seconds = (
210                nonZero(hours, "点", f_currency),
211                nonZero(minutes, "分", f_currency),
212                nonZero(seconds, "秒", f_currency),
213            )
214        hours_minutes_seconds = re.sub(r"([点|分|秒])+", r"\1", f"{hours}{minutes}{seconds}")
215        output_date = f"{year_month_day}{hours_minutes_seconds}"
216        return output_date
217
218    # 【SSML】number=中文大写数字读法(单字)
219    # Chinese Numbers(single word)
220    def to_chinese_number(self, num: str):
221        pattern = r"(\d+)"
222        zh_numerals = self._zh_numerals_number
223        arrs = re.split(pattern, num)
224        output = ""
225        for item in arrs:
226            if re.match(pattern, item):
227                output += "".join(zh_numerals[digit] if digit in zh_numerals else "" for digit in str(item))
228            else:
229                output += item
230        output = output.replace(".", "点")
231        return output
232
233    # 【SSML】telephone=数字转成中文电话号码大写汉字(单字)
234    # Convert numbers to Chinese phone numbers in uppercase Chinese characters(single word)
235    def to_chinese_telephone(self, num: str):
236        output = self.to_chinese_number(num.replace("+86", ""))  # zh +86
237        output = output.replace("一", "幺")
238        return output
239
240    # 【SSML】currency=按金额发音。
241    # Digital processing from GPT_SoVITS num.py (thanks)
242    def to_chinese_currency(self, num: str):
243        pattern = r"(\d+)"
244        arrs = re.split(pattern, num)
245        output = ""
246        for item in arrs:
247            if re.match(pattern, item):
248                output += num2str(item)
249            else:
250                output += item
251        output = output.replace(".", "点")
252        return output
253
254    # 【SSML】date=按日期发音。支持 2024年08月24, 2024/8/24, 2024-08, 08-24, 24 等输入。
255    def to_chinese_date(self, num: str):
256        chinese_date = self._format_chinese_data(num)
257        return chinese_date
def to_chinese_number(self, num: str):
220    def to_chinese_number(self, num: str):
221        pattern = r"(\d+)"
222        zh_numerals = self._zh_numerals_number
223        arrs = re.split(pattern, num)
224        output = ""
225        for item in arrs:
226            if re.match(pattern, item):
227                output += "".join(zh_numerals[digit] if digit in zh_numerals else "" for digit in str(item))
228            else:
229                output += item
230        output = output.replace(".", "点")
231        return output
def to_chinese_telephone(self, num: str):
235    def to_chinese_telephone(self, num: str):
236        output = self.to_chinese_number(num.replace("+86", ""))  # zh +86
237        output = output.replace("一", "幺")
238        return output
def to_chinese_currency(self, num: str):
242    def to_chinese_currency(self, num: str):
243        pattern = r"(\d+)"
244        arrs = re.split(pattern, num)
245        output = ""
246        for item in arrs:
247            if re.match(pattern, item):
248                output += num2str(item)
249            else:
250                output += item
251        output = output.replace(".", "点")
252        return output
def to_chinese_date(self, num: str):
255    def to_chinese_date(self, num: str):
256        chinese_date = self._format_chinese_data(num)
257        return chinese_date
class LangSegment:
260class LangSegment:
261    def __init__(self):
262        self.langid = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True)
263
264        self._text_cache = None
265        self._text_lasts = None
266        self._text_langs = None
267        self._lang_count = None
268        self._lang_eos = None
269
270        # 可自定义语言匹配标签:カスタマイズ可能な言語対応タグ:사용자 지정 가능한 언어 일치 태그:
271        # Customizable language matching tags: These are supported,이 표현들은 모두 지지합니다
272        # <zh>你好<zh> , <ja>佐々木</ja> , <en>OK<en> , <ko>오빠</ko> 这些写法均支持
273        self.SYMBOLS_PATTERN = r"(<([a-zA-Z|-]*)>(.*?)<\/*[a-zA-Z|-]*>)"
274
275        # 语言过滤组功能, 可以指定保留语言。不在过滤组中的语言将被清除。您可随心搭配TTS语音合成所支持的语言。
276        # 언어 필터 그룹 기능을 사용하면 예약된 언어를 지정할 수 있습니다. 필터 그룹에 없는 언어는 지워집니다. TTS 텍스트에서 지원하는 언어를 원하는 대로 일치시킬 수 있습니다.
277        # 言語フィルターグループ機能では、予約言語を指定できます。フィルターグループに含まれていない言語はクリアされます。TTS音声合成がサポートする言語を自由に組み合わせることができます。
278        # The language filter group function allows you to specify reserved languages.
279        # Languages not in the filter group will be cleared. You can match the languages supported by TTS Text To Speech as you like.
280        # 排名越前,优先级越高,The higher the ranking, the higher the priority,ランキングが上位になるほど、優先度が高くなります。
281
282        # 系统默认过滤器。System default filter。(ISO 639-1 codes given)
283        # ----------------------------------------------------------------------------------------------------------------------------------
284        # "zh"中文=Chinese ,"en"英语=English ,"ja"日语=Japanese ,"ko"韩语=Korean ,"fr"法语=French ,"vi"越南语=Vietnamese , "ru"俄语=Russian
285        # "th"泰语=Thai
286        # ----------------------------------------------------------------------------------------------------------------------------------
287        self.DEFAULT_FILTERS = ["zh", "ja", "ko", "en"]
288
289        # 用户可自定义过滤器。User-defined filters
290        self.Langfilters = self.DEFAULT_FILTERS[:]  # 创建副本
291
292        # 合并文本
293        self.isLangMerge = True
294
295        # 试验性支持:您可自定义添加:"fr"法语 , "vi"越南语。Experimental: You can customize to add: "fr" French, "vi" Vietnamese.
296        # 请使用API启用:self.setfilters(["zh", "en", "ja", "ko", "fr", "vi" , "ru" , "th"]) # 您可自定义添加,如:"fr"法语 , "vi"越南语。
297
298        # 预览版功能,自动启用或禁用,无需设置
299        # Preview feature, automatically enabled or disabled, no settings required
300        self.EnablePreview = False
301
302        # 除此以外,它支持简写过滤器,只需按不同语种任意组合即可。
303        # In addition to that, it supports abbreviation filters, allowing for any combination of different languages.
304        # 示例:您可以任意指定多种组合,进行过滤
305        # Example: You can specify any combination to filter
306
307        # 中/日语言优先级阀值(评分范围为 0 ~ 1):评分低于设定阀值 <0.89 时,启用 filters 中的优先级。\n
308        # 중/일본어 우선 순위 임계값(점수 범위 0-1): 점수가 설정된 임계값 <0.89보다 낮을 때 필터에서 우선 순위를 활성화합니다.
309        # 中国語/日本語の優先度しきい値(スコア範囲0〜1):スコアが設定されたしきい値<0.89未満の場合、フィルターの優先度が有効になります。\n
310        # Chinese and Japanese language priority threshold (score range is 0 ~ 1): The default threshold is 0.89.  \n
311        # Only the common characters between Chinese and Japanese are processed with confidence and priority. \n
312        self.LangPriorityThreshold = 0.89
313
314        # Langfilters = ["zh"]              # 按中文识别
315        # Langfilters = ["en"]              # 按英文识别
316        # Langfilters = ["ja"]              # 按日文识别
317        # Langfilters = ["ko"]              # 按韩文识别
318        # Langfilters = ["zh_ja"]           # 中日混合识别
319        # Langfilters = ["zh_en"]           # 中英混合识别
320        # Langfilters = ["ja_en"]           # 日英混合识别
321        # Langfilters = ["zh_ko"]           # 中韩混合识别
322        # Langfilters = ["ja_ko"]           # 日韩混合识别
323        # Langfilters = ["en_ko"]           # 英韩混合识别
324        # Langfilters = ["zh_ja_en"]        # 中日英混合识别
325        # Langfilters = ["zh_ja_en_ko"]     # 中日英韩混合识别
326
327        # 更多过滤组合,请您随意。。。For more filter combinations, please feel free to......
328        # より多くのフィルターの組み合わせ、お気軽に。。。더 많은 필터 조합을 원하시면 자유롭게 해주세요. .....
329
330        # 可选保留:支持中文数字拼音格式,更方便前端实现拼音音素修改和推理,默认关闭 False 。
331        # 开启后 True ,括号内的数字拼音格式均保留,并识别输出为:"zh"中文。
332        self.keepPinyin = False
333
334        # DEFINITION
335        self.PARSE_TAG = re.compile(r"(⑥\$*\d+[\d]{6,}⑥)")
336
337        self.LangSSML = LangSSML()
338
339    def _clears(self):
340        self._text_cache = None
341        self._text_lasts = None
342        self._text_langs = None
343        self._text_waits = None
344        self._lang_count = None
345        self._lang_eos = None
346
347    def _is_english_word(self, word):
348        return bool(re.match(r"^[a-zA-Z]+$", word))
349
350    def _is_chinese(self, word):
351        for char in word:
352            if "\u4e00" <= char <= "\u9fff":
353                return True
354        return False
355
356    def _is_japanese_kana(self, word):
357        pattern = re.compile(r"[\u3040-\u309F\u30A0-\u30FF]+")
358        matches = pattern.findall(word)
359        return len(matches) > 0
360
361    def _insert_english_uppercase(self, word):
362        modified_text = re.sub(r"(?<!\b)([A-Z])", r" \1", word)
363        modified_text = modified_text.strip("-")
364        return modified_text + " "
365
366    def _split_camel_case(self, word):
367        return re.sub(r"(?<!^)(?=[A-Z])", " ", word)
368
369    def _statistics(self, language, text):
370        # Language word statistics:
371        # Chinese characters usually occupy double bytes
372        if self._lang_count is None or not isinstance(self._lang_count, defaultdict):
373            self._lang_count = defaultdict(int)
374        lang_count = self._lang_count
375        if not "|" in language:
376            lang_count[language] += int(len(text) * 2) if language == "zh" else len(text)
377        self._lang_count = lang_count
378
379    def _clear_text_number(self, text):
380        if text == "\n":
381            return text, False  # Keep Line Breaks
382        clear_text = re.sub(r"([^\w\s]+)", "", re.sub(r"\n+", "", text)).strip()
383        is_number = len(re.sub(re.compile(r"(\d+)"), "", clear_text)) == 0
384        return clear_text, is_number
385
386    def _saveData(self, words, language: str, text: str, score: float, symbol=None):
387        # Pre-detection
388        clear_text, is_number = self._clear_text_number(text)
389        # Merge the same language and save the results
390        preData = words[-1] if len(words) > 0 else None
391        if symbol is not None:
392            pass
393        elif preData is not None and preData["symbol"] is None:
394            if len(clear_text) == 0:
395                language = preData["lang"]
396            elif is_number == True:
397                language = preData["lang"]
398            _, pre_is_number = self._clear_text_number(preData["text"])
399            if preData["lang"] == language:
400                self._statistics(preData["lang"], text)
401                text = preData["text"] + text
402                preData["text"] = text
403                return preData
404            elif pre_is_number == True:
405                text = f"{preData['text']}{text}"
406                words.pop()
407        elif is_number == True:
408            priority_language = self._get_filters_string()[:2]
409            if priority_language in "ja-zh-en-ko-fr-vi":
410                language = priority_language
411        data = {"lang": language, "text": text, "score": score, "symbol": symbol}
412        filters = self.Langfilters
413        if filters is None or len(filters) == 0 or "?" in language or language in filters or language in filters[0] or filters[0] == "*" or filters[0] in "alls-mixs-autos":
414            words.append(data)
415            self._statistics(data["lang"], data["text"])
416        return data
417
418    def _addwords(self, words, language, text, score, symbol=None):
419        if text == "\n":
420            pass  # Keep Line Breaks
421        elif text is None or len(text.strip()) == 0:
422            return True
423        if language is None:
424            language = ""
425        language = language.lower()
426        if language == "en":
427            text = self._insert_english_uppercase(text)
428        # text = re.sub(r'[(())]', ',' , text) # Keep it.
429        text_waits = self._text_waits
430        ispre_waits = len(text_waits) > 0
431        preResult = text_waits.pop() if ispre_waits else None
432        if preResult is None:
433            preResult = words[-1] if len(words) > 0 else None
434        if preResult and ("|" in preResult["lang"]):
435            pre_lang = preResult["lang"]
436            if language in pre_lang:
437                preResult["lang"] = language = language.split("|")[0]
438            else:
439                preResult["lang"] = pre_lang.split("|")[0]
440            if ispre_waits:
441                preResult = self._saveData(
442                    words,
443                    preResult["lang"],
444                    preResult["text"],
445                    preResult["score"],
446                    preResult["symbol"],
447                )
448        pre_lang = preResult["lang"] if preResult else None
449        if ("|" in language) and (pre_lang and not pre_lang in language and not "…" in language):
450            language = language.split("|")[0]
451        if "|" in language:
452            self._text_waits.append({"lang": language, "text": text, "score": score, "symbol": symbol})
453        else:
454            self._saveData(words, language, text, score, symbol)
455        return False
456
457    def _get_prev_data(self, words):
458        data = words[-1] if words and len(words) > 0 else None
459        if data:
460            return (data["lang"], data["text"])
461        return (None, "")
462
463    def _match_ending(self, input, index):
464        if input is None or len(input) == 0:
465            return False, None
466        input = re.sub(r"\s+", "", input)
467        if len(input) == 0 or abs(index) > len(input):
468            return False, None
469        ending_pattern = re.compile(r'([「」“”‘’"\'::。.!!?.?])')
470        return ending_pattern.match(input[index]), input[index]
471
472    def _cleans_text(self, cleans_text):
473        cleans_text = re.sub(r"(.*?)([^\w]+)", r"\1 ", cleans_text)
474        cleans_text = re.sub(r"(.)\1+", r"\1", cleans_text)
475        return cleans_text.strip()
476
477    def _mean_processing(self, text: str):
478        if text is None or (text.strip()) == "":
479            return None, 0.0
480        arrs = self._split_camel_case(text).split(" ")
481        langs = []
482        for t in arrs:
483            if len(t.strip()) <= 3:
484                continue
485            language, score = self.langid.classify(t)
486            langs.append({"lang": language})
487        if len(langs) == 0:
488            return None, 0.0
489        return Counter([item["lang"] for item in langs]).most_common(1)[0][0], 1.0
490
491    def _lang_classify(self, cleans_text):
492        language, score = self.langid.classify(cleans_text)
493        # fix: Huggingface is np.float32
494        if score is not None and isinstance(score, np.generic) and hasattr(score, "item"):
495            score = score.item()
496        score = round(score, 3)
497        return language, score
498
499    def _get_filters_string(self):
500        filters = self.Langfilters
501        return "-".join(filters).lower().strip() if filters is not None else ""
502
503    def _parse_language(self, words, segment):
504        LANG_JA = "ja"
505        LANG_ZH = "zh"
506        LANG_ZH_JA = f"{LANG_ZH}|{LANG_JA}"
507        LANG_JA_ZH = f"{LANG_JA}|{LANG_ZH}"
508        language = LANG_ZH
509        regex_pattern = re.compile(r"([^\w\s]+)")
510        lines = regex_pattern.split(segment)
511        lines_max = len(lines)
512        LANG_EOS = self._lang_eos
513        for index, text in enumerate(lines):
514            if len(text) == 0:
515                continue
516            EOS = index >= (lines_max - 1)
517            nextId = index + 1
518            nextText = lines[nextId] if not EOS else ""
519            nextPunc = len(re.sub(regex_pattern, "", re.sub(r"\n+", "", nextText)).strip()) == 0
520            textPunc = len(re.sub(regex_pattern, "", re.sub(r"\n+", "", text)).strip()) == 0
521            if not EOS and (textPunc == True or (len(nextText.strip()) >= 0 and nextPunc == True)):
522                lines[nextId] = f"{text}{nextText}"
523                continue
524            number_tags = re.compile(r"(⑥\d{6,}⑥)")
525            cleans_text = re.sub(number_tags, "", text)
526            cleans_text = re.sub(r"\d+", "", cleans_text)
527            cleans_text = self._cleans_text(cleans_text)
528            # fix:Langid's recognition of short sentences is inaccurate, and it is spliced longer.
529            if not EOS and len(cleans_text) <= 2:
530                lines[nextId] = f"{text}{nextText}"
531                continue
532            language, score = self._lang_classify(cleans_text)
533            prev_language, prev_text = self._get_prev_data(words)
534            if language != LANG_ZH and all("\u4e00" <= c <= "\u9fff" for c in re.sub(r"\s", "", cleans_text)):
535                language, score = LANG_ZH, 1
536            if len(cleans_text) <= 5 and self._is_chinese(cleans_text):
537                filters_string = self._get_filters_string()
538                if score < self.LangPriorityThreshold and len(filters_string) > 0:
539                    index_ja, index_zh = filters_string.find(LANG_JA), filters_string.find(LANG_ZH)
540                    if index_ja != -1 and index_ja < index_zh:
541                        language = LANG_JA
542                    elif index_zh != -1 and index_zh < index_ja:
543                        language = LANG_ZH
544                if self._is_japanese_kana(cleans_text):
545                    language = LANG_JA
546                elif len(cleans_text) > 2 and score > 0.90:
547                    pass
548                elif EOS and LANG_EOS:
549                    language = LANG_ZH if len(cleans_text) <= 1 else language
550                else:
551                    LANG_UNKNOWN = LANG_ZH_JA if language == LANG_ZH or (len(cleans_text) <= 2 and prev_language == LANG_ZH) else LANG_JA_ZH
552                    match_end, match_char = self._match_ending(text, -1)
553                    referen = prev_language in LANG_UNKNOWN or LANG_UNKNOWN in prev_language if prev_language else False
554                    if match_char in "。.":
555                        language = prev_language if referen and len(words) > 0 else language
556                    else:
557                        language = f"{LANG_UNKNOWN}|…"
558            text, *_ = re.subn(number_tags, self._restore_number, text)
559            self._addwords(words, language, text, score)
560
561    # ----------------------------------------------------------
562    # 【SSML】中文数字处理:Chinese Number Processing (SSML support)
563    # 这里默认都是中文,用于处理 SSML 中文标签。当然可以支持任意语言,例如:
564    # The default here is Chinese, which is used to process SSML Chinese tags. Of course, any language can be supported, for example:
565    # 中文电话号码:<telephone>1234567</telephone>
566    # 中文数字号码:<number>1234567</number>
567    def _process_symbol_SSML(self, words, data):
568        tag, match = data
569        language = SSML = match[1]
570        text = match[2]
571        score = 1.0
572        if SSML == "telephone":
573            # 中文-电话号码
574            language = "zh"
575            text = self.LangSSML.to_chinese_telephone(text)
576        elif SSML == "number":
577            # 中文-数字读法
578            language = "zh"
579            text = self.LangSSML.to_chinese_number(text)
580        elif SSML == "currency":
581            # 中文-按金额发音
582            language = "zh"
583            text = self.LangSSML.to_chinese_currency(text)
584        elif SSML == "date":
585            # 中文-按金额发音
586            language = "zh"
587            text = self.LangSSML.to_chinese_date(text)
588        self._addwords(words, language, text, score, SSML)
589
590    # ----------------------------------------------------------
591    def _restore_number(self, matche):
592        value = matche.group(0)
593        text_cache = self._text_cache
594        if value in text_cache:
595            process, data = text_cache[value]
596            tag, match = data
597            value = match
598        return value
599
600    def _pattern_symbols(self, item, text):
601        if text is None:
602            return text
603        tag, pattern, process = item
604        matches = pattern.findall(text)
605        if len(matches) == 1 and "".join(matches[0]) == text:
606            return text
607        for i, match in enumerate(matches):
608            key = f"⑥{tag}{i:06d}⑥"
609            text = re.sub(pattern, key, text, count=1)
610            self._text_cache[key] = (process, (tag, match))
611        return text
612
613    def _process_symbol(self, words, data):
614        tag, match = data
615        language = match[1]
616        text = match[2]
617        score = 1.0
618        filters = self._get_filters_string()
619        if language not in filters:
620            self._process_symbol_SSML(words, data)
621        else:
622            self._addwords(words, language, text, score, True)
623
624    def _process_english(self, words, data):
625        tag, match = data
626        text = match[0]
627        filters = self._get_filters_string()
628        priority_language = filters[:2]
629        # Preview feature, other language segmentation processing
630        enablePreview = self.EnablePreview
631        if enablePreview == True:
632            # Experimental: Other language support
633            regex_pattern = re.compile(r"(.*?[。.??!!]+[\n]{,1})")
634            lines = regex_pattern.split(text)
635            for index, text in enumerate(lines):
636                if len(text.strip()) == 0:
637                    continue
638                cleans_text = self._cleans_text(text)
639                language, score = self._lang_classify(cleans_text)
640                if language not in filters:
641                    language, score = self._mean_processing(cleans_text)
642                if language is None or score <= 0.0:
643                    continue
644                elif language in filters:
645                    pass  # pass
646                elif score >= 0.95:
647                    continue  # High score, but not in the filter, excluded.
648                elif score <= 0.15 and filters[:2] == "fr":
649                    language = priority_language
650                else:
651                    language = "en"
652                self._addwords(words, language, text, score)
653        else:
654            # Default is English
655            language, score = "en", 1.0
656            self._addwords(words, language, text, score)
657
658    def _process_Russian(self, words, data):
659        tag, match = data
660        text = match[0]
661        language = "ru"
662        score = 1.0
663        self._addwords(words, language, text, score)
664
665    def _process_Thai(self, words, data):
666        tag, match = data
667        text = match[0]
668        language = "th"
669        score = 1.0
670        self._addwords(words, language, text, score)
671
672    def _process_korean(self, words, data):
673        tag, match = data
674        text = match[0]
675        language = "ko"
676        score = 1.0
677        self._addwords(words, language, text, score)
678
679    def _process_quotes(self, words, data):
680        tag, match = data
681        text = "".join(match)
682        childs = self.PARSE_TAG.findall(text)
683        if len(childs) > 0:
684            self._process_tags(words, text, False)
685        else:
686            cleans_text = self._cleans_text(match[1])
687            if len(cleans_text) <= 5:
688                self._parse_language(words, text)
689            else:
690                language, score = self._lang_classify(cleans_text)
691                self._addwords(words, language, text, score)
692
693    def _process_pinyin(self, words, data):
694        tag, match = data
695        text = match
696        language = "zh"
697        score = 1.0
698        self._addwords(words, language, text, score)
699
700    def _process_number(self, words, data):  # "$0" process only
701        """
702        Numbers alone cannot accurately identify language.
703        Because numbers are universal in all languages.
704        So it won't be executed here, just for testing.
705        """
706        tag, match = data
707        language = words[0]["lang"] if len(words) > 0 else "zh"
708        text = match
709        score = 0.0
710        self._addwords(words, language, text, score)
711
712    def _process_tags(self, words, text, root_tag):
713        text_cache = self._text_cache
714        segments = re.split(self.PARSE_TAG, text)
715        segments_len = len(segments) - 1
716        for index, text in enumerate(segments):
717            if root_tag:
718                self._lang_eos = index >= segments_len
719            if self.PARSE_TAG.match(text):
720                process, data = text_cache[text]
721                if process:
722                    process(words, data)
723            else:
724                self._parse_language(words, text)
725        return words
726
727    def _merge_results(self, words):
728        new_word = []
729        for index, cur_data in enumerate(words):
730            if "symbol" in cur_data:
731                del cur_data["symbol"]
732            if index == 0:
733                new_word.append(cur_data)
734            else:
735                pre_data = new_word[-1]
736                if cur_data["lang"] == pre_data["lang"]:
737                    pre_data["text"] = f"{pre_data['text']}{cur_data['text']}"
738                else:
739                    new_word.append(cur_data)
740        return new_word
741
742    def _parse_symbols(self, text):
743        TAG_NUM = "00"  # "00" => default channels , "$0" => testing channel
744        TAG_S1, TAG_S2, TAG_P1, TAG_P2, TAG_EN, TAG_KO, TAG_RU, TAG_TH = (
745            "$1",
746            "$2",
747            "$3",
748            "$4",
749            "$5",
750            "$6",
751            "$7",
752            "$8",
753        )
754        TAG_BASE = re.compile(rf'(([【《((“‘"\']*[LANGUAGE]+[\W\s]*)+)')
755        # Get custom language filter
756        filters = self.Langfilters
757        filters = filters if filters is not None else ""
758        # =======================================================================================================
759        # Experimental: Other language support.Thử nghiệm: Hỗ trợ ngôn ngữ khác.Expérimental : prise en charge d’autres langues.
760        # 相关语言字符如有缺失,熟悉相关语言的朋友,可以提交把缺失的发音符号补全。
761        # If relevant language characters are missing, friends who are familiar with the relevant languages can submit a submission to complete the missing pronunciation symbols.
762        # S'il manque des caractères linguistiques pertinents, les amis qui connaissent les langues concernées peuvent soumettre une soumission pour compléter les symboles de prononciation manquants.
763        # Nếu thiếu ký tự ngôn ngữ liên quan, những người bạn quen thuộc với ngôn ngữ liên quan có thể gửi bài để hoàn thành các ký hiệu phát âm còn thiếu.
764        # -------------------------------------------------------------------------------------------------------
765        # Preview feature, other language support
766        enablePreview = self.EnablePreview
767        if "fr" in filters or "vi" in filters:
768            enablePreview = True
769        self.EnablePreview = enablePreview
770        # 实验性:法语字符支持。Prise en charge des caractères français
771        RE_FR = "" if not enablePreview else "àáâãäåæçèéêëìíîïðñòóôõöùúûüýþÿ"
772        # 实验性:越南语字符支持。Hỗ trợ ký tự tiếng Việt
773        RE_VI = "" if not enablePreview else "đơưăáàảãạắằẳẵặấầẩẫậéèẻẽẹếềểễệíìỉĩịóòỏõọốồổỗộớờởỡợúùủũụứừửữựôâêơưỷỹ"
774        # -------------------------------------------------------------------------------------------------------
775        # Basic options:
776        process_list = [
777            (
778                TAG_S1,
779                re.compile(self.SYMBOLS_PATTERN),
780                self._process_symbol,
781            ),  # Symbol Tag
782            (
783                TAG_KO,
784                re.compile(re.sub(r"LANGUAGE", f"\uac00-\ud7a3", TAG_BASE.pattern)),
785                self._process_korean,
786            ),  # Korean words
787            (
788                TAG_TH,
789                re.compile(re.sub(r"LANGUAGE", f"\u0e00-\u0e7f", TAG_BASE.pattern)),
790                self._process_Thai,
791            ),  # Thai words support.
792            (
793                TAG_RU,
794                re.compile(re.sub(r"LANGUAGE", f"А-Яа-яЁё", TAG_BASE.pattern)),
795                self._process_Russian,
796            ),  # Russian words support.
797            (
798                TAG_NUM,
799                re.compile(r"(\W*\d+\W+\d*\W*\d*)"),
800                self._process_number,
801            ),  # Number words, Universal in all languages, Ignore it.
802            (
803                TAG_EN,
804                re.compile(re.sub(r"LANGUAGE", f"a-zA-Z{RE_FR}{RE_VI}", TAG_BASE.pattern)),
805                self._process_english,
806            ),  # English words + Other language support.
807            (
808                TAG_P1,
809                re.compile(r'(["\'])(.*?)(\1)'),
810                self._process_quotes,
811            ),  # Regular quotes
812            (
813                TAG_P2,
814                re.compile(r"([\n]*[【《((“‘])([^【《((“‘’”))》】]{3,})([’”))》】][\W\s]*[\n]{,1})"),
815                self._process_quotes,
816            ),  # Special quotes, There are left and right.
817        ]
818        # Extended options: Default False
819        if self.keepPinyin == True:
820            process_list.insert(
821                1,
822                (
823                    TAG_S2,
824                    re.compile(r"([\(({][^})\)]*?\d[^})\)]*?[})\])"),
825                    self._process_pinyin,
826                ),  # Chinese Pinyin Tag.
827            )
828        # -------------------------------------------------------------------------------------------------------
829        words = []
830        lines = re.findall(r".*\n*", re.sub(self.PARSE_TAG, "", text))
831        for index, text in enumerate(lines):
832            if len(text.strip()) == 0:
833                continue
834            self._lang_eos = False
835            self._text_cache = {}
836            for item in process_list:
837                text = self._pattern_symbols(item, text)
838            cur_word = self._process_tags([], text, True)
839            if len(cur_word) == 0:
840                continue
841            cur_data = cur_word[0] if len(cur_word) > 0 else None
842            pre_data = words[-1] if len(words) > 0 else None
843            if cur_data and pre_data and cur_data["lang"] == pre_data["lang"] and cur_data["symbol"] == False and pre_data["symbol"]:
844                cur_data["text"] = f"{pre_data['text']}{cur_data['text']}"
845                words.pop()
846            words += cur_word
847        if self.isLangMerge == True:
848            words = self._merge_results(words)
849        lang_count = self._lang_count
850        if lang_count and len(lang_count) > 0:
851            lang_count = dict(sorted(lang_count.items(), key=lambda x: x[1], reverse=True))
852            lang_count = list(lang_count.items())
853            self._lang_count = lang_count
854        return words
855
856    def setfilters(self, filters):
857        # 当过滤器更改时,清除缓存
858        # 필터가 변경되면 캐시를 지웁니다.
859        # フィルタが変更されると、キャッシュがクリアされます
860        # When the filter changes, clear the cache
861        if self.Langfilters != filters:
862            self._clears()
863            self.Langfilters = filters
864
865    def getfilters(self):
866        return self.Langfilters
867
868    def setPriorityThreshold(self, threshold: float):
869        self.LangPriorityThreshold = threshold
870
871    def getPriorityThreshold(self):
872        return self.LangPriorityThreshold
873
874    def getCounts(self):
875        lang_count = self._lang_count
876        if lang_count is not None:
877            return lang_count
878        text_langs = self._text_langs
879        if text_langs is None or len(text_langs) == 0:
880            return [("zh", 0)]
881        lang_counts = defaultdict(int)
882        for d in text_langs:
883            lang_counts[d["lang"]] += int(len(d["text"]) * 2) if d["lang"] == "zh" else len(d["text"])
884        lang_counts = dict(sorted(lang_counts.items(), key=lambda x: x[1], reverse=True))
885        lang_counts = list(lang_counts.items())
886        self._lang_count = lang_counts
887        return lang_counts
888
889    def getTexts(self, text: str):
890        if text is None or len(text.strip()) == 0:
891            self._clears()
892            return []
893        # lasts
894        text_langs = self._text_langs
895        if self._text_lasts == text and text_langs is not None:
896            return text_langs
897        # parse
898        self._text_waits = []
899        self._lang_count = None
900        self._text_lasts = text
901        text = self._parse_symbols(text)
902        self._text_langs = text
903        return text
904
905    def classify(self, text: str):
906        return self.getTexts(text)
langid
SYMBOLS_PATTERN
DEFAULT_FILTERS
Langfilters
isLangMerge
EnablePreview
LangPriorityThreshold
keepPinyin
PARSE_TAG
LangSSML
def setfilters(self, filters):
856    def setfilters(self, filters):
857        # 当过滤器更改时,清除缓存
858        # 필터가 변경되면 캐시를 지웁니다.
859        # フィルタが変更されると、キャッシュがクリアされます
860        # When the filter changes, clear the cache
861        if self.Langfilters != filters:
862            self._clears()
863            self.Langfilters = filters
def getfilters(self):
865    def getfilters(self):
866        return self.Langfilters
def setPriorityThreshold(self, threshold: float):
868    def setPriorityThreshold(self, threshold: float):
869        self.LangPriorityThreshold = threshold
def getPriorityThreshold(self):
871    def getPriorityThreshold(self):
872        return self.LangPriorityThreshold
def getCounts(self):
874    def getCounts(self):
875        lang_count = self._lang_count
876        if lang_count is not None:
877            return lang_count
878        text_langs = self._text_langs
879        if text_langs is None or len(text_langs) == 0:
880            return [("zh", 0)]
881        lang_counts = defaultdict(int)
882        for d in text_langs:
883            lang_counts[d["lang"]] += int(len(d["text"]) * 2) if d["lang"] == "zh" else len(d["text"])
884        lang_counts = dict(sorted(lang_counts.items(), key=lambda x: x[1], reverse=True))
885        lang_counts = list(lang_counts.items())
886        self._lang_count = lang_counts
887        return lang_counts
def getTexts(self, text: str):
889    def getTexts(self, text: str):
890        if text is None or len(text.strip()) == 0:
891            self._clears()
892            return []
893        # lasts
894        text_langs = self._text_langs
895        if self._text_lasts == text and text_langs is not None:
896            return text_langs
897        # parse
898        self._text_waits = []
899        self._lang_count = None
900        self._text_lasts = text
901        text = self._parse_symbols(text)
902        self._text_langs = text
903        return text
def classify(self, text: str):
905    def classify(self, text: str):
906        return self.getTexts(text)
def printList(langlist):
909def printList(langlist):
910    """
911    功能:打印数组结果
912    기능: 어레이 결과 인쇄
913    機能:配列結果を印刷
914    Function: Print array results
915    """
916    print("\n===================【打印结果】===================")
917    if langlist is None or len(langlist) == 0:
918        print("无内容结果,No content result")
919        return
920    for line in langlist:
921        print(line)
922    pass

功能:打印数组结果 기능: 어레이 결과 인쇄 機能:配列結果を印刷 Function: Print array results

def main():
 925def main():
 926    # -----------------------------------
 927    # 更新日志:新版本分词更加精准。
 928    # Changelog: The new version of the word segmentation is more accurate.
 929    # チェンジログ:新しいバージョンの単語セグメンテーションはより正確です。
 930    # Changelog: 분할이라는 단어의 새로운 버전이 더 정확합니다.
 931    # -----------------------------------
 932
 933    # 输入示例1:(包含日文,中文)Input Example 1: (including Japanese, Chinese)
 934    # text = "“昨日は雨が降った,音楽、映画。。。”你今天学习日语了吗?春は桜の季節です。语种分词是语音合成必不可少的环节。言語分詞は音声合成に欠かせない環節である!"
 935
 936    # 输入示例2:(包含日文,中文)Input Example 1: (including Japanese, Chinese)
 937    # text = "欢迎来玩。東京,は日本の首都です。欢迎来玩.  太好了!"
 938
 939    # 输入示例3:(包含日文,中文)Input Example 1: (including Japanese, Chinese)
 940    # text = "明日、私たちは海辺にバカンスに行きます。你会说日语吗:“中国語、話せますか” 你的日语真好啊!"
 941
 942    # 输入示例4:(包含日文,中文,韩语,英文)Input Example 4: (including Japanese, Chinese, Korean, English)
 943    # text = "你的名字叫<ja>佐々木?<ja>吗?韩语中的안녕 오빠读什么呢?あなたの体育の先生は誰ですか? 此次发布会带来了四款iPhone 15系列机型和三款Apple Watch等一系列新品,这次的iPad Air采用了LCD屏幕"
 944
 945    # 试验性支持:"fr"法语 , "vi"越南语 , "ru"俄语 , "th"泰语。Experimental: Other language support.
 946    langsegment = LangSegment()
 947    langsegment.setfilters(["fr", "vi", "ja", "zh", "ko", "en", "ru", "th"])
 948    text = """
 949我喜欢在雨天里听音乐。
 950I enjoy listening to music on rainy days.
 951雨の日に音楽を聴くのが好きです。
 952비 오는 날에 음악을 듣는 것을 즐깁니다。
 953J'aime écouter de la musique les jours de pluie.
 954Tôi thích nghe nhạc vào những ngày mưa.
 955Мне нравится слушать музыку в дождливую погоду.
 956ฉันชอบฟังเพลงในวันที่ฝนตก
 957"""
 958
 959    # 进行分词:(接入TTS项目仅需一行代码调用)Segmentation: (Only one line of code is required to access the TTS project)
 960    langlist = langsegment.getTexts(text)
 961    printList(langlist)
 962
 963    # 语种统计:Language statistics:
 964    print("\n===================【语种统计】===================")
 965    # 获取所有语种数组结果,根据内容字数降序排列
 966    # Get the array results in all languages, sorted in descending order according to the number of content words
 967    langCounts = langsegment.getCounts()
 968    print(langCounts, "\n")
 969
 970    # 根据结果获取内容的主要语种 (语言,字数含标点)
 971    # Get the main language of content based on the results (language, word count including punctuation)
 972    lang, count = langCounts[0]
 973    print(f"输入内容的主要语言为 = {lang} ,字数 = {count}")
 974    print("==================================================\n")
 975
 976    # 分词输出:lang=语言,text=内容。Word output: lang = language, text = content
 977    # ===================【打印结果】===================
 978    # {'lang': 'zh', 'text': '你的名字叫'}
 979    # {'lang': 'ja', 'text': '佐々木?'}
 980    # {'lang': 'zh', 'text': '吗?韩语中的'}
 981    # {'lang': 'ko', 'text': '안녕 오빠'}
 982    # {'lang': 'zh', 'text': '读什么呢?'}
 983    # {'lang': 'ja', 'text': 'あなたの体育の先生は誰ですか?'}
 984    # {'lang': 'zh', 'text': ' 此次发布会带来了四款'}
 985    # {'lang': 'en', 'text': 'i Phone  '}
 986    # {'lang': 'zh', 'text': '15系列机型和三款'}
 987    # {'lang': 'en', 'text': 'Apple Watch '}
 988    # {'lang': 'zh', 'text': '等一系列新品,这次的'}
 989    # {'lang': 'en', 'text': 'i Pad Air '}
 990    # {'lang': 'zh', 'text': '采用了'}
 991    # {'lang': 'en', 'text': 'L C D '}
 992    # {'lang': 'zh', 'text': '屏幕'}
 993    # ===================【语种统计】===================
 994
 995    # ===================【语种统计】===================
 996    # [('zh', 51), ('ja', 19), ('en', 18), ('ko', 5)]
 997
 998    # 输入内容的主要语言为 = zh ,字数 = 51
 999    # ==================================================
1000    # The main language of the input content is = zh, word count = 51