divisor.acestep.language_segmentation.LangSegment
This file bundles language identification functions.
Modifications (fork): Copyright (c) 2021, Adrien Barbaresi.
Original code: Copyright (c) 2011 Marco Lui saffsd@gmail.com. Based on research by Marco Lui and Tim Baldwin.
See LICENSE file for more info. https://github.com/adbar/py3langid
Projects: https://github.com/juntaosun/LangSegment
LICENSE: py3langid - Language Identifier BSD 3-Clause License
Modifications (fork): Copyright (c) 2021, Adrien Barbaresi.
Original code: Copyright (c) 2011 Marco Lui saffsd@gmail.com. Based on research by Marco Lui and Tim Baldwin.
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
1""" 2This file bundles language identification functions. 3 4Modifications (fork): Copyright (c) 2021, Adrien Barbaresi. 5 6Original code: Copyright (c) 2011 Marco Lui <saffsd@gmail.com>. 7Based on research by Marco Lui and Tim Baldwin. 8 9See LICENSE file for more info. 10https://github.com/adbar/py3langid 11 12Projects: 13https://github.com/juntaosun/LangSegment 14 15LICENSE: 16py3langid - Language Identifier 17BSD 3-Clause License 18 19Modifications (fork): Copyright (c) 2021, Adrien Barbaresi. 20 21Original code: Copyright (c) 2011 Marco Lui <saffsd@gmail.com>. 22Based on research by Marco Lui and Tim Baldwin. 23 24All rights reserved. 25 26Redistribution and use in source and binary forms, with or without modification, are 27permitted provided that the following conditions are met: 28 291. Redistributions of source code must retain the above copyright notice, this 30 list of conditions and the following disclaimer. 31 322. Redistributions in binary form must reproduce the above copyright notice, 33 this list of conditions and the following disclaimer in the documentation 34 and/or other materials provided with the distribution. 35 363. Neither the name of the copyright holder nor the names of its 37 contributors may be used to endorse or promote products derived from 38 this software without specific prior written permission. 39 40THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER ``AS IS'' AND ANY EXPRESS OR IMPLIED 41WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND 42FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR 43CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 44CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 45SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON 46ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING 47NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 48ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 49""" 50 51import os 52import re 53import sys 54import numpy as np 55from collections import Counter 56from collections import defaultdict 57 58# import langid 59# import py3langid as langid 60# pip install py3langid==0.2.2 61 62# 启用语言预测概率归一化,概率预测的分数。因此,实现重新规范化 产生 0-1 范围内的输出。 63# langid disables probability normalization by default. For command-line usages of , it can be enabled by passing the flag. 64# For probability normalization in library use, the user must instantiate their own . An example of such usage is as follows: 65from py3langid.langid import LanguageIdentifier, MODEL_FILE 66 67from divisor.acestep.language_segmentation.utils.num import num2str 68 69# ----------------------------------- 70# 更新日志:新版本分词更加精准。 71# Changelog: The new version of the word segmentation is more accurate. 72# チェンジログ:新しいバージョンの単語セグメンテーションはより正確です。 73# Changelog: 분할이라는 단어의 새로운 버전이 더 정확합니다. 74# ----------------------------------- 75 76 77# Word segmentation function: 78# automatically identify and split the words (Chinese/English/Japanese/Korean) in the article or sentence according to different languages, 79# making it more suitable for TTS processing. 80# This code is designed for front-end text multi-lingual mixed annotation distinction, multi-language mixed training and inference of various TTS projects. 81# This processing result is mainly for (Chinese = zh, Japanese = ja, English = en, Korean = ko), and can actually support up to 97 different language mixing processing. 82 83# =========================================================================================================== 84# 分かち書き機能:文章や文章の中の例えば(中国語/英語/日本語/韓国語)を、異なる言語で自動的に認識して分割し、TTS処理により適したものにします。 85# このコードは、さまざまなTTSプロジェクトのフロントエンドテキストの多言語混合注釈区別、多言語混合トレーニング、および推論のために特別に作成されています。 86# =========================================================================================================== 87# (1)自動分詞:「韓国語では何を読むのですかあなたの体育の先生は誰ですか?今回の発表会では、iPhone 15シリーズの4機種が登場しました」 88# (2)手动分词:“あなたの名前は<ja>佐々木ですか?<ja>ですか?” 89# この処理結果は主に(中国語=ja、日本語=ja、英語=en、韓国語=ko)を対象としており、実際には最大97の異なる言語の混合処理をサポートできます。 90# =========================================================================================================== 91 92# =========================================================================================================== 93# 단어 분할 기능: 기사 또는 문장에서 단어(중국어/영어/일본어/한국어)를 다른 언어에 따라 자동으로 식별하고 분할하여 TTS 처리에 더 적합합니다. 94# 이 코드는 프런트 엔드 텍스트 다국어 혼합 주석 분화, 다국어 혼합 교육 및 다양한 TTS 프로젝트의 추론을 위해 설계되었습니다. 95# =========================================================================================================== 96# (1) 자동 단어 분할: "한국어로 무엇을 읽습니까? 스포츠 씨? 이 컨퍼런스는 4개의 iPhone 15 시리즈 모델을 제공합니다." 97# (2) 수동 참여: "이름이 <ja>Saki입니까? <ja>?" 98# 이 처리 결과는 주로 (중국어 = zh, 일본어 = ja, 영어 = en, 한국어 = ko)를 위한 것이며 실제로 혼합 처리를 위해 최대 97개의 언어를 지원합니다. 99# =========================================================================================================== 100 101# =========================================================================================================== 102# 分词功能:将文章或句子里的例如(中/英/日/韩),按不同语言自动识别并拆分,让它更适合TTS处理。 103# 本代码专为各种 TTS 项目的前端文本多语种混合标注区分,多语言混合训练和推理而编写。 104# =========================================================================================================== 105# (1)自动分词:“韩语中的오빠读什么呢?あなたの体育の先生は誰ですか? 此次发布会带来了四款iPhone 15系列机型” 106# (2)手动分词:“你的名字叫<ja>佐々木?<ja>吗?” 107# 本处理结果主要针对(中文=zh , 日文=ja , 英文=en , 韩语=ko), 实际上可支持多达 97 种不同的语言混合处理。 108# =========================================================================================================== 109 110 111# 手动分词标签规范:<语言标签>文本内容</语言标签> 112# 수동 단어 분할 태그 사양: <언어 태그> 텍스트 내용</언어 태그> 113# Manual word segmentation tag specification: <language tags> text content </language tags> 114# 手動分詞タグ仕様:<言語タグ>テキスト内容</言語タグ> 115# =========================================================================================================== 116# For manual word segmentation, labels need to appear in pairs, such as: 117# 如需手动分词,标签需要成对出现,例如:“<ja>佐々木<ja>” 或者 “<ja>佐々木</ja>” 118# 错误示范:“你的名字叫<ja>佐々木。” 此句子中出现的单个<ja>标签将被忽略,不会处理。 119# Error demonstration: "Your name is <ja>佐々木。" Single <ja> tags that appear in this sentence will be ignored and will not be processed. 120# =========================================================================================================== 121 122 123# =========================================================================================================== 124# 语音合成标记语言 SSML , 这里只支持它的标签(非 XML)Speech Synthesis Markup Language SSML, only its tags are supported here (not XML) 125# 想支持更多的 SSML 标签?欢迎 PR! Want to support more SSML tags? PRs are welcome! 126# 说明:除了中文以外,它也可改造成支持多语种 SSML ,不仅仅是中文。 127# Note: In addition to Chinese, it can also be modified to support multi-language SSML, not just Chinese. 128# =========================================================================================================== 129# 中文实现:Chinese implementation: 130# 【SSML】<number>=中文大写数字读法(单字) 131# 【SSML】<telephone>=数字转成中文电话号码大写汉字(单字) 132# 【SSML】<currency>=按金额发音。 133# 【SSML】<date>=按日期发音。支持 2024年08月24, 2024/8/24, 2024-08, 08-24, 24 等输入。 134# =========================================================================================================== 135class LangSSML: 136 def __init__(self): 137 # 纯数字 138 self._zh_numerals_number = { 139 "0": "零", 140 "1": "一", 141 "2": "二", 142 "3": "三", 143 "4": "四", 144 "5": "五", 145 "6": "六", 146 "7": "七", 147 "8": "八", 148 "9": "九", 149 } 150 151 # 将2024/8/24, 2024-08, 08-24, 24 标准化“年月日” 152 # Standardize 2024/8/24, 2024-08, 08-24, 24 to "year-month-day" 153 def _format_chinese_data(self, date_str: str): 154 # 处理日期格式 155 input_date = date_str 156 if date_str is None or date_str.strip() == "": 157 return "" 158 date_str = re.sub(r"[\/\._|年|月]", "-", date_str) 159 date_str = re.sub(r"日", r"", date_str) 160 date_arrs = date_str.split(" ") 161 if len(date_arrs) == 1 and ":" in date_arrs[0]: 162 time_str = date_arrs[0] 163 date_arrs = [] 164 else: 165 time_str = date_arrs[1] if len(date_arrs) >= 2 else "" 166 167 def nonZero(num, cn, func=None): 168 if func is not None: 169 num = func(num) 170 return f"{num}{cn}" if num is not None and num != "" and num != "0" else "" 171 172 f_number = self.to_chinese_number 173 f_currency = self.to_chinese_currency 174 # year, month, day 175 year_month_day = "" 176 if len(date_arrs) > 0: 177 year, month, day = "", "", "" 178 parts = date_arrs[0].split("-") 179 if len(parts) == 3: # 格式为 YYYY-MM-DD 180 year, month, day = parts 181 elif len(parts) == 2: # 格式为 MM-DD 或 YYYY-MM 182 if len(parts[0]) == 4: # 年-月 183 year, month = parts 184 else: 185 month, day = parts # 月-日 186 elif len(parts[0]) > 0: # 仅有月-日或年 187 if len(parts[0]) == 4: 188 year = parts[0] 189 else: 190 day = parts[0] 191 year, month, day = ( 192 nonZero(year, "年", f_number), 193 nonZero(month, "月", f_currency), 194 nonZero(day, "日", f_currency), 195 ) 196 year_month_day = re.sub(r"([年|月|日])+", r"\1", f"{year}{month}{day}") 197 # hours, minutes, seconds 198 time_str = re.sub(r"[\/\.\-:_]", ":", time_str) 199 time_arrs = time_str.split(":") 200 hours, minutes, seconds = "", "", "" 201 if len(time_arrs) == 3: # H/M/S 202 hours, minutes, seconds = time_arrs 203 elif len(time_arrs) == 2: # H/M 204 hours, minutes = time_arrs 205 elif len(time_arrs[0]) > 0: 206 hours = f"{time_arrs[0]}点" # H 207 if len(time_arrs) > 1: 208 hours, minutes, seconds = ( 209 nonZero(hours, "点", f_currency), 210 nonZero(minutes, "分", f_currency), 211 nonZero(seconds, "秒", f_currency), 212 ) 213 hours_minutes_seconds = re.sub(r"([点|分|秒])+", r"\1", f"{hours}{minutes}{seconds}") 214 output_date = f"{year_month_day}{hours_minutes_seconds}" 215 return output_date 216 217 # 【SSML】number=中文大写数字读法(单字) 218 # Chinese Numbers(single word) 219 def to_chinese_number(self, num: str): 220 pattern = r"(\d+)" 221 zh_numerals = self._zh_numerals_number 222 arrs = re.split(pattern, num) 223 output = "" 224 for item in arrs: 225 if re.match(pattern, item): 226 output += "".join(zh_numerals[digit] if digit in zh_numerals else "" for digit in str(item)) 227 else: 228 output += item 229 output = output.replace(".", "点") 230 return output 231 232 # 【SSML】telephone=数字转成中文电话号码大写汉字(单字) 233 # Convert numbers to Chinese phone numbers in uppercase Chinese characters(single word) 234 def to_chinese_telephone(self, num: str): 235 output = self.to_chinese_number(num.replace("+86", "")) # zh +86 236 output = output.replace("一", "幺") 237 return output 238 239 # 【SSML】currency=按金额发音。 240 # Digital processing from GPT_SoVITS num.py (thanks) 241 def to_chinese_currency(self, num: str): 242 pattern = r"(\d+)" 243 arrs = re.split(pattern, num) 244 output = "" 245 for item in arrs: 246 if re.match(pattern, item): 247 output += num2str(item) 248 else: 249 output += item 250 output = output.replace(".", "点") 251 return output 252 253 # 【SSML】date=按日期发音。支持 2024年08月24, 2024/8/24, 2024-08, 08-24, 24 等输入。 254 def to_chinese_date(self, num: str): 255 chinese_date = self._format_chinese_data(num) 256 return chinese_date 257 258 259class LangSegment: 260 def __init__(self): 261 self.langid = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True) 262 263 self._text_cache = None 264 self._text_lasts = None 265 self._text_langs = None 266 self._lang_count = None 267 self._lang_eos = None 268 269 # 可自定义语言匹配标签:カスタマイズ可能な言語対応タグ:사용자 지정 가능한 언어 일치 태그: 270 # Customizable language matching tags: These are supported,이 표현들은 모두 지지합니다 271 # <zh>你好<zh> , <ja>佐々木</ja> , <en>OK<en> , <ko>오빠</ko> 这些写法均支持 272 self.SYMBOLS_PATTERN = r"(<([a-zA-Z|-]*)>(.*?)<\/*[a-zA-Z|-]*>)" 273 274 # 语言过滤组功能, 可以指定保留语言。不在过滤组中的语言将被清除。您可随心搭配TTS语音合成所支持的语言。 275 # 언어 필터 그룹 기능을 사용하면 예약된 언어를 지정할 수 있습니다. 필터 그룹에 없는 언어는 지워집니다. TTS 텍스트에서 지원하는 언어를 원하는 대로 일치시킬 수 있습니다. 276 # 言語フィルターグループ機能では、予約言語を指定できます。フィルターグループに含まれていない言語はクリアされます。TTS音声合成がサポートする言語を自由に組み合わせることができます。 277 # The language filter group function allows you to specify reserved languages. 278 # Languages not in the filter group will be cleared. You can match the languages supported by TTS Text To Speech as you like. 279 # 排名越前,优先级越高,The higher the ranking, the higher the priority,ランキングが上位になるほど、優先度が高くなります。 280 281 # 系统默认过滤器。System default filter。(ISO 639-1 codes given) 282 # ---------------------------------------------------------------------------------------------------------------------------------- 283 # "zh"中文=Chinese ,"en"英语=English ,"ja"日语=Japanese ,"ko"韩语=Korean ,"fr"法语=French ,"vi"越南语=Vietnamese , "ru"俄语=Russian 284 # "th"泰语=Thai 285 # ---------------------------------------------------------------------------------------------------------------------------------- 286 self.DEFAULT_FILTERS = ["zh", "ja", "ko", "en"] 287 288 # 用户可自定义过滤器。User-defined filters 289 self.Langfilters = self.DEFAULT_FILTERS[:] # 创建副本 290 291 # 合并文本 292 self.isLangMerge = True 293 294 # 试验性支持:您可自定义添加:"fr"法语 , "vi"越南语。Experimental: You can customize to add: "fr" French, "vi" Vietnamese. 295 # 请使用API启用:self.setfilters(["zh", "en", "ja", "ko", "fr", "vi" , "ru" , "th"]) # 您可自定义添加,如:"fr"法语 , "vi"越南语。 296 297 # 预览版功能,自动启用或禁用,无需设置 298 # Preview feature, automatically enabled or disabled, no settings required 299 self.EnablePreview = False 300 301 # 除此以外,它支持简写过滤器,只需按不同语种任意组合即可。 302 # In addition to that, it supports abbreviation filters, allowing for any combination of different languages. 303 # 示例:您可以任意指定多种组合,进行过滤 304 # Example: You can specify any combination to filter 305 306 # 中/日语言优先级阀值(评分范围为 0 ~ 1):评分低于设定阀值 <0.89 时,启用 filters 中的优先级。\n 307 # 중/일본어 우선 순위 임계값(점수 범위 0-1): 점수가 설정된 임계값 <0.89보다 낮을 때 필터에서 우선 순위를 활성화합니다. 308 # 中国語/日本語の優先度しきい値(スコア範囲0〜1):スコアが設定されたしきい値<0.89未満の場合、フィルターの優先度が有効になります。\n 309 # Chinese and Japanese language priority threshold (score range is 0 ~ 1): The default threshold is 0.89. \n 310 # Only the common characters between Chinese and Japanese are processed with confidence and priority. \n 311 self.LangPriorityThreshold = 0.89 312 313 # Langfilters = ["zh"] # 按中文识别 314 # Langfilters = ["en"] # 按英文识别 315 # Langfilters = ["ja"] # 按日文识别 316 # Langfilters = ["ko"] # 按韩文识别 317 # Langfilters = ["zh_ja"] # 中日混合识别 318 # Langfilters = ["zh_en"] # 中英混合识别 319 # Langfilters = ["ja_en"] # 日英混合识别 320 # Langfilters = ["zh_ko"] # 中韩混合识别 321 # Langfilters = ["ja_ko"] # 日韩混合识别 322 # Langfilters = ["en_ko"] # 英韩混合识别 323 # Langfilters = ["zh_ja_en"] # 中日英混合识别 324 # Langfilters = ["zh_ja_en_ko"] # 中日英韩混合识别 325 326 # 更多过滤组合,请您随意。。。For more filter combinations, please feel free to...... 327 # より多くのフィルターの組み合わせ、お気軽に。。。더 많은 필터 조합을 원하시면 자유롭게 해주세요. ..... 328 329 # 可选保留:支持中文数字拼音格式,更方便前端实现拼音音素修改和推理,默认关闭 False 。 330 # 开启后 True ,括号内的数字拼音格式均保留,并识别输出为:"zh"中文。 331 self.keepPinyin = False 332 333 # DEFINITION 334 self.PARSE_TAG = re.compile(r"(⑥\$*\d+[\d]{6,}⑥)") 335 336 self.LangSSML = LangSSML() 337 338 def _clears(self): 339 self._text_cache = None 340 self._text_lasts = None 341 self._text_langs = None 342 self._text_waits = None 343 self._lang_count = None 344 self._lang_eos = None 345 346 def _is_english_word(self, word): 347 return bool(re.match(r"^[a-zA-Z]+$", word)) 348 349 def _is_chinese(self, word): 350 for char in word: 351 if "\u4e00" <= char <= "\u9fff": 352 return True 353 return False 354 355 def _is_japanese_kana(self, word): 356 pattern = re.compile(r"[\u3040-\u309F\u30A0-\u30FF]+") 357 matches = pattern.findall(word) 358 return len(matches) > 0 359 360 def _insert_english_uppercase(self, word): 361 modified_text = re.sub(r"(?<!\b)([A-Z])", r" \1", word) 362 modified_text = modified_text.strip("-") 363 return modified_text + " " 364 365 def _split_camel_case(self, word): 366 return re.sub(r"(?<!^)(?=[A-Z])", " ", word) 367 368 def _statistics(self, language, text): 369 # Language word statistics: 370 # Chinese characters usually occupy double bytes 371 if self._lang_count is None or not isinstance(self._lang_count, defaultdict): 372 self._lang_count = defaultdict(int) 373 lang_count = self._lang_count 374 if not "|" in language: 375 lang_count[language] += int(len(text) * 2) if language == "zh" else len(text) 376 self._lang_count = lang_count 377 378 def _clear_text_number(self, text): 379 if text == "\n": 380 return text, False # Keep Line Breaks 381 clear_text = re.sub(r"([^\w\s]+)", "", re.sub(r"\n+", "", text)).strip() 382 is_number = len(re.sub(re.compile(r"(\d+)"), "", clear_text)) == 0 383 return clear_text, is_number 384 385 def _saveData(self, words, language: str, text: str, score: float, symbol=None): 386 # Pre-detection 387 clear_text, is_number = self._clear_text_number(text) 388 # Merge the same language and save the results 389 preData = words[-1] if len(words) > 0 else None 390 if symbol is not None: 391 pass 392 elif preData is not None and preData["symbol"] is None: 393 if len(clear_text) == 0: 394 language = preData["lang"] 395 elif is_number == True: 396 language = preData["lang"] 397 _, pre_is_number = self._clear_text_number(preData["text"]) 398 if preData["lang"] == language: 399 self._statistics(preData["lang"], text) 400 text = preData["text"] + text 401 preData["text"] = text 402 return preData 403 elif pre_is_number == True: 404 text = f"{preData['text']}{text}" 405 words.pop() 406 elif is_number == True: 407 priority_language = self._get_filters_string()[:2] 408 if priority_language in "ja-zh-en-ko-fr-vi": 409 language = priority_language 410 data = {"lang": language, "text": text, "score": score, "symbol": symbol} 411 filters = self.Langfilters 412 if filters is None or len(filters) == 0 or "?" in language or language in filters or language in filters[0] or filters[0] == "*" or filters[0] in "alls-mixs-autos": 413 words.append(data) 414 self._statistics(data["lang"], data["text"]) 415 return data 416 417 def _addwords(self, words, language, text, score, symbol=None): 418 if text == "\n": 419 pass # Keep Line Breaks 420 elif text is None or len(text.strip()) == 0: 421 return True 422 if language is None: 423 language = "" 424 language = language.lower() 425 if language == "en": 426 text = self._insert_english_uppercase(text) 427 # text = re.sub(r'[(())]', ',' , text) # Keep it. 428 text_waits = self._text_waits 429 ispre_waits = len(text_waits) > 0 430 preResult = text_waits.pop() if ispre_waits else None 431 if preResult is None: 432 preResult = words[-1] if len(words) > 0 else None 433 if preResult and ("|" in preResult["lang"]): 434 pre_lang = preResult["lang"] 435 if language in pre_lang: 436 preResult["lang"] = language = language.split("|")[0] 437 else: 438 preResult["lang"] = pre_lang.split("|")[0] 439 if ispre_waits: 440 preResult = self._saveData( 441 words, 442 preResult["lang"], 443 preResult["text"], 444 preResult["score"], 445 preResult["symbol"], 446 ) 447 pre_lang = preResult["lang"] if preResult else None 448 if ("|" in language) and (pre_lang and not pre_lang in language and not "…" in language): 449 language = language.split("|")[0] 450 if "|" in language: 451 self._text_waits.append({"lang": language, "text": text, "score": score, "symbol": symbol}) 452 else: 453 self._saveData(words, language, text, score, symbol) 454 return False 455 456 def _get_prev_data(self, words): 457 data = words[-1] if words and len(words) > 0 else None 458 if data: 459 return (data["lang"], data["text"]) 460 return (None, "") 461 462 def _match_ending(self, input, index): 463 if input is None or len(input) == 0: 464 return False, None 465 input = re.sub(r"\s+", "", input) 466 if len(input) == 0 or abs(index) > len(input): 467 return False, None 468 ending_pattern = re.compile(r'([「」“”‘’"\'::。.!!?.?])') 469 return ending_pattern.match(input[index]), input[index] 470 471 def _cleans_text(self, cleans_text): 472 cleans_text = re.sub(r"(.*?)([^\w]+)", r"\1 ", cleans_text) 473 cleans_text = re.sub(r"(.)\1+", r"\1", cleans_text) 474 return cleans_text.strip() 475 476 def _mean_processing(self, text: str): 477 if text is None or (text.strip()) == "": 478 return None, 0.0 479 arrs = self._split_camel_case(text).split(" ") 480 langs = [] 481 for t in arrs: 482 if len(t.strip()) <= 3: 483 continue 484 language, score = self.langid.classify(t) 485 langs.append({"lang": language}) 486 if len(langs) == 0: 487 return None, 0.0 488 return Counter([item["lang"] for item in langs]).most_common(1)[0][0], 1.0 489 490 def _lang_classify(self, cleans_text): 491 language, score = self.langid.classify(cleans_text) 492 # fix: Huggingface is np.float32 493 if score is not None and isinstance(score, np.generic) and hasattr(score, "item"): 494 score = score.item() 495 score = round(score, 3) 496 return language, score 497 498 def _get_filters_string(self): 499 filters = self.Langfilters 500 return "-".join(filters).lower().strip() if filters is not None else "" 501 502 def _parse_language(self, words, segment): 503 LANG_JA = "ja" 504 LANG_ZH = "zh" 505 LANG_ZH_JA = f"{LANG_ZH}|{LANG_JA}" 506 LANG_JA_ZH = f"{LANG_JA}|{LANG_ZH}" 507 language = LANG_ZH 508 regex_pattern = re.compile(r"([^\w\s]+)") 509 lines = regex_pattern.split(segment) 510 lines_max = len(lines) 511 LANG_EOS = self._lang_eos 512 for index, text in enumerate(lines): 513 if len(text) == 0: 514 continue 515 EOS = index >= (lines_max - 1) 516 nextId = index + 1 517 nextText = lines[nextId] if not EOS else "" 518 nextPunc = len(re.sub(regex_pattern, "", re.sub(r"\n+", "", nextText)).strip()) == 0 519 textPunc = len(re.sub(regex_pattern, "", re.sub(r"\n+", "", text)).strip()) == 0 520 if not EOS and (textPunc == True or (len(nextText.strip()) >= 0 and nextPunc == True)): 521 lines[nextId] = f"{text}{nextText}" 522 continue 523 number_tags = re.compile(r"(⑥\d{6,}⑥)") 524 cleans_text = re.sub(number_tags, "", text) 525 cleans_text = re.sub(r"\d+", "", cleans_text) 526 cleans_text = self._cleans_text(cleans_text) 527 # fix:Langid's recognition of short sentences is inaccurate, and it is spliced longer. 528 if not EOS and len(cleans_text) <= 2: 529 lines[nextId] = f"{text}{nextText}" 530 continue 531 language, score = self._lang_classify(cleans_text) 532 prev_language, prev_text = self._get_prev_data(words) 533 if language != LANG_ZH and all("\u4e00" <= c <= "\u9fff" for c in re.sub(r"\s", "", cleans_text)): 534 language, score = LANG_ZH, 1 535 if len(cleans_text) <= 5 and self._is_chinese(cleans_text): 536 filters_string = self._get_filters_string() 537 if score < self.LangPriorityThreshold and len(filters_string) > 0: 538 index_ja, index_zh = filters_string.find(LANG_JA), filters_string.find(LANG_ZH) 539 if index_ja != -1 and index_ja < index_zh: 540 language = LANG_JA 541 elif index_zh != -1 and index_zh < index_ja: 542 language = LANG_ZH 543 if self._is_japanese_kana(cleans_text): 544 language = LANG_JA 545 elif len(cleans_text) > 2 and score > 0.90: 546 pass 547 elif EOS and LANG_EOS: 548 language = LANG_ZH if len(cleans_text) <= 1 else language 549 else: 550 LANG_UNKNOWN = LANG_ZH_JA if language == LANG_ZH or (len(cleans_text) <= 2 and prev_language == LANG_ZH) else LANG_JA_ZH 551 match_end, match_char = self._match_ending(text, -1) 552 referen = prev_language in LANG_UNKNOWN or LANG_UNKNOWN in prev_language if prev_language else False 553 if match_char in "。.": 554 language = prev_language if referen and len(words) > 0 else language 555 else: 556 language = f"{LANG_UNKNOWN}|…" 557 text, *_ = re.subn(number_tags, self._restore_number, text) 558 self._addwords(words, language, text, score) 559 560 # ---------------------------------------------------------- 561 # 【SSML】中文数字处理:Chinese Number Processing (SSML support) 562 # 这里默认都是中文,用于处理 SSML 中文标签。当然可以支持任意语言,例如: 563 # The default here is Chinese, which is used to process SSML Chinese tags. Of course, any language can be supported, for example: 564 # 中文电话号码:<telephone>1234567</telephone> 565 # 中文数字号码:<number>1234567</number> 566 def _process_symbol_SSML(self, words, data): 567 tag, match = data 568 language = SSML = match[1] 569 text = match[2] 570 score = 1.0 571 if SSML == "telephone": 572 # 中文-电话号码 573 language = "zh" 574 text = self.LangSSML.to_chinese_telephone(text) 575 elif SSML == "number": 576 # 中文-数字读法 577 language = "zh" 578 text = self.LangSSML.to_chinese_number(text) 579 elif SSML == "currency": 580 # 中文-按金额发音 581 language = "zh" 582 text = self.LangSSML.to_chinese_currency(text) 583 elif SSML == "date": 584 # 中文-按金额发音 585 language = "zh" 586 text = self.LangSSML.to_chinese_date(text) 587 self._addwords(words, language, text, score, SSML) 588 589 # ---------------------------------------------------------- 590 def _restore_number(self, matche): 591 value = matche.group(0) 592 text_cache = self._text_cache 593 if value in text_cache: 594 process, data = text_cache[value] 595 tag, match = data 596 value = match 597 return value 598 599 def _pattern_symbols(self, item, text): 600 if text is None: 601 return text 602 tag, pattern, process = item 603 matches = pattern.findall(text) 604 if len(matches) == 1 and "".join(matches[0]) == text: 605 return text 606 for i, match in enumerate(matches): 607 key = f"⑥{tag}{i:06d}⑥" 608 text = re.sub(pattern, key, text, count=1) 609 self._text_cache[key] = (process, (tag, match)) 610 return text 611 612 def _process_symbol(self, words, data): 613 tag, match = data 614 language = match[1] 615 text = match[2] 616 score = 1.0 617 filters = self._get_filters_string() 618 if language not in filters: 619 self._process_symbol_SSML(words, data) 620 else: 621 self._addwords(words, language, text, score, True) 622 623 def _process_english(self, words, data): 624 tag, match = data 625 text = match[0] 626 filters = self._get_filters_string() 627 priority_language = filters[:2] 628 # Preview feature, other language segmentation processing 629 enablePreview = self.EnablePreview 630 if enablePreview == True: 631 # Experimental: Other language support 632 regex_pattern = re.compile(r"(.*?[。.??!!]+[\n]{,1})") 633 lines = regex_pattern.split(text) 634 for index, text in enumerate(lines): 635 if len(text.strip()) == 0: 636 continue 637 cleans_text = self._cleans_text(text) 638 language, score = self._lang_classify(cleans_text) 639 if language not in filters: 640 language, score = self._mean_processing(cleans_text) 641 if language is None or score <= 0.0: 642 continue 643 elif language in filters: 644 pass # pass 645 elif score >= 0.95: 646 continue # High score, but not in the filter, excluded. 647 elif score <= 0.15 and filters[:2] == "fr": 648 language = priority_language 649 else: 650 language = "en" 651 self._addwords(words, language, text, score) 652 else: 653 # Default is English 654 language, score = "en", 1.0 655 self._addwords(words, language, text, score) 656 657 def _process_Russian(self, words, data): 658 tag, match = data 659 text = match[0] 660 language = "ru" 661 score = 1.0 662 self._addwords(words, language, text, score) 663 664 def _process_Thai(self, words, data): 665 tag, match = data 666 text = match[0] 667 language = "th" 668 score = 1.0 669 self._addwords(words, language, text, score) 670 671 def _process_korean(self, words, data): 672 tag, match = data 673 text = match[0] 674 language = "ko" 675 score = 1.0 676 self._addwords(words, language, text, score) 677 678 def _process_quotes(self, words, data): 679 tag, match = data 680 text = "".join(match) 681 childs = self.PARSE_TAG.findall(text) 682 if len(childs) > 0: 683 self._process_tags(words, text, False) 684 else: 685 cleans_text = self._cleans_text(match[1]) 686 if len(cleans_text) <= 5: 687 self._parse_language(words, text) 688 else: 689 language, score = self._lang_classify(cleans_text) 690 self._addwords(words, language, text, score) 691 692 def _process_pinyin(self, words, data): 693 tag, match = data 694 text = match 695 language = "zh" 696 score = 1.0 697 self._addwords(words, language, text, score) 698 699 def _process_number(self, words, data): # "$0" process only 700 """ 701 Numbers alone cannot accurately identify language. 702 Because numbers are universal in all languages. 703 So it won't be executed here, just for testing. 704 """ 705 tag, match = data 706 language = words[0]["lang"] if len(words) > 0 else "zh" 707 text = match 708 score = 0.0 709 self._addwords(words, language, text, score) 710 711 def _process_tags(self, words, text, root_tag): 712 text_cache = self._text_cache 713 segments = re.split(self.PARSE_TAG, text) 714 segments_len = len(segments) - 1 715 for index, text in enumerate(segments): 716 if root_tag: 717 self._lang_eos = index >= segments_len 718 if self.PARSE_TAG.match(text): 719 process, data = text_cache[text] 720 if process: 721 process(words, data) 722 else: 723 self._parse_language(words, text) 724 return words 725 726 def _merge_results(self, words): 727 new_word = [] 728 for index, cur_data in enumerate(words): 729 if "symbol" in cur_data: 730 del cur_data["symbol"] 731 if index == 0: 732 new_word.append(cur_data) 733 else: 734 pre_data = new_word[-1] 735 if cur_data["lang"] == pre_data["lang"]: 736 pre_data["text"] = f"{pre_data['text']}{cur_data['text']}" 737 else: 738 new_word.append(cur_data) 739 return new_word 740 741 def _parse_symbols(self, text): 742 TAG_NUM = "00" # "00" => default channels , "$0" => testing channel 743 TAG_S1, TAG_S2, TAG_P1, TAG_P2, TAG_EN, TAG_KO, TAG_RU, TAG_TH = ( 744 "$1", 745 "$2", 746 "$3", 747 "$4", 748 "$5", 749 "$6", 750 "$7", 751 "$8", 752 ) 753 TAG_BASE = re.compile(rf'(([【《((“‘"\']*[LANGUAGE]+[\W\s]*)+)') 754 # Get custom language filter 755 filters = self.Langfilters 756 filters = filters if filters is not None else "" 757 # ======================================================================================================= 758 # Experimental: Other language support.Thử nghiệm: Hỗ trợ ngôn ngữ khác.Expérimental : prise en charge d’autres langues. 759 # 相关语言字符如有缺失,熟悉相关语言的朋友,可以提交把缺失的发音符号补全。 760 # If relevant language characters are missing, friends who are familiar with the relevant languages can submit a submission to complete the missing pronunciation symbols. 761 # S'il manque des caractères linguistiques pertinents, les amis qui connaissent les langues concernées peuvent soumettre une soumission pour compléter les symboles de prononciation manquants. 762 # Nếu thiếu ký tự ngôn ngữ liên quan, những người bạn quen thuộc với ngôn ngữ liên quan có thể gửi bài để hoàn thành các ký hiệu phát âm còn thiếu. 763 # ------------------------------------------------------------------------------------------------------- 764 # Preview feature, other language support 765 enablePreview = self.EnablePreview 766 if "fr" in filters or "vi" in filters: 767 enablePreview = True 768 self.EnablePreview = enablePreview 769 # 实验性:法语字符支持。Prise en charge des caractères français 770 RE_FR = "" if not enablePreview else "àáâãäåæçèéêëìíîïðñòóôõöùúûüýþÿ" 771 # 实验性:越南语字符支持。Hỗ trợ ký tự tiếng Việt 772 RE_VI = "" if not enablePreview else "đơưăáàảãạắằẳẵặấầẩẫậéèẻẽẹếềểễệíìỉĩịóòỏõọốồổỗộớờởỡợúùủũụứừửữựôâêơưỷỹ" 773 # ------------------------------------------------------------------------------------------------------- 774 # Basic options: 775 process_list = [ 776 ( 777 TAG_S1, 778 re.compile(self.SYMBOLS_PATTERN), 779 self._process_symbol, 780 ), # Symbol Tag 781 ( 782 TAG_KO, 783 re.compile(re.sub(r"LANGUAGE", f"\uac00-\ud7a3", TAG_BASE.pattern)), 784 self._process_korean, 785 ), # Korean words 786 ( 787 TAG_TH, 788 re.compile(re.sub(r"LANGUAGE", f"\u0e00-\u0e7f", TAG_BASE.pattern)), 789 self._process_Thai, 790 ), # Thai words support. 791 ( 792 TAG_RU, 793 re.compile(re.sub(r"LANGUAGE", f"А-Яа-яЁё", TAG_BASE.pattern)), 794 self._process_Russian, 795 ), # Russian words support. 796 ( 797 TAG_NUM, 798 re.compile(r"(\W*\d+\W+\d*\W*\d*)"), 799 self._process_number, 800 ), # Number words, Universal in all languages, Ignore it. 801 ( 802 TAG_EN, 803 re.compile(re.sub(r"LANGUAGE", f"a-zA-Z{RE_FR}{RE_VI}", TAG_BASE.pattern)), 804 self._process_english, 805 ), # English words + Other language support. 806 ( 807 TAG_P1, 808 re.compile(r'(["\'])(.*?)(\1)'), 809 self._process_quotes, 810 ), # Regular quotes 811 ( 812 TAG_P2, 813 re.compile(r"([\n]*[【《((“‘])([^【《((“‘’”))》】]{3,})([’”))》】][\W\s]*[\n]{,1})"), 814 self._process_quotes, 815 ), # Special quotes, There are left and right. 816 ] 817 # Extended options: Default False 818 if self.keepPinyin == True: 819 process_list.insert( 820 1, 821 ( 822 TAG_S2, 823 re.compile(r"([\(({][^})\)]*?\d[^})\)]*?[})\])"), 824 self._process_pinyin, 825 ), # Chinese Pinyin Tag. 826 ) 827 # ------------------------------------------------------------------------------------------------------- 828 words = [] 829 lines = re.findall(r".*\n*", re.sub(self.PARSE_TAG, "", text)) 830 for index, text in enumerate(lines): 831 if len(text.strip()) == 0: 832 continue 833 self._lang_eos = False 834 self._text_cache = {} 835 for item in process_list: 836 text = self._pattern_symbols(item, text) 837 cur_word = self._process_tags([], text, True) 838 if len(cur_word) == 0: 839 continue 840 cur_data = cur_word[0] if len(cur_word) > 0 else None 841 pre_data = words[-1] if len(words) > 0 else None 842 if cur_data and pre_data and cur_data["lang"] == pre_data["lang"] and cur_data["symbol"] == False and pre_data["symbol"]: 843 cur_data["text"] = f"{pre_data['text']}{cur_data['text']}" 844 words.pop() 845 words += cur_word 846 if self.isLangMerge == True: 847 words = self._merge_results(words) 848 lang_count = self._lang_count 849 if lang_count and len(lang_count) > 0: 850 lang_count = dict(sorted(lang_count.items(), key=lambda x: x[1], reverse=True)) 851 lang_count = list(lang_count.items()) 852 self._lang_count = lang_count 853 return words 854 855 def setfilters(self, filters): 856 # 当过滤器更改时,清除缓存 857 # 필터가 변경되면 캐시를 지웁니다. 858 # フィルタが変更されると、キャッシュがクリアされます 859 # When the filter changes, clear the cache 860 if self.Langfilters != filters: 861 self._clears() 862 self.Langfilters = filters 863 864 def getfilters(self): 865 return self.Langfilters 866 867 def setPriorityThreshold(self, threshold: float): 868 self.LangPriorityThreshold = threshold 869 870 def getPriorityThreshold(self): 871 return self.LangPriorityThreshold 872 873 def getCounts(self): 874 lang_count = self._lang_count 875 if lang_count is not None: 876 return lang_count 877 text_langs = self._text_langs 878 if text_langs is None or len(text_langs) == 0: 879 return [("zh", 0)] 880 lang_counts = defaultdict(int) 881 for d in text_langs: 882 lang_counts[d["lang"]] += int(len(d["text"]) * 2) if d["lang"] == "zh" else len(d["text"]) 883 lang_counts = dict(sorted(lang_counts.items(), key=lambda x: x[1], reverse=True)) 884 lang_counts = list(lang_counts.items()) 885 self._lang_count = lang_counts 886 return lang_counts 887 888 def getTexts(self, text: str): 889 if text is None or len(text.strip()) == 0: 890 self._clears() 891 return [] 892 # lasts 893 text_langs = self._text_langs 894 if self._text_lasts == text and text_langs is not None: 895 return text_langs 896 # parse 897 self._text_waits = [] 898 self._lang_count = None 899 self._text_lasts = text 900 text = self._parse_symbols(text) 901 self._text_langs = text 902 return text 903 904 def classify(self, text: str): 905 return self.getTexts(text) 906 907 908def printList(langlist): 909 """ 910 功能:打印数组结果 911 기능: 어레이 결과 인쇄 912 機能:配列結果を印刷 913 Function: Print array results 914 """ 915 print("\n===================【打印结果】===================") 916 if langlist is None or len(langlist) == 0: 917 print("无内容结果,No content result") 918 return 919 for line in langlist: 920 print(line) 921 pass 922 923 924def main(): 925 # ----------------------------------- 926 # 更新日志:新版本分词更加精准。 927 # Changelog: The new version of the word segmentation is more accurate. 928 # チェンジログ:新しいバージョンの単語セグメンテーションはより正確です。 929 # Changelog: 분할이라는 단어의 새로운 버전이 더 정확합니다. 930 # ----------------------------------- 931 932 # 输入示例1:(包含日文,中文)Input Example 1: (including Japanese, Chinese) 933 # text = "“昨日は雨が降った,音楽、映画。。。”你今天学习日语了吗?春は桜の季節です。语种分词是语音合成必不可少的环节。言語分詞は音声合成に欠かせない環節である!" 934 935 # 输入示例2:(包含日文,中文)Input Example 1: (including Japanese, Chinese) 936 # text = "欢迎来玩。東京,は日本の首都です。欢迎来玩. 太好了!" 937 938 # 输入示例3:(包含日文,中文)Input Example 1: (including Japanese, Chinese) 939 # text = "明日、私たちは海辺にバカンスに行きます。你会说日语吗:“中国語、話せますか” 你的日语真好啊!" 940 941 # 输入示例4:(包含日文,中文,韩语,英文)Input Example 4: (including Japanese, Chinese, Korean, English) 942 # text = "你的名字叫<ja>佐々木?<ja>吗?韩语中的안녕 오빠读什么呢?あなたの体育の先生は誰ですか? 此次发布会带来了四款iPhone 15系列机型和三款Apple Watch等一系列新品,这次的iPad Air采用了LCD屏幕" 943 944 # 试验性支持:"fr"法语 , "vi"越南语 , "ru"俄语 , "th"泰语。Experimental: Other language support. 945 langsegment = LangSegment() 946 langsegment.setfilters(["fr", "vi", "ja", "zh", "ko", "en", "ru", "th"]) 947 text = """ 948我喜欢在雨天里听音乐。 949I enjoy listening to music on rainy days. 950雨の日に音楽を聴くのが好きです。 951비 오는 날에 음악을 듣는 것을 즐깁니다。 952J'aime écouter de la musique les jours de pluie. 953Tôi thích nghe nhạc vào những ngày mưa. 954Мне нравится слушать музыку в дождливую погоду. 955ฉันชอบฟังเพลงในวันที่ฝนตก 956""" 957 958 # 进行分词:(接入TTS项目仅需一行代码调用)Segmentation: (Only one line of code is required to access the TTS project) 959 langlist = langsegment.getTexts(text) 960 printList(langlist) 961 962 # 语种统计:Language statistics: 963 print("\n===================【语种统计】===================") 964 # 获取所有语种数组结果,根据内容字数降序排列 965 # Get the array results in all languages, sorted in descending order according to the number of content words 966 langCounts = langsegment.getCounts() 967 print(langCounts, "\n") 968 969 # 根据结果获取内容的主要语种 (语言,字数含标点) 970 # Get the main language of content based on the results (language, word count including punctuation) 971 lang, count = langCounts[0] 972 print(f"输入内容的主要语言为 = {lang} ,字数 = {count}") 973 print("==================================================\n") 974 975 # 分词输出:lang=语言,text=内容。Word output: lang = language, text = content 976 # ===================【打印结果】=================== 977 # {'lang': 'zh', 'text': '你的名字叫'} 978 # {'lang': 'ja', 'text': '佐々木?'} 979 # {'lang': 'zh', 'text': '吗?韩语中的'} 980 # {'lang': 'ko', 'text': '안녕 오빠'} 981 # {'lang': 'zh', 'text': '读什么呢?'} 982 # {'lang': 'ja', 'text': 'あなたの体育の先生は誰ですか?'} 983 # {'lang': 'zh', 'text': ' 此次发布会带来了四款'} 984 # {'lang': 'en', 'text': 'i Phone '} 985 # {'lang': 'zh', 'text': '15系列机型和三款'} 986 # {'lang': 'en', 'text': 'Apple Watch '} 987 # {'lang': 'zh', 'text': '等一系列新品,这次的'} 988 # {'lang': 'en', 'text': 'i Pad Air '} 989 # {'lang': 'zh', 'text': '采用了'} 990 # {'lang': 'en', 'text': 'L C D '} 991 # {'lang': 'zh', 'text': '屏幕'} 992 # ===================【语种统计】=================== 993 994 # ===================【语种统计】=================== 995 # [('zh', 51), ('ja', 19), ('en', 18), ('ko', 5)] 996 997 # 输入内容的主要语言为 = zh ,字数 = 51 998 # ================================================== 999 # The main language of the input content is = zh, word count = 51 1000 1001 1002if __name__ == "__main__": 1003 main()
136class LangSSML: 137 def __init__(self): 138 # 纯数字 139 self._zh_numerals_number = { 140 "0": "零", 141 "1": "一", 142 "2": "二", 143 "3": "三", 144 "4": "四", 145 "5": "五", 146 "6": "六", 147 "7": "七", 148 "8": "八", 149 "9": "九", 150 } 151 152 # 将2024/8/24, 2024-08, 08-24, 24 标准化“年月日” 153 # Standardize 2024/8/24, 2024-08, 08-24, 24 to "year-month-day" 154 def _format_chinese_data(self, date_str: str): 155 # 处理日期格式 156 input_date = date_str 157 if date_str is None or date_str.strip() == "": 158 return "" 159 date_str = re.sub(r"[\/\._|年|月]", "-", date_str) 160 date_str = re.sub(r"日", r"", date_str) 161 date_arrs = date_str.split(" ") 162 if len(date_arrs) == 1 and ":" in date_arrs[0]: 163 time_str = date_arrs[0] 164 date_arrs = [] 165 else: 166 time_str = date_arrs[1] if len(date_arrs) >= 2 else "" 167 168 def nonZero(num, cn, func=None): 169 if func is not None: 170 num = func(num) 171 return f"{num}{cn}" if num is not None and num != "" and num != "0" else "" 172 173 f_number = self.to_chinese_number 174 f_currency = self.to_chinese_currency 175 # year, month, day 176 year_month_day = "" 177 if len(date_arrs) > 0: 178 year, month, day = "", "", "" 179 parts = date_arrs[0].split("-") 180 if len(parts) == 3: # 格式为 YYYY-MM-DD 181 year, month, day = parts 182 elif len(parts) == 2: # 格式为 MM-DD 或 YYYY-MM 183 if len(parts[0]) == 4: # 年-月 184 year, month = parts 185 else: 186 month, day = parts # 月-日 187 elif len(parts[0]) > 0: # 仅有月-日或年 188 if len(parts[0]) == 4: 189 year = parts[0] 190 else: 191 day = parts[0] 192 year, month, day = ( 193 nonZero(year, "年", f_number), 194 nonZero(month, "月", f_currency), 195 nonZero(day, "日", f_currency), 196 ) 197 year_month_day = re.sub(r"([年|月|日])+", r"\1", f"{year}{month}{day}") 198 # hours, minutes, seconds 199 time_str = re.sub(r"[\/\.\-:_]", ":", time_str) 200 time_arrs = time_str.split(":") 201 hours, minutes, seconds = "", "", "" 202 if len(time_arrs) == 3: # H/M/S 203 hours, minutes, seconds = time_arrs 204 elif len(time_arrs) == 2: # H/M 205 hours, minutes = time_arrs 206 elif len(time_arrs[0]) > 0: 207 hours = f"{time_arrs[0]}点" # H 208 if len(time_arrs) > 1: 209 hours, minutes, seconds = ( 210 nonZero(hours, "点", f_currency), 211 nonZero(minutes, "分", f_currency), 212 nonZero(seconds, "秒", f_currency), 213 ) 214 hours_minutes_seconds = re.sub(r"([点|分|秒])+", r"\1", f"{hours}{minutes}{seconds}") 215 output_date = f"{year_month_day}{hours_minutes_seconds}" 216 return output_date 217 218 # 【SSML】number=中文大写数字读法(单字) 219 # Chinese Numbers(single word) 220 def to_chinese_number(self, num: str): 221 pattern = r"(\d+)" 222 zh_numerals = self._zh_numerals_number 223 arrs = re.split(pattern, num) 224 output = "" 225 for item in arrs: 226 if re.match(pattern, item): 227 output += "".join(zh_numerals[digit] if digit in zh_numerals else "" for digit in str(item)) 228 else: 229 output += item 230 output = output.replace(".", "点") 231 return output 232 233 # 【SSML】telephone=数字转成中文电话号码大写汉字(单字) 234 # Convert numbers to Chinese phone numbers in uppercase Chinese characters(single word) 235 def to_chinese_telephone(self, num: str): 236 output = self.to_chinese_number(num.replace("+86", "")) # zh +86 237 output = output.replace("一", "幺") 238 return output 239 240 # 【SSML】currency=按金额发音。 241 # Digital processing from GPT_SoVITS num.py (thanks) 242 def to_chinese_currency(self, num: str): 243 pattern = r"(\d+)" 244 arrs = re.split(pattern, num) 245 output = "" 246 for item in arrs: 247 if re.match(pattern, item): 248 output += num2str(item) 249 else: 250 output += item 251 output = output.replace(".", "点") 252 return output 253 254 # 【SSML】date=按日期发音。支持 2024年08月24, 2024/8/24, 2024-08, 08-24, 24 等输入。 255 def to_chinese_date(self, num: str): 256 chinese_date = self._format_chinese_data(num) 257 return chinese_date
220 def to_chinese_number(self, num: str): 221 pattern = r"(\d+)" 222 zh_numerals = self._zh_numerals_number 223 arrs = re.split(pattern, num) 224 output = "" 225 for item in arrs: 226 if re.match(pattern, item): 227 output += "".join(zh_numerals[digit] if digit in zh_numerals else "" for digit in str(item)) 228 else: 229 output += item 230 output = output.replace(".", "点") 231 return output
260class LangSegment: 261 def __init__(self): 262 self.langid = LanguageIdentifier.from_pickled_model(MODEL_FILE, norm_probs=True) 263 264 self._text_cache = None 265 self._text_lasts = None 266 self._text_langs = None 267 self._lang_count = None 268 self._lang_eos = None 269 270 # 可自定义语言匹配标签:カスタマイズ可能な言語対応タグ:사용자 지정 가능한 언어 일치 태그: 271 # Customizable language matching tags: These are supported,이 표현들은 모두 지지합니다 272 # <zh>你好<zh> , <ja>佐々木</ja> , <en>OK<en> , <ko>오빠</ko> 这些写法均支持 273 self.SYMBOLS_PATTERN = r"(<([a-zA-Z|-]*)>(.*?)<\/*[a-zA-Z|-]*>)" 274 275 # 语言过滤组功能, 可以指定保留语言。不在过滤组中的语言将被清除。您可随心搭配TTS语音合成所支持的语言。 276 # 언어 필터 그룹 기능을 사용하면 예약된 언어를 지정할 수 있습니다. 필터 그룹에 없는 언어는 지워집니다. TTS 텍스트에서 지원하는 언어를 원하는 대로 일치시킬 수 있습니다. 277 # 言語フィルターグループ機能では、予約言語を指定できます。フィルターグループに含まれていない言語はクリアされます。TTS音声合成がサポートする言語を自由に組み合わせることができます。 278 # The language filter group function allows you to specify reserved languages. 279 # Languages not in the filter group will be cleared. You can match the languages supported by TTS Text To Speech as you like. 280 # 排名越前,优先级越高,The higher the ranking, the higher the priority,ランキングが上位になるほど、優先度が高くなります。 281 282 # 系统默认过滤器。System default filter。(ISO 639-1 codes given) 283 # ---------------------------------------------------------------------------------------------------------------------------------- 284 # "zh"中文=Chinese ,"en"英语=English ,"ja"日语=Japanese ,"ko"韩语=Korean ,"fr"法语=French ,"vi"越南语=Vietnamese , "ru"俄语=Russian 285 # "th"泰语=Thai 286 # ---------------------------------------------------------------------------------------------------------------------------------- 287 self.DEFAULT_FILTERS = ["zh", "ja", "ko", "en"] 288 289 # 用户可自定义过滤器。User-defined filters 290 self.Langfilters = self.DEFAULT_FILTERS[:] # 创建副本 291 292 # 合并文本 293 self.isLangMerge = True 294 295 # 试验性支持:您可自定义添加:"fr"法语 , "vi"越南语。Experimental: You can customize to add: "fr" French, "vi" Vietnamese. 296 # 请使用API启用:self.setfilters(["zh", "en", "ja", "ko", "fr", "vi" , "ru" , "th"]) # 您可自定义添加,如:"fr"法语 , "vi"越南语。 297 298 # 预览版功能,自动启用或禁用,无需设置 299 # Preview feature, automatically enabled or disabled, no settings required 300 self.EnablePreview = False 301 302 # 除此以外,它支持简写过滤器,只需按不同语种任意组合即可。 303 # In addition to that, it supports abbreviation filters, allowing for any combination of different languages. 304 # 示例:您可以任意指定多种组合,进行过滤 305 # Example: You can specify any combination to filter 306 307 # 中/日语言优先级阀值(评分范围为 0 ~ 1):评分低于设定阀值 <0.89 时,启用 filters 中的优先级。\n 308 # 중/일본어 우선 순위 임계값(점수 범위 0-1): 점수가 설정된 임계값 <0.89보다 낮을 때 필터에서 우선 순위를 활성화합니다. 309 # 中国語/日本語の優先度しきい値(スコア範囲0〜1):スコアが設定されたしきい値<0.89未満の場合、フィルターの優先度が有効になります。\n 310 # Chinese and Japanese language priority threshold (score range is 0 ~ 1): The default threshold is 0.89. \n 311 # Only the common characters between Chinese and Japanese are processed with confidence and priority. \n 312 self.LangPriorityThreshold = 0.89 313 314 # Langfilters = ["zh"] # 按中文识别 315 # Langfilters = ["en"] # 按英文识别 316 # Langfilters = ["ja"] # 按日文识别 317 # Langfilters = ["ko"] # 按韩文识别 318 # Langfilters = ["zh_ja"] # 中日混合识别 319 # Langfilters = ["zh_en"] # 中英混合识别 320 # Langfilters = ["ja_en"] # 日英混合识别 321 # Langfilters = ["zh_ko"] # 中韩混合识别 322 # Langfilters = ["ja_ko"] # 日韩混合识别 323 # Langfilters = ["en_ko"] # 英韩混合识别 324 # Langfilters = ["zh_ja_en"] # 中日英混合识别 325 # Langfilters = ["zh_ja_en_ko"] # 中日英韩混合识别 326 327 # 更多过滤组合,请您随意。。。For more filter combinations, please feel free to...... 328 # より多くのフィルターの組み合わせ、お気軽に。。。더 많은 필터 조합을 원하시면 자유롭게 해주세요. ..... 329 330 # 可选保留:支持中文数字拼音格式,更方便前端实现拼音音素修改和推理,默认关闭 False 。 331 # 开启后 True ,括号内的数字拼音格式均保留,并识别输出为:"zh"中文。 332 self.keepPinyin = False 333 334 # DEFINITION 335 self.PARSE_TAG = re.compile(r"(⑥\$*\d+[\d]{6,}⑥)") 336 337 self.LangSSML = LangSSML() 338 339 def _clears(self): 340 self._text_cache = None 341 self._text_lasts = None 342 self._text_langs = None 343 self._text_waits = None 344 self._lang_count = None 345 self._lang_eos = None 346 347 def _is_english_word(self, word): 348 return bool(re.match(r"^[a-zA-Z]+$", word)) 349 350 def _is_chinese(self, word): 351 for char in word: 352 if "\u4e00" <= char <= "\u9fff": 353 return True 354 return False 355 356 def _is_japanese_kana(self, word): 357 pattern = re.compile(r"[\u3040-\u309F\u30A0-\u30FF]+") 358 matches = pattern.findall(word) 359 return len(matches) > 0 360 361 def _insert_english_uppercase(self, word): 362 modified_text = re.sub(r"(?<!\b)([A-Z])", r" \1", word) 363 modified_text = modified_text.strip("-") 364 return modified_text + " " 365 366 def _split_camel_case(self, word): 367 return re.sub(r"(?<!^)(?=[A-Z])", " ", word) 368 369 def _statistics(self, language, text): 370 # Language word statistics: 371 # Chinese characters usually occupy double bytes 372 if self._lang_count is None or not isinstance(self._lang_count, defaultdict): 373 self._lang_count = defaultdict(int) 374 lang_count = self._lang_count 375 if not "|" in language: 376 lang_count[language] += int(len(text) * 2) if language == "zh" else len(text) 377 self._lang_count = lang_count 378 379 def _clear_text_number(self, text): 380 if text == "\n": 381 return text, False # Keep Line Breaks 382 clear_text = re.sub(r"([^\w\s]+)", "", re.sub(r"\n+", "", text)).strip() 383 is_number = len(re.sub(re.compile(r"(\d+)"), "", clear_text)) == 0 384 return clear_text, is_number 385 386 def _saveData(self, words, language: str, text: str, score: float, symbol=None): 387 # Pre-detection 388 clear_text, is_number = self._clear_text_number(text) 389 # Merge the same language and save the results 390 preData = words[-1] if len(words) > 0 else None 391 if symbol is not None: 392 pass 393 elif preData is not None and preData["symbol"] is None: 394 if len(clear_text) == 0: 395 language = preData["lang"] 396 elif is_number == True: 397 language = preData["lang"] 398 _, pre_is_number = self._clear_text_number(preData["text"]) 399 if preData["lang"] == language: 400 self._statistics(preData["lang"], text) 401 text = preData["text"] + text 402 preData["text"] = text 403 return preData 404 elif pre_is_number == True: 405 text = f"{preData['text']}{text}" 406 words.pop() 407 elif is_number == True: 408 priority_language = self._get_filters_string()[:2] 409 if priority_language in "ja-zh-en-ko-fr-vi": 410 language = priority_language 411 data = {"lang": language, "text": text, "score": score, "symbol": symbol} 412 filters = self.Langfilters 413 if filters is None or len(filters) == 0 or "?" in language or language in filters or language in filters[0] or filters[0] == "*" or filters[0] in "alls-mixs-autos": 414 words.append(data) 415 self._statistics(data["lang"], data["text"]) 416 return data 417 418 def _addwords(self, words, language, text, score, symbol=None): 419 if text == "\n": 420 pass # Keep Line Breaks 421 elif text is None or len(text.strip()) == 0: 422 return True 423 if language is None: 424 language = "" 425 language = language.lower() 426 if language == "en": 427 text = self._insert_english_uppercase(text) 428 # text = re.sub(r'[(())]', ',' , text) # Keep it. 429 text_waits = self._text_waits 430 ispre_waits = len(text_waits) > 0 431 preResult = text_waits.pop() if ispre_waits else None 432 if preResult is None: 433 preResult = words[-1] if len(words) > 0 else None 434 if preResult and ("|" in preResult["lang"]): 435 pre_lang = preResult["lang"] 436 if language in pre_lang: 437 preResult["lang"] = language = language.split("|")[0] 438 else: 439 preResult["lang"] = pre_lang.split("|")[0] 440 if ispre_waits: 441 preResult = self._saveData( 442 words, 443 preResult["lang"], 444 preResult["text"], 445 preResult["score"], 446 preResult["symbol"], 447 ) 448 pre_lang = preResult["lang"] if preResult else None 449 if ("|" in language) and (pre_lang and not pre_lang in language and not "…" in language): 450 language = language.split("|")[0] 451 if "|" in language: 452 self._text_waits.append({"lang": language, "text": text, "score": score, "symbol": symbol}) 453 else: 454 self._saveData(words, language, text, score, symbol) 455 return False 456 457 def _get_prev_data(self, words): 458 data = words[-1] if words and len(words) > 0 else None 459 if data: 460 return (data["lang"], data["text"]) 461 return (None, "") 462 463 def _match_ending(self, input, index): 464 if input is None or len(input) == 0: 465 return False, None 466 input = re.sub(r"\s+", "", input) 467 if len(input) == 0 or abs(index) > len(input): 468 return False, None 469 ending_pattern = re.compile(r'([「」“”‘’"\'::。.!!?.?])') 470 return ending_pattern.match(input[index]), input[index] 471 472 def _cleans_text(self, cleans_text): 473 cleans_text = re.sub(r"(.*?)([^\w]+)", r"\1 ", cleans_text) 474 cleans_text = re.sub(r"(.)\1+", r"\1", cleans_text) 475 return cleans_text.strip() 476 477 def _mean_processing(self, text: str): 478 if text is None or (text.strip()) == "": 479 return None, 0.0 480 arrs = self._split_camel_case(text).split(" ") 481 langs = [] 482 for t in arrs: 483 if len(t.strip()) <= 3: 484 continue 485 language, score = self.langid.classify(t) 486 langs.append({"lang": language}) 487 if len(langs) == 0: 488 return None, 0.0 489 return Counter([item["lang"] for item in langs]).most_common(1)[0][0], 1.0 490 491 def _lang_classify(self, cleans_text): 492 language, score = self.langid.classify(cleans_text) 493 # fix: Huggingface is np.float32 494 if score is not None and isinstance(score, np.generic) and hasattr(score, "item"): 495 score = score.item() 496 score = round(score, 3) 497 return language, score 498 499 def _get_filters_string(self): 500 filters = self.Langfilters 501 return "-".join(filters).lower().strip() if filters is not None else "" 502 503 def _parse_language(self, words, segment): 504 LANG_JA = "ja" 505 LANG_ZH = "zh" 506 LANG_ZH_JA = f"{LANG_ZH}|{LANG_JA}" 507 LANG_JA_ZH = f"{LANG_JA}|{LANG_ZH}" 508 language = LANG_ZH 509 regex_pattern = re.compile(r"([^\w\s]+)") 510 lines = regex_pattern.split(segment) 511 lines_max = len(lines) 512 LANG_EOS = self._lang_eos 513 for index, text in enumerate(lines): 514 if len(text) == 0: 515 continue 516 EOS = index >= (lines_max - 1) 517 nextId = index + 1 518 nextText = lines[nextId] if not EOS else "" 519 nextPunc = len(re.sub(regex_pattern, "", re.sub(r"\n+", "", nextText)).strip()) == 0 520 textPunc = len(re.sub(regex_pattern, "", re.sub(r"\n+", "", text)).strip()) == 0 521 if not EOS and (textPunc == True or (len(nextText.strip()) >= 0 and nextPunc == True)): 522 lines[nextId] = f"{text}{nextText}" 523 continue 524 number_tags = re.compile(r"(⑥\d{6,}⑥)") 525 cleans_text = re.sub(number_tags, "", text) 526 cleans_text = re.sub(r"\d+", "", cleans_text) 527 cleans_text = self._cleans_text(cleans_text) 528 # fix:Langid's recognition of short sentences is inaccurate, and it is spliced longer. 529 if not EOS and len(cleans_text) <= 2: 530 lines[nextId] = f"{text}{nextText}" 531 continue 532 language, score = self._lang_classify(cleans_text) 533 prev_language, prev_text = self._get_prev_data(words) 534 if language != LANG_ZH and all("\u4e00" <= c <= "\u9fff" for c in re.sub(r"\s", "", cleans_text)): 535 language, score = LANG_ZH, 1 536 if len(cleans_text) <= 5 and self._is_chinese(cleans_text): 537 filters_string = self._get_filters_string() 538 if score < self.LangPriorityThreshold and len(filters_string) > 0: 539 index_ja, index_zh = filters_string.find(LANG_JA), filters_string.find(LANG_ZH) 540 if index_ja != -1 and index_ja < index_zh: 541 language = LANG_JA 542 elif index_zh != -1 and index_zh < index_ja: 543 language = LANG_ZH 544 if self._is_japanese_kana(cleans_text): 545 language = LANG_JA 546 elif len(cleans_text) > 2 and score > 0.90: 547 pass 548 elif EOS and LANG_EOS: 549 language = LANG_ZH if len(cleans_text) <= 1 else language 550 else: 551 LANG_UNKNOWN = LANG_ZH_JA if language == LANG_ZH or (len(cleans_text) <= 2 and prev_language == LANG_ZH) else LANG_JA_ZH 552 match_end, match_char = self._match_ending(text, -1) 553 referen = prev_language in LANG_UNKNOWN or LANG_UNKNOWN in prev_language if prev_language else False 554 if match_char in "。.": 555 language = prev_language if referen and len(words) > 0 else language 556 else: 557 language = f"{LANG_UNKNOWN}|…" 558 text, *_ = re.subn(number_tags, self._restore_number, text) 559 self._addwords(words, language, text, score) 560 561 # ---------------------------------------------------------- 562 # 【SSML】中文数字处理:Chinese Number Processing (SSML support) 563 # 这里默认都是中文,用于处理 SSML 中文标签。当然可以支持任意语言,例如: 564 # The default here is Chinese, which is used to process SSML Chinese tags. Of course, any language can be supported, for example: 565 # 中文电话号码:<telephone>1234567</telephone> 566 # 中文数字号码:<number>1234567</number> 567 def _process_symbol_SSML(self, words, data): 568 tag, match = data 569 language = SSML = match[1] 570 text = match[2] 571 score = 1.0 572 if SSML == "telephone": 573 # 中文-电话号码 574 language = "zh" 575 text = self.LangSSML.to_chinese_telephone(text) 576 elif SSML == "number": 577 # 中文-数字读法 578 language = "zh" 579 text = self.LangSSML.to_chinese_number(text) 580 elif SSML == "currency": 581 # 中文-按金额发音 582 language = "zh" 583 text = self.LangSSML.to_chinese_currency(text) 584 elif SSML == "date": 585 # 中文-按金额发音 586 language = "zh" 587 text = self.LangSSML.to_chinese_date(text) 588 self._addwords(words, language, text, score, SSML) 589 590 # ---------------------------------------------------------- 591 def _restore_number(self, matche): 592 value = matche.group(0) 593 text_cache = self._text_cache 594 if value in text_cache: 595 process, data = text_cache[value] 596 tag, match = data 597 value = match 598 return value 599 600 def _pattern_symbols(self, item, text): 601 if text is None: 602 return text 603 tag, pattern, process = item 604 matches = pattern.findall(text) 605 if len(matches) == 1 and "".join(matches[0]) == text: 606 return text 607 for i, match in enumerate(matches): 608 key = f"⑥{tag}{i:06d}⑥" 609 text = re.sub(pattern, key, text, count=1) 610 self._text_cache[key] = (process, (tag, match)) 611 return text 612 613 def _process_symbol(self, words, data): 614 tag, match = data 615 language = match[1] 616 text = match[2] 617 score = 1.0 618 filters = self._get_filters_string() 619 if language not in filters: 620 self._process_symbol_SSML(words, data) 621 else: 622 self._addwords(words, language, text, score, True) 623 624 def _process_english(self, words, data): 625 tag, match = data 626 text = match[0] 627 filters = self._get_filters_string() 628 priority_language = filters[:2] 629 # Preview feature, other language segmentation processing 630 enablePreview = self.EnablePreview 631 if enablePreview == True: 632 # Experimental: Other language support 633 regex_pattern = re.compile(r"(.*?[。.??!!]+[\n]{,1})") 634 lines = regex_pattern.split(text) 635 for index, text in enumerate(lines): 636 if len(text.strip()) == 0: 637 continue 638 cleans_text = self._cleans_text(text) 639 language, score = self._lang_classify(cleans_text) 640 if language not in filters: 641 language, score = self._mean_processing(cleans_text) 642 if language is None or score <= 0.0: 643 continue 644 elif language in filters: 645 pass # pass 646 elif score >= 0.95: 647 continue # High score, but not in the filter, excluded. 648 elif score <= 0.15 and filters[:2] == "fr": 649 language = priority_language 650 else: 651 language = "en" 652 self._addwords(words, language, text, score) 653 else: 654 # Default is English 655 language, score = "en", 1.0 656 self._addwords(words, language, text, score) 657 658 def _process_Russian(self, words, data): 659 tag, match = data 660 text = match[0] 661 language = "ru" 662 score = 1.0 663 self._addwords(words, language, text, score) 664 665 def _process_Thai(self, words, data): 666 tag, match = data 667 text = match[0] 668 language = "th" 669 score = 1.0 670 self._addwords(words, language, text, score) 671 672 def _process_korean(self, words, data): 673 tag, match = data 674 text = match[0] 675 language = "ko" 676 score = 1.0 677 self._addwords(words, language, text, score) 678 679 def _process_quotes(self, words, data): 680 tag, match = data 681 text = "".join(match) 682 childs = self.PARSE_TAG.findall(text) 683 if len(childs) > 0: 684 self._process_tags(words, text, False) 685 else: 686 cleans_text = self._cleans_text(match[1]) 687 if len(cleans_text) <= 5: 688 self._parse_language(words, text) 689 else: 690 language, score = self._lang_classify(cleans_text) 691 self._addwords(words, language, text, score) 692 693 def _process_pinyin(self, words, data): 694 tag, match = data 695 text = match 696 language = "zh" 697 score = 1.0 698 self._addwords(words, language, text, score) 699 700 def _process_number(self, words, data): # "$0" process only 701 """ 702 Numbers alone cannot accurately identify language. 703 Because numbers are universal in all languages. 704 So it won't be executed here, just for testing. 705 """ 706 tag, match = data 707 language = words[0]["lang"] if len(words) > 0 else "zh" 708 text = match 709 score = 0.0 710 self._addwords(words, language, text, score) 711 712 def _process_tags(self, words, text, root_tag): 713 text_cache = self._text_cache 714 segments = re.split(self.PARSE_TAG, text) 715 segments_len = len(segments) - 1 716 for index, text in enumerate(segments): 717 if root_tag: 718 self._lang_eos = index >= segments_len 719 if self.PARSE_TAG.match(text): 720 process, data = text_cache[text] 721 if process: 722 process(words, data) 723 else: 724 self._parse_language(words, text) 725 return words 726 727 def _merge_results(self, words): 728 new_word = [] 729 for index, cur_data in enumerate(words): 730 if "symbol" in cur_data: 731 del cur_data["symbol"] 732 if index == 0: 733 new_word.append(cur_data) 734 else: 735 pre_data = new_word[-1] 736 if cur_data["lang"] == pre_data["lang"]: 737 pre_data["text"] = f"{pre_data['text']}{cur_data['text']}" 738 else: 739 new_word.append(cur_data) 740 return new_word 741 742 def _parse_symbols(self, text): 743 TAG_NUM = "00" # "00" => default channels , "$0" => testing channel 744 TAG_S1, TAG_S2, TAG_P1, TAG_P2, TAG_EN, TAG_KO, TAG_RU, TAG_TH = ( 745 "$1", 746 "$2", 747 "$3", 748 "$4", 749 "$5", 750 "$6", 751 "$7", 752 "$8", 753 ) 754 TAG_BASE = re.compile(rf'(([【《((“‘"\']*[LANGUAGE]+[\W\s]*)+)') 755 # Get custom language filter 756 filters = self.Langfilters 757 filters = filters if filters is not None else "" 758 # ======================================================================================================= 759 # Experimental: Other language support.Thử nghiệm: Hỗ trợ ngôn ngữ khác.Expérimental : prise en charge d’autres langues. 760 # 相关语言字符如有缺失,熟悉相关语言的朋友,可以提交把缺失的发音符号补全。 761 # If relevant language characters are missing, friends who are familiar with the relevant languages can submit a submission to complete the missing pronunciation symbols. 762 # S'il manque des caractères linguistiques pertinents, les amis qui connaissent les langues concernées peuvent soumettre une soumission pour compléter les symboles de prononciation manquants. 763 # Nếu thiếu ký tự ngôn ngữ liên quan, những người bạn quen thuộc với ngôn ngữ liên quan có thể gửi bài để hoàn thành các ký hiệu phát âm còn thiếu. 764 # ------------------------------------------------------------------------------------------------------- 765 # Preview feature, other language support 766 enablePreview = self.EnablePreview 767 if "fr" in filters or "vi" in filters: 768 enablePreview = True 769 self.EnablePreview = enablePreview 770 # 实验性:法语字符支持。Prise en charge des caractères français 771 RE_FR = "" if not enablePreview else "àáâãäåæçèéêëìíîïðñòóôõöùúûüýþÿ" 772 # 实验性:越南语字符支持。Hỗ trợ ký tự tiếng Việt 773 RE_VI = "" if not enablePreview else "đơưăáàảãạắằẳẵặấầẩẫậéèẻẽẹếềểễệíìỉĩịóòỏõọốồổỗộớờởỡợúùủũụứừửữựôâêơưỷỹ" 774 # ------------------------------------------------------------------------------------------------------- 775 # Basic options: 776 process_list = [ 777 ( 778 TAG_S1, 779 re.compile(self.SYMBOLS_PATTERN), 780 self._process_symbol, 781 ), # Symbol Tag 782 ( 783 TAG_KO, 784 re.compile(re.sub(r"LANGUAGE", f"\uac00-\ud7a3", TAG_BASE.pattern)), 785 self._process_korean, 786 ), # Korean words 787 ( 788 TAG_TH, 789 re.compile(re.sub(r"LANGUAGE", f"\u0e00-\u0e7f", TAG_BASE.pattern)), 790 self._process_Thai, 791 ), # Thai words support. 792 ( 793 TAG_RU, 794 re.compile(re.sub(r"LANGUAGE", f"А-Яа-яЁё", TAG_BASE.pattern)), 795 self._process_Russian, 796 ), # Russian words support. 797 ( 798 TAG_NUM, 799 re.compile(r"(\W*\d+\W+\d*\W*\d*)"), 800 self._process_number, 801 ), # Number words, Universal in all languages, Ignore it. 802 ( 803 TAG_EN, 804 re.compile(re.sub(r"LANGUAGE", f"a-zA-Z{RE_FR}{RE_VI}", TAG_BASE.pattern)), 805 self._process_english, 806 ), # English words + Other language support. 807 ( 808 TAG_P1, 809 re.compile(r'(["\'])(.*?)(\1)'), 810 self._process_quotes, 811 ), # Regular quotes 812 ( 813 TAG_P2, 814 re.compile(r"([\n]*[【《((“‘])([^【《((“‘’”))》】]{3,})([’”))》】][\W\s]*[\n]{,1})"), 815 self._process_quotes, 816 ), # Special quotes, There are left and right. 817 ] 818 # Extended options: Default False 819 if self.keepPinyin == True: 820 process_list.insert( 821 1, 822 ( 823 TAG_S2, 824 re.compile(r"([\(({][^})\)]*?\d[^})\)]*?[})\])"), 825 self._process_pinyin, 826 ), # Chinese Pinyin Tag. 827 ) 828 # ------------------------------------------------------------------------------------------------------- 829 words = [] 830 lines = re.findall(r".*\n*", re.sub(self.PARSE_TAG, "", text)) 831 for index, text in enumerate(lines): 832 if len(text.strip()) == 0: 833 continue 834 self._lang_eos = False 835 self._text_cache = {} 836 for item in process_list: 837 text = self._pattern_symbols(item, text) 838 cur_word = self._process_tags([], text, True) 839 if len(cur_word) == 0: 840 continue 841 cur_data = cur_word[0] if len(cur_word) > 0 else None 842 pre_data = words[-1] if len(words) > 0 else None 843 if cur_data and pre_data and cur_data["lang"] == pre_data["lang"] and cur_data["symbol"] == False and pre_data["symbol"]: 844 cur_data["text"] = f"{pre_data['text']}{cur_data['text']}" 845 words.pop() 846 words += cur_word 847 if self.isLangMerge == True: 848 words = self._merge_results(words) 849 lang_count = self._lang_count 850 if lang_count and len(lang_count) > 0: 851 lang_count = dict(sorted(lang_count.items(), key=lambda x: x[1], reverse=True)) 852 lang_count = list(lang_count.items()) 853 self._lang_count = lang_count 854 return words 855 856 def setfilters(self, filters): 857 # 当过滤器更改时,清除缓存 858 # 필터가 변경되면 캐시를 지웁니다. 859 # フィルタが変更されると、キャッシュがクリアされます 860 # When the filter changes, clear the cache 861 if self.Langfilters != filters: 862 self._clears() 863 self.Langfilters = filters 864 865 def getfilters(self): 866 return self.Langfilters 867 868 def setPriorityThreshold(self, threshold: float): 869 self.LangPriorityThreshold = threshold 870 871 def getPriorityThreshold(self): 872 return self.LangPriorityThreshold 873 874 def getCounts(self): 875 lang_count = self._lang_count 876 if lang_count is not None: 877 return lang_count 878 text_langs = self._text_langs 879 if text_langs is None or len(text_langs) == 0: 880 return [("zh", 0)] 881 lang_counts = defaultdict(int) 882 for d in text_langs: 883 lang_counts[d["lang"]] += int(len(d["text"]) * 2) if d["lang"] == "zh" else len(d["text"]) 884 lang_counts = dict(sorted(lang_counts.items(), key=lambda x: x[1], reverse=True)) 885 lang_counts = list(lang_counts.items()) 886 self._lang_count = lang_counts 887 return lang_counts 888 889 def getTexts(self, text: str): 890 if text is None or len(text.strip()) == 0: 891 self._clears() 892 return [] 893 # lasts 894 text_langs = self._text_langs 895 if self._text_lasts == text and text_langs is not None: 896 return text_langs 897 # parse 898 self._text_waits = [] 899 self._lang_count = None 900 self._text_lasts = text 901 text = self._parse_symbols(text) 902 self._text_langs = text 903 return text 904 905 def classify(self, text: str): 906 return self.getTexts(text)
874 def getCounts(self): 875 lang_count = self._lang_count 876 if lang_count is not None: 877 return lang_count 878 text_langs = self._text_langs 879 if text_langs is None or len(text_langs) == 0: 880 return [("zh", 0)] 881 lang_counts = defaultdict(int) 882 for d in text_langs: 883 lang_counts[d["lang"]] += int(len(d["text"]) * 2) if d["lang"] == "zh" else len(d["text"]) 884 lang_counts = dict(sorted(lang_counts.items(), key=lambda x: x[1], reverse=True)) 885 lang_counts = list(lang_counts.items()) 886 self._lang_count = lang_counts 887 return lang_counts
889 def getTexts(self, text: str): 890 if text is None or len(text.strip()) == 0: 891 self._clears() 892 return [] 893 # lasts 894 text_langs = self._text_langs 895 if self._text_lasts == text and text_langs is not None: 896 return text_langs 897 # parse 898 self._text_waits = [] 899 self._lang_count = None 900 self._text_lasts = text 901 text = self._parse_symbols(text) 902 self._text_langs = text 903 return text
909def printList(langlist): 910 """ 911 功能:打印数组结果 912 기능: 어레이 결과 인쇄 913 機能:配列結果を印刷 914 Function: Print array results 915 """ 916 print("\n===================【打印结果】===================") 917 if langlist is None or len(langlist) == 0: 918 print("无内容结果,No content result") 919 return 920 for line in langlist: 921 print(line) 922 pass
功能:打印数组结果 기능: 어레이 결과 인쇄 機能:配列結果を印刷 Function: Print array results
925def main(): 926 # ----------------------------------- 927 # 更新日志:新版本分词更加精准。 928 # Changelog: The new version of the word segmentation is more accurate. 929 # チェンジログ:新しいバージョンの単語セグメンテーションはより正確です。 930 # Changelog: 분할이라는 단어의 새로운 버전이 더 정확합니다. 931 # ----------------------------------- 932 933 # 输入示例1:(包含日文,中文)Input Example 1: (including Japanese, Chinese) 934 # text = "“昨日は雨が降った,音楽、映画。。。”你今天学习日语了吗?春は桜の季節です。语种分词是语音合成必不可少的环节。言語分詞は音声合成に欠かせない環節である!" 935 936 # 输入示例2:(包含日文,中文)Input Example 1: (including Japanese, Chinese) 937 # text = "欢迎来玩。東京,は日本の首都です。欢迎来玩. 太好了!" 938 939 # 输入示例3:(包含日文,中文)Input Example 1: (including Japanese, Chinese) 940 # text = "明日、私たちは海辺にバカンスに行きます。你会说日语吗:“中国語、話せますか” 你的日语真好啊!" 941 942 # 输入示例4:(包含日文,中文,韩语,英文)Input Example 4: (including Japanese, Chinese, Korean, English) 943 # text = "你的名字叫<ja>佐々木?<ja>吗?韩语中的안녕 오빠读什么呢?あなたの体育の先生は誰ですか? 此次发布会带来了四款iPhone 15系列机型和三款Apple Watch等一系列新品,这次的iPad Air采用了LCD屏幕" 944 945 # 试验性支持:"fr"法语 , "vi"越南语 , "ru"俄语 , "th"泰语。Experimental: Other language support. 946 langsegment = LangSegment() 947 langsegment.setfilters(["fr", "vi", "ja", "zh", "ko", "en", "ru", "th"]) 948 text = """ 949我喜欢在雨天里听音乐。 950I enjoy listening to music on rainy days. 951雨の日に音楽を聴くのが好きです。 952비 오는 날에 음악을 듣는 것을 즐깁니다。 953J'aime écouter de la musique les jours de pluie. 954Tôi thích nghe nhạc vào những ngày mưa. 955Мне нравится слушать музыку в дождливую погоду. 956ฉันชอบฟังเพลงในวันที่ฝนตก 957""" 958 959 # 进行分词:(接入TTS项目仅需一行代码调用)Segmentation: (Only one line of code is required to access the TTS project) 960 langlist = langsegment.getTexts(text) 961 printList(langlist) 962 963 # 语种统计:Language statistics: 964 print("\n===================【语种统计】===================") 965 # 获取所有语种数组结果,根据内容字数降序排列 966 # Get the array results in all languages, sorted in descending order according to the number of content words 967 langCounts = langsegment.getCounts() 968 print(langCounts, "\n") 969 970 # 根据结果获取内容的主要语种 (语言,字数含标点) 971 # Get the main language of content based on the results (language, word count including punctuation) 972 lang, count = langCounts[0] 973 print(f"输入内容的主要语言为 = {lang} ,字数 = {count}") 974 print("==================================================\n") 975 976 # 分词输出:lang=语言,text=内容。Word output: lang = language, text = content 977 # ===================【打印结果】=================== 978 # {'lang': 'zh', 'text': '你的名字叫'} 979 # {'lang': 'ja', 'text': '佐々木?'} 980 # {'lang': 'zh', 'text': '吗?韩语中的'} 981 # {'lang': 'ko', 'text': '안녕 오빠'} 982 # {'lang': 'zh', 'text': '读什么呢?'} 983 # {'lang': 'ja', 'text': 'あなたの体育の先生は誰ですか?'} 984 # {'lang': 'zh', 'text': ' 此次发布会带来了四款'} 985 # {'lang': 'en', 'text': 'i Phone '} 986 # {'lang': 'zh', 'text': '15系列机型和三款'} 987 # {'lang': 'en', 'text': 'Apple Watch '} 988 # {'lang': 'zh', 'text': '等一系列新品,这次的'} 989 # {'lang': 'en', 'text': 'i Pad Air '} 990 # {'lang': 'zh', 'text': '采用了'} 991 # {'lang': 'en', 'text': 'L C D '} 992 # {'lang': 'zh', 'text': '屏幕'} 993 # ===================【语种统计】=================== 994 995 # ===================【语种统计】=================== 996 # [('zh', 51), ('ja', 19), ('en', 18), ('ko', 5)] 997 998 # 输入内容的主要语言为 = zh ,字数 = 51 999 # ================================================== 1000 # The main language of the input content is = zh, word count = 51