LLM-jp-Moshi-v1

LLM-jp-Moshi: 日本語Full-duplex音声対話モデル

LLM-jp-Moshi: Japanese Full-duplex Spoken Dialogue Models

🤗 LLM-jp-Moshi-v1 🤗 LLM-jp-Moshi-v1 | GitHub GitHub

国立情報学研究所大規模言語モデル研究開発センター対話ワーキンググループ

Dialogue Working Group, Research and Development Center for Large Language Models, National Institute of Informatics

LLM-jp-Moshiは，英語における7Bパラメータのfull-duplex音声対話モデルMoshiをベースとし，日本語音声対話データでの追加学習によって構築されました．詳細は文献をご覧ください． LLM-jp-Moshi is built upon Moshi, a 7B-parameter full-duplex spoken dialogue model for English, through additional training on Japanese spoken dialogue data. Please refer to our publications for details.

LLM-jp-Moshi-v1とユーザによる実際の音声対話のサンプル

Samples of real-time spoken dialogue between LLM-jp-Moshi-v1 and users

対話継続（Prompted Dialogue Continuation）

Prompted Dialogue Continuation

人間同士の10秒の対話音声（プロンプト）から，以下の各モデルが生成した20秒の対話音声サンプル．

20-second audio samples generated by each model from a 10-second human-to-human dialogue audio prompt.

Re-synthesis: 実際の20秒の対話音声を，Moshiの音声トークナイザMimiによって再合成した音声
J-Moshi: J-CHAT, Tabidachi, CSJ, 内製データ, 日本語CallHomeで学習されたMoshi
LLM-jp-Moshi-v1: J-CHAT, LLM-jp-Zoom1で学習されたMoshi

Re-synthesis: Actual 20-second dialogue audio re-synthesized by Moshi's audio tokenizer Mimi
J-Moshi: Moshi finetuned on J-CHAT, Tabidachi, CSJ, in-house data, and CallHome-Japanese
LLM-jp-Moshi-v1: Moshi finetuned on J-CHAT and LLM-jp-Zoom1

以下の音声サンプルのうち，ベルが鳴るまでの10秒間がプロンプト音声であり，その後の20秒間が各モデルによって生成された音声です．

In the following audio samples, the first 10 seconds until the bell rings is the prompt audio, and the following 20 seconds is the audio generated by each model.

評価結果

Evaluation Results

対話継続タスク（入力音声対話に対して，その続きとなる対話音声を生成させ，その妥当性について評価を行うタスク）において，LLM-as-a-Judgeによる客観的自動評価とクラウドワーカーによる主観評価の双方を実施し，既存の公開モデルであるJ-Moshi（J-Moshi-extではなくJ-Moshi）と比較して，自然性および意味的適切性の両面で優位な性能を示しました（下表参照）．入力音声としては，LLM-jp-Zoom1のヘルドアウトしたテストデータ，Tabidachi（旅行案内の音声対話），日本語CallHomeを用いています．実音声は，入力音声対話の続きとなる実際の人間同士の音声を表します．

In the dialogue continuation task (generating the continuation of a given dialogue audio and evaluating its validity), we conducted both objective automatic evaluation by LLM-as-a-Judge and subjective evaluation by crowd workers, and compared the results with the existing public model J-Moshi (not J-Moshi-ext), demonstrating superior performance in both naturalness and semantic appropriateness (see table below). The input audio consisted of held-out test data from LLM-jp-Zoom1, Tabidachi (travel guide dialogue), and Japanese CallHome (Japanese CH). The real audio represents the actual human-to-human dialogue following the input audio.

NISQA/UT-MOSは音声の自動評価尺度の値（5段階）を表します．LLMAJはLLM-as-a-Judgeの枠組みを用い，大規模言語モデルに対話としての自然性や流暢性を評価させた値（10段階）を表します．項目としては，対話の一貫性（Coherence:COH），自然性（Naturalness:NAT），関連性（Relevance:REL），情報提供の適切さ（Instruction Following:INS），ターンの適切さ（Turn Taking:TUR），全体的な品質（Overall:OVE）を評価しています．

NISQA/UT-MOS represent the values (on a 5-point scale) of automatic evaluation metrics for audio quality. LLMAJ represents the values (on a 10-point scale) evaluated by large language models using the LLM-as-a-Judge framework, assessing the naturalness and fluency of the dialogue. The evaluation items include Coherence (COH), Naturalness (NAT), Relevance (REL), Instruction Following (INS), Turn Taking (TUR), and Overall (OVE).

なお，表において下線は最良値を表します． Note: Underlined values indicate the best performance.

LLM-jp-Zoom1: クラウドソーシングによる主観評価結果（自然性と意味性，5段階）

LLM-jp-Zoom1: Subjective Evaluation Results (5-point scale) in Crowdsourcing

	J-MoshiJ-Moshi	LLM-jp-Moshi-v1LLM-jp-Moshi-v1	LLM-jp-Zoom1 (実音声)LLM-jp-Zoom1 (Real Audio)
Pre-training	J-CHAT	J-CHAT	-
Fine-tuning	Tabidachi,CSJ 内製データ日本語CallHomeTabidachi,CSJ In-house Data CallHome-Japanese	LLM-jp-Zoom1	-
自然性 (1-5)Naturalness (1-5)	2.43	3.34	4.51
意味性 (1-5)Semantics (1-5)	2.10	2.84	4.50

LLM-jp-Zoom1: 客観評価結果（MOS，LLMAJ）

LLM-jp-Zoom1: Objective Evaluation Results (MOS and LLMAJ)

		J-MoshiJ-Moshi	LLM-jp-Moshi-v1LLM-jp-Moshi-v1	LLM-jp-Zoom1 (実音声)LLM-jp-Zoom1 (Real Audio)
Pre-training		J-CHAT	J-CHAT	-
Fine-tuning		Tabidachi,CSJ 内製データ日本語CallHomeTabidachi,CSJ In-house Data CallHome-Japanese	LLM-jp-Zoom1	-
MOS (1-5)	NISQA	3.35	3.72	2.96
MOS (1-5)	UT-MOS	1.82	1.97	1.91
LLMAJ (1-10)	COH	4.12	4.24	6.92
	NAT	5.33	5.53	7.73
	REL	3.18	3.47	5.82
	INS	1.84	2.02	4.16
	TUR	4.31	4.67	6.80
	OVE	4.07	4.37	6.69

Tabidachi: クラウドソーシングによる主観評価結果（自然性と意味性，5段階）

Tabidachi: Subjective Evaluation Results (5-point scale) in Crowdsourcing

	J-MoshiJ-Moshi	LLM-jp-Moshi-v1LLM-jp-Moshi-v1	Tabidachi (実音声)Tabidachi (Real Audio)
Pre-training	J-CHAT	J-CHAT	-
Fine-tuning	Tabidachi,CSJ 内製データ日本語CallHomeTabidachi,CSJ In-house Data CallHome-Japanese	LLM-jp-Zoom1	-
自然性 (1-5)Naturalness (1-5)	2.57	3.27	4.08
意味性 (1-5)Semantics (1-5)	2.35	2.81	4.27

Tabidachi: 客観評価結果（MOS，LLMAJ）

Tabidachi: Objective Evaluation Results (MOS and LLMAJ)

		J-MoshiJ-Moshi	LLM-jp-Moshi-v1LLM-jp-Moshi-v1	Tabidachi (実音声)Tabidachi (Real Audio)
Pre-training		J-CHAT	J-CHAT	-
Fine-tuning		Tabidachi,CSJ 内製データ日本語CallHomeTabidachi,CSJ In-house Data CallHome-Japanese	LLM-jp-Zoom1	-
MOS (1-5)	NISQA	2.77	3.22	2.98
MOS (1-5)	UT-MOS	1.87	1.83	2.13
LLMAJ (1-10)	COH	4.22	4.45	6.77
	NAT	5.10	5.69	6.87
	REL	3.35	3.51	5.60
	INS	2.49	2.29	4.53
	TUR	4.16	4.67	6.36
	OVE	3.87	4.27	5.99

日本語CallHome: クラウドソーシングによる主観評価結果（自然性と意味性，5段階）

CallHome-Japanese: Subjective Evaluation Results (5-point scale) in Crowdsourcing

	J-MoshiJ-Moshi	LLM-jp-Moshi-v1LLM-jp-Moshi-v1	日本語CallHome (実音声)CallHome-Japanese (Real Audio)
Pre-training	J-CHAT	J-CHAT	-
Fine-tuning	Tabidachi,CSJ 内製データ日本語CallHomeTabidachi,CSJ In-house Data CallHome-Japanese	LLM-jp-Zoom1	-
自然性 (1-5)Naturalness (1-5)	1.88	2.99	3.89
意味性 (1-5)Semantics (1-5)	1.55	2.50	3.99

日本語CallHome: 客観評価結果（MOS，LLMAJ）

CallHome-Japanese: Objective Evaluation Results (MOS and LLMAJ)

		J-MoshiJ-Moshi	LLM-jp-Moshi-v1LLM-jp-Moshi-v1	日本語CallHome (実音声)CallHome-Japanese (Real Audio)
Pre-training		J-CHAT	J-CHAT	-
Fine-tuning		Tabidachi,CSJ 内製データ日本語CallHomeTabidachi,CSJ In-house Data CallHome-Japanese	LLM-jp-Zoom1	-
MOS (1-5)	NISQA	2.41	3.06	2.47
MOS (1-5)	UT-MOS	1.59	2.01	1.60
LLMAJ (1-10)	COH	3.14	3.29	5.56
	NAT	4.18	4.59	6.67
	REL	2.39	2.57	4.69
	INS	1.18	1.37	3.25
	TUR	3.37	3.82	5.65
	OVE	3.08	3.33	5.45

対話ワーキンググループについて

About the Dialogue Working Group

対話ワーキンググループは，LLM-jpにおける研究活動の一環として設置された研究グループです．NII/LLMC科学主幹の東中竜一郎教授が主担当を務め，早稲田大学の小川哲司教授，慶應義塾大学の高道慎之介准教授と緊密に連携しながら研究を推進しています．

The Dialogue Working Group is a research group within LLM-jp, led by Prof. Ryuichiro Higashinaka, Scientific Director of NII/LLMC, and closely collaborating with Prof. Tetsuji Ogawa of Waseda University and Assoc. Prof. Shinnosuke Takamichi of Keio University.

引用

Citation

@inproceedings{abe2026effects,
    title={Effects of dialogue corpora properties on fine-tuning a {M}oshi-based spoken dialogue model},
    author={Abe, Yuto and Saeki, Mao and Ohashi, Atsumoto and Takamichi, Shinnosuke and Fujie, Shinya and Kobayashi, Tetsunori and Ogawa, Tetsuji and Higashinaka, Ryuichiro},
    booktitle={Proc. International Workshop on Spoken Dialogue Systems (IWSDS)},
    pages={104-108},
    year={2026},
    month={Feb}
}

@inproceedings{abe2026moshi,
    title={Moshi音声対話モデルの日本語ファインチューニングにおける対話データ特性の影響},
    author={阿部雄斗 and 佐伯真於 and 大橋厚元 and 高道慎之介 and 藤江真也 and 小林哲則 and 小川哲司 and 東中竜一郎},
    booktitle={日本音響学会研究発表会講演論文集},
    year={2026},
    month={Mar}
}

謝辞

Acknowledgments

本研究では，国立研究開発法人産業技術総合研究所および株式会社AIST Solutionsが提供するAI橋渡しクラウド（ABCI）3.0を，「ABCI 3.0開発加速利用」の支援を受けて実施しました．また，ベースモデルであるMoshiおよびテクニカルペーパーを公開されたKyutai Labs，ならびにJ-CHATデータセットを公開された研究者の皆様に深く感謝の意を表します．

We used ABCI 3.0 provided by AIST and AIST Solutions with support from “ABCI 3.0 Development Acceleration Use”. We would like to thank Kyutai Labs for releasing the original Moshi model and technical paper. We also thank the researchers who released the J-CHAT dataset.

This page was adapted from the SoundStorm project page.