Overview of Japanese LLMs
A list of publicly available LLMs trained with a focus on Japanese, along with their evaluation benchmarks, maintained by volunteers from various sources like academic papers and other public resources.
Caution
- We can't guarantee the accuracy or completeness of any information here.
- Some information is based on conjecture and might not reflect your specific use case.
- While many models are released under permissive licenses like MIT or Apache 2.0, some are subject to more restrictive terms including non-commercial use clauses (e.g CC BY-NC-SA 4.0) or other stipulations.
Please point out any errors on the issues page. Feel free to contribute directly with a pull request.
Table of Contents
Text Generation Models
For multimodal models, see below.
Models built from scratch
General purpose
Architecture | Max Context Length | Training Data | Developer | License / Terms of Use | |
---|---|---|---|---|---|
Sarashina2-8x70B | Mixtral (8x70b (465b)) | 8,192 | Sparse Upcycling on Sarashina2 (70B) | SB Intuitions | Sarashina Model NonCommercial License |
LLM-jp-3 172B | Llama (172b, 172b-instruct3) | 4,096 | Pre-training: llm-jp-corpus-v3 (2.1T tokens) Instruction Tuning: ichikara-instruction, answer-carefully, magpie-sft-v1.0, Daring-Anteater, FLAN, ichikara-instruction-format, AutoMultiTurnByCalm3-22B, ramdom-to-fixed-multiturn-Calm3, wizardlm8x22b-logical-math-coding-sft-ja, wizardlm8x22b-logical-math-coding-sft_additional-ja, Synthetic-JP-EN-Coding-Dataset-567k DPO: synthetic data | Research and Development Center for Large Language Models | Pre-trained model: LLM-jp-3 172B Terms of Use Post-trained model: llm-jp-3-172b-instruct3 Terms of Use |
LLM-jp-3 172B beta2 | Llama (172b-beta2, 172b-beta2-instruct2) | 4,096 | Pre-training: part of llm-jp-corpus-v3 (1.4T tokens) Instruction Tuning: ichikara-instruction, answer-carefully, magpie-sft-v1.0, Daring-Anteater, FLAN, ichikara-instruction-format, AutoMultiTurnByCalm3-22B, ramdom-to-fixed-multiturn-Calm3, wizardlm8x22b-logical-math-coding-sft-ja, wizardlm8x22b-logical-math-coding-sft_additional-ja, Synthetic-JP-EN-Coding-Dataset-567k | Research and Development Center for Large Language Models | LLM-jp-3 172B beta2 Terms of Use |
LLM-jp-3 172B beta1 | Llama (172b-beta1, 172b-beta1-instruct) | 4,096 | Pre-training: part of llm-jp-corpus-v3 (0.7T tokens) Instruction Tuning: ichikara-instruction, answer-carefully, Dolly Dataset, OASST1, OASST2, Aya Dataset, ichikara-instruction-format, Daring-Anteater, FLAN | Research and Development Center for Large Language Models | LLM-jp-3 172B beta1 Terms of Use |
LLM-jp-3 172B alpha | Llama (172b-alpha1, 172b-alpha1-instruct, 172b-alpha2, 172b-alpha2-instruct) | 4,096 | Pre-training: part of llm-jp-corpus-v3 (alpha1: 0.7T tokens, alpha2: 1.4T tokens) Instruction Tuning: ichikara-instruction, answer-carefully, Dolly Dataset, OASST1, OASST2, Aya Dataset, ichikara-instruction-format, Daring-Anteater, FLAN | Research and Development Center for Large Language Models | Apache 2.0 |
Stockmark-100b | Llama (100b, 100b-instruct-v0.1) | 4,096 | Pre-training: RedPajama, Japanese Wikipedia, Japanese mC4, Japanese CommonCrawl, Japanese Patent, Stockmark Web Corpus (910B tokens) Instruction Tuning (LoRA): ichikara-instruction | Stockmark | MIT |
PLaMo-100B-Pretrained | Llama[1] (100b) | 4,096 | Pre-training: Japanese CommonCrawl, RefinedWeb, undisclosed (2.0T tokens) | Preferred Elements (Preferred Networks) | PLaMo Non-Commercial License |
Sarashina2 | Llama (7b, 13b, 70b) | 7b, 13b: 4,096 70b: 8,192 | Pre-training: Japanese Common Crawl, SlimPajama, StarCoder (2.1T tokens) | SB Intuitions | MIT |
Sarashina1 | GPT-NeoX (7b, 13b, 65b) | 2,048 | Pre-training: Japanese Common Crawl (1T tokens) | SB Intuitions | MIT |
Tanuki-8×8B | Tanuki (MoE) (47b) (v1.0, v1.0-AWQ, v1.0-GPTQ-4bit, v1.0-GPTQ-8bit, v1.0-GGUF) | 4,096 | Pre-training: various Web & synthetic datasets(1.7T tokens) SFT, DPO: various synthetic datasets [2] | Matsuo Lab LLM Development Project | Apache 2.0 |
CyberAgentLM3 (CALM3) | Llama (22b-chat) | 16,384 | undisclosed (2.0T tokens) | CyberAgent | Apache 2.0 |
LLM-jp-3 13B | Llama (1.8b, 1.8b-instruct, 3.7b, 3.7b-instruct, 13b, 13b-instruct) | 4,096 | Pre-training: llm-jp-corpus-v3 (2.1T tokens) Instruction Tuning: ichikara-instruction, answer-carefully, FLAN, ichikara-instruction-format, AutoMultiTurnByCalm3-22B, ramdom-to-fixed-multiturn-Calm3, wizardlm8x22b-logical-math-coding-sft_additional-ja, Synthetic-JP-EN-Coding-Dataset-567k | Research and Development Center for Large Language Models | Apache 2.0 |
llm-jp-3-3.7b-instruct-EZO | Llama (3.7b-instruct-EZO-Common, 3.7b-instruct-EZO-Humanities) | 4,096 | additionally trained on LLM-jp-3 (3.7B) | Axcxept | Apache 2.0 |
LLM-jp-13B v2.0 | Llama (13b-v2.0, 13b-instruct-full-dolly-ichikara_004_001_single-oasst-oasst2-v2.0, 13b-instruct-full-ac_001-dolly-ichikara_004_001_single-oasst-oasst2-v2.0, 13b-instruct-full-ac_001_16x-dolly-ichikara_004_001_single-oasst-oasst2-v2.0) | 4,096 | Pre-training: llm-jp-corpus-v2 (260B tokens) Instruction Tuning: ichikara-instruction, answer-carefully, Dolly Dataset, OASST1, OASST2 | LLM-jp | Apache 2.0 |
Fugaku-LLM | GPT (13B, 13B-instruct, 13B-instruct-gguf) | 2,048 | Pre-training: undisclosed dataset Instruction Tuning: OASST1, Dolly Dataset, GSM8K | Titech, Tohoku Univ., Fujitsu, RIKEN, Nagoya Univ., CyberAgent, Kotoba Technologies | Fugaku-LLM Terms of Use |
LLM-jp-13B v1.1 | GPT (13b-instruct-lora-dolly_en-dolly_ja-ichikara_003_001-oasst_en-oasst_ja-v1.1, 13b-instruct-full-dolly_en-dolly_ja-ichikara_003_001-oasst_en-oasst_ja-v1.1, 13b-dpo-lora-hh_rlhf_ja-v1.1) | 2,048 | Instruction Tuning (LoRA or Full-parameter FT): Dolly Dataset, OASST1, ichikara-instruction DPO (LoRA): HH RLHF | LLM-jp | Apache 2.0 |
LLM-jp-13B | GPT (1.3b-v1.0, 13b-v1.0, 13b-instruct-full-jaster-v1.0, 13b-instruct-full-jaster-dolly-oasst-v1.0, 13b-instruct-full-dolly-oasst-v1.0, 13b-instruct-lora-jaster-v1.0, 13b-instruct-lora-jaster-dolly-oasst-v1.0, 13b-instruct-lora-dolly-oasst-v1.0) | 2,048 | Pre-training: llm-jp-corpus (Wikipedia, Japanese mC4, The Pile, Stack) (300B tokens) Instruction Tuning (Full-parameter FT or LoRA): jaster, Dolly Dataset, OASST1 | LLM-jp | Apache 2.0 |
PLaMo-13B | Llama[3] (13b, 13b-instruct, 13b-instruct-nc) | base: 4,096 instruct, instruct-nc: 8,192 | Pre-training: C4, Project Gutenberg, RedPajama, Japanese Wikipedia, Japanese mC4 (1.5T tokens) Instruction Tuning: Dolly, HH RLHF, OASST1, wikinews (+Alpaca in NC model) | Preferred Networks | Apache 2.0 (CC BY-NC 4.0 as for NC model) |
Stockmark-13b | Llama (13b, 13b-instruct) | 2,048 | Pre-training: Japanese Wikipedia, Japanese CC-100, Japanese mC4, Japanese CommonCrawl, Japanese Patent, Stockmark Web Corpus (220B tokens) Instruction Tuning (LoRA): ichikara-instruction | Stockmark | base: MIT instruct: CC BY-NC-SA 4.0 |
Weblab-10B | GPT-NeoX (10b, 10b-instruction-sft) | 2,048 | Japanese mC4, The Pile (600B tokens) Instruction Tuning: Alpaca, FLAN | University of Tokyo Matsuo Lab | CC BY‑NC 4.0 |
Tanuki-8B | Tanuki (8b) (v1.0, v1.0-AWQ, v1.0-GPTQ-4bit, v1.0-GPTQ-8bit, v1.0-GGUF) | 4,096 | Pre-training: various Web & synthetic datasets(1.3T tokens) SFT, DPO: various synthetic datasets [2:1] | Matsuo Lab LLM Development Project | Apache 2.0 |
Japanese StableLM Alpha | GPT-NeoX (base-alpha-7b, instruct-alpha-7b, instruct-alpha-7b-v2) | 2,048 | Wikipedia, Japanese CC‑100, Japanese mC4, Japanese OSCAR, RedPajama, private datasets[4] (750B tokens) Instruction Tuning: Dolly, HH‑RLHF, wikinews, Alpaca (discarded in v2) | Stability AI | base: Apache 2.0 instruct (v1): Research license instruct (v2): Apache 2.0 |
CyberAgentLM2 (CALM2) | Llama (7b, 7b-chat, 7b-chat-dpo-experimental) | base: 4,096 chat: 32,768 | publicly available Japanese and English datasets (details unknown) (1.3T tokens) DPO: Chatbot Arena Conversations JA (calm2) Dataset | CyberAgent | Apache 2.0 (CC BY 4.0 as for DPO model) |
OpenCALM | GPT-NeoX (small, medium, large, 1b(1.4b), 3b(2.7b), 7b(6.8b)) | 2,048 | Japanese Wikipedia, Japanese mC4, Japanese CC‑100 | CyberAgent | CC BY‑SA 4.0 |
Stormy | GPT-NeoX (7b(6.8b)) | 2,048 | OpenCALM fine-tuned on llm-japanese-dataset v0 non-translation tasks | University of Tokyo Izumi Lab | CC BY‑SA 4.0 |
rinna GPT (En-Ja Bilingual) | GPT-NeoX (4b(3.8b), 4b(3.8b)-8k, 4b(3.8b)-instruction-sft, 4b(3.8b)-instruction-ppo) | 8k model: 8,192 others: 2,048 | Wikipedia, Japanese CC‑100, Japanese C4, RedPajama, The Pile (524B tokens) Instruction Tuning: HH‑RLHF, FLAN PPO: HH‑RLHF for reinforcement learning 8k: trained with long context | rinna | MIT |
japanese-large-lm | GPT-NeoX (1.7b, 3.6b, 1.7b-instruction-sft, 3.6b-instruction-sft) | 2,048 | Japanese Wikipedia, Japanese CC‑100, Japanese C4, Japanese OSCAR and private datasets (650GB) Instruction Tuning: OASST1 | LINE | Apache 2.0 |
rinna GPT (Japanese only) | GPT / GPT-NeoX (xsmall, small, medium, 1b, neox-small, neox-3.6b, neox-3.6b-instruction-sft, neox-3.6b-instruction-sft-v2, neox-3.6b-instruction-ppo) | ≤ 2,048 | Japanese Wikipedia, Japanese CC‑100 (1b and up models add Japanese mC4) Instruction Tuning: HH‑RLHF, FLAN, SHP PPO: HH‑RLHF for reinforcement learning | rinna | MIT |
RetrievaT5 | T5 (small (short), small (medium), small (long), base (short), base (medium), base (long), large (short), large (medium), large (long), xl(3b)) | Japanese Wikipedia, Japanese mC4 | Retrieva | CC BY‑SA 4.0 | |
Spiral-RetNet-3b-base | RetNet (3b) | 2,048 | Wikipedia, Japanese CC-100, CulturaX | Spiral.AI | MIT |
kotomamba-2.8B | Mamba (2.8B-v1.0) | 2,048 | Japanese Wikipedia, Swallow Corpus, SlimPajama | Kotoba Technologies | Apache 2.0 |
ABEJA GPT | GPT / GPT-NeoX (large, neox-2.7b) | Japanese Wikipedia, Japanese CC‑100, Japanese OSCAR | ABEJA | MIT | |
WasedaGPT | GPT (small, xl(1.5b)) | Japanese Wikipedia, Japanese CC‑100 | Waseda Kawahara Lab | CC BY‑SA 4.0 | |
StockmarkGPT | GPT-NeoX (1.4b) | Japanese Wikipedia (0.88B tokens), Japanese CC‑100 (10.5B tokens), private data (8.6B tokens) | Stockmark | MIT | |
YellowbackGPT | GPT-NeoX (1.3b) | Japanese Wikipedia, Japanese CC‑100, Japanese OSCAR | Yellowback | Apache 2.0 | |
Sarashina2.1-1B | Llama (1b) | 8,192 | Japanese and English data on the web (10T tokens) | SB Intuitions | Sarashina Model NonCommercial License |
colorfulscoop GPT | GPT (small) | Japanese Wikipedia | Colorful Scoop | CC BY‑SA 3.0 | |
TitechGPT | GPT (medium, medium-reversed) [5] | Japanese Wikipedia, Japanese CC‑100 | Titech Okazaki Lab | CC BY‑SA 4.0 | |
KyotoUniversityGPT | GPT (small, medium, large) | Japanese Wikipedia (3.2GB), Japanese CC‑100 (85GB), Japanese OSCAR (54GB) | Kyoto University Language Media Processing Lab | CC BY‑SA 4.0 | |
JapaneseBART | BART (base, large) | Japanese Wikipedia (18M sentences) | Kyoto University Language Media Processing Lab | CC BY‑SA 4.0 | |
Megagon Labs T5 | T5 (base) | Japanese mC4 (782 GB), Japanese wiki40b (2 GB) | Megagon Labs (Recruit Co.,Ltd.) | Apache 2.0 |
Domain Specific
Domain | Architecture | Training Data | Developer | License | |
---|---|---|---|---|---|
Japanese Dialog Transformer | Dialog | Transformer | Twitter japanese reply pairs | NTT | Evaluation Licence |
Japanese News BART | Business | BART (base) | Japanese business news articles (21M articles) | Stockmark | MIT |
AcademicBART | Science | BART (base) | CiNii Japanese Papers | Ehime University AI Lab | Apache 2.0 |
Models built off non-Japanese LLMs (w/ continual pre-training on Japanese)
General purpose
Base Model | Training Data | Developer | License / Terms of Use | |
---|---|---|---|---|
Llama 3.1 Swallow 70B (70B-v0.1, 70B-Instruct-v0.1, 70B-Instruct-v0.3) | Llama 3.1 (70b) | Pre-training: The Stack v2, Wikipedia, DCLM-baseline-1.0, Swallow Corpus Version 2, Cosmopedia, Laboro ParaCorpus Instruction Tuning: lmsys-chat-1m-synth-ja-wo-pii-and-template-instructions, lmsys-chat-1m-synth-en-wo-pii-and-template-instructions, filtered-magpie-ultra-ja, filtered-magpie-ultra-en, gemma-magpie | Swallow Project | Llama 3.1 Community License (Gemma Terms of Use is also applied to the Instruct model) |
cyberagent/Llama-3.1-70B-Japanese-Instruct-2407 | Llama 3.1 (70b) | undisclosed | CyberAgent | Llama 3.1 Community License |
Llama 3 Swallow 70B (70B-v0.1, 70B-Instruct-v0.1) | Llama 3 (70b) | Pre-training: Algebraic Stack, Wikipedia, RefinedWeb, Swallow Corpus, Cosmopedia, Laboro ParaCorpus, OpenWebMath Instruction Tuning: OASST1 [6] | Swallow Project | Llama 3 Community License |
turing-motors/Llama-3-heron-brain-70B-v0.3 | Llama 3 (70b) | additionally trained on Llama 3 Swallow 70B (details undisclosed) | Turing | Llama 3 Community License |
Llama 3 Youko 70B (70b, 70b-instruct, 70b-gptq, 70b-instruct-gptq) | Llama 3 (70b) | Pre-training: Wikipedia, Japanese C4, Japanese CC-100, Japanese OSCAR, The Pile, undisclosed dataset (5B tokens) Instruction Tuning: undisclosed datasetト[7] | rinna | Llama 3 Community License |
Swallow 70B (70b-hf, 70b-instruct-hf, 70b-instruct-v0.1, 70b-NVE-hf, 70b-NVE-instruct-hf) | Llama 2 (70b) | Pre-training: Japanese Wikipedia, RefinedWeb, Swallow Corpus, The Pile Instruction Tuning: Dolly Dataset, HH RLHF, OASST1 *v0.1: OASST1, OASST2 | Swallow Project | Llama 2 Community License |
KARAKURI LM (70b-v0.1, 70b-chat-v0.1) | Llama 2 (70b) | Pre-training: mC4, CC100, OSCAR, RedPajama, undisclosed dataset (16B tokens) SteerLM: OASST2, undisclosed dataset | KARAKURI | Llama 2 Community License[8] |
Japanese Stable LM Beta 70B (base-beta-70b, instruct-beta-70b) | Llama 2 (70b) | Pre-training: Wikipedia, Japanese mC4, Japanese CC-100, Japanese OSCAR, SlimPajama(excluding Books3) (100B tokens) Instruction Tuning: Dolly Dataset, HH RLHF, OASST1 | Stability AI | Llama 2 Community License |
Swallow-MX 8x7B (8x7b-NVE-v0.1) | Mixtral-8x7B-Instruct-v0.1 (46.7b) | Pre-training: Algebraic Stack, Japanese Wikipedia, RefinedWeb, Swallow Corpus, The Pile, The Vault | Swallow Project | Apache 2.0 |
KARAKURI LM 8x7B Instruct v0.1 (8x7b-instruct-v0.1) | Mixtral-8x7B-Instruct-v0.1 (46.7b) | trained Swallow-MX 8x7B on the following datasets: Dolly Dataset, OASST2, HelpSteer, glaive-code-assistant-v3, glaive-function-calling-v2, synthetic_text_to_sql, MetaMathQA, orca-math-word-problems-200k, rag-dataset-12000, rag-hallucination-dataset-1000, undisclosed dataset | KARAKURI | Apache 2.0 (?)[9] |
KARAKURI LM 8x7B Chat v0.1 (8x7b-chat-v0.1) | Mixtral-8x7B-Instruct-v0.1 (46.7b) | trained Swallow-MX 8x7B on OASST2, HelpSteer, and undisclosed datasets using SteerLM | KARAKURI | Apache 2.0 |
ABEJA-Mixtral-8x7B-japanese (8x7B-v0.1-japanese, 8x7B-Instruct-v0.1-japanese, 8x7B-Instruct-v0.1-japanese-alpha, 8x7B-Instruct-v0.1-japanese-alpha-merged) | Mixtral-8x7B-Instruct-v0.1 (46.7b) *The model without "Instruct" in its name is based on Mixtral-8x7B-v0.1 | Pre-training: Japanese CC, Redpajama, undisclosed dataset (450B tokens) | ABEJA | Apache 2.0 |
Nekomata 14B (14b, 14b-instruction, 14b-gguf, 14b-instruction-gguf) | Qwen (14b) | Pre-training: Wikipedia, Japanese C4, Japanese CC-100, Japanese OSCAR, The Pile, undisclosed dataset (66B tokens) Instruction Tuning: Dolly Dataset, FLAN, subsets of llm-japanese-dataset | rinna | Tongyi Qianwen LICENSE |
Swallow 13B (13b-hf, 13b-instruct-hf, 13b-instruct-v0.1, 13b-NVE-hf) | Llama 2 (13b) | Pre-training: Japanese Wikipedia, RefinedWeb, Swallow Corpus, The Pile Instruction Tuning: Dolly Dataset, HH RLHF, OASST1 *v0.1: OASST1, OASST2 | Swallow Project | Llama 2 Community License |
LEIA-Swallow-13B (13b) | Llama 2 (13b) | additionally trained Swallow 13B using LEIA | Individual (Ikuya Yamada, Ryokan Ri) | Llama 2 Community License |
ELYZA-japanese-Llama-2-13b (13b, 13b-instruct, 13b-fast, 13b-fast-instruct) | Llama 2 (13b) | Pre-training: Japanese Wikipedia, Japanese OSCAR, and other crawled data (18B tokens) Instruction Tuning: undisclosed dataset | ELYZA | Llama 2 Community License |
cyberagent/Mistral-Nemo-Japanese-Instruct-2408 | Mistral NeMo (12b) | undisclosed | CyberAgent | Apache 2.0 |
Llama 3.1 Swallow 8B (8B-v0.1, 8B-Instruct-v0.1, 8B-v0.2, 8B-Instruct-v0.2, 8B-Instruct-v0.3) | Llama 3.1 (8b) | Pre-training: The Stack v2, Wikipedia, DCLM-baseline-1.0, Swallow Corpus Version 2, Cosmopedia, Laboro ParaCorpus Instruction Tuning: lmsys-chat-1m-synth-ja-wo-pii-and-template-instructions, lmsys-chat-1m-synth-en-wo-pii-and-template-instructions, filtered-magpie-ultra-ja, filtered-magpie-ultra-en, gemma-magpie | Swallow Project | Llama 3.1 Community License (Gemma Terms of Use is also applied to the Instruct model) |
Llama 3 Swallow 8B (8B-v0.1, 8B-Instruct-v0.1) | Llama 3 (8b) | Pre-training: Algebraic Stack, Wikipedia, RefinedWeb, Swallow Corpus, Cosmopedia, Laboro ParaCorpus, OpenWebMath Instruction Tuning: OASST1 [6:1] | Swallow Project | Llama 3 Community License |
turing-motors/Llama-3-heron-brain-8B-v0.3 | Llama 3 (8b) | additionally trained on Llama 3 Swallow 8B (details undisclosed) | Turing | Llama 3 Community License |
Llama 3 Youko 8B (8b, 8b-instruct, 8b-gptq, 8b-instruct-gptq) | Llama 3 (8b) | Pre-training: Wikipedia, Japanese C4, Japanese CC-100, Japanese OSCAR, The Pile, undisclosed dataset (22B tokens) Instruction Tuning[7:1]: Aya Dataset (Japanese subset), FLAN, Dolly Dataset, HH RLHF, OASST1, OASST2, MetaMathQA, CodeAlpaca Dataset, undisclosed dataset DPO: HelpSteer, HelpSteer2, undisclosed dataset | rinna | Llama 3 Community License |
Llama 3 ELYZA JP 8B (8B, 8B-GGUF, 8B-AWQ) | Llama 3 (8b) | undisclosed | ELYZA | Llama 3 Community License |
Llama 3 neoAI 8B Chat v0.1 (8B-Chat-v0.1) | Llama 3 (8b) | undisclosed | neoAI | Llama 3 Community License |
Llama 3 tedllm (v0) | Llama 3 (8b) | Pre-training: Japanese generic corpus | Tokyo Electron Device | Llama 3 Community License |
Swallow 7B (7b-hf, 7b-instruct-hf, 7b-instruct-v0.1, 7b-NVE-hf, 7b-NVE-instruct-hf, 7b-plus-hf) | Llama 2 (7b) | Pre-training: Japanese Wikipedia, RefinedWeb, Swallow Corpus, The Pile Instruction Tuning: Dolly Dataset, HH RLHF, OASST1 *v0.1: OASST1, OASST2 | Swallow Project | Llama 2 Community License |
LEIA-Swallow-7B (7b) | Llama 2 (7b) | additionally trained Swallow 7B using LEIA | Individual (Ikuya Yamada, Ryokan Ri) | Llama 2 Community License |
ELYZA-japanese-Llama-2-7b (7b, 7b-instruct, 7b-fast, 7b-fast-instruct) | Llama 2 (7b) | Pre-training: Japanese Wikipedia, Japanese OSCAR, and other crawled data (18B tokens) Instruction Tuning: undisclosed dataset | ELYZA | Llama 2 Community License |
Youri 7B (7b, 7b-instruction, 7b-chat, 7b-gptq, 7b-instruction-gptq, 7b-chat-gptq) | Llama 2 (7b) | Pre-training: Wikipedia, Japanese C4, Japanese CC-100, Japanese OSCAR, The Pile, undisclosed dataset (40B tokens) Instruction Tuning: Dolly Dataset, FLAN, subsets of llm-japanese-dataset | rinna | Llama 2 Community License |
houou-7b (instruction-7b-v1, instruction-7b-v2, instruction-7b-v3) | Llama 2 (7b) | Instruction-tuned Youri 7B (base) on ichikara-instruction | MoneyForward | Llama 2 Community License |
Japanese Stable LM Beta 7B (base-beta-7b, base-ja_vocab-beta-7b, instruct-beta-7b, instruct-ja_vocab-beta-7b) | Llama 2 (7b) | Pre-training: Wikipedia, Japanese mC4, Japanese CC-100, Japanese OSCAR, SlimPajama(excluding Books3) (100B tokens) Instruction Tuning: Dolly Dataset, HH RLHF, OASST1 | Stability AI | Llama 2 Community License |
SambaLingo-Japanese (Base, Chat) | Llama 2 (7b) | Pre-training: CulturaX Instruction Tuning: ultrachat_200k DPO: ultrafeedback, cai-conversation-harmless | SambaNova Systems | Llama 2 Community License (?)[9:1] |
blue-lizard (blue-lizard) | Llama 2 (7b) | undisclosed | Deepreneur | Llama 2 Community License |
Swallow-MS 7B (7b-v0.1, 7b-instruct-v0.1) | Mistral-7B-v0.1 (7b) | Pre-training: Algebraic Stack, Japanese Wikipedia, RefinedWeb, Swallow Corpus, The Pile Instruction Tuning: Dolly Dataset, OASST1 | Swallow Project | Apache 2.0 |
RakutenAI-7B (7B, 7B-instruct, 7B-chat) | Mistral-7B-v0.1 (7b) | Pre-training: undisclosed Instruction Tuning: Dolly Dataset, OASST1, datasets converted from the train split of NLU datasets (like jaster), undisclosed dataset | Rakuten | Apache 2.0 |
Japanese Stable LM Gamma 7B (base-gamma-7b, instruct-gamma-7b) | Mistral-7B-v0.1 (7b) | Pre-training: Wikipedia, Japanese mC4, Japanese CC-100, Japanese OSCAR, SlimPajama(excluding Books3) (100B tokens) Instruction Tuning: Dolly Dataset, HH RLHF, wikinews subset of llm-japanese-dataset | Stability AI | Apache 2.0 |
ChatNTQ JA 7B (7b-v1.0) | Mistral-7B-v0.1 (7b) | Instruction-tuned Japanese Stable LM Gamma 7B (base) on their own datasets | NTQ Solution | Apache 2.0 |
Shisa Gamma 7B (7b-v1) | Mistral-7B-v0.1 (7b) | Instruction-tuned Japanese Stable LM Gamma 7B (base) on ultra-orca-boros-en-ja | AUGMXNT | Apache 2.0 (?)[9:2] |
Shisa 7B (base-7b-v1, 7b-v1) | Mistral-7B-v0.1 (7b) | Pre-training: shisa-pretrain-en-ja-v1 (8B tokens) Instruction Tuning & DPO: ultra-orca-boros-en-ja, shisa-en-ja-dpo-v1 | AUGMXNT | Apache 2.0 (?)[9:3] |
Karasu (7B, 7B-chat, 7B-chat-plus, 7B-chat-plus-unleashed) | Mistral-7B-v0.1 (7b) | Additionally trained Shisa 7B (base) on Aozora Bunko, Japanese Law Precedent Dataset, Japanese Wikipedia, Japanese domain webscrapes from the Japanese subset of CulturaX, UltraChat 200k (7B tokens) Instruction Tuning: ultra-orca-boros-en-ja-v1, OASST1, ShareGPT, undisclosed dataset | Lightblue | Apache 2.0 (?)[9:4] |
Nekomata 7B (7b, 7b-instruction, 7b-gguf, 7b-instruction-gguf) | Qwen (7b) | Pre-training: Wikipedia, Japanese C4, Japanese CC-100, Japanese OSCAR, The Pile, undisclosed dataset (66B tokens) Instruction Tuning: Dolly Dataset, FLAN, subsets of llm-japanese-dataset | rinna | Tongyi Qianwen LICENSE |
lightblue/japanese-mpt-7b | MPT (7b) | Japanese mC4 | Lightblue | Apache 2.0 |
Japanese Stable LM 3B-4E1T (3b-4e1t-base, 3b-4e1t-instruct) | StableLM-3B-4E1T (3b) | Pre-training: Wikipedia, Japanese mC4, Japanese CC-100, Japanese OSCAR, SlimPajama(excluding Books3) (100B tokens) Instruction Tuning: Dolly Dataset, HH RLHF, wikinews subset of llm-japanese-dataset | Stability AI | Apache 2.0 |
kotomamba-2.8B-CL | mamba-2.8b-slimpj (2.8b) | Japanese Wikipedia, Swallow Corpus, SlimPajama | Kotoba Technologies | Apache 2.0 |
Gemma 2 Baku 2B (2b, 2b-it) | Gemma 2 (2b) | Pre-training: Wikipedia, Japanese C4, Japanese CC-100, Japanese OSCAR, The Pile, undisclosed dataset (80B tokens) OPRO: undisclosed dataset [10] | rinna | Gemma Terms of Use |
Japanese Stable LM 2 1.6B (base, instruct) | Stable LM 2 1.6B (1.6b) | Pre-training: Wikipedia, CulturaX Instruction Tuning: jaster, ichikara-instruction, alpaca-gpt4-japanese, ultra-orca-boros-en-ja-v1 | Stability AI | STABILITY AI NON-COMMERCIAL RESEARCH COMMUNITY LICENSE |
karasu-1.1B | TinyLlama (1.1b) | Pre-training: Japanese OSCAR, Japanese mC4 (3B tokens) | Lightblue | Apache 2.0 |
Domain specific
Domain | Base Model | Developer | License | |
---|---|---|---|---|
Llama3-Preferred-MedSwallow-70B (70B) | Medicine | Llama 3 (70b) | Preferred Networks | Llama 3 Community License |
AIgroup-CVM-utokyohospital/MedSwallow-70b | Medicine | Llama 2 (70b) | University of Tokyo Hospital Department of Cardiovascular Medicine AI Group | CC BY-NC-SA 4.0 |
nekomata-14b-pfn-qfin (qfin, qfin-inst-merge) | Finance | Qwen (14b) | Preferred Networks | Tongyi Qianwen LICENSE |
Watashiha-Llama-2-13B-Ogiri-sft (sft, sft-neuron) | Oogiri | Llama 2 (13b) | Watashiha | Llama 2 Community License |
ELYZA-japanese-CodeLlama-7b (7b, 7b-instruct) | Coding | Code Llama (7b) | ELYZA | Llama 2 Community License |
AIBunCho/japanese-novel-gpt-j-6b | Storytelling | GPT-J (6b) | Individual (Hiroyuki Osone) | CreativeML OpenRAIL-M License |
NovelAI/genji-jp | Storytelling | GPT-J (6b) | NovelAI | ? |
Models built off non-Japanese LLMs (w/ post-training on Japanese)
General purpose
Base Model | Training Data | Developer | License / Terms of Use | |
---|---|---|---|---|
AXCXEPT/EZO-Qwen2.5-72B-Instruct AXCXEPT/EZO-AutoCoTRAG-Qwen2.5-72B-Instruct_q4 | Qwen2.5 (72b) | Axcxept | Qwen License | |
ao-Karasu (72B) | Qwen1.5 (72b) | ultra-orca-boros-en-ja-v1, OASST1, ShareGPT, Japanese technical blogs, News stories, QA site answers, undisclosed dataset | Lightblue | Tongyi Qianwen LICENSE (?)[9:5] |
AXCXEPT/Llama-3.1-70B-EZO-1.1-it | Llama 3.1 (70b) | Axcxept | Llama 3.1 Community License | |
Llama 3 shisa-v1-llama3-70b (70b) | Llama 3 (70b) | ultra-orca-boros-en-ja-v1 | Shisa.AI | Llama 3 Community License (?)[9:6] |
AIgroup-CVM-utokyohospital/Llama-2-70b-chat-4bit-japanese | Llama 2 (70b) | University of Tokyo Hospital Department of Cardiovascular Medicine AI Group | Llama 2 Community License | |
doshisha-mil/llama-2-70b-chat-4bit-japanese-v1 | Llama 2 (70b) | Doshisha University Media Informatics Lab | ? | |
AXCXEPT/EZO-Qwen2.5-32B-Instruct AXCXEPT/EZO-AutoCoTRAG-Qwen2.5-32B-Instruct | Qwen2.5 (32b) | Axcxept | Apache 2.0 | |
Qarasu (14B-chat-plus-unleashed) | Qwen (14b) | ultra-orca-boros-en-ja-v1, OASST1, ShareGPT, undisclosed dataset | Lightblue | Tongyi Qianwen LICENSE (?)[9:7] |
Sparticle/llama-2-13b-chat-japanese-lora | Llama 2 (13b) | Sparticle | ? | |
izumi-lab/llama-13b-japanese-lora-v0-1ep | Llama (13b) | University of Tokyo Izumi Lab | ? | |
AXCXEPT/EZO-Common-9B-gemma-2-it | Gemma 2 (9b) | Axcxept | Gemma Terms of Use | |
AXCXEPT/EZO-Humanities-9B-gemma-2-it | Gemma 2 (9b) | Axcxept | Gemma Terms of Use | |
AXCXEPT/Llama-3.1-8B-EZO-1.1-it | Llama 3.1 (8b) | Axcxept | Llama 3.1 Community License | |
Llama 3 Suzume 8B (8B-japanese, 8B-japanese-gguf) | Llama 3 (8b) | megagonlabs/instruction_ja, ShareGPT, undisclosed dataset | Lightblue | Llama 3 Community License (?)[9:8] |
Llama 3 shisa-v1-llama3-8b (8b) | Llama 3 (8b) | ultra-orca-boros-en-ja-v1 | Shisa.AI | Llama 3 Community License (?)[9:9] |
AXCXEPT/Llama-3-EZO-8b-Common-it | Llama 3 (8b) | Axcxept | Llama 3 Community License | |
ganchengguang/Yoko-7B-Japanese-v1 | Llama 2 (7b) | Yokohama National University Mori Lab | ? | |
Sparticle/llama-2-7b-chat-japanese-lora | Llama 2 (7b) | Sparticle | ? | |
izumi-lab/llama-7b-japanese-lora-v0-5ep | Llama (7b) | University of Tokyo Izumi Lab | ? | |
lightblue/jod | Mistral-7B-SlimOrca (7b) | Lightblue | Apache 2.0 | |
NTQAI/chatntq-7b-jpntuned | RWKV-4 World (7b) | NTQ Solution | ? | |
Borea (Jp, Common, Coding) | Phi-3.5 (3.8b) | Axcxept | MIT | |
AXCXEPT/EZO-Llama-3.2-3B-Instruct-dpoE | Llama 3.2 (3b) | Axcxept | Llama 3.2 Community License | |
Gemma-2-JPN (2b-jpn-it) | Gemma 2 (2b) | Gemma Terms of Use | ||
AXCXEPT/EZO-gemma-2-2b-jpn-it | Gemma 2 (2b) | Axcxept | Gemma Terms of Use | |
AXCXEPT/EZO-Common-T2-2B-gemma-2-it | Gemma 2 (2b) | Axcxept | Gemma Terms of Use |
Domain specific
Domain | Base Model | Developer | License | |
---|---|---|---|---|
JMedLoRA (llama2-jmedlora-6.89ep) | Medicine | Llama 2 (70b) | University of Tokyo Hospital Department of Cardiovascular Medicine AI Group | CC BY-NC 4.0 |
Merged models
Original Models (Japanese LLMs in bold) | Developer | License | |
---|---|---|---|
EQUES/MedLLama3-JP-v2 | Llama 3 Swallow 8B (Instruct), OpenBioLLM-8B, MMed-Llama 3 8B, Llama 3 ELYZA JP 8B | EQUES | Llama 3 Community License |
EvoLLM-JP-A (v1-7B) | Shisa Gamma 7B (v1), Arithmo2 Mistral 7B, Abel 7B 002 | Sakana AI | Apache 2.0 |
EvoLLM-JP (v1-7B, v1-10B) | Shisa Gamma 7B (v1), WizardMath-7B-V1.1, Abel 7B 002 | Sakana AI | MICROSOFT RESEARCH LICENSE |
API-based models
Max Context Length | Developer | Platform | |
---|---|---|---|
Solar mini chat ja (solar-1-mini-chat-ja) | 32,768 | Upstage | self-owned |
AI Novelist | 2,400 ~ 8,192 | Bit192 | self-owned |
LHTM-OPT | alt Inc. | AWS Marketplace | |
tsuzumi (tsuzumi-7b) | NTT | Azure AI Foundry |
Encoder models
General purpose
Architecture | Max Input Length | Training Data | Developer | License | HuggingFace? [11] | |
---|---|---|---|---|---|---|
KyotoUniBERT | BERT (base, large) | 512 | Japanese Wikipedia (18M articles) | Kyoto University Language Media Processing Lab | Apache 2.0 | △ |
TohokuUniversityBERT | BERT (base, large) | 512 | base (v1): Japanese Wikipedia (17M articles / 2.6GB) base (v2) & large: Japanese Wikipedia 4.0GB base (v3) & large (v2): Japanese Wikipedia (4.9GB), Japanese CC‑100 (74.3GB) | Tohoku University NLP Group | base (v1, v2) & large: CC BY‑SA 3.0 base (v3) & large (v2): Apache 2.0 | ◯ (base (v1), base (v1, char-level), base (v2), base (v2, char-level), large, large (char-level), base (v3), base (v3, char-level), large (v2), large (v2, char-level)) |
TohokuNLP BERT-alpha 500M | Llama-based encoder[12] | 4,096 or 8,192 | Japanese subset of llm-jp-corpus-v3 | Tohoku University NLP Group | Apache 2.0 | ◯ (sq4096-alpha, sq8192-alpha) |
NICT BERT | BERT (base) | 512 | Japanese Wikipedia | NICT | CC BY 4.0 | △ |
Laboro BERT | BERT (base, large) | 512 | Japanese Web Corpus (News and blogs, etc) (12GB) | Laboro.AI | CC BY‑NC 4.0 | ✕ |
colorfulscoop BERT | BERT (base) | 512 | Japanese Wikipedia | Colorful Scoop | CC BY‑SA 3.0 | ◯ |
UniversityOfTokyoBERT | BERT (small) | 512 | Japanese Wikipedia (2.9GB) | University of Tokyo Izumi Lab | CC BY‑SA 4.0 | ◯ |
chiTra (Sudachi Transformers) | BERT (base) | 512 | NINJAL Web Japanese Corpus (148GB) | NINJAL, WAP Tokushima Laboratory of AI and NLP | Apache 2.0 | △ |
ACCMS BERT | BERT (base) | 512 | Japanese Wikipedia (3.3GB) | Kyoto University ACCMS | CC BY‑SA 4.0 | ◯ |
HitachiBERT | BERT (base) | 512 | Japanese Wikipedia, Japanese CC‑100 | Hitachi | CC BY‑NC‑SA 4.0 | ◯[13] |
RetrievaBERT | BERT [14] | 2,048 | Japanese CommonCrawl, RefinedWeb, Chinese Wikipedia, Korean Wikipedia, The Stack | Retrieva | Apache 2.0 | ◯ |
Bandai Namco DistilBERT | DistilBERT | 512 | (Distillation of TohokuUniversityBERT(base)) | Bandai Namco Research | MIT | ◯ |
Laboro DistilBERT | DistilBERT | 512 | (Distillation of Laboro BERT(base)) | Laboro.AI | CC BY‑NC 4.0 | ◯ |
LINE DistilBERT | DistilBERT | 512 | (Distillation of LINE internal BERT model) | LINE | Apache 2.0 | ◯ |
rinna RoBERTa | RoBERTa (base) | 512 | Japanese Wikipedia, Japanese CC‑100 | rinna | MIT | ◯ |
WasedaRoBERTa | RoBERTa (base, large) | 512 | Japanese Wikipedia, Japanese CC‑100 | Waseda Kawahara Lab | CC BY‑SA 4.0 | ◯ (base, large, large (seq512))[15] |
InformatixRoBERTa | RoBERTa (base) | 512 | Japanese Wikipedia, Web Articles (25GB) | Informatix | Apache 2.0 | △ |
KyotoUniversityRoBERTa | RoBERTa (base, large) | 512 | Japanese Wikipedia, Japanese CC‑100 | Kyoto University Language Media Processing Lab | CC BY‑SA 4.0 | ◯ (base (char-level), large (char-level)) |
YokohamaNationalRoBERTa | RoBERTa (base) | 512 | Japanese Wikipedia (3.45GB) | Yokohama National University Mori Lab | Apache 2.0 | ◯ |
Megagon Labs RoBERTa | RoBERTa (base)[16] | 1,282 | Japanese mC4 (200M sentences) | Megagon Labs (Recruit Co.,Ltd.) | MIT | ◯ |
ACCMS RoBERTa | RoBERTa (base) | 512 | Japanese Wikipedia (3.3GB) + Japanese CC‑100 (70GB) | Kyoto University ACCMS | CC BY‑SA 4.0 | ◯ |
CinnamonELECTRA | ELECTRA (small) | 512 | Japanese Wikipedia | Cinnamon | Apache 2.0 | ◯ |
Megagon Labs ELECTRA | ELECTRA (base) | 512 | Japanese mC4 (200M sentences) | Megagon Labs (Recruit Co.,Ltd.) | MIT | ◯ |
UniversityOfTokyoELECTRA | ELECTRA (small, base) | 512 | Japanese Wikipedia (2.9GB) | University of Tokyo Izumi Lab | CC BY‑SA 4.0 | ◯ (small, base) |
JapaneseRoFormer | RoFormer (base) | 512 | Japanese Wikipedia (3.45GB) | Yokohama National University Mori Lab | Apache 2.0 | ◯ |
JapaneseLUKE | LUKE (base, large) | 512 | Japanese Wikipedia | Studio Ousia | Apache 2.0 | ◯ (base, large) |
KyotoUniversityDeBERTaV2 | DeBERTaV2 (tiny, base, large) | 512 | Japanese Wikipedia, Japanese CC‑100, Japanese OSCAR (171GB) | Kyoto University Language Media Processing Lab | CC BY‑SA 4.0 | ◯ (tiny, tiny (char-level), base, large) |
KyotoUniversityDeBERTaV3 | DeBERTaV3 (base) | 512 | llm-jp-corpus | Kyoto University Language Media Processing Lab | Apache 2.0 | ◯ |
UniversityOfTokyoDeBERTaV2 | DeBERTaV2 (small, base) | 512 | Japanese Wikipedia, Japanese Wikinews, Japanese CC-100, Japanese mC4, Japanese OSCAR | University of Tokyo Izumi Lab | CC BY-SA 4.0 | ◯ (small, base) |
GLOBIS DeBERTaV3 | DeBERTaV3 (xsmall, base, large) | 512 | Wikipedia, WikiBooks, Aozora Bunko, Japanese CC-100, Japanese mC4, Japanese OSCAR | GLOBIS | CC BY-SA 4.0 | ◯ (xsmall, base, large) |
JapaneseBigBird | BigBird (base) | 4,096 | Japanese Wikipedia, Japanese CC‑100, Japanese OSCAR | Waseda Kawahara Lab | CC BY‑SA 4.0 | ◯ |
JapaneseLayoutLM | LayoutLM (base) | 512 | Pre-trained on Japanese Wikipedia, initialized with TohokuUniversityBERT | The Japan Research Institute, Limited | CC BY-SA 3.0 | ◯ |
Domain Specific
Domain | Architecture | Training Data | Developer | License | HuggingFace? | |
---|---|---|---|---|---|---|
JapaneseBlogELECTRA | Colloquial language | ELECTRA (small) | Japanese Blog Corpus (354M sentences) | Kitami Institute of Technology Masui-Ptaszynski Lab | CC BY‑SA 4.0 | ◯ |
JapaneseSpokenLanguageBERT | Spoken language | BERT (base) | Additional training for TohokuUniversityBERT using Corpus of Spontaneous Japanese (CSJ) (In the DAPT model, the diet record is also used) | Retrieva | Apache 2.0 | ◯ |
AcademicRoBERTa | Science | RoBERTa (base) | CiNii Japanese Papers (6.3M sentences) | Ehime University AI Lab | Apache 2.0 | ◯ |
local-politics-BERT | Politics | BERT (base) | Wikipedia, Minutes of the National Diet, Minutes of the Local Assembly | Japanese Local Assembly Minutes Corpus Project | CC BY-SA 4.0 | ◯ (SC-min, SC-minwiki, SC-2M-wiki, SC-2M-min, SC-2M-minwiki, FP-min, FP-minwiki) [17] |
UBKE-LUKE | Economics | LUKE (base) | Japanese Wikipedia, Securities Reports, Economic News Articles | Uzabase | CC BY-NC | ◯ |
JapaneseFinancialBERT | Finance | BERT (small, base)[18] | Japanese Wikipedia, Japanese Financial Corpus (27M sentences/5.2GB) | University of Tokyo Izumi Lab | CC BY‑SA 4.0 | ◯ (small, base) |
JapaneseFinancialELECTRA | Finance | ELECTRA (small) | Japanese Wikipedia (20M sentences/2.9GB), Japanese Financial Corpus (27M sentences/5.2GB) | University of Tokyo Izumi Lab | CC BY‑SA 4.0 | ◯ |
JapaneseNewsBERT | Business | BERT (base) | Japanese Business Articles (3M articles) | Stockmark | CC BY 4.0 | △ |
JapaneseNewsXLNet | Business | XLNet (base) | Japanese Business Articles (3M articles) | Stockmark | ? | ◯ ※ Unofficial release |
JapaneseNewsALBERT | Business | ALBERT (base) | Japanese Business Articles (3M articles) | Stockmark | ? | △ |
MinpakuBERT | Cultural Heritage | BERT (base) | Additional training with National Museum of Ethnology's cultural heritage data on top of Tohoku University BERT | University of Hyogo Ohshima Lab | MIT | ◯ (minpaku-v1, minpaku-v3, minpaku-v3-no-additional-token) |
UTH-BERT | Medicine | BERT (base) | Japanese Medical Records(120M lines) | University of Tokyo Hospital Medical AI Development Course | CC BY‑NC‑SA 4.0 | △ |
medBERTjp | Medicine | BERT (base) | Japanese Wikipedia, Japanese Medical Corpus ("今日の診療プレミアム/Today's Care Premium" Web Version) | Osaka University Hospital Medical Informatics Lab | CC BY‑NC‑SA 4.0 | △ |
JMedRoBERTa | Medicine | RoBERTa (base) | Japanese Medical Papers (11M sentences/1.8GB) | NII Aizawa Lab | CC BY‑NC‑SA 4.0 | ◯ (ManbyoWordPiece, SentencePiece)[19] |
Sentence and Document Embeddings [20]
Bi-Encoders
Single-representation bi-encoders
Multi-representation bi-encoders
Developer | License | |
---|---|---|
JaColBERTv2.5 (JaColBERTv2.4, JaColBERTv2.5) | Answer.AI | MIT |
JaColBERTv2 (JaColBERTv2) | Individual (Benjamin Clavié) | MIT |
JaColBERT (JaColBERT) | Individual (Benjamin Clavié) | MIT |
Cross-Encoders
Vision-Language Models
Text+Image to Text
Models built from scratch
General purpose
Architecture | Training Data | Developer | License / Terms of Use | |
---|---|---|---|---|
llava-calm2-siglip (llava-calm2-siglip) | LLaVA-1.5 | coversational data generated from MS-COCO and VisualGenome | CyberAgent | Apache 2.0 |
LLM-jp-3 VILA 14B (14b) | LLaVA-1.5 | Japanese image text pairs, LLaVA-Pretrain, Japanese interleaved data, coyo (subset), mmc4-core (subset), llava-instruct-ja, japanese-photos-conv, ja-vg-vqa, synthdog-ja, LLaVA-1.5 instruction data (subset) | Research and Development Center for Large Language Models | Apache 2.0 & OpenAI Terms of Use |
Heron (blip-ja-stablelm-base-7b-v0, blip-ja-stablelm-base-7b-v1, blip-ja-stablelm-base-7b-v1-llava-620k, git-ja-stablelm-base-7b-v0, git-ELYZA-fast-7b-v0, git-ja-stablelm-base-7b-v1) | BLIP-2 / GIT | v1: LLaVA-Instruct-150K-JA or LLaVA-Instruct-620K-JA v0: LLaVA-Instruct-150K-JA, Japanese STAIR Captions, Japanese Visual Genome VQA dataset | Turing | CC BY-NC 4.0 |
Japanese Stable VLM (japanese-stable-vlm) | LLaVA-1.5 | Japanese CC12M, STAIR Captions, Japanese Visual Genome VQA dataset | Stability AI | STABILITY AI JAPANESE STABLE VLM COMMUNITY LICENSE |
Japanese InstructBLIP Alpha (japanese-instructblip-alpha) | InstructBLIP | Japanese CC12M, STAIR Captions, Japanese Visual Genome VQA dataset | Stability AI | JAPANESE STABLELM RESEARCH LICENSE |
rinna MiniGPT-4 (bilingual-gpt-neox-4b-minigpt4) | MiniGPT-4 | CC12M, COCO 2014, Visual Genome, STAIR Captions, Japanese Visual Genome VQA dataset | rinna | MIT |
Domain Specific
Architecture | Domain | Developer | License | |
---|---|---|---|---|
watashiha/Watashiha-Llama-2-13B-Ogiri-sft-vlm | LLaVA | Oogiri | Watashiha | Llama 2 Community License |
Models built off non-Japanese VLMs
Base Model | Training Data | Developer | License | |
---|---|---|---|---|
AXCXEPT/EZO-InternVL2-26B | InternVL2 | - | Axcxept | MIT |
Merged models
Original Models (Japanese LLMs in bold) | Developer | License | |
---|---|---|---|
Llama-3-EvoVLM-JP-v2 (v2) | Mantis-8B-SigLIP-Llama-3, Llama-3-ELYZA-JP-8B, Bunny-v1.1-Llama-3-8B-V | Sakana AI | Llama 3 Community License |
AXCXEPT/Llama-3-EZO-VLM-1 | - (trained from Llama-3-EvoVLM-JP-v2) | Axcxept | Llama 3 Community License |
EvoVLM-JP (v1-7B) | Shisa Gamma 7B (v1), LLaVA-1.6-Mistral-7B | Sakana AI | Apache 2.0 |
Text to Image
General Purpose
Architecture | Training Data | Developer | License | |
---|---|---|---|---|
CommonArt β (commonart-beta) | PixArt-Σ | CommonCatalog-cc-by, Megalith-10M, Smithonian Open Access, ArtBench (CC-0 only) | AI Picasso | Apache 2.0 |
EvoSDXL-JP (v1) | Stable Diffusion | - (merged from several diffusion models, including Japanese Stable Diffusion XL) | Sakana AI | Apache 2.0[21] |
Japanese Stable Diffusion XL (japanese-stable-diffusion-xl) | Stable Diffusion | undisclosed | Stability AI | STABILITY AI JAPANESE STABLE DIFFUSION XL COMMUNITY LICENSE |
TohokuUniversity Stable Diffusion (base, refiner) | Stable Diffusion | WMT2023 Shared Task English-Japanese parallel corpus, about 13 million captions from laion2B-multi | Tohoku University NLP Group | CreativeML OpenRAIL-M License |
rinna Stable Diffusion (japanese-stable-diffusion) | Stable Diffusion | LAION-5B Japanese Subset (100M images) | rinna | CreativeML OpenRAIL-M License |
Domain Specific
Architecture | Domain | Developer | License | |
---|---|---|---|---|
Evo-Nishikie (v1) | Stable Diffusion (ControlNet) | Ukiyo-e | Sakana AI | Apache 2.0[21:1] |
Evo-Ukiyoe (v1) | Stable Diffusion | Ukiyo-e | Sakana AI | Apache 2.0[21:2] |
Others
Architecture | Training Data | Developer | License | |
---|---|---|---|---|
LY CLIP (clip-japanese-base) | CLIP | CommonCrawl, CC12M, YFCC100M | LY Corp. | Apache 2.0 |
Recruit CLIP (japanese-clip-vit-b-32-roberta-base) | CLIP | about 120 million captions from laion2B-multi | Recruit Co.,Ltd. | CC BY-4.0 |
Japanese Stable CLIP (japanese-stable-clip-vit-l-16) | SigLIP | CC12M translated to Japanese, STAIR Captions | Stability AI | STABILITY AI JAPANESE STABLE CLIP COMMUNITY LICENSE |
rinna CLIP (japanese-clip-vit-b-16) | CLIP | CC12M translated to Japanese | rinna | Apache 2.0 |
rinna CLOOB (japanese-cloob-vit-b-16) | CLOOB | CC12M translated to Japanese | rinna | Apache 2.0 |
HAKUHODO Technologies CLIP (base, deeper, wider) | CLIP | about 120 million captions from laion2B-multi | HAKUHODO Technologies | CC BY-NC-SA 4.0 |
Speech-Language Models
Automatic Speech Recognition
Architecture | Training Data | Developer | License | |
---|---|---|---|---|
Kotoba-Whisper (v1.0, v1.0-ggml, v1.0-faster, v1.1, bilingual-v1.0, bilingual-v1.0-ggml, bilingual-v1.0-faster, v2.0, v2.0-ggml, v2.0-faster, v2.1, v2.2) | Distil-Whisper | ReazonSpeech | Kotoba Technologies | Apache 2.0 |
Nue ASR (nue-asr) | Nue ASR (HuBERT + LLM) | ReazonSpeech | rinna | Apache 2.0 |
ReazonSpeech (espnet-v1, espnet-next, espnet-v2, nemo-v2) | ESPnet (Conformer-Transducer) / NeMo (FastConformer-RNNT) | ReazonSpeech | Reazon Holdings | Apache 2.0 |
Others
Architecture | Training Data | Developer | License | |
---|---|---|---|---|
Kotoba-Speech (v0.1) | Transformer | undisclosed | Kotoba Technologies | Apache 2.0 |
UniversityOfTokyoHuBERT (base-jtube) | HuBERT | JTubeSpeech | University of Tokyo Saruwatari & Takamichi Lab | MIT |
rinna HuBERT (base, large) | HuBERT | ReazonSpeech | rinna | Apache 2.0 |
Reazon wav2vec 2.0 (base, large) | wav2vec 2.0 | ReazonSpeech | Reazon Holdings | Apache 2.0 |
rinna wav2vec 2.0 (base) | wav2vec 2.0 | ReazonSpeech | rinna | Apache 2.0 |
Evaluation Benchmarks for Japanese LLMs
Hybrid Benchmarks
Description | Developer | |
---|---|---|
Nejumi LLM Leaderboard3 | Evaluates the Japanese language capabilities of LLMs from three perspectives: language understanding ability, application ability, and alignment (including controllability and safety). For more details, see this article. | Weights & Biases |
Japanese LLM Evaluation | Conducts a comprehensive evaluation of various LLMs based on three types of tasks: Japanese language understanding and generation tasks, Japanese multi-turn dialogue tasks, and English language understanding and generation tasks. Also publishes swallow-evaluation, an evaluation script that integrates and improves existing LLM evaluation tools. | Swallow Project |
Traditional Benchmarks based on Natural Language Understanding tasks
Description | Developer | |
---|---|---|
Open Japanese LLM Leaderboard | Evaluates Japanese language models in 16 different tasks using llm-jp-eval. | LLM-jp, Hugging Face |
llm-jp-eval | A tool that evaluates Japanese LLMs automatically across multiple datasets. The complete list of supported datasets can be found here (which also includes tasks such as JNLI and JCommonsenseQA from JGLUE). | LLM-jp |
JP Language Model Evaluation Harness | A fork by Stability AI of EleutherAI/lm-evaluation-harness. It is a tool for automatically evaluating Japanese LLMs across multiple datasets. The complete list of supported datasets can be found here (which also includes tasks such as JNLI and JCommonsenseQA from JGLUE). There is a detailed summary of the evaluation results by rinna: [rinna] Benchmark of Stability-AI/lm-evaluation-harness | Stability AI |
JGLUE | Japanese version of the GLUE benchmark suite, including the MARC-ja, JCoLA, JSTS, JNLI, JSQuAD, and JCommonsenseQA tasks. JCoLA is by the University of Tokyo's Oseki Lab. See here and here (ja only) for further details about each task. | Waseda University Kawahara Lab and Yahoo |
JMMLU | A benchmark constructed as a Japanese version of the MMLU Benchmark, consisting of multiple-choice questions from a wide range of academic fields including natural sciences, humanities, and social sciences. In addition to translating the original MMLU, it features newly added problems based on the unique cultural background of Japan (Japan-specific problems). | Waseda University Kawahara Lab |
Benchmarks on open-ended generative tasks
Description | Developer | |
---|---|---|
Japanese MT-bench | The Japanese version of MT-bench asks about multi-turn conversational ability. It includes 80 questions, 10 each, from 8 categories: Writing, Roleplay, Reasoning, Math, Coding, Extraction, STEM, Humanities. Some questions have been modified to fit with Japanese culture during the production of the Japanese version. It also includes a script that performs a 10-level absolute evaluation by GPT-4. | Stability AI |
ELYZA-tasks-100 | Ranking based on model responses to 100 complex and diverse tasks, including tasks testing summarization, correction, abstraction, induction, and other skills. Uses humans to score the model responses and then ranks models based on their mean scores. | ELYZA |
Preferred Generation Benchmark (pfgen-bench) | A benchmark to measure the Japanese language generation ability of LLMs based on 50 common sense questions unique to the Japanese context. It evaluates along three axes: Fluency, Truthfulness, and Helpfulness. The evaluation is conducted without using LLM-as-a-Judge by calculating n-gram or rule-based metrics. | Preferred Elements (Preferred Networks) |
Rakuda Benchmark | Ranking based on model answers to 40 open-ended questions on Japanese geography, history, politics, and society. Uses GPT-4 to judge model outputs pairwise, and then ranks models by fitting a Maximum Likelihood Elo/Bradley-Terry model to GPT-4's preferences. | YuzuAI |
Japanese Vicuna QA Benchmark | This is the Japanese version of vicuna-blog-eval, which is the predecessor of MT-Bench. It includes 80 questions on general knowledge, role-playing, common sense, Fermi estimation, counterfactual thinking, coding, mathematics, and writing. It also includes a script for automatic evaluation by GPT-4 (win-rate calculation). The leaderboard can be found here. | Kyoto University Language Media Processing Lab |
Tengu-Bench | Includes 120 free-form questions from various categories. Categories of questions: table interpretation, logic puzzles, idea generation, function calling, long document summarization (over a thousand tokens), conversation summarization, long document closed QA (over a thousand tokens), honorifics, project creation, math, translation, extraction, ethical control, cost estimation, Japan, chit-chat, puns, formatting, construction, business, legal judgment, politics, hypothetical questions. | Lightblue |
Shaberi | A framework that can collectively evaluate the Japanese MT-bench, Rakuda Benchmark, ELYZA-tasks-100, and Tengu-Bench. There is also a fork by Shisa.AI. | Lightblue |
Benchmarks for measuring performance in specific domains
Description | Developer | |
---|---|---|
Japanese Language Model Financial Evaluation Harness | A benchmark for Japanese LLM in the financial sector. It includes tasks such as sentiment analysis in finance (chabsa), basic knowledge tasks in securities analysis (cma_basics), tasks related to audits in certified public accountant examinations (cpa_audit), multiple choice question tasks in financial planner exams (fp2), and mock exam tasks for securities salespeople exams (security_sales_1). For more details, please see here. | Preferred Networks |
pfmt-bench-fin-ja | A benchmark for measuring the generation capabilities of Japanese LLMs in the financial domain. | Preferred Networks |
Stockmark Business Questions | The collection includes 50 questions that probe knowledge on topics such as market trends, current affairs, social issues, and business trends. | Stockmark |
JMED-LLM | A dataset for evaluating LLMs in the Japanese medical domain. It compiles previously developed Japanese medical language processing tasks for LLM benchmarking. | NAIST Social Computing Lab. |
JMedBench | A benchmark for LLMs in the Japanese medical field. It includes 20 datasets in 5 types of tasks: multi-choice question-answering, machine translation, named entity recognition, document classification, and semantic textual similarity (some datasets are borrowed from JMMLU and JMED-LLM). A tool called med-eval is developed to facilitate evaluation on JMedBench. | NII Aizawa Lab |
Japanese Medical Language Model Evaluation Harness | A benchmark for evaluating Japanese LLMs in the medical domain in both Japanese and English, executable by a single command. | Individual (Issey Sukeda) |
karakuri-bench | A dataset for measuring performance of Japanese LLMs in customer support. | KARAKURI |
Benchmarks for measuring factuality and safety
Description | Developer | |
---|---|---|
JTruthfulQA | The Japanese version of the dataset for evaluating the factuality of LLMs TruthfulQA. It includes questions about superstitions and other beliefs held by some people that are not factual, as well as questions about Japan-specific knowledge, all collected from scratch. | Waseda University Kawahara Lab |
JCommonsenseMorality | A dataset on Japanese commonsense morality. Sentences describing actions are labeled with binary values indicating whether they are morally wrong or acceptable. | Hokkaido University Language Media Lab |
JBBQ | The Japanese version of the social bias QA dataset BBQ, developed through translation, revision, and addition of questions based on Japanese culture and customs. | University of Tokyo Yanaka Lab |
Benchmarks for measuring logical reasoning capabilities
Description | Developer | |
---|---|---|
JFLD (Japanese Formal Logic Deduction) | A dataset for evaluating deductive reasoning capabilities of Japanese LLMs (the Japanese version of the FLD (Formal Logic Deduction) proposed by the same authors). It is characterized by being composed of counterfactual samples to evaluate apart from the knowledge the LLM possesses. | Hitachi |
JHumanEval | A Japanese version of the HumanEval benchmark, which assesses the ability to generate Python code from English instructions. In creating the Japanese version, the text was first machine-translated and then manually corrected. | Japan Women's University Kuramitsu Lab |
Benchmarks on controlled text generation
Description | Developer | |
---|---|---|
LCTG Bench | A benchmark for the controllability of Japanese LLMs. It evaluates whether LLMs can adhere to constraints in four aspects: output format, character count, keywords, and forbidden words. The quality of the generated text is also evaluated. | CyberAgent |
Benchmarks for embedding models
Description | Developer | |
---|---|---|
JMTEB | A benchmark developed as the Japanese version of MTEB. It consists of tasks such as document clustering, text classification, sentence similarity, sentence pair labeling prediction, and text extraction (a reranking task was recently added). | SB Intuitions |
JQaRA | A dataset for evaluating Japanese document extraction and reranking accuracy. Each of the 1,667 questions is assigned 100 candidate documents, of which at least one can answer the question. The questions are taken from JAQKET, and the candidate documents are sourced from Japanese Wikipedia. | Individual (Yuichi Tateno) |
JaCWIR | A dataset created for evaluating document extraction and reranking in domains other than Wikipedia. Each of the 5,000 questions is assigned one Web page that serves as the source of the question and 99 unrelated Web pages. | Individual (Yuichi Tateno) |
Benchmarks for vision-language models
Description | Developer | |
---|---|---|
JMMMU | A benchmark constructed as the Japanese version of MMMU Benchmark. It consists of 720 translated MMMU problems and 600 new problems unique to Japanese culture. | University of Tokyo Aizawa Lab |
JDocQA | A question-answer dataset based on Japanese documents (pamphlets, slides, reports, websites), consisting of a total of 11,600 questions. It includes various question formats, including unanswerable questions. | NAIST Watanabe Lab |
Heron VLM Leaderboard powered by Nejumi/WandB | Summarizes the evaluation results of Japanese-Heron-Bench and LLaVA-Bench-In-the-Wild (Japanese). | Turing, Weights & Biases |
Japanese-Heron-Bench | 21 images are assigned a total of 102 questions. It is characterized by image-question pairs that require knowledge related to Japan. | Turing |
JA-VLM-Bench-In-the-Wild | A dataset independently prepared by Sakana AI to evaluate EvoVLM-JP-v1-7B. It consists of 50 questions assigned to 42 images. It is characterized by images and questions that require knowledge about Japan. | Sakana AI |
JA-Multi-Image-VQA | A dataset for evaluating the question-answering ability in Japanese for multiple images. | Sakana AI |
LLaVA-Bench-In-the-Wild (Japanese) | This is the Japanese version of LLaVA-Bench-In-the-Wild, translated using DeepL. It consists of 60 questions assigned to 24 images. | Turing |
LLaVA-Bench (COCO) Japanese | This is the Japanese version, translated by DeepL, of the LLaVA-Bench (COCO) dataset used to evaluate LLaVA. It consists of 30 images, each with 3 types of questions assigned to them. | Turing |
Japanese Visual Genome VQA dataset | A question-and-answer dataset annotated based on images from the Visual Genome dataset. A subset of this dataset, JA-VG-VQA-500, consisting of 500 questions, is sometimes used as a benchmark for evaluating VLMs. | Yahoo |
References for Models and Architectures
References for Training Methods
PPO (RLHF) | 2017.07.20 | - | Proximal Policy Optimization Algorithms |
Instruction Tuning (Supervised Fine-tuning; SFT) | 2021.09.03 | ICLR 2022 | Finetuned Language Models Are Zero-Shot Learners |
Sparse Upcycling | 2022.12.09 | ICLR 2023 | Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints |
DPO | 2023.05.29 | NeurIPS 2023 | Direct Preference Optimization: Your Language Model is Secretly a Reward Model |
SteerLM | 2023.10.09 | EMNLP 2023 (Findings) | SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF |
ORPO | 2024.03.12 | EMNLP 2024 | ORPO: Monolithic Preference Optimization without Reference Model |
Our Contributors
We love contributors! Feel free to contribute to this project.
Citation
The summary of this repository is also published as a preprint: Exploring Open Large Language Models for the Japanese Language: A Practical Guide
When referencing this repository, please cite as follows:
@article{awesomeJapanese2024,
title={{Exploring Open Large Language Models for the Japanese Language: A Practical Guide}},
author={Kaito Sugimoto},
doi={10.51094/jxiv.682},
journal={Jxiv preprint},
year={2024}
}
Some architectural changes have been made. For details, refer to: 1,000億パラメータ規模の独自LLM「PLaMo-100B」の事前学習 ↩︎
Refer to the following articles: 大規模言語モデルTanuki-8B, 8x8Bの位置づけや開発指針など, 大規模言語モデルを開発するにあたっての事前・事後学習の戦略メモー特に合成データについてー ↩︎ ↩︎
Some performance enhancements have been made to the original Llama model. See here for details. ↩︎
Details have not been made public but the private dataset includes data from the EleutherAI Polyglot project's Japanese team and from members of Stable Community Japan. ↩︎
This project conducted evaluation research on using right-to-left generation instead of the usual left-to-right generation, releasing both left-to-right and right-to-left models. ↩︎
Before conducting Instruction Tuning, a Chat Vector between Llama 3 Instruct and Llama 3 Base is added. ↩︎ ↩︎
After conducting Instruction Tuning, a Chat Vector between Llama 3 Instruct and Llama 3 Base is added. ↩︎ ↩︎
However, if commercial use of KARAKURI LM is desired, direct contact with the developer, KARAKURI Inc., is required. ↩︎
In Instruction Tuning, because it uses data generated by OpenAI's models, such as GPT-3.5 and GPT-4, for training, there is a possibility that it may violate OpenAI's terms. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Before conducting Instruction Tuning, a Chat Vector between Gemma 2 Instruct and Gemma 2 Base is added. ↩︎
○: The model is on the HuggingFace Model Hub and can be loaded in with the
AutoModel.from_pretrained()
command. △: The model is not on the Model Hub but can be loaded in manually with the HuggingFace transformers library. ✕: The model is not directly loadable with HuggingFace. ↩︎By removing Causal Attention from Llama, it is used as an encoder-type model. ↩︎
This project conducted evaluation research on pre-tokenization morphological analysis and released their best performing model, which used Juman++ and BPE. ↩︎
However, the maximum sequence length has been extended to 2048, and various architectural changes have been made compared to the original BERT. See the HuggingFace repository README for details. ↩︎
nlp-waseda/roberta-base-japanese and nlp-waseda/roberta-large-japanese trained using a 128 token context length, but nlp-waseda/roberta-large-japanese-seq512 expanded the context length to 512. ↩︎
Extended to a 1282 context length from the usual 512. ↩︎
For details of each model, please refer to Chapter 4 of the authors' paper. Note that the SC-2M-wiki model is strictly not a domain-specific model as it is pre-trained only on Wikipedia. ↩︎
The "small" model trains on Japanese Wikipedia and the Japanese Financial Corpus simultaneously, while the "base" model takes the TohokuUniversityBERT and conducts additional training on the Japanese Financial Corpus. ↩︎
ManbyoWordPiece conducts a pre-tokenization step using MeCab (IPA+Manbyo dictionaries) and uses WordPiece for subword tokenization, while the SentencePiece model tokenizes text directly using a unigram model. ↩︎
The classification of embedding models was referenced from Dense Text Retrieval based on Pretrained Language Models: A Survey (Zhao+, 2022). The Bi-Encoder architecture inputs two separate inputs into the model and vectorizes each, using their dot product or cosine similarity as a measure of their proximity. In contrast, the Cross-Encoder architecture inputs the combined inputs into the model to directly compute their proximity internally. Although Cross-Encoders incur higher computational costs, they are often used as rerankers in information extraction due to their ability to compute input proximity more precisely. Among Bi-Encoders, there are types (e.g., ColBERT) that represent the input as multiple vectors (such as one per token) rather than a single vector, hence further classification into Single-representation bi-encoders and Multi-representation bi-encoders. ↩︎
However, it calls for consideration for use in research and education. Additionally, be aware that some of the licenses for the source models are not Apache 2.0. ↩︎ ↩︎ ↩︎