Skip to content

Overview of Japanese LLMs

[ English | Français | 日本語 ]

Parameter sizes of Japanese and non-Japanese LLMs over time

Evolution of parameter sizes for Japanese LLMs and non-Japanese LLMs. The information on the Japanese models is derived from this article, while the information on the non-Japanese models can be referred from the Models table on LifeArchitect.ai. However, due to space constraints in the figure, some models have been omitted. Additionally, estimates are included in the parameter count for non-Japanese models. Please notify us of any corrections, additions, or updates.

A list of publicly available LLMs trained with a focus on Japanese, along with their evaluation benchmarks, maintained by volunteers from various sources like academic papers and other public resources.

Caution

  1. We can't guarantee the accuracy or completeness of any information here.
  2. Some information is based on conjecture and might not reflect your specific use case.
  3. While many models are released under permissive licenses like MIT or Apache 2.0, some are subject to more restrictive terms including non-commercial use clauses (e.g CC BY-NC-SA 4.0) or other stipulations.

Please point out any errors on the issues page. Feel free to contribute directly with a pull request.

Table of Contents

Text Generation Models

For multimodal models, see below.

Models built from scratch

General purpose

ArchitectureMax Context LengthTraining DataDeveloperLicense / Terms of Use
Sarashina2-8x70BMixtral
(8x70b (465b))
8,192undisclosedSB IntuitionsSarashina Model NonCommercial License
LLM-jp-3 172B beta2Llama
(172b-beta2, 172b-beta2-instruct2)
4,096Pre-training: part of llm-jp-corpus-v3
(1.4T tokens)
Instruction Tuning: ichikara-instruction, answer-carefully, magpie-sft-v1.0, Daring-Anteater, FLAN, ichikara-instruction-format, AutoMultiTurnByCalm3-22B, ramdom-to-fixed-multiturn-Calm3, wizardlm8x22b-logical-math-coding-sft-ja, wizardlm8x22b-logical-math-coding-sft_additional-ja, Synthetic-JP-EN-Coding-Dataset-567k
Research and Development Center for Large Language Models (LLMC)LLM-jp-3 172B beta2 Terms of Use
LLM-jp-3 172B beta1Llama
(172b-beta1, 172b-beta1-instruct)
4,096Pre-training: part of llm-jp-corpus-v3
(0.7T tokens)
Instruction Tuning: ichikara-instruction, answer-carefully, Dolly Dataset, OASST1, OASST2, Aya Dataset, ichikara-instruction-format, Daring-Anteater, FLAN
Research and Development Center for Large Language Models (LLMC)LLM-jp-3 172B beta1 Terms of Use
LLM-jp-3 172B alphaLlama
(172b-alpha1, 172b-alpha1-instruct, 172b-alpha2, 172b-alpha2-instruct)
4,096Pre-training: part of llm-jp-corpus-v3
(alpha1: 0.7T tokens, alpha2: 1.4T tokens)
Instruction Tuning: ichikara-instruction, answer-carefully, Dolly Dataset, OASST1, OASST2, Aya Dataset, ichikara-instruction-format, Daring-Anteater, FLAN
Research and Development Center for Large Language Models (LLMC)Apache 2.0
Stockmark-100bLlama
(100b, 100b-instruct-v0.1)
4,096Pre-training: RedPajama, Japanese Wikipedia, Japanese mC4, Japanese CommonCrawl, Japanese Patent, Stockmark Web Corpus
(910B tokens)
Instruction Tuning (LoRA): ichikara-instruction
StockmarkMIT
PLaMo-100B-PretrainedLlama[1]
(100b)
4,096Pre-training: Japanese CommonCrawl, RefinedWeb, undisclosed
(2.0T tokens)
Preferred ElementsPLaMo Non-Commercial License
Sarashina2Llama
(7b, 13b, 70b)
7b, 13b: 4,096
70b: 8,192
Pre-training: Japanese Common Crawl, SlimPajama, StarCoder
(2.1T tokens)
SB IntuitionsMIT
Sarashina1GPT-NeoX
(7b, 13b, 65b)
2,048Pre-training: Japanese Common Crawl
(1T tokens)
SB IntuitionsMIT
Tanuki-8×8BTanuki (MoE) (47b)
(v1.0, v1.0-AWQ, v1.0-GPTQ-4bit, v1.0-GPTQ-8bit, v1.0-GGUF)
4,096Pre-training: various Web & synthetic datasets(1.7T tokens)
SFT, DPO: various synthetic datasets [2]
Matsuo Lab LLM Development ProjectApache 2.0
CyberAgentLM3 (CALM3)Llama
(22b-chat)
16,384undisclosed
(2.0T tokens)
CyberAgentApache 2.0
LLM-jp-3 13BLlama
(1.8b, 1.8b-instruct, 3.7b, 3.7b-instruct, 13b, 13b-instruct)
4,096Pre-training: llm-jp-corpus-v3
(2.1T tokens)
Instruction Tuning: ichikara-instruction, answer-carefully, FLAN, ichikara-instruction-format, AutoMultiTurnByCalm3-22B, ramdom-to-fixed-multiturn-Calm3, wizardlm8x22b-logical-math-coding-sft_additional-ja, Synthetic-JP-EN-Coding-Dataset-567k
Research and Development Center for Large Language Models (LLMC)Apache 2.0
llm-jp-3-3.7b-instruct-EZOLlama
(3.7b-instruct-EZO-Common, 3.7b-instruct-EZO-Humanities)
4,096additionally trained on LLM-jp-3 (3.7B)AxcxeptApache 2.0
LLM-jp-13B v2.0Llama
(13b-v2.0, 13b-instruct-full-dolly-ichikara_004_001_single-oasst-oasst2-v2.0, 13b-instruct-full-ac_001-dolly-ichikara_004_001_single-oasst-oasst2-v2.0, 13b-instruct-full-ac_001_16x-dolly-ichikara_004_001_single-oasst-oasst2-v2.0)
4,096Pre-training: llm-jp-corpus-v2
(260B tokens)
Instruction Tuning: ichikara-instruction, answer-carefully, Dolly Dataset, OASST1, OASST2
LLM-jpApache 2.0
Fugaku-LLMGPT
(13B, 13B-instruct, 13B-instruct-gguf)
2,048Pre-training: undisclosed dataset
Instruction Tuning: OASST1, Dolly Dataset, GSM8K
Titech, Tohoku Univ., Fujitsu, RIKEN, Nagoya Univ., CyberAgent, Kotoba TechnologiesFugaku-LLM Terms of Use
LLM-jp-13B v1.1GPT
(13b-instruct-lora-dolly_en-dolly_ja-ichikara_003_001-oasst_en-oasst_ja-v1.1, 13b-instruct-full-dolly_en-dolly_ja-ichikara_003_001-oasst_en-oasst_ja-v1.1, 13b-dpo-lora-hh_rlhf_ja-v1.1)
2,048Instruction Tuning (LoRA or Full-parameter FT): Dolly Dataset, OASST1, ichikara-instruction
DPO (LoRA): HH RLHF
LLM-jpApache 2.0
LLM-jp-13BGPT
(1.3b-v1.0, 13b-v1.0, 13b-instruct-full-jaster-v1.0, 13b-instruct-full-jaster-dolly-oasst-v1.0, 13b-instruct-full-dolly-oasst-v1.0, 13b-instruct-lora-jaster-v1.0, 13b-instruct-lora-jaster-dolly-oasst-v1.0, 13b-instruct-lora-dolly-oasst-v1.0)
2,048Pre-training: llm-jp-corpus (Wikipedia, Japanese mC4, The Pile, Stack) (300B tokens)
Instruction Tuning (Full-parameter FT or LoRA): jaster, Dolly Dataset, OASST1
LLM-jpApache 2.0
PLaMo-13BLlama[3]
(13b, 13b-instruct, 13b-instruct-nc)
base: 4,096
instruct, instruct-nc: 8,192
Pre-training: C4, Project Gutenberg, RedPajama, Japanese Wikipedia, Japanese mC4
(1.5T tokens)
Instruction Tuning: Dolly, HH RLHF, OASST1, wikinews (+Alpaca in NC model)
Preferred NetworksApache 2.0
(CC BY-NC 4.0 as for NC model)
Stockmark-13bLlama
(13b, 13b-instruct)
2,048Pre-training: Japanese Wikipedia, Japanese CC-100, Japanese mC4, Japanese CommonCrawl, Japanese Patent, Stockmark Web Corpus
(220B tokens)
Instruction Tuning (LoRA): ichikara-instruction
Stockmarkbase: MIT
instruct: CC BY-NC-SA 4.0
Weblab-10BGPT-NeoX
(10b, 10b-instruction-sft)
2,048Japanese mC4, The Pile
(600B tokens)
Instruction Tuning: Alpaca, FLAN
University of Tokyo Matsuo LabCC BY‑NC 4.0
Tanuki-8BTanuki (8b)
(v1.0, v1.0-AWQ, v1.0-GPTQ-4bit, v1.0-GPTQ-8bit, v1.0-GGUF)
4,096Pre-training: various Web & synthetic datasets(1.3T tokens)
SFT, DPO: various synthetic datasets [2:1]
Matsuo Lab LLM Development ProjectApache 2.0
Japanese StableLM AlphaGPT-NeoX
(base-alpha-7b, instruct-alpha-7b, instruct-alpha-7b-v2)
2,048Wikipedia, Japanese CC‑100, Japanese mC4, Japanese OSCAR, RedPajama, private datasets[4]
(750B tokens)
Instruction Tuning: Dolly, HH‑RLHF, wikinews, Alpaca (discarded in v2)
Stability AIbase: Apache 2.0
instruct (v1): Research license
instruct (v2): Apache 2.0
CyberAgentLM2 (CALM2)Llama
(7b, 7b-chat, 7b-chat-dpo-experimental)
base: 4,096
chat: 32,768
publicly available Japanese and English datasets (details unknown)
(1.3T tokens)
DPO: Chatbot Arena Conversations JA (calm2) Dataset
CyberAgentApache 2.0
(CC BY 4.0 as for DPO model)
OpenCALMGPT-NeoX
(small, medium, large, 1b(1.4b), 3b(2.7b), 7b(6.8b))
2,048Japanese Wikipedia, Japanese mC4, Japanese CC‑100CyberAgentCC BY‑SA 4.0
StormyGPT-NeoX
(7b(6.8b))
2,048OpenCALM fine-tuned on
llm-japanese-dataset v0 non-translation tasks
University of Tokyo Izumi LabCC BY‑SA 4.0
rinna GPT
(En-Ja Bilingual)
GPT-NeoX
(4b(3.8b), 4b(3.8b)-8k, 4b(3.8b)-instruction-sft, 4b(3.8b)-instruction-ppo)
8k model: 8,192
others: 2,048
Wikipedia, Japanese CC‑100, Japanese C4, RedPajama, The Pile
(524B tokens)
Instruction Tuning: HH‑RLHF, FLAN
PPO: HH‑RLHF for reinforcement learning
8k: trained with long context
rinnaMIT
japanese-large-lmGPT-NeoX
(1.7b, 3.6b, 1.7b-instruction-sft, 3.6b-instruction-sft)
2,048Japanese Wikipedia, Japanese CC‑100, Japanese C4, Japanese OSCAR and private datasets
(650GB)
Instruction Tuning: OASST1
LINEApache 2.0
rinna GPT
(Japanese only)
GPT / GPT-NeoX
(xsmall, small, medium, 1b, neox-small, neox-3.6b, neox-3.6b-instruction-sft, neox-3.6b-instruction-sft-v2, neox-3.6b-instruction-ppo)
≤ 2,048Japanese Wikipedia, Japanese CC‑100
(1b and up models add
Japanese mC4)
Instruction Tuning: HH‑RLHF, FLAN, SHP
PPO: HH‑RLHF for reinforcement learning
rinnaMIT
RetrievaT5T5
(small (short), small (medium), small (long), base (short), base (medium), base (long), large (short), large (medium), large (long), xl(3b))
Japanese Wikipedia, Japanese mC4RetrievaCC BY‑SA 4.0
Spiral-RetNet-3b-baseRetNet
(3b)
2,048Wikipedia, Japanese CC-100, CulturaXSpiral.AIMIT
kotomamba-2.8BMamba
(2.8B-v1.0)
2,048Japanese Wikipedia, Swallow Corpus, SlimPajamaKotoba TechnologiesApache 2.0
ABEJA GPTGPT / GPT-NeoX
(large, neox-2.7b)
Japanese Wikipedia, Japanese CC‑100, Japanese OSCARABEJAMIT
WasedaGPTGPT
(small, xl(1.5b))
Japanese Wikipedia, Japanese CC‑100Waseda Kawahara LabCC BY‑SA 4.0
StockmarkGPTGPT-NeoX
(1.4b)
Japanese Wikipedia (0.88B tokens), Japanese CC‑100 (10.5B tokens), private data (8.6B tokens)StockmarkMIT
YellowbackGPTGPT-NeoX
(1.3b)
Japanese Wikipedia, Japanese CC‑100, Japanese OSCARYellowbackApache 2.0
colorfulscoop GPTGPT
(small)
Japanese WikipediaColorful ScoopCC BY‑SA 3.0
TitechGPTGPT
(medium, medium-reversed) [5]
Japanese Wikipedia, Japanese CC‑100Titech Okazaki LabCC BY‑SA 4.0
KyotoUniversityGPTGPT
(small, medium, large)
Japanese Wikipedia (3.2GB), Japanese CC‑100 (85GB), Japanese OSCAR (54GB)Kyoto University Language Media Processing LabCC BY‑SA 4.0
JapaneseBARTBART
(base, large)
Japanese Wikipedia (18M sentences)Kyoto University Language Media Processing LabCC BY‑SA 4.0
Megagon Labs T5T5
(base)
Japanese mC4 (782 GB), Japanese wiki40b (2 GB)Megagon Labs
(Recruit Co.,Ltd.)
Apache 2.0

Domain Specific

DomainArchitectureTraining DataDeveloperLicense
Japanese Dialog TransformerDialogTransformerTwitter japanese reply pairsNTTEvaluation Licence
Japanese News BARTBusinessBART (base)Japanese business news articles (21M articles)StockmarkMIT
AcademicBARTScienceBART (base)CiNii Japanese PapersEhime University AI LabApache 2.0

Models built off non-Japanese LLMs (w/ continual pre-training on Japanese)

General purpose

Base ModelTraining DataDeveloperLicense / Terms of Use
Llama 3.1 Swallow 70B
(70B-v0.1, 70B-Instruct-v0.1)
Llama 3.1 (70b)Pre-training: The Stack v2, Wikipedia, DCLM-baseline-1.0, Swallow Corpus Version 2, Cosmopedia, Laboro ParaCorpus
Instruction Tuning: lmsys-chat-1m-synth-ja-wo-pii-and-template-instructions, lmsys-chat-1m-synth-en-wo-pii-and-template-instructions, filtered-magpie-ultra-ja, filtered-magpie-ultra-en, gemma-magpie
Swallow ProjectLlama 3.1 Community License
(Gemma Terms of Use is also applied to the Instruct model)
cyberagent/Llama-3.1-70B-Japanese-Instruct-2407Llama 3.1 (70b)undisclosedCyberAgentLlama 3.1 Community License
Llama 3 Swallow 70B
(70B-v0.1, 70B-Instruct-v0.1)
Llama 3 (70b)Pre-training: Algebraic Stack, Wikipedia, RefinedWeb, Swallow Corpus, Cosmopedia, Laboro ParaCorpus, OpenWebMath
Instruction Tuning: OASST1 [6]
Swallow ProjectLlama 3 Community License
turing-motors/Llama-3-heron-brain-70B-v0.3Llama 3 (70b)additionally trained on Llama 3 Swallow 70B (details undisclosed)TuringLlama 3 Community License
Llama 3 Youko 70B
(70b, 70b-instruct, 70b-gptq, 70b-instruct-gptq)
Llama 3 (70b)Pre-training: Wikipedia, Japanese C4, Japanese CC-100, Japanese OSCAR, The Pile, undisclosed dataset
(5B tokens)
Instruction Tuning: undisclosed datasetト[7]
rinnaLlama 3 Community License
Swallow 70B
(70b-hf, 70b-instruct-hf, 70b-instruct-v0.1, 70b-NVE-hf, 70b-NVE-instruct-hf)
Llama 2 (70b)Pre-training: Japanese Wikipedia, RefinedWeb, Swallow Corpus, The Pile
Instruction Tuning: Dolly Dataset, HH RLHF, OASST1
*v0.1: OASST1, OASST2
Swallow ProjectLlama 2 Community License
KARAKURI LM
(70b-v0.1, 70b-chat-v0.1)
Llama 2 (70b)Pre-training: mC4, CC100, OSCAR, RedPajama, undisclosed dataset
(16B tokens)
SteerLM: OASST2, undisclosed dataset
KARAKURILlama 2 Community License[8]
Japanese Stable LM Beta 70B
(base-beta-70b, instruct-beta-70b)
Llama 2 (70b)Pre-training: Wikipedia, Japanese mC4, Japanese CC-100, Japanese OSCAR, SlimPajama(excluding Books3)
(100B tokens)
Instruction Tuning: Dolly Dataset, HH RLHF, OASST1
Stability AILlama 2 Community License
Swallow-MX 8x7B
(8x7b-NVE-v0.1)
Mixtral-8x7B-Instruct-v0.1 (46.7b)Pre-training: Algebraic Stack, Japanese Wikipedia, RefinedWeb, Swallow Corpus, The Pile, The VaultSwallow ProjectApache 2.0
KARAKURI LM 8x7B Instruct v0.1
(8x7b-instruct-v0.1)
Mixtral-8x7B-Instruct-v0.1 (46.7b)trained Swallow-MX 8x7B on the following datasets: Dolly Dataset, OASST2, HelpSteer, glaive-code-assistant-v3, glaive-function-calling-v2, synthetic_text_to_sql, MetaMathQA, orca-math-word-problems-200k, rag-dataset-12000, rag-hallucination-dataset-1000, undisclosed datasetKARAKURIApache 2.0 (?)[9]
KARAKURI LM 8x7B Chat v0.1
(8x7b-chat-v0.1)
Mixtral-8x7B-Instruct-v0.1 (46.7b)trained Swallow-MX 8x7B on OASST2, HelpSteer, and undisclosed datasets using SteerLMKARAKURIApache 2.0
ABEJA-Mixtral-8x7B-japanese
(8x7B-v0.1-japanese, 8x7B-Instruct-v0.1-japanese, 8x7B-Instruct-v0.1-japanese-alpha, 8x7B-Instruct-v0.1-japanese-alpha-merged)
Mixtral-8x7B-Instruct-v0.1 (46.7b)
*The model without "Instruct" in its name is based on Mixtral-8x7B-v0.1
Pre-training: Japanese CC, Redpajama, undisclosed dataset
450B tokens)
ABEJAApache 2.0
Nekomata 14B
(14b, 14b-instruction, 14b-gguf, 14b-instruction-gguf)
Qwen (14b)Pre-training: Wikipedia, Japanese C4, Japanese CC-100, Japanese OSCAR, The Pile, undisclosed dataset
(66B tokens)
Instruction Tuning: Dolly Dataset, FLAN, subsets of llm-japanese-dataset
rinnaTongyi Qianwen LICENSE
Swallow 13B
(13b-hf, 13b-instruct-hf, 13b-instruct-v0.1, 13b-NVE-hf)
Llama 2 (13b)Pre-training: Japanese Wikipedia, RefinedWeb, Swallow Corpus, The Pile
Instruction Tuning: Dolly Dataset, HH RLHF, OASST1
*v0.1: OASST1, OASST2
Swallow ProjectLlama 2 Community License
LEIA-Swallow-13B
(13b)
Llama 2 (13b)additionally trained Swallow 13B using LEIAIndividual (Ikuya Yamada, Ryokan Ri)Llama 2 Community License
ELYZA-japanese-Llama-2-13b
(13b, 13b-instruct, 13b-fast, 13b-fast-instruct)
Llama 2 (13b)Pre-training: Japanese Wikipedia, Japanese OSCAR, and other crawled data
(18B tokens)
Instruction Tuning: undisclosed dataset
ELYZALlama 2 Community License
cyberagent/Mistral-Nemo-Japanese-Instruct-2408Mistral NeMo (12b)undisclosedCyberAgentApache 2.0
Llama 3.1 Swallow 8B
(8B-v0.1, 8B-Instruct-v0.1, 8B-v0.2, 8B-Instruct-v0.2)
Llama 3.1 (8b)Pre-training: The Stack v2, Wikipedia, DCLM-baseline-1.0, Swallow Corpus Version 2, Cosmopedia, Laboro ParaCorpus
Instruction Tuning: lmsys-chat-1m-synth-ja-wo-pii-and-template-instructions, lmsys-chat-1m-synth-en-wo-pii-and-template-instructions, filtered-magpie-ultra-ja, filtered-magpie-ultra-en, gemma-magpie
Swallow ProjectLlama 3.1 Community License
(Gemma Terms of Use is also applied to the Instruct model)
Llama 3 Swallow 8B
(8B-v0.1, 8B-Instruct-v0.1)
Llama 3 (8b)Pre-training: Algebraic Stack, Wikipedia, RefinedWeb, Swallow Corpus, Cosmopedia, Laboro ParaCorpus, OpenWebMath
Instruction Tuning: OASST1 [6:1]
Swallow ProjectLlama 3 Community License
turing-motors/Llama-3-heron-brain-8B-v0.3Llama 3 (8b)additionally trained on Llama 3 Swallow 8B (details undisclosed)TuringLlama 3 Community License
Llama 3 Youko 8B
(8b, 8b-instruct, 8b-gptq, 8b-instruct-gptq)
Llama 3 (8b)Pre-training: Wikipedia, Japanese C4, Japanese CC-100, Japanese OSCAR, The Pile, undisclosed dataset
(22B tokens)
Instruction Tuning[7:1]: Aya Dataset (Japanese subset), FLAN, Dolly Dataset, HH RLHF, OASST1, OASST2, MetaMathQA, CodeAlpaca Dataset, undisclosed dataset
DPO: HelpSteer, HelpSteer2, undisclosed dataset
rinnaLlama 3 Community License
Llama 3 ELYZA JP 8B
(8B, 8B-GGUF, 8B-AWQ)
Llama 3 (8b)undisclosedELYZALlama 3 Community License
Llama 3 neoAI 8B Chat v0.1
(8B-Chat-v0.1)
Llama 3 (8b)undisclosedneoAILlama 3 Community License
Llama 3 tedllm
(v0)
Llama 3 (8b)Pre-training: Japanese generic corpusTokyo Electron DeviceLlama 3 Community License
Swallow 7B
(7b-hf, 7b-instruct-hf, 7b-instruct-v0.1, 7b-NVE-hf, 7b-NVE-instruct-hf, 7b-plus-hf)
Llama 2 (7b)Pre-training: Japanese Wikipedia, RefinedWeb, Swallow Corpus, The Pile
Instruction Tuning: Dolly Dataset, HH RLHF, OASST1
*v0.1: OASST1, OASST2
Swallow ProjectLlama 2 Community License
LEIA-Swallow-7B
(7b)
Llama 2 (7b)additionally trained Swallow 7B using LEIAIndividual (Ikuya Yamada, Ryokan Ri)Llama 2 Community License
ELYZA-japanese-Llama-2-7b
(7b, 7b-instruct, 7b-fast, 7b-fast-instruct)
Llama 2 (7b)Pre-training: Japanese Wikipedia, Japanese OSCAR, and other crawled data
(18B tokens)
Instruction Tuning: undisclosed dataset
ELYZALlama 2 Community License
Youri 7B
(7b, 7b-instruction, 7b-chat, 7b-gptq, 7b-instruction-gptq, 7b-chat-gptq)
Llama 2 (7b)Pre-training: Wikipedia, Japanese C4, Japanese CC-100, Japanese OSCAR, The Pile, undisclosed dataset
(40B tokens)
Instruction Tuning: Dolly Dataset, FLAN, subsets of llm-japanese-dataset
rinnaLlama 2 Community License
houou-7b
(instruction-7b-v1, instruction-7b-v2, instruction-7b-v3)
Llama 2 (7b)Instruction-tuned Youri 7B (base) on ichikara-instructionMoneyForwardLlama 2 Community License
Japanese Stable LM Beta 7B
(base-beta-7b, base-ja_vocab-beta-7b, instruct-beta-7b, instruct-ja_vocab-beta-7b)
Llama 2 (7b)Pre-training: Wikipedia, Japanese mC4, Japanese CC-100, Japanese OSCAR, SlimPajama(excluding Books3)
(100B tokens)
Instruction Tuning: Dolly Dataset, HH RLHF, OASST1
Stability AILlama 2 Community License
SambaLingo-Japanese
(Base, Chat)
Llama 2 (7b)Pre-training: CulturaX
Instruction Tuning: ultrachat_200k
DPO: ultrafeedback, cai-conversation-harmless
SambaNova SystemsLlama 2 Community License (?)[9:1]
blue-lizard
(blue-lizard)
Llama 2 (7b)undisclosedDeepreneurLlama 2 Community License
Swallow-MS 7B
(7b-v0.1, 7b-instruct-v0.1)
Mistral-7B-v0.1 (7b)Pre-training: Algebraic Stack, Japanese Wikipedia, RefinedWeb, Swallow Corpus, The Pile
Instruction Tuning: Dolly Dataset, OASST1
Swallow ProjectApache 2.0
RakutenAI-7B
(7B, 7B-instruct, 7B-chat)
Mistral-7B-v0.1 (7b)Pre-training: undisclosed
Instruction Tuning: Dolly Dataset, OASST1, datasets converted from the train split of NLU datasets (like jaster), undisclosed dataset
RakutenApache 2.0
Japanese Stable LM Gamma 7B
(base-gamma-7b, instruct-gamma-7b)
Mistral-7B-v0.1 (7b)Pre-training: Wikipedia, Japanese mC4, Japanese CC-100, Japanese OSCAR, SlimPajama(excluding Books3)
(100B tokens)
Instruction Tuning: Dolly Dataset, HH RLHF, wikinews subset of llm-japanese-dataset
Stability AIApache 2.0
ChatNTQ JA 7B
(7b-v1.0)
Mistral-7B-v0.1 (7b)Instruction-tuned Japanese Stable LM Gamma 7B (base) on their own datasetsNTQ SolutionApache 2.0
Shisa Gamma 7B
(7b-v1)
Mistral-7B-v0.1 (7b)Instruction-tuned Japanese Stable LM Gamma 7B (base) on ultra-orca-boros-en-jaAUGMXNTApache 2.0 (?)[9:2]
Shisa 7B
(base-7b-v1, 7b-v1)
Mistral-7B-v0.1 (7b)Pre-training: shisa-pretrain-en-ja-v1 (8B tokens)
Instruction Tuning & DPO: ultra-orca-boros-en-ja, shisa-en-ja-dpo-v1
AUGMXNTApache 2.0 (?)[9:3]
Karasu
(7B, 7B-chat, 7B-chat-plus, 7B-chat-plus-unleashed)
Mistral-7B-v0.1 (7b)Additionally trained Shisa 7B (base) on Aozora Bunko, Japanese Law Precedent Dataset, Japanese Wikipedia, Japanese domain webscrapes from the Japanese subset of CulturaX, UltraChat 200k
(7B tokens)
Instruction Tuning: ultra-orca-boros-en-ja-v1, OASST1, ShareGPT, undisclosed dataset
LightblueApache 2.0 (?)[9:4]
Nekomata 7B
(7b, 7b-instruction, 7b-gguf, 7b-instruction-gguf)
Qwen (7b)Pre-training: Wikipedia, Japanese C4, Japanese CC-100, Japanese OSCAR, The Pile, undisclosed dataset
(66B tokens)
Instruction Tuning: Dolly Dataset, FLAN, subsets of llm-japanese-dataset
rinnaTongyi Qianwen LICENSE
lightblue/japanese-mpt-7bMPT (7b)Japanese mC4LightblueApache 2.0
Japanese Stable LM 3B-4E1T
(3b-4e1t-base, 3b-4e1t-instruct)
StableLM-3B-4E1T (3b)Pre-training: Wikipedia, Japanese mC4, Japanese CC-100, Japanese OSCAR, SlimPajama(excluding Books3)
(100B tokens)
Instruction Tuning: Dolly Dataset, HH RLHF, wikinews subset of llm-japanese-dataset
Stability AIApache 2.0
kotomamba-2.8B-CLmamba-2.8b-slimpj
(2.8b)
Japanese Wikipedia, Swallow Corpus, SlimPajamaKotoba TechnologiesApache 2.0
Gemma 2 Baku 2B
(2b, 2b-it)
Gemma 2 (2b)Pre-training: Wikipedia, Japanese C4, Japanese CC-100, Japanese OSCAR, The Pile, undisclosed dataset
(80B tokens)
OPRO: undisclosed dataset [10]
rinnaGemma Terms of Use
Japanese Stable LM 2 1.6B
(base, instruct)
Stable LM 2 1.6B (1.6b)Pre-training: Wikipedia, CulturaX
Instruction Tuning: jaster, ichikara-instruction, alpaca-gpt4-japanese, ultra-orca-boros-en-ja-v1
Stability AISTABILITY AI NON-COMMERCIAL RESEARCH COMMUNITY LICENSE
karasu-1.1BTinyLlama (1.1b)Pre-training: Japanese OSCAR, Japanese mC4
(3B tokens)
LightblueApache 2.0

Domain specific

DomainBase ModelDeveloperLicense
Llama3-Preferred-MedSwallow-70B
(70B)
MedicineLlama 3 (70b)Preferred NetworksLlama 3 Community License
AIgroup-CVM-utokyohospital/MedSwallow-70bMedicineLlama 2 (70b)University of Tokyo Hospital Department of Cardiovascular Medicine AI GroupCC BY-NC-SA 4.0
nekomata-14b-pfn-qfin
(qfin, qfin-inst-merge)
FinanceQwen (14b)Preferred NetworksTongyi Qianwen LICENSE
Watashiha-Llama-2-13B-Ogiri-sft
(sft, sft-neuron)
OogiriLlama 2 (13b)WatashihaLlama 2 Community License
ELYZA-japanese-CodeLlama-7b
(7b, 7b-instruct)
CodingCode Llama
(7b)
ELYZALlama 2 Community License
AIBunCho/japanese-novel-gpt-j-6bStorytellingGPT-J (6b)Individual (Hiroyuki Osone)CreativeML OpenRAIL-M License
NovelAI/genji-jpStorytellingGPT-J (6b)NovelAI

Models built off non-Japanese LLMs (w/ post-training on Japanese)

General purpose

Base ModelTraining DataDeveloperLicense / Terms of Use
AXCXEPT/EZO-Qwen2.5-72B-Instruct
AXCXEPT/EZO-AutoCoTRAG-Qwen2.5-72B-Instruct_q4
Qwen2.5 (72b)AxcxeptQwen License
ao-Karasu
(72B)
Qwen1.5 (72b)ultra-orca-boros-en-ja-v1, OASST1, ShareGPT, Japanese technical blogs, News stories, QA site answers, undisclosed datasetLightblueTongyi Qianwen LICENSE (?)[9:5]
AXCXEPT/Llama-3.1-70B-EZO-1.1-itLlama 3.1 (70b)AxcxeptLlama 3.1 Community License
Llama 3 shisa-v1-llama3-70b
(70b)
Llama 3 (70b)ultra-orca-boros-en-ja-v1Shisa.AILlama 3 Community License (?)[9:6]
AIgroup-CVM-utokyohospital/Llama-2-70b-chat-4bit-japaneseLlama 2 (70b)University of Tokyo Hospital Department of Cardiovascular Medicine AI GroupLlama 2 Community License
doshisha-mil/llama-2-70b-chat-4bit-japanese-v1Llama 2 (70b)Doshisha University Media Informatics Lab
AXCXEPT/EZO-Qwen2.5-32B-Instruct
AXCXEPT/EZO-AutoCoTRAG-Qwen2.5-32B-Instruct
Qwen2.5 (32b)AxcxeptApache 2.0
Qarasu
(14B-chat-plus-unleashed)
Qwen (14b)ultra-orca-boros-en-ja-v1, OASST1, ShareGPT, undisclosed datasetLightblueTongyi Qianwen LICENSE (?)[9:7]
Sparticle/llama-2-13b-chat-japanese-loraLlama 2 (13b)Sparticle
izumi-lab/llama-13b-japanese-lora-v0-1epLlama (13b)University of Tokyo Izumi Lab
AXCXEPT/EZO-Common-9B-gemma-2-itGemma 2 (9b)AxcxeptGemma Terms of Use
AXCXEPT/EZO-Humanities-9B-gemma-2-itGemma 2 (9b)AxcxeptGemma Terms of Use
AXCXEPT/Llama-3.1-8B-EZO-1.1-itLlama 3.1 (8b)AxcxeptLlama 3.1 Community License
Llama 3 Suzume 8B
(8B-japanese, 8B-japanese-gguf)
Llama 3 (8b)megagonlabs/instruction_ja, ShareGPT, undisclosed datasetLightblueLlama 3 Community License (?)[9:8]
Llama 3 shisa-v1-llama3-8b
(8b)
Llama 3 (8b)ultra-orca-boros-en-ja-v1Shisa.AILlama 3 Community License (?)[9:9]
AXCXEPT/Llama-3-EZO-8b-Common-itLlama 3 (8b)AxcxeptLlama 3 Community License
ganchengguang/Yoko-7B-Japanese-v1Llama 2 (7b)Yokohama National University Mori Lab
Sparticle/llama-2-7b-chat-japanese-loraLlama 2 (7b)Sparticle
izumi-lab/llama-7b-japanese-lora-v0-5epLlama (7b)University of Tokyo Izumi Lab
lightblue/jodMistral-7B-SlimOrca (7b)LightblueApache 2.0
NTQAI/chatntq-7b-jpntunedRWKV-4 World (7b)NTQ Solution
Borea
(Jp, Common, Coding)
Phi-3.5 (3.8b)AxcxeptMIT
AXCXEPT/EZO-Llama-3.2-3B-Instruct-dpoELlama 3.2 (3b)AxcxeptLlama 3.2 Community License
Gemma-2-JPN
(2b-jpn-it)
Gemma 2 (2b)GoogleGemma Terms of Use
AXCXEPT/EZO-gemma-2-2b-jpn-itGemma 2 (2b)AxcxeptGemma Terms of Use
AXCXEPT/EZO-Common-T2-2B-gemma-2-itGemma 2 (2b)AxcxeptGemma Terms of Use

Domain specific

DomainBase ModelDeveloperLicense
JMedLoRA
(llama2-jmedlora-6.89ep)
MedicineLlama 2 (70b)University of Tokyo Hospital Department of Cardiovascular Medicine AI GroupCC BY-NC 4.0

Merged models

Original Models (Japanese LLMs in bold)DeveloperLicense
EQUES/MedLLama3-JP-v2Llama 3 Swallow 8B (Instruct), OpenBioLLM-8B, MMed-Llama 3 8B, Llama 3 ELYZA JP 8BEQUESLlama 3 Community License
EvoLLM-JP-A
(v1-7B)
Shisa Gamma 7B (v1), Arithmo2 Mistral 7B, Abel 7B 002Sakana AIApache 2.0
EvoLLM-JP
(v1-7B, v1-10B)
Shisa Gamma 7B (v1), WizardMath-7B-V1.1, Abel 7B 002Sakana AIMICROSOFT RESEARCH LICENSE

API-based models

Max Context LengthDeveloperPlatform
Solar mini chat ja
(solar-1-mini-chat-ja)
32,768Upstageself-owned
AI Novelist2,400 ~ 8,192Bit192self-owned
LHTM-OPTalt Inc.AWS Marketplace

Encoder models

General purpose

ArchitectureTraining DataDeveloperLicenseHuggingFace? [11]
KyotoUniBERTBERT (base, large)Japanese Wikipedia (18M articles)Kyoto University Language Media Processing LabApache 2.0
TohokuUniversityBERTBERT (base, large)base (v1):
Japanese Wikipedia (17M articles / 2.6GB)
base (v2) & large:
Japanese Wikipedia 4.0GB
base (v3) & large (v2):
Japanese Wikipedia (4.9GB), Japanese CC‑100 (74.3GB)
Tohoku University NLP Groupbase (v1, v2) & large: CC BY‑SA 3.0
base (v3) & large (v2): Apache 2.0

(base (v1), base (v1, char-level), base (v2), base (v2, char-level), large, large (char-level), base (v3), base (v3, char-level), large (v2), large (v2, char-level))
NICT BERTBERT (base)Japanese WikipediaNICTCC BY 4.0
Laboro BERTBERT (base, large)Japanese Web Corpus
(News and blogs, etc) (12GB)
Laboro.AICC BY‑NC 4.0
colorfulscoop BERTBERT (base)Japanese WikipediaColorful ScoopCC BY‑SA 3.0
UniversityOfTokyoBERTBERT (small)Japanese Wikipedia (2.9GB)University of Tokyo Izumi LabCC BY‑SA 4.0
chiTra (Sudachi Transformers)BERT (base)NINJAL Web Japanese Corpus (148GB)NINJAL, WAP Tokushima Laboratory of AI and NLPApache 2.0
ACCMS BERTBERT (base)Japanese Wikipedia (3.3GB)Kyoto University ACCMSCC BY‑SA 4.0
HitachiBERTBERT (base)Japanese Wikipedia, Japanese CC‑100HitachiCC BY‑NC‑SA 4.0[12]
RetrievaBERTBERT [13]Japanese CommonCrawl, RefinedWeb, Chinese Wikipedia, Korean Wikipedia, The StackRetrievaApache 2.0
Bandai Namco DistilBERTDistilBERT(Distillation of TohokuUniversityBERT(base))Bandai Namco ResearchMIT
Laboro DistilBERTDistilBERT(Distillation of Laboro BERT(base))Laboro.AICC BY‑NC 4.0
LINE DistilBERTDistilBERT(Distillation of LINE internal BERT model)LINEApache 2.0
rinna RoBERTaRoBERTa (base)Japanese Wikipedia, Japanese CC‑100rinnaMIT
WasedaRoBERTaRoBERTa (base, large)Japanese Wikipedia, Japanese CC‑100Waseda Kawahara LabCC BY‑SA 4.0
(base, large, large (seq512))[14]
InformatixRoBERTaRoBERTa (base)Japanese Wikipedia, Web Articles
(25GB)
InformatixApache 2.0
KyotoUniversityRoBERTaRoBERTa (base, large)Japanese Wikipedia, Japanese CC‑100Kyoto University Language Media Processing LabCC BY‑SA 4.0
(base (char-level), large (char-level))
YokohamaNationalRoBERTaRoBERTa (base)Japanese Wikipedia (3.45GB)Yokohama National University Mori LabApache 2.0
Megagon Labs RoBERTaRoBERTa (base)[15]Japanese mC4 (200M sentences)Megagon Labs
(Recruit Co.,Ltd.)
MIT
ACCMS RoBERTaRoBERTa (base)Japanese Wikipedia (3.3GB) + Japanese CC‑100 (70GB)Kyoto University ACCMSCC BY‑SA 4.0
CinnamonELECTRAELECTRA (small)Japanese WikipediaCinnamonApache 2.0
Megagon Labs ELECTRAELECTRA (base)Japanese mC4 (200M sentences)Megagon Labs
(Recruit Co.,Ltd.)
MIT
UniversityOfTokyoELECTRAELECTRA (small, base)Japanese Wikipedia (2.9GB)University of Tokyo Izumi LabCC BY‑SA 4.0
(small, base)
JapaneseRoFormerRoFormer (base)Japanese Wikipedia (3.45GB)Yokohama National University Mori LabApache 2.0
JapaneseLUKELUKE (base, large)Japanese WikipediaStudio OusiaApache 2.0
(base, large)
KyotoUniversityDeBERTaV2DeBERTaV2 (tiny, base, large)Japanese Wikipedia, Japanese CC‑100, Japanese OSCAR
(171GB)
Kyoto University Language Media Processing LabCC BY‑SA 4.0
(tiny, tiny (char-level), base, large)
KyotoUniversityDeBERTaV3DeBERTaV3 (base)llm-jp-corpusKyoto University Language Media Processing LabApache 2.0
UniversityOfTokyoDeBERTaV2DeBERTaV2 (small, base)Japanese Wikipedia, Japanese Wikinews, Japanese CC-100, Japanese mC4, Japanese OSCARUniversity of Tokyo Izumi LabCC BY-SA 4.0◯ (small, base)
GLOBIS DeBERTaV3DeBERTaV3 (xsmall, base, large)Wikipedia, WikiBooks, Aozora Bunko, Japanese CC-100, Japanese mC4, Japanese OSCARGLOBISCC BY-SA 4.0◯ (xsmall, base, large)
JapaneseBigBirdBigBird (base)Japanese Wikipedia, Japanese CC‑100, Japanese OSCARWaseda Kawahara LabCC BY‑SA 4.0
JapaneseLayoutLMLayoutLM (base)Pre-trained on Japanese Wikipedia, initialized with TohokuUniversityBERTThe Japan Research Institute, LimitedCC BY-SA 3.0

Domain Specific

DomainArchitectureTraining DataDeveloperLicenseHuggingFace?
JapaneseNewsBERTBusinessBERT (base)Japanese Business Articles (3M articles)StockmarkCC BY 4.0
JapaneseNewsXLNetBusinessXLNet (base)Japanese Business Articles (3M articles)Stockmark
※ Unofficial release
JapaneseNewsALBERTBusinessALBERT (base)Japanese Business Articles (3M articles)Stockmark
JapaneseBlogELECTRAColloquial languageELECTRA (small)Japanese Blog Corpus (354M sentences)Kitami Institute of Technology Masui-Ptaszynski LabCC BY‑SA 4.0
JapaneseSpokenLanguageBERTSpoken languageBERT (base)Additional training for TohokuUniversityBERT using Corpus of Spontaneous Japanese (CSJ)
(In the DAPT model, the diet record is also used)
RetrievaApache 2.0
JapaneseFinancialBERTFinanceBERT (small, base)[16]Japanese Wikipedia, Japanese Financial Corpus (27M sentences/5.2GB)University of Tokyo Izumi LabCC BY‑SA 4.0
(small, base)
JapaneseFinancialELECTRAFinanceELECTRA (small)Japanese Wikipedia (20M sentences/2.9GB), Japanese Financial Corpus (27M sentences/5.2GB)University of Tokyo Izumi LabCC BY‑SA 4.0
UTH-BERTMedicineBERT (base)Japanese Medical Records(120M lines)University of Tokyo Hospital
Medical AI Development Course
CC BY‑NC‑SA 4.0
medBERTjpMedicineBERT (base)Japanese Wikipedia, Japanese Medical Corpus ("今日の診療プレミアム/Today's Care Premium" Web Version)Osaka University Hospital
Medical Informatics Lab
CC BY‑NC‑SA 4.0
JMedRoBERTaMedicineRoBERTa (base)Japanese Medical Papers (11M sentences/1.8GB)NII Aizawa LabCC BY‑NC‑SA 4.0
(ManbyoWordPiece, SentencePiece)[17]
AcademicRoBERTaScienceRoBERTa (base)CiNii Japanese Papers (6.3M sentences)Ehime University AI LabApache 2.0
MinpakuBERTCultural HeritageBERT (base)Additional training with National Museum of Ethnology's cultural heritage data on top of Tohoku University BERTUniversity of Hyogo Ohshima LabMIT◯ (minpaku-v1, minpaku-v3, minpaku-v3-no-additional-token)
local-politics-BERTPoliticsBERT (base)Wikipedia, Minutes of the National Diet, Minutes of the Local AssemblyJapanese Local Assembly Minutes Corpus ProjectCC BY-SA 4.0◯ (SC-min, SC-minwiki, SC-2M-wiki, SC-2M-min, SC-2M-minwiki, FP-min, FP-minwiki) [18]

Sentence and Document Embeddings [19]

Bi-Encoders

Single-representation bi-encoders

Max Context LengthDeveloperLicense
RoSEtta
(pkshatech/RoSEtta-base-ja)
1,024PKSHA TechnologyApache 2.0
GLuCoSE v2
(pkshatech/GLuCoSE-base-ja-v2)
512PKSHA TechnologyApache 2.0
Ruri
(cl-nagoya/ruri-pt-small, cl-nagoya/ruri-pt-base, cl-nagoya/ruri-pt-large, cl-nagoya/ruri-small, cl-nagoya/ruri-base, cl-nagoya/ruri-large)
512Nagoya University Sasano GroupApache 2.0
Japanese SimCSE
(cl-nagoya/unsup-simcse-ja-base, cl-nagoya/unsup-simcse-ja-large, cl-nagoya/sup-simcse-ja-base, cl-nagoya/sup-simcse-ja-large)
512Nagoya University Sasano GroupCC BY-SA 4.0
GLuCoSE
(pkshatech/GLuCoSE-base-ja)
512PKSHA TechnologyApache 2.0
colorfulscoop/sbert-base-jaColorful ScoopCC BY‑SA 4.0
MU-Kindai/SBERT-JSNLI-base
MU-Kindai/SBERT-JSNLI-large
Kindai University
MU-Kindai/Japanese-SimCSE-BERT-base-unsup
MU-Kindai/Japanese-SimCSE-BERT-large-unsup
MU-Kindai/Japanese-SimCSE-RoBERTa-base-unsup
MU-Kindai/Japanese-SimCSE-BERT-base-sup
MU-Kindai/Japanese-SimCSE-BERT-large-sup
Kindai UniversityMIT
pkshatech/simcse-ja-bert-base-clcmlpPKSHA TechnologyCC BY‑SA 4.0
MU-Kindai/Japanese-MixCSE-BERT-base
MU-Kindai/Japanese-MixCSE-BERT-large
Kindai UniversityMIT
MU-Kindai/Japanese-DiffCSE-BERT-baseKindai UniversityMIT
bclavie/fio-base-japanese-v0.1Individual (Benjamin Clavié)
cl-nagoya/shioriha-large-ptNagoya University Sasano Group

Multi-representation bi-encoders

DeveloperLicense
JaColBERTv2.5
(JaColBERTv2.4, JaColBERTv2.5)
Answer.AIMIT
JaColBERTv2
(JaColBERTv2)
Individual (Benjamin Clavié)MIT
JaColBERT
(JaColBERT)
Individual (Benjamin Clavié)MIT

Cross-Encoders

DeveloperLicense
Ruri-Reranker
(cl-nagoya/ruri-reranker-stage1-small, cl-nagoya/ruri-reranker-stage1-base, cl-nagoya/ruri-reranker-stage1-large, cl-nagoya/ruri-reranker-small, cl-nagoya/ruri-reranker-base, cl-nagoya/ruri-reranker-large)
Nagoya University Sasano GroupApache 2.0
hotchpotch/japanese-reranker-cross-encoder-xsmall-v1
hotchpotch/japanese-reranker-cross-encoder-small-v1
hotchpotch/japanese-reranker-cross-encoder-base-v1
hotchpotch/japanese-reranker-cross-encoder-large-v1
hotchpotch/japanese-bge-reranker-v2-m3-v1
Individual (Yuichi Tateno)MIT

Vision-Language Models

Text+Image to Text

Models built from scratch

General purpose

ArchitectureTraining DataDeveloperLicense
llava-calm2-siglip
(llava-calm2-siglip)
LLaVA-1.5coversational data generated from MS-COCO and VisualGenomeCyberAgentApache 2.0
Heron
(blip-ja-stablelm-base-7b-v0, blip-ja-stablelm-base-7b-v1, blip-ja-stablelm-base-7b-v1-llava-620k, git-ja-stablelm-base-7b-v0, git-ELYZA-fast-7b-v0, git-ja-stablelm-base-7b-v1)
BLIP-2 / GITv1: LLaVA-Instruct-150K-JA or LLaVA-Instruct-620K-JA
v0: LLaVA-Instruct-150K-JA, Japanese STAIR Captions, Japanese Visual Genome VQA dataset
TuringCC BY-NC 4.0
Japanese Stable VLM
(japanese-stable-vlm)
LLaVA-1.5Japanese CC12M, STAIR Captions, Japanese Visual Genome VQA datasetStability AISTABILITY AI JAPANESE STABLE VLM COMMUNITY LICENSE
Japanese InstructBLIP Alpha
(japanese-instructblip-alpha)
InstructBLIPJapanese CC12M, STAIR Captions, Japanese Visual Genome VQA datasetStability AIJAPANESE STABLELM RESEARCH LICENSE
rinna MiniGPT-4
(bilingual-gpt-neox-4b-minigpt4)
MiniGPT-4CC12M, COCO 2014, Visual Genome, STAIR Captions, Japanese Visual Genome VQA datasetrinnaMIT

Domain Specific

ArchitectureDomainDeveloperLicense
watashiha/Watashiha-Llama-2-13B-Ogiri-sft-vlmLLaVAOogiriWatashihaLlama 2 Community License

Models built off non-Japanese VLMs

Base ModelTraining DataDeveloperLicense
AXCXEPT/EZO-InternVL2-26BInternVL2-AxcxeptMIT

Merged models

Original Models (Japanese LLMs in bold)DeveloperLicense
Llama-3-EvoVLM-JP-v2
(v2)
Mantis-8B-SigLIP-Llama-3, Llama-3-ELYZA-JP-8B, Bunny-v1.1-Llama-3-8B-VSakana AILlama 3 Community License
AXCXEPT/Llama-3-EZO-VLM-1- (trained from Llama-3-EvoVLM-JP-v2)AxcxeptLlama 3 Community License
EvoVLM-JP
(v1-7B)
Shisa Gamma 7B (v1), LLaVA-1.6-Mistral-7BSakana AIApache 2.0

Text to Image

General Purpose

ArchitectureTraining DataDeveloperLicense
CommonArt β
(commonart-beta)
PixArt-ΣCommonCatalog-cc-by, Megalith-10M, Smithonian Open Access, ArtBench (CC-0 only)AI PicassoApache 2.0
EvoSDXL-JP
(v1)
Stable Diffusion- (merged from several diffusion models, including Japanese Stable Diffusion XL)Sakana AIApache 2.0[20]
Japanese Stable Diffusion XL
(japanese-stable-diffusion-xl)
Stable DiffusionundisclosedStability AISTABILITY AI JAPANESE STABLE DIFFUSION XL COMMUNITY LICENSE
TohokuUniversity Stable Diffusion
(base, refiner)
Stable DiffusionWMT2023 Shared Task English-Japanese parallel corpus, about 13 million captions from laion2B-multiTohoku University NLP GroupCreativeML OpenRAIL-M License
rinna Stable Diffusion
(japanese-stable-diffusion)
Stable DiffusionLAION-5B Japanese Subset (100M images)rinnaCreativeML OpenRAIL-M License

Domain Specific

ArchitectureDomainDeveloperLicense
Evo-Nishikie
(v1)
Stable Diffusion (ControlNet)Ukiyo-eSakana AIApache 2.0[20:1]
Evo-Ukiyoe
(v1)
Stable DiffusionUkiyo-eSakana AIApache 2.0[20:2]

Others

ArchitectureTraining DataDeveloperLicense
LY CLIP
(clip-japanese-base)
CLIPCommonCrawl, CC12M, YFCC100MLY Corp.Apache 2.0
Recruit CLIP
(japanese-clip-vit-b-32-roberta-base)
CLIPabout 120 million captions from laion2B-multiRecruit Co.,Ltd.CC BY-4.0
Japanese Stable CLIP
(japanese-stable-clip-vit-l-16)
SigLIPCC12M translated to Japanese, STAIR CaptionsStability AISTABILITY AI JAPANESE STABLE CLIP COMMUNITY LICENSE
rinna CLIP
(japanese-clip-vit-b-16)
CLIPCC12M translated to JapaneserinnaApache 2.0
rinna CLOOB
(japanese-cloob-vit-b-16)
CLOOBCC12M translated to JapaneserinnaApache 2.0
HAKUHODO Technologies CLIP
(base, deeper, wider)
CLIPabout 120 million captions from laion2B-multiHAKUHODO TechnologiesCC BY-NC-SA 4.0

Speech-Language Models

Automatic Speech Recognition

ArchitectureTraining DataDeveloperLicense
Kotoba-Whisper
(v1.0, v1.0-ggml, v1.0-faster, v1.1, bilingual-v1.0, bilingual-v1.0-ggml, bilingual-v1.0-faster, v2.0, v2.0-ggml, v2.0-faster, v2.1)
Distil-WhisperReazonSpeechKotoba TechnologiesApache 2.0
Nue ASR
(nue-asr)
Nue ASR
(HuBERT + LLM)
ReazonSpeechrinnaApache 2.0
ReazonSpeech
(espnet-v1, espnet-next, espnet-v2, nemo-v2)
ESPnet (Conformer-Transducer) / NeMo (FastConformer-RNNT)ReazonSpeechReazon HoldingsApache 2.0

Others

ArchitectureTraining DataDeveloperLicense
Kotoba-Speech
(v0.1)
TransformerundisclosedKotoba TechnologiesApache 2.0
UniversityOfTokyoHuBERT
(base-jtube)
HuBERTJTubeSpeechUniversity of Tokyo
Saruwatari & Takamichi Lab
MIT
rinna HuBERT
(base, large)
HuBERTReazonSpeechrinnaApache 2.0

Evaluation Benchmarks for Japanese LLMs

Hybrid Benchmarks

DescriptionDeveloper
Nejumi LLM Leaderboard3Evaluates the Japanese language capabilities of LLMs from three perspectives: language understanding ability, application ability, and alignment (including controllability and safety). For more details, see this article.Weights & Biases
Japanese LLM EvaluationConducts a comprehensive evaluation of various LLMs based on three types of tasks: Japanese language understanding and generation tasks, Japanese multi-turn dialogue tasks, and English language understanding and generation tasks. Also publishes swallow-evaluation, an evaluation script that integrates and improves existing LLM evaluation tools.Swallow Project

Traditional Benchmarks based on Natural Language Understanding tasks

DescriptionDeveloper
llm-jp-evalA tool that evaluates Japanese LLMs automatically across multiple datasets.
The complete list of supported datasets can be found here (which also includes tasks such as JNLI and JCommonsenseQA from JGLUE).
Evaluation results are compiled on the llm-jp-eval leaderboard.
LLM-jp
JP Language Model Evaluation HarnessA fork by Stability AI of EleutherAI/lm-evaluation-harness. It is a tool for automatically evaluating Japanese LLMs across multiple datasets.
The complete list of supported datasets can be found here (which also includes tasks such as JNLI and JCommonsenseQA from JGLUE).
There is a detailed summary of the evaluation results by rinna: [rinna] Benchmark of Stability-AI/lm-evaluation-harness
Stability AI
JGLUEJapanese version of the GLUE benchmark suite, including the MARC-ja, JCoLA, JSTS, JNLI, JSQuAD, and JCommonsenseQA tasks. JCoLA is by the University of Tokyo's Oseki Lab. See here and here (ja only) for further details about each task.Waseda University Kawahara Lab and Yahoo
JMMLUA benchmark constructed as a Japanese version of the MMLU Benchmark, consisting of multiple-choice questions from a wide range of academic fields including natural sciences, humanities, and social sciences. In addition to translating the original MMLU, it features newly added problems based on the unique cultural background of Japan (Japan-specific problems).Waseda University Kawahara Lab
Japanese Open LLM LeaderboardSimilar to Huggingface's Open LLM Leaderboard, this leaderboard provides a verification on Japanese LLMs. You can check the performance of Japanese LLMs in English tasks.LLM-jp

Benchmarks on open-ended generative tasks

DescriptionDeveloper
Japanese MT-benchThe Japanese version of MT-bench asks about multi-turn conversational ability. It includes 80 questions, 10 each, from 8 categories: Writing, Roleplay, Reasoning, Math, Coding, Extraction, STEM, Humanities. Some questions have been modified to fit with Japanese culture during the production of the Japanese version. It also includes a script that performs a 10-level absolute evaluation by GPT-4.Stability AI
Rakuda BenchmarkRanking based on model answers to 40 open-ended questions on Japanese geography, history, politics, and society. Uses GPT-4 to judge model outputs pairwise, and then ranks models by fitting a Maximum Likelihood Elo/Bradley-Terry model to GPT-4's preferences.YuzuAI
ELYZA-tasks-100Ranking based on model responses to 100 complex and diverse tasks, including tasks testing summarization, correction, abstraction, induction, and other skills. Uses humans to score the model responses and then ranks models based on their mean scores.ELYZA
Japanese Vicuna QA BenchmarkThis is the Japanese version of vicuna-blog-eval, which is the predecessor of MT-Bench. It includes 80 questions on general knowledge, role-playing, common sense, Fermi estimation, counterfactual thinking, coding, mathematics, and writing. It also includes a script for automatic evaluation by GPT-4 (win-rate calculation). The leaderboard can be found here.Kyoto University Language Media Processing Lab
Tengu-BenchIncludes 120 free-form questions from various categories. Categories of questions: table interpretation, logic puzzles, idea generation, function calling, long document summarization (over a thousand tokens), conversation summarization, long document closed QA (over a thousand tokens), honorifics, project creation, math, translation, extraction, ethical control, cost estimation, Japan, chit-chat, puns, formatting, construction, business, legal judgment, politics, hypothetical questions.Lightblue
ShaberiA framework that can collectively evaluate the Japanese MT-bench, Rakuda Benchmark, ELYZA-tasks-100, and Tengu-Bench. There is also a fork by Shisa.AI.Lightblue

Benchmarks for measuring performance in specific domains

DescriptionDeveloper
Japanese Language Model Financial Evaluation HarnessA benchmark for Japanese LLM in the financial sector. It includes tasks such as sentiment analysis in finance (chabsa), basic knowledge tasks in securities analysis (cma_basics), tasks related to audits in certified public accountant examinations (cpa_audit), multiple choice question tasks in financial planner exams (fp2), and mock exam tasks for securities salespeople exams (security_sales_1). For more details, please see here.Preferred Networks
pfmt-bench-fin-jaA benchmark for measuring the generation capabilities of Japanese LLMs in the financial domain.Preferred Networks
Stockmark Business QuestionsThe collection includes 50 questions that probe knowledge on topics such as market trends, current affairs, social issues, and business trends.Stockmark
JMED-LLMA dataset for evaluating LLMs in the Japanese medical domain. It compiles previously developed Japanese medical language processing tasks for LLM benchmarking.NAIST Social Computing Lab.
JMedBenchA benchmark for LLMs in the Japanese medical field. It includes 20 datasets in 5 types of tasks: multi-choice question-answering, machine translation, named entity recognition, document classification, and semantic textual similarity (some datasets are borrowed from JMMLU and JMED-LLM). A tool called med-eval is developed to facilitate evaluation on JMedBench.NII Aizawa Lab
Japanese Medical Language Model Evaluation HarnessA benchmark for evaluating Japanese LLMs in the medical domain in both Japanese and English, executable by a single command.Individual (Issey Sukeda)
karakuri-benchA dataset for measuring performance of Japanese LLMs in customer support.KARAKURI

Benchmarks for measuring factuality and safety

DescriptionDeveloper
JTruthfulQAThe Japanese version of the dataset for evaluating the factuality of LLMs TruthfulQA. It includes questions about superstitions and other beliefs held by some people that are not factual, as well as questions about Japan-specific knowledge, all collected from scratch.Waseda University Kawahara Lab
JCommonsenseMoralityA dataset on Japanese commonsense morality. Sentences describing actions are labeled with binary values indicating whether they are morally wrong or acceptable.Hokkaido University Language Media Lab
JBBQThe Japanese version of the social bias QA dataset BBQ, developed through translation, revision, and addition of questions based on Japanese culture and customs.University of Tokyo Yanaka Lab

Benchmarks for measuring logical reasoning capabilities

DescriptionDeveloper
JFLD (Japanese Formal Logic Deduction)A dataset for evaluating deductive reasoning capabilities of Japanese LLMs (the Japanese version of the FLD (Formal Logic Deduction) proposed by the same authors). It is characterized by being composed of counterfactual samples to evaluate apart from the knowledge the LLM possesses.Hitachi
JHumanEvalA Japanese version of the HumanEval benchmark, which assesses the ability to generate Python code from English instructions. In creating the Japanese version, the text was first machine-translated and then manually corrected.Japan Women's University Kuramitsu Lab

Benchmarks on controlled text generation

DescriptionDeveloper
LCTG BenchA benchmark for the controllability of Japanese LLMs. It evaluates whether LLMs can adhere to constraints in four aspects: output format, character count, keywords, and forbidden words. The quality of the generated text is also evaluated.CyberAgent

Benchmarks for embedding models

DescriptionDeveloper
JMTEBA benchmark developed as the Japanese version of MTEB. It consists of tasks such as document clustering, text classification, sentence similarity, sentence pair labeling prediction, and text extraction (a reranking task was recently added).SB Intuitions
JQaRAA dataset for evaluating Japanese document extraction and reranking accuracy. Each of the 1,667 questions is assigned 100 candidate documents, of which at least one can answer the question. The questions are taken from JAQKET, and the candidate documents are sourced from Japanese Wikipedia.Individual (Yuichi Tateno)
JaCWIRA dataset created for evaluating document extraction and reranking in domains other than Wikipedia. Each of the 5,000 questions is assigned one Web page that serves as the source of the question and 99 unrelated Web pages.Individual (Yuichi Tateno)

Benchmarks for vision-language models

DescriptionDeveloper
JMMMUA benchmark constructed as the Japanese version of MMMU Benchmark. It consists of 720 translated MMMU problems and 600 new problems unique to Japanese culture.University of Tokyo Aizawa Lab
Heron VLM Leaderboard powered by Nejumi/WandBSummarizes the evaluation results of Japanese-Heron-Bench and LLaVA-Bench-In-the-Wild (Japanese).Turing, Weights & Biases
Japanese-Heron-Bench21 images are assigned a total of 102 questions. It is characterized by image-question pairs that require knowledge related to Japan.Turing
JA-VLM-Bench-In-the-WildA dataset independently prepared by Sakana AI to evaluate EvoVLM-JP-v1-7B. It consists of 50 questions assigned to 42 images. It is characterized by images and questions that require knowledge about Japan.Sakana AI
JA-Multi-Image-VQAA dataset for evaluating the question-answering ability in Japanese for multiple images.Sakana AI
LLaVA-Bench-In-the-Wild (Japanese)This is the Japanese version of LLaVA-Bench-In-the-Wild, translated using DeepL. It consists of 60 questions assigned to 24 images.Turing
LLaVA-Bench (COCO) JapaneseThis is the Japanese version, translated by DeepL, of the LLaVA-Bench (COCO) dataset used to evaluate LLaVA. It consists of 30 images, each with 3 types of questions assigned to them.Turing

References for Models and Architectures

Transformer2017.06.12NIPS(NeurIPS) 2017Attention Is All You Need
GPT2018.06.11-Improving Language Understanding by Generative Pre-Training
BERT2018.10.11NAACL 2019BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
GPT-22019.02.14-Language Models are Unsupervised Multitask Learners
XLNet2019.06.19NeurIPS 2019XLNet: Generalized Autoregressive Pretraining for Language Understanding
RoBERTa2019.07.26-RoBERTa: A Robustly Optimized BERT Pretraining Approach
Sentence-BERT2019.08.27EMNLP-IJCNLP 2019Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
ALBERT2019.09.26ICLR 2020ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
DistilBERT2019.10.02EMC2 Workshop at NeurIPS 2019DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
T52019.10.23JMLR 2020Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
BART2019.10.29ACL 2020BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
LayoutLM2019.12.31KDD 2020LayoutLM: Pre-training of Text and Layout for Document Image Understanding
ELECTRA2020.03.23ICLR 2020ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
ColBERT2020.04.27SIGIR 2020ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
Conformer2020.05.16INTERSPEECH 2020Conformer: Convolution-augmented Transformer for Speech Recognition
GPT-32020.05.28NeurIPS 2020Language Models are Few-Shot Learners
DeBERTa2020.06.05ICLR 2021DeBERTa: Decoding-enhanced BERT with Disentangled Attention
BigBird2020.07.28NeurIPS 2020Big Bird: Transformers for Longer Sequences
LUKE2020.10.02EMNLP 2020LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention
CLIP2021.02.26ICML 2021Learning Transferable Visual Models From Natural Language Supervision
SimCSE2021.04.18EMNLP 2021SimCSE: Simple Contrastive Learning of Sentence Embeddings
RoFormer2021.04.20-RoFormer: Enhanced Transformer with Rotary Position Embedding
HuBERT2021.06.14TASLP 2021HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
CLOOB2021.10.21NeurIPS 2022CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP
DeBERTaV32021.11.18ICLR 2023DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing
ColBERTv22021.12.02NAACL 2022ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction
Stable Diffusion2021.12.20CVPR 2022High-Resolution Image Synthesis With Latent Diffusion Models
BLIP2022.01.28ICML 2022BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
MixCSE2022.02.22AAAI 2022Unsupervised Sentence Representation via Contrastive Learning with Mixing Negatives
InstructGPT2022.03.04NeurIPS 2022Training language models to follow instructions with human feedback
GPT-NeoX2022.04.14BigScience Research Workshop at ACL 2022GPT-NeoX-20B: An Open-Source Autoregressive Language Model
DiffCSE2022.04.21NAACL 2022DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings
GIT2022.05.27TMLR 2022GIT: A Generative Image-to-text Transformer for Vision and Language
Whisper2022.12.06ICML 2023Robust Speech Recognition via Large-Scale Weak Supervision
BLIP-22023.01.30ICML 2023BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
ControlNet2023.02.10ICCV 2023Adding Conditional Control to Text-to-Image Diffusion Models
Llama2023.02.27-LLaMA: Open and Efficient Foundation Language Models
GPT-42023.03.15-GPT-4 Technical Report
SigLIP2023.03.27ICCV 2023Sigmoid Loss for Language Image Pre-Training
LLaVA2023.04.17NeurIPS 2023Visual Instruction Tuning
MiniGPT-42023.04.20-MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Fast Conformer2023.05.08ASRU 2023Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition
InstructBLIP2023.05.11NeurIPS 2023InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
RWKV2023.05.22EMNLP 2023 (Findings)RWKV: Reinventing RNNs for the Transformer Era
RetNet2023.07.17-Retentive Network: A Successor to Transformer for Large Language Models
Llama 22023.07.18-Llama 2: Open Foundation and Fine-Tuned Chat Models
Code Llama2023.08.24-Code Llama: Open Foundation Models for Code
Qwen2023.09.28-Qwen Technical Report
PixArt-α2023.09.30ICLR 2024PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
LLaVA-1.52023.10.05CVPR 2024Improved Baselines with Visual Instruction Tuning
Mistral 7B2023.10.10-Mistral 7B
Distil-Whisper2023.11.01-Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling
Mamba2023.12.01COLM 2024Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Nue ASR2023.12.06ACL 2024 (Findings)Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition
InternVL2023.12.21CVPR 2024InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
TinyLlama2024.01.04-TinyLlama: An Open-Source Small Language Model
Mixtral2024.01.08-Mixtral of Experts
PIXART-δ2024.01.10-PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models
LEIA2024.02.18ACL 2024 (Findings)LEIA: Facilitating Cross-lingual Knowledge Transfer in Language Models with Entity-based Data Augmentation
PixArt-Σ2024.03.07-PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
Gemma2024.03.13-Gemma: Open Models Based on Gemini Research and Technology
EvoLLM-JP, EvoVLM-JP2024.03.19-Evolutionary Optimization of Model Merging Recipes
RakutenAI-7B2024.03.21-RakutenAI-7B: Extending Large Language Models for Japanese
rinna GPT, rinna RoBERTa, Nekomata, Youri, etc.2024.04.02LREC-COLING 2024Release of Pre-Trained Models for the Japanese Language
SambaLingo-Japanese2024.04.08-SambaLingo: Teaching Large Language Models New Languages
Heron2024.04.11-Heron-Bench: A Benchmark for Evaluating Vision Language Models in Japanese
Stockmark-13b2024.04.12-Pretraining and Updating Language- and Domain-specific Large Language Model: A Case Study in Japanese Business Domain
Phi-32024.04.22-Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
InternVL 1.52024.04.25-How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Swallow2024.04.27COLM 2024Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities
LLM-jp-13B2024.07.04-LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs
Llama 3.12024.07.23-The Llama 3 Herd of Models
Gemma 22024.07.31-Gemma 2: Improving Open Language Models at a Practical Size
PLaMo-100B2024.10.10-PLaMo-100B: A Ground-Up Language Model Designed for Japanese Proficiency

References for Training Methods

PPO (RLHF)2017.07.20-Proximal Policy Optimization Algorithms
Instruction Tuning
(Supervised Fine-tuning; SFT)
2021.09.03ICLR 2022Finetuned Language Models Are Zero-Shot Learners
DPO2023.05.29NeurIPS 2023Direct Preference Optimization: Your Language Model is Secretly a Reward Model
SteerLM2023.10.09EMNLP 2023 (Findings)SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF
ORPO2024.03.12EMNLP 2024ORPO: Monolithic Preference Optimization without Reference Model

Our Contributors

We love contributors! Feel free to contribute to this project.

contributors

Citation

The summary of this repository is also published as a preprint: Exploring Open Large Language Models for the Japanese Language: A Practical Guide

When referencing this repository, please cite as follows:

@article{awesomeJapanese2024,
    title={{Exploring Open Large Language Models for the Japanese Language: A Practical Guide}},
    author={Kaito Sugimoto},
    doi={10.51094/jxiv.682},
    journal={Jxiv preprint},
    year={2024}
}

  1. Some architectural changes have been made. For details, refer to: 1,000億パラメータ規模の独自LLM「PLaMo-100B」の事前学習 ↩︎

  2. Refer to the following articles: 大規模言語モデルTanuki-8B, 8x8Bの位置づけや開発指針など, 大規模言語モデルを開発するにあたっての事前・事後学習の戦略メモー特に合成データについてー ↩︎ ↩︎

  3. Some performance enhancements have been made to the original Llama model. See here for details. ↩︎

  4. Details have not been made public but the private dataset includes data from the EleutherAI Polyglot project's Japanese team and from members of Stable Community Japan. ↩︎

  5. This project conducted evaluation research on using right-to-left generation instead of the usual left-to-right generation, releasing both left-to-right and right-to-left models. ↩︎

  6. Before conducting Instruction Tuning, a Chat Vector between Llama 3 Instruct and Llama 3 Base is added. ↩︎ ↩︎

  7. After conducting Instruction Tuning, a Chat Vector between Llama 3 Instruct and Llama 3 Base is added. ↩︎ ↩︎

  8. However, if commercial use of KARAKURI LM is desired, direct contact with the developer, KARAKURI Inc., is required. ↩︎

  9. In Instruction Tuning, because it uses data generated by OpenAI's models, such as GPT-3.5 and GPT-4, for training, there is a possibility that it may violate OpenAI's terms. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

  10. Before conducting Instruction Tuning, a Chat Vector between Gemma 2 Instruct and Gemma 2 Base is added. ↩︎

  11. ○: The model is on the HuggingFace Model Hub and can be loaded in with the AutoModel.from_pretrained() command. △: The model is not on the Model Hub but can be loaded in manually with the HuggingFace transformers library. ✕: The model is not directly loadable with HuggingFace. ↩︎

  12. This project conducted evaluation research on pre-tokenization morphological analysis and released their best performing model, which used Juman++ and BPE. ↩︎

  13. However, the maximum sequence length has been extended to 2048, and various architectural changes have been made compared to the original BERT. See the HuggingFace repository README for details. ↩︎

  14. nlp-waseda/roberta-base-japanese and nlp-waseda/roberta-large-japanese trained using a 128 token context length, but nlp-waseda/roberta-large-japanese-seq512 expanded the context length to 512. ↩︎

  15. Extended to a 1282 context length from the usual 512. ↩︎

  16. The "small" model trains on Japanese Wikipedia and the Japanese Financial Corpus simultaneously, while the "base" model takes the TohokuUniversityBERT and conducts additional training on the Japanese Financial Corpus. ↩︎

  17. ManbyoWordPiece conducts a pre-tokenization step using MeCab (IPA+Manbyo dictionaries) and uses WordPiece for subword tokenization, while the SentencePiece model tokenizes text directly using a unigram model. ↩︎

  18. For details of each model, please refer to Chapter 4 of the authors' paper. Note that the SC-2M-wiki model is strictly not a domain-specific model as it is pre-trained only on Wikipedia. ↩︎

  19. The classification of embedding models was referenced from Dense Text Retrieval based on Pretrained Language Models: A Survey (Zhao+, 2022). The Bi-Encoder architecture inputs two separate inputs into the model and vectorizes each, using their dot product or cosine similarity as a measure of their proximity. In contrast, the Cross-Encoder architecture inputs the combined inputs into the model to directly compute their proximity internally. Although Cross-Encoders incur higher computational costs, they are often used as rerankers in information extraction due to their ability to compute input proximity more precisely. Among Bi-Encoders, there are types (e.g., ColBERT) that represent the input as multiple vectors (such as one per token) rather than a single vector, hence further classification into Single-representation bi-encoders and Multi-representation bi-encoders. ↩︎

  20. However, it calls for consideration for use in research and education. Additionally, be aware that some of the licenses for the source models are not Apache 2.0. ↩︎ ↩︎ ↩︎

Last updated: