Details of the fine-tuned v1.1 model
Summary
Introduction
I’m Takashi Kodama from the Language and Media Laboratory of the Kyoto University (personal page, X account).
LLM-jp released the LLM-jp-13B v1.0 in October 20th 2023. At that time, we released both a pre-trained model and a fine-tuned model, but due to time constraints on the tuned model, unfortunately we were not able to sufficiently improve its performance.
This time, we are publishing the fine-tuned model v1.1, which has improved performance by reviewing the instruction fine-tuning settings and adding DPO (Direct Preference Optimization) methodology. We are also publishing the data and the source code. All assets are released under open-source licenses that allow commercial use. Note that this time we are updating only the fine-tuned model, and the pre-trained model uses v1.0 released in October 2023.
- Model
- Data
- Code
- llm-jp-sft(v1.0 published at the same time)
- llm-jp-dpo
In this article, we will introduce the learning settings, performance evaluation of instruction tuning, and DPO for the purpose of sharing knowledge.
About the LLM-jp-13B
LLM-jp-13B is a large-scale language model with 13 billion parameters pre-trained mainly in Japanese and English. LLM-jp-13B is characterized by making its models and tools open in order to contribute to research and development in academia and industry. For details on pre-learning, please see the Press Release of the National Institute of Informatics.
Instruction tuning
First, I will introduce instruction tuning.
Instruction tuning is a fine-tuning method for pre-trained models that aims to generate output in accordance with user instructions. By preparing pairs of user instructions and corresponding outputs for a wide variety of tasks, the model can generate appropriate outputs even for unknown instructions.
This time, we will introduce the process of dataset creation in a translation task for evaluation.
Datasets
The following five datasets are used for instruction tuning:
Dataset | Languages | Number of samples |
---|---|---|
databricks-dolly-15k | English | 15,011 |
databricks-dolly-15k-ja | Japanese | 15,011 |
oasst-21k-en | English | 21,164 |
oasst-21k-ja | Japanese | 21,164 |
ichikara-instruction-003-001 | Japanese | 2,903 |
databricks-dolly-15k is an English instruction dataset. This is single-turn dataset with a response to the instruction and the input.
databricks-dolly-15k-ja is a translation of the databricks-dolly-15k data into Japanese using DeepL.
oasst-21k-en is a multi-turn instruction dataset. The original oasst1 includes 35 languages, but we only extracted English dialogues.
oasst-21k-ja is a translation of oasst-21k-en into Japanese using DeepL. In order to take into account the connections between utterances in multi-turn dialogues when translating, we connect the utterances and translate them in dialogue units, and then divide them into utterances again in post-processing.
ichikara-instruction-003-001 is a single-turn Japanese instruction dataset created byJapanese instruction dataset creation project for LLM. Unlike the translated datasets mentioned above, it is created in ‘Ichikara’ in Japanese and is characterized by its high quality.
For reference, the translation of databricks-dolly-15k-ja and oasst-21k-ja cost approximately 100,000 yen. During learning, each dataset is divided into training data and development data at a ratio of 9:1.
Input prompt
The input prompts for the model are prompts template used by Alpaca and its Japanese translation.
- Single-turn and “input” field (targeted: databricks-dolly-15k and part of databricks-dolly-15k-ja)
以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。 ### 指示: {instruction} ### 入力: {input} ### 応答: {output}
- シングルターンかつ「入力」フィールドなし(対象:databricks-dolly-15k と databricks-dolly-15k-ja の一部,ichikara-instruction-003-001)
以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。 ### 指示: {instruction} ### 応答: {output}
- マルチターン(対象:oasst-21k-en と oasst-21k-ja)
以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。 ### 指示: {instruction1} ### 応答: {output1} ### 指示: {instruction2} ### 応答: {output2}
In pre-learning, the model is trained by calculating the loss for all tokens, however in instruction tuning, the purpose is to learn the response (output) when instructions and inputs are given. Therefore, the loss is not calculated for the instruction and input parts, but only for the response part. The follow “trl” (Transformer Reinforcement Learning) DataCollatorForCompletionOnlyLM
is useful for this implementation.
Settings for the comparative experiment
Previous research found that when instruction tuning was performed using instruction data in English and non-English languages simultaneously, the accuracy of question answering (XQuAD) in non-English languages was improving. Based on this, we will study using both English and Japanese instruction datasets at the same time.
For this experiment, we will conduct a comparative experiment by changing the combination of the model’s learning parameters and the dataset used as follows:
Model | Learning parameters | Datasets used |
---|---|---|
dolly-oasst (lora) | LoRA | dolly, oasst |
dolly-ichikara-oasst (lora) | LoRA | dolly, ichikara, oasst |
ichikara (full) | Full | ichikara |
dolly-oasst (full) | Full | dolly, oasst |
dolly-ichikara-oasst (full) | Full | dolly, ichikara, oasst |
dolly-oasst->ichikara (full) | Full | dolly と oasst fine-tuning then,ichikara for fine-tuning |
Regarding learning parameters, we will compare LoRA tuning and full parameter tuning.
For the datasets used, we will compare the following datasets such as dolly (databricks-dolly-15k と databricks-dolly-15k-ja),ichikara (ichikara-instruction-003-001),and oasst (oasst-21k-en & oasst-21k-ja). And lastly, we fine-tuned the following datasets dolly-oasst->ichikara (full) for dolly and oasst. We will then performs in tuning in two steps with ichikara.
Hyperparameters
The hyperparameters are set as follows:
Parameters | Values |
---|---|
max_seq_length | 2048 |
batch_size | 64 |
learning_rate | 2e-5 |
warmup_ratio | 0.1 |
num_train_epochs | 5 |
LoRA r | 128 |
LoRA alpha | 256 |
LoRA dropout | 0.05 |
LoRA target modules | c_attn, c_proj, c_fc |
For the two stage of tuning, checkpoints are saved every 10 steps with ichikara (full) and dolly-oasst->ichikara (full) datasets. For the other models, checkpoints are saved every 100 steps, and the model with the lowest loss for the development data is adopted.
Learning
We used 8 A100 40GB GPUs for training.
DeepSpeed Zero 2 was used for LoRA tuning, and DeepSpeed Zero 3 + parameter offload was used for full-parameter tuning to reduce memory usage. The full-parameter tuning was terminated after about 2.5 epochs due to overtraining during training, which resulted in increased loss of development data. The training time for LoRA was about 14 hours, and 61 hours for the full parameters.
The trl library of Hugging Face is used for implementation. For more details, see code here.
Evaluation
The Japanese VicunaQA Benchmark is used for evaluation. This benchmark is designed to evaluate LLM performance on atypical tasks for which there are no fixed answers. The Japanese VicunaQA Benchmark evaluates LLM’s responses to Japanese questions using GPT-4 (gpt-4-0613). It consists of 80 questions in 8 categories, including common sense, mathematics, and role-playing. It should be noted that although automatic evaluation by GPT-4 has been reported to be consistent with human evaluation to some extent, there are many issues, such as difficulty in determining the accuracy of information.
Following the Japanese VicunaQA Benchmarking Leaderboard, we evaluate the AdjustedWinRate (tie is also considered), which is the percentage that the output of the LLM under evaluation is better than the output of GPT-3.5 (text-davinci-003). AdjustedWinRate (tie is also considered) is evaluated.
\[\text{AdjustedWinRate} = \frac{\# \text{Win} + 0.5 \times \# \text{Tie}}{\# \text{Win} + \# \text{Loss} + \# \text{Tie}}\]The results are as follows. The model trained this time is shown in bold, along with the results of other models listed on the Japanese VicunaQA Benchmarking Leaderboard at the time of this writing.
Model | AdjustedWinRate (%) |
---|---|
cyberagent/calm2-7b-chat | 77.500 |
dolly-ichikara-oasst (full) | 58.750 |
dolly-oasst->ichikara (full) | 57.500 |
tokyotech-llm/Swallow-70b-instruct-hf | 51.875 |
dolly-oasst (full) | 50.000 |
dolly-ichikara-oasst (lora) | 48.750 |
ichikara (full) | 46.250 |
dolly-oasst (lora) | 45.000 |
llm-jp/llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0 | 33.750 |
rinna/japanese-gpt-neox-3.6b-instruction-ppo | 19.375 |
llm-jp/llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0 | 13.750 |
rinna/japanese-gpt-neox-3.6b-instruction-sft-v2 | 13.750 |
llm-jp/llm-jp-13b-v1.0 | 13.125 |
Observations: First, the full-parameter model tends to outperform LoRA in terms of training parameters.
Next, for the dataset used, we can see that the performance is improved by using ichikara in addition to dolly and oasst. However, the performance of the model tuned with ichikara alone is lower than that of the model tuned with dolly and oasst. This may be due to the fact that the number of data in ichikara is 2,903, which is smaller than other datasets.
Finally, dolly-oasst->ichikara (full) with two-step tuning performs slightly worse than dolly-ichikara-oasst (full) with dolly, ichikara, and oasst simultaneously, but almost as well. For large models, even instruction tuning requires a large amount of computational resources, and this type of two-stage tuning is considered to be an effective method for saving computational resources. In fact, the time required for the second stage tuning was quite short, for only about 2 hours.
DPO
Next, DPO (Direct Preference Optimization) is introduced to the project
DPO (Direct Preference Optimization) is a method for optimizing a model to output a more user-friendly response. DPO is as good as or better than PPO used in InstructGPT, yet it is superior in terms of training stability and computational efficiency. DPO is superior to PPO used in InstructGPT.
In this study, we will use DPO to further train instructionally tuned models and improve their performance.
Datasets
The DPO uses hh-rlhf-12k-en, a part of hh-rlhf translated into Japanese by DeepL datasets/llm-jp/hh-rlhf-12k-en. hh-rlhf is a dataset that consists of a dialogue between a person and an agent, followed by the agent’s preferred and unfavorable responses, paired together.
The “favorable” and “unfavorable” ratings are divided into two criteria: helpfulness (usefulness) and harmlessness (harmlessness), and data is provided for each. hh-rlhf has approximately 160,000 training data alone. In this study, we randomly sampled 9,000 cases for usefulness and 3,000 cases for harmlessness, translated them into Japanese in order to used them.
As in the dataset oasst-21k-en, in order to take into account the connections between the utterances of multi-turn dialogues, we translated the data by connecting the utterances and dividing them into dialogue units, and then post-processed the translation into utterances again.
During training, the training data and development data are split at a ratio of 9:1.
Input prompt
The input prompts are the same as for instructional tuning.
Hyperparameters
Hyperparameters are set as follows:
Parameters | Values |
---|---|
max_seq_length | 2048 |
batch_size | 64 |
learning_rate | 5e-7 |
warmup_ratio | 0.1 |
num_train_epochs | 30 |
LoRA r | 128 |
LoRA alpha | 256 |
LoRA dropout | 0.05 |
LoRA target modules | c_attn, c_proj, c_fc |
As the learning rate is set to 1e-6 in the original DPO paper, it seems that a relatively small values are better for learning.
Learning
This time, DPO is performed using dolly-ichikara-oasst (full) as the base model, which had the best performance in the instruction tuning. Only LoRA tuning was performed, and DeepSpeed Zero 2 was used as in the instruction tuning. 8 A100 40GB GPUs were used for training, and completed in about 21 hours.
The trl library is used for implementation, as is instruction tuning. See code for more details.
Evaluation
We will use the Japanese VicunaQA benchmark again for evaluation.
The result is as follows:
Model | AdjustedWinRate (%) |
---|---|
cyberagent/calm2-7b-chat | 77.500 |
dolly-ichikara-oasst (full) + DPO | 67.500 |
dolly-ichikara-oasst (full) | 58.750 |
dolly-oasst->ichikara (full) | 57.500 |
tokyotech-llm/Swallow-70b-instruct-hf | 51.875 |
dolly-oasst (full) | 50.000 |
dolly-ichikara-oasst (lora) | 48.750 |
ichikara (full) | 46.250 |
dolly-oasst (lora) | 45.000 |
llm-jp/llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0 | 33.750 |
rinna/japanese-gpt-neox-3.6b-instruction-ppo | 19.375 |
llm-jp/llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0 | 13.750 |
rinna/japanese-gpt-neox-3.6b-instruction-sft-v2 | 13.750 |
llm-jp/llm-jp-13b-v1.0 | 13.125 |
Observations:
Comparing the responses output by the two models, dolly-ichikara-oasst (full) and dolly-ichikara-oasst (full) + DPO, we found that dolly-ichikara-oasst (full) + DPO tended to output longer responses. In fact, the average number of characters output in a response was dolly-ichikara-oasst (full) + DPO.
In fact, the average number of characters output was 446 for dolly-ichikara-oasst (full) and 509 for dolly-ichikara-oasst (full) + DPO, indicating that the responses were about 10% longer. It is thought that the DPO contributed to the increase in the score because the model outputs longer answers and the amount of information in the answers has increased.
Finally, we will look at the sample responses.
「時間管理能力を向上させるにはどうしたらいいですか?」(How can I improve my time management skills?”) dolly-ichikara-oasst (full) and dolly-ichikara-oasst (full) + DPO each generated the following answers:
dolly-ichikara-oasst (full):
時間管理のスキルを向上させたいなら、まず優先順位をつけることから始めましょう。ToDoリストを作り、緊急度と重要度に基づいて項目を整理しましょう。そして、最も重要な仕事からこなしていくようにしよう。さらに、生産性向上ツールを使ったり、スケジュールやToDoリストをデジタル化したりして、より効率的に時間を管理できるように工夫しましょう。
dolly-ichikara-oasst (full) + DPO:
時間管理のスキルを向上させたいなら、まず優先順位をつけることから始めましょう。やるべき仕事とやりたくない仕事のリストを作るんだ。そして、やるべき仕事が終わったら、リストにチェックを入れる。そして、その仕事を終わらせるために、自分が使える最も重要な時間帯(朝一番とか)にそれらの仕事をこなすようにスケジュールを調整するんだ。もしできたら、ボーナスポイントとして丸いシールを追加してください!
優先順位をつけたら、次のステップは、時間管理能力を向上させるために、タイマーを使って仕事の時間を測ることだ。タイマーを10分、20分、30分と1つずつ増やしていく。タイマーが切れるまで、できるだけ多くの仕事をこなしたほうがいい。タイマーを切った後、どれだけ仕事をこなせたかを見て、自分の進歩を確認しましょう。また、一日中、タイマーで何かを制限しながら作業するという実験もできます。
また、ToDoリストアプリを使うこともできる。この種のアプリは、あなたの進捗状況や、どのタスクにもっと時間をかけたほうがいいかを教えてくれます。また、これらのアプリは、各タスクに対して「完了」ボタンを与えるので、達成感を感じるのに役立ちます。
まとめると、時間管理能力を向上させる最善の方法は、優先順位をつけること、タイマーを使うこと、そして達成感を感じることだ。
Both answers output appropriate advices for the question, but dolly-ichikara-oasst (full) + DPO outputs gave more specific time management advices.
However, one issue common to both models is the consolidation of word endings (normal and honorific language). Some of the datasets used in instruction tuning and DPO were created by machine translation, and it is thought that this is due to insufficient consolidation of word endings within the datasets.
The solution to this problem is to manually correct the machine translation and to use more datasets created in natural Japanese by humans, such as ichikara-instruction, but the cost of creating such datasets is very high. LLM-jp hopes to work on this in the future.
Conclusion
In this article, we introduced the training and evaluation of the fine-tuned models v1.1, whose performance was improved by re-configuring the instructional tuning settings and adding the DPO methodology. We present the three models that improved their overall performance with instructional tuning (one full-parameter model and one LoRA model), as well as a model that was further improved with the use of DPO.
Model | AdjustedWinRate (%) |
---|---|
dolly-ichikara-oasst (full) + DPO | 67.500 |
dolly-ichikara-oasst (full) | 58.750 |
dolly-ichikara-oasst (lora) | 48.750 |
In llm-jp/llm-jp-13b-instruct-lora-jaster-dolly-oasst-v1.0, the AdjustedWinRate for GPT-3.5 on Japanese VicunaQA was 33.750%. The newly released model, which includes DPO, shows a significant improvement of 67.500%.
The model released this time is still in the initial stage of LLM-jp’s efforts. It does not yet have sufficient safety measures in place, and is not fully compliant with the guidelines and warnings published by the government and related organizations. Please be aware of this and use LLM-jp at your own risk.
The ichikara-instruction dataset used for instruction tuning is released under a non-commercial license, but the models trained using this dataset are released with an open-source license under an agreement between RIKEN AIP and NII. The ichikara-instruction dataset is available for commercial use for a fee (contact: RIKEN Center for Integrated Research on Innovative Intelligence, Linguistic Information Access Technology Team).
LLM-jp will continue to develop open-source large-scale language models that are robust to the Japanese language. We encourage anyone interested in LLM-jp activities to join us at this page.
* This article is a translation by Akim Mousterou. The original article is here.