How to Convert JSON to JSONL for OpenAI Fine-Tuning

Fine-tuning OpenAI's models can help you customize the behavior of the model to better suit your specific use case. One common task when preparing data for fine-tuning is converting JSON data into a format known as JSONL (JSON Lines). This format is particularly useful when working with OpenAI’s fine-tuning API because it stores each data entry as a single line, making the model training process more efficient.

In this guide, we’ll walk you through the process of converting a JSON dataset into JSONL format using a New York Giants sports team example. This will allow you to create a dataset that can be used to fine-tune a model that provides sports-related information.

What is JSONL?

JSONL stands for JSON Lines, a file format where each line is a separate JSON object. This structure makes it easy to read and process large datasets in a line-by-line fashion, which is perfect for tasks such as model fine-tuning. The OpenAI fine-tuning API expects data in JSONL format, where each line represents a separate interaction between the user and the assistant.

Example Data Structure for Fine-Tuning

When using OpenAI’s fine-tuning API, the data needs to follow a specific structure. The key elements of the JSONL format are:

messages: An array of messages that represent the conversation between the system, user, and assistant.
role: Defines who is sending the message (system, user, or assistant).
content: The content of the message.
weight (optional): Indicates the importance of the assistant’s response (usually set to 1 for most use cases).

Here’s a typical example of the format:

json

1{
2  "messages": [
3    {"role": "system", "content": "You are a knowledgeable sports assistant who answers questions about teams, players, and sports events."},
4    {"role": "user", "content": "Tell me about the New York Giants."},
5    {"role": "assistant", "content": "The New York Giants are a professional football team based in East Rutherford, New Jersey. They are part of the NFC East division in the NFL."}
6  ]
7}

Example: Creating a Dataset for the New York Giants

Let’s say you want to create a dataset where users can ask questions about the New York Giants, and the assistant will provide informative answers. Below is an example of the JSON structure that represents interactions between a user and the assistant:

json

1{
2  "input": {
3    "messages": [
4      {
5        "role": "user",
6        "content": "What year did the New York Giants win the Super Bowl?"
7      }
8    ],
9    "tools": [],
10    "parallel_tool_calls": true
11  },
12  "preferred_output": [
13    {
14      "role": "assistant",
15      "content": "The New York Giants won the Super Bowl four times: 1986, 1990, 2007, and 2011."
16    }
17  ],
18  "non_preferred_output": [
19    {
20      "role": "assistant",
21      "content": "The Giants have won several Super Bowls."
22    }
23  ]
24}

In this case, the user asks about the Super Bowl victories of the New York Giants, and the assistant provides two responses: a more detailed preferred output, and a shorter non-preferred output.

Converting JSON to JSONL

To fine-tune OpenAI’s models, we need to convert this JSON data into JSONL format. The key is ensuring that each line contains a complete conversation with the necessary system, user, and assistant roles, structured appropriately.

Steps to Convert JSON to JSONL

Identify the Components: The input JSON data contains an array of messages and separate preferred_output and non_preferred_output fields. These need to be combined into a single conversation.
Format Each Entry: Each line in the JSONL file must represent a full conversation, including the system, user, and assistant messages.

Here’s what the converted JSONL file will look like:

json

1{"messages": [{"role": "system", "content": "You are a knowledgeable sports assistant who answers questions about teams, players, and sports events."}, {"role": "user", "content": "What year did the New York Giants win the Super Bowl?"}, {"role": "assistant", "content": "The New York Giants won the Super Bowl four times: 1986, 1990, 2007, and 2011.", "weight": 1}]}
2{"messages": [{"role": "system", "content": "You are a knowledgeable sports assistant who answers questions about teams, players, and sports events."}, {"role": "user", "content": "What year did the New York Giants win the Super Bowl?"}, {"role": "assistant", "content": "The Giants have won several Super Bowls."}]}

Key Points:

Each line contains a single conversation with a system, user, and assistant message.
The weight attribute is added to the preferred_output response to indicate that it is the preferred response (you can adjust the weight based on the quality of the responses).
The non_preferred_output is included as an alternative, shorter response from the assistant.

Automating the Conversion with Python

If you have a larger dataset, manually converting it to JSONL can be time-consuming. You can automate the process with a Python script. Below is a Python script that reads the input JSON file and converts it into JSONL format:

Python Script for Conversion

python

1import json
2
3def convert_json_to_jsonl(input_file, output_file):
4    # Read the JSON data
5    with open(input_file, 'r') as infile:
6        data = json.load(infile)
7
8    # Open output file for writing JSONL
9    with open(output_file, 'w') as outfile:
10        # Combine the system message, user message, and both assistant outputs
11        for preferred, non_preferred in zip(data["preferred_output"], data["non_preferred_output"]):
12            jsonl_entry = {
13                "messages": [
14                    {"role": "system", "content": "You are a knowledgeable sports assistant who answers questions about teams, players, and sports events."},
15                    {"role": "user", "content": data["input"]["messages"][0]["content"]},
16                    {"role": "assistant", "content": preferred["content"], "weight": 1}
17                ]
18            }
19            json.dump(jsonl_entry, outfile)
20            outfile.write('\n')
21
22            # Write non-preferred output as an alternative
23            jsonl_entry["messages"][2] = {"role": "assistant", "content": non_preferred["content"]}
24            json.dump(jsonl_entry, outfile)
25            outfile.write('\n')
26
27if __name__ == "__main__":
28    input_file = 'input.json'  # Path to your input JSON file
29    output_file = 'output.jsonl'  # Path to your output JSONL file
30    convert_json_to_jsonl(input_file, output_file)

How to Use the Python Script:

Save the input JSON data in a file named input.json.
Save the script as convert_json_to_jsonl.py.
Run the script using Python:
bash
```
1python convert_json_to_jsonl.py
```

This script will generate an output.jsonl file, where each line corresponds to a conversation about the New York Giants, complete with the system, user, and assistant messages.