# Fine-Tuning LLMs for Function Calling

In the application of LLMs, agents represent an exciting field. They are intelligent systems capable of simulating human-like intelligent behavior to perform specific tasks or services. LLM-based agents can leverage the powerful comprehension and generation capabilities of LLMs while also incorporating the planning and function-calling abilities to accomplish many complex tasks.

In this tutorial, we will demonstrate how to perform Supervised Fine-Tuning (SFT) and Parameter Efficient Fine-Tuning (PEFT) to learn function-calling (tool learning) using NeMo 2.0. NeMo 2.0 introduces Python-based configurations, PyTorch Lightning’s modular abstractions, and NeMo-Run for scaling experiments across multiple GPUs. In this notebook, we will use NeMo-Run to streamline the configuration and execution of our experiments.

# NeMo Tools and Resources

* [NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html)

# Software Requirements

* Access to latest NeMo Framework NGC Containers


# Hardware Requirements

* This playbook has been tested on the following hardware: Single A6000, Single H100, 2xA6000, 8xH100. It can be scaled to multiple GPUs as well as multiple nodes by modifying the appropriate parameters.

#### Launch the NeMo Framework container as follows: 

Depending on the number of gpus, `--gpus` might need to adjust accordingly:
```
docker run -it -p 8080:8080 -p 8088:8088 --rm --gpus '"device=0,1"' --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:25.02
```

#### Launch Jupyter Notebook as follows: 
```
jupyter notebook --allow-root --ip 0.0.0.0 --port 8088 --no-browser --NotebookApp.token=''

```

## Step 1. Construct the Dataset

An LLM agent is a system that leverages an LLM as its core engine, and is capable of executing specific tasks by invoking external functions or tools. These tools can be APIs, databases, calculators, etc., allowing the agent to obtain the necessary information from external sources while completing tasks. As shown in an example dataset below, an LLM understands the usage of each tool via three text fields: `name`, `description`, `parameters`.

In [None]:
tool1 = {
 'name': 'strategy_query', 
 'description': 'Check the initial quotes for financial products.', 
 'parameters': {'product': {'type': 'string', 'description': 'Product type.'},
 'term': {'type': 'string', 'description': 'Term.'}}
}
tool2 = {}
tools = [tool1, tool2]

To perform fine-tuning in NeMo 2.0, we should first transform the training dataset into a predefined format in NeMo 2.0. According to the different training strategies you use, there are two types of function-calling dataset formats:

* The first type is a single-turn function-calling dataset. For each piece of data, the assistant only calls the function once, and the conversation does not record the function-calling execution result. We focus on training the LLM to correctly select the function and its parameters.

* The second type is a multi-turn function-calling dataset. For each piece of data, the assistant calls the function once or more than once, and the conversation records the function-calling execution result. The assistant will make the next generation based on the function's return values. The is closer to a real-life, interactive chatbot scenario.

Both types of datasets are supported in NeMo function-calling fine-tuning. You can choose one of them or a combination of them, as long as it's suitable for your target.

#### Option 1: Use the Single-Turn Function-Calling Dataset

Let's take a data from [glaiveai/glaive-function-calling-v2](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2) as an example. The original data format is shown below.

In [None]:
{
 "system": "SYSTEM: You are a helpful assistant with access to the following functions. Use them if required -\n{\n \"name\": \"generate_password\",\n \"description\": \"Generate a random password\",\n \"parameters\": {\n \"type\": \"object\",\n \"properties\": {\n \"length\": {\n \"type\": \"integer\",\n \"description\": \"The length of the password\"\n },\n \"include_symbols\": {\n \"type\": \"boolean\",\n \"description\": \"Whether to include symbols in the password\"\n }\n },\n \"required\": [\n \"length\"\n ]\n }\n}\n",
 "chat": "USER: Hi, I need a new password for my account. Can you generate one for me?\n\n\nASSISTANT: Of course, I can help with that. How long would you like your password to be? And do you want it to include symbols? <|endoftext|>\n\n\nUSER: I would like it to be 12 characters long and yes, please include symbols.\n\n\nASSISTANT: {\"name\": \"generate_password\", \"arguments\": '{\"length\": 12, \"include_symbols\": true}'} <|endoftext|>\n\n\nFUNCTION RESPONSE: {\"password\": \"4&7jB#9L!2Z%\"}\n\n\nASSISTANT: Here is your new password: 4&7jB#9L!2Z%. Please make sure to save it in a secure place. <|endoftext|>\n\n\nUSER: Great, thanks! Can you also help me book a flight to New York?\n\n\nASSISTANT: I'm sorry, but as an AI, I don't have the capability to book flights or perform other external tasks. My primary function is to assist with tasks related to the functions provided to me. In this case, I can help generate passwords. For booking a flight, I recommend using a trusted travel website or contacting your travel agent. <|endoftext|>\n\n\n"
},

We should transform it into the NeMo chat dataset format, which consists of three fields: `mask`, `system` and `conversations`. 

* `mask`: The role that needs to be masked out to prevent the role from participating in loss calculation.

* `system`: System prompt.

* `conversations`: For each role, the conversation consists of two fields `from` and `value`.

We can transform the original data into the format shown below. Since we're constructing a single-turn function-calling dataset, you can end the conversation at the point of the tool call.

In [None]:
{ 
 "mask": "User", 
 "system": "",
 "conversations": [
 {
 "from": "User", 
 "value": "You are an expert in composing functions. You are given a question and a set of possible functions. Based on the question, you will need to make one or more function/tool calls to achieve the purpose. If none of the function can be used, point it out. If the given question lacks the parameters required by the function, also point it out. You should only return the function call in tools call sections. Here is a list of functions in JSON format that you can invoke.\n\n{\n \"name\": \"generate_password\",\n \"description\": \"Generate a random password\",\n \"parameters\": {\n \"type\": \"object\",\n \"properties\": {\n \"length\": {\n \"type\": \"integer\",\n \"description\": \"The length of the password\"\n },\n \"include_symbols\": {\n \"type\": \"boolean\",\n \"description\": \"Whether to include symbols in the password\"\n }\n },\n \"required\": [\n \"length\"\n ]\n }\n}\n\n{\n \"name\": \"create_task\",\n \"description\": \"Create a new task in a task management system\",\n \"parameters\": {\n \"type\": \"object\",\n \"properties\": {\n \"title\": {\n \"type\": \"string\",\n \"description\": \"The title of the task\"\n },\n \"due_date\": {\n \"type\": \"string\",\n \"format\": \"date\",\n \"description\": \"The due date of the task\"\n },\n \"priority\": {\n \"type\": \"string\",\n \"enum\": [\n \"low\",\n \"medium\",\n \"high\"\n ],\n \"description\": \"The priority of the task\"\n }\n },\n \"required\": [\n \"title\",\n \"due_date\",\n \"priority\"\n ]\n }\n}\n\n\nIf you decide to invoke any of the function(s), put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]\nYou SHOULD NOT include any other information in the response.\n\nI need a new password. Can you generate one for me?"
 }, 
 {
 "from": "Assistant", 
 "value": "Of course. How long would you like your password to be? And would you like it to include symbols?"
 }, 
 {
 "from": "User", 
 "value": "I would like it to be 12 characters long and yes, please include symbols."
 }, 
 {
 "from": "Assistant", 
 "value": "[generate_password(length=12, include_symbols=True)]"
 }
 ]
},

Using the single-turn function-calling datasets described above, we've successfully fine-tuned [nvidia/Mistral-NeMo-Minitron-8B-Instruct](https://huggingface.co/nvidia/Mistral-NeMo-Minitron-8B-Instruct) based on [nvidia/Mistral-NeMo-Minitron-8B-Base](https://huggingface.co/nvidia/Mistral-NeMo-Minitron-8B-Base), which has a general function-calling ability. If you're interested, you can quickly experience its capabilities [NIM online](https://build.nvidia.com/nvidia/mistral-nemo-minitron-8b-8k-instruct). If you want to reproduce a model like [nvidia/Mistral-NeMo-Minitron-8B-Instruct](https://huggingface.co/nvidia/Mistral-NeMo-Minitron-8B-Instruct) using NeMo, you can refer to the three open-source datasets we used. 
*Note that we also used some internal datasets that are not open-sourced.*

* [nvidia/Daring-Anteater](https://huggingface.co/datasets/nvidia/Daring-Anteater)

* [glaiveai/glaive-function-calling-v2](https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2)

* [Salesforce/xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k)

* lr=1e-6 and 3 epoches.


## Step 1.1 Download the HuggingFace Dataset
As the purpose of this tutorial, we are going to download [Salesforce/xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k).
First let's download the datasets from Hugging Face. You need to have valid HuggingFace token in order to access this gated repo.

In [None]:
!huggingface-cli login --token 

In [None]:
from datasets import load_dataset
xlam_ds = load_dataset('Salesforce/xlam-function-calling-60k', split='train')
xlam_ds.to_json('xlam.jsonl')

You should now have `Salesforce/xlam-function-calling-60k` raw dataset file downloaded as `xlam.jsonl`.

In [None]:
!ls

## Step 1.2 Convert the Dataset to NeMo Chat SFT Dataset
We now convert the raw dataset to NeMo format using the following data transformation script.
NeMo's `ChatDataModule` requires `data_root` to contain one `training.jsonl` and `validation.jsonl` for training and validation sets.

Let's first define some helper functions:

In [None]:
import json
import random
import ast, copy

random.seed(1234)

possible_headers = [
 "You are an expert in composing functions. You are given a question and a set of possible functions. Based on the question, you will need to make one or more function/tool calls to achieve the purpose. If none of the function can be used, point it out. If the given question lacks the parameters required by the function, also point it out. You should only return the function call in tools call sections. Here is a list of functions in JSON format that you can invoke.\n",
 "You are a function-calling assistant. Your task is to identify and execute the appropriate functions from a given list based on the user's question. If no suitable function is available, specify this. If required parameters are missing, indicate this as well. Return only the function call in the specified format. Here is the list of available functions in JSON format",
 "Imagine you are an AI designed to call functions. Given a question and a set of functions, your role is to make the necessary function calls. If a function cannot be used, state this. If parameters are missing, mention it. Here are the available functions\n",
 "You are an AI agent specialized in executing function calls. Your mission is to interpret questions and determine the correct functions to execute from a provided list. If no function applies, or if parameters are missing, you must indicate this. Below are the functions you can call\n",
 "You are an intelligent agent capable of invoking functions based on user queries. Given a question and a list of functions, your task is to identify and execute the appropriate functions. If no function is suitable, specify this. If required parameters are missing, indicate this as well. Return only the function call in the specified format. Here is the list of available functions in JSON format\n",
 "As an AI assistant, you are tasked with determining the appropriate function calls based on a question and a list of available functions. If no function can be used, or if parameters are missing, indicate this. Return only the function calls in the specified format. Functions are detailed in JSON format."
]
rejection_prompts = [
 "I'm sorry, but after reviewing the available tools, I couldn't find a function that suits your request. Please provide more information or specify a different function. If you need assistance with anything else, feel free to ask.",
 "[]"
]
def process_system_turn(j):
 if random.choice([0,1]) == 0:
 j["tools"] = json.loads(j["tools"])
 header = random.choice(possible_headers)
 tools = json.dumps(j["tools"], indent=4) if isinstance(j["tools"], dict) else j["tools"]
 if isinstance(tools, list):
 tools = str(tools)
 if random.choice([0,1]) == 0:
 system = header + "\n" + tools + "" + '\n' + """If you decide to invoke any of the function(s), put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]\nYou SHOULD NOT include any other information in the response."""
 else:
 system = header + """If you decide to invoke any of the function(s), put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]\nYou SHOULD NOT include any other information in the response.\n""" + "\n" + json.dumps(j["tools"], indent=4) + "" 


 return system

def put_system_ito_user(j):
 
 system = j["system"]
 j["system"] = ""
 j["conversations"][0]["value"] = system + '\n\n' + j["conversations"][0]["value"]

 return j

def get_all_functions(jlines, arg="tools"):
 functions = []
 for j in jlines:
 f = get_functions(j, arg)
 functions += f
 return functions

def get_functions(j, arg="tools"):
 try:
 f = ast.literal_eval(j[arg])
 except:
 f = json.loads(j[arg])
 if not isinstance(f, list):
 f = [f]
 
 return f

def process_function(functions):
 try:
 functions = ast.literal_eval(functions)
 except:
 try:
 functions = json.loads(functions)
 except:
 print(functions)
 return None

 outputs=[]
 for function in functions:
 out = ""
 name = function["name"]
 out += name + "("
 try:
 arguments = json.loads(function["arguments"]) if isinstance(function["arguments"], str) else function["arguments"]
 if len(arguments) == 0:
 return out + ")"

 for arg, v in arguments.items():
 if isinstance(v, str):
 out += arg + "=" + '"' + str(v) + '", '
 else:
 out += arg + "=" + str(v) + ', '
 out = out[:-2] + ")"
 outputs.append(out)
 except:
 print("sec error", function)

 return "[" + ", ".join(outputs) + "]"

def write_nemo_datasetfile(json_objects, output_folder, rejection_rate=0.3, train_ratio=0.95):
 all_function = get_all_functions(json_objects, "tools")
 os.makedirs(output_folder, exist_ok=True)
 #augmentation: add more functions to increase difficulty
 for j in json_objects:
 tools = json.loads(j["tools"])
 n = random.choice([0]*10 + [i for i in range(10)])
 aug_f = random.sample(all_function, n)
 diff_f = [f for f in aug_f if f not in tools]
 tools += diff_f
 random.shuffle(tools)
 j["tools"] = json.dumps(tools)

 # augmentation: add rejection
 rejs = []
 for j in random.sample(json_objects, int(rejection_rate * len(json_objects))):
 tools = json.loads(j["tools"])
 n = len(tools)
 aug_f = random.sample(all_function, n)
 diff_f = [f for f in aug_f if f not in tools]
 tools = diff_f
 if len(tools) == 0:
 continue
 new_j = copy.deepcopy(j)
 new_j["tools"] = json.dumps(tools)
 new_j["rejection"] = True
 rejs.append(new_j)

 # Adding the rejections to the list
 json_objects += rejs
 output = []
 for j in json_objects:
 d = {}
 d["system"] = process_system_turn(j)
 d["mask"] = "User"
 if j.get("rejection", False):
 answer = random.choice(rejection_prompts)
 else:
 answer = process_function(j["answers"])
 
 if answer == None:
 continue
 q = j["query"]
 d["conversations"] = [{"from":"User", "value": q}, {"from":"Assistant", "value": answer}]
 
 output.append(d)
 d = put_system_ito_user(d)
 output.append(d)

 # Split into train/val set
 train_fout = open(f'{output_folder}/training.jsonl', 'w')
 validation_fout = open(f'{output_folder}/validation.jsonl', 'w')
 split_index = int(len(output) * train_ratio)
 random.shuffle(output)
 train_objects = output[:split_index]
 val_objects = output[split_index:]

 with open(f'{output_folder}/training.jsonl', 'w') as f:
 for obj in train_objects:
 f.write(json.dumps(obj) + '\n')
 with open(f'{output_folder}/validation.jsonl', 'w') as f:
 for obj in val_objects:
 f.write(json.dumps(obj) + '\n')
 print(f'Saved training.jsonl and validation.jsonl to {output_folder}.')


We split the raw dataset into training set and validation set using a fraction of 95%/5%:

In [None]:
import os

f_input = open("xlam.jsonl")
train_ratio = 0.90
all_objects = [json.loads(l) for l in f_input.readlines()][:10000]

write_nemo_datasetfile(all_objects, 'xlam_dataset', train_ratio=0.95)



In [None]:
!ls xlam_dataset

#### Option 2: Use the Multi-Turns Function-Calling Dataset

For multi-turn function-calling dataset construction, the process is similar to constructing a single-turn function-calling dataset. The only difference is that we need to add one more role, 'Function,' to represent the function-calling return. Let's take the data below as an example.

In [None]:
{
 "chat": [
 {
 "role": "user",
 "content": "Is there overnight lending available?"
 },
 {
 "role": "assistant",
 "func_call": {
 "function": "strategy_query",
 "params": {
 "term": "overnight"
 }
 }
 },
 {
 "role": "assistant",
 "func_return": {
 "strategy_query": [
 {
 "product": "lending",
 "term": "overnight",
 "amount": "1 billion",
 "interest_rate": "2.0%"
 }
 ]
 }
 },
 {
 "role": "assistant",
 "content": "Yes,1 billion,2.0%. Are you interested?"
 },
 {
 "role": "user",
 "content": "2.0% is too high. I have to think about it."
 },
 {
 "role": "assistant",
 "func_call": {
 "function": "transaction_cancel",
 "params": {}
 }
 },
 {
 "role": "assistant",
 "func_return": {
 "response": "The transaction has been cancelled."
 }
 }
 ]
}

We transform the original data into the NeMo chat format shown below. Note that, compared to the single-turn function-calling dataset, we add the 'Function' role in conversations to record the function-calling execution result. We should also mask out the user and function roles to prevent them from participating in loss calculation.

In [None]:
{
 "mask": "User,Function",
 "system": "Answer the following questions as best you can. You have access to the following tools:\n\ninquiry: Call this tool to interact with the inquiry API. What is the inquiry API useful for? 查询金融产品的价格。 Parameters: [{'name': 'product', 'type': 'string', 'description': '产品类型', 'required': True},\n {'name': 'term', 'type': 'string', 'description': '期限', 'required': True},\n {'name': 'amount', 'type': 'string', 'description': '交易额度', 'required': True},\n {'name': 'interest_rate', 'type': 'string', 'description': '利率', 'required': True}]\nstrategy_query: Call this tool to interact with the strategy_query API. What is the strategy_query API useful for? 查询金融产品的交易策略。 Parameters: [{'name': 'product', 'type': 'string', 'description': '产品类型', 'required': True},\n {'name': 'term', 'type': 'string', 'description': '期限', 'required': True}]\ntransaction_confirm: Call this tool to interact with the transaction_confirm API. What is the transaction_confirm API useful for? 确认交易。 Parameters: [{'name': 'product', 'type': 'string', 'description': '产品类型', 'required': True},\n {'name': 'term', 'type': 'string', 'description': '期限', 'required': True},\n {'name': 'amount', 'type': 'string', 'description': '交易额度', 'required': True},\n {'name': 'interest_rate', 'type': 'string', 'description': '利率', 'required': True}]\ntransaction_cancel: Call this tool to interact with the transaction_cancel API. What is the transaction_cancel API useful for? 取消交易。 Parameters: []\n\nOutput the following format:\n\nAction: the action to take, should be one of [inquiry, strategy_query, transaction_confirm, transaction_cancel]\nAction Input: the input to the action",
 "conversations": [
 {
 "from": "User",
 "value": "Is there overnight lending available?"
 },
 {
 "from": "Assistant",
 "value": "Action: strategy_query\\nAction Input: {\\n \"term\": \"overnight\"\\n}"
 },
 {
 "from": "Function",
 "value": "{\"error\": \"\", \"response\": {\\n \"product\": \"lending\",\\n \"term\": \"overnight\",\\n \"amount\": \"1 billion\",\\n \"interest_rate\": \"2.0%\"\\n}}"
 },
 {
 "from": "Assistant",
 "value": "Yes, 1 billion, 2.0%. Are you interested?"
 },
 {
 "from": "User",
 "value": "2.0% is too high. I have to think about it."
 },
 {
 "from": "Assistant",
 "value": "Action: transaction_cancel\\nAction Input: {}"
 },
 {
 "from": "Function",
 "value": "{\"error\": \"\", \"response\": \"The transaction has been cancelled.\"}"
 }
 ]
}

For the rest of the notebook, we will proceed with type 1: single-turn function calling dataset, since the dataset is publicly available on Hugging Face.

## Step 2. Use NeMo-Run with NeMo2 Recipe

After transforming the datasets, we should split and save the datasets into `training.jsonl`, `validation.jsonl` and `test.jsonl` under a folder. We can now start fine-tuning using NeMo-Run and assign the datasets directory path to `dataset_root` in `nemo_run.Config`. NeMo-Run will automatically tokenize the datasets and save the binary under the same data folder. Despite the different dataset formats, whether it is a single-turn function-calling dataset or a multi-turn function-calling dataset, the training script using NeMo-Run remains the same.

For this tutorial, we will showcase function-calling capabilities using the Baichuan-7B-Base model. You can see the list of all available models and their recipes [here](https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/index.html).
For Baichuan, we need to install `bitsandbytes` in the container.

In [None]:
# Install Baichuan dependency
!pip install bitsandbytes

### Step 2.1: Auto Download and Convert the Baichuan2 7B model to NeMo2
Baichuan2 7B model can be automatically downloaded and converted th NeMo2 format with the following script:

In [None]:
%%writefile import_baichuan2_7b.py
from nemo.collections import llm

if __name__ == '__main__':
 llm.import_ckpt(
 model=llm.Baichuan2Model(config=llm.Baichuan2Config7B()),
 source="hf://baichuan-inc/Baichuan2-7B-Base",
 overwrite=True,
 )

In [None]:
!torchrun import_baichuan2_7b.py

The above script 
- Downloads the Baichuan2 7B model from Hugging Face (if not already downloaded).
- Automatically converts it into the NeMo format.

Note:
- The script can only run in a Python environment, not in a Jupyter notebook.
- You need to have access to `baichuan-inc/Baichuan2-7B-Base` [repo on Hugging Face](https://huggingface.co/baichuan-inc/Baichuan2-7B-Base).

The conversion will create a `baichuan-inc/Baichuan2-7B-Base` folder in the default `$NEMO_HOME/models` directory. 
`$NEMO_HOME` centralizes and stores all models and datasets used for NeMo training. By default `$NEMO_HOME` stores to `/root/.cache/nemo`.

### Step 2.2: Finetuning Baichuan2 7B using Function-Calling Dataset

For this step we use the NeMo 2 predefined recipe. 

First we define the recipe and executor for using NeMo 2. The predefined recipe uses LoRA fine-tuning to run on one A6000 GPU. If you would like to perform full-parameter fine-tuning, you can set `peft_scheme=None`. You can also use larger or smaller models depending on your needs and compute resources.


In [None]:
import nemo_run as run
from nemo.collections import llm

def configure_recipe(nodes: int = 1, gpus_per_node: int = 1):
 recipe = llm.recipes.baichuan2_7b.finetune_recipe(
 num_nodes=nodes,
 num_gpus_per_node=gpus_per_node,
 peft_scheme='lora',
 )
 return recipe

def local_executor_torchrun(devices: int = 1) -> run.LocalExecutor:
 executor = run.LocalExecutor(ntasks_per_node=devices, launcher="torchrun")
 return executor

You can learn more about NeMo Executor [here](https://github.com/NVIDIA/NeMo-Run/blob/main/docs/source/guides/execution.md).



In [None]:
# Instantiate the recipe
# Make sure you set the gpus_per_node as expected
recipe = configure_recipe(gpus_per_node=8) 

Now, we modify the recipe to use our function-calling chat dataset. For this tutorial, we will only train for 40 steps. You can adjust other hyperparameters as needed. We launch training with NeMo-Run's local executor.

In [None]:
recipe.resume.restore_config.path = "nemo://baichuan-inc/Baichuan2-7B-Base"
recipe.data = run.Config(
 llm.ChatDataModule,
 dataset_root="xlam_dataset",
 seq_length=4096,
 micro_batch_size=1,
 global_batch_size=32,
)
recipe.trainer.limit_val_batches = 0
recipe.trainer.max_steps = 40
recipe.log.use_datetime_version = False
recipe.log.explicit_log_dir = 'chat_sft_function_calling_demo'
# adjust other hyperparameters as needed
# for example:
# recipe.optim.config.lr = 1e-6
# recipe.trainer.strategy.tensor_model_parallel_size = 2
# recipe.log.ckpt.save_top_k = 3

executor = local_executor_torchrun(devices=recipe.trainer.devices)
run.run(recipe, executor=executor)

When the training finishes, you should see the logs and find the final checkpoint location:

In [None]:
!ls chat_sft_function_calling_demo/checkpoints/

## Step 3. Evaluate the Trained Model

After successfully training a checkpoint, we should evaluate the effectiveness of the trained model. First, as a sanity check, we can quickly check the trained model performance via NeMo in-framework inference. 

### Run NeMo Framework Inference


In [None]:
%%writefile nemo_inference.py

import torch.distributed
from megatron.core.inference.common_inference_params import CommonInferenceParams
import nemo.lightning as nl
import re

strategy = nl.MegatronStrategy(
 tensor_model_parallel_size=1,
 pipeline_model_parallel_size=1,
 context_parallel_size=1,
 sequence_parallel=False,
 setup_optimizers=False,
 store_optimizer_states=False,
)

trainer = nl.Trainer(
 accelerator="gpu",
 devices=1,
 num_nodes=1,
 strategy=strategy,
 plugins=nl.MegatronMixedPrecision(
 precision="bf16-mixed",
 params_dtype=torch.bfloat16,
 pipeline_dtype=torch.bfloat16,
 autocast_enabled=False,
 grad_reduce_in_fp32=False,
 ),
)

source = {
 "mask": "User",
 "system": "",
 "conversations": [
 {
 "from": "User",
 "value": "Imagine you are an AI designed to call functions. Given a question and a set of functions, your role is to make the necessary function calls. If a function cannot be used, state this. If parameters are missing, mention it. Here are the available functions\n\n[{\"name\": \"player_statistics_seasons\", \"description\": \"Fetch the season statistics for a given player using the SofaScores API.\", \"parameters\": {\"player_id\": {\"description\": \"The unique identifier for the player whose statistics are to be fetched.\", \"type\": \"int\", \"default\": \"12994\"}}}, {\"name\": \"matchstreakodds\", \"description\": \"Fetch odds data related to streaks for a specific football match using its ID.\", \"parameters\": {\"is_id\": {\"description\": \"The ID of the match for which the streaks odds data is to be retrieved.\", \"type\": \"int\", \"default\": 10114139}}}]\nIf you decide to invoke any of the function(s), put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]\nYou SHOULD NOT include any other information in the response.\n\nFetch the season statistics for player with ID 67890."
 },
 ]
}
special_tokens = {
 "system_turn_start": "",
 "turn_start": "",
 "label_start": "",
 "end_of_turn": "\n",
 "end_of_name": "\n",
 }
from nemo.collections.nlp.data.language_modeling.megatron.gpt_sft_chat_dataset import _get_header_conversation_type_mask_role
# Apply prompt template to be the same format as training
header, conversation, data_type, mask_role = _get_header_conversation_type_mask_role(source, special_tokens)
prompts = [conversation]

from nemo.collections.llm import api
results = api.generate(
 path="chat_sft_function_calling_demo/checkpoints/model_name=0--val_loss=0.00-step=39-consumed_samples=1280.0-last",
 prompts=prompts,
 trainer=trainer,
 inference_params=CommonInferenceParams(
 temperature=1.0,
 top_p=0., # greedy decoding
 top_k=1, # greedy decoding
 num_tokens_to_generate=50,
 ),
 text_only=True,
)
if torch.distributed.get_rank() == 0:
 for i, r in enumerate(results):
 print("=" * 50)
 print(prompts[i])
 print("*" * 50)
 match = re.search(r'(.*?)', r, re.DOTALL)
 if match:
 print(match.group(0))
 else:
 print(r)
 print("=" * 50)
 print("\n\n")

In [None]:
!torchrun nemo_inference.py

We can see that the model has correctly generated the function call `[player_statistics_seasons(player_id=67890)]`.

After the initial verification, we can now convert the checkpoint back to a Hugging Face checkpoint to deploy for inference, perform benchmark testing, and verify on downstream tasks.


### Convert NeMo2 Model to HuggingFace Format

If you're satisfied with the trained model's performance, we can continue. For the benchmark and downstream task assessment in the next two steps, the applications we will use only accept OpenAI API format inference requests. Therefore, we should first convert the saved checkpoint to a Hugging Face checkpoint for further deployment.

In [None]:
%%writefile convert_to_hf.py
from pathlib import Path
from nemo.collections import llm

if __name__ == "__main__":
 # Merge LoRA adapters back to model
 llm.peft.merge_lora(
 lora_checkpoint_path="chat_sft_function_calling_demo/checkpoints/model_name=0--val_loss=0.00-step=39-consumed_samples=1280.0-last",
 output_path="chat_sft_function_calling_demo/checkpoints/model_name=0--val_loss=0.00-step=39-consumed_samples=1280.0-last_merged",
 )

 # Export model to HF format
 llm.export_ckpt(
 path=Path("chat_sft_function_calling_demo/checkpoints/model_name=0--val_loss=0.00-step=39-consumed_samples=1280.0-last_merged"),
 target="hf",
 output_path=Path("chat_sft_function_calling_demo/sft_hf"),
 overwrite=True,
 )

In [None]:
!torchrun convert_to_hf.py

Then, we can follow the steps in [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/apps) to establish an OpenAI API, allowing us to deploy the model and handle inference requests efficiently.

### Benchmark the Fine-Tuned Model

To benchmark the function-calling ability of the fine-tuned LLM, you can refer to the [berkeley-function-call-leaderboard](https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard). On a converged model, you will get benchmark results similar to those shown below.

...

🔍 Running test: multiple

✅ Test completed: multiple. 🎯 Accuracy: 0.89

🔍 Running test: parallel

✅ Test completed: parallel. 🎯 Accuracy: 0.87

🔍 Running test: parallel_multiple

✅ Test completed: parallel_multiple. 🎯 Accuracy: 0.835

...


### Perform Downstream Task Assessment

If you are not ready with your own agent, below we provide an agent demo for your quick assessment. The prompts in this section come from a different dataset on financial Q&A. 

Since the returns of the function calls are hard-coded in this demo, we recommend you use the conversations below and input them in order:

> Do you have overnight call loan?

> If the interest rate can drop to 1.5%, I will proceed.

> Okay, confirm the transaction.

In [None]:
!pip install -U "qwen-agent[gui,rag,code_interpreter,python_executor]"

Note that you should modify the LLM service configuration according to your model name and server address, such as `http://10.123.123.123:8000/v1`. If you want to try LLMs running on [NIM online](https://build.nvidia.com/explore/discover), you need to apply for a free API key and use the server address `https://integrate.api.nvidia.com/v1`.

In [None]:
%%writefile start_app.py

import os
from qwen_agent.agents import ReActChat
from qwen_agent.gui import WebUI

from qwen_agent.tools.base import BaseTool, register_tool
import json

def init_agent_service():
 llm_cfg = {
 'model': 'nvidia/mistral-nemo-minitron-8b-instruct',
 'model_server': 'https://integrate.api.nvidia.com/v1', # http://10.137.164.245:8000/v1
 'api_key': "nvapi-YOUR-API-KEY",
 }
 tools = ['inquiry', 'strategy_query', 'transaction_confirm', 'transaction_cancel']
 bot = ReActChat(llm=llm_cfg,
 name='match transaction agent',
 description='This agent can help to match transaction.',
 function_list=tools)
 return bot

@register_tool('inquiry')
class Inquiry(BaseTool): 
 description = 'After the initial quote, if the customer negotiates, use this tool to check the prices available for financial products.'
 parameters = [{'name': 'product', 'type': 'string', 'description': 'Product type.', 'required': True},
 {'name': 'term', 'type': 'string', 'description': 'Term.', 'required': True},
 {'name': 'amount', 'type': 'string', 'description': 'Transaction amount.', 'required': True},
 {'name': 'interest_rate', 'type': 'string', 'description': 'Interest rate.', 'required': True},]
 
 def call(self, params: str, **kwargs) -> str:
 return json.dumps({'term': 'overnight', 'amount': '1 billion', 'interest_rate': '1.5%'},
 ensure_ascii=False)

@register_tool('strategy_query')
class StrategyQuery(BaseTool): 
 description = 'Check the initial quotes for financial products.'
 parameters = [{'name': 'product', 'type': 'string', 'description': 'Product type.', 'required': True},
 {'name': 'term', 'type': 'string', 'description': 'Term.', 'required': True},]
 def call(self, params: str, **kwargs) -> str:
 return json.dumps({'product': 'call loan', 'term': 'overnight', 'amount': '1 billion', 'interest_rate': '1.6%'},
 ensure_ascii=False)

@register_tool('transaction_confirm')
class TransactionConfirm(BaseTool):
 description = 'Confirm the transaction.'
 parameters = [{'name': 'product', 'type': 'string', 'description': 'Product type.', 'required': True},
 {'name': 'term', 'type': 'string', 'description': 'Term.', 'required': True},
 {'name': 'amount', 'type': 'string', 'description': 'Transaction amount.', 'required': True},
 {'name': 'interest_rate', 'type': 'string', 'description': 'Interest rate.', 'required': True},]
 def call(self, params: str, **kwargs) -> str:
 return json.dumps({'response': 'success'},
 ensure_ascii=False)

@register_tool('transaction_cancel')
class TransactionCancel(BaseTool): 
 description = 'Cancel the transaction.'
 parameters = []
 def call(self, params: str, **kwargs) -> str:
 return json.dumps({'response': 'success'},
 ensure_ascii=False)


def app_gui():
 bot = init_agent_service()
 chatbot_config = {
 'prompt.suggestions': ['Do you have overnight call loan?', 'If the interest rate can drop to 1.5%, I will proceed.', 'Okay, confirm the transaction.']
 }
 WebUI(bot, chatbot_config=chatbot_config).run(share=True)


if __name__ == '__main__':
 app_gui()


In [None]:
## Please Run When You're Done!
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)