NeMo Retriever Synthetic Data Generation (SDG) is designed to streamline the creation of high-quality evaluation datasets for Text QA retrieval use cases. By leveraging existing enterprise data, this pipeline enables rapid generation of relevant evaluation datasets, facilitating improved model performance.
This version supports the generation of evaluation datasets, creating synthetic benchmark datasets compatible with commonly used evaluation frameworks such as BEIR. Synthetic training dataset generation will be supported in an upcoming version.
NeMo Retriever SDG can be run either from the command line, or using the notebook example provided in this repository. Check the Prerequisites section for instructions on generating an API key and installing libraries. To get started with the notebook, follow the Notebook Quick Start instructions. Otherwise, follow the CLI Quick Start section.
- Quickly generate complex QA datasets from existing text documents for retriever model evaluation.
- Output datasets can be formatted in SQuAD (Stanford Question Answering Dataset) or BEIR (Benchmarking Information Retrieval) format for easy integration into evaluation workflows.
- Designed to integrate seamlessly with NVIDIA NeMo Evaluator microservice, currently in early access.
In order to use NeMo Retriever SDG, you will need access to NVIDIA’s API Catalog. Go to the NGC Personal Key Manager to generate a Personal Key that will allow you to access AI Foundation Models and Endpoints.
To install the required libraries, navigate to the root directory of the project and run the following command in your notebook or command line:
$ pip install -r requirements.txt
Alternatively, you can use container nvcr.io/nvidia/pytorch:24.01-py3.
$ docker pull nvcr.io/nvidia/pytorch:24.01-py3
$ docker run -it --rm --gpus all --ipc host --network host -v $(pwd):/workspace nvcr.io/nvidia/pytorch:24.01-py3
/workspace# pip install -r requirements.txt
/workspace# jupyter notebook
Navigate to the quick start notebook and follow the instructions.
The pipeline can be run with datasets in either SQuAD or rawdoc (only text and title) format. To test the pipeline, you can use the provided example data at sample_data_rawdoc.jsonl or sample_data_squad.json.
Navigate to the top level of this project directory and run the following command in your command line. It will take roughly 5-10 minutes.
Tip: If you see the following error message:
ModuleNotFoundError: No module named 'nemo_retriever_sdg'Try addingPYTHONPATH=.to your command.
Rawdoc format
To use rawdoc format, provide your data in a .jsonl file. The structure of the data should follow this format: {"text": <document>, "title": <title>}.
PYTHONPATH=. python scripts/run_pipeline.py \
api_key=<API Key> \
input_file=$(pwd)/data/sample_data_rawdoc.jsonl \
input_format=rawdoc \
output_dir=$(pwd)/outputs/sample_data_rawdoc
SQuAD format
To use SQuAD format, provide your data in a .json file. For more information about the expected structure of the data, see the quick start notebook.
PYTHONPATH=. python scripts/run_pipeline.py \
api_key=<API Key> \
input_file=$(pwd)/data/sample_data_squad.json \
input_format=squad \
output_dir=$(pwd)/outputs/sample_data_squad
Edit config.yaml to update the configuration. Predefined configuration files can be found in scripts/conf.
To switch to another config file, use --config-name <config file name>. For example,
PYTHONPATH=. python scripts/run_pipeline.py \
--config-name config-nq.yaml \
api_key=<API Key> \
input_file=$(pwd)/data/nq_test.jsonl \
input_format=rawdoc \
output_dir=$(pwd)/outputs/sample_nq
The default config file config.yaml should work best to generate synthetic data for the IT Helpdesk domain. In case you'd like to improve the quality of synthetic data and/or apply the SDG pipeline for other domains, consider applying the recipes described below.
We recommend engineering the prompt templates for better synthetic data generations. Specifically, we have observed Chain-of-Thought prompting to result in the better generations as well. We have provided additional config files (config-nq.yaml and config-fiqa.yaml) that showcase Chain-of-Thought prompting.
Furthermore, they also showcase the use of in-context learning, wherein passage, query pairs were picked from datasets to be used as few-shot examples. Both methods yields good quality results.
We provide the embedding-model-as-a-judge as well as filter threshold value in our default configuration. The general recommendation to increase the difficulty of questions is to lower the filter threshold value and vice versa. The user can experiment with different filter threshold values to get more challenging or easier synthetic questions in their synthetic datasets.
The choice of the embedding model is provided in the default configuration. We experimented and verified the quality of the pipeline with the default configuration on multiple datasets such as FiQA, NQ and other internal datasets. The user can also change the embedding-model-as-a-judge by choosing any embedding model from Huggingface Model Hub.
For Answerability Filter, our recommendation is to go with the choice provided in the default configuation file. We confirmed that the checkbox-style prompt in the default configuration worked well for valid question filtering.
However, the framework is flexible of the choice of LLM-as-a-Judge and different LLMs with different prompt templates might work better for certain use cases. You can also experiment with Likert-scale prompting if need be.
