Some blog posts
In our quest to decode annual reports (see here ), as usual, all begins with a proper data curation & preparation. The challenge in building more precise question-answering systems, as in any AI-related problem, is the scarcity of resources to label them.
Traditionally, we’ve needed people to craft and review data manually.
Large Language Models (LLMs) have proven themselves increasingly trustworthy, such as this recent popular paper covering RAG long-context windows in understanding text, therefore being able to ask questions upon them, opening new possibilities for synthetic data generation, such as stated in this survey article
In this post, we’ll explore how to leverage a LLM, specifically Anthropic’s Claude Sonnet, to generate question-answer pairs from parts of financial documents. Along the way, we’ll also learn how to benefit from Amazon Bedrock’s Converse API to ensure a systematic, structured output.
Key points we’ll cover:
These synthetically generated questions will later help us in training the adapter, forming the foundation of our improved financial Q&A system.
Most of the code can be found within this part of the companion repository.
Here is how we are going to proceed.
I implemented the SyntheticDataGenerator
class, which covered the general workflow. Let’s review main parts of the code:
In the system, I used the publicly accessible EDGAR system from U.S Securities and Exchange Commission in order to get Amazon report for fiscal year 2023, in a PDF format. Nothing fancy, right :-) . Next, how are we going to divide our document into text parts, so that every part can itself contain a question to answer to?
This is where I leverage the use of langchain
a framework allowing us to rapidly prototype LLM aplications.
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
def read_10k_earnings(document_path):
loader = PyPDFLoader(document_path, extract_images=False)
splitter = CharacterTextSplitter(separator = "\n", chunk_size = 1024, chunk_overlap=100)
pages = loader.load_and_split(splitter)
return [p.page_content for p in pages]
For the sake of the example, I used a very basic splitting strategy. Now, every chunk will have 1024 character, with 100 characters overlap between chunks. In our case, it formed a document of 340 text chunks.
Now, for each chunk, we are going to form at least one question, thanks to Claude Sonnet LLM. We’re going to distinguish system prompt from user prompt
System prompt sets the overall context, role and behavior for the LLM.
You are a helpful expert in financial information retrieval. You excel at asking questions from 10k earnings that are diverse and know how to answer them with 10k earning documents. You know how to minimize bias. Your goal is to ask realistic and specific questions that can likely be asked for automated financial reporting.
User prompt, on the other hand, contains the specific input that the LLM should respond to.
Based on the text chunk below, generate one question - answer pair that has a question, alongside a relevant passage from the text > chunk. <text_chunk> <text_chunk> {chunk} </text_chunk> Question needs to be specific.
In Python, it goes with f-strings:
def build_base_prompt(self, chunk):
specified_prompt = f""".
Based on the text chunk below, generate one question - answer pair that has a question, alongside a relevant passage from the text chunk. <text_chunk>
<text_chunk>
{chunk}
</text_chunk>
Question needs to be specific.
"""
Reliability and reproducibility are key. Enforcing JSON output is crucial for ensuring structured, consistent, and easily parsable responses from LLM. Thanks to Amazon Bedrock’s Converse API, we can now define a specific JSON schema as part of the tool configuration, which guides the model to generate responses in the desired format. Let’s quickly go through our way of enforcing JSON output in our case.
We create a JSON schema for the desired output structure (question-answer pair)
{
'type': 'object',
'properties': {
'pair': {
'type': 'object',
'required': ["question", "answer"],
'properties': {
'question': {
'type':'string',
'description': 'A short question that can be answered from the chunk'
},
'answer': {
'type':'string',
'description':'a sentence solely from the document that answers accurately the question'
}
}
}
}
}
Finally, we wrap it in a toolSpec
object with a name and description. In Python, it goes this way, assuming we encapsulated the previous JSON schema as a dictionary variable called pair
:
desc = """Generate a pair of question and answer, based on a text chunk . It must include:
- a question that can be answered from a document
- a passage from the text chunk that accurately answers the question, called 'answer'
"""
tool_spec = {"toolSpec":{
"name":"pair_generator",
"description":desc,
"inputSchema":
{"json":pair}
}
}
In a nutshell, the rest happens pretty naturally:
toolConfig
argumentconverse
method… And that’s it! Now the LLM will generate an question answer pair.
In my case, I used
Here is an example
{
'question': 'What is the time period over which Judith McGrath adopted a trading plan to sell up to 5,760 shares of Amazon.com, Inc. common stock?',
'answer': 'On November 27, 2023, Judith McGrath, Director, adopted a trading plan intended to satisfy Rule 10b5-1(c) to sell up to 5,760 shares of Amazon.com, Inc. common stock over a period ending on March 8, 2024, subject to certain conditions.'
}
Now, what about generating 500 question-answer pairs?
Always within the SyntheticDataGenerator
class, the generate_pairs
method is designed to create a specified number of question-answer pairs from the given document chunks. Here’s how it works:
It starts with a target number of pairs to generate (N).
It keeps track of how many times each chunk has been used to generate questions.
generate_one_pair
on this chunk, which in turn uses the atomic_invoke
method to create a Q&A pair.To ensure diversity, it prioritizes chunks that have been used less often.
It has a maximum number of attempts (2N) to avoid getting stuck if generation becomes difficult.
This approach aims to create a diverse set of Q&A pairs while efficiently using all parts of the document and managing potential generation challenges.
Here is a non exhaustive list of limitations:
build_base_prompt
in order to add previous asked questions in thebuild_base_prompt
, in order to encourage more diverse, targeted question categories.validate_output
method, but method should be business-proof (maybe using another LLM as a judge ?)And that’s a wrap! We’ve successfully generated synthetic question-answer pairs using Anthropic’s Claude Sonnet and Amazon Bedrock’s Converse API. This is a first milestone in our journey to create a more precise financial Q&A system. By leveraging the power of LLMs, we’ve already solved a crucial business problem: labeling data.
But, as we know, the journey doesn’t end here. In the next post, we’ll begin the main menu: the world of embedding adapters and how we can use these synthetic question-answer pairs to train a more accurate and informative financial Q&A system.
Notes:The information provided in this series is for educational purposes only and should not be considered as financial or investment advice. Please read our full Investment Disclaimer for more details.