Long-Document QA

This benchmark assesses how well Large Language Models (LLMs) can correctly answer a natural language query by extracting and analyzing relevant information from extremely long and complex documents, such as SEC filings.

The long-document QA evaluation set can help financial professionals understand the relative performance of LLM applications and corresponding RAG architectures.

Methodology

The long-document QA task within S&P AI Benchmarks by Kensho includes 225 questions derived from publicly available financial documents. Each question has been reviewed by at least one domain expert for both legibility and relevance to real-world tasks.

This benchmark specifically assesses an LLM's ability to answer pertinent natural language questions based on contexts extracted from financial documents of at least 100 pages in length. This benchmark tests an LLM's ability to answer natural language questions based on texts that are at least 100 pages long. For instance, “how well can the LLM find and select the right paragraph or chunk of text from a 500-page SEC filing, and generate the correct answer to the query?” It also tests the model's quantitative and reasoning capabilities, i.e. how it understands the text, picks relevant values from it and performs arithmetic operations to answer the financial question.

Many models, especially open source models with limited context windows, cannot ingest extremely long documents, such as SEC filings; as a result, the LLM alone would not be able to answer a user's query that requires information from the document. Retrieval Augmented Generation or RAG is a common system used in partnership with the LLM to pull the most relevant parts of a document to the fore, allowing analysis of documents that expand past the context window. However, this isn't the only way that LLMs can effectively and accurately extract information and answer questions from long documents; some models with very large context windows (over 150K tokens), may be effective at long-document QA without the need for a corresponding RAG system.

The questions in this task are formatted as below. Answers for questions in this category are expected to be indices corresponding to the provided options.

{
	"id": str,
	"question": What percentage of programming rights assets are
                attributed to Comcast's CCCL Parent for 2015?,
	"context": str,
}

Ready to find your place on the Long-Document QA Leaderboard?