S&P AI Benchmarks by Kensho
A series of benchmarks that evaluate AI systems including Large Language Models (LLMs) for business and finance use cases.
S&P AI Benchmarks by Kensho consists of two evaluation sets informed by S&P Global’s world-class data and industry expertise. These benchmarks are designed to assess the ability of LLMs to solve real-world business and finance questions and were developed in collaboration with experts across S&P Global to ensure accuracy and reliability.
Everyone is welcome to sign up and participate, from academic labs and large corporations to independent model developers. The public-facing leaderboards are designed to encourage innovation and collaborative understanding.
Why We Created S&P AI Benchmarks
Although today’s LLMs generally demonstrate strong performance on question-answering (QA) and code generation tasks, it remains difficult for models to reason about quantities and numbers. This poses issues for using LLMs for real-world applications in business and finance, as these fields can require transparent and precise reasoning capabilities, along with a wide breadth of technical knowledge.
Existing benchmarks for these domains include tasks such as sentiment analysis, text classification, or named-entity extraction. With S&P AI Benchmarks, we’ve created rigorous and challenging tasks that are rooted in realistic use cases for business professionals. Our goal is to build trustworthy, objective evaluation sets to encourage the development of better models for business and finance.
To learn more, read our latest research papers.
“Bizbench: A Quantitative Reasoning Benchmark for Business and Finance” (ACL 2024)“DocFinQA: A Long-Context Financial Reasoning Dataset” (ACL 2024)