Challenge Details

The goal of the LiveRAG Challenge is to allow research teams across academia and industry to advance their RAG research and compare the performance of their systems with other teams, on a fixed corpus (derived from the publicly available FineWeb) and a fixed open-source LLM, Falcon3-10B-Instruct.

Challenge
Task

Selected teams are expected to build a RAG system, applying their own approach for key elements of their system, such as query rewrite, text retrieval, prompt generation, etc., and integrating with the challenge LLM (Falcon3-10B-Instruct) for answer generation. A stream of questions to be answered will be released during the Live Challenge Day and the generated responses must be submitted by participants during a limited time window.

Resources

Participants must use the following resources in their RAG system

LLM: Falcon3-10B-Instruct as the common answer generator to be used by all.
Dataset: FineWeb-10BT, a 15M documents subset of FineWeb, cleaned and deduplicated English web pages to be used as the RAG auxiliary repository.

Selected participants have the choice between building their own search indices over the challenge dataset or taking advantage of the following prebuilt indices, leveraging their allocated credits:

Pinecone Dense index: Document are split into sentence-based chunks of 512 tokens each by the LlamaIndex sentence splitter. Each chunk is embedded by E5-base embedder into 512-size vector.
Opensearch Sparse index: The same chunks will be indexed by BM25-based Sparse index using Opensearch platform.

Participants get free access to TII’s DataMorgana a configurable synthetic Q&A benchmark generator for training their system prior to the live event. The same system will be used for generating an original test set of Q&As at the live event and for automatic evaluation afterwards.

Live Challenge
Day

The Live Challenge Day will be hosted on a Hugging Face competition space. Participants must register for one of the sessions of the Live Challenge Day. As their session starts, they will be provided with a list of several hundred questions and be expected to upload the answers their RAG solution generated as well as the list of passages used for prompt augmentation, and the associated prompt that was fed to Falcon to generate the answer (for verification purposes at a later stage) within a predefined time window.

Response upload should be done according to the predefined time window, which will be communicated together with the operational instructions. Note that the challenge submission will offer a “dry” test session a few days before the actual live event to allow participants to check their uploading capabilities.

Evaluation

Answers will be judged according to two metrics:

Relevance

Measures the correctness and relevance of the answer to the question on a four-point scale.

The response correctly answers the user question and contains no irrelevant content

The response provides a useful answer to the user question, but may contain irrelevant content that do not harm the usefulness of the answer

No answer is provided in the response (e.g., “I don’t know”)

-1

The response does not answer the question whatsoever

Faithfulness Metric

Assesses whether the response is grounded in the retrieved passages on a three-point scale.

Full support, all answer parts are grounded

Partial support, not all answer parts are grounded

-1

No support, all answer parts are not grounded

Both relevance and faithfulness will contribute to the final evaluation score.

The evaluation will be performed in two stages:

1. Auto-eval: we will use LLM-as-a-judge approach to measure relevance and faithfulness of participant answers (using Claude-3.5 Sonnet as our judge).
2. Manual-eval: top-scored solutions will be judged manually by the organizers.