The goal of the LiveRAG Challenge is to allow research teams across academia and industry to advance their RAG research and compare the performance of their systems with other teams, on a fixed corpus (derived from the publicly available FineWeb) and a fixed open-source LLM, Falcon3-10B-Instruct.
Selected teams are expected to build a RAG system, applying their own approach for key elements of their system, such as query rewrite, text retrieval, prompt generation, etc., and integrating with the challenge LLM (Falcon3-10B-Instruct) for answer generation. A stream of questions to be answered will be released during the Live Challenge Day and the generated responses must be submitted by participants during a limited time window.
Participants must use the following resources in their RAG system
Selected participants have the choice between building their own search indices over the challenge dataset or taking advantage of the following prebuilt indices, leveraging their allocated credits:
Participants get free access to TII’s DataMorgana a configurable synthetic Q&A benchmark generator for training their system prior to the live event. The same system will be used for generating an original test set of Q&As at the live event and for automatic evaluation afterwards.
The Live Challenge Day will be hosted on a Hugging Face competition space. Participants must register for one of the sessions of the Live Challenge Day. As their session starts, they will be provided with a list of several hundred questions and be expected to upload the answers their RAG solution generated as well as the list of passages used for prompt augmentation, and the associated prompt that was fed to Falcon to generate the answer (for verification purposes at a later stage) within a predefined time window.
Response upload should be done according to the predefined time window, which will be communicated together with the operational instructions. Note that the challenge submission will offer a “dry” test session a few days before the actual live event to allow participants to check their uploading capabilities.
Answers will be judged according to two metrics:
Measures the correctness and relevance of the answer to the question on a four-point scale.
Assesses whether the response is grounded in the retrieved passages on a three-point scale.
Both relevance and faithfulness will contribute to the final evaluation score.
The evaluation will be performed in two stages: