Question Mining and Exam Creation

Enhancing LLM with data from our custom repository.

RAG is merges the capabilities of pre-trained language models with information retrieval to enhance the generation of text. It is designed to leverage a vast corpus of text data, enabling it to produce responses that are not only relevant but also rich in detail and contextually accurate. Here’s an overview of how RAG operates:

  • Preprocessing: Indexes a large dataset as a knowledge base.

  • Query Formation: Converts queries into semantic vectors for retrieval.

  • Document Retrieval: Finds relevant documents using nearest neighbor search algorithms.

  • Context Integration: Augments query with retrieved documents for enriched context.

  • Text Generation: Generates informed responses with a pre-trained language model.

ICanProveIt will draw from 3 sources:

  • A curated repository that is carefully curated by academics to contain current and valid data that is to datamine questions.

  • A vector database that contains embeddings from the document repository.

  • The document that is uploaded.

Human In the Loop Curating and Fine Tuning

Pre-trained language models using the documents in the repository to to enhance its ability to create questions on a topic that is not only rich in detail but also contextually accurate.

We won't go into our tuning methodology but recent frameworks for tuning including human in the loop are listed below.

We will as part of the project we will ingest Soroban documents and add them to our curated repository (Figure 2).

Learners will upload documents on Soroban blockchain. Questions for the exam will not be generated from the exam but from documents in our repository.

Creating the Exam

Advanced semantic analysis is conducted on the documents in our curated repository. Document management uses a pipeline of rerankes to provide enhanced contextual understanding of the curated documents.

Use of a customized rerankers model allows for a nuanced understanding of document relevance, enhancing the LLM's context with information about the relative importance of each document. The refined input, highlighting the ranked relevance of the context to the the context of the uploaded document might then appear as follows:

In addition to topical matching we use the well known pedagogical standard Bloom Taxonomy for categorizing objectives in the curated materials.

Last updated