Masterarbeit / Bachelorarbeit
    
    A LLM-based approach for mining reproducibility information from scholarly
          publications
        Completion
2024/12
Research Area
Intelligent Information Management
Advisers
            Dr. Sheeba Samuel
            Prof. Dr.-Ing. Martin Gaedke
Description
Reproducibility of results is a cornerstone of scientific
          research, ensuring that findings can be independently verified and built upon. Scholarly
          publications contain critical information necessary for reproducing results, including
          data presented in text, tables, and images. However, manually extracting this
          reproducibility metadata is a challenging and time-consuming task. Leveraging Large
          Language Models (LLMs) to automate the extraction of tabular information offers a powerful
          solution to enhance efficiency and accuracy in this crucial process. 
This thesis aims to develop a LLLM-based approach, designed to automatically mine
          reproducibility metadata from text in scholarly publications. The solution must ensure
          robust and accurate extraction of reproducibility information. The thesis will
          specifically aim to extract information on deep learning methods from publications to
          understand the reproducibility factor of these methods, since Deep learning methods are
          widely used across various domains, from natural language processing to computer vision.
          The key features of the solution will include Sophisticated Table Recognition, where LLMs
          are utilized to accurately extract data from text, capturing detailed information
          necessary for reproducibility. This will be further used to expand the metadata collected
          from the text. The solution provides a user-friendly interface that allows users to easily
          input academic papers (PDFs or text files), review extracted reproducibility metadata, and
          make necessary adjustments or annotations. The solution should be designed to be easily
          extendable to accommodate other elements like images, tables, etc. and integrate with
          additional LLM models and data extraction techniques.
The objective
          of this thesis is to analyze the current state of the extraction methods from scholarly
          publications, identify existing challenges, and develop a comprehensive LLM-based solution
          to extract reproducibility metadata from publication text. This includes designing and
          implementing the software tool, followed by an experimental evaluation through a pilot
          study to demonstrate its effectiveness and usability. By advancing text extraction for
          reproducibility metadata through this LLM-based system, this thesis aims to significantly
          improve the efficiency and accuracy of data extraction processes, thereby enhancing the
          ability of researchers to verify and build upon published scientific results.
                    

