Masterarbeit
Integrating Open Large Language Models into the Jupyter Notebook Reproducibility Pipeline
Completion
in progress
Research Area
Intelligent Information Management
Students
Erisa Zaimi
Advisers
Dr. Sheeba Samuel
Description
Large Language Models have demonstrated remarkable capabilities in code understanding, error diagnosis, and automated repair, making them promising tools for augmenting reproducibility analysis pipelines. Open-weight models such as DeepSeek, Mistral, and LLaMA offer the additional advantage of local deployment, which is important for research pipelines that process potentially sensitive or proprietary notebook content. Integrating LLMs into the FAIR Jupyter (https://doi.org/10.4230/TGDK.2.2.4) could enable new capabilities such as automated error explanation, dependency inference, and natural-language summarization of reproducibility findings — tasks that are currently manual or absent entirely. The current FAIR Jupyter pipeline captures reproducibility outcomes as structured data but provides limited support for interpreting or explaining those outcomes in human-readable terms. Error messages from failed notebook executions are stored but not analyzed, and the pipeline lacks any mechanism for suggesting fixes or generating natural-language summaries of findings. Manual analysis of error logs is time-consuming and requires deep technical expertise, limiting the pipeline's usability for non-expert stakeholders such as journal editors or policy makers. This thesis aims to investigate the integration of open-weight LLMs into the FAIR Jupyter pipeline for tasks including automated error classification, fix suggestion generation, and natural-language summarization of reproducibility reports. It will benchmark multiple open models (e.g., DeepSeek, Mistral, LLaMA) on a curated set of notebook error logs and evaluate the quality, accuracy, and usefulness of their outputs. The thesis will also design and prototype a pipeline module for LLM-assisted analysis that can be run locally without reliance on external APIs. The thesis should deliver a prototype LLM integration module for the FAIRJupyter pipeline, including prompt templates for error classification and fix suggestion, a benchmarking framework for evaluating LLM outputs against manually annotated ground truth, and a comparative evaluation of at least three open-weight models. It should include documentation for deploying the module in a local compute environment and a discussion of the trade-offs between model capability, resource requirements, and output quality.


