Masterarbeit

Real-Time Speech Translation for Web Videos Using AI-Based Streaming ASR and LLMs

Research Area

Intelligent Information Management

Advisers

Bai Li M.Sc.

researcher

Room: 1/B203

Phone: +49 371 531 31887

Email: bai.li@informatik.tu-chemnitz.de

Description

Online video has become an important medium for information sharing in international higher education, but language barriers still hinder international users from accessing and fully understanding academic video content. In recent years, advancements in Automatic Speech Recognition (ASR) and Large Language Models (LLM) technologies have made real-time video generation with multilingual support possible, thereby improving the access experience of educational content. As part of the central topic "EduLoom: A Multimodal AI Framework for Integrating Educational Web Resources“, this thesis aims to design and implement a system for translating online video content. The system should capture audio streams from videos, generate text through streaming speech recognition, and use an AI-based translation model to translate the recognized speech into another language. The generated translated subtitles and text should support two use cases: real-time display during video playback and review after viewing the video. Since this project is applied to a university environment, domain-specific terminology and accessibility requirements should also be considered. The system prototype should be evaluated in terms of speech recognition accuracy, translation quality, latency, cost, usability, and support for real-time use and post-viewing review. The objective of this thesis is the creation of a solution or the combination of existing techniques to solve the problem of real video translation through GenAI as described above. This comprises the analysis of the state of the art of speech recognition and translation as well as the demonstration of the solution through implementation and appropriate experimental evaluation as described above.