Masterarbeit
     Automated Metadata Annotation of Research Data with Homonym Disambiguation
    Automated Metadata Annotation of Research Data with Homonym Disambiguation
        Completion
2024/05
Research Area
Intelligent Information Management
Advisers
 
            Jan Haas M.Sc.
 
            Prof. Dr.-Ing. Martin Gaedke
Description
CKAN is a popular repository and framework for Research Data
          and Research Data
Management. Its core functionality consists of the ability to
          upload datasets, as well as to annotate them with metadata, such as keywords, description,
          and title. In the context of enriching datasets with metadata, finding the most suitable
          tags out of a predefined set of available keywords for a specific dataset can be quite
          difficult as that requires knowledge about the already existing tags which could result in
          improper annotated datasets. Additionally, some tags might be homonyms which - if not
          disambiguated - make it hard to correctly classify a dataset. This thesis explores
          algorithms and techniques for automatic annotation of datasets with plausible tags based
          on a specified context, e.g., the dataset description, title or previous work of the
          author and their field. Additionally, the investigated and implemented techniques should
          also be aware of the semantics of the keyword and thus capable of automatically
          disambiguating homonyms. The objective of this thesis consists of research into the
          different approaches of automatic annotation of datasets with keywords and their
          capabilities in performing disambiguation on homonym tags as described above. Thus, a
          thorough State-of-the-art analysis must be conducted, and the most applicable approaches
          should be implemented, and their performance evaluated in an objective manner based on
          different existing metrics and benchmarks. Additionally, a demonstrator has to be created
          which shows the capabilities of the implemented algorithms.
 
                    


