A Novel Approach to Semantic Mapping through a Documentation-Centric Methodology
The dissertation by Dr.-Ing. Andreas Burgdorf deals with the automated assignment of semantic concepts to data sets. He examines the role of textual data documentation and which methods are suitable for making the best possible use of this documentation for the assignment process. Various methods of machine learning and, in particular, large language models are particularly noteworthy here.
We asked Andreas about his dissertation:
In what context was your dissertation written? Which projects or other factors particularly influenced your dissertation?
Before I came to the University of Wuppertal, I had already worked on various relatively small projects at RWTH Aachen University, where I focused on natural language processing and, in particular, emerging transformer-based language models. In Wuppertal, I then worked on the Bergisch.smart_mobility project, which involved, among other things, the development of a data marketplace on the topic of mobility in the Bergisch city triangle. Here, many data sets that were already available in open data portals had to be manually annotated semantically. This raised the question of which methods might be suitable for automating this annotation as far as possible. Based on my past projects in the field of language processing, I was particularly interested in documentation, as it offers the additional advantage that everyone, regardless of their IT background, is able to describe these data sets in their own words. This is how I discovered this interface between semantic data management and natural language processing.
What contribution does your work make to the field of research?
The dissertation contributes to the investigation of the question of the extent to which textual data documentation is suitable as a source of information for automated semantic annotation. An initial milestone was the creation of a suitable data corpus (VC-SLAM) containing raw data, textual data documentation, and semantic models, as no comparable usable corpus existed at the time. Based on this corpus, I investigated and developed various methods, from heuristics to language models, to optimize the automatic semantic annotation process. The result is the DocSemMap framework, which outperforms other automated methods that do not rely on documentation. Finally, I also focused on large language models (LLMs), investigating how they can further improve the process, either by using them to improve documentation or by using them themselves as an additional method for the annotation process.
What's next for you and the topic?
Today, I have continued to focus on large language models in particular. I am very pleased to have the opportunity to support a large number of staff and students in the AI4BUW project in the development of chatbots and other AI-based assistance systems for teaching, research, and administration, among other things. I am delighted that the use of large language models for the semantic annotation of data sets and, in particular, for the construction of more complex semantic models is becoming increasingly relevant at the chair and in research projects in various fields of application.