Context- and Quality-aware Data Engineering for Scientific Facilities Dealing with Ultralarge Datasets

DASHH Doctoral Researcher: Michael Schuh

Supervisors: Dr. Steve Aplin (EuXFEL), Prof. Walid Maalej (UHH)

Due to continuous progress in technical means and scientific ambition, advanced light sources are generating increasing amounts of data. For example, a single experiment at European XFEL can generate up to several petabytes of raw data within 6 days. This ultra-large amount of data will continue to increase in the next years with more sensors with higher resolutions.
Scaling the resources necessary to store, transfer, and process these data is no longer sustainable, neither from an economical nor an environmental perspective. Instead, only data that is scientifically valuable and meaningful should be selected, and the rest discarded.
This research is therefore focused on evaluating and improving data management and data engineering algorithms and methods to effectively perform data reduction for the various scientific use-cases and experiment techniques.
A key part of this project will be to work with domain experts to design a modular framework that can be used to run data reduction pipelines and to configure and control tools for data filtering. The framework will include interfaces that allow data-driven machine learning techniques to study estimators which can predict the value of a given subset of experimental data for further data analysis.
The framework will require standardized and well-defined processes and will benefit strongly from good software engineering methods and practices such as packaging, versioning, documentation, and continuous improvement based on user feedback