Distributed Self-Healing Infrastructure for High-Speed Scientific Data Processing

Supervisors: Dr. Holger Schlarb (DESY), Prof. Görschwin Fey (TUHH)

Most advanced physics meets highly dependable high performance embedded computing precisely at the infrastructure of large scale research facilities like the European XFEL or PETRAIII/IV. The physics of the experiment defines requirements and functionality for high-speed high-performance real time processing. Reliable operation requires a dependable distributed infrastructure composed of thousands of custom computing nodes for data taking, processing, storage and transfer. Thus, the key question arises: How to facilitate self-diagnosis and self-healing under tight real-time and high-performance processing demands? Guided by the physicists and engineers at DESY and supervised at TUHH, the prospective PhD student will devise new concepts for self-aware distributed computing to identify and heal faults autonomously at run time. The scientific challenges are in the automated localization of potential sources for failures may these be due to hardware defects, radiation or even software bugs and their mitigation. Deep understanding of the computing infrastructure as well as the experiment physics are mandatory to identify feasible solutions. Model-based approaches joined with online formal reasoning will be the method of choice for advanced self-awareness. Empirical studies will implement and verify developed concepts on the most recent devices deployed at DESY. This exactly matches future needs in wider application areas relying on a myriad of devices combined into virtually autonomous distributed computing infrastructures.