Marc Eisoldt
Massiv parallele Auswertung eines inversen Sensormodells für 3D-Laserscanner und TSDF-Karten auf GPUs und FPGAs

Abstract
Global localization in maps is an important Problem in the field of autonomous robotics and is required for many indoor and outdoor tasks in this domain. The Monte-Carlo-Localization solves this problem by considering a set of pose hypotheses in the environment. While this method has been well studied in the two-dimensional case, global localization in three-dimensional map representations of mobile robotic systems with six degrees of freedom has been neglected in terms of performance. The problem is computationally very expensive and no mobile robot system has yet been presented in the literature that is able to solve it in real-time. The most computationally intensive step in the algorithm is the evaluation of the pose hypotheses using a sensor model and offers a high parallelization potential. For this reason, this work investigates the massive parallelization of the evaluation of the particles using TSDF-maps for three-dimensional laser scanners using systems based on embedded-GPUs and FPGAs. The goal is to compare the two architectures for implementing a real-time capable global localization for mobile robotic systems. Both the performance and the power consumption of the system are of critical importance. Based on previous work on this topic, a mobile robot system was developed for each considered architecture. Since the memory access limits the performance of the implementation, a sub-gridstructure was used to manage the map. This also allows an efficient implementation of a custom caching strategy developed for the problem in reconfigurable hardware. Both implementations were evaluated and compared using simulated scenarios. A Jetson AGX Xavier was used as the platform for the GPU, a Xilinx Ultrascale+ XCZU15EG for the FPGA and an Intel NUC for the CPU. The implementation on the GPU achives more than ten times the performance and more than twenty times the energy efficiency than the implementation on the CPU. The implementation on the FPGA achieves more than twice the performance and more than eight times the energy efficiency than the implementation on the CPU. However, in order to use all received information from the used laser scanner in real-time, the implementation for the GPU must be further accelerated by a factor of three and the implementation for the FPGA must be further accelerated by a factor of sixteen. Based on the significant acceleration of the algorithm by the GPU as well as the efficiency of the configured processing units and the specific caching for the reconfigurable hardware, it is expected that further acceleration of the evaluation of the pose hypotheses can be achieved by developing a domain-specific architecture combining both the GPU and the reconfigurable hardware.