Topics for Project, · 2020-02-02 · Approximate LUT mapping for FPGAs To program an FPGA, a...

27
Topics for Project, Bachelor’s and Master’s Theses https://www.cs12.tf.fau.eu/teaching

Transcript of Topics for Project, · 2020-02-02 · Approximate LUT mapping for FPGAs To program an FPGA, a...

Topics for Project,Bachelor’s and Master’s Theses

https://www.cs12.tf.fau.eu/teaching

FPGAs in Wetterstationen der Artenvielfalt

Globale Faktoren wie die zunehmende Umweltverschmutzung und Mono-kulturen in der Landwirtschaft fuhren zu einer stetigen Abnahme der Arten-vielfalt der Tierwelt. Nationale und internationale Konsortien wollen diesesArtensterben dokumentieren und damit die Aufmerksamkeit der Offentlich-keit fur diese Entwicklung wecken, aber auch geeignete Gegenmaßnah-men einleiten. Dazu ist geplant, durch automatisierte Sensorstationen dasVorkommnis verschiedener Spezies in weitlaufigen, haufig schwer zugang-lichen Arealen aufzuzeichnen.

Die Herausforderung bei deren Entwurf besteht darin, dass sie einerseitsgroße Datenmengen verarbeiten, speichern und uber Mobilfunk ubertra-gen mussen. Andererseits sind die verfugbaren Ressourcen im Feldeinsatzbeschrankt (Energie basierend auf Solar- oder Windkraft, Speicherresour-cen und Kommunikationsbandbreite). In dieser Arbeit soll fur den Bereichder Bildverarbeitung eine FPGA-basierte Datenverarbeitungseinheit einersolchen Station entworfen werden. Diese soll eine Strategie umsetzen, diedurch Hinzuschalten von Kompressionsverfahren die Datenmenge adaptivund in Abhangigkeit des Batteriestands und freier Speicherkapazitat redu-ziert und dadurch die Datenverfugbarkeit verbessern kann.

Voraussetzungen: Grundkenntnisse Programmierung (C/C++), VHDL oder FPGA-EntwurfArt der Arbeit: Theorie (20%), Konzeption (40%), Implementierung (40%)Betreuung: Stefan Wildermann ([email protected]), Jurgen Teich

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Programmable approximate computing inRISC-V processors by utilizing anytime instructions

Approximate computing can positively influence the execution properties ofapplications like run-time or energy consumption by trading-off computati-on accuracy. This concept is already widely used in accelerators to improvethe performance of application execution in many different fields like digitalsignal and video processing.

The goal of this work is to investigate whether the concept of anytime in-structions that was formulated at our chair can not only be used efficientlyin accelerators but also in general-purpose processors like the RISC-V.Anytime instructions is a form of approximate computing that lets the pro-grammer decide on the degree of accuracy the computation shall havealready by formulating the instruction. The performance, time and energywise, and accuracy of applications run on the modified RISC-V shall be in-vestigated and compared to the execution on a hardware accelerator, alsodeveloped at the chair, that utilizes the same approximate functional units.

Requirements: Knowledge in HDL programming mandatoryType of thesis: Theory (30%), concept (40%), implementation (30%)Supervisor: Marcel Brand ([email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Approximate LUT mapping for FPGAs

To program an FPGA, a technology mapping of the circuitdescription to the lookup tables (LUTs) of the FPGA has tobe performed (LUT mapping). This process is often modeledas a graph covering problem and solved by finding cuts (orcones) within the circuit.

As LUTs are a finite resource, it is beneficial to use as fewLUTs as possible. In this work, the student is asked to analy-ze whether it is possible to save LUTs by accepting erroneousresults in the realized functionality, i.e. only approximating thecircuit. The solution obtained during this work is to be implemented within the framework of the EPFLlogic synthesis tools.

Goals of this thesis are:• Getting used to the EPFL logic synthesis tools• Analysis of existing LUT mapping algorithms• Development of an approximate LUT mapping algorithm

Prequisites: (good) Programming knowledge in C++, basic knowledge in mathematicsType of Work: Theory (30%), Implementation (0%)Supervisor: Oliver Keszöcze ([email protected])

Lehrstuhl für Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Side-channel evaluation on FPGAs with ChipWhisperer

An attacker of a system is no longer confined to the network connecti-on as single point of entrance. Side-channels are ways for informationto flow outside of standard wires in a computer system. Therefore, anobserver who is not part of the system can eavesdrop on private infor-mation. Known methods to retrieve such information are using poweras well as electromagnetic field measurements that are recorded withan oscilloscope. ChipWhisperer is a platform which offers easily pro-grammable FPGA boards and an open source side-channel analysis(SCA) framework for evaluating measurements and even breaking en-cryptions like AES.

Main tasks for a potential bachelor thesis or student assistant job:◦ Implement an interface for the ChipWhisperer framework to communicate with an oscilloscope.◦ Adapt a predefined measurement procedure for the ChipWhisperer board.◦ Evaluate measurements taken from consumer vs. ChipWhisperer board.

The amount of work can be adjusted to fit a bachelor thesis or assistant job.Please feel free to ask for details!

Prerequisites: Programming skills in Python, C/C++ and being careful when handling dev-boardsType of Work: Theory (15%), Conception (25%), Implementation (60%)Supervisor: Jens Schlumberger ([email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Using machine learning to apply SCA in a real world scenario

Side-channel analysis (SCA) of embedded hardware devices can beof great help for attackers or forensic investigators on a crime scene.SCA can be used to retrieve data and assess the state of a device.Furthermore it can be used to break encryption algorithms running onsaid device. In a real world scenario this is rather hard, because onecan only monitor a specific time frame that hopefully contains someevidence of an encryption algorithm like AES. Those encryptions willoccur in a bulk and are tough to analyze without separation.Therefore, it is mandatory to split the recorded measurements into sin-gle encryption cycles. This separation could be done by using machine learning (ML) due to significantdifferences in the frequency domain. The goal of this project or thesis is to investigate how well MLworks on finding recurring patterns of encryption algorithms in side-channel measurements.In the future, such an ML algorithm could be a significant part of an FPGA-based analyzer for crypto-devices which could aid forensic investigators at crime scenes to gather valuable data in a useful form.

The amount of work can be adjusted to fit a project or master thesis but preferably both sequentially.Please feel free to ask for details!

Prerequisites: Programming skills in Python or C/C++ (experience with machine learning or signalprocessing is preferred but not mandatory)

Type of Work: Theory (30%), Conception (20%), Implementation (50%)Supervisor: Jens Schlumberger ([email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Beschleunigung von Entwurfsraumexplorationen durch ma-schinelles Lernen

Bei der Realisierung und Evaluierung von Hardwareimplementierun-

Training

Suchraum

Synthese

gen aus einer gegebenen Spezifikation, stellt haufig die Große desSuchraums ein Problem dar. Um die Vielzahl an Implementierungs-moglichkeiten zu beherrschen, wird deshalb auf eine Entwurfsraumex-ploration zuruckgegriffen. Hierbei wird der Suchraum traversiert, wobeiausgewahlte Implementierungen stetig verbessert werden. Diese Tra-versierung basiert auf dem Erzeugen und Evaluieren vieler Losungen,die stetig verbessert werden. Die Evaluierung benotigt eine Synthese,die wegen ihrer hohen Komplexitat zu sehr hohen Laufzeiten fuhrt.Im Zuge dieser Arbeit wird ein Ansatz untersucht, der diese Laufzeitreduziert, indem nur ein Teil der Losungen tatsachlich synthetisiert undevaluiert wird. Die Evaluierung der anderen Losungen wird durch eine Verfahren ersetzt, das durchmaschinelles Lernen Zusammenhange zwischen charakteristischen Eigenschaften der synthetisiertenLosungen und deren Evaluierungsergebnissen lernt. Dieses Wissen wird dann verwendet, um die Qua-litat anderer Losungen vorherzusagen.

Voraussetzungen: Programmierkenntnisse in C/C++ und JavaArt der Arbeit: Theorie (40%), Konzeption (40%), Implementierung (20%)Ansprechpartner: Peter Brand, Joachim Falk ({peter.brand, joachim.falk}@fau.de)

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Vergleich von simulierten und realen Kommunikationsmus-tern durch maschinelles Lernen

Bei der Entwicklung neuer Kommunikationsalgorithmen und -methoden,Mustererkennen Klassen

bildenModel

Trainieren

Vergleich

MaschinellesLernenReale

Datenströme

SimulierterDatenströme

sind belastbare Evaluierung notig. Eine Evaluation, sofern analytischnicht moglich, erfordert eine Bewertung einer Vielzahl von Testsze-narien, die moglichst alle Auspragungen real moglicher Kommunika-tion umfassen sollten. Da Szenarien in realen Systemen meist wedernachvollziehbar, noch wiederholbar sind und unerwartetes Verhaltenim schlimmsten Fall katastrophale Auswirkungen hat, werden bei derEntwicklung meist Simulationen eingesetzt, die relevantes, reales Sys-temverhalten nachbilden. Dadurch lassen sich nachvollziehbare und wiederholbare Testszenarien er-zeugen, die dynamisch auf das Verhalten des Kommunikationsalgorithmus reagieren. Die Aussagekrafteiner simulativen Evaluation hangt dabei hauptsachlich von der Wirklichkeitstreue des simulierten Ver-haltens ab.Im Zuge dieser Arbeit soll untersucht werden, wie sich die Wirklichkeitstreue vergleichen lasst. Dabeiwerden Methoden des maschinellen Lernens angewandt, um Verhaltensmuster in realen Kommunika-tionsstromen zu erkennen und anschließend mit simulierten Kommunikationsstromen zu vergleichen.

Voraussetzungen: Programmierkenntnisse in C/C++ und JavaArt der Arbeit: Theorie (45%), Konzeption (45%), Implementierung (10%)Ansprechpartner: Peter Brand, Joachim Falk ({peter.brand, joachim.falk}@fau.de)

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

DSE of Simulink Models for the Automotive Domain

Nowadays, embedded systems can reach tremendous complexity. Ty-pically, these systems have a heterogeneous architecture that is com-posed of different types of computing units like CPUs, DSPs, FPGAs,etc. as well as a diverse set of communication infrastructure rangingfrom field buses like CAN to commodity networks like Ethernet. Thus,the challenge is how to select an optimal distribution of the systemfunctionality to these units from the huge design space of the possibledistributions.On the other hand, Simulink has gained a lot of acceptance in the au-tomotive domain due to its intuitive block-based algorithm design, ea-sy simulation, and rapid prototyping capabilities. However, Simulink islacking integration with Desing Space Exploration (DSE) especially fordistributed target architectures.In order to address this issue, the goal of this thesis is the extensionof our Simulink-based DSE tool flow to also support distributed targetarchitectures like Ethernet TSN connected target architectures.

Prerequisites: Programming Skills in C/C++ and Matlab/Simulink knowledge requiredType of Work: Theory (10%), Conception (30%), Implementation (60%)Supervisors: Martin Letras ([email protected]) and Joachim Falk ([email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Hardware-Centric Run-Time Integrity Checking for PSoCs

Programmable System-on-Chips (PSoCs) have several ad-vantages over other processing platforms through the availabi-lity of hardware acceleration in the form of Field Programma-ble Gate Arrays (FPGAs) that are tightly coupled with integra-ted General Purpose Processors (GPPs). Here, an ongoing re-search topic is to use the FPGA part to enhance the security ofthe entire PSoC platform including processors, memories andperipherals. In this way, cryptographic algorithms can carryout security sensitive integrity and authenticity checks directly in hardware without unsecure softwareintervention. As a consequence, the FPGA acts as a root of trust to verify both the secure execution ofuser applications and also the underlying Operating System (OS). The goal of this thesis is to investigatenovel hardware-centric run-time integrity checks for PSoCs to ensure the integrity and authenticity ofa system configuration without loss of performance and security. The main tasks within this thesis are:

• Acquiring a general understanding of Xilinx Zynq PSoCs.• Implementing custom security routines with hardware and software integrity and authenticity checks

on Xilinx Zynq PSoC.• Evaluation of detection rates based on real security threats.

Prerequisites: Programming Skills in C/C++ and VHDL, interest in PSoC design and OS security

Type of Work: Theory (15%), Conception (35%), Implementation (50%)

Supervisors: Franz-Josef Streit ([email protected]) and Stefan Wildermann ([email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Automatic Integration of Hardware Accelerators into VirtualPrototypesThe increasing complexity of embedded systems ne-cessitates a rise in the level of abstraction of the de-sign flow. To facilitate this, the SystemC language, aC++ based class library and de-facto industrial stan-dard, is used to model embedded systems at theElectronic System Level (ESL).In this thesis, you will use SystemC to model a sim-ple fractal viewer as a test application at ESL. Basedon this application, you will use the tools devel-oped at the Chair for Hardware/Software Co-Designto generate a virtual prototype using an OpenRISCCPU with the eCos real time operating system.The goal of this thesis is the extension of our tools, such that integration of hardware accelerators can beperformed automatically, hence, increasing the design productivity of the ESL design flow significantly.To get familiar with the topic, you will first integrate a dedicated hardware accelerators for the fractalcomputation by hand.

Prerequisites: Basic C++-knowledge requiredType of Work: Theory (10%), Conception (30%), Implementation (60%)Supervisor: Joachim Falk ([email protected]), Tobias Schwarzer ([email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

DSL-based Source-to-Source Transformation using Clang

A source-to-source compiler or transpiler typicallygenerates source code in a certain low-level pro-gramming language from a higher-level programdescription. Programming using high-level abstrac-tions enables high productivity and efficiency. A su-preme example is the domain-specific languages inimage processing, such as HIPAcc or Halide. Theyfacilitate efficient code generation for targets such as GPUs, using a very concise algorithm description.

HIPAcc is an internal DSL embedded in C++. Programmers specify an image processing pipeline viathree types of imaging kernels: point, local, and global operators. HIPAcc is responsible for mappingthose kernels to hardware and generating target code such as CUDA. The lowering happens inside thecompiler and is based on Clang AST (Abstract Synxtax Tree). Accurate intermediate representation andefficient analysis is the basis to any optimizations and the backbone of such source-to-source compilers.

The aims of this thesis are: 1) get acquainted with Clang toolings such as AST Visitor or AST Matcher.2) analyze the current workflow for CUDA code generation in HIPAcc, and implement a custom kernelextension within the CUDA code generator. 3) benchmark common image processing applications.

Required skills: Good knowledge of C++, and preferable CUDANature of work: Theory (20 %), Conception (20 %), Implementation (60 %)Contact: Bo Qiao ([email protected]).

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Entwurfsraumexploration fur die Abbildungsmenge von Anwendungen auf MPSoCs

Mithilfe einer Entwurfsraumexploration (DSE) konnen Abbildun-gen (sog. Bindungen) von Tasks einer Anwendung auf Ressour-cen einer MPSoC-Architektur (Multi-Processor System-on-Chip)bestimmt werden, die hinsichtlich Entwurfszielen wie Energie oderLatenz Pareto-optimal sind. Sofern die Ausfuhrungseigenschaf-ten einer Anwendung stark in Abhangigkeit der Eingabedaten dif-ferieren, verschenkt eine feste Zuweisung von Tasks zu Ressour-cen der Architektur jedoch verglichen mit speziell auf die Ein-gabedaten zugeschnittenen Abbildungen Optimierungspotenzi-al. Dies ist beispielsweise bei Anwendungen aus der Bildverar-beitung der Fall, bei denen die Arbeitslastverteilung durch dasjeweilige Eingabebild bedingt ist. Ublicherweise hat eine DSE die Optimierung individueller Abbildungen als Ziel. In dieserArbeit sollen Techniken zur Optimierung einer Menge von Abbildungen untersucht werden, wobei jede dieser Abbildun-gen auf bestimmte Teilmengen der potenziellen Eingabedaten zugeschnitten ist. Ziel ist somit der Entwurf von Systemen,die durch dynamisches Umschalten zwischen Abbildungen zur Laufzeit die Ausfuhrungseigenschaften der Anwendungverbessern konnen.

Voraussetzungen: Programmierkenntnisse in Java

Art der Arbeit: Theorie (20%), Konzeption (40%), Implementierung (40%)

Ansprechpartner: Jan Spieck ([email protected]), Stefan Wildermann ([email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

PCB Escape Routing for Digital Microfluidic Biochips

Digital microfluidic biochips (DMFBs) are devices that try to massi-vely miniaturize laboratories using conventional printed circuit boards(PCBs). This idea is known as lab-on-a-chip.

To conduct an experiment, liquids are moved over cells on the PCB.These cells need to be connected to external control pins (escape rou-ting). Cells that can be jointly controlled can be connected to the samepin. It can be impossible to find a routing on a single PCB layer due tothe large number of cells. The problem can be solved by using morelayers or pins.

The aim of this work is to determine exactly those escape routingson a three-dimensional routing graph that minimize the use of layersand/or pins. Here, one has to examine the trade-off between addinga new layer or a new control pin. The results should be importable bythe open-source CAD tool KiCad. The topic is suitable as a Bachelor’sReport or as a project.

Prequisites: Knowledge in C++ (Python helpful)Type of Work: Conception (30%), Implementation (70%)Supervisor: Oliver Keszöcze ([email protected])

Lehrstuhl für Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Nachregulierung leichtgewichtiger CNNs auf eingebettetenProzessorfeldern

Neuronale Netze, beispielsweise Convolutional Neural Networks(CNNs) finden haufig in Applikationen in eingebetteten Systemenund Smartphones Anwendung. Dabei werden haufig neuronaleNetze auf einem allgemeinen Datensatz vorab trainiert und an-schließend auf dem Smartphone mit personenspezifischen Datennachjustiert (engl. finetuning), um eine hohere Genauigkeit zu er-reichen.Um den Energiebedarf gering zu halten, werden 1) die Netze mit re-duzierter Genauigkeit und 2) auf zugeschnittenen Beschleunigern(z.B. systolische Prozessorfelder) berechnet.

PE PE PE PE

PE PE PE PE

PE PE PE PE

Training aufallgemeinenDatensatz

Inferenzauf Gerat

Anwender-spezifischeNachjus-tierung

Hersteller

Anwender/-in

In dieser Abschlussarbeit soll analysiert werden, wie sich die Berechnungsgenauigkeit auf das Trainingeines Netzes auswirkt. Anschließend wird ein Konzept erarbeitet um ein CNN auf einem eng gekoppel-ten Prozessorfeld nachzujustieren. Dieser Ansatz wird auf einer existierenden Architektur implementiert.Es wird evaluiert ob sich die Genauigkeit positiv verbessert und wie viele Ressourcen fur eine effizienteUmsetzung notwendig sind.

Voraussetzungen: Programmierkenntnisse in C++, Python; Grundkenntnisse uber Neuronale NetzeArt der Arbeit: Theorie (20%), Konzeption (40%), Implementierung (40%)Ansprechpartner: Christian Heidorn ([email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

HiWI: Deep Neural Network Optimization

DNNs are at the forefront of the recent advancement in AI systems. However, whendeploying DNNs on the hardware, required memory, latency, throughput, and ener-gy consumption become important challenges.In order to address these challenges at the HSCD chair, typical image and au-dio processing applications utilizing deep learning are compressed, optimized, andcompiled for the target hardware. Open Neural Network Exchange (ONNX) is usedas an intermediate exchange format between compression and compilation stage.A student assistant is needed to assist in this work.

DNN

Compression

ONNX

Compilation

Target Hardware

This student assistant position has the followings tasks

• Understanding of the ONNX format and improving the ability to export and import optimized DNNsusing ONNX.

• Assisting in research related to the optimization of DNNs.

Prerequisites: Knowledge of machine learning/deep learning and Python.Sort of work: Theory (15%), conception (25%), implementation (60%)Supervisor: Muhammad Sabih ([email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Domain-Specific Library Design in Modern C++

In high-performance computing, applications have to scale to an ever-increasing amount of compute nodes. Furthermore, the domain of dis-tributed and parallel numerical algorithms is highly interdisciplinary. Toensure the productive use of frameworks, parallelization should be ab-stracted from algorithm development.

We are developing HighPerMeshes, a domain-specific language thatis able to distribute algorithms designed for unstructured grids on hun-dreds of compute nodes. HighPerMeshes implements an executor model, which provides an abstractinterface for implementing these algorithms that can then be parallelized with multiple strategies. Ho-wever, we want to further generalize this executor to allow parallelization of (sparse) matrix and vectoroperations.

Your main tasks are to review the current design decisions, which are based on the implementation ofalgorithms on unstructured grids, and find common abstractions that allow distributing (sparse) matrixoperations on a multitude of compute nodes. The amount of work can be adjusted to fit a studentassistant job or as a bachelor thesis.

Required skills: C++ 17 and common design patternsNature of work: Theory (20%), Conception (40%), Implementation (40%)Contact: Stefan Groth ([email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Bildverarbeitungsketten auf Prozessorfeldern

Bildverarbeitung ist allgegenwärtig, insbesondere in eingebettetenSystemen – von der Hinderniserkennung in Radarsystemen bis zurTumorerkennung in 3D-Bildern eines Tomographen. Prozessorfelderwie Tightly Coupled Processor Arrays (TCPAs) eignen sich beson-ders für deren Beschleunigung, weil sie nicht nur viel Parallelität bie-ten, sondern auch sehr energieeffizient sind.Bildverarbeitung bedeutet oft die Hintereinanderschaltung verschie-dener Filter; man spricht dann von einer Bildverarbeitungskette. Bild-verarbeitungsketten lassen sich etwa durch einen Filtergraphen wiein nebenstehender Abbildung darstellen.In dieser Arbeit soll ein domänenspezifischer Compiler entwickeltwerden, der aus der formalen Beschreibung einer Bildverarbei-tungskette parallelisierten Assembler-Code für TCPAs erzeugt. DieSchwierigkeit: Wie lassen sich die Filter parallelisieren und die Pro-zessorelemente allozieren angesichts der notwendigen Kommunika-tion zwischen den Filtern?

PE PE PE PE PE

PE PE PE PE PE

PE PE PE PE PE

PE PE PE PE PE

Schwellwert

Kanten X Kanten Y

Ecken

Voraussetzungen: Gute Kenntnisse von C++ und AssemblerArt der Arbeit: Theorie (20%), Konzeption (40%), Implementierung (40%)Ansprechpartner: Michael Witterauf ([email protected])

Lehrstuhl für Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Computer Vision with FPGAs and C++

Field Programmable Gate Arrays (FPGAs) areproved to be among the most suitable architectu-res for image processing applications. Describingan application-specific hardware via Verilog HDLor VHDL is time-consuming and error-prone. Re-cent advancements in High-Level Synthesis (HLS)promise to solve this problem. Yet, very few HLS-based libraries exist to enable FPGAs for softwareprogrammers.

This work will focus on development of an image processing library for the Xilinx FPGAs. Thereby, apolicy based C++ template library will developed for Vivado-HLS. Final work should be polished to beshared as an open source project.

The main tasks within this thesis are:a) understanding the basic abstractions of image processing and their efficient FPGA implementationsb) developing a well written C++ libraryc) benchmarking the library with sample applications

Required skills: Self Learning, Good knowledge of C++ and, Understanding of FPGA designNature of work: Theory (30 %), Conception (30 %), Implementation (40 %)Contact: M. Akif Oezkan ([email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

High Performance Computing with FPGAs and OpenCL

FPGAs are one of the accelerators that are recentlyemerging as more power-efficient alternatives to GPUsin world-class supercomputers. In particularly, FPGAsexcel at stencil computations.

This work focus on implementing a subset of the Rodi-nia benchmark suite for these type of applications. Wetake advantage of High Level Synthesis (HLS) that al-lows FPGAs to be more easily programmed in softwareprogramming languages. Thereby, we use OpenCL and Vivado-HLS to target Intel and Xilinx FPGAs.

We aim to develop an open source library, where FPGA-specific optimizations are hidden to the user.We will benchmark our implementations on state-of-the-art FPGAs that can be accessed from the cloudservers (Amazon AWS).

The main tasks are:a) investigating FPGA/HLS design techniques for Rodinia benchmarkb) developing a template-based library for high-performance computingc) benchmarking the target applications on cloud servers

Required skills: Self-Learning, Enthusiasm on Research, C++, and Understanding of FPGA designNature of work: Theory (30 %), Conception (30 %), Implementation (40 %)Contact: M. Akif Oezkan ([email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Database Query Placement for FPGA-based Reprogram-mable Data Providers (ReProVide)

ReProVide aims to push down database operators to the table stora-

cba

scansales

[a,b,c,k]

arithmetic

b,c

[+]

f ilterb+c

> [min_price]

GroupByk

aggregatea

[sum]

User UserInterface

Buffer[1]

sum_func

Buffer[2]

Acc2

Acc1

Buffer[3]

Acc0

Buffer[4]

ScanController

ge to reduce execution time and data transfer cost. A system-on-chipconsisting of ARM processors and FPGA logic is used as a reprogram-mable platform to implement database accelerators and novel storagesolutions.A remote host can query preprocessed data from this storage. Thesystem-on-chip executes the respective query execution plan (QEP)and sends back the data. However, this involves mapping all query ope-rations onto the software and reconfigurable hardware resources of thesystem so that e.g. latency/throughput is optimized. This optimizationproblem is called Optimal Query Placement and has to be evaluatedevery time a new query arrives. Therefore, there are time constraints.In this work we would like to investigate heuristics to create optimizedquery placements efficiently.

Prerequisites: Basic knowledge in C++ and VHDLType of Work: Theory (30%), Conception (40%), Implementation (30%)Supervisor: Andreas Becher ([email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Scheduling DMA Transfers for Processor Arrays

Processor arrays are energy-efficient loop accelerators that leveragethe vast parallelism offered by loops. They, however, have a bottleneck: the more parallel the accelerator works, the higher the databandwidth must be.To overcome this, it is beneficial to couple the processor array tightlyto the data source, for example main memory or sensors. This crea-tes a variety of parallel memory requests (address, number of bytes),which accesses the same external memory over a shared DMA con-troller.The goal of this work is to design a DMA controller in VHDL, whichcan process multiple DMA requests in parallel. Since the nature ofthese requests has a strong influence on the performance of suchsystems, different arbitration schemes shall be investigated regardingthe archieved throughput and the estimated hardware cost. The im-plementation shall be tested on the ZynQ System-on-Chip platform.

PE PE PE PE PE

PE PE PE PE PE

PE PE PE PE PE

DMA Controller Schedule

Bus

Prerequisites: Experience with VHDLType of work: Theory (20%), Design (40%), Implementation (40%)Supervisors: Dominik Walter ([email protected])

Lehrstuhl für Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Loose Enforcement of Timing and Energy Trade-Offs

Applications running on a many-core platform may have a set of up-per/lower bounds on their non-functional properties, e.g., latency, whichmust be met. Run-Time Requirement Enforcement (RRE) specifiestechniques that monitor the application execution and control someplatform parameters, e.g., the operating frequency of the cores, to sa-tisfy the requirements. For example, in the figure, the application hastwo sets of requirements, namely, latency and power consumption.

The goal of this thesis is to investigate the approaches from control en-gineering to implement RREs for timing and energy requirements. He-re, the following topics must be investigated: 1) Investigate algorithmsborrowed from control theory on their capability of controlling timingand energy consumption in tile-based invasive multi-core architectures, 2) Selection of control points, 3)Stability considerations, and 4) Simulative evaluation of energy savings for object detection application.

Prerequisites: Familiarity with control theory and many-core systems, Programming skills in javaType of work: Theory (40%), Conception (20%), Implementation (40%)Contact: Jurgen Teich ([email protected])

Behnaz Pourmohseni ([email protected])Stefan Wildermann ([email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Efficient Search Heuristics for Many-Core Task Migration

Adaptive management of resources and applications in many-core

RM

× × X

...

systems relies on task migration. In such systems, at run time, a Re-source Manager (RM) entity checks different options for task migra-tion to decide which task(s) to be migrated and to what destinationthey must be migrated, e.g., so that the system performance is ma-ximized. Depending on the system state (resource occupation, loaddistribution, etc.) the RM may have to check several migration opti-ons until it finds a feasible or profitable option. For many systems,e.g., real-time systems, the run-time overhead of this search processcan become too high to be acceptable or even affordable.The goal of this thesis is to investigate search heuristics that reduce this overhead by (i) identifying mi-gration candidates which have a higher chance of being feasible and (ii) prioritizing these candidates inthe check process to reduce the (average) number of options that are checked before finding a feasiblecandidate. Such a prioritization scheme can take into account, e.g., the current load distribution in thesystem or in different sub-systems. In this thesis, you will investigate existing techniques for this problem,develop new heuristics, and evaluate the effectiveness of different approaches via system simulation.

Prerequisites: Programming skills in Java and Python, basic knowledge in embedded systemsType of work: Theory (60%), Implementation (40%)Contact: Behnaz Pourmohseni ([email protected])

Stefan Wildermann ([email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

A 10G Ethernet Controller for Reprogrammable Data Provi-ders (ReProVide)

ReProVide aims to push down database operators to Database Management System

Global OptimizerHeterogeneity-aware query partitioning

Interconnect (×-Bar)

CPU System

Core Core

Core Core

Network ctrlRDMA

CompressionFormatting DMA DMA

DMA DMA

DataflowController

partialreconf.

area

partialreconf.

area

partialreconf.

area

V-RAM

V-RAM

V-RAM

NV-RAM

NV-RAM

SSD

SSD

HDD

the table storage to reduce execution time and datatransfer cost. A system-on-chip consisting of ARM pro-cessors and FPGA logic is used as a reprogrammableplatform to implement database accelerators and no-vel storage solutions. Our tables are stored in SATAconnected SSDs. A remote host can request proces-sed data from this storage. The system-on-chip execu-tes the respective query and sends back the data usinga processor-local 1GBit network interface.In this work we would like to implement a 10GBit Ether-net interface together with a controller using FPGA lo-gic to relief the ARM processors from the task of resulttransmission.Prerequisites: Basic knowledge in C++ and VHDLType of Work: Theory (20%), Conception (30%), Implementation (50%) (SHK)Supervisor: Andreas Becher ([email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Feature Engineering for Auto-Tuningof Deep Learning CompilersCompiler optimizations are dependent on the type of application and the type ofhardware architecture. Modern Deep Learning (DL) compilers such as Tensor Vir-tual Machine (TVM) utilize a Machine Learning (ML)-based model for auto-tuningcompiler optimizations. Using the knowledge of a hardware architecture and thedeep learning application, features can be extracted to improve the performance ofML-based auto-tuner.These features can be extracted from the source-code of the deep learning mo-del, from the graph representation such as ONNX, and from dynamic behavior ofthe executed code on target hardware. Given a set of typical deep learning-basedimage and audio processing applications, the tasks for the master’s thesis are asfollows

TrainedDNN

CodeFeatures

ONNXGraph

Features

Compiler DynamicFeatures

Tuning MLmodel

Hardware

• Compile the deep learning models using TVM and identify the features for ML-based auto-tuning.• Evaluate the performance of this ML-based auto-tuning and compare it to the existing TVM auto-

tuner as well as the vendor-specific compilers such as NVIDIA TensorRT.• Optionally, generate the features using graph-based deep learning.

Prerequisites: Knowledge of hardware architecture, ML/DL, Python and C++.Sort of work: Theory (25%), conception (35%), implementation (40%)Supervisor: Muhammad Sabih ([email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen

Energy reduction of CNN execution by reducingcontrol overhead in hardware accelerators

Tightly Coupled Processor Arrays (TCPAs) are a class of highly customi-zable hardware accelerators that are perfectly suited to accelerate loopapplications like FIR filters, matrix-matrix-multiplications, and also of thepopular Convolutional Neural Network (CNN) applications used in ma-chine learning. A TCPA consists of an array of processors, while eachprocessor can communicate data directly to its neighbor. This designenables fast and low-power execution of such loop applications.

The goal of this work is to investigate ways to reduce the control overhead in the execution of applicationslike CNNs in the TCPA architecture, especially when they are run on low bit widths. For example, forCNN inference, often only 8 or 16 bits or even custom floating-point formats are used. This provides thechance for our TCPAs to be even more efficient by utilizing, e.g., vector processing. Vector processingis a concept that reduces control and instruction overhead by, for example, executing four 8-bit additionsinstead of one 32-bit addition in one instruction on the same hardware unit.

Requirements: Knowledge in VHDL programming mandatoryknowledge in Python programming helpful

Type of thesis: Theory (30%), concept (40%), implementation (30%)Supervisor: Marcel Brand ([email protected])

Lehrstuhl fur Informatik 12Hardware-Software-Co-Design

Cauerstraße 1191058 Erlangen