Jonah Mevert
An FPGA-based Hardware Accelerator for Speech Recognition Tasks Using LSTM Networks
Abstract
Automatic Speech Recognition (ASR) is a ubiquitous problem in Computer Science, having become increasingly important in the last few years due to the emergence of personal assistants such as Siri or Alexa. Long Short-Term Memory (LSTM) neural networks are a proven way to approach sequence classification tasks such as ASR, and Connectionist Temporal Classification (CTC) can be used to avoid pre-segmented inputs, reducing the complexity of the problem. However, LSTM network still have more parameters than other types of neural networks and are more computationally complex. This makes the use of FPGAs compelling, especially on smaller devices that do not have access to the internet. In this thesis, a complete workflow is developed to train models for ASR using the DARPA-TIMIT speech corpus and Pytorch, and to export them to a Xilinx Zynq XC7Z020 SoC, where tasks are split between the processing system (PS) and the actual FPGA fabric. The design incorporates quantisation, parametrising the bit widths of different parameters as well as the degree of parallelisation, to optimise throughput and resource utilisation on hardware.