Scalable transform architectures and their implementation for graphic accelerators and communication
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Most signal processing applications require large number of computations to be performed in predetermined short time intervals. An application typically consists of basic digital signal processing (DSP) operations and data movements. Numerous software applications have been developed to run on general purpose computers to keep up with the speed requirement. For real time operation application specific processors were developed to implement these algorithms. The conventional designs usually have complex control and data exchange. In most of the cases these are ad-hoc solutions that should be redesigned in case throughput requirement change.
This work presents a systematic approach for designing scalable processors for Discrete Trigonometric Transforms (DTT) with a major focus on the Discrete Cosine Transform (DCT). This approach is based on the theory of regular algorithms for DTT that allow regular data flow between the processing elements without complex data flow controls. The concept of regular DTT algorithms is in localizing non-regularities of the fast transform algorithms in the nodes of computational graph. The nodes are parameterized so the processing elements of the proposed processor architecture can reconfigure easily for computing different node functions. The resulting algorithm is similar to the constant geometry version of the Cooley-Tukey FFT. Such structures have a degree of scalability.
Scalability is achieved by selecting the number of PEs. Application throughput determines the number of PEs required in a processing column. The PEs are stacked vertically in a partial column structure forming a second pipeline dimension. The proposed scalable structure requires minimum control and memory usage. It supports a continuous data processing and leads to area optimization. Each PE has an associated first-input-first-output (FIFO) memory for a local data reordering. The building block for FIFO is a Shift Exchange Unit (SEU) that acts like delay-switch-delay memory structure. Number of SEUs is a function of transform size (N) and number of PEs. Scalable structure for 2-D DCTs based on transpose memory is also introduced. Different configurations were evaluated and compared based on area, delay and power.
The scalable micro architectures were validated by designing various configurations. The scalability was achieved by changing one parameter, PE, in the control file. The design proved to have a speed up of M over single PE, were M is the number of PEs in the processing column. It is shown that the design cost (power, area, delay) can be easily predicted. The structure does not use memory for intermediate results. The structure was expendable to other DTTs (FFT and DST). The structure allows reuse of modules. The designs will be synthesizable for both ASIC and FPGA design flow.