Ph.D. student at UCLA
470 Eng VI
Computer Science Department
University of California, Los Angeles
jiewang [at] cs.ucla.edu
I am currently a third-year Ph.D. student in the Computer Science Department at the University of California, Los Angeles, advised by Prof.Jason Cong. I am a member of the VLSI Architecture, Synthesis & technology (VAST) Laboratory. I received my B.S. degree in Electronic Engineering from Tsinghua University, with a double major in Economics.
My research interests lie in parallel/distributed architecture and programming. I am involved in research projects including customized computing for deep learning, standard BLAS libraries, genomic applications, and etc. You can find my CV here.
Communication Optimization on GPU: A Case Study of Sequence Alignment Algorithm, supervisored by Prof.Jason Cong
Data movement is increasingly becoming the bottleneck in both performance and energy efficiency in modern computation. It used to be the case that there is limited freedom for communication optimization on GPUs, as conventional GPUs only provide two types of methods for inter-thread communication: using shared memory or global memory. Recently, a new warp shuffle instruction is introduced from Kepler architecture on Nvidia GPUs, which enables threads within the same warp to directly exchange data in registers. This has brought new performance optimization opportunities for algorithms with intensive inter-thread communication.
In this work, we deploy register shuffle in the application domain of sequence alignment (or similarly, string matching), and conduct a quantitative analysis of the opportunities and limitation of using register shuffle. We select two sequence alignment algorithms, Smith-Waterman (SW) and PairHMM, from the widely used Genome Analysis Toolkit (GATK) as case studies. Compared with implementations using shared memory, we obtain significant speedup of 1.2x and 2.1x using register shuffle for SW and PairHMM. Furthermore, we develop a performance model for analyzing the kernel performance based on the measured shuffle latency from a suite of micro-benchmarks. Our model provides valuable insights for CUDA programmers into how to best use shuffle instructions for performance optimization.
Automated Generation of High-Performance Large-Scale Matrix Multiplication Accelerator on FPGA, supervisored by Prof.Jason Cong
Matrix multiplication (MM) is a key linear algebra routine which has been widely used in many application areas. In this work we provide a high-performance single-precision dense MM FPGA accelerator, and also an automatic generator to generate the accelerator with high throughput and high resource efficiency based on hardware and MM workload specifications. The accelerator adopts the linear systolic array as the basic building block and contains an optimized architecture which integrates several blocks together. The size and the number of blocks are parameterized, allowing the user to search for the optimal design parameters using an automatic design space exploration. The accelerator is tested on the Xilinx VC709 evaluation board, and shows a peak performance of 198.1 GFLOPs.
Going Deeper with Embedded FPGA Platform for Convolutional Neural Network, supervisored by Prof.Yu Wang
In recent years, Convolutional Neural Network (CNN) based methods have achieved great success in a large number of applications and have been among the most powerful and widely used techniques in computer vision. However, CNN-based methods are computational-intensive and resource-consuming, and thus are hard to be integrated into embedded systems such as smart phones, smart glasses, and robots. FPGA is one of the most promising platforms for accelerating CNN, but the limited bandwidth and on-chip memory size limit the performance of FPGA accelerator for CNN.
In this paper, we go deeper with the embedded FPGA platform on accelerating CNNs and propose a CNN accelerator design on embedded FPGA for Image-Net large-scale image classification. We first present an in-depth analysis of state-of-the-art CNN models and show that Convolutional layers are computational-centric and Fully-Connected layers are memory-centric. Then the dynamic-precision data quantization method and a convolver design that is efficient for all layer types in CNN are proposed to improve the bandwidth and resource utilization. Results show that only 0.4% accuracy loss is introduced by our data quantization flow for the very deep VGG16 model when 8/4-bit quantization is used. A data arrangement method is proposed to further ensure a high utilization of the external memory bandwidth. Finally, a state-of-the-art CNN, VGG16-SVD, is implemented on an embedded FPGA platform as a case study. VGG16-SVD is the largest and most accurate network that has been implemented on FPGA end-to-end so far. The system on Xilinx Zynq ZC706 board achieves a frame rate at 4.45 fps with the top-5 accuracy of 86.66% using 16-bit quantization. The average performance of Convolutional layers and the full CNN is 187.8 GOP/s and 137.0 GOP/s under 150MHz working frequency, which outperforms previous approaches significantly.