★Tutorial★ Understand Your FPGA Designs Better:
From Rapid Simulation to On-board Profiling
Understanding FPGA design performance is crucial for optimizing designs and meeting performance targets. Performance is evaluated in two ways: simulated in terms of cycle count, and measured on-board in terms of latency/throughput. Both are challenging to obtain accurately and efficiently.
Simulated performance. Current HLS tools provide estimated/simulated performance metrics, either using HLS synthesis reports to get estimated clock cycle counts, or C/RTL co-simulation to get relatively accurate clock cycle counts. HLS synthesis reports, calculated using static analysis, are fast to obtain (usually generated within minutes) but are well-known to be inaccurate. C/RTL co-simulation, simulating the design’s dynamic behavior, is much more accurate but is usually slow (potentially taking hours or even days per design point). Thus understanding FPGA design performance accurately and efficiently is beneficial but challenging.
To address this challenge, we introduce our fast and accurate simulation tool, LightningSim, which provides extremely accurate performance simulation but can be orders of magnitude faster than C/RTL co-simulation. LightningSim is open-source and features an easy-to-use, push-button workflow for Vitis HLS projects. It is shipped as one single Python package, and its built-in command-line tool starts a web server and visualizes the progress of the LightningSim flow. No additional configuration is required to work with an existing Vitis HLS project. We hope that with LightningSim, designers can effortlessly obtain simulation data to inform design decisions throughout the HLS development process.
Measured performance. However, even simulated performance will often diverge significantly from real on-FPGA performance, posing challenges in identifying the real bottlenecks and optimizing designs effectively. We usually observe large discrepancies between co-sim and on-FPGA performance in terms of cycle counts, especially for complicated designs with frequent off-chip data movement. Previous existing on-FPGA profiling tools are restricted to probing end-to-end latency, leaving inner performance a complete black box to designers. Additionally, these tools necessitate cumbersome Verilog inspection, diminishing the convenience and advantages of HLS.
To address this challenge, we introduce RealProbe, a fully automated on-board profiling tool designed to extract real on-FPGA performance simply by annotating the HLS source code. With just one line — #pragma HLS RealProbe — our tool automatically generates all the code necessary to profile the exact cycle counts of an entire function hierarchy on-board. By providing precise and comprehensive on-FPGA cycle performance, we anticipate RealProbe to offer a user-friendly solution for designers to optimize designs based on actual on-board performance.
Featured Open-source Projects
LightningSim is an ultra-fast, accurate, trace-based simulator for High-Level Synthesis (HLS) designs. It is 99.9% accurate compared with C/RTL co-simulation while is up to two orders of magnitude faster. It also features an incremental FIFO depths optimization, which is up to 500x faster than using HLS default exploration.
RealProbe is an on-FPGA profiling tool for HLS designs. Simply adding one line of “#pragma HLS RealProbe” will give you a full understanding of your HLS design running on FPGAs, with a deep hierarchy into all sub-modules, function calls, and loops. RealProbe also features high accuracy (100%), small resource overhead (0% BRAM), light runtime overhead (5.7%), and incremental synthesis, compared with AMD Integrated Logic Analyzer (ILA).
HLSFactory is a framework for HLS design datasets. It provides the facilities to collect and build custom HLS datasets using various frontends, supported HLS tools, and data aggregation. It also provides built-in design dataset sources for users who want to run their experiments out of the box. It has easy-to-use facilities for new users to contribute their own HLS designs to the existing design datasets, or augment the existing tool flows to support custom flows such as new front ends for design space sampling and new vendor tool support.
Edge-MoE is the first end-to-end FPGA accelerator for multi-task Vision Transformers (ViT) using Mixture-of-Expert (MoE) with a rich collection of architectural innovations, including: (1) a novel reordering mechanism for self-attention, which requires only constant bandwidth regardless of the target parallelism; (2) a fast single-pass softmax approximation; (3) an accurate and low-cost GELU approximation; (4) a unified and flexible computing unit that is shared by almost all computational layers to maximally reduce resource usage; and (5) a novel patch reordering method to eliminate memory access overhead. Edge-MoE achieves 2.24x and 4.90x better energy efficiency comparing with GPU and CPU, respectively. A real-time video demonstration is available online, along with our open-source code written using High-Level Synthesis.
FlowGNN is a cutting-edge hardware architecture designed to make Graph Neural Networks (GNNs) faster and more adaptable. GNNs are powerful tools used in fields like drug discovery and high-energy physics, but existing solutions often struggle to keep up with the growing demand for both new models and fast processing speeds. FlowGNN solves this by offering a flexible dataflow architecture that supports a wide range of GNN models, without the need for time-consuming pre-processing of data. This makes it ideal for real-time applications where graph structures change frequently. Tested on advanced hardware, FlowGNN delivers speed improvements of up to 254x and 477x compared to traditional CPU and GPU processing, and it significantly outperforms state-of-the-art GNN accelerators.