Hardware Accelerators & High-Level Synthesis (HLS)
2026-05-27 | By Antonio Velasco
With changing priorities and an ever-evolving compute space (mainly with AI’s rise), computing performance isn’t just based on clock speeds or processors. They just aren’t efficient enough, and the room for improvement is shrinking at an alarming pace. Instead, designers are looking towards spending their time on building more specialized chips for specific purposes, in this case, speeds and hardware accelerators. These custom and heavily modified hardware blocks are designed to execute very specific tasks at high speeds and are a result of development reaching very specific constraints. For anybody interested, this article will give a crash course in what exactly constitutes a hardware accelerator and how they can be implemented by anybody.

The Rise of Hardware Accelerators
Historically, you could easily just improve performance by increasing clock frequencies and parallelism. This is why there was a rapid increase in frequency speeds, where in 1978 a processor would be at 5MHz, and in 2010 it ballooned to 3.3GHz, representing over a 600x increase. However, with larger frequencies came larger power consumptions because of the relationship between capacitive load, voltage, and switching frequency. As dynamic power consumption dominates CMOS circuits, further performance scaling through frequency alone became impractical.

Hardware accelerators came up as a solution by trading the flexibility of a general-purpose CPU for efficiency. Instead of having to execute diverse workloads and a ton of other tasks, accelerators are designed to efficiently execute very specific operations. It’s like using a vegetable peeler instead of a knife to skin potatoes—a more specific tool with the same principles, just way more efficient for a certain task. Accelerators exist across a wide spectrum, including instruction-level coprocessors, kernel-level accelerators such as matrix multipliers, and application-level accelerators like deep neural network inference engines.
Understanding Performance Metrics in Accelerator Design
Now, before going deeper into accelerator design, it is crucial to comprehend the key performance parameters that influenced the creation of hardware accelerators. Throughput and latency are two of the most crucial measures.
Throughput is the number of jobs finished in a given amount of time, whereas latency is the amount of time needed to finish a single task. While certain applications prioritize high throughput, like video processing or machine learning training, others, like real-time object identification, demand minimal latency.

You can think of throughput, latency, and bandwidth as properties of a pipe. Bandwidth is simply capacity (a larger pipe means more bandwidth), whereas throughput is the amount of data sent and received in a certain timeframe. Latency is the exact timeframe and measures the speed at which data is sent. Data can be latency-constrained or throughput-constrained, making it a difficult challenge to tackle.
Parallelism, which is just the capacity to carry out several calculations at once by locating and utilizing independent operations within a task, is another crucial idea to comprehend. By taking advantage of parallelism at several levels, including instruction-level, data-level, and task-level parallelism, hardware accelerators frequently increase performance. For instance, hardware designers can clone arithmetic units to analyze many data pieces at once since a basic vector addition loop has independent operations across iterations.
Nevertheless, tradeoffs are introduced by parallelism. While increasing the number of processing units enhances computational throughput, it also raises the requirements for memory bandwidth, area, and power consumption. During accelerator development, designers must carefully weigh these conflicting restrictions.
High-Level Synthesis: Bridging Software and Hardware
Traditionally, hardware design was basically just the engineer describing circuits through RTL (register-transfer level languages), like VHDL or Verilog. While still widespread and very powerful, for more complex systems, it’s extremely hard to scale. For this, a “translating” layer is deployed to allow engineers to describe hardware functionality through higher-level programming languages like C/C++ and then into hardware implementations. This is referred to as High-Level Synthesis (HLS).
The flow starts with the C/C++ algorithm, which is parsed through an HLS tool, which leads to RTL code. This RTL code can now be put through synthesis and into an FPGA or ASIC implementation.
HLS accelerates development by enabling designers to focus on algorithmic optimization rather than low-level hardware implementation details. This is much easier to do with C/C++ as opposed to RTL. However, translating software into efficient hardware requires solving several fundamental design problems: resource allocation and scheduling.
Resource Allocation in High-Level Synthesis
In HLS, hardware components must be assigned to perform these specific operations. Resource allocation determines this flow and how the tasks are portioned out. These resources may include functional units such as adders or multipliers, memory elements such as registers and buffers, and communication components such as buses and multiplexers.
In software, operations are executed sequentially on shared resources (like those on a CPU). In hardware, these shared resources can be replicated to enable these operations to be executed in parallel. For example, implementing multiple arithmetic units allows multiple operations to occur simultaneously, significantly improving performance.
However, similar to how just increasing the clock frequency doesn’t solve the problem, increasing the resources is not plain and simple. More resources increase silicon area and power consumption and may complicate routing and timing closure. To mitigate this, designers often reuse functional units across different operations or execution stages to balance performance with hardware cost. This leads to scheduling and determining when each resource is used.
Scheduling: Determining When Operations Occur
Scheduling sets the order in which operations happen and decides which clock cycle each operation runs on. Scheduling is very important for getting the most out of parallelism while still respecting data dependencies, since hardware allows for concurrent execution.
When scheduling, you need to think about operation latency, resource availability, and how instructions depend on each other. If two operations need the same data, they can't run at the same time. On the other hand, independent operations can often be done at the same time.
When scheduling, optimization techniques include combining independent operations into the same state to make them run in parallel or breaking up complicated operations into several states to make the hardware less complicated.
Memory and Interface Considerations
The performance of an accelerator depends on more than just its computational power. It also depends on how memory is set up and how data moves. When the speed of data transfer limits performance instead of the speed of computation, many applications become memory-bound.
Using on-chip memory to increase data reuse or recomputing intermediate results instead of storing them are two design techniques that can greatly speed up performance. Also, accelerator interfaces need to be built to give the CPU, memory, and accelerator hardware enough bandwidth.

Conclusion
The move toward hardware acceleration is one of the biggest changes in modern computing. High-Level Synthesis makes this change possible by turning high-level algorithmic descriptions into hardware architectures that work best. HLS tools let designers use hardware parallelism to balance performance, power, and area constraints by allocating resources, scheduling tasks, and binding them together.
Engineers who want to build the next generation of high-performance embedded and SoC systems will still need to know about HLS and accelerator design principles as computing becomes more specialized and efficient.