Synthesizing parallel designs from sequential C++ – part 3

A couple of weeks ago, I started explaining how sequential C++ could be parallelized to produce high-performance concurrent hardware. In the first two posts (part 1 and part2) I developed the notions of resource bottlenecks and spatial parallelism; in this third post we will explain the notion temporal parallelism. This concept is essential as it allows parallelizing data dependant operations – a case found in a majority of designs.

As explained previously, two or more operations which do not dependent on each other can be parallelized, provided that sufficient resources are available to process these operations in parallel:

  x = a * b;
  y = c * d;

High-level synthesis can parallelize these two multiplications as 1) there are no variable dependencies and 2) dedicated hardware will be created to match the desired performance, so there are no resource limits.

But what happens when dependencies exist? How can parallel hardware be synthesized in this case? Let’s modify our simple example above by adding a dependency between the two multiplications:

  x = a * b;
  y = c * x;

One intuitively understands that spatial parallelization techniques cannot apply in this case. Indeed, “x”, the result of the first operation must be available in order to compute the second multiplication. In other words, we have cascaded logic and a dependency chain between the two operations.

For the sake of this example, let’s assume that each multiplication takes 1 clock cycle. In the first cycle, we’ll compute x = a * b and in the second cycle, we’ll compute y = c * x, as shown in the example below.

 |  step  1  |  step  2  |
 |-----------|-----------|
 | x = a * b | y = c * x |

At this point, there is no parallelism in our design. To obtain parallelism, we need to introduce a new dimension to our equation; and this is dimension is time.

Spatial parallelism explained in my previous post allows creating parallelism within one iteration of a function. By adding the notion of time, we can now create parallelism across iterations.

Up to now, we have always considered our simple example from the perspective of a single run, producing one result. But by going to hardware, we build a structure which will iterate again and again. Our simple design will not compute just one value of “y”, but many of them, over time.

So let’s add time to our example. Let’s denote a0..y0 the values corresponding to the first iteration, a1…y1 those corresponding to the second iteration, etc…

 |    step 1    |    step 2    |    step 3    |    step 4    |
 |--------------|--------------|--------------|--------------|
 | x0 = a0 * b0 | y0 = c0 * x0 |
                | x1 = a1 * b1 | y1 = c1 * x1 |
                               | x2 = a2 * b2 | y2 = c2 * x2 |

As we see from the diagram above, as we compute “y0″, it is possible to begin the computation of “x1″. So the “x0″ and “y0″ multiplications can’t happen in parallel, but the “x1″ and “y0″ can. Our two multipliers will indeed run concurrently, but as they do, they will process values corresponding to different data sets.

In the hardware designer’s vocabulary, what I just described is called pipelining. Pipelining is the bread and butter of RTL design.

As we did when looking at spatial parallelism, let’s now generalize our example. Let’s replace our simple multiplications by more complicated functionality and let’s encapsulate this functionality in separate functions “foo” and “bar”.

The pseudo-code for our modified C program would look like:

  x = foo(a);
  y = bar(x);

The two functions can be as complex as desired, the same pipelining technique applies.

 |    step 1    |    step 2    |    step 3    |    step 4    |
 |--------------|--------------|--------------|--------------|
 | x0 = foo(a0) | y0 = bar(x0) |
                | x1 = foo(a1) | y1 = bar(x1) |
                               | x2 = foo(a2) | y2 = bar(x2) |

As shown in these two examples, it is perfectly possible to build parallel and pipelined implementations from sequential C++ by leveraging parallelism across design iterations.

This fundamental notion makes the problem of parallelizing C++ in hardware implementations quite different than in software implementations. In software, instructions are executed one by one. In hardware, processes run continuously. Temporal parallelism is about taking advantage of this specific hardware property.

A C synthesis tool like Catapult C is perfectly capable of leveraging this parallelization potential and creating highly efficient parallel multi-block designs. The net benefit for the user is a very simple input source which does not require explicit description of concurrency, thereby making models more abstract, more reusable and easier to create and verify.

This ability to parallelize pure sequential C++ to efficient concurrent hardware is precisely what allowed Hitachi to create 57% of a recent ASIC in 82% less time.

More Blog Posts

Comments (↓ Add Your Own)

One comment on this post

[...] #3 Synthesizing parallel designs from sequential C++ [...]

Add Your Comment

High-Level Synthesis is entering the mainstream of hardware design, bringing tremendous opportunities and creating stimulating new challenges to hardware designers. This blog is about trends, opinions and experiences with going from C++ to RTL, automatically.