Thomas Bollaert’s Blog
Thomas Bollaert’s Blog RSSMentor ESL in TSMC Reference Flow 12
One year ago, I was writing about the inclusion of Mentor ESL in the TSMC Reference Flow 11, and why the endorsement of system-level design and high-level synthesis by the world’s leading foundry was a telling sign of maturity for ESL.
Since this first major milestone, TSMC and Mentor haven’t remained idle, on the contrary. Both parties teamed-up to take this first ESL flow to a whole new dimension, expanding the use of ESL and functional verification tools from single-block to full SoC design.
Today, this effort is released in the form of the new Mentor ESL flow in TSMC’s Reference Flow 12 targeting TSMC 28nm process technology. The Mentor ESL design and verification flow now addresses full SoC designs with support for transaction level model (TLM) based Virtual Platforms enabling early software validation, power estimation and model reuse and refinement to RTL:
- The Vista platform supports functional validation and power estimation based on TSMC iPPA process node value characterization, and enables OS booting and early validation of application software on a Virtual TLM Platform.
- Certe Testbench Studio provides automated Universal Verification Methodology (UVM) testbench creation, saving time and reducing errors.
- Catapult C supports high-level synthesis from SystemC and incremental synthesis, which is demonstrated on a complete, multi-block, hardware accelerator component. The generated RTL, including AXI interfaces, is combined with the Questa Verification IP and a TLM Virtual Platform running in Vista to provide a hybrid TLM and RTL simulation.
- Questa Ultra provides an ESL to RTL verification flow with UVM that supports TLM platform and model reuse, test plan tracking and accelerated coverage closure.
- Questa Codelink provides HW/SW co-verification to greatly reduce debug time when running system tests on an embedded processor.
If you are at DAC in San Diego, and if you are interested in seeing a full SoC TLM virtual prototype booting Linux, before synthesizing a complete IP subsystem from TLM to RTL and then verifying it with Questa Verification IP, then you may want to stop by the Mentor booth #1542 for a suite session and demo.
And if you are not at DAC, do not despair! The entire Mentor ESL RF12 demo kit is available on-line from the TSMC website: http://online.tsmc.com/online/
Tags: Catapult C, DAC, ESL, high-level synthesis, How-to, TSMC, verification, Vista
48th DAC – Gary’s Magic Formula
Last night, in his traditional DAC-opening presentation, Gary Smith addressed the crowd with a loud and clear message about the cost of doing hardware design. Design costs are steadily increasing and this is draining life and blood out of the industry. When chip design costs reach $25M, VCs stop funding start-ups. When costs reach $50M, continued Gary Smith, even IDMs struggle to afford ASIC developments.
So where do we stand today? According to Smith, too many projects require 100+ hardware engineers to complete a chip, putting design costs way too high to be affordable. Yet today, it is possible to design a 104 million gates ASIC with 30 engineers for the cost of $18.7M. Notice the present tense in the previous sentence: it “is” possible to do this today. These numbers were not pulled out of thin air; these are actual figures from a design house surveyed by Gary.
How does one design a large ASIC for less than $20M? The secret is in reuse and in proper design organization. Assuming 80% of RTL reuse, this means that out of your 100 million gates ASIC, 20 million gates need to be designed. In other words, 5 blocks of 4 million gates each. That’s Gary’s “Magic Formula” and the secret to cost effective hardware design.
The corollary of this formula is that the minimum capacity for an EDA tool is too be able to handle a 4 million gates block in an overnight run. This is the strict minimum and anything less than that is not viable.
The math works out, and when you spice the formula with ESL and High-Level Synthesis, you can build an even more compelling economic case for your next ASIC. That’s good news and that’s certainly why Gary sees the EDA market on a solid growth path and reaching $6.6B within a few years.
Tags: ASIC, Cost, DAC, ESL, Gary Smith, high-level synthesis, How-to
DAC: 9th ESL Symposium
If you are going to DAC this year, then you must attend the 9th Annual ESL Symposium and not just because there is free lunch or you need the new Apple iPad 2. This year, Wally Rhines will moderate a very impressive panel line-up:
-
Gadi Singer – Intel
Vice President, Intel Architecture Group
General Manager, System-on-Chip Enabling Group -
John Goodenough – ARM
Vice President of Design Technology and Automation -
Ken Hansen – Freescale Semiconductor
Sr. Fellow, Vice president and Chief Technology Officer -
Philippe Magarshack – STMicroelectronics
Group Vice-President, Technology R&D
General Manager, Central CAD and Design Solutions -
Simon Bloch – Mentor Graphics
Vice President and General Manager, ESL/HDL Design and Synthesis Division
For sure, this year’s conversion is not to be missed. The panelists will be examining the industry-wide move to ESL by discussing their views and experiences. Anyone from the engineering to executive level interested in Architectural Design, Virtual Prototyping, TLM Verification and High Level Synthesis should grab a lunch, take a seat and participate.
Be sure to pre-register and arrive early for this event as it sells out ever year!
http://www.mentor.com/esl/events/9th-annual-esl-symposium-at-dac
Tags: ARM, DAC, ESL, Freescale, high-level synthesis, Intel, STMicroelectronics, User Testimonial
HLS Fundamentals / Part 2
In my last two posts, I introduced the question that proved the most challenging in the HLS Bluebook quiz (here) and presented some fundamental concepts about loop unrolling and loop pipelining and explained why answer 2 was not the right one (here).
Let’s now see what happens in the case of answer 1, when we unroll LOOP0 by 4 and pipeline the design with II=1.
Partially unrolling by 4 means that we transform the loop into a new one which now has only 8/4=2 iterations, and where each iteration of the new loop implements 4 iterations of the original loop. The corresponding C code would look like:
void acc(int din[8], int &dout)
{
int tmp;
LOOP0: for(int i=0; i<8; i+=4) {
tmp+=din[i+0];
tmp+=din[i+1];
tmp+=din[i+2];
tmp+=din[i+3];
}
dout = tmp;
}
The schedule for one loop iteration would look like as follows:
|RD0|ADD|ADD|
|RD1| | |
|RD2|ADD| |
|RD3| | |
In the first cycle, 4 inputs are read. In the following cycles, these 4 values are summed together, possibly using a balanced adder tree.
If the design was not pipelined, the second iteration of the partially unrolled LOOP0 would start after the end of the first iteration and the schedule would look like:
|RD0|ADD|ADD|RD4|ADD|ADD|
|RD1| | |RD5| | |
|RD2|ADD| |RD6|ADD| |
|RD3| | |RD7| |OUT|
Instead, the design is pipelined with II=1, meaning that the second iteration of LOOP0 should start 1 cycle after the start of the first iteration. Similarly the next design iteration (to process a new set of inputs) would start 1 cycle after the start of the last LOOP0 iteration. The schedule drawn below shows the two iterations of the partially unrolled LOOP0 corresponding to one design iteration, followed by two more iterations of LOOP0 corresponding to a second loop iteration.
|RD0|ADD|ADD|
|RD1| | |
|RD2|ADD| |
|RD3| | |
|RD4|ADD|ADD|
|RD5| | |
|RD6|ADD| |
|RD7| |OUT|
|RD0|ADD|ADD|
|RD1| | |
|RD2|ADD| |
|RD3| | |
|RD4|ADD|ADD|
|RD5| | |
|RD6|ADD| |
|RD7| |OUT|
There are 2 iterations of LOOP0, new iterations start every clock cycle, the output is produced at the end of the last iteration: this implies that the RTL generated with these constraints will produce new results every 2 cycles.
An important point to notice as well is that the throughput of the design is independent of its latency. The above examples were drawn with the assumption that each addition took a full clock cycle. One loop iteration is shown to take 3 clock cycles. Faster adders could have produced a shorter schedule. This would have meant a shorter ramp-up time (time to first output), but the data rate would stay the same (1 output every 2 clock cycles).
So answer 1 is not the correct one. There are only two possible choices left, and I hope that with all these recent explanations, finding the solution should now be easy…
Tags: ANSI C++, Bluebook, C synthesis, high-level synthesis, HLS, How-to, Learning, Loop, Pipelining, Unrolling
HLS Fundamentals: Loop Unrolling and Loop Pipelining
The dust has settled and four winners have emerged from the HLS Bluebook contest, and this week, as promised, I will discuss the question that proved to be the most challenging in third and final round of the contest.
The culprit was the following question, which only 15% of the contenders answered correctly:

HLS Contest - Round 3, Question 1
What this simple C code does is: reading 8 input values (din), summing them in an intermediate variable (tmp) and returning the final results (dout).
Before going over the various answers and explaining the correct one, let’s review two important concepts covered in this question:
- Loop pipelining provides a way to increase the throughput of a loop by initiating the (i+1)th iteration of the loop before the ith iteration has completed. Overlapping the execution of subsequent iterations of a loop exploits parallelism across loop iterations. The number of cycles between iterations of the loop is called the “Initiation Interval” (II).
- Loop unrolling provides a way be used to reduce the latency of a loop by reducing the number of its iterations. When a loop is unrolled – either fully or partially – the loop body is duplicated as many times as the loop is unrolled. This exposes parallelism that exists across subsequent iterations of the loop. The number of times a loop is unrolled is called the “Unroll Factor”.
If these definitions sound a bit theoretical, the following explanations and charts should help make them clearer. And an earlier blog post on this topic is still available here for reference.
- Default behavior
Assuming that the accumulation takes one clock cycle, a single iteration of LOOP0 would look as follows:
|RDi|ACC|
This notation indicates that in the first cycle an input is read and in the second cycle, the data is accumulated.
If not special synthesis contraints are applied to this design, loops either unrolled or pipelined. This means that the next loop iteration will start right after the previous loop iteration completes. In this case the schedule of the design would look like as follows:
|RD0|ACC|RD1|ACC|RD2|ACC|RD3|ACC|RD4|ACC|RD5|ACC|RD6|ACC|RD7|ACC|
In other words, a very serial design is built where in first cycle din[0] is read, in the second cycle the data is accumulated, in the third cycle din[1] is read, etc… In this case, the output would come out every 16 cycles.
- Analysis of Answer #2
Let’s now see what happens in the case of answer 2, when we leave LOOP0 rolled and pipeline the design with II=3.
As in the default case, the loop is kept rolled, therefore one iteration would look like this:
|RDi|ACC|
But now we pipeline the design with II=3, implying that we are building a design were each new iteration of LOOP0 starts 3 cycles after the beginning of the previous iteration.
|RD0|ACC| | | | | | | | |
| |RD1|ACC| | | | | | | |
| | |RD2|ACC| | | | | | |
| | | |RD3|ACC| | | | | |
| | | | |RD4|ACC| | | | |
| | | | | |RD5|ACC| | | |
| | | | | | |RD6|ACC| | |
| | | | | | | |RD7|ACC| |
|<-3cycles->| | | | | | | |
Consequently, the next design iteration (to process a new set of inputs) would start 3 cycles after the start of the last LOOP0 iteration. There are 8 iterations of LOOP0 and each iteration starts every 3 clock cycles: this implies that the design processes new inputs and produces a new result every 24 cycles.
So answer 2 is not the correct one. Have you found the correct answer yet? In my next blog post, I’ll explain what happens when you start unrolling loops.
Tags: ANSI C++, Bluebook, C synthesis, high-level synthesis, HLS, How-to, Learning, Loop, Pipelining, Unrolling
HLS Contest: And the winner is…
Early December, the Catapult team launched the HLS Bluebook Contest. Our intent was to bring the community together around a fun yet challenging event and give people an opportunity to learn about HLS and test their skills in this area.
Today, 4 months, 3 rounds and 15 questions after the grand opening, we are very happy to announce the winners of the contest, the four only individuals who score perfectly on the third and decisive round.
Congratulations for their outstanding performance to:
Lee Bradshaw United States
Philip Chambers Australia
Paul S. D'Urbano United States
Jerome Lachaize France
We started the contest with a first round of 5 questions, and the response was overwhelming with 1,144 initial participants! The average score on this first round was 3.63 and 63% of the participants scored a 4 or a 5 (perfect score).
Given the many positive comments we got through email, it seemed obvious people were game. So we decided to continue with a second round, upping the difficulty level. Everyone who scored 4 or more was invited to participate in this next round.
Round 2 started February 10th. With slightly more difficult questions, the average score was still a solid 3.58 and but only 184 people scored 4 or more. From the initial 1,144 contenders, we were now down to top of the crop. It was time to hold the final round.
Round 3 started March 9th and 121 finalists lined up for the showdown. Once again, we had raised the bar with even more complicated questions. The moment had come for our final winners to shine! In this final round the average score dropped to 2.47. Only 17 finalists managed to get a 4, tying for second place. And of course the 4 winners emerged, scoring perfectly on this decisive round. Once again, a big bold “Bravo” to them and to all the finalists.

HLS Contest - Round 3 Results
It was also very interesting to see where people tripped. We had designed the last question (n.5) to be more challenging and indeed only 19% got this one correct. But it turned out that the first question was the hardest one as only 15% got a good answer there.

HLS Contest - Round 3 Results by question
I am reproducing this first question below and will develop on it in upcoming blog posts. But before I give the answer and the explanation, what would you have answered?

HLS Contest - Round 3, Question 1
Tags: ANSI C++, Bluebook, C synthesis, Contest, high-level synthesis, HLS, Learning, SystemC
A Designer’s Perspective on ESL Methodologies for an OFDM Modem Design
“In recent times, ESL design methodologies have been the talk of the semiconductor design community and have found increasing acceptance. Most of the recent publications have given information on design flow needs and an high level overview of the (C/C++/SystemC) based high level synthesis design process using a small block level design scenario. Although productivity benefits for ESL methodologies have been acknowledged, there is still little information regarding the scalability, quality of results (QoR), and learning curve of deploying these ESL design methodologies on real and large scale industrial designs.”
This quote is taken from a very thorough DesignCon 2011 paper, in which authors Harvinder Singh, Gagan Midha, Thierry Michel, Roberto Guizzetti, Pascal Urard, Nitin Chawla of STMicroelectronics provide an indepth description of their experience in designing a complete OFDM modem with an ESL methodology.
What is really interesting about this paper is that it presents the methodology from a designer’s perpective. The design process is explained in context of a high throughput and multi-million gate complexity OFDM modem design. The complete modem has been designed using High Level Synthesis. First the design partitioning process for breaking down the OFDM modem into sub-blocks is explained, along with details on the basic architecture of the modem. The HLS process of block development and the integration of blocks (data and control flow) is illustrated with special focus on challenges and solutions during the design cycle. The block architecture of the major building blocks of an OFDM system, such as Forward Error Correction (FEC) decoders and Fast Fourier Transform (FFT) is discussed and QoR results are presented, demonstrating the advantage of using HLS for design space exploration.
The complete paper can be downloaded here. I highly recommended reading it.

OFDM inter-block interfaces and data flow control
Tags: ANSI C++, C synthesis, Catapult C, control-logic synthesis, DesignCon, FFT, Full-Chip, high-level synthesis, OFDM, RTL, STMicroelectronics, User Testimonial
Catapult C and the 7 Samuraïs
You may have already encountered the expression “Full-Chip High-Level Synthesis” on this blog. I typically define it as the ability to model, verify and synthesize complete IP subsystems starting from C++/SystemC. This obviously encompasses core processing functionality, but also control-logic, memories, hierarchy, complex interfaces and interconnects. In other words, being able to do the “full” thing, really.
A few days ago, “one of the seven samuraïs” posted on John Cooley’s ESNUG the results of his evaluation of Catapult C. As you’ll see from the requirements to handle arbitration logic and point-to-point interfaces on top of algorithmic content, this pretty much means “Full-Chip HLS”. Here is how the report starts:
- “We wanted our test to be rigorous, so we used an existing scaler design. Our scaler was implemented in 90 nm technology. It does down and up scaling of frames from 1×1 to 1024×1024 pixels; each pixel has four 8-bit components. The scale factors are configurable, with an integrated 640 pixel line buffer.”
The rest of it, including results found and conclusions can be read on here.
Tags: C++, Catapult C, control, control-logic synthesis, Cooley, Deepchip, ESNUG, Full-Chip, high-level synthesis, SystemC, User Testimonial
The Why, What and How of HLS @ DATE 2011
Good news for the industry: the DATE (Design, Automation, and Test in Europe) conference is back to growth. And perhaps it is not a surprise given that this year the event is being held in Grenoble. With its great views on the snowy Alps, Grenoble is emerging has the major hub of the electronic and semiconductor industry in Europe.
3D ICs, Low-power, ESL… The rich conference program covers all hot areas of EDA. For those interested in High-Level Synthesis (HLS), a tutorial should of particular interested. “Electronic System Level Design and Verification” explores the practical application of HLS with presentations from Yvan Desmartin and Bernhard Niemann, two knowledgeable HLS users from STMicroelectronics and Fraunhofer IIC, respectively. In addition, Michael Fingeroff, the author of the High-Level Synthesis Blue Book, will share his guidance for how novices can go to experts by simply following the best practice coding examples shown in the book. This full-day session explores how the promise of HLS is becoming a reality.
Title: Electronic System Level Design and Verification
Date: Mon, March 14th, 2011
Time: 09h30 – 18h00
Speakers:
Thomas Bollaert, Mentor Graphics, US
Yvan Desmartin, STMicroelectronics, FR
Michael Fingeroff, Mentor Graphics, US
Bernhard Niemann, Fraunhofer Institute for Integrated Circuits, DE
Michael Fingeroff has worked as a technical marketing engineer for the Catapult C product line at Mentor Graphics since 2002. His areas of interest include DSP and high-performance video hardware. Prior to working for Mentor Graphics, Michael worked as a hardware design engineer developing real-time broadband video systems. Michael received both his bachelors and masters degrees in electrical engineering from Temple University in 1990 and 1995 respectively.
Yvan Desmartin, is design team leader in STMicroelectronics Home Entertainment & Display Group / Home Video Division. He brings 14 years experience in digital design. His current activities mainly focus on graphic, display and network applications for the Set Top Box market. He received the electronic engineering degree from INSA LYON (France).
Bernhard Niemann is a group manager for Multimedia Terminals at Fraunhofer IIS. He has been dealing with C++ based design since 2000, when he developed the SystemC training classes offered by Fraunhofer IIS. Since then he was involved in various C++ based design activities ranging from executable specification in SystemC to the introduction of C++ based synthesis into the design flow at Fraunhofer IIS. His current activities are focused on chipset design for new hybrid (satellite and terrestrial) broadcasting systems.
Tags: Bluebook, C synthesis, DATE, Fraunhofer, Grenoble, high-level synthesis, HLS, STMicroelectronics, Tutorial, User Testimonial
DVCon: Wally Rhine’s Keynote
“50 years from today, every man, woman and child in India will be required to run an HDL simulator”.
As Wally Rhines explained in his DVCon keynote today, this is the absurd conclusion you reach if you extrapolate data showing that between 2007 and 2010 the average verification team size grew by a whopping 58%. Indeed the conclusion is absurd, but the image is strikingly powerful. Verification complexity grows at a faster pace than we can keep up with, and we’d better do something about it.

Entire India population becomes verification engineers in 50 years?
Reflecting on what happened in the last few years, Rhines observed that the focus has been to increase the “volume” of verification. This has been achieved by the adoption of verification techniques such as assertions, code coverage, functional coverage or emulation. The use of assertions has grown from 37% in 2007 to 69% in 2010. Similarly, functional coverage grew from 40% to 72% over the same period of time.
However, as design complexity also increased during that time, this aggressive adoption of advanced functional verification techniques barely helped contain the problems. 66% of projects are still behind schedule. 45% of chips require two silicon passes and 25% require more than two passes.
Citing more survey results, Rhines highlighted that 52% of chip failures were still due to functional problems. The issue as already been discussed on this blog: RTL design is where most errors are being introduced. But can it be a surprise given that RTL design is mostly a manual effort?
Continuing, Wally Rhines explained that the industry needs to look beyond the mere “volume” of verification, and now needs to improve and emphasize the “velocity” of verification. In other words, accelerating verification closure with techniques such as intelligent testbench automation, transaction-based hardware acceleration and by adopting ESL and higher levels of abstraction.
So for the sake of India’s population, it is time we change the direction of verification and shift from adding cycles of verification to maximizing the verification per cycle.
Tags: DVCon, ESL, verification, Wally Rhines
About Thomas Bollaert’s Blog
High-Level Synthesis is entering the mainstream of hardware design, bringing tremendous opportunities and creating stimulating new challenges to hardware designers. This blog is about trends, opinions and experiences with going from C++ to RTL, automatically.
Latest Posts
- Mentor ESL in TSMC Reference Flow 12
- 48th DAC – Gary’s Magic Formula
- DAC: 9th ESL Symposium
- HLS Fundamentals / Part 2
- HLS Fundamentals: Loop Unrolling and Loop Pipelining
- HLS Contest: And the winner is…