«by Charles Eric LaForest A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of ...»
High-Speed Soft-Processor Architecture
for FPGA Overlays
Charles Eric LaForest
A thesis submitted in conformity with the requirements
for the degree of Doctor of Philosophy
Graduate Department of Electrical and Computer Engineering
University of Toronto
c Copyright 2015 by Charles Eric LaForest
High-Speed Soft-Processor Architecture
for FPGA Overlays
Charles Eric LaForest
Doctor of Philosophy
Graduate Department of Electrical and Computer Engineering
University of Toronto 2015 Field-Programmable Gate Arrays (FPGAs) provide an easier path than Application- Speciﬁc Integrated Circuits (ASICs) for implementing computing systems, and generally yield higher performance and lower power than optimized software running on high- end CPUs. However, designing hardware with FPGAs remains a diﬃcult and time- consuming process, requiring specialized skills and hours-long CAD processing times.
An easier design process abstracts away the FPGA via an “overlay architecture”, which implements a computing platform upon which we construct the desired system. Soft- processors represent the base case of overlays, allowing easy software-driven design, but at a large cost in performance and area. This thesis addresses the performance limitations of FPGA soft-processors, as building blocks for overlay architectures.
We ﬁrst aim to maximize the usage of FPGA structures by designing Octavo, a strict round-robin multi-threaded soft-processor architecture tailored to the underlying FPGA and capable of operating at maximal speed. We then scale Octavo to SIMD and MIMD parallelism by replicating its datapath and connecting Octavo cores in a point-to-point mesh. This scaling creates multi-local logic, which we preserve via logical partitioning to avoid artiﬁcial critial paths introduced by unnecessary CAD optimizations.
We plan ahead for larger Octavo systems by adding support for architectural ex- tensions, instruction predication, and variable I/O latency. These features improve the ii eﬃciency of hardware extensions, eliminate busy-wait loops, and provide a building block for more eﬃcient code execution.
By extracting the ﬂow-control and addressing sub-graphs out of a program’s Control- Data Flow Graph (CDFG), we can execute them in parallel with useful computations using special dedicated hardware units, granting better performance than fully unrolled loops without increasing code size.
Finally, we benchmark Octavo against the MXP soft vector processor, the NiosII/f scalar soft-processor, and equivalent benchmark implementations written in C synthesized with the LegUp High-Level Synthesis (HLS) tool, and direct Verilog Hardware Description Language (HDL) implementations. Octavo’s higher clock frequency oﬀsets its higher cycle count, performing roughly on-par with MXP, and within an order of magnitude of the performance of LegUp and Verilog solutions, but with an order of magnitude area penalty.
Since I was a teenager, I’ve always wanted to design a computer. You would not be reading this without the help and inﬂuence of many.
Firstly, my wife Joy. Do I need to expound on the love, support, and patience of a spouse over a decade spent in academia, with her own career, while raising a family?
Secondly, to Greg Steﬀan. I can credit pretty much my entire academic success to his ability to pull me back to the big picture after fruitfully letting me get lost in the details, arguing for my own direction. Back in 2008, he proposed I do my MASc on a seemingly small little problem, which grew into a sub-ﬁeld of its own after we published, and won us a signiﬁcant award. Right from the beginning, while he co-supervised my undergraduate BIS thesis in 2006, before he even had tenure, we talked about what kind of computer architectures would work best on an FPGA. The work you are holding right now stems from those conversations, and I think I can say we ﬁgured out a good solution. Neither of us knew how far it would all go, and I am sorry he did not get to see it completed.
Jason Anderson stepped up to carry the supervision torch, and gave his time and eﬀort to make sure the LegUp benchmarks were great. I am lucky to have had his support, unﬂagging enthusiasm, and invaluable thesis-writing feedback.
Guy Lemieux, of UBC, inﬂuenced the later stages of this work enormously. The entire work on execution overhead originates from a single conversation we had. Guy always took the time to help and explain things, criticizing honestly, greatly stimulating my thinking. One of his students, Aaron Severance, gave his time and eﬀort also to provide MXP benchmark data, better than any I could have done myself, and patiently explained how the machinery worked. Any mistakes or omissions are mine alone.
I have had the luck of also working with Vaughn Betz and Tarek Abdelrahman on my thesis committee, both bringing contrasting and broad perspectives on my research.
In fact, every ECE and DCS Faculty member I have spoken with have answered my questions and given help freely.
v I have also had the pleasure of great student colleagues, too numerous to mention all, but especially Ruediger Willenberg, Xander Chin, Braiden Brousseau, Robert Hesse, Ming G. Liu, Zimo Li, Tristan O’Rourke, Mihai Burcea, Andrew Shorten, Jason Lu, Henry Wong, Mark Jeﬀreys, Jasmina Vasiljevic, Ioana Baldini, and Charles Chiasson.
Some roots of this work came from conversations at MIT with Jack Dennis and some of Arvind’s grad students in the summer of 2010. Joel Emer of Intel also provided some early feedback. Octavo’s initial design improved a lot after speaking with Patrick Doyle, Mark Stoodley, and other people of IBM Markham.
I am also indebted to Blair Fort (as a student), Martin Labrecque, and Peter Yiannacouras. Their earlier soft-processor work laid a lot of the groundwork for my own. I am still learning things from it.
I got nothing but ﬁrst-class IT tech support from Jaro Pristupa, Eugenia Distefano, and YongJoo Lee, who never blinked at all my software installation requests, and saved my data through RAID failures and my errors. Furthermore, Kelly Chan and Beverly Chu were always available for EECG administrative help, as were Judith Levene and Darlene Gorzo of the ECE Grad Oﬃce, and Jayne Leake who managed my TA assignments.
A lot of computational horsepower (over 24 CPU-years of it!), was provided by the GPC supercomputer at the SciNet HPC Consortium . My special thanks to ChingHsing Yu and Dr. Daniel Gruner for their enthusiasm at helping me scale my computations, even when I unintentionally broke their system a little sometimes. (“What do you mean, I used up all the ﬁle descriptors?!? ”) The folk at Altera have also provided support in many ways: funding, Quartus licenses, support, and hardware. My special thanks to Blair Fort (as Altera staﬀ) for his freely given help.
Further funding came from the ECE Department, the Queen Elizabeth II World Telecommunication Congress Graduate Scholarship in Science and Technology, the Walter C. Sumner Foundation, and NSERC.
If you read this and feel forgotten, my apologies. Call me up and reminisce!
vi “He was alive, trembling ever so slightly with delight, proud that his fear was under control. Then without ceremony he hugged in his forewings, extended his short, angled wingtips, and plunged directly toward the sea. By the time he passed four thousand feet he had reached terminal velocity, the wind was a solid beating wall of sound against which he could move no faster. He was ﬂying now straight down, at two hundred fourteen miles per hour. He swallowed, knowing that if his wings unfolded at that speed he’d be blown into a million tiny shreds of seagull. But the speed was power, and the speed was joy, and the speed was pure beauty.” – “Jonathan Livingston Seagull”, Richard Bach I was born to Rock n’ Roll.
Everything I need.
I was born with the hammer down.
I was built for speed.
– “Built For Speed”, Mot¨rhead o It is by caﬀeine alone I set my mind in motion.
It is by the beans of Java that thoughts acquire speed, the hands acquire shakes, the shakes become a warning.
It is by caﬀeine alone I set my mind in motion.
– variation on the “Mentat Mantra” – attributed to Mark Stein, Arisia SF Convention, Boston, ca. 1993, Sunday morning.
Soundtrack by deadmau5.
4.1 Speed/Area/Density of Partitioned Meshes with 32 Datapaths...... 77
4.2 Speed/Area/Density of Partitioned Meshes with 102 Datapaths..... 78
5.1 Octavo’s Instruction Word Format with Extended Write Address Space. 86
6.1 Benchmark cycle count speedup and eﬃciency improvements........ 118 7.1 3-Step Array Reverse (Reverse-3) Measurements.............. 140
7.2 Sequential Hailstone (Hailstone-S) Measurements............. 141
7.3 Accelerated Hailstone (Hailstone-A) Measurements............ 142
7.4 Structured FSM (FSM-S) Measurements.................. 143 7.5 “Assembly” FSM (FSM-A) Measurements................. 143
7.6 Array Increment (Increment) Measurements................ 145
7.7 Non-Branching Sequential Hailstone (Hailstone-N) Measurements.... 146 7.8 4-Step Array Reverse (Reverse-4) Measurements.............. 147 7.9 8-Tap FIR Measurements........................... 148
8.1 Octavo Fmax on Various Altera Devices................... 163
8.2 Truth table for Empty/Full bit with back-to-back transfers......... 173
6.1 Control-Data Flow Graph (CDFG) of the Hailstone benchmark, along with Flow-Control, Addressing, and Useful Work sub-graphs........... 105
6.2 Address Oﬀset Module implementation.................... 109
6.3 Branch Trigger Module implementation................... 111
6.4 Octavo Block Diagram with I/O Predication, AOM, and BTM...... 113
6.5 Average Octavo Fmax, varying the number of entries per AOM instance and the number of BTM instances...................... 115
E.1 Branch Trigger Module Altered for I/O Predication............ 192 H.1 Simple Floating-Point Number Recognizer................. 208
Introduction We can think of Field-Programmable Arrays (FPGAs) as “Lego-on-a-chip”: a large number of small building blocks which, although not very capable by themselves, we can
assemble into larger, useful digital systems. In a nutshell, these building blocks include:
• small Look-Up Tables (LUTs), which can compute any Boolean function of a few inputs (e.g.: 4 to 6) or act as small memories;
• ﬂip-ﬂops, which hold state;
• various arithmetic circuits, such as adders and multipliers;
• larger memory blocks, for eﬃcient bulk storage;
• I/O blocks, which implement low-level oﬀ-chip interfaces;
• and a very large amount of programmable interconnect, which allows us to connect all of these building blocks together.
Rather than implement digital systems directly in Application-Speciﬁc Integrated Circuits (ASICs), a much longer and more costly process, we can trade-oﬀ area, power, and performance, and instead implement such systems using an FPGA. Despite the tradeoﬀs relative to ASICs, FPGAs still enable higher-performance and lower power solutions
12 Chapter 1. Introduction
than even highly-optimized software implementations on high-end CPUs, and unlike ASICs, we can re-conﬁgure FPGAs as the application evolves. As power limits and the slowing of Moore’s Law (through scaling limits or increasing costs) causes individual CPU performance to level-out after decades of accelerated growth, FPGAs present an avenue for higher performance without all the complications of large-scale parallel programming.
Designing with FPGAs remains a diﬃcult process, if a less costly one than with ASICs, with hours-long CAD processing times after laborious hardware design. At the most basic level, we can use Hardware Description Languages (HDLs) to describe logic fairly directly, but at a low level akin to assembly programming or a stripped-down C language.
Alternately, we can use High-Level Synthesis (HLS) techniques to allow higher-level system descriptions, which greatly eases design implementation, exploration, and veriﬁcation. However, HLS must still translate the description down to HDL, with the same CAD processing time problems, and produces machine-generated “black box” implementations hard to relate to the original source code.
Finally, we can instead
away the FPGA itself by using it to implement a more conventional computing platform, and then implement our desired system onto this “overlay architecture”. For example, a scalar soft-processor (e.g.: Altera’s NiosII) represents the simplest instance of an overlay.
By using overlays, we pay a further price in performance, power, and area, but gain an even faster development cycle, mostly free from FPGA CAD processing time and amenable to high-level software descriptions. Contrary to conventional CPUs, an overlay still beneﬁts from the FPGA’s re-conﬁgurability, allowing us to incrementally customize the overlay to our application and recoup lost performance.
1.2. Contributions We propose that overlays already provide a better design process for FPGAs, minimizing the use of HDLs and HLS, and maximizing the return for a given development eﬀort.
Which leaves us with the question: How do we improve the performance of overlays?
1.2 Contributions This dissertation focuses on improving the performance of soft-processors on FPGAs, as
building blocks for FPGA overlay architectures. Speciﬁcally:
• We construct Octavo, a soft-processor architecture tailored to the underlying FPGA, capable of operating at maximal speed.