«by Vikas S. Vij A dissertation submitted to the faculty of The University of Utah in partial fulﬁllment of the requirements for the degree of ...»
ALGORITHMS AND METHODOLOGY TO DESIGN
ASYNCHRONOUS CIRCUITS USING
SYNCHRONOUS CAD TOOLS
Vikas S. Vij
A dissertation submitted to the faculty of
The University of Utah
in partial fulﬁllment of the requirements for the degree of
Doctor of Philosophy
Department of Electrical and Computer Engineering The University of Utah December 2013 Copyright c Vikas S. Vij 2013 All Rights Reserved The University of Utah Graduate School
STATEMENT OF DISSERTATION APPROVALVikas S. Vij This dissertation of
has been approved by the following supervisory committee members:
Kenneth S. Stevens 11/5/2013, Chair Date Approved Chris Myers 10/31/2013, Member Date Approved Erik Brunvand 11/5/2013, Member Date Approved Priyank Kalla 10/31/2013, Member Date Approved Christos Sotiriou, Member Date Approved Gianluca Lazzi and by, Chair of Electrical and Computer Engineering the Department of and by David B. Kieda, Dean of the Graduate School.
Synchronous CAD tools must be constrained for them to work with asynchronous circuits. Identiﬁcation of these constraints and characterization ﬂow to automatically derive the constraints is presented. The effect of the constraints on the designs and the way they are handled by the synchronous CAD tools are analyzed and reported in this work.
The automation of the generation of asynchronous design templates and also the constraint generation is an important problem. Algorithms for automation of reset addition to asynchronous circuits and power and/or performance optimizations applied to the circuits using logical effort are explored thus ﬁlling an important hole in the automation ﬂow.
Constraints representing cyclic asynchronous circuits as directed acyclic graphs (DAGs) to the CAD tools is necessary for applying synchronous CAD optimizations like sizing, path delay optimizations and also using static timing analysis (STA) on these circuits. A thorough investigation for the requirements of cycle cutting while preserving timing paths is presented with an algorithm to automate the process of generating them.
A large set of designs for 4 phase handshake protocol circuit implementations with early and late data validity are characterized for area, power and performance. Benchmark circuits with automated scripts to generate various conﬁgurations for better understanding of the designs are proposed and analyzed. Extension to the methodology like addition of scan insertion using automatic test pattern generation (ATPG) tools to add testability of datapath in bundled data asynchronous circuit implementations and timing closure approaches are also described. Energy, area, and performance of purely asynchronous circuits and circuits with mixed synchronous and asynchronous blocks are explored. Results indicate the beneﬁts that can be derived by generating circuits with asynchronous components using this methodology.
I would like to thank Dr. Ken Stevens for his guidance, understanding, patience, and most importantly, his assistance through my graduate studies at the University of Utah. His mentorship, expertise and humor provided me with the motivation to pursue my doctoral research. His support and encouragement were crucial through this whole experience. I would like to thank Dr. Chris Myers, Dr. Erik Brunvand, Dr. Priyank Kalla, and Dr. Christos Sotiriou for their constructive reviews of my work and for their constant assistance through my graduate studies.
I wish to thank my friends in the research group: William Lee, Shomit Das, Raghu Prasad Gudla, Krishnaji Desai, Dan Gebhardt and Junbok You for having all those constructive discussions and presentations to learn from and teach each other. Their reviews and critical comments on my manuscripts and work helped improve the quality immensely.
I would like to thank the Electrical and Computer Engineering Department at the University of Utah, especially Lori Sather for guiding me through the administrative stuff at the University and always being there to help. The long stay in Salt Lake City would not have been fun without the presence and help of my friends. Special thanks to Sai and Hemang for always being there to motivate me. Last but not the least my parents, my brother, my sister-in-law and my nephews for sticking by me through all this time and making this whole journey a fun ﬁlled one.
Integrated circuits continue to grow in performance and transistor count, with current designs exceeding a few billion transistors. Distributing the clock to the entire chip in such designs poses signiﬁcant design effort and energy. Power consumed by the chip increases from generation to generation because total chip area remains constant or grows. This results in power hungry clock drivers that require high slew rates in order to distribute a high frequency clock with limited skew. This has resulted in designs where nearly 40 percent of the total on chip power consumption is due to the clock generation and distribution network [2, 3]. Earlier, performance and area were the metrics around which VLSI designers would build and optimize their chips, but in the last decade power has arguably become the most important metric.
The presence of a large number of transistors has also resulted in different approaches to design which involve different intellectual property (IP) blocks characterized and optimized for a speciﬁc frequency of operation. These IP blocks being integrated together to create a chip with multiple clock frequency domains is called multisynchronous design. The overhead of clock domain crossing and also the increase in design complexity of the system are becoming a big issue in terms of power and also performance.
Asynchronous circuits are a potential solution to all these problems, as they switch only to do useful work. The frequency of operation of asynchronous circuits is dynamic and is dependent on the amount of logic in the pipeline as well as on the operating frequency of the pipelines adjacent to it. Since there is no global clock and all the communication is local, there is no need for power hungry low skew drivers. These circuits are based on handshake protocols, which enhance the modularity and composability of the designs and thus assist in supporting multiple frequency designs without the need for synchronization. By operating the asynchronous system at frequencies that best optimize the power and performance of 2 each individual asynchronous modules an overall better design is achieved. One of the better silicon examples of such an architecture is the Pentium Processor front end, which operates at three frequencies: 720MHz for instruction decode, 3.6GHz for instruction selection, and 900MHz for instruction steering and issue . This design was fabricated in the same foundry as its commercial counterpart, and achieved an 17 fold improvement in eτ 2 (energy delay squared product).
If the timing models employed in the clocked design style can be leveraged by general multifrequency design, the same tools, languages, and ﬂows can be used with all methods of timing for design and architecture optimization. Relative timing (RT) does just that;
it bridges the gap between the incompatible timing used by unclocked design styles by expressing the timing in a form used by commercial clocked electronic design automation (EDA) tools. Once timing compatibility is achieved, common design languages, standard cell libraries, and tool ﬂows become common to all design styles. This compatibility enhances productivity, reduces the cost of adopting multifrequency design methodologies and results in power and performance advantages. A ﬂow based on RT to generate multifrequency designs using commercial clocked EDA tools is described in this dissertation.
1.1 Related Work The convergence of asynchronous design approaches and the synchronous (clocked) computer aided design (CAD) tools and ﬂows has been an actively researched topic. Based on different timing models used for the asynchronous design, different approaches are proposed to use the synchronous CAD tools and ﬂows either partially or fully.
Flows related to approaches that use delay insensitive (DI) encoding use only the clocked synthesis CAD tool like Synopsys Design Compiler (DC) . The synthesized design output is then mapped to speciﬁc DI gate implementations, which preserve the hazard properties of the design. Since timing is inherent in the DI systems, there is no requirement to specify timing constraints for functional correctness. The NULL Convention Logic (NCL) designs by Theseus Logic (), the Proteus ﬂow by University of Southern California and Fulcrum Microsystems () and the phased logic () approach to design circuit using the level encoded two-phase dual rail (LEDR) encoding follow this partial use of synchronous CAD tools and ﬂows.
3 Commercial companies like Silistix working on DI design also have attempted to address this problem for network-on-chip (NoC) designs with their toolﬂow, which requires a precharacterized technology library . The library contains adapters for their IP interface protocols and hard macroblocks for the CHAIN interconnect (), which is used to connect the system blocks. The toolﬂow is named CHAINarchitect and it converts the network speciﬁed in a custom language called the connect speciﬁcation language (CSL) into an on-chip network implementation. The beneﬁts of this toolﬂow are its completeness in terms of all the general CAD ﬂow steps like place and route, testing and static timing analysis (STA). This ﬂow applies only to NoC designs developed using the technology library precharacterized by Silistix. Hence, it is not a general design ﬂow. Also, the synchronous CAD tool optimizations are not applied because of the use of precharacterized hard macroblocks.
Desynchronization approach is the most complete existing method for generating an asynchronous bundled data design using the synchronous CAD ﬂows [11, 12, 13, 14, 15]. It also uses standard library cells and hardware description languages (HDLs) to specify the circuit. This approach is built on the marked graph theory and hence proves the liveness, safeness and ﬂow equivalence properties of the circuits. It accomplishes a direct mapping of the synchronous design into an asynchronous equivalent by removing the clock network and replacing it with an asynchronous handshake network. Postsilicon numbers for the ASPIDA DLX processor and a DES core were published for comparison of the asynchronous design with its synchronous counterpart [11, 12, 13, 14, 15]. Desynchronization provides signiﬁcant beneﬁts in electromagnetic interference (EMI) improvements and shorter design cycles. However, design results show little or no power improvement over the initial clocked design. The asynchronous design operates at average case speed as compared to worst case speed for the synchronous design, thus resulting in performance beneﬁts for the fabricated desynchronized DLX processor. The base of the timing constraint speciﬁcation for this approach is ﬁrst to divide each ﬂip-ﬂop into a pair of latches which are individually controlled by a handshake controller. Then, a virtual clock is created to enable each latch for timing. This approach restricts the design to the clock paradigm, thus preventing the application of asynchronous architectural and design optimizations, which are important to 4 gain the beneﬁts similar to the Pentium front-end example. The beneﬁt of this approach is that the representation is completely like the clocked deﬁnition.
The application of synchronous synthesis tools for high-level timed asynchronous bundled-data design has also been investigated by using a channel-level VHDL code .
The circuit implementation was shown on a ﬁeld programmable gate array (FPGA). The asynchronous control circuit for this implementation is derived using the ATACS tool , while the FPGA synthesis tool synthesizes the datapath. The beneﬁt of this approach is the use of an HDL language to generate the asynchronous circuit and the concept of utilizing synchronous CAD tools to synthesize the combinational logic in the datapath. The timing and sizing algorithms of the synchronous CAD tools are not used to optimize the asynchronous design, hence delay elements are created by manually adding buffers based on the delay requirement.
Another approach that addresses generation of asynchronous bundled data as well as quasi delay insensitive (QDI) Micropipeline designs is the Weaver ﬂow . It modiﬁes the library to make it compatible with DC, thus enabling synthesis of asynchronous circuits.
The application of this approach is presented for deterministic as well as data dependent token propagation which enables its application to a large set of asynchronous circuits. The major drawback of this approach, however, is that it requires modifying the standard cell libraries to make it compatible with the ﬂow. Hence, knowledge of library characterization and modiﬁcation is required to derive the beneﬁts.
A detailed study of the limitations of the synchronous CAD tools and ﬂows with respect to applying them on an on-chip network is presented in . These limitations are as