IBTIDA: Fully open-source ASIC implementation of Chisel-generated System on a Chip

—Building a System on Chip (SoC) using a fully open- source toolchain requires the availability of open-source tools for RTL simulation, generation, GDS-II conversion, manufacture- able foundry process design kits (PDKs), IP libraries, and I/O blocks. The proposed work shows the methodology of using completely open-source tools and hardware construction language (HCL) to tape-out RISC-V based SoC - Ibtida. The methodology utilizes Chisel (Constructing Hardware in Scala Embedded Language) as the RTL generator, Verilator as the RTL simulator, OpenLANE as the RTL to GDS-II converter, and SKY-130nm Open PDK to manufacture the SoC. Ibtida consists of a 5-stage pipelined 32-bit RISC-V (RV32IM) core with 32 GPIOs, and separate instruction and data memories. The Ibtida design is embedded in a harness on a physical chip. The harness is equipped with a management SoC used as a controller to the Ibtida. Prior to converting the RTL into GDS-II, the cycle-accurate simulation using Verilator and FPGA emulation on Xilinx ARTY A7 has been performed for veriﬁcation and regression testing. The FPGA implementation utilizes 8650 LUTs, 3356 Slice Registers, 714 ﬂip ﬂops, and 2.5 Block RAM of 36Kb. The ASIC implementation utilizes a 2.5 mm2 area with a density of 37.44 KGate/mm2. The manufacturing of this SoC is provided by Google shuttle program called Open MPW (Multi Project Wafer) in association with Efabless and SkyWater technologies. To the best of our knowledge, this is the ﬁrst RISC-V based SoC, generated using Chisel and taped-out using fully open-source technologies.


I. INTRODUCTION
T ODAY, Moore's law is diminishing. The trend of increasing computing capabilities by doubling the number of transistors is coming to a halt [1]. Due to this, we are entering the golden age of computer architecture [2] where the key driving force for the pursuit of increased performance has been other than the only miniaturization. Although, due to the proprietary nature of chip designing, the innovation has been somewhat limited due to the fact that only big companies can design their own processors. This was democratized by the advent of RISC-V Instruction Set Architecture (ISA) [3] which enabled startups and communities to work together in chip designing. Still, there was another barrier for academic researchers, startups, and small companies to actually tape-out their processors, that is, the close nature of Process Design Kit (PDK). From the past, there have been many open-source Electronic Design Automation (EDA) tools (SPICE, Magic, etc.) available for the physical design engineers but the lack of a completely open-source PDK kept the custom hardware design to a handful of large and established companies and well-funded research universities. However, this problem has also been resolved recently in mid-2020, when SkyWater foundry together with Google introduced the first fully opensource PDK the SKY130 process node [4] which is based on a 130nm Complementary Metal Oxide Semiconductor (CMOS) technology.

A. The open-source hardware momentum
Since the arrival of the RISC-V ISA, there has been a boom in the open-source chip designing domain. It proved that like open-source software, open-source hardware can be greatly improved by a collaborative effort between small and big companies complementing each other and not only improving the ISA but also the other tools ecosystem required for hardware designing [5].
The ChipsAlliance [6] established in 2019, takes the aim of open-source hardware designing even further. It provides a commonplace for designers to create innovative solutions using open-source tools. It has renowned companies as members working together to develop reusable open-source IPs. It is also focused on providing tools for open-source physical design. The very ambitious open-source OpenROAD project [7] is also part of the ChipsAlliance that aims to provide 24hour, No-Human-In-The-Loop layout design for SOC, Package, and PCB with no Power-Performance-Area (PPA) loss, enabling software engineers and people with scarce physical design knowledge to tape-out their own processors.
The availability of everything open-source from RTL to EDA tools still hindered the complete flow of open chip designing due to the nonexistence of a completely open-source PDK. For over twenty years, the PDKs have been kept closed source and required non-disclosure agreements (NDAs), license servers, and password-protected download sites causing the privilege to tape-out designs at the hands of only big established companies [8]. But with the SkyWater foundry opening up their design for a 130nm process together with Google and the Efabless/Google collaboration for providing free tape-out shuttles, presents a huge opportunity for startups, small academic institutes, and even high school students to come up with their custom unique designs and actually get them fabricated.

B. Why hardware should learn from software
Due to the halt in performance even after doubling the number of transistors, the era of domain-specific architecture is booming. The advent of the RISC-V ISA has enabled small teams and startups to develop custom hardware to improve performance and efficiency in terms of power consumption. However, the process for designing chips has been painfully long and involved a rigid development model that earlier software development followed, known as the waterfall model. The software created in the early days suffered from overbudget, not meeting deadlines, and being abandoned. Making changes to the whole monolithic software project was very difficult as the customer's needs changed. The same goes for hardware projects. In a hardware project, first, the microarchitecture is specified, followed by the RTL design after which the verification happens, and then the complete physical design of the netlist is done. Usually, the physical design is even outsourced to other companies which further increases the timeline of the projects usually ranging from 1-3 years, and if the customer's need changes the whole process needs to be repeated. The agile software methodology [9] emphasizes on working software over detailed documentation, customer collaboration, and being flexible over rigid specifications. It promotes small teams working iteratively on improving working-but-incomplete prototypes and enhancing them until the end result is acceptable. Inspired by this agile software approach, the researchers at the University of California, Berkeley proposed their own "Agile Hardware Manifesto" [10] through which they taped-out eleven processors in a span of five years.
To facilitate this agile hardware development idea by increasing designer productivity, Chisel [11] was created. It is a domain-specific language created on top of Scala which provides all the high-level programming features such as Object-Oriented Programming (OOP) and Functional Programming (FP) to the designer for creating reusable libraries that generate efficient hardware circuits. The idea is to create reusable packages just like in software which provides abstraction and easy-to-use integration opportunities of various verified IPs. Furthermore, the Chisel compiler automatically creates a fast, cycle-accurate C++ software simulator, or low-level synthesizable Verilog that maps to FPGAs or ASIC flows.

C. Previous works
There have been eleven tape-outs based on Chisel utilizing the Rocket-chip generator [12] by the University of California, Berkeley but were based on commercial EDA tools and closed PDKs. Also, a family of striVe SoCs was taped out using the OpenLANE and Skywater 130nm PDK to prove the viability of all open-source EDA tools and the PDK [13]. However, it is written in a traditional low-level hardware description language, Verilog. The Rocketchip generated tape-outs were missing the open-source backend flow to generate the GDS and the striVe family SoCs although mapped on the open PDKs, lacked the frontend design written in a higher-level programming language.
In this paper, we present our contribution by using the abstractness and software programming feel of Chisel to tapeout a 5-stage pipelined RISC-V RV32IM core and a minimal SoC around it with no prior experience in chip designing and passed the generated RTL, Verilog, to OpenLANE [14] to provide a completely open-source RTL-GDS flow which was then mapped onto the fully open-source SkyWater 130nm process design kit through the Google/Efabless MPW Shuttle program [15]. We used Chisel for the ease of programming hardware circuits providing us a quickstart with RTL designing as compared to the low-level Verilog and proved that the generated Verilog can be mapped to the fully open suite of Electronic Design Automation (EDA) tools and can be fabricated on the Skywater 130nm open PDK.

II. DESIGN METHODOLOGY AND SPECIFICATION
To prove our work proposed in the paper we followed a methodology on a design specification and analyzed its implementation and results. In the following sub-sections, we will discuss the methodology and specification of the design later delving into other sections for details related to the implementation and analysis of the design.

A. Methodology
Chisel was used as a frontend of the proposed design which is a domain-specific language embedded inside Scala that provides higher functionality of a programming language to design circuits instead of traditional HDLs like Verilog/VHDL [16]. The Chisel front-end generates an Intermediate Representation (IR) called Flexible Intermediate Representation for RTL (FIRRTL) which provides certain transforms and passes based on Scala [17] that runs on top of the Java Virtual Machine (JVM) which transforms the same Chisel code to be used into three different backends: 1) Simulation, 2) FPGA Emulation and 3) ASIC Implementation For simulation, to check the functionality of the design, the Chisel compiler was used to generate a C++ simulator based on the emitted Verilog of the SoC through Verilator [18] and emitted C++ wrapper for providing stimuli to the compiled simulator, finally running the simulator to generate a Value Change Dump (VCD) file that can be viewed on an opensource waveform viewer GTKWave [19].
For emulating on the FPGA, the Chisel generated Verilog was mapped on the Arty A7 FPGA board using Xilinx's Vivado for synthesizing, placing and routing, and generating the bitstream to be mapped on the board. This is the only closed source path that was used for emulation. However, an open-source alternative for the FPGA implementation exists as well such as the Symbiflow project [20] or OpenFPGA [21] but that is not the scope of this paper.
For the ASIC, the generated Verilog and SkyWater 130 nm PDKs were used along with the OpenLANE flow comprising of various open-source tools for Synthesis, Floorplan, Power Distribution Network (PDN) generation, Place and Route, Design Rule Check (DRC), Layout Versus Schematic (LVS) checks and GDSII generation.

B. Specification
Ibtida is a minimal System on a Chip designed completely with Chisel using the higher programming language features. It consists of four basic elements that every computer has: 1) Compute, 2) Communication, 3) Peripherals, and 4) Storage.
The instruction interface has a Point-Point interconnect for fetching instructions and the data interface has a 1xN interconnect that allows the core to either perform loads/stores to the memory or to the GPIO peripheral. Since there is no nonvolatile memory present for code storage, a UART controller is designed to accept the program from the host computer and writes it into the ICCM memory every time the board is powered on or a new program needs to be uploaded. The details of each element highlighted in figure 2 above are described below: 1) Compute: It is a 32 bit 5-stage pipelined core compliant with the RISC-V base ISA I-type extension and an additional M-type extension that supports multiply/divide instructions together becoming an RV32IM supported core. It has five pipelined stages: 1) Fetch (F). 2

) Decode (D). 3) Execute (E). 4) Memory (M). 5) WriteBack (WB).
a) Fetch: The fetch has a Program Counter (PC) that points to the next instruction to be fetched and an interface to fetch the instructions from the memory. The PC value is updated through a multiplexer that selects the next PC value which can be a simple PC + 4 through an adder or another jump address depending upon the instruction in the Decode stage.
b) Decode: The decode stage consists of a register file with 32 registers x0 to x31 each 32 bits wide as described in the RISC-V ISA. It also has an Immediate Generation unit that extracts the encoded immediate values from the instructions, concatenating and padding them to become 32 bits wide. There is a Control Unit as well that decodes the current instruction using the opcode and enables certain control signals depending upon the type of instruction. There is a Branch Unit that identifies if the current instruction is a branch instruction and calculates the next PC address if the branch is taken. The Branch Unit was kept in the Decode stage to improve the branch miss penalty to 1 cycle if the branch is taken since the fetch would need to be flushed and the new instruction needs to be fetched from the updated PC value. It also has a Hazard Detection logic unit that prevents structural hazards from happening i.e if the register being accessed by the current instruction is also being written at the same time by another instruction in the Write Back stage.
c) Execute: The execute stage has an Arithmetic Logic Unit (ALU) for computation-related tasks and an ALU Control unit indicating the ALU as to which operation needs to be performed. It also has a forwarding unit that is used to provide the ALU with proper operands if there are any data hazards in the pipeline. d) Memory: The memory stage consists of a store/load unit that performs either stores or loads to the memory or the GPIO peripheral.
e) Write Back: The write back stage consists of a mux that selects the data to be written in the register file which can be either from the ALU output or from the data memory.
2) Communication: The communication mechanism used between the core, peripherals, and memories is TileLink Uncached Lightweight (TL-UL) bus protocol [22]. The miniature version of TileLink, the TL-UL was used since we did not require cache coherency and other complex communication.
The fetch stage sends a valid request to the TL-UL Master which then communicates with the TL-UL Slave that is then connected with the instruction memory. This forms a Point-Point interconnection between the core's fetch and instruction memory as shown in figure 2. For load/stores during the memory stage, a 1xN switch is used to connect a single TL-UL Master with multiple TL-UL Slaves which are two in our case. One for the data memory and the other for the GPIO peripheral. The 1xN switch automatically decodes which slave to route the master's request to depending upon the address issued. There is no support for burst accesses. The master can only send one request at a time and wait for the acknowledgment before sending another request. The write back stage consists of a mux that selects the data to be written in the register file which can be either from the ALU output or from the data memory.
3) Peripherals: The SoC contains only one peripheral that is the GPIO connected to the bus. The GPIO has 30 I/O pads going outside to interact with the outside world. Its control and status registers (CSRs) are accessible via TL-UL bus which can be manipulated by the software program running on the

A. Verilator Simulation
For testing the functionality of the design, Verilator was used to simulate the SoC and each of it's individual components. The listing 1 shows how a 2-way mux can be designed in Chisel.  A driver class as shown in listing 3 is used to configure the Scala backend to use verilator for testing and an additional flag is used to generate the VCD trace for waveform view.
Scala build tool (sbt) is utilized to compile the Scala classes and execute them as shown in listing 4 which in turn builds all the verilator files using the testbench and generates a VCD trace to view.
The generated VCD trace can be viewed on GTKWave. In figure 4 the resulting waveform for the mux is depicted.
Similarly, each module within the Ibtida SoC was tested for its correct functionality using Chisel-based testbenches and Verilator based simulation. In table I, a RISC-V assembly program for the sake of testing is shown that is run on the SoC, and figure 6 shows how the instructions passes through the pipeline with only the important signals extracted for ease. The whole test suite run on the Ibtida SoC is present on Github. [23] Initially, as shown in the figure 6 the UART programmer loads the program into the instruction memory and asserts  uart done high signaling that the memory is loaded. The fetch then sends a valid request with the PC's current value and gets the instruction in the next cycle. Until then a NOP (No operation) instruction is sent to the datapath that does nothing in the pipeline. After this, on each clock cycle, a new instruction is fetched and previous instructions progress through in the pipeline. Finally, the registers get loaded with the values coming from the write back stage.

B. FPGA Emulation
The generated Verilog of Ibtida SoC from Chisel was mapped on the Arty A7 FPGA board. It runs on 8MHz frequency with no total negative slack (TNS) and failing endpoints. Table II shows the timing report of the implemented design. The MMCM primitive was used as the clock generator to provide the clock to the design. The ICCM and DCCM memories were mapped into FPGA Block Rams (BRAMs). The DSP units inside the board were used for efficient multiplication. The resource utilization of the design is given in Table III  The power consumption of the implementation is given in Table IV.

C. ASIC Implementation
For the ASIC implementation, the Chisel-generated Verilog was integrated inside a testing harness and then hardened through the OpenLANE flow for generating the GDSII layout.
1) Testing Harness: Caravel [24], is a testing harness that acts as a manager of the Ibtida SoC. It has three parts in it: 1) Management Area 2) User Project Area 3) Storage Area as shown in figure 5.
a) Management Area: The management area consists of an SoC built on top of a RISC-V based microprocessor PicoRV32 [25]. It has some peripherals including timers, uart, and gpio. The firmware on the management area can be used c) Storage Area: It consists of two dual port SRAMs of size 1Kbyte generated by OpenRAM [26]. The storage area is only accessible to the management area. Figure 7 shows the architecture of the Caravel harness. The management area contains peripherals on a Wishbone Bus [27] which are written/read by the PicoRV32. There is also Chip LA which is a memory-mapped 128 bits wide logic analyzer on a wishbone bus. It can be configured to read data from the User Project Area or provide any data to it. There is also a Wishbone slave interface inside the User Project Area but we used it only to provide the clock and reset to Ibtida SoC coming via the Wishbone master interface on the management area. The User Project Area has access to the 38 GPIOs after they are configured to be usable by the firmware running on the management core.
2) Integrating Ibtida inside Caravel: Figure 8 shows the configured Ibtida SoC for integrating inside the Caravel User Project Area. The signals prefixed la are coming from the logic analyzer. The SRAMs were mapped onto technology-specific flip flop based DFFRAMs.
3) Openlane: RTL to GDS: OpenLANE is an opensource automated RTL-GDSII ASIC design flow based on several components, PDK (Process Design Kit), and IP (Intellectual Property) libraries including standard-cell libraries, that perform steps from RTL synthesis all the way to GDS streaming. It is an aggregation of open-source EDA (Electronic Design Automation) tools explicitly Open-ROAD, Yosys [28], Magic [29], Netgen [30], OpenPhySyn [31] and SPEF-Extractor [32]. Furthermore, a custom script is being used for design exploration and optimization. The completely open-source flow was designed in accordance with the open PDK, open-sourced by Google and SkyWater (Sky130 PDK) on a 130nm CMOS technology, but concurrently is generalized to support other technologies. The flow performs full ASIC implementation steps from RTL to GDSII, which includes: figure 9.

1) Logic Synthesis 2) Floor-Planning 3) Placement 4) Clock Tree Synthesis (CTS) 5) Routing 6) SPEF-Extraction 7) GDSII Generation 8) Physical Verification as shown in
The output of synthesis is a gate-level netlist which after floor-planning results in a def (Design exchange format) file, comprising information related to physical layout i.e. pin placement, die area, and the core area of a specific design. During further stages in the flow, the def file gets updated multiple times as standard cells get placed during the placement, and information regarding the coordinates of their placement is added. The final def gets generated after routing where the track information connecting the standard cells is added.
The GDS then gets generated followed by Design Rule Check (DRC) and Layout vs Schematic (LVS) check which is required for the physical verification.
a) Logic Synthesis: The first step towards attaining a hardened Ibtida IP involves logic synthesis. This process specifically focuses on acquiring the RTL along with the standard cell library files. Before the synthesis of Ibtida RTL, the OpenLANE environment needs to be set up as shown in listing 5 followed by the commands shown in listing 6. The flow can be executed in interactive mode, by the prep -design <design name> command which sources the design configuration file; config.tcl, where it reads the specified environment variables (VERILOG INCLUDE DIRS, SYNTH STRATEGY, etc.) required for synthesis and merges the relevant library exchange format (LEF) and technology LEF file i.e. technology-specific files information along with the generated Verilog RTL and passes it as input to Yosys and ABC which synthesizes the logic and maps it on to the technology-specific standard cells respectively as seen in figure  10.
The chip area calculated after synthesis is 1.25mm 2 . Furthermore, Table V shows the statistics of the generated netlist.
OpenLANE provides a set of design exploration strategies, which enables designers to achieve the design specifications in terms of performance and area. There are four design exploration strategies offered by OpenLANE which have been tested for Ibtida SoC. The strategies provide a trade-off between the area and timing. Strategies 0 and 1 (delay) explicitly focuses on achieving a better performance in terms of timing whereas 2 and 3 (area) strategies focus on getting a better (compact) area. The effect of different design exploration strategies for Ibtida SoC is shown in figure 11.  b) Floor-planning: Floor-planning in the OpenLANE flow deals with assigning the die area and core area read from the config.tcl and generates the number of rows accordingly, and also involves placement of hard macros if any, in the design space. For Ibtida Soc floor-planning required the gate-level netlist generated through Yosys along with a pin order.cfg file; which includes the name of the pins to be Floor-planning for Ibtida SoC has been achieved using the listing 7, where init floorplan command floor-plans the netlist on a core area of 1620µm x 1590µm with a core utilization of 50%, i.e. half of the core area has been utilized by the standard cells. Table VI shows the coordinates of the core and the die for Ibtida SoC. This is followed by place io, which places the I/Os around the die as shown in figure 12. Power distribution network (PDN) is then generated using the gen pdn command which creates metal1 and metal4 horizontal rails and vertical straps respectively as shown in figure  13.  The global placement command, inserts all the standard cells into the core area haphazardly. There is no sequence or order; some standard cells might even overlap each other. Ibtida SoC after global placement is shown in figure 15. The detailed placement command ensures that every cell is placed properly inside the rows. The legalization issues w.r.t. the overlapping cells are catered in this step, enabling the cells to align. Ibtida SoC after global placement is shown in figure  16. Ibtida SoC was generated using the command as shown in listing 9. run_cts Listing 9: Clock Tree Synthesis Script e) Routing: Routing is a step followed by CTS. In the OpenLANE flow, routing is executed automatically through scripts. The task of the router is to precisely define the paths on the layout surface enabling conductors to carry electrical signals. The conductors are responsible for interconnecting the pins and the standard cells on the layout and thus forming a routing grid. Since the routing grid is quite large, routing is performed using a divide and conquer approach; Global Routing followed by Detailed Routing as shown in listing 10. The global routing command abstractly plans the routing guides to outline the implementation of actual routes whereas the detailed routing command enables the wires to follow those routing guides and establish interconnects as shown in figure 17.   The physical verification step also termed as the sign-off step in the OpenLANE flow is to validate the final layout. Throughout the flow, a series of reports and logs are generated, which usually involves checking the generated def file at each stage of physical design for any design rule violations. This is ensured by the EDA tools; fastroute, for identifying antenna violations, and tritonroute, which checks for any routing violations. The verification step ascertains that the placer and router have correctly placed the cells and routed the grid. The design is checked for any overlapping cells or short circuits and inspects any Layout vs. Schematic (LVS) error that includes any unmatched pins or short/open circuits between nets that should have been connected. Some common Design Rule Check (DRC) errors corresponding to the wire spacing, width and pitch need to be catered as defined in the PDK technology lef (.tlef) file. Some basic errors are shown in figure 19 and 20.
The final DRC and LVS check on the generated Ibtida SoC layout is ensured using the listing 13. For Ibtida SoC design to be considered DRC and LVS clean, it needs to be validated through Magic where run magic drc command checks the layout for any design rule check errors and reports them if any. Furthermore, a hierarchical SPICE netlist is extracted using the run magic spice export. The extracted netlist is then validated through the open-source tool Netgen  In this paper we overviewed a Chisel generated SoC tapedout using the completely open-source toolchain and discussed the different chip designing flows involving RTL simulation, FPGA emulation, and ASIC implementation. Furthermore, we also discussed how the OpenLANE suite allows the automatic place and route of a chip without needing a physical design expert.
Chisel HDL allows software programmers and novice hardware engineers to describe circuits in a higher programming language feel as compared to traditional hardware description languages. The user can abstractly design RTL logic and write Whereas, OpenLANE allows the design to be taken through the ASIC process to generating a GDSII layout for fabrication.
We believe with the introduction of a completely opensource PDK; SkyWater 130nm, and a completely open-source ISA; RISC-V, combined with HDLs hosted in higher programming languages; Chisel, there exists a great opportunity for undergraduate students, academia and researchers to quickly design, implement and fabricate chips using the agile methodology analogous to the software domain which has been a huge bottleneck in innovation in the hardware design industry.