ORION 2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage Design Space Exploration

The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters.

<table>
<thead>
<tr>
<th>Citation</th>
<th>Kahng, Andrew B. et al. &quot;ORION 2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage Design Space Exploration.&quot; Design, Automation &amp; Test in Europe Conference and Exhibition, DATE’09, 20-24 April, 2009, Nice France.</th>
</tr>
</thead>
<tbody>
<tr>
<td>As Published</td>
<td><a href="http://www.date-conference.com/date09/">http://www.date-conference.com/date09/</a></td>
</tr>
<tr>
<td>Publisher</td>
<td>IEEE Computer Society</td>
</tr>
<tr>
<td>Version</td>
<td>Author's final manuscript</td>
</tr>
<tr>
<td>Accessed</td>
<td>Fri Apr 05 18:48:10 EDT 2019</td>
</tr>
<tr>
<td>Citable Link</td>
<td><a href="http://hdl.handle.net/1721.1/60547">http://hdl.handle.net/1721.1/60547</a></td>
</tr>
<tr>
<td>Terms of Use</td>
<td>Attribution-Noncommercial-Share Alike 3.0 Unported</td>
</tr>
<tr>
<td>Detailed Terms</td>
<td><a href="http://creativecommons.org/licenses/by-nc-sa/3.0/">http://creativecommons.org/licenses/by-nc-sa/3.0/</a></td>
</tr>
</tbody>
</table>
Abstract

As industry moves towards many-core chips, networks-on-chip (NoCs) are emerging as the scalable fabric for interconnecting the cores. With power now the first-order design constraint, early-stage estimation of NoC power has become crucially important. ORION [29] was amongst the first NoC power models released, and has since been fairly widely used for early-stage power estimation of NoCs. However, when validated against recent NoC prototypes – the Intel 80-core Teraflops chip and the Intel Scalable Communications Core (SCC) chip – we saw significant deviation that can lead to erroneous NoC design choices. This prompted our development of ORION 2.0, an extensive enhancement of the original ORION models which includes completely new subcomponent power models, area models, as well as improved and updated technology models. Validation against the two Intel chips confirms a substantial improvement in accuracy over the original ORION. A case study with these power models presented within the COST-DCNoC design space exploration tool [23] confirms the need for, and value of, accurate early-stage NoC power estimation. To ensure the longevity of ORION 2.0, we will be releasing it wrapped within a semi-automated flow that automatically updates its models as new technology files become available.

1 Introduction

Power has become the most critical design constraint. Increasing power consumption and design complexity have led designers to adopt multi-core designs in chip multiprocessors (CMPs) [16, 26] and multiprocessor systems-on-chip (MPSoCs) [36, 21]. As the demand for bandwidth increases in these systems, NoCs have been introduced to meet the increasing communication demand among cores [8]. However, with increasing emphasis on network bandwidth, the power that an interconnection network consumes will also be substantial [13]. The International Technology Roadmap for Semiconductors (ITRS) predicts that future generations of high-end VLSI designs will operate in 10-20 GHz frequency range with the communication between cores in Gbit/s [40]. This requires designers to work within a tight power budget. To aid designers in early stages of design, architectural power models were proposed for rough power estimations (see Section 2).

Architectural power estimation is extremely important in order to (1) verify that power budgets are approximately met by the different parts of the design and the entire design, and (2) evaluate the effect of various high-level optimizations, which have been shown to have much more significant impact on power than low-level optimizations. To tackle this problem, ORION, a set of architectural power models for network routers, was proposed in 2002 and has been fairly widely used for early-stage NoC power estimation in literature and industry. Despite the increase in complexity of today’s designs, ORION’s original power models have not been updated or enhanced. In a comparison between ORION 1.0 and the Intel 80-core Teraflops chip we notice 6.8% difference in reported rated power values (see Section 5). This highlights the need for more accurate architectural power models to aid designers in making early-stage NoC design decisions.

In addition, since architectural design space exploration is typically done for current and future technologies, models must be derivable from standard technology files (e.g., Liberty format [38], LIF [39]), as well as extrapolatable process models (such as, PTM [Predictive Technology Model] [37], or ITRS). ORION 1.0 collects inputs from ad hoc sources to drive its internal power models. We see a clear need for a semi-automated flow (i.e., using shell scripting) to extract technology inputs from reliable sources, to ease the updating of models as new technology files become available.

The above two factors prompted ORION 2.0 and its two key goals: (1) To update and enhance ORION’s power and area modeling accuracy; and (2) To encompass ORION 2.0 within a semi-automated flow so that ORION can be continuously maintained and updated easily. Figure 1 shows the usage model and modeling flow of ORION 2.0 with its main inputs and outputs. In this paper, we substantially improve the original ORION 1.0, to allow it to accurately estimate power for designs beyond the 65nm technology node. ORION 2.0 surgically tackles various deficiencies of ORION 1.0 highlighted through our validation with two Intel chips (Section 5) and our close interactions with both the chip designers and developers of ORION 1.0. Table 1 summarizes the contributions of ORION 2.0 beyond the original ORION 1.0, highlighting the extensive modifications:

**New:**

- Flip-flop and clock power models (both leakage and dynamic) are added. Flip-flop power models will enable the faithful modeling of flip-flop-based FIFOs in addition to the SRAM-based implementation in ORION 1.0. Clock power is a major component of overall chip power especially in high-performance applications [13], but was omitted in ORION 1.0.

- Link power models are added, leveraging accurate models recently developed in [6] as links are a major contributor to NoC power. Prior existing work on link power and delay modeling [11, 12] focus on minimum-delay buffer insertion, whereas we adopt a hybrid solution which minimizes...
a weighted product of delay and power. ORION 1.0 did not have a parameterized link model.

- Virtual-channel allocator microarchitecture in ORION 1.0 is modified to optimize its power consumption. Also, a new VCC allocation model, based on the microarchitecture and pipeline proposed in [17], is added in ORION 2.0.
- Arbiter leakage power, previously not covered in ORION 1.0, is now modeled.
- An accurate area model is added, allowing for detailed router floorplanning which enhances the accuracy of early-stage power estimation.
- An automatic flow for extracting technology parameters from standard technology files (e.g., Liberty format [38], LEF [39]), as well as extrapolatable models of process (e.g., PTM [37], ITRS [40]) is added to allow ORION 2.0 to be easily and continuously updated in the future.

### Improved:

- Application-specific technology-level adjustments (used different $V_{dd}$ flavors and transistor widths) are used in ORION 2.0 to improve power estimation for SoC and high-performance applications. ORION 1.0 used a single set of parameters for all designs at a given technology node.

### Updated:

- Transistor sizes and capacitance values are updated in ORION 2.0 with new process technology files – industry SPICE models and ITF (Interconnect Technology Format) – instead of ad hoc scaling factors as in ORION 1.0.

Our power model is validated against the Intel 80-core Teraflops chip and the Intel Scalable Communications Core, and is within 7% and 11% of the corresponding total power values. We also integrated ORION 2.0 models in COSI-CCC communication synthesis infrastructure and found that accurate models substantially affect the NoC system-level design exploration.

The remainder of this paper is organized as follows. In Section 2 we contrast against prior related work. Section 3 describes ORION 2.0 dynamic and leakage power models, while Section 4 describes our proposed area model. In Section 5 we validate our models against the Intel 80-core chip [13] and the Intel SCC chip [14], and show the impact of the new models on achievable NoC configurations. Finally, Section 6 concludes the paper.

## 2 Related Work

Power modeling can be carried out at different levels of fidelity, trading off modeling time with accuracy, ranging from real-chip power measurements [15], to pre and post-layout transistor-level simulations [34], to RTL power estimation tools [32] to early-stage architectural power models [5, 30, 29, 10]. Low-level power estimation tools, even RTL power estimation, require complete RTL code to be available, and simulate slowly, on the order of hours, while an architectural power model takes on the order of seconds. Circuit-level power estimation tools, though providing excellent accuracy, has even longer simulation times, and require even more substantive development effort. These shortcomings have prompted a plethora of early-stage architectural power models and simulators, such as the widely-used Watcht [5] and SimplePower [30] for uniprocessor power modeling. These models allowed computer architects and designers to factor in power when making early-stage design decisions.

Power models at different abstraction levels have also been proposed for a variety of network fabrics in the past. There have been several RTL-level NoC power models proposed [2, 1]. In [2], a RTL level power model for NoCs was developed by first extracting the SPICE level netlist from the layout and then integrating the characterized values into the VHDL based RTL design. In [1], an accurate power characterization of a range of NoC routers was performed through RTL synthesis and place and route using standard ASIC tool flow. Both studies inherit limitations of RTL-level power simulation: simulation time is slow, and requires detailed RTL modeling, making them unsuitable for early-stage NoC design space explorations. Besides, power models cannot be targeted for future technology nodes.

At the architecture level, Patel et al. [20] first proposed a power model for interconnection networks, deriving its power estimates based on transistor count. As the model is not instantiated with architectural parameters, it cannot be used to explore trade-offs in router microarchitecture design. ORION [29], an early-stage architectural power model for NoCs, was originally proposed and released in 2002, and has since been fairly widely used in academia and incorporated into industry toolchains (Intel, AMD, IBM, Freescale). Bona et al. [3] also presented a methodology for automatically generating the energy models for on-chip communication infrastructure at system level. But the focus is on bus based and crossbar based communication for SoC. Bhat et al. [4] proposed an architecture level regression analysis model for different router components based on energy numbers obtained from simulations using MAGMA tools. Besides the downsides of RTL-based models, the router architecture varieties and process technologies modeled were limited compared to ORION 1.0.

## 3 Power Modeling

In this section, we derive architectural-level parameterized power models for major router building blocks including FIFO buffers, crossbar switches, and arbiters. In addition, we model the power due to clocking of router blocks and links between routers. These power components sum up to 94% of total power consumed in a high-performance router used in the Intel 80-core Teraflop chip [13] in the 65nm technology node. The communication power is significant at 28% of each processing tile’s total power. As shown in Figure 2, clocking power, 33%, is the largest component of router power, with the FIFO buffers the second largest component at 22%. Power due to physical links, crossbar switch, and arbiter come next at 17%, 15% and 7%, respectively. In ORION 2.0 we add (1) clocking and link power models, (2) flip-flop-based FIFO power models, and (3) arbiter leakage model. For the rest of the components we enhance/update current ORION models. In this section we first describe our dynamic power modeling and then present our leakage power modeling with specific analysis of arbiter leakage power model.

### 3.1 Dynamic Power Modeling

Dynamic power consumption in CMOS circuits is formulated as $P = E \cdot f_{clk}$, where energy $E = \frac{1}{2} \cdot f_{clk} \cdot C_{dl} \cdot V_{dd}^2$, with $f_{clk}$ the clock frequency, $C_{dl}$ the switched capacitance, and $V_{dd}$ the supply voltage. We derive detailed parameterized equations for estimating switching capacitance of (1) register-based FIFO buffers, (2) clocking due to routers, and (3) physical links.

#### 3.1.1 Clock

Clock distribution and generation comprise a major portion of power consumption in synchronous designs [9], representing up to 33% of power consumption in a high-performance router [13].

We estimate the term $C_{clk}$, as shown in Equation 1. Throughout our modeling approach it is assumed that all components are built using static CMOS gates. Given that the load of the clock distribution network heavily depends on its topology, we assume an $H$-tree distribution style

$$C_{clk} = C_{ram–fifo} + C_{pipeline–registers} + C_{register–fifo} + C_{wiring}$$

where $C_{ram–fifo}$, $C_{pipeline–registers}$, $C_{register–fifo}$, and $C_{wiring}$ are capacitive loads due to memory structures, pipeline registers, FIFO registers, and clock distribution wiring respectively.

#### Memory structures.

We adapt the original ORION model for SRAM buffers for determining the precharge circuitry capacitive load on the clock network. The pre-charging circuit is just the pre-charging transistor, $T_p$, which commonly is just a single PMOS.

---

$^1$Power breakdown numbers include both dynamic and leakage power values.
Hence, its capacitance, $C_{\text{chg}}$, is due to its gate and drain end capacitances, $C_g(T_e)$ and $C_d(T_e)$ respectively as shown in Equation 2.

In an SRAM FIFO with $B$ buffers and flit size $F$, the total capacitance due to pre-charging circuitry can be derived using Equation 3, with $P_l$ and $P_c$ being the number of read and write ports, respectively.

$$C_{\text{chg}} = C_g(T_e) + C_d(T_e)$$

(2)

$$C_{\text{fifo}} = (P_l + P_c) \cdot F \cdot B \cdot C_{\text{chg}}$$

(3)

**Pipeline registers.** Typical interconnection network routers have different pipeline stages. To advance, each flit must proceed through the steps of: (1) routing computation, (2) virtual-channel (VC) allocation, (3) switch allocation, and (4) switch traversal. We assume DFF as the building block of the pipeline registers. In a router with flit size of $F$ bits and $N_{\text{pipeline}}$ pipeline stages, the capacitive load on the clock due to pipeline registers is

$$C_{\text{pipeline-registers}} = N_{\text{pipeline}} \cdot F \cdot C_{\text{ff}}$$

(4)

where $C_{\text{ff}}$ is the flip-flop capacitance and is extracted from 65nm HP (high-performance) and LP (low-power) libraries.

**Register-based FIFOs.** FIFO buffers can be implemented as a series of flip-flops. We assume simple DFF to construct the FIFO. In a $B$-entry register-based FIFO with flit size of $F$ bits, the capacitive load on the clock can be computed as:

$$C_{\text{register-fifo}} = F \cdot B \cdot C_{\text{ff}}$$

(5)

For the registers we assume D flip-flop (DFF) is used as the building block. We obtain the capacitance value across different drive strengths from TSMC 65nm G and LP standard cell library data sheets. Architectural parameters change the effective loading of each gate in the design. Hence, to use the appropriate drive strength for the registers we use their load capacitance and timing requirements. In this work, we assume minimum-size DFFs are used in all the registers.

**Wiring load.** For a 5-level $H$-tree clock distribution, total wire capacitance is given in Equation 6, where $C_{\text{int}}$ is the per-unit-length wire capacitance and $D$ is the chip dimension.

$$C_{\text{wiring}} = \left( \frac{16}{2} + \frac{1 \times 8}{2} + \frac{2 \times 4}{2} + \frac{4 \times 2}{2} + \frac{8 \times 1}{2} \right) \cdot C_{\text{int}}$$

(6)

3.1.2 Register-Based FIFO Buffers

FIFO buffers consume up to 22% of the total router power in [13]. As mentioned earlier, FIFO buffers can be implemented as either SRAM or shift registers. The ORION 1.0 model supports only the use of SRAM-based FIFOs. We use flip-flops as the building block of the shift registers [28]. Hence, a $B$-entry FIFO buffer can be implemented as a series of $B$ flip-flops (FF).

**Write operation.** The write operation occurs at the tail of the shift register. Assuming the new flit is $f_n$ and the old flit is $f_o$, the number of switched flip-flops is the Hamming distance between them. Therefore, the write energy is:

$$E_{\text{write}} = H(f_n, f_o) E_{\text{switch}}^{\text{ff}}$$

(7)

where $E_{\text{switch}}^{\text{ff}}$ is the energy to switch one bit. To simplify the analysis, let $H$ denote the average switching activity; then, the average write energy is:

$$E_{\text{write}} = H \cdot E_{\text{switch}}^{\text{ff}}$$

(8)

**Read operation.** The read operation has two steps:

1. The flit stored at the header of the buffer is read into the crossbar. Since the header of the buffer is directly connected to the input port of the crossbar, this step does not consume any energy in the buffer.

2. Subsequent flits in the buffer are shifted one position towards the header. If the buffer holds $n$ flits before the read operation, $n-1$ flip-flop writes are performed to shift the data.

Hence, the average read energy is:

$$E_{\text{read}} = (n-1) \cdot E_{\text{write}}$$

(9)

We obtain the capacitance value across different drive strengths from TSMC 65nm G and LP standard cell library data sheets.

3.1.3 Allocators and Arbiters

We modified the separable VC allocator microarchitecture in ORION 1.0 to optimize its power consumption. Instead of two stages of arbiters, we have a single stage of $pV$ arbiters, each governing one specific output VC, where $p$ and $v$ are the number of input physical and virtual channels, respectively. Instead of sending requests to all output VCs of the desired output port, input VCs first check the availability of output VCs, then send a request for any one available output VC. The arbiters will resolve conflicts where multiple input VCs request the same output VC. This design has lower matching probability but does away with an entire stage of arbiters, significantly saving power.

We also added a new VC allocator model in ORION 2.0 which models VC allocation as VC “selection” instead, as was first proposed in [17]. In this approach, the pipeline first goes through switch allocation, before selecting an available VC, effectively simplifying VC allocation to reading from a queue of available VCs. Power consumed thus becomes largely invariant to actual number of VCs, and can be modeled as a constant (derived through RTL synthesis) that will be scaled across process technologies. For example, in a 65nm 5 × 5 router with two VCs, the separable allocators consumes 8.47mW as opposed to 0.167mW consumed by VC selection model with $V_{dd}=0.9V$, $f_{clk}=550MHz$, and $\alpha=1$. The VC selection power value is derived from the synthesis of the entire VC allocator using Synopsys Design Compiler v2007.12-SP5 [33].

3.1.4 Physical Link

The dynamic power of links is primarily due to charging and discharging of capacitive loads (wire and input capacitance of next-stage repeater). Internal power dissipation, arising from charging and discharging of internal capacitances and short-circuit power, is noticeable for repeaters only when the input slew times are extremely large. Link power is a major component of a router total power (i.e., 17% [13]). Previous works such as [12, 11] only use delay as the objective for buffer insertion. This results in large repeaters and significantly increases the power consumption. In this work, we use a hybrid buffering solution that minimizes a linear combination of delay and power. We exhaustively evaluate a given objective function for a given number and size of repeaters, while searching for the optimal (number, size) values. Dynamic power is given by the well-known equations:

$$P_{\text{link}} = \alpha \cdot C_l \cdot V_{dd}^2 \cdot f_{clk}$$

(10)

$$C_l = C_{\text{in}} + C_{\text{gnd}} + C_{\text{cc}}$$

(11)

where $P_{\text{link}}$, $\alpha$, $C_l$, $V_{dd}$ and $f_{clk}$ denote the link dynamic power, activity factor, load capacitance, supply voltage, and frequency, respectively. The load capacitance is the sum of the input capacitance of the next repeater, $C_{\text{in}}$, and the ground ($C_{\text{gnd}}$) and coupling ($C_{\text{cc}}$) capacitances of the wire driven. $C_{\text{in}}$ can be reliably obtained from industry Liberty files. The unit $C_{\text{gnd}}$ and $C_{\text{cc}}$ are also extracted from industry LEF files.

3.2 Leakage Power Modeling

As technology scales to deep sub-micron processes, leakage power becomes increasingly important as compared to dynamic power. There is thus a growing need to characterize and optimize network leakage power as well. Chen et al. [7] proposed an architectural methodology for estimating leakage power. However, [7] only considered subthreshold leakage whereas from 65nm
and beyond gate leakage gains importance and becomes a significant portion of the leakage power. This is even more visible for high-performance applications where gate oxides are much thinner (i.e., ~1.5nm in 65nm HP library).

We follow the same methodology proposed in [7] with addition of gate leakage in our leakage analysis. We also use different $V_{th}$ flavors to better represent leakage power consumption for different applications (i.e., high-performance vs. low-power).

To derive an architectural leakage model, we can separate the technology-dependent variables such as transistor width from those that stay invariant for a specific process technology:

$$I_{leak}(i,s) = W(i,s) \cdot (I'_{sub}(i,s) + I'_{gate}(i,s))$$  \hspace{1cm} (12)

where $I_{leak}$ is total leakage current. $I'_{sub}$ and $I'_{gate}$ are subthreshold and gate leakage currents per unit transistor width for a specific technology, respectively. $W(i,s)$ refers to the effective transistor width of component $i$ at state $s$. We measure $I'_{sub}$ and $I'_{gate}$ for a variety of circuit components, input states, operating conditions (i.e., voltage and temperature), and different $V_{th}$ flavors (i.e., HVT, NVT, and LVT).

We compose architectural leakage power model in a bottom-up fashion for each building block [7].

### 3.2.1 Derivation of $I_{leak}$

For each component $i$ and input state $s$, we simulate $I'_{sub}$ and $I'_{gate}$ using HSPICE and 65nm foundry SPICE model with corresponding technology parameters. Table 2 lists $I'_{sub}$ and $I'_{gate}$ simulated for each fundamental circuit component $i$ (leakage currents differ at different states due to stacking and body biasing effects). Circuit structures can then be hierarchically composed from these fundamental circuit components.

### 3.2.2 Arbiter Leakage Modeling

We applied our methodology to the major building blocks of interconnection networks – buffers, crossbar, and arbiter. Here, we walk through our arbiter modeling to demonstrate the methodology. We model three types of arbiters: (1) matrix, (2) round-robin, and (3) queuing. For an arbiter with $R$ requesters, you can represent its priorities by an $R \times R$ matrix, with a 1 in row $i$ and column $j$ if requester $i$ has higher priority than requester $j$, and 0 otherwise. Let $req_i$ be the $i^{th}$ request, $gnt_i$ the $i^{th}$ grant, and $m_{ij}$ the $i^{th}$ row and $j^{th}$ column elements in the matrix. Using these variables [29],

$$gnt_i = \sum_{j=1}^{R} \left[ (\triangleq r_{ij} + m_{ij}) \times \sum_{k=1}^{R} (\triangleq r_{ik} + m_{ik}) \right]$$  \hspace{1cm} (13)

**Input state probabilistic analysis.** We analyze the probability distribution of each input state of a circuit component by examining how architectural units function. Given the $I'_{sub}$ and $I'_{gate}$, and the probabilities of each input state $Prob(i,s)$, the leakage current for a building block is:

$$I_{leak}(Block,t) = \sum_i \sum_s Prob(i,s) \cdot W(i,s) \cdot (I'_{sub}(i,s) + I'_{gate}(i,s))$$  \hspace{1cm} (14)

**Input state simulation.** Input states can also be tracked through network simulation.

$$I_{leak}(Block,t) = \sum_i W(i,s(t)) \cdot (I'_{sub}(i,s(t)) + I'_{gate}(i,s(t)))$$  \hspace{1cm} (15)

where $I_{leak}(Block,t)$ is the leakage current at time $t$, and $s(t)$ is the state of circuit type $i$ at time $t$ within this circuit block. Finally, we can estimate total leakage current of an arbiter (Equation 16) while its power is leakage current multiplied by supply voltage (Equation 17).

$$I_{leak}(arbiter) = I_{leak}(NOR2) \cdot ((2R-1)R) + I_{leak}(INV) \cdot R + I_{leak}(DFF) \cdot \frac{R(R-1)}{2}$$  \hspace{1cm} (16)

$$P_{leak}(arbiter) = I_{leak}(arbiter) \cdot V_{dd}$$  \hspace{1cm} (17)

### 3.2.3 Physical Link Leakage Modeling

The leakage power of links is due to repeaters inserted in them. In repeaters, leakage occurs in both output states. NMOS devices leak when the output is high, while PMOS devices leak when the output is low. This is important for buffers also because on-chip interconnect devices are the primary contributors due to their large sizes. Leakage power has two main components: (1) subthreshold leakage, and (2) gate-tunneling current. Both components depend linearly on device size. Thus, leakage power can be calculated using [6]:

$$P_{leak} = \frac{P_n^p + P_p^p}{2}$$  \hspace{1cm} (18)

$$P_n^p = \kappa_n^p w_n$$  \hspace{1cm} (19)

$$P_p^p = \kappa_p^p w_p$$  \hspace{1cm} (20)

where $P_n^p$ and $P_p^p$ are the leakage power for NMOS and PMOS devices, respectively, and $\kappa_n^p = -6.034$, $\kappa_p^p = 26.561$, $\kappa_n^p = 1.238$ and $\kappa_p^p = 27.082$ are coefficients determined using linear regression against 65nm LP library. State-dependent leakage modeling can also be performed using Equations (19) and (20) separately.

### 3.3 Model Inputs and Technology Capture

ORION 1.0 used a set of fixed transistor sizes and capacitance values across designs with different target frequencies (i.e., high-performance and low-power). In ORION 2.0 we include data for three device types: (1) high-performance (HP), (2) low standby power (LSTOP), and (3) low operating power (LOP) as defined in ITRS [40]. For wires we follow the approach in the [6] comprehending key interconnect circuit and layout design styles, including a power-efficient buffering technique that overcomes realties of previous delay-driven buffering techniques.

In addition, the most critical gap in existing system-level and NoC optimizations has been the lack of well-defined pathways to capture necessary technology and device parameters from the wide range of available sources. Since exploration of the system-level performance and power envelope is typically done for current and future technologies, the models that drive system-level design must be derivable from standard technology files (e.g., Liberty, ITF, LEF) and process models for both transistors and interconnects (e.g., PTM, ITRS). Earlier works on NoC design space exploration and synthesis [22, 24] collect inputs from ad hoc sources to drive internal models of performance, power and area. In contrast, we have developed a semi-automated flow (i.e., using shell scripting) to extract technology inputs from the above reliable sources. In the absence of such sources, we use predictive models and SPICE simulations to generate appropriate scaling factors.

### 4 Area Modeling

With the increase in number of cores on a single design, the area occupied by the communication components such as links and routers increases. As area is an important economic incentive in IC (integrated circuit) design, it needs to be estimated early in the design flow to enable design space exploration. In this section we present accurate models for router and link area.

<table>
<thead>
<tr>
<th>Technology</th>
<th>$V_{th}$ (mV)</th>
<th>$I_{on}$ (mA)</th>
<th>$I_{off}$ (mA)</th>
</tr>
</thead>
<tbody>
<tr>
<td>NMOS</td>
<td>1.097e-07</td>
<td>5.103e-09</td>
<td>3.291e-09</td>
</tr>
<tr>
<td>PMOS</td>
<td>2.297e-09</td>
<td>1.194e-08</td>
<td>1.766e-07</td>
</tr>
<tr>
<td>INV</td>
<td>3.291e-09</td>
<td>1.097e-07</td>
<td>4.622e-09</td>
</tr>
<tr>
<td>NAND2</td>
<td>6.701e-09</td>
<td>3.172e-07</td>
<td>4.622e-09</td>
</tr>
<tr>
<td>NOR2</td>
<td>7.724e-08</td>
<td>6.448e-09</td>
<td>6.448e-09</td>
</tr>
</tbody>
</table>

**Table 2:** $I'_{sub}$ and $I'_{gate}$ (per micron of gate width) for each fundamental circuit component $i$ at different input state $s$, 25°C and for high $V_{th}$ flavor.
4.1 Gate Area
We use a recent model by [31] and the analysis in [27] to estimate the areas of transistors and gates such as inverters, NAND, and NOR gates. This is a fast technique to estimate standard cell characteristics before the cells are laid out. Figure 3 shows the layout model that has been used in [31]. Table 3 shows the process- and technology-level input parameters required by this gate area model. When a transistor exceeds a certain maximum value, the transistor assumed to be folded. Given the width of an NMOS, $W_n$, the number of folded transistors can be calculated as follows:

$$N_{folded-transistor} = \left\lfloor \frac{W_n}{H_{n-diff}} \right\rfloor$$

(21)

The equation for total diffusion width of $N_{stacked}$ transistors when they are not folded is given by the following equation:

$$W_{diffusion-area} = 2(W_{contact} + 2S_{pc}) + N_{stacked}W_{poly} + (N_{stacked} - 1)S_{pp}$$

(22)

total diffusion width of $N_{stacked}$ transistors when they are folded is given by the following equation:

$$W_{diffusion-area} = N_{folded-transistor}(2(W_{contact} + 2S_{pc}) + N_{stacked}W_{poly} + (N_{stacked} - 1)S_{pp})$$

(23)

and finally the height of a gate is calculated using the following equation:

$$H_{gate} = H_{n-diff} + H_{p-diff} + H_{gap-opp} + 2H_{power-rail}$$

(24)

4.2 Router Area
To estimate the router area we basically compute the area of each of the building blocks and sum them up with an addition of 10% (rule of thumb) to account for global whitespace. For each building block we first identify the implementation style of the block and then decompose the block into its basic logical elements (i.e., gate-level netlist). Then, we use the gate area model described earlier to estimate the area of the entire block.

FIFO buffers. Designers typically implement buffers as SRAM arrays. Some on-chip networks, such as RAW microprocessor [25], use shift registers due to less demanding buffer space. We model the area of both implementations, but explain only the SRAM-based model here. We use the same SRAM-based FIFO buffer as in [29]. Equations (25) and (26) compute the word line and bit line lengths of the FIFO, respectively.

4.3 Link Area
The area occupied by links are due to wires and repeaters. We use the described gate area model to estimate the area of repeaters. The area of global wiring can be calculated as

$$Area_{link} = F \times (w_w + s_w) + s_w$$

(28)

where $Area_{link}$ denotes the wire area, $F$ is the flit size in bits, and $w_w$ and $s_w$ are the wire width and spacing computed from the width and spacing of the layer (global or intermediate) on which the wire is routed, and from the design style.

5 Validation and Significance Assessment
Table 7 lists all the architectural parameters used in ORION 1.0 and ORION 2.0 models. $N_{pipeline}$, $A_{pp}$, and $D$ are three new parameters added to ORION 2.0. We now show the validation of the our ORION 2.0 model against the Intel 80-core Teraflops chip [13] targeted for high-performance CMPs. The router in this design is a 5.1GHz 5-port switched router that uses two virtual channels (VC) with 16 buffers within each VC for deadlock-free routing. The flit width is 39 bits. The router uses a 5-stage pipeline with a round robin arbitration scheme. The supply voltage is at 1.2V with the transistor junction temperature at 110°C. The crossbar switch is implemented using the popular multiplexer-tree and the chip is taped out in 65nm technology. Our estimated total power consumption per router is within 7% of the Teraflops post-layout power estimation. Table 6 shows the difference between ORION 1.0 and ORION 2.0 with Intel 80-core Teraflops chip post-layout power and area numbers. We observe that ORION 2.0 has significantly improved the accuracy of the power and area models compared with ORION 1.0.4 Table 4 gives a power breakdown comparison of the Intel 80-core Teraflops chip with ORION 2.0.5 We attribute the deviation of ORION 2.0 estimates from the real chip simulations to: (1) difference in library files (i.e., between Intel’s and those used in our study), (2) our use of 6T SRAM vs. Intel’s use of 8T SRAM buffers, and (3) difference in arbitration scheme.6

We also validate our proposed models with the Intel Scalable Communications Core (SCC) chip [14] targeted for ultra-low power Systems-on-Chip (SoCs). The router in this design has five ports with an arbitrated 5×5 crossbar and flip-flop-based buffers. The router is unpipelined, and thus has only 1 stage. There are two 1-flit buffers per input port with flit size of 32 bits. There are two input buffers per input port and one output buffer at the output port. The operating frequency is 250MHz; the supply

---

Table 3: Process and technology input parameters used in the gate area model [27].

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>$H_{n-diff}$</td>
<td>Maximum height of n-diffusion</td>
</tr>
<tr>
<td>$H_{p-diff}$</td>
<td>Maximum height of p-diffusion</td>
</tr>
<tr>
<td>$H_{gap-same}$</td>
<td>Minimum gap between diffusions of the same type</td>
</tr>
<tr>
<td>$H_{gap-opp}$</td>
<td>Minimum gap between n and p diffusions</td>
</tr>
<tr>
<td>$H_{power-rail}$</td>
<td>Height of $V_{DD}$ and $V_{SS}$</td>
</tr>
<tr>
<td>$W_{poly}$</td>
<td>Minimum width of poly</td>
</tr>
<tr>
<td>$S_{pp}$</td>
<td>Minimum poly-to-poly spacing</td>
</tr>
<tr>
<td>$W_{contact}$</td>
<td>Contact width</td>
</tr>
<tr>
<td>$S_{pc}$</td>
<td>Minimum poly-to-contact spacing</td>
</tr>
</tbody>
</table>

Table 4: Power (mW) breakdown comparison of ORION 2.0 with post-layout simulation results for the Intel 80-core chip.

<table>
<thead>
<tr>
<th>Component</th>
<th>%diff (ORION 2.0 vs. Intel 80-core)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Buffer</td>
<td>14.2</td>
</tr>
<tr>
<td>Crossbar</td>
<td>16.9</td>
</tr>
<tr>
<td>Arbiter</td>
<td>-9.0</td>
</tr>
<tr>
<td>Clock</td>
<td>-20.9</td>
</tr>
<tr>
<td>Link</td>
<td>8.8</td>
</tr>
</tbody>
</table>

$L_{word-line} = F \cdot (w_{cell} + 2(P_{r} + P_{o})d_{w})$  

(25)

$L_{bit-line} = B \cdot (h_{cell} + (P_{r} + P_{o})d_{w})$  

(26)

where $F$, $B$, $w_{cell}$, $h_{cell}$, $d_{w}$, $P_r$, and $P_o$ are flit size in bits, buffer size in flits, memory cell width, memory cell height, wire spacing, number of read ports and number of write ports, respectively. Hence, the total area for a $B$ entry buffer with flit size of $F$ is calculated as follows. In this model, $h_{cell}$ and $w_{cell}$ are computed using the gate area model described earlier.

$$Area_{fio} = L_{word-line} \cdot L_{bit-line}$$  

(27)

For other router components, namely, crossbar and arbiter we first decompose them into their circuit building block (i.e., gate-level netlist). Then, using the gate area model we estimate the area of individual circuit component and compute the area of the entire block.

---

4 We do not model power management mechanisms in ORION 2.0. Hence, for the Intel 80-core Teraflops chips we assume that all components are on.

5 Currently we do not have a power breakdown of the Intel SCC chip.

6 For arbiter power we use the modified separable VC allocator model.
Table 5: Comparison of network power (P), area (A), total number of routers, and hop count, using ORION 1.0 and ORION 2.0 models.

<table>
<thead>
<tr>
<th>SoC</th>
<th>P (mW)</th>
<th>A (mm²)</th>
<th># of routers</th>
<th>max. # of router ports</th>
<th>max. # of hops</th>
</tr>
</thead>
<tbody>
<tr>
<td>ORION 1.0</td>
<td>65nm</td>
<td>0.857</td>
<td>25</td>
<td>6</td>
<td>10</td>
</tr>
<tr>
<td>ORION 2.0</td>
<td>65nm</td>
<td>0.924</td>
<td>25</td>
<td>6</td>
<td>10</td>
</tr>
</tbody>
</table>

Table 6: Comparison of power (mW) and area (μm²) models of ORION 1.0 and ORION 2.0 with post-layout simulation results of Intel 80-core and pre-layout simulation results of Intel SCC.

<table>
<thead>
<tr>
<th>Intel 80-core</th>
<th>Intel SCC</th>
</tr>
</thead>
<tbody>
<tr>
<td>%diff (total power)</td>
<td>-17.3</td>
</tr>
<tr>
<td>%diff (total area)</td>
<td>-23.5</td>
</tr>
</tbody>
</table>

Table 7: Architectural parameters in ORION 1.0 and 2.0.

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>B</td>
<td>No. of buffers</td>
</tr>
<tr>
<td>P</td>
<td>No. of ports</td>
</tr>
<tr>
<td>V</td>
<td>No. of virtual channels</td>
</tr>
<tr>
<td>X</td>
<td>No. of crossbar ports</td>
</tr>
<tr>
<td>tech</td>
<td>Technology node</td>
</tr>
<tr>
<td>fclk</td>
<td>Clock frequency</td>
</tr>
<tr>
<td>Vdd</td>
<td>Supply voltage</td>
</tr>
<tr>
<td>- Npipeline</td>
<td>No. of pipeline stages</td>
</tr>
<tr>
<td>A</td>
<td>Application domain (i.e., HP, LP, etc.)</td>
</tr>
<tr>
<td>D</td>
<td>Chip dimension</td>
</tr>
</tbody>
</table>

References

27. Predictive Technology Model http://www.eas.asu.edu/~ptm/

Acknowledgments

We would like to thank Dr. Sriman Vangal and Dr. Anthony Chen of Intel Corporation for providing us the Intel 80-core Teraflops and the Intel SCC power estimates, Dr. Hangsheng Wang of Freescale for his technical support of the original ORION code, and Tushar Krishna of Princeton University for assisting with validation. This research is supported by the MARCO Gigascale Systems Research Center.