# MinRoot and CMesh: Interconnection Architectures for Network-on-Chip Systems

Mohammad Ali Jabraeil Jamali, Ahmad Khademzadeh

Abstract-The success of an electronic system in a System-on-Chip is highly dependent on the efficiency of its interconnection network, which is constructed from routers and channels (the routers move data across the channels between nodes). Since neither classical bus based nor point to point architectures can provide scalable solutions and satisfy the tight power and performance requirements of future applications, the Network-on-Chip (NoC) approach has recently been proposed as a promising solution. Indeed, in contrast to the traditional solutions, the NoC approach can provide large bandwidth with moderate area overhead. The selected topology of the components interconnects plays prime rule in the performance of NoC architecture as well as routing and switching techniques that can be used. In this paper, we present two generic NoC architectures that can be customized to the specific communication needs of an application in order to reduce the area with minimal degradation of the latency of the system. An experimental study is performed to compare these structures with basic NoC topologies represented by 2D mesh, Butterfly-Fat Tree (BFT) and SPIN. It is shown that Cluster mesh (CMesh) and MinRoot schemes achieves significant improvements in network latency and energy consumption with only negligible area overhead and complexity over existing architectures. In fact, in the case of basic NoC topologies, CMesh and MinRoot schemes provides substantial savings in area as well, because they requires fewer routers. The simulation results show that CMesh and MinRoot networks outperforms MESH, BFT and SPIN in main performance metrics.

*Keywords*—MinRoot, CMesh, NoC, Topology, *Performance Evaluation*.

## I. INTRODUCTION

**S**ESTEM on Chip (SoC) designs are becoming widely used in telecommunication, consumer electronics and multimedia areas. As technology allows greater integration, they are being investigated in greater detail. By the end of this decade SoCs will grow up to four billion transistors. SoCs incorporate a number of components (or modules) including processors, controllers and memory arrays. These components need to communicate to pass data and/or control information. Thus, a successful SoC design largely relies on the ability to interconnect these components to compute a solution efficiently. An approach to the design of such systems in a single hardware chip is to reuse hardware/software IP cores, resulting in a considerable number of autonomous interconnected cores.

Traditional interconnection architectures, such as a single shared bus or a hierarchy of buses, are no longer a solution to support the increasing interconnection complexity and bandwidth demands of such hardware/software platforms due to their poor scalability and shared bandwidth. It is expected that in the future the aggregate communication bandwidth between cores will scale up to values much larger than the Gbytes/s range for many video applications [1]. To overcome these problems, the NoC has been introduced as a new interconnection paradigm able to integrate a number of IP cores while keeping a high communication bandwidth between them [2], [3].

Different topologies, proposed initially for parallel computing field, have been studied and adapted for NoC (e.g., Mesh, BFT and SPIN). The main focus of this paper is the detailed comparative evaluation of a set of recently proposed NoC architectures with CMesh and MinRoot NoC architectures relative to throughput and latency.

The paper is organized as follows. In Section II, related work of different topologies proposed recently for NoC is briefly presented. Section III, gives an overview of NoC architecture. Prototyping and results are presented in Section IV with conclusion given in Section V.

### II. RELATED WORK

Recently, network topologies have been investigated as a major issue in generic NoC architectures. The topology of onchip interconnect specifies the structure in which routers and IPs are connected together. Interconnection networks are usually classified into four major classes based on network topology: shared-medium network, direct network, indirect network, and hybrid network. In shared medium networks, the communication medium is shared by all connected devices. As it was mentioned in abstract the shared bus is an example of this class. Although this architecture is simple, it is not suitable for future NoCs with an increasing number of modules. In direct network, communicating devices are linked to each other by transmission channels. To transmit a message from one device to another, this message needs to traverse through several intermediate devices if the source and destination are not neighboring. On the other hand, an indirect network connects devices by one or more Routers, thus any message exchanging requires information transmitting through

Mohammad Ali Jabraeil Jamali is with Islamic Azad University, Shabestar Branch, Iran (e-mail: m\_jamali@itrc.ac.ir).

Ahmad khademzadeh is with Iran Telecommunication Research Center, Tehran, Iran (e-mail: zadeh@itrc.ac.ir).

one or more Routers. Finally, hybrid network is also possible by using elements of the previous three paradigms. Examples of these architectures are MESH in [4], SPIN [5], BFT in [6].

## A. MESH

Shashi Kumar et al. [7] proposed methodology called CLICHÉ (MESH) - Chip-Level Integration of Communicating Heterogeneous Elements. Each IP unit has a router node. As described before point-to-point connection is supported by two one-directional buses. The router architecture lies in input and output buffers, input and output arbiters, multiplexer, demultiplexer and routing logic.

One physical port could have several virtual channels, but only one virtual could have access to a physical port. The arbiter that contained in each router is based on priority matrix as a result gives grants to virtual channel [7]. Such kind of architecture is scalable and has simple structure despite it is not acceptable for parallel computation, data flow, and digital signal processing. 2D *nxn* Mesh has a number of bidirectional links quals  $3n^2$ -2n, and a diameter equals 2n. In this architecture, the number of routers quals to R=N, where Nis the system size in terms of number of functional (see Figure 1).



## B. SPIN

Guerrier and Greiner [8] proposed a tree like generic interconnect template called SPIN (Scalable, Programmable, Integrated Network) for on-chip packet switched interconnections network. A fat tree architecture is used to interconnect IP blocks. Figure 2 shows the basic SPIN architecture with 16 nodes, representing the number of functional IP blocks in the system. Every node has four children and the parent is replicated four times at any level of the tree. The size of the network grows at (*NlogN*)/8. The functional IP blocks reside at the leaves and the Routers reside at the vertices. In this architecture, the number of routers converges to R = 3N/4, where N is the system size in terms of number of functional. Among all simple architectures SPIN seems to be complex but despite it is cost-efficient for VLSI.

## C. BFT

BFT [9] (figure 3) architecture similar to SPIN belongs to fat-tree architectures and has the same concept: the routers are situated in the nodes of a tree and IP units in the leaves. Despite BFT concept has a difference from SPIN. The number of levels depends on a number of IP's:

The number of routers in current level of such kind architecture can be found the next way:  $R_l = N/2^{l+1}$  where *l* is a current level.In this architecture, the number of Routers converges to R = N/2\* [(N/4+1/2\*N/4+1/4\*N/4+...)].



## III. NETWORK ON CHIP ARCHETECTURE

A NoC consists of a set of routers interconnected according to a certain topology. A router is used to route messages along the topology. Homogeneous interconnection architectures are easily scaled up and facilitate modular design, but are not tailored to the application characteristics. They are probably the best choice for general-purpose computing. However, systems developed for a particular class of applications can benefit from a more heterogeneous communication infrastructure that provides high bandwidth in a localized fashion where it is needed to eliminate bottlenecks [10]. The traditional NoC structure does not take advantage of this property in spite of the physical proximity of the cores. To obtain NoC solutions with lower area overhead and average communication latency, we propose the design of two NoC based on the CMesh architecture and MinRoot architecture of figure 4 and figure 5.



# A. CMesh

There are a number of ways to improve network performance in a NoC such as reducing the maximum distance across the network, as well as increasing the buffer size. To exploit these characteristics, we propose a clustered mesh network design shown in Figure 4, where a larger router services multiple IP's. As this reduces the number of routers on chip, this topology allows each router to consume more area, while still consuming little overall chip space. This allows for larger buffers in the router, as instead of requiring 20 network input buffers for every 4 IP's, only 8 buffers are required (4 inter-router ports and 4 IP ports). It also reduces the maximum effective distance across the network, as each hop on a router now moves the packet two IP's closer to the destination.

Concentration of nodes provides a more compact layout and reduces wire, allowing wider channel widths. Also fewer routers permit a lower hop count without increasing wiring complexity (as opposed to tree-based designs).

In the CMesh architecture, a router can have four local connections with neighbor cores. Therefore, neighbors can exchange data through a single router, reducing communication latency. The router, shown in Figure 6, is the main component of a CMesh, responsible for providing transfer of packets between IPs. The router has a single, centralized control logic and up to eight bi-directional ports: East (E), West (W), North (N), South (S) and 4 Locals (L). Each port has an input buffer for temporary storage of packets.



Fig. 6. CMesh Router

The control logic implements the routing and arbitration algorithms. When a router receives a header flit, arbitration is performed, and if the incoming packet request is granted, an routing algorithm is executed to connect the input port data to the correct output port. If the chosen port is busy, the header flit, as well as all subsequent flits of this packet, will be blocked in the input buffers. The routing request for this packet will remain active until a connection is established in some future execution of the procedure in this router. When the routing algorithm finds a free output port to use, the connection between the input port and the output port is established. After routing all flits of the packet, connection is closed. Arbitration logic is used to grant access to an output port when one or more input ports require a connection at the same time. A round-robin arbitration scheme is used to avoid starvation.

When a flit is blocked in a given router, the performance of the network is affected, since several flits belonging to the same packet may be blocked in several intermediate routers.

To minimize the delay and the required resources, we have used wormhole method for the switching. In this method, a packet is divided to smaller segments called flits (Flow control digit) and then these flits are routed successively until they reach their destination.

## B. MinRoot

MinRoot architecture (see figure 5) similar to BFT belongs to fat-tree architectures and has the same concept: the routers are situated in the nodes of a tree and IP units in the leaves. Despite MinRoot concept has a difference from BFT.

The number of routers in current level of such kind architecture can be found the next way:  $R_l = N/4^l$  where *l* is a current level.

In this architecture, the number of routers converges to P = N(2 + f(A) + h(A + h(

R = N/3 \* [(N/4 + 1/4 \* N/4 + 1/4 \* (1/4 \* N/4) + ...)].

Table 1 compares the number of Routers and the number of links for the five topologies.

| # ROUTERS AND # LINKS FOR THE FIVE TOPOLOGIES |         |       |         |       |         |       |  |  |  |
|-----------------------------------------------|---------|-------|---------|-------|---------|-------|--|--|--|
|                                               | N=16    |       | N=64    |       | N=256   |       |  |  |  |
|                                               | #Router | #Link | #Router | #Link | #Router | #Link |  |  |  |
| MESH                                          | 16      | 40    | 64      | 176   | 256     | 736   |  |  |  |
| SPIN                                          | 8       | 32    | 32      | 128   | 128     | 512   |  |  |  |
| BFT                                           | 6       | 24    | 28      | 112   | 120     | 480   |  |  |  |
| MinRoot                                       | 5       | 20    | 21      | 84    | 85      | 337   |  |  |  |
| CMesh                                         | 4       | 20    | 16      | 88    | 64      | 368   |  |  |  |

TABLE I

## IV. PROTOTYPING AND RESULTS

## A. Router and NoC Area Evaluation

The network area cost including the area of Routers, multiplexers/demultiplexers, buffers, and links are also analyzed as shown in tables 2 and 3. The area costs of the MESH, SPIN and BFT increase rapidly due to the linearly increasing buffer area of the MESH and superlinearly increasing switch fabric area of the MESH, SPIN and BFT. The MESH or the SPIN or the BFT network occupies almost the 30% of the entire chip area. The area consumption of the hierarchical topologies (CMesh and MinRoot) occupies only the 10%-15% of the overall chip area. Considering the area cost, CMesh and MinRoot topologies are the most costeffective topologies in general. We have described the three routers and three NOCs in VHDL and than synthesized using the Leonardo synthesis tool. From the implementation, we concluded that in the worst case it can operate at a frequency around 100 MHz Since the switch of the router is fully connected, the router is able to forward n packets at a time, where n is the number of ports. It means that it has a total bandwidth of b x n, where b is the bandwidth of a single port. For example, a router with eight ports, a data width of 16 bits operating at 100MHz has a total bandwidth of 12.4 Gbps. The areas occupied by routers are resumed in table 2.

TABLE II AREA OF ROUTER

| XC9500XL Mapping |        | ASIC Mapping (0.35µm<br>CMOS )<br>#Gates |        |        |
|------------------|--------|------------------------------------------|--------|--------|
| Router           | #Gates | #Flip Flops                              | Buffer | Switch |
| MESH             | 9069   | 856                                      | 2689   | 1861   |
| MinRoot          | 11390  | 825                                      | 1252   | 2185   |
| CMesh            | 15967  | 1335                                     | 2763   | 2896   |

| TABLE III<br>Area of NoC |         |             |                                          |  |  |  |  |  |
|--------------------------|---------|-------------|------------------------------------------|--|--|--|--|--|
| Type of<br>Topology      | XC95002 | XL Mapping  | ASIC Mapping (0.35µm<br>CMOS )<br>#Gates |  |  |  |  |  |
|                          | #Gates  | #Flip Flops |                                          |  |  |  |  |  |
| MESH                     | 115281  | 110000      | 215817                                   |  |  |  |  |  |
| MinRoot                  | 55359   | 12654       | 95087                                    |  |  |  |  |  |
| CMesh                    | 60600   | 14720       | 101005                                   |  |  |  |  |  |

Table 2 details the area usage of the Router modules for two mappings, FPGA and ASIC. Table 2 contains the number of gates used to implement routers, and FIFOs with a depth of 8 words (word=16 bits). Table 3 details the area usage of the

NOC modules for two mappings, FPGA and ASIC. In this experiments, we consider each system to be consisting of 16 functional IP blocks, i.e., N=16.

## B. Performance Evaluation

There are two kinds of traffic pattern: one is uniform random traffic and the other is localized traffic with a locality factor. The locality factor means a ratio of the intracluster traffic to the overall traffic. There are options of using both Poisson and self-similar message injection distributions. Selfsimilar traffic has been observed in the bursty traffic between on-chip modules in typical MPEG-2 video applications [11] and networking applications [12]. It is obvious that IPs requiring low latency and large bandwidth communication can get more synergetic performance by locating them in the same clster based on their communication locality so that the intracluster traffic is dominant over all of on-chip traffic. The locality factor can represent the localized traffic pattern quantitatively.

We now compare the throughput and latency characteristics of the various NoC architectures. The throughput of the communication infrastructure generally depends on the traffic pattern. Throughput is the maximum traffic accepted by the network and it relates to the peak data rate sustainable by the system. The accepted traffic depends on the rate at which the functional IP blocks are injecting data into the network. Ideally, accepted traffic should increase linearly with this injection load. However, due to the limitation of routing resources (switches and interconnect wires), accepted traffic will saturate at a certain value of the injection load. Similarly to the throughput, the unit of measure for injection load is also flits/cycle/IP.

The plots in Figure 7 also indicate that, under the uniform traffic assumption CMesh, and MinRoot provide a lower throughput than do BFT, MESH and SPIN. This happens due to the fact that SPIN and MESH have more links between a source and a destination pair than do the others. We observe that the accepted traffic increases linearly with the injection load up to the throughput saturation point.



Fig. 7. Network throughput under uniform traffic (poisson).

While these results are as one would expect, the assumption of spatial uniformity of traffic is not very realistic in an SoC environment since different functions will be mapped to different parts of the SoC and they will exhibit highly localized patterns. Hence, we studied the effect of traffic localization on throughput for injection process and considered the illustrative case of spatial localization where local messages travel from a source to the set of the nearest destinations. In the case of BFT, SPIN and MinRoot, localized traffic is constrained within a cluster consisting of a single subtree, while, in the case of MESH and CMesh, it is constrained within the four destinations placed at the shortest Manhattan distance [13]. For example, if the localization factor is 0.6, then 60 percent of the traffic generated by an IP occurs within its cluster, while the rest of the traffic is randomly distributed in the remainder of the entire SoC.



Fig. 8. Network throughput under localized traffic (poisson).

Figure 8 shows the effect of traffic localization on throughput for all the topologies. Localization of traffic does not have much impact on SPIN, but it enhances the throughput of BFT, MESH, CMesh, and MinRoot considerably. Though SPIN have very high throughput for the uniformly distributed traffic, it lack the ability to exploit the traffic localization inherent in SoC architectures.



Fig. 9. Network latency under uniform traffic (poisson).



Fig. 10. Network latency under localized traffic (local factor=0.3) (poisson).

In Figure 9, we show the variation of latency with injection load in the case of Poisson for uniform traffic. The injection load directly affects the average message latency. We considered the effect of traffic localization on the latency. Variation of latency with localization factors of 0.3 and 0.8 is shown in Figures of 10 and 11, respectively.

It is seen from Fig. 11 that, similar to the case of throughput characteristics, traffic localization does not have significant impact on the latency variations for SPIN, but it enhances the latency of BFT and MESH considerably, specially CMesh, and MinRoot.



Fig. 11. Network latency under localized traffic(local factor=0.8) (poisson).

The benefits of traffic localization are evident from these figures: Increasing the amount of traffic localization causes more messages to be injected without increasing the average latency, specially the CMesh, and MinRoot architectures. This happens due to the fact that, on the average, messages will traverse fewer stages and clusters in the case of a greater amount of localization. Consequently, the functional mapping should be performed so as to exploit the advantages of spatial locality, i.e., the blocks that communicate more frequently should be placed close to each other. This will reduce the use of long global paths and the energy dissipation.

From Figures 8-11, we can infer that the architectures with a higher degree of locality like MinRoot and CMesh have greater performance at high degree of locality than the others.

## V. CONCLUSION

The burgeoning adoption of complex SoC architectures, coupled with diminishing feature sizes, has heightened the need for very efficient on-chip interconnects. NoCs have become the dominant choice to fulfill this need. However, their resource-constrained nature has posed several challenges to their optimization in terms of area, power and performance. In this work, we propose a new network topology (CMesh) which injects messages into the on-chip network from four, rather than one, points per node. In the case of MESH network implementations, the proposed CMesh scheme also provides significant area savings, because it requires fewer routers than a generic MESH. This flexibility stems from the fact that multiple PEs can connect to a single router.

Also, we propose a new network topology (MinRoot) which same as CMesh topology provides significant area savings. The study compares these topologies with three NoC architectures: MESH, SPIN and BFT. Selected configurations of these architectures have been constructed and simulated to compare main performance metrics such as throughput, message latency, and network load. Therefore, we can infer that the architectures with a higher degree of locality like MinRoot and CMesh have greater performance at high degree of locality than the others.

#### References

- [1] Guerrier, P., Greiner, A.: A generic architecture for on-chip packetswitched interconnections. DATE (2000).
- [2] Hemani, A., et al.: Network on chip: An arquitectura for billion transistor era. Proceedings of the IEEE NorChip Conference (2000).
- [3] Dally, W., Towles, B.: Route packets, not wires: on-chip interconnection networks. Proceedings of DAC (2001).
- [4] Kumar, S., Jantsch, A., Soininen, J.-P., Forsell, M., Millberg, M., Öberg, J., Tiensyrjä, K., & Hemani, A. (2002). A network on chip architecture and design methodology. In: *Proceedings of int't symp. VLSI (ISVLSI)* (pp. 117–124).
- [5] P. Guerrier, A. Greiner, "A generic architecture for onchip packetswitched interconnections", *Proceedings of DATE*, Paris, France, March, 2000. pp. 250–256.
- [6] P.P. Pande, C. Grecu, A. Ivanov, and R. Saleh, "Design of a Switch for Network on Chip Applications," Proc. Int'l Symp. Circuits and Systems (ISCAS), vol. 5, pp. 217-220, May 2003.
- [7] S. Kumar et al., "A Network on Chip Architecture and Design Methodology," Proc. Int'l Symp. VLSI (ISVLSI), pp. 117-124, 2002.
- [8] J. Hennessey and D. Patterson, Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 2003.
- [9] P.P. Pande, C. Grecu, A. Ivanov, and R. Saleh, "Design of a Switch for Network on Chip Applications," Proc. Int'l Symp. Circuits and Systems (ISCAS), vol. 5, pp. 217-220, May 2003.
- [10] Benini, L., Micheli, G.: Networks on chips: a new SoC paradigm. SBCCI (2005) Computer 35(1) (2002), pp. 70–78.
- [11] G. Varatkar and R. Marculescu, "Traffic Analysis for On-ChipNetworks Design of Multimedia Applications," Proc. Design Automation Conf. (DAC), pp. 510-517, June 2002.

- [12] D.R. Avresky, V. Shubranov, R. Horst, and P. Mehra, "Performance Evaluation of the ServerNetR SAN under Self-Similar Traffic," Proc. 13th Int'l and 10th Symp. Parallel and Distributed Processing, pp. 143-147, Apr. 1999.
- [13] D. Wingard, "MicroNetwork-Based Integration for SoCs," Proc. Design Automation Conf. (DAC), pp. 673-677, June 2001.