Congestion-aware core mapping for Network-on-Chip based systems using betweenness centrality

Congestion-aware core mapping for Network-on-Chip based systems using betweenness centrality

Accepted Manuscript Congestion-aware core mapping for Network-on-Chip based systems using betweenness centrality Tahir Maqsood, Kashif Bilal, Sajjad A...

609KB Sizes 0 Downloads 12 Views

Accepted Manuscript Congestion-aware core mapping for Network-on-Chip based systems using betweenness centrality Tahir Maqsood, Kashif Bilal, Sajjad A. Madani PII: DOI: Reference:

S0167-739X(16)30844-5 http://dx.doi.org/10.1016/j.future.2016.12.031 FUTURE 3272

To appear in:

Future Generation Computer Systems

Received date: 28 February 2016 Revised date: 21 November 2016 Accepted date: 24 December 2016 Please cite this article as: T. Maqsood, K. Bilal, S.A. Madani, Congestion-aware core mapping for Network-on-Chip based systems using betweenness centrality, Future Generation Computer Systems (2016), http://dx.doi.org/10.1016/j.future.2016.12.031 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Congestion-Aware Core Mapping for Network-on-Chip based Systems using Betweenness Centrality Tahir Maqsood, Kashif Bilal, Sajjad A. Madani Department of Computer Science COMSATS Institute of Information Technology, Pakistan {tmaqsood, kashifbilal, madani}@ciit.net.pk Abstract: Network congestion poses significant impact on application performance and network throughput in Network-on-Chip (NoC) based systems. Efficient core mapping can significantly reduce the network contention and end-to-end latency leading to improved application performance in NoC based multicore systems. In this work, we propose a Congestion-Aware (CA) core mapping heuristic based on betweenness centrality metric. The proposed CA algorithm optimizes core mapping using betweenness centrality of links to alleviate congestion from highly loaded NoC links. We use modified betweenness centrality metric to identify highly loaded NoC links that are more prone to congestion. In contrast to traditional betweenness centrality metric, which is generally used to measure the structural/static characteristics of the system, the adapted betweenness centrality metric utilizes the volume of communication traversing through the edges (NoC links) to capture the operational and dynamic characteristics of the system. The experimental results demonstrate that our proposed algorithm achieved significantly lower average channel load and end-to-end latency compared to the baseline First Fit (FF) and Nearest Neighbor (NN) core mapping algorithms. Particularly, CA algorithm achieved up to 46% and 12% lower channel load and end-to-end latency compared to FF algorithm, respectively. Moreover, proposed algorithm exhibits an average gain of 32% in terms of reduced network energy consumption compared to the baseline configuration. Keywords: Congestion-aware core mapping, Network-on-Chip (NoC), Betweenness centrality

1 Introduction Emerging big data applications are characterized by massive datasets, high parallelism degree, and strict power and performance constraints that are significantly different from traditional high performance computing (HPC) and desktop application workloads. Business potential of these big data applications coupled with cloud services serves as driving force for the development of innovative computer architectures. For instance, to address the challenges of such emerging applications, Single-chip Cloud Computer (SCC) has been developed by Intel, which is an experimental system based on multicore architecture [1]. Similarly, Tilera developed chips with 64 and 100 cores connected through an on-chip network [2, 3]. Due to the advancements in integration technologies and silicon chip design, future Multi-Processor System-on-chip (MPSoC) systems are anticipated to comprise hundreds or even thousands of processing cores also termed as Processing Elements (PEs) [4-6]. Tremendous compute power is promised by multicore architectures, such as SCC, which can be leveraged by efficiently splitting the workload across multiple processing cores. Ideally, the division of workload should lead to linear speedup, which is equivalent in proportion to the number of cores on chip. However, in practice as the number of processing cores increase, the raw compute performance can be overshadowed by certain overheads, such as network communication among the interdependent tasks executing on different cores. Therefore, efficient communication mechanism and interconnect design play a vital role in achieving high performance and energy efficiency in future multicore architectures. For such multicore systems with large number of PEs, Network-on-Chip (NoC) has emerged as a scalable alternative to traditional bus based architecture [7, 8]. In traditional MPSoCs, shared bus or global wires are used for communication among the PEs. Whereas, NoC based MPSoCs use a network of shared links connected through routers in different topologies, such as mesh, torus, or tree. The advantage of NoC based architecture is that multiple PEs can communicate simultaneously over different paths. It is highly desirable that parallel communication between different pairs of PEs should not interfere with each other, i.e., the paths should be

contention free. However, in practice it is very difficult to achieve totally contention free paths. Consequently, congestion within NoC can significantly affect the overall application performance, especially network latency and throughput of the system [9, 10]. Reducing network contention within NoC based multicore processor systems is crucial as network contention can delay the startup time and/or completion time of tasks that wait for data arrival before tasks can start execution [8]. Network contention is generally caused when a communication channel is occupied by another packet and the rest of the packets have to wait until the required communication channel becomes available. Most of the previous studies considered mapping of a single task on each processing core and have conducted analysis over a single mesh NoC [4, 10-15]. Moreover, there are very few works that consider both task and core mapping to reduce the network contention in NoC based multicore systems. In this work, we consider mapping of multiple tasks on each core. We demonstrate effectiveness and performance of our proposed scheme using extensive simulations with varying mesh sizes ranging from 5×5 to 10×10 mesh NoCs. In this work, we conduct congestion analysis of our proposed communication and energy aware task packing and core mapping algorithms [16]. Moreover, we propose a congestion-aware core mapping heuristic with the objective to reduce the network contention and average end-to-end latency. The proposed congestion-aware core mapping algorithm works initially by identifying the highly loaded NoC links that can potentially cause congestion within the network. Afterwards, core mapping is optimized to alleviate network load from the highly loaded NoC links leading to reduced network contention and delay. To identify the highly loaded NoC links that are more prone to congestion, we apply a modified betweenness centrality metric. In contrast to traditional betweenness centrality metric, which is generally used to measure the structural characteristics of the system, the adapted edge betweenness centrality metric utilizes the volume of communication traversing through the edges (NoC links) to capture and exploit the dynamic behavioral characteristics of the system. For instance, the authors in [17, 18] used betweenness centrality to model the structural behavior of Data Center Networks (DCNs). The authors proved that legacy structural metrics, such as betweenness centrality are not useful to capture the operational and dynamic behavior of DCNs. Therefore, we utilize a modified betweenness centrality metric to capture the dynamic and operational behavior of NoC based multicore systems. Major contributions of this article are summarized as follows. •

We propose a novel application of edge betweenness centrality metric to capture the operational and dynamic characteristics of NoC traffic and identify highly loaded NoC links that can potentially cause congestion.



Propose a congestion-aware core mapping heuristic to optimize core mapping in such a manner to alleviate network contention and resulting delay within NoC.



Extensive simulations are conducted to analyze the effectiveness of proposed scheme with varying NoC sizes ranging from 5×5 to 10×10 mesh NoCs. Performance parameters used to evaluate the proposed congestion-aware core mapping heuristic include: (a) channel load, (b) end-to-end latency, (c) NoC energy consumption, and (d) workflow completion time.

The rest of the article is organized as follows. Section 2 provides an overview of prior works closely related to proposed work. System models and problem formulation are discussed in Section 3. Proposed congestion-aware algorithm is presented in Section 4. Experimental results are presented and discussed in Section 5. Concluding remarks and future directions are provided in Section 6.

2 Related Work Congestion-aware task and core mapping for NoC based multicore processor systems is a major challenge and has gained significant attention in the recent past. Therefore, in this regard various solutions have been proposed in scientific literature [4, 10-15, 19-21]. In this section we will briefly discuss some of the proposals that are closely related to this work.

To reduce network contention, Carvalho et al. presented various congestion-aware task mapping techniques including: Minimum Maximum Channel Load (MMC), Minimum Average Channel Load (MAC), and Path Load (PL) [19]. Moreover, authors have conducted the comparison of aforementioned techniques based on average channel occupation that reflects the congestion within the NoC. Furthermore, the authors have proposed a clustering approach for dividing NoC into multiple clusters for mapping multiple applications. In another work [20], Carvalho and Moraes proposed a task mapping heuristic termed as Best Neighbor (BN) that combines the approaches of Nearest Neighbor (NN) and PL heuristics to optimize the execution time of PL. The major drawback of PL algorithm is that it evaluates all possible paths between source and destination PE leading to increased algorithm complexity. To overcome the aforementioned drawback of PL algorithm, the BN heuristic adopts the search mechanism of NN while following the path computation method of PL. However, the authors considered mapping of single task to each PE and analysis was conducted on 8×8 mesh NoC only. Moreover, the number of applications that can be mapped is restricted by the number of virtual clusters and authors did not outline the mechanism to handle the scenario where the number of applications exceed the number of virtual clusters. In [15], Lee et al. proposed a task mapping heuristic to reduce the communication energy consumption and a specific switch architecture to minimize the congestion caused by concurrent access to shared memory for NoC based MPSoCs. However, the proposed scheme only considers that a single task can be mapped onto each PE. Moreover, only single shared memory is considered and evaluation has been conducted on 4×4 mesh only. Similarly, to reduce the communication overhead, communication energy consumption, and network congestion for NoC based MPSoCs, Singh et al. proposed a dynamic task mapping heuristic that attempts to map multiple communicating tasks to the same or closer PEs [14]. However, the proposed technique does not balance the load among the available PEs. Moreover, the proposed technique does not guarantee global optimization of communication overhead as only neighboring master and slave nodes are considered for placement on the same PE. The analysis was conducted on 8×8 mesh NoC with homogenous PEs. To overcome the aforementioned drawbacks of [14], Kaushik et al. proposed a pre-processing technique that attempts to reduce the communication overhead and evenly balances the load among the available PEs [13]. The proposed technique improves the application execution time, energy consumption, and resource utilization. However, the pre-processing technique leads to additional computation and results in increased algorithm execution time. Moreover, the analysis has been conducted with limited number of tasks over smaller mesh NoCs. In [10], Chao et al. proposed a congestion-aware task mapping and scheduling algorithm that alleviates the congestion in NoC based systems. The proposed algorithm attempts to map the communicating tasks to available processing cores having minimum Manhattan distance as well as considering the utilization of corresponding NoC links between the communicating tiles. However, the proposed architecture only considers the single task processing cores and can be extended to accommodate multiple tasks on each core. Singh et al. proposed two runtime multi-task mapping heuristics that attempt to reduce communication overhead and resulting congestion within the NoC [21]. The proposed heuristics attempt to map the communicating tasks to same PEs or within close proximity of each other within the specific region. The proposed technique creates virtual clusters within the NoC that are used to map initial tasks of an application. However, multiple tasks are only supported on hardware resources. The analysis has been conducted on 8×8 platform only. A dynamic communication-aware task mapping technique for heterogeneous MPSoCs is proposed by Khajekarimi and Hashemi to reduce network congestion [12]. In this technique, initial mapping is performed using the BN algorithm. Later, based on the communication traffic at run-time, the algorithm attempts to map the communicating tasks to adjacent PEs considering the current traffic load at the links. However, the analysis has been conducted on smaller mesh NoCs with less number of tasks using only two benchmarks. Moreover, the authors did not consider the migration cost and the impact of moving the tasks on the overall performance of the system. Similarly, Wang et al. proposed a dynamic application mapping algorithm considering NoC contention [4]. The proposed algorithm works in two phases. In the first phase, a rectangular region of required size is identified for mapping an incoming application. In the next phase, the tasks of applications are allocated to processing cores within the identified region considering communication volume of edges connecting interdependent tasks. The

Fig. 1. An example application task graph. experimental results show superiority of proposed algorithm compared with FF and NN algorithms. Another dynamic task mapping algorithm is proposed by Fattah et al. for NoC based many-core processors that aim to reduce the internal and external congestion within the system [11]. A major drawback of proposed technique is that first task selected for mapping has the highest number of edges but communication volume of the tasks is not considered for selecting first task. The order in which the neighbor tasks of the selected first task should be mapped is not considered. In addition to task mapping, several other techniques have been proposed in different dimensions to alleviate network contention in NoC based systems. The further research dimensions include: adaptive routing algorithms, NoC architectures, and router/switch architectures [22-24].

3 System Model and Problem Formulation 3.1 Application An application is generally represented in the form of a Task Graph TG (T, C) as shown in Fig. 1. Each node in TG represents a task ti ∈ T, while edges between nodes represent the volume of data needed to be exchanged between tasks. The set of all the edges is denoted by C, while each edge cj,k contained in C signifies the communication dependencies between tj and tk. The communication volume between tj and tk is captured by the weight wj,k of cj,k. For instance, as shown in Fig. 1, the communication volume between t0 and t1 is 30. In addition to the volume of communication between the communicating nodes, we require an additional parameter to specify the rate of communication. Let rj,k represent the rate of communication between tj and tk. Communication rate is used to calculate network load traversing through the NoC links as well as on the complete path between any pair of communicating cores.

3.2 NoC Model In this work we have adopted the similar model that is originally proposed by Schranzhofer et al. in [25]. A core graph CG (V, E) is used to describe the 2D N×M mesh based NoC communication infrastructure where N and M indicate the number of rows and columns, respectively. The core graph CG (V, E) is composed of |V| vertices and |E| edges where each vertex v ∈ V represents a processing element (PE) and ei,j ∈ E represents an edge or NoC link connecting two PEs i and j. Note that each vertex represents a PE and from here onwards we will use vertex and PE interchangeably throughout the paper. Similarly, each edge represents a NoC link and we use edge and link interchangeably throughout the paper. The NoC model, adopted in this work, uses deterministic XY routing for packet forwarding. In XY routing, the packet is first directed along the X-axis towards the destination and when packet reaches the column of destination, then packets is forwarded along Y-axis towards the destination [10]. Here we would like to clarify that core graph is resynthesized after the task mapping phase is complete. Because communication volume among different pairs of PEs can only be determined once the tasks are mapped to corresponding PEs.

3.3 Energy Model

The NoC energy consumption is based upon the amount of data shared between the pair of tasks that are mapped on different processing cores. We have used the similar energy model that was used in our earlier work on dynamic task mapping for NoC based MPSoCs [26]. The adopted network energy model is based on bit energy consumption that works by calculating the amount of energy consumed by all the data transferred through NoC. Network energy consumption is proportional to the total number of bits transferred and the number of hops travelled by each bit [27]. The energy consumed to transfer one bit from core i to core j is referred as BEi,j and is calculated using Eq. 1. 𝐵𝐸𝑖,𝑗 = �𝜂𝑖,𝑗 × 𝑅𝐸� + ((𝜂𝑖,𝑗 − 1) × 𝐿𝐸) + (2 × 𝐶𝐸)

(1)

Here RE represents the amount of energy consumed by a NoC router to process one bit through the router from input port to output port. LE and CE indicates the amount of energy consumed by a link between any two routers (link energy) and on the links between router and sending/receiving PEs (core to router links), respectively. The values of RE, LE, and CE are based on the experimental values reported by Ye et al. [28]. The values of RE, LE, and CE used in this study are listed in Table 1. For simplicity we assumed same values for LE and CE. Finally, 𝜂𝑖,𝑗 represents the number of hops travelled by the bit, also referred as minimum Manhattan Distance (MD), from corei to corej. The value of 𝜂𝑖,𝑗 is calculated using Eq. 2. Based on Eq. 1 and Eq. 2 the total network energy consumption,

represented by NE, is calculated using Eq. 3.

𝜂𝑖,𝑗 = � 𝑋𝑖 − 𝑋𝑗 � + | 𝑌𝑖 − 𝑌𝑗 | 𝑁𝐸 =



∀𝑖,𝑗∈𝑇∶𝑖≠𝑗

𝐵𝐸𝑖,𝑗 × 𝑤𝑖,𝑗

(2) (3)

In Eq. 3, wi,j represents the amount of data transferred between any pair of task ti and tj. Note that the wi,j will be zero if there is no edge between ti and tj, i.e., there is no communication between ti and tj. Table 1: NoC energy values [28] Parameter RE LE and CE

Bit energy (10-15 Joule) 431 87

3.4 Problem Formulation 3.4.1 Bin Packing Problem In our earlier work, we investigated the problem of jointly optimizing energy consumption and communication overhead through efficient task and core mapping in NoC based multicore systems. We approached the optimization problem of energy and communication-aware mapping by introducing a new class of bin packing problem, the Variable bEnefit Bin pAcking Problem (VEBAP) and transformed VEBAP into a variation of knapsack optimization problem termed as Variable Benefit Knapsack Problem (VBKP). We term the aforementioned optimization problem as ‘variable benefit’ because the benefits of packing objects (tasks) into different bins are not fixed but variable. Employing the well-known 0-1 Knapsack only reduces energy consumption while the immense advantage of proposed knapsack optimization is that it reduces both energy consumption and network overhead simultaneously. The benefits of using the proposed Knapsack optimization are two-fold. Firstly, maximum number of tasks will be packed on the available processing elements leading to better resource utilization and lower energy consumption. Secondly, the communication overhead among the interdependent tasks is minimized. To solve VBKP, we developed a dynamic programming based bin packing algorithm. Knapsack based bin packing algorithm for workload consolidation places tasks in such a manner that utilization of processing cores is maximized while communication overhead is minimized. Optimized utilization of processing cores leads to significant reduction in computational energy consumption. Similarly, decrease in communication overhead not only saves the communication energy but also leads to noteworthy reduction in application execution time.

3.4.2 Congestion Aware Core Mapping Problem To provide a rigorous mathematical formulation for the problem at hand, i.e., congestion aware core mapping, we need to define the following terminologies. Given a core graph CG (V, E), let Ƞ (s, t) represent the total number of shortest paths from source PE s to target PE t. Moreover, let Ƞ (s, t |v) denote the number of shortest paths from s to t that pass through v ∈ V. Similarly, for any edge e ∈ E, let Ƞ (s, t |e) represent the total number of shortest paths between any pair of nodes s and t that traverse through e. Note that Ƞ (s, t |v) = 0 and Ƞ (s, t |e) = 0, if there is no shortest path from s to t that goes through v and e, respectively. Further, Ƞ (s, t |v) = 0 and Ƞ (s, t |e) = 0 when v ∈ {s, t}. Definition 1 (vertex betweenness centrality): The betweenness centrality (BC) of a vertex v ∈ V is the ratio between total number of shortest paths from vertex s to t and the number of shortest paths between s and t that pass through v. The betweenness centrality for all vertices v ∈ V of CG (V, E) is defined by Eq. 4: 𝐵𝐶𝑉(𝑣) =



𝑠,𝑡∈𝑉:𝑣≠𝑠≠𝑡

Ƞ (𝑠, 𝑡 |𝑣) Ƞ (𝑠, 𝑡)

(4)

Ƞ (𝑠, 𝑡 |𝑒) Ƞ (𝑠, 𝑡)

(5)

Definition 2 (edge betweenness centrality): The betweenness centrality for all edges e ∈ E of CG (V, E) is defined by Eq. 5: 𝐵𝐶𝐸(𝑒) =



𝑠,𝑡∈𝑉:𝑣≠𝑠≠𝑡

Note that traditional betweenness centrality metric for both vertices and edges only accounts for the total number of paths going through a given vertex or edge, respectively. However, it does not consider the amount of data or traffic passing through the particular vertex or edge. Consequently, the current definition of betweenness centrality is able to capture the structural properties of the underlying network but is unable to depict the behavioral or operational characteristics of actual network traffic passing through the network. Therefore, in this work we propose a modified betweenness centrality metric to depict the operational and dynamic characteristics of the system in terms of actual amount of network traffic traversing through the vertices or edges. To achieve the aforementioned goal, we need to incorporate an additional parameter, i.e., the amount of data shared between PEs i and j. Let di,j represent the amount of data exchanged between PEs i and j. The parameter di,j, is obtained by calculating the amount of data shared between the tasks that are mapped on PEs i and j. Considering the amount of data exchanged between the nodes the proposed modified betweenness centrality metric for nodes and edges can be calculated using Eq. 6 and 7, respectively. 𝐵𝐶𝑉(𝑣) = 𝐵𝐶𝐸(𝑒) =



𝑠,𝑡∈𝑉:𝑣≠𝑠≠𝑡



𝑠,𝑡∈𝑉:𝑣≠𝑠≠𝑡

Ƞ (𝑠, 𝑡 |𝑣) × 𝑑𝑠,𝑡 Ƞ (𝑠, 𝑡) Ƞ (𝑠, 𝑡 |𝑒) × 𝑑𝑠,𝑡 Ƞ (𝑠, 𝑡)

(6)

(7)

In this work we are interested to measure the edge betweenness centrality, because the major source of congestion within the NoC are edges that represent NoC links. The congestion occurs when the amount of data traffic passing through any link exceeds the capacity of that particular link. Definition 3 (sum of betweenness centrality of a path): Let 𝜌𝑠,𝑡 = {𝑒1 , 𝑒2 , … , 𝑒𝑧 } represent the set of edges that lie on the shortest path between cores s and t. Consequently, the sum of betweenness centrality of edges that are part of the shortest path between any given pair of cores s and t can be calculated using Eq. 8: 𝑠𝑢𝑚𝐵𝐶𝐸𝑠,𝑡 = � 𝐵𝐶𝐸(𝑒) ∀𝑒∈𝜌𝑠,𝑡

(8)

Given the aforementioned models and definitions, the problem is formally stated as: Given a core graph CG (V, E), try to find a feasible mapping of cores onto the NoC such that to minimize: (a) congestion within the NoC by reducing average channel load; and (b) average end-to-end latency between communicating tasks.

4 Proposed Heuristic To address the aforementioned problem, we propose a Congestion-Aware (CA) core mapping algorithm presented in Table 2. The CA algorithm starts by calculating the edge betweenness centrality for all edges using Eq. 7, considering the communication volume. Next, all of the edges are sorted according to betweenness centrality of edges in descending order. Afterwards, each edge is evaluated in sorted order. For each edge, the algorithm calculates the current network load on the edge by taking sum of all of the network traffic that needs to traverse through the particular edge (line 4). If current network load is greater than or equal to bandwidth of the particular edge (line 6), then it indicates that given edge is a potential candidate for congestion. Consequently, the CA algorithm attempts to optimize the current mapping of cores in such a way that network load on the particular edge is reduced. To optimize the mapping of cores following steps are repeated until the network load becomes lower than the bandwidth of the particular edge or all the communicating cores have been evaluated for remapping. Firstly, algorithm finds the pair of cores that communicate through the particular edge being evaluated (line 7). Afterwards, among all pairs of cores the algorithm selects the pair of cores having highest mutual communication volume (line 9). After selecting highly communicating pair of cores, current load on complete path between pair of cores, denoted as pLoad, is calculated using Eq. 8 (line 10). Next, among the already selected pair of cores, the algorithm selects cores having higher and lower incoming/outgoing communication volume denoted by hCore and lCore, respectively (lines 11 and 12). In next few steps, CA algorithm attempts to find a feasible core, denoted by mCore, within x number of hops of hCore. The algorithm iteratively evaluates all x hop neighbors of hCore where the value of x varies in the range from 1 to MAX_MD. MAX_MD is the maximum Manhattan Distance (MD) from hCore to the border of NoC. Among all the x hop neighbors of hCore, the feasible mCore is selected as the particular core that has lowest average incoming/outgoing path load towards hCore. After mCore is selected, the position of mCore and lCore are swapped and network load on the new path between lCore and hCore is recalculated using Eq. 8 (line 16). If the network load on the path between hCore and lCore is reduced compared to previous path load then the swap is accepted, otherwise the swap is reversed and another feasible mCore is searched incrementing the value of x (17-22). The whole process is repeated until the network load on the particular edge becomes lower than bandwidth or all the pair of cores that communicate through the edge under consideration have been evaluated. The intuition behind the algorithm is to optimize the core mapping in such a manner that will alleviate the congestion on certain edges (NoC links) of core graph that are more affected by network load. To find and prioritize the evaluation of edges that are more prone to congestion, we have adopted a modified betweenness centrality metric. In contrast to traditional betweenness centrality metric, that only uses betweenness centrality as a structural property of the system, the proposed metric utilizes the volume of communication traversing through the edges to capture the behavioral characteristics of the system. Note that the proposed CA core mapping algorithm is applied on top of a task mapping algorithm, such as Greedy, Greedy-Swap, Knapsack, or Knapsack-Swap algorithms. The objective of proposed task mapping algorithms is to jointly optimize the computational energy consumption and network load. This is achieved by formulating the problem into a new class of bin packing problems, i.e., Variable bEnefit Bin pAcking Problem (VEBAP). To solve VEBAP, we developed a Knapsack based bin packing algorithm which is implemented using dynamic programming approach. Moreover, we proposed a number of core mapping heuristics that aim to reduce the communication overhead and resulting energy consumption by placing highly communicating bins close to each other. The proposed core mapping heuristics include: Extended Nearest Neighbor (Ex-NN), Single Cluster (SC), Multiple Cluster (MC), and Bin Swapping (BS) algorithms. For detailed working and understanding of aforementioned task and core mapping algorithms, the readers are encouraged to see [16]. However, in prior work we did not considered the impact of network contention on application performance. Therefore, in this we work we have conducted congestion analysis of earlier proposed task and core

mapping algorithms and propose a congestion-aware core mapping algorithm targeted to reduce the network contention and end-to-end latency. Table 2: Congestion-aware core mapping algorithm Algorithm 1: Congestion-aware core mapping Inputs: Core graph CG (V, E), a random core mapping 1. Calculate betweenness centrality of all edges 2. sEdges = sort edges according to betweenness centrality of edges in decreasing order 3. For each e in sEdges in sorted order 4. loade = total network load on e 5. bandwidthe = bandwidth of e 6. If loade >= bandwidthe 7. cSet := set of cores ∀𝑠, 𝑡 ∈ 𝑉: (𝑠, 𝑡 |𝑒) ≠ 0 8. Do { 9. corej and corek := pair of cores having highest mutual communication among cSet pLoad := 𝑠𝑢𝑚𝐵𝐶𝐸𝑗,𝑘 calculated using Eq. 8 10. 11. hCore := core with higher communication volume among corej and corek 12. lCore := core with lower communication volume among corej and corek 13. For x=1 to MAX_MD 14. mCore := core having lowest path load to/from hCore with MD = x 15. Swap the location of mCore and lCore nLoad := recalculate 𝑠𝑢𝑚𝐵𝐶𝐸𝑗,𝑘 using Eq. 8 16. 17. If nLoad < pLoad Then 18. Remove corej and corek from cSet 19. break; // exit inner for loop 20. Else 21. Reverse the Swap of lCore and hCore 22. End If 23. End For 24. Recalculate loade 25. } While (loade >= bandwidthe AND cSet ≠ {∅}) 26. End If 27. End For

4.1 Complexity Analysis The algorithm starts by calculating the betweenness centrality for all edges. The complexity of calculating betweenness centrality for weighted graphs is 𝑂(𝑛𝑚 + 𝑛2 log 𝑛) where n is the number of cores and m is the number of edges [29]. Next the edges are sorted and complexity of sorting the edges will be 𝑂(𝑚 log 𝑚). Afterwards, the outer for loop of algorithm is executed m times and in the worst case inner while and for loops are executed n and d times, respectively. Where d represents the maximum Manhattan distance from any given core to the boundary of NoC. However, the while loop terminates as soon as the edge load becomes less than bandwidth. Moreover, feasible mapping can be found at any distance ranging from 1 to d. Therefore average case complexity can be lower than worst case complexity. The computational complexity of CA algorithm is represented by Eq. 9. 𝑂(𝑛𝑚 + 𝑛2 log 𝑛 + 𝑚 log 𝑚 + 𝑛𝑚𝑑)

(9)

Eq. 9 clearly indicates that most influential parameter that determines the computational complexity of CA algorithm is number of edges. Similarly the space complexity of CA algorithm is shown by Eq. 10 because we need a list (of size n) to store the data related to cores and two lists (of size 2m) to store data related to edges and betweenness centrality of edges in sorted order, respectively. 𝑂(2𝑚 + 𝑛)

(10)

In addition, we also need a matrix of size N×M and to represent a 2D mesh NoC where N and M indicate the numbers of rows and columns of NoC, respectively. A matrix of size T×P is required to indicate task to core mapping where T and P represent number of tasks and processing cores, respectively.

5 Experimental Evaluation 5.1 Experimental Setup Performance evaluation of proposed algorithms has been conducted through extensive simulations carried out using Heterogeneous Network-on-Chip Simulator (HNOCS) [30]. Table 3 presents the parameter values used for various NoC components in HNOCS. A large number of simulation scenarios have been generated by varying different application parameters, such as number of tasks, computation to communication ratio (CCR), and number of processing cores. Simulation experiments have been conducted with NoC mesh sizes ranging from 5×5 to 10×10. Moreover, number of tasks were increased proportionally to number of cores where number of tasks varied in the range of 300 to 1,000. The First Fit (FF) [31] algorithm run on top of greedy algorithm has been used as yardstick for performance comparison of the algorithms. Table 3: HNOCS parameter settings Parameter

Value

Link Bandwidth

16Gbps

Link Delay

0.01 ns

Router Delay

2 ns

5.2 Results and Discussions To evaluate the quality of proposed algorithm and quantify the benefits of proposed technique, experimental results are presented and discussed in this section. The performance evaluation is conducted on the basis of following metrics that include: (a) channel load, (b) end-to-end latency, (c) NoC energy consumption, and (d) workflow. In the subsequent paragraphs, Greedy-Swap and Knapsack-Swap indicate results of task swapping algorithm applied on top of Greedy and Knapsack algorithms, respectively. The reason we have selected channel load as performance metric is that the higher the average channel load is the greater will be the probability of congestion within the network [32]. Alternatively, lower channel load depicts the lower probability of congestion. Similarly, if there is congestion in the network then the average end-to-end latency will also increase and vice versa. In addition to channel load and end-to-end latency we have also analyzed the performance of algorithms on the basis of NoC energy consumption. The performance of algorithms is also analyzed in terms of workflow completion time because workflow completion time has significant impact on system throughput and energy consumption. The experimental results are clustered into two groups: (a) generic application graphs and (b) application graphs with specific communication to computation ratio (CCR). Results pertaining to each of the application group are discussed separately in the subsequent paragraphs.

5.2.1 Generic Application Graphs Fig. 2 presents the average channel load for each of the core mapping algorithms run on top of different task mapping algorithms. The presented results depict the normalized values of each of the algorithm averaged over all NoC mesh sizes, i.e., from 5×5 to 10×10. The results have been normalized according to the values of FF algorithm run on top of Greedy algorithm. The experimental results clearly indicate that proposed CA algorithm outperforms all the other core mapping algorithms and achieved lower average channel load in combination with all the task mapping algorithms. Particularly, CA algorithm showed lowest channel load when run on top of Knapsack algorithm and slightly higher channel load with Knapsack-Swap. More specifically, CA with Knapsack and Knapsack-Swap algorithms reported 46% and 45% lower average channel load compared to the baseline FF algorithm run on top of Greedy algorithm, respectively. On average, CA algorithm exhibited a gain of 38% lower channel load compared to FF algorithm against all four task mapping algorithms. The reason behind achieving

Greedy

Greedy-Swap

Knapsack

Knapsack-Swap

120

140

100

120

Greedy-Swap

NN-Ex

SC

Knapsack

Knapsack-Swap

100 End-to-End latency

80 Channel load

Greedy

60 40 20

80 60 40 20

0

0 FF

NN

NN-Ex

SC

MC

MC-V2

BS

CA

FF

NN

Fig. 2 Normalized channel load

MC

MC-V2

BS

CA

Fig. 3 Average end-to-end latency Greedy

Greedy-Swap

NN-Ex

SC

Knapsack

Knapsack-Swap

120

NoC Energy

100 80 60 40 20 0 FF

NN

MC

MC-V2

BS

CA

Fig. 4 Normalized NoC energy comparatively lower channel load than all other algorithms is that CA algorithm attempts to place the highly communicating cores as close as possible while considering network load on the path between the cores. Whereas, the rest of the mapping algorithms, i.e., NN-Ex, SC, MC, and BS only attempt to reduce communication overhead and resulting NoC energy consumption by placing highly communicating cores as near as possible. However, these algorithms do not consider the network load on the overall path between communicating cores induced by placing highly communicating cores in close proximity. Similarly, CA algorithm consistently achieved lower average end-to-end latency compared to other mapping algorithms as shown in Fig. 3. CA algorithm reduced end-to-end latency up to 12% with an average of 10% compared to FF algorithm. Moreover, CA algorithm reported lowest end-to-end delay when run on top of GreedySwap algorithm with Greedy and Knapsack-Swap closely following with approximately 1% and 2% higher latency compared with Greedy-Swap, respectively. Whereas CA reported comparatively higher latency when run on top of Knapsack algorithm. Since there is no direct relationship between channel load and end-to-end latency, therefore, the proportion of reduction in channel load and average latency cannot be same which is also evident from the comparison of results depicted in Fig. 2 and Fig. 3. We also conducted experiments to analyze the impact of proposed algorithm on NoC energy consumption. Fig. 4 presents the comparison of network energy consumption. The experimental results reveal that CA algorithm exhibited lowest energy consumption compared with all other mapping algorithms in all cases. On average CA algorithm achieved 32% lower network energy consumption compared to FF algorithm. More specifically, CA coupled with Knapsack and Knapsack-Swap algorithms achieved up to 47% and 50% less energy consumption, respectively, compared to the baseline FF run on top of Greedy algorithm. Next best results are achieved by BS and SC mapping algorithms with an average gain of 25% and 23% compared to FF, respectively.

FF MC

NN MC-V2

NN-Ex BS

SC CA

110

100

100

90

90 Channel load

Channel load

110

80 70 60

FF

NN

NN-Ex

SC

MC

MC-V2

BS

CA

80 70 60

50

50 5x5

6x6

7x7

8x8

9x9

10x10

5x5

6x6

7x7

(a)

9x9

10x10

(b)

FF

NN

NN-Ex

SC

MC

MC-V2

BS

CA

100

100

90

90

80 70

FF MC

110

Channel load

Channel load

110

8x8

60

NN MC-V2

NN-Ex BS

SC CA

80 70 60

50

50 5x5

6x6

7x7

8x8

9x9

10x10

5x5

6x6

7x7

(c)

8x8

9x9

10x10

(d)

Fig. 5 Normalized channel load over mesh sizes. (a) Greedy, (b) Greedy-Swap, (c) Knapsack, and (d) Knapsack-Swap 150

FF MC

NN MC-V2

NN-Ex BS

140

SC CA

140

NN MC-V2

SC CA

120 End-to-End latency

120 110 100 90

110 100 90 80

80 70

70 5x5

6x6

7x7

8x8

9x9

5x5

10x10

6x6

7x7

(a)

140

FF MC

8x8

9x9

10x10

(b)

NN MC-V2

NN-Ex BS

140

SC CA

130

130

120

120 End-to-End latency

End-to-End latency

NN-Ex BS

130

130 End-to-End latency

FF MC

110 100 90 80

FF MC

NN MC-V2

NN-Ex BS

SC CA

110 100 90 80

70

70 5x5

6x6

7x7

8x8

(c)

9x9

10x10

5x5

6x6

7x7

8x8

9x9

10x10

(d)

Fig. 6 Normalized end-to-end latency over mesh sizes. (a) Greedy, (b) Greedy-Swap, (c) Knapsack, and (d) KnapsackSwap

FF MC

NN MC-V2

NN-Ex BS

SC CA

100

100

90

90

80

70

60

60

50 6x6

7x7

8x8

9x9

NN MC-V2

FF MC

5x5

6x6

7x7

NN-Ex BS

SC CA

100

90

90

80

70

60

60

7x7

8x8

(c)

9x9

10x10

9x9

10x10

NN MC-V2

NN-Ex BS

SC CA

80

70

50

FF MC

110

NoC energy

NoC energy

NN MC-V2

6x6

8x8

(b)

100

5x5

SC CA

50

10x10

(a) 110

NN-Ex BS

80

70

5x5

FF MC

110

NoC energy

NoC energy

110

50 5x5

6x6

7x7

8x8

9x9

10x10

(d)

Fig. 7 Normalized NoC energy over mesh sizes. (a) Greedy, (b) Greedy-Swap, (c) Knapsack, and (d) Knapsack-Swap Experiments were also conducted to observe the behavior of proposed algorithms over varying mesh NoC sizes. Fig. 5 presents the comparison of channel load over varying mesh sizes with all the combinations of task and core mapping algorithms. Experimental results indicate that all the mapping algorithms including proposed CA algorithm exhibit consistent channel load over varying mesh sizes with only few exceptions. For instance, NN algorithm shows comparatively higher channel load at 8×8 mesh size with both Greedy and Greedy-Swap algorithm. Moreover, it is interesting to note that majority of the better performing mapping algorithms, such as SC, MC, BS, and CA achieve lowest channel load at 9×9 NoC in almost all cases. On the other extreme, the same algorithms exhibit a noteworthy increase in channel load at 10×10 mesh size. In conclusion, the majority of the algorithms behave consistently over varying mesh NoC sizes in terms of average channel load. Similar analysis was also conducted to analyze the end-to-latency of core mapping algorithms over varying mesh NoC sizes, as depicted in Fig. 6. In contrast to the results of channel load, the results of end-to-end latency of mapping algorithms does not remain consistent over varying mesh sizes. However, the proposed CA algorithm exhibit lower variations compared with the rest of mapping algorithms. Experimental results reveal that all core mapping algorithms including CA show highest end-to-end latency at 9×9 mesh size for all the task mapping algorithms with only few exceptions. Moreover, it is interesting to note that there is a sharp decrease in latency at 10×10 compared to 9×9 mesh size. Fig. 7 presents the comparison of NoC energy consumption of mapping algorithms over varying mesh sizes. The overall trend is that all core mapping algorithms, including proposed CA algorithm, exhibit consistent energy consumption with few exceptions. For instance, the proposed CA algorithm shows comparatively higher NoC energy consumption at 7×7 mesh size with Greedy-Swap and Knapsack algorithms. However, CA algorithm still manages to achieve lower NoC energy consumption compared to other core mapping algorithms. The only exception is the case of 7×7 mesh with Greedy-Swap where CA has slightly higher, i.e., less than 1% and 3% higher energy consumption than SC and BS, respectively. Apart from CA, the next best results are achieved by BS with MC and SC closely following with slightly higher NoC energy consumption in majority of the cases.

Greedy

Greedy-Swap

Knapsack

Knapsack-Swap

Greedy

100

100

80

80

Channel load

120

Channel load

120

60

Knapsack

Knapsack-Swap

60

40

40

20

20

0

Greedy-Swap

0

FF

NN

NN-Ex

SC

MC

MC-V2

BS

CA

FF

NN

NN-Ex

SC

(a)

MC

MC-V2

BS

CA

(b) Greedy

Greedy-Swap

Knapsack

Knapsack-Swap

120

Channel load

100 80 60 40 20 0

FF

NN

NN-Ex

SC

MC

MC-V2

BS

CA

(c) Fig. 8: Normalized channel load. (a) CCR=0.5 (b) CCR=1 (c) CCR=2

5.2.2 Application Graphs with Specific CCR We have also conducted experiments to evaluate the performance of proposed algorithms over application graphs with specific CCR values. The reason for using generic application graphs and graphs with specific CCR is to analyze the performance of proposed algorithms over a varied range of application graphs. Since it is really hard to predict that what type of applications will be executed on the system in future, therefore, we have experimented with generic random graphs as well as graphs with certain specific CCR values to cover a wide range of application scenarios. Fig. 8 presents the results of normalized channel load over application graphs with three different CCR values. In general, the proposed CA algorithm exhibited lowest average channel load compared with other mapping algorithms. Particularly, CA algorithm exhibited 33, 20, and 13 percent lower channel load compared with baseline FF mapping averaged over all task packing algorithms with CCR 0.5, 1, and 2, respectively. The experimental results clearly show that CA achieved lowest channel load when run on top of Knapsack algorithm in all cases, i.e., with CCR 0.5, 1, and 2. The second best results are achieved by CA run in conjunction with Knapsack-Swap. This due to the fact that Knapsack-Swap attempts to reduce the communication overhead by mapping highly communicating tasks to the same core. However, this may lead to overloading of certain NoC links that lie on the path between highly communicating cores. It can be noticed that while increasing the CCR all mapping algorithms exhibit consistent results except the SC algorithm. SC algorithm exhibit better performance compared to rest of the mapping algorithms with CCR 0.5 but its performance degrades for higher communication load, i.e., CCR 1 and 2. The reason behind such behavior is that SC attempts to cluster the highly communicating cores in a single cluster that performs better for lower communication loads. However, when the communication load increases the single clustering approach does not work because clustering highly communicating cores can lead to congestion of links at the center of NoC.

Greedy-Swap

Knapsack

Knapsack-Swap Thousands

Greedy

70 60 50

End-to-End latency (ms)

End-to-End latency (ms)

Thousands

Greedy 80

40 30 20 10

Greedy-Swap

Knapsack

Knapsack-Swap

140 120 100 80 60 40 20 0

0

FF

NN

NN-Ex

SC

MC

MC-V2

BS

FF

CA

NN

NN-Ex

SC

(a)

MC-V2

BS

CA

(b) Greedy

Thousands

MC

Greedy-Swap

Knapsack

Knapsack-Swap

180 160 140

End-to-End latency (ms)

120 100 80 60 40 20 0

FF

NN

NN-Ex

SC

MC

MC-V2

BS

CA

(c) Fig. 9: Average end-to-end latency. (a) CCR=0.5 (b) CCR=1 (c) CCR=2 Greedy-Swap

Knapsack

Knapsack-Swap

Greedy Thousands

2.5

NoC Energy (mJ)

2 1.5 1

Greedy-Swap

Knapsack

Knapsack-Swap

5 4.5 4 3.5 3 2.5 2 1.5 1

0.5

0.5

0

0

NN

NN-Ex

SC

MC

MC-V2

BS

CA

FF

NN

NN-Ex

(a)

SC

MC

(b) Greedy

Thousands

FF

NoC Energy (mJ)

NoC Energy (mJ)

Thousands

Greedy 3

Greedy-Swap

Knapsack

Knapsack-Swap

8 7 6 5 4 3 2 1 0

FF

NN

NN-Ex

SC

MC

MC-V2

BS

CA

(c) Fig. 10: NoC Energy consumption. (a) CCR=0.5 (b) CCR=1 (c) CCR=2

MC-V2

BS

CA

Fig. 9 shows the average end-to-end (E2E) latency over varying communication loads. Average E2E latency is calculated by taking average of end-to-end latency incurred by each flit transferred between any two communicating cores against each simulation scenario. Experimental results show that on average CA exhibit lower E2E latency compared to all the other mapping algorithms in all cases. Particularly, CA achieved up to 30% lower E2E latency compared with baseline FF mapping algorithm run top of Greedy. CA exhibited least latency when run on top of Knapsack of algorithm over all CCR scenarios with Knapsack-Swap closely following. Second best results, are achieved by BS algorithm that reported slightly higher E2E latency compared to the CA. More specifically, BS reported on average 2% higher latency compared to the CA algorithm. Interestingly, after CA and BS, next best results are exhibited by NN-Ex that outperformed FF, SC, and MC algorithms. The reason behind CA having lower E2E latency is that CA attempts to map the cores in such a manner that NoC links are not overloaded. Consequently, the communication load is distributed across the NoC leading to lower wait time for packets (flits) traversing through the network. Similarly, BS is a probabilistic technique that randomly places the core to reduce the communication overhead. Since BS does not cluster the highly communicating cores in close proximity, it does not overload the NoC links leading to reduced average E2E latency. Moreover, there is a general trend that with increasing CCR the average E2E latency for all the algorithms also increases. This is mainly due to the reason that increasing the network load leads to increased contention for the same NoC resources including routers, buffers, and links. Consequently, the average E2E latency increases for the traffic traversing through the NoC. Fig. 10 presents the results of energy consumed by NoC components while forwarding data among the communicating cores. The parameters that determine NoC energy consumption are the amount of data (flits) forwarded and the number of hops travelled by the flits. The experimental results show that CA coupled with Knapsack-Swap achieved lower energy consumption compared to rest of the algorithms with CCR 0.5 and 1. In case of CCR 2, BS mapping algorithm run on top of Knapsack-Swap exhibited lowest communication energy consumption. The reason behind such behavior is that Knapsack-Swap algorithm places majority of the highly communicating tasks onto the same core thereby significantly reducing the communication overhead. Whereas, the primary objective of CA algorithm is to place the cores in such a manner that NoC links are not overloaded to avoid network contention. Consequently, CA algorithm may place some of the communicating cores at distant NoC positions that leads to increase in communication energy consumption. On the other extreme, BS algorithm attempts to greedily place the cores such that Manhattan distance (number of hops) between communicating cores is reduced leading to reduced energy consumption. This is particularly evident in case of Knapsack algorithm where BS outperforms the proposed CA algorithm in terms of communication energy consumption. Specifically, for higher communication scenarios, i.e., with CCR 1 and 2, BS achieved lower energy consumption compared to CA when both are applied on top of Knapsack algorithm. Fig. 11 presents the results for workflow completion time against various communication loads. CA algorithm outperforms all the other mapping algorithms in all communication scenarios. Particularly, CA algorithm coupled with the Knapsack exhibited lowest application completion time compared to other algorithms. Specifically, CA achieved on average 11%, 14%, and 8% lower workflow completion time compared to the baseline FF algorithm against CCR 0.5, 1, and 2, respectively. Next best results are exhibited by the CA algorithm run on top of the Knapsack-Swap. On average, Knapsack-Swap showed 5% higher completion time when run in conjunction with CA. Among the other core mapping algorithm, BS exhibited lower workflow completion time compared to rest of the algorithms except CA. Particularly, BS showed slightly higher workflow completion time compared to CA, less than 1% higher workflow completion time on average, in all communication scenarios. Experimental results reveal that bin packing algorithms exhibited consistent performance against varying communication loads except the Greedy-Swap algorithm. The Greedy-Swap algorithm exhibited significantly lower gain in average workflow completion time compared to the Greedy algorithm against CCR 2. More specifically, the average difference between workflow completion time of Greedy and Greedy-Swap is 9% and 12% against CCR 0.5 and 1, respectively. Whereas, in case of CCR 2, the average difference becomes less than 1%. The reason that explains such behavior is that in Greedy-Swap, the task swapping algorithm is applied on top of greedy task mapping algorithm. When the communication load is comparatively lower, i.e., with CCR 0.5 and 1, the swapping algorithm is able to achieve significant reduction in communication overhead that results in reduction in application execution time. On

Greedy

Greedy-Swap

Knapsack

Knapsack-Swap

Greedy 160

90

140

80

Workflow completion time (sec)

Workflow completion time (sec)

100

Greedy-Swap

Knapsack

Knapsack-Swap

120

70

100

60 50 40 30 20 10 0

80 60 40 20 0

FF

NN

NN-Ex

SC

MC

MC-V2

BS

CA

FF

NN

NN-Ex

SC

(a)

MC

MC-V2

BS

CA

(b) Greedy

Greedy-Swap

Knapsack

Knapsack-Swap

250

Workflow completion time (sec)

200

150

100

50

0

FF

NN

NN-Ex

SC

MC

MC-V2

BS

CA

(c) Fig. 11: Workflow completion time. (a) CCR=0.5 (b) CCR=1 (c) CCR=2 the other extreme, in case of higher communication loads, CCR 2, the swapping algorithm run top of greedy algorithm could not achieve the noteworthy reduction in communication overhead leading to comparatively lower gain in terms of reduced workflow completion time.

6 Conclusions and Future Work We proposed a congestion-aware core mapping algorithm aimed at reducing the network contention and end-toend latency. The proposed algorithm utilized a modified betweenness centrality metric to identify the NoC links that are more prone to network congestion. To reduce network congestion, the CA algorithm optimizes the core mapping in such a way to alleviate network load from NoC links that are prone to network contention. Experimental results after extensive simulations showed the effectiveness of proposed algorithm compared to other core mapping algorithm. Moreover, it was observed that modified betweenness centrality metric is able to capture the dynamic behavior of network traffic traversing through NoC and effectively detect overloaded links. In our future work we aim to explore the impact of component failures, including faulty cores and NoC link failures, onto various performance parameters, such as throughput, end-to-end latency, and energy consumption. Moreover, development and evaluation of hybrid NoC architectures, such as optical/wired and wired/wireless NoCs is another interesting dimension to explore. Similarly, there is significant research work in progress considering 3D stacked mesh NoCs. How to extend the proposed heuristic to avoid congestion within 3D NoCs is another interesting area to explore.

References 1.

Georgiadis, A.-L., S. Xydis, and D. Soudris. Deploying and monitoring hadoop MapReduce analytics on single-chip cloud computer. in Proceedings of the 7th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and the 5th Workshop on Design Tools and Architectures For Multicore Embedded Computing Platforms. 2016. ACM.

2. 3.

4.

5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32.

Shafique, M. and J. Henkel. Agent-based distributed power management for kilo-core processors. in Proceedings of the International Conference on Computer-Aided Design. 2013. IEEE Press. Xu, J., et al. Towards High-Speed Real-Time HTTP Traffic Analysis on the Tilera Many-Core Platform. in High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on. 2013. IEEE. Wang, C., et al. A Dynamic Contention-aware Application Allocation Algorithm for Many-core Processor. in High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conferen on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on. 2015. IEEE. Yu, X., et al., Staring into the abyss: An evaluation of concurrency control with one thousand cores. Proceedings of the VLDB Endowment, 2014. 8(3): p. 209-220. Johnson, D.R., et al., Rigel: A 1,024-core single-chip accelerator architecture. IEEE Micro, 2011. 31(4): p. 30-41. Pang, K., et al., Task mapping and mesh topology exploration for an FPGA-based network on chip. Microprocessors and Microsystems, 2015. 39(3): p. 189-199. Han, J.-J., et al., Contention-aware energy management scheme for NoC-based multicore real-time systems. IEEE Transactions on Parallel and Distributed Systems, 2015. 26(3): p. 691-701. Kim, G., et al., An Augmented Reality Processor with a Congestion-Aware Network-on-Chip Scheduler. IEEE Micro, 2014. 34(6): p. 3141. Chao, H.-L., et al. Congestion-aware scheduling for NoC-based reconfigurable systems. in Proceedings of the Conference on Design, Automation and Test in Europe. 2012. EDA Consortium. Fattah, M., et al. CoNA: Dynamic application mapping for congestion reduction in many-core systems. in Computer Design (ICCD), 2012 IEEE 30th International Conference on. 2012. IEEE. Khajekarimi, E. and M.R. Hashemi. Communication and congestion aware run-time task mapping on heterogeneous MPSoCs. in The 16th CSI International Symposium on Computer Architecture and Digital Systems (CADS 2012). 2012. IEEE. Kaushik, S., A.K. Singh, and T. Srikanthan. Computation and communication aware run-time mapping for NoC-based MPSoC platforms. in SOC Conference (SOCC), 2011 IEEE International. 2011. IEEE. Singh, A.K., et al., Run-time mapping of multiple communicating tasks on MPSoC platforms. Procedia Computer Science, 2010. 1(1): p. 1019-1026. Lee, S.-H., Y.-C. Yoon, and S.-Y. Hwang, Communication-aware task assignment algorithm for MPSoC using shared memory. Journal of Systems Architecture, 2010. 56(7): p. 233-241. Maqsood, T., et al., Energy and Communication Aware Task Mapping for MPSoCs Journal of Parallel and Distributed Systems (JPDC), 2016. Submitted. Bilal, K., et al., On the characterization of the structural robustness of data center networks. IEEE Transactions on Cloud Computing, 2013. 1(1): p. 1-1. Manzano, M., et al., On the connectivity of data center networks. IEEE Communications Letters, 2013. 17(11): p. 2172-2175. Carvalho, E., N. Calazans, and F. Moraes. Heuristics for dynamic task mapping in NoC-based heterogeneous MPSoCs. in 18th IEEE/IFIP International Workshop on Rapid System Prototyping (RSP'07). 2007. IEEE. Carvalho, E. and F. Moraes. Congestion-aware task mapping in heterogeneous MPSoCs. in System-on-Chip, 2008. SOC 2008. International Symposium on. 2008. IEEE. Singh, A.K., et al. Mapping algorithms for noc-based heterogeneous mpsoc platforms. in Digital System Design, Architectures, Methods and Tools, 2009. DSD'09. 12th Euromicro Conference on. 2009. IEEE. Wang, C., et al., Area and power-efficient innovative congestion-aware Network-on-Chip architecture. Journal of Systems Architecture, 2011. 57(1): p. 24-38. Daneshtalab, M., et al., A systematic reordering mechanism for on-chip networks using efficient congestion-aware method. Journal of Systems Architecture, 2013. 59(4): p. 213-222. Wang, C., W.-H. Hu, and N. Bagherzadeh, A load-balanced congestion-aware wireless network-on-chip design for multi-core platforms. Microprocessors and Microsystems, 2012. 36(7): p. 555-570. Schranzhofer, A., J.-J. Chen, and L. Thiele, Dynamic power-aware mapping of applications onto heterogeneous mpsoc platforms. IEEE Transactions on Industrial Informatics, 2010. 6(4): p. 692-707. Maqsood, T., et al., Dynamic task mapping for Network-on-Chip based systems. Journal of Systems Architecture, 2015. 61(7): p. 293306. Atitallah, R.B., et al. Estimating energy consumption for an MPSoC architectural exploration. in International Conference on Architecture of Computing Systems. 2006. Springer. Ye, T.T., G.D. Micheli, and L. Benini. Analysis of power consumption on switch fabrics in network routers. in Proceedings of the 39th annual Design Automation Conference. 2002. ACM. Brandes, U., A faster algorithm for betweenness centrality*. Journal of mathematical sociology, 2001. 25(2): p. 163-177. Ben-Itzhak, Y., et al. HNOCS: modular open-source simulator for heterogeneous NoCs. in Embedded Computer Systems (SAMOS), 2012 International Conference on. 2012. IEEE. Xie, B., et al., An energy-aware online task mapping algorithm in NoC-based system. The Journal of Supercomputing, 2013. 64(3): p. 1021-1037. de Souza Carvalho, E.L., N.L.V. Calazans, and F.G. Moraes, Dynamic task mapping for MPSoCs. IEEE Design & Test of Computers, 2010. 27(5): p. 26-35.

*Biographies (Text)

Tahir Maqsood received his B.S. in Computer Science from PIMSAT, Pakistan, in 2001, Master in Computer Science from KASBIT, Pakistan, in 2004, and M.S. in Computer Networks from Northumbria University, UK, in 2007. He is currently a Ph.D. candidate in the Department of Computer Science, COMSATS Institute of Information Technology, Abbottabad, Pakistan. His research interests include task and core mapping, energy efficient systems, and network performance evaluation.

Kashif Bilal is an Assistant Professor at COMSATS Institute of Information Technology, Abbottabad, Pakistan. He received his PhD in Electrical and Computer Engineering from North Dakota State University, Fargo, USA in 2014. He also received College of Engineering (CoE) Graduate Student Researcher of the year 2014 award at NDSU. His research interests include energy efficient high speed networks, Green Computing, and robustness in data centers. Currently, he is focusing on exploration of network traffic patterns in real data centers and development of data center network workload generator.

Sajjad A. Madani works at COMSATS Institute of Information technology as Associate Professor. He did M.S. in Computer Sciences from Lahore University of Management Sciences and Ph.D. from Vienna University of Technology. His areas of interest include low power wireless sensor network and green computing. He has published more than 40 papers in peer reviewed international conferences and journals.

Major contributions of this article are summarized as follows. •

We propose a novel application of edge betweenness centrality metric to capture the operational and dynamic characteristics of NoC traffic and identify highly loaded NoC links that can potentially cause congestion.



In contrast to traditional betweenness centrality metric, which is generally used to measure the structural characteristics of the system, the modified edge betweenness centrality metric utilizes the volume of network traffic traversing through the edges (NoC links) to capture and exploit the dynamic and operational behavioral of the system.



Propose a congestion-aware core mapping heuristic to optimize core mapping in such a manner to alleviate network contention and resulting delay within NoC.



Extensive simulations are conducted to analyze the effectiveness of proposed scheme with varying NoC sizes ranging from 5×5 to 10×10 mesh NoCs. Performance parameters used to evaluate the proposed congestion-aware core mapping heuristic include: (a) channel load, (b) end-to-end latency, (c) NoC energy consumption, and (d) workflow completion time.