Dynamic task mapping for Network-on-Chip based systems

Dynamic task mapping for Network-on-Chip based systems

Accepted Manuscript Dynamic Task Mapping for Network-on-Chip based Systems Tahir Maqsood, Sabeen Ali, Saif U.R. Malik, Sajjad A. Madani PII: DOI: Refe...

1MB Sizes 0 Downloads 2 Views

Recommend Documents

No documents
Accepted Manuscript Dynamic Task Mapping for Network-on-Chip based Systems Tahir Maqsood, Sabeen Ali, Saif U.R. Malik, Sajjad A. Madani PII: DOI: Reference:

S1383-7621(15)00058-2 http://dx.doi.org/10.1016/j.sysarc.2015.06.001 SYSARC 1282

To appear in:

Journal of Systems Architecture

Received Date: Revised Date: Accepted Date:

5 February 2015 29 May 2015 7 June 2015

Please cite this article as: T. Maqsood, S. Ali, S.U.R. Malik, S.A. Madani, Dynamic Task Mapping for Networkon-Chip based Systems, Journal of Systems Architecture (2015), doi: http://dx.doi.org/10.1016/j.sysarc.2015.06.001

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Dynamic Task Mapping for Network-on-Chip based Systems Tahir Maqsood *, Sabeen Ali, Saif U. R. Malik, and Sajjad A. Madani Department of Computer Science, COMSATS Institute of Information Technology, Pakistan

Abstract—Efficiency of Network-on-Chip (NoC) based multi-processor systems largely depends on optimal placement of tasks onto processing elements (PEs). Although number of task mapping heuristics have been proposed in literature, selecting best technique for a given environment remains a challenging problem. Keeping in view the fact that comparisons in original study of each heuristic may have been conducted using different assumptions, environment, and models. In this study, we have conducted a detailed quantitative analysis of selected dynamic task mapping heuristics under same set of assumptions, similar environment, and system models. Comparisons are conducted with varying network load, number of tasks, and network size for constantly running applications. Moreover, we propose an extension to communication-aware packing based nearest neighbor (CPNN) algorithm that attempts to reduce communication overhead among the interdependent tasks. Furthermore, we have conducted formal verification and modeling of proposed technique using high level Petri nets. The experimental results indicate that proposed mapping algorithm reduces communication cost, average hop count, and end-to-end latency as compared to CPNN especially for large mesh NoCs. Moreover, proposed scheme achieves up to 6% energy savings for smaller mesh NoCs. Further, results of formal modeling indicate that proposed model is workable and operates according to specifications. Keywords: Dynamic Task Mapping; Network-on-Chip (NoC); Multiprocessor System-on-Chip (MPSoC)

1. Introduction Advancements in silicon technology and developments in the nanotechnology have enabled the integration of a complete system onto a single chip, referred as System-on-Chip (SoC). The SoC helps in integrating maximum amount of technology in the smallest possible space by placing several processing elements (PEs) on a single chip leading to the development of Multiprocessor Systemon-Chip (MPSoC) [1]. The PEs of MPSoC can be homogenous or heterogeneous each performing varying tasks and possibly operates at different frequencies. MPSoCs are being applied in complex embedded applications because the growing and complex computational requirements of future embedded system applications cannot be fulfilled by a single general purpose processor. Therefore, MPSoCs have been proposed to meet the growing performance requirements of future embedded systems [2]. Future MPSoCs are anticipated to be composed of dozens or even hundreds of processing cores. Integrating large number of cores on a single chip requires a robust and scalable interconnection architecture. Network-on-Chip (NoC) has been proposed as a modular and scalable communication interconnect to support communication among the PEs of MPSoCs [3]. The research issues and challenges for NoC based systems can be classified into four major categories that are [4]: • Communication Infrastructure: Research based on the design and development of efficient communication infrastructure that includes: selection of suitable network topology, router architecture, optimization of buffers, link design, clocking, floor planning, and layout design. • Communication Paradigm: Efforts to find best possible solutions for routing policies, switching techniques, congestion control, power and thermal management, fault tolerance, and reliability. • Evaluation Framework: Development of tools and techniques to get better understanding of achievable throughput, latency, and bandwidth of the network. • Application Mapping: Efforts to find suitable placement of set of tasks of an application onto the PEs of NoC in such a way that optimizes the one or more parameters of interest. First three categories are extensively surveyed but the fourth category, i.e., application mapping is not much dealt with [4]. In application mapping, tasks of the application are mapped onto the suitable PEs depending upon various optimization criteria, such as energy consumption, network overhead, congestion, and latency. Application or task mapping is one of the critical factors that affect the performance and efficiency of NoC based MPSoCs. There are few surveys and comparative studies conducted to evaluate the task mapping algorithms proposed for NoCs [4, 5, 6]. However, these studies either restricted the analysis to smaller mesh sizes or have conducted qualitative comparison of presented algorithms. For instance, in [6] authors have conducted the comparison of static and dynamic task mapping heuristics for 5×5 and 9×9 NoCs. Similarly, the authors in [5] have also presented the analysis of selected dynamic task mapping heuristics over 8×8 mesh NoCs. The authors of [4] have presented a qualitative survey of static and dynamic task mapping techniques for MPSoCs. ————————————————————————————————

* Corresponding author. Tel.: + 92 333 3122065, email: [email protected] Email addresses: [email protected] (T. Maqsood), [email protected] (S. Ali), [email protected] (S. U. R. Malik), [email protected] (S. A. Madani)

In this study, a detailed quantitative analysis of some of the selected dynamic task mapping heuristics is provided considering a wide range of NoC mesh sizes with varying task loads. The experimental results of comparative analysis revealed that communication-aware packing based nearest neighbor (CPNN) algorithm achieves better performance as compared to the other algorithms in terms of reduced energy consumption. However, CPNN exhibits higher communication cost, especially for large mesh NoCs. Moreover, mapping produced by CPNN algorithm results in unbalanced load distribution among the PEs leading to increased energy consumption. Therefore, in this work, we also propose an extension of CPNN algorithm that attempts to reduce communication overhead by migrating communicating tasks to the same PE for constantly executing applications. Major contributions of this article are summarized as follows. • A detailed quantitative analysis of the selected dynamic task mapping heuristics is provided under same environment, using same assumptions, and system models. • An extension to CPNN heuristic is proposed. The proposed heuristic aims to reduce the communication cost and energy consumption by migrating communicating tasks that are already mapped using CPNN from lightly loaded PEs to the other PEs that can accommodate those tasks. • Formal verification and modelling of the proposed technique is provided using high level Petri nets, satisfiability modulo theories, and Z3 solver. The remaining of the paper is organized as follows. Background and some preliminary concepts are presented in Section 2. Prior research work related with our work is reviewed in Section 3. Section 4 provides the details about models used for conducting this study. Section 5 presents the working of algorithms selected for analysis along with the proposed algorithm. Formal verification and modeling is discussed in Section 6. In Section 7, experimental setup and results are presented. Finally, Section 8 concludes the paper with some directions for future work. 2. Preliminaries The future MPSoCs can consist of hundreds or even thousands of processing elements. Therefore, there is need for scalable and robust communication infrastructure to interconnect the large number of processing elements. Traditional SoC technology uses a shared bus interconnection to connect limited number of PEs. However, as the number of PEs increases, the shared bus architecture is unable to scale with the increase in number of PEs [7, 8]. In order to overcome the limitations of shared bus architecture, NoC has been proposed as a modular and scalable alternative [9, 10]. The processing elements within the NoC can be homogeneous or heterogeneous. The processors belonging to same family make the MPSoC platform homogeneous and those belonging from different families make it heterogeneous [11, 12]. In 2007 and 2009, homogeneous MPSoCs were proposed by Intel [13] and Tilera [14] with 80 and 100 processing elements connected through NoC, respectively. Heterogeneous MPSoCs were proposed by IBM, Sony, and Toshiba, that consisted of one manager processor with 8 floating point units [15]. The resources that can be shared by the various PEs in NoC include: memory, input/output, and links [11, 12]. Network-on-Chip may comprise of computing resources, programmable logic blocks, distributed storage resources, programmable I/O, and a switching fabric that connects all the resources. The computing resources can be processor cores or field programmable logic blocks. The switching fabric acts as a communication backbone for connecting computing resources of NoC [16]. Major components of a NoC based MPSoC are depicted in Fig. 1 [18, 19]: • Processing elements (PEs): Can be a general-purpose processor, digital signal processor (DSP), or embedded memory. • Network interface: Interface that connects processing elements (PE) to NoC through routers. • Routers: Nodes for routing data and control flits based on specified routing strategy. • Router-to-router links: Include logical and physical channels that are used for transferring flits between routers.

Fig. 1. NoC Components [17]

3. Related Work To utilize the NoC resources effectively, efficient task mapping is required to place tasks to the suitable processing elements. Task mapping in NoC refers to the process of finding suitable placement for the tasks of an application to the processing elements while considering certain optimization criteria, such as reduction in energy consumption, total execution time, packet latency, congestion, and communication overhead [1]. Task mapping decisions directly reflect on the overall system performance in terms of the aforementioned parameters. Mapping applications onto the NoC architecture is NP-hard [18, 20, 21]. Therefore, various heuristics based solutions have been proposed for task mapping problem. A significant amount of work has been done and is still under process to find efficient task mapping for NoC. In the subsequent paragraphs we briefly mention some of the selected task mapping techniques proposed for NoC based MPSoCs. Moreover, a brief comparison of the discussed techniques is presented in Table 1. Kumar et al. [2] proposed a communication-aware run-time heuristic for mapping large number of applications to NoC based MPSoC platform. The model examines the available resources and maps the communicating tasks on same or nearest PE in left, down, top and right order. Tasks are mapped by considering the communication with the previously mapped tasks on the target PE. Furthermore, the model assigns high priority to the tasks of applications that are near to each other in order to minimize the communication overhead. The major drawback of this technique is that, since it tries to map only the communicating tasks to the same PE, there are chances that some leaf tasks may occupy the PE completely. Such PEs are not shared with any other tasks resulting in under-utilization of certain PEs that may lead to increased energy consumption. Xie et al. [22] proposed an energyaware online task mapping algorithm for NoC. The target of the algorithm was to cut down communication energy consumption. The first step of the algorithm is to analyze the communication status of applications at runtime and to select the task having highest volume of communication. The selected task is mapped first. In the following steps, the neighboring tasks of the selected task are sorted by the volume of communication and mapped to the nearest PEs. As a result of the mapping done by this algorithm, the tasks with heavy volumes of communications are placed closer to each other but lightly communicating tasks may be neglected and placed far apart. Chen-Ling Chou and Radu Marculescu [23] presented a run-time task allocation algorithm while considering user’s behavior. The influence of user’s behavior enables the system to react in a more timely manner to the real-time changes. It also allows the system to adapt dynamically to the required user needs. The user behavior is determined by the sequence of events that a user performs during interaction with the system. The algorithm introduces additional computational overhead as it requires machine learning techniques to process the user behavior. Chen-Ling Chou and Radu Marculescu [24] proposed another run-time task allocation strategy depending upon the users. The user’s behavior is specially considered in resource allocation resulting in a better response to real-time changes. User behavior can be keyed out as a set of overlapping and consecutive events for the time a user interacts with the system or network. The behavior may vary from user to user, therefore, the scheme should be intelligent enough to process the user behavior that requires additional computational requirements. Mehran et al. [25] suggested a dynamic spiral mapping algorithm. The algorithm maps tasks in a spiral manner, starting from the center to the boundary of NoC architecture. The communicating tasks are placed closer to each other. The communication time is likewise shortened by reducing dynamic mapping time, reconfiguration time, and task migration time. Although, the technique maps the task in spiral fashion on-to the NoC architecture but, there are possibilities that the PEs that may provide a better mapping can be ignored during the spiral mapping. A run-time agent based mapping technique was presented by Al Faruque et al. [26]. An agent is a small task that is capable of being executed on any node of NoC. The functionality of agents is such that they can perform resource management, store resources’ state information and negotiate with each other to find a best suitable processing element for a specified task. Agents are of two types: 1) Global agents (GA) and 2) Cluster agents (CA). Cluster agents, have the knowledge about their own clusters and negotiate with global agents that have global information related to all clusters, upon receiving a task. The negotiations help in providing efficient mappings. The proposed technique reduces the overall traffic needed to determine the current state of the network but the computational over head is increased because of the negotiations among agents. For NoC based heterogeneous MPSoCs, Carvalho et al. [27, 28] proposed dynamic task mapping heuristics. The performance of mapping heuristic is observed for dynamic workloads. Initial task mapping phase is accompanied by a dynamic mapping phase. The presented model targets the minimization of congestion and optimized NoC performance. The tasks are mapped at run-time while considering the communication requests and load on the links. This heuristic is applicable exclusively for single-task processors. It can be extended to multi-task processors as well. To minimize the energy consumption during task allocation and scheduling, Huang et al. [29] proposed an energy-aware task allocation algorithm for network-on-chip based heterogeneous multiprocessor systems. An extension of Integer Linear Programming (ILP) was introduced to consider both processing and communication energy and to formulate a global optimal mapping. However, it involves large execution time. To overcome the limitation of ILP based approach, the authors also proposed a Simulated Annealing with Timing Adjustment (SA-TA) scheme that attempted to accelerate the optimization process and achieved performance very close to the global optimum. The major limitation of the proposed model is that it considers a simplified energy model where no leakage power is considered. Chou and Marculescu [30] proposed a contention-aware application mapping. This technique uses integer linear programming (ILP) to minimize the inter-tile network contention. The proposed scheme decreases the packet latency but the communication energy overhead is slightly increased. The scope of this technique was limited to 2-D mesh NoCs with XY routing. Tosun et al. [31] also proposed an application mapping technique using ILP. The technique minimizes energy consumption for different benchmarks,

but the bandwidth constraints are not considered. The computation time of the algorithm is also high. S. Tosun attempted to improve the limitations of [30] by presenting cluster-based application mapping for NoC by using ILP [32]. Several benchmarks were considered showing that the technique produces optimal results. However the execution time of the algorithm was high being an ILP based approach. Table 1: Mapping Heuristics for NoC based Systems Proposed Technique Communication-aware packing based nearest neighbor (CPNN) heuristics for run-time task mapping on NoC-based MPSoC platforms [2]

Classification Dynamic Mapping

Mapping Criteria Maps highly communicating tasks on same or closer PEs. Looks for PEs in left, down, top, right order

Pros Communication overhead is minimized. Energy savings up to 44% can be expected

Cons Some leaf tasks may occupy certain PEs completely. Consequently, such PEs are not shared with any other tasks resulting in underutilization of PEs. Lightly communicating tasks may be neglected and placed far apart.

An energy-aware online task mapping algorithm in NoC-based system [22]

Dynamic Mapping

Analyze communication status of applications at runtime and real-time mapping is carried out online.

Communication energy saving more than 20%.

Run-Time Task Allocation Considering User Behavior in Embedded Multiprocessor Networks-on-Chip [23]

Dynamic Mapping

Influence of user behavior makes the system respond in a better way to realtime changes. Adapt dynamically to the required user needs.

Communication energy savings more than 70%

Computational overhead is increased

User-Aware Dynamic Task Allocation in Networks-on-Chip [24]

Dynamic Mapping

User behavior considered in resource allocation resulting in a more dependable response to real-time changes

60% communication energy savings

Increased computational requirements for processing user behavior.

DSM: A Heuristic Dynamic Spiral Mapping algorithm for network on chip [25]

Dynamic Mapping

Maps tasks in a spiral manner, starting from the center to the boundary with communicating tasks placed closer to each other.

Reduced reconfiguration time, dynamic mapping time and task migration time

Some PEs that can provide better mapping can be ignored during spiral mapping

ADAM: Run-time Agent-based Distributed Application Mapping for on-chip Communication [26]

Dynamic Mapping

Agents can manage resources, store state information of resources and negotiate with each other to select a best suitable processing element for a specified task.

Reduce the overall traffic needed to determine the current state of the network

Negotiations among the agents adds up the computational and communicational overhead

Heuristics for Dynamic Task Mapping in NoC-based Heterogeneous MPSoCs [27]. Dynamic Task Mapping for MPSoCs [28]

Dynamic Mapping

The tasks are mapped at run-time while considering the communication requests and load on the links.

Reduced execution times and congestion. 31% decrease in the channel load and up to 22% smaller packet latency.

This heuristic is applicable only to single-task processors

Energy-Aware Task Allocation for Network-on-Chip Based Heterogeneous Multiprocessor Systems [29]

Static Mapping

An extension of Integer Linear Programming (ILP) and a Simulated Annealing with Timing Adjustment (SA-TA) heuristic.

Accelerate the optimization process and achieve much better performance

Considers a simplified energy model with no consideration of leakage power

Contention-aware Application Mapping for Network-on-Chip Communication Architectures [30]

Static Mapping

Uses integer linear programming (ILP) resulting in minimizing the inter-tile network contention.

Decrease in packet latency was achieved with insignificant communication energy overhead

The scope of this technique was limited to 2-D mesh NoCs. Computation time is high and communication energy overhead is slightly increased

An ILP formulation for application mapping onto Network-on-Chips [31]

Static Mapping

Application mapping technique using integer linear programming (ILP).

The energy consumption is minimized depending upon different benchmarks

The computation time for the algorithm is high and bandwidth constraints are not considered.

Cluster-based application mapping method for Network-on-Chip [32]

Static Mapping

Cluster-based application mapping for NoC by using ILP.

Produces optimal results

Algorithm execution time is high

An application specific NoC mapping for optimized delay [33]

Static Mapping

Delay model for application mapping based on genetic algorithm having minimum average delay.

Minimum average delay and the execution time are also reduced

Slow rate of convergence

Task-Binding Based Branch-andBound Algorithm for NoC Mapping [43]

Static Mapping

Attempts to map the heavily communicating tasks to closer PEs.

Energy consumption and communication cost is reduced and the algorithm run-time is also reduced significantly

Single task is mapped on each PE, i.e., does not consider the multi-task mapping

Zhou et al. [33] presented a delay model for application mapping based on genetic algorithm that has a minimal average delay. In the proposed algorithm, a population relates to the position of the cores in NoC. The initial population is selected randomly. Multipoint crossover is used having crossover points that are randomly chosen for creating next generation. The chromosomes that have a low waiting time can participate easily in crossover. The proposed algorithm offers a minimum average delay and the execution time is likewise shortened. A major weakness of genetic algorithm is that its rate of convergence is quite slow. Carvalho et al. [6] conducted quantitative comparison to find out the pros and cons of static and dynamic mapping algorithms. According to the results of the study, dynamic algorithms are more advantageous than static techniques as they are able to deal with real-time scenarios. However, dynamic algorithms consider only a partial view of the application graph in the sense that it deals with only the communication with its caller task. On the other hand, static algorithms have global view of the complete system and can use complex mechanisms to consider all tasks and resources. Möller et al. [34] provided another comparative analysis of dynamic task mapping heuristics in heterogeneous NoC-based MPSoCs. Eight different heuristics were compared in this research; however, the comparison was performed based on very few parameters and the deadlines were also missed during simulation. There is a need to focus on the deadlines also in order to generate solid results. Zhou et al. [43] proposed an optimized branch and bound (B&B) algorithm that attempts to reduce the communication energy consumption by mapping the heavily communicating tasks to closer PEs. The proposed B&B algorithm binds the heavily communicating together. The advantage of binding the heavily communicating tasks is twofold: (a) tasks that have high communication dependence are mapped on adjacent PEs, and (b) the algorithm run-time is reduced significantly. Experimental results showed that the proposed algorithm is able to reduce the energy consumption approximately 15% and 64% as compared to traditional B&B and simulated annealing (SA) based algorithms, respectively. However, in contrast to our approach, in the proposed algorithm a single task can be mapped on each PE. Moreover, the proposed B&B algorithm is based on design-time task mapping whereas in this work we only consider the run-time or dynamic task mapping heuristics. Table 1 summarizes the aforementioned task mapping heuristics. 4. System Models This section presents the different models used for conducting the study. The application model describes how tasks of an application are represented using application task graphs. NoC model is also discussed along with the application mapping model which represents how different tasks are mapped to the PEs. Energy model is also discussed in this section that is used to calculate the overall energy consumption of the system. 4.1. Application Model The tasks of an application are defined by an Application Task Graph (ATG). The ATG is a directed graph that is denoted by ATG = (T, E). Here, T is the set of tasks of an application and E represents the set of edges that shows the dependence and communication cost between the connected tasks. A task ti ∈ T is represented by three parameters (idi, reqi, sizei), where idi represents the task identifier, reqi represents the task computation requirements in cycles, and sizei indicates the size of ti in bytes including task code and data [44, 45, 46]. The weighted edge em,s ∈ E defines the volume of data to be sent from task m to task s during each period. An edge between two tasks describes the master-slave pair as shown in Fig. 2. A slave task (task 1) will start its execution after receiving the required data from its master task (task 0). Initial task or entry task (task 0) has no master [1, 2, 35].

Fig. 2. Application Task Graph

4.2. NoC Model The NoC architecture can also be represented as a graph G=(P, V), where P is the set of PEs pi ∈ P and vi,j ∈ V represents the links between two PEs pi and pj. Each processing element pi ∈ P is identified by an identifier pid and has a specified computing capacity represented as capi. The link vi,j ∈ V contains the information of channel width in bits and bandwidth available for data transmission [2]. In the proposed MPSoC architecture, each PE has a fixed amount of memory and can support multiple tasks depending on the size of available memory. For, simplicity we assume that the amount of memory required by each task ti ∈ T is same as the computational requirement reqi of the task and the total memory available at each PE pj ∈ P is same as the computing capacity capj. 4.3. Energy Model In NoC based systems, the energy consumption consists of two primary components that are: (a) computational energy and (b) communication energy. Processing elements embedded within the tiles consume the computational energy. It can vary depending

upon the task characteristics and the efficiency of the PEs within a tile [36, 37]. Communication energy refers to the amount of energy consumed by the network components for transferring the required data among the PEs. The communication energy has been calculated based on bit energy consumption model proposed in [36]. The bit energy is defined as the energy needed for transmitting one bit from tilei to tilej. The bit energy can be calculated using Eq. 1: , ,    ,  1   2 

(1)

Here EBitij is the energy needed in NoC for transmitting one bit from tilei to tilej. ERbit represents the dynamic energy needed to transfer a bit through the router (wires, buffers, and logic gates). ELbit represents the dynamic energy consumed to transfer the bit on the links between two neighboring tiles. ECbit represents the bit dynamic energy consumed on the links between router and PE. Finally, , represents the number of routers or hops that the bit passes from tilei to tilej and is calculated using Eq. 2. Consequently, the total communication energy consumed within NoC can be captured using Eq. 3: ,       |    |  !  "#$    "#$  1   2  %

(2) (3)

K is the total number of bits transmitted and represents the average number of hops travelled by all the bits. Computation energy consumption of each PE (Ej) is calculated with the help of Eq. 4:  & '() '()   & *+, *+, 

(4)

Here & '() indicates the number of cycles during which the processor pj is in running state whereas & *+, represents the number of cycles during which pj is in idle state [37, 38]. '() and *+, represent the energy consumption of a processor in running and idle states, respectively. Total computational energy consumption -./ is calculated by taking summation of the energy consumption of all the PEs as shown in Eq. 5. Similarly, the total system energy consumption is calculated by taking sum of computational energy consumption and NoC energy consumption as depicted in Eq. 6. )

-./ 0  3 12

44"+ -./   5

4.4. Migration Cost The cost of migrating a task ti from source PE located at tilex to a destination PE located at tiley is calculated using the Eq. 7. 6 789 :,;

(7)

In Eq. 7, sizei is the size of task in bytes including code and data and :,; is the number of hops between tilex and tiley that is calcualted using Eq. 2. Let us consider that list of all the migrations is stored in a set MSet {m1, m2, …… mn}. Each entry <= ∈ 6>9 has following attributes {tk, src, dst}. Here tk is the task to migrate, src is the source tile, and dst is the destination tile where tk is migrated. Considering the above formualtion, the total migration cost for all the migrations in MSet is calcualted using Eq. 8. 6?744"+

0

∀ .A ∈ BC,4

6= D

Eq. 8 captures the total migration cost in terms of communication cost in number of bytes. To calculate the migration energy consumption we can utilize the Eq. 3 with some modifications as depicted in Eq 9. To calcualte the migration energy, we need to BE4 convert MCosttotal to number of bits that is achieved by the term J FGFHI 8. HKL

6?744"+ .$ 

8   "#$    "#$  1   2   "#$

4.5. Application Mapping In dynamic task mapping, the mapping process is activated when a mapped task needs to communicate with a task that has not been mapped yet [34]. Mapping of a task is represented by a function mpg: ti ∈ T → pi ∈ P [1, 2]. The emphasis of this work is on the continuously running applications. That is, the tasks of applications are executed repeatedly after each interval. 5. Dynamic Task Mapping Algorithms And Proposed Strategy In this work we have conducted a detailed analysis of some of the selected dynamic task mapping algorithms for NoC based systems to determine the performance of mapping algorithms in varying scenarios. The algorithms are examined on different metrics

that include: (a) communication cost, (b) energy consumption, (c) end-to-end delay, and (d) average hop count count. Based on the results of analysis, it was revealed that CPNN algorithm provides best results in terms rms of energy consumption for all variations of mesh sizes and lowest communication cost for smaller mesh sizes. For larger mesh sizes the communication cost of CPNN was higher than the communication energy (CE) algorithm because unlike CPNN, the CE algorithm algorithm considers the volume of communication among the dependent tasks. Therefore, an extension of the CPNN is proposed that attempts to improve the performance of CPNN in terms of communication overhead. The dynamic task mapping algorithms used in this analysis are: • Smart Nearest Neighbour (SNN) • First Fit (FF) Mapping Algorithm • Nearest Neighbour (NN) Mapping Algorithm • Communication Energy (CE) Mapping Algorithm • Communication-Aware Packing-Based Based Nearest N Neighbour (CPNN) 5.1. First Fit Mapping FF is the simplestt mapping algorithm. For task placement, the algorithm looks for the first available available processing element and assigns the task to it.

Fig. 3. Sequential Task Placements in First Fit Mapping Algorithm

The tasks are mapped in a sequential order, starting from the initial position of (0, 0) which represents 0th row and 0th column in the mesh NoC, as shown in Fig. 3.. The algorithm keeps on looking for the available PEs until the position reaches the boundary of NoC. If no PE is found, then it looks for the PEs in the next row.. In every iteration, the mapping is completed only if PE is found or if all the PEs have been evaluated [19]. 5.2. Nearest Neighbor Mapping The NN mapping algorithm assigns the first task to the selected initial PE. Later, the remaining tasks are mapped to the available PEs that have the minimum Manhattan distance to the already mapped PEs, with reference to initial position, as shown in Fig. 4 [19, 39]. The Manhattan distance is calculated based on the hop count starting from the hop count of 0 and going up to the maximum hop distance (hop distance = 0 to max hop count) [1]. The NN algorithm attempts to map the tasks close to each other within a sma small area. ea. We have implemented two variations of NN algorithm, i.e, NN and NN-E. NN E. In the NN algorithm, (0, 0) is taken as the initial position as shown in Fig. 4 whereas in NN-E E central tile is taken as the initial position as shown in Fig. 5.

Fig. 4. Task Placements in Nearest Neighbour (NN) Algorithm

Fig. 5. Task Placements in Enhanced Nearest Neighbour (NN-E) Algorithm

5.3. Smart Nearest Neighbor Although the nearest neighbor algorithm tries to map the tasks closer to each other, but the NN algorithm does not consider the communication cost among the interdependent tasks. tasks Consequently, communicating tasks may be placed on the PEs far apart resulting in increased communication overhead. The communication overhead can be reduced if maximum number of

communicating tasks can be mapped to the same or nearest available PE. In order to achieve the aforementioned goal, Singh et al. have proposed smart nearest neighbor task mapping algorithm. The SNN algorithm is based on the clustering approach. The NoC architecture is divided into different virtual clusters. These clusters have virtual boundaries so that the resources located in different virtual clusters can be shared among multiple applications, if required. Each cluster allows only one initial task to be placed on its central PE [1]. When the initial task starts execution, it requests for its communicating tasks. The first requested task from a master task is loaded and mapped to the PE that has the minimum distance from the PE where the master task is placed. The remaining requested tasks from the same master are stored in an application queue (AQ). As the PEs support multiple tasks, therefore, the requested communicating tasks can be mapped to the same PE if it has the required capacity resulting in reduced communication overhead [1]. An example of the mapping performed by smart nearest neighbor algorithm is illustrated in Fig. 6.

Fig. 6. Task placement using Smart Nearest Neighbour (SNN) Algorithm [1]

5.4. Communication Energy Mapping Algorithm The communication energy (CE) algorithm attempts to map the tasks with heavy communication closer to each other. It finds the task with the highest volume of communication and maps it to the first available processing element. In the next step, neighboring tasks of the mapped task are sorted by the volume of communication. From the sorted tasks, the task with highest communication is chosen and mapped to the PE having smallest Manhattan distance with the previously mapped PE. In each iteration, task with maximum volume of communication, among the neighboring tasks, is selected and mapped to the nearest available PE [19]. Consider the example application task graph (ATG) shown in Fig. 7. In this Fig., the volumes of communication of all the tasks are given at the bottom in the form of an array. The volume of communication is calculated by considering the incoming and outgoing volume of data. In Fig. 8, task 4 has the highest volume of communication having a value of 55, consequently, task 4 is mapped first. Then among its neighbors, i.e., tasks 1, 7 and 8, task 8 has comparatively higher communication with task 4 as shown in the ATG. Therefore, task 8 will be mapped to the nearest PE. From the remaining neighbors 1 and 7, the communication between task 4 and 1 is higher having value of 20, therefore, task 1 will be mapped to the same or nearest PE. In the same way all the remaining tasks are mapped based on the communication volume.

Fig. 7. Task Placements in Communication Energy Algorithm

5.5. Communication-Aware Packing Based Nearest Neighbor Similar to SNN, communication-aware packing based nearest neighbor (CPNN) algorithm also follows the clustering approach. In addition, the CPNN also uses two other concepts that are: (a) the packing strategy and (b) communication aware nearest neighbor mapping. The packing strategy assigns the task to the PEs in left, down, top, and right order freeing the top right corner of the cluster for the tasks of other applications to be mapped if needed as shown in Fig. 8. The initial task is mapped to the central PE in a cluster. For the unmapped tasks a packing order is searched in the left, down, top, and right order. Then the tasks are mapped to the PEs based on the packing order. After determining a suitable PE that can hold the incoming task, the algorithm finds if there are any previously mapped tasks on that PE utilizing the communication aware nearest neighbor mapping strategy. If there are no previously mapped tasks in that PE, then the incoming task is mapped to that PE. If the

PE contains previously mapped tasks then it is checked that whether these tasks communicate with the newly arrived tasks or not. If the mapped and new tasks are communicating, only then the new tasks are mapped to that PE otherwise a new suitable PE is searched.

Fig. 8. Task Placements in Communication-aware Packing based Nearest Neighbour [2]

5.6. Selected Algorithm Extensive simulations under varying network load and NoC sizes were conducted to evaluate the performance of the aforementioned algorithms. The experimental results revealed that, CPNN algorithm showed an improved performance as compared to the other algorithms in terms of reduced energy consumption. However, CPNN algorithm exhibit higher communication cost, especially for large mesh NoCs. Moreover, the mapping produced by the CPNN results in under-utilization of some of the PEs that may affect the performance of NoC. Therefore, in this study we propose an extended version of CPNN that attempts to reduce the communication overhead. The proposed algorithm performs the task migration to place more tasks to the closer PEs while keeping the energy consumption close to the CPNN. CPNN maps the tasks of an application to the PEs in left, down, top, and right order only if the tasks present in those PEs communicate with the newly arrived task. A drawback of CPNN is that as soon as it finds a suitable PE, it stops searching for suitable PEs and assigns the task to the first available PE. However, there can be possibility that there exist some other PEs that can hold the incoming task and further reduce the communication overhead. Moreover, in CPNN algorithm a significant number of PEs remain under-utilized because the algorithm tries to map only the communicating tasks to the same processing elements. This not only increases the communication overhead but also leads to increased energy consumption as some PEs are left under-utilized. 5.7. Proposed Algorithm Keeping in view the aforementioned shortcoming of CPNN algorithm, an extension of the CPNN algorithm is proposed. The primary objective of the proposed algorithm is to reduce the communication overhead. The proposed algorithm attempts to migrate tasks from the lightly loaded PEs to other PEs that can accommodate the tasks. While migrating tasks, communication dependencies among tasks are taken into consideration. Benefits of proposed strategy are two-fold: (a) communication overhead among the interdependent tasks is reduced and (b) the under-utilized PEs can be tuned to low energy mode to reduce energy consumption. Fig. 9 (a) and (b) illustrate the problem in CPNN algorithm and its solution with the proposed algorithm, respectively. In this particular example, consider that each PE can accommodate three tasks. We can see that some of the PEs are lightly loaded as compared to other PEs. In this particular example, task 9 needs to communicate with task 11 which is placed at a distant PE. Moreover, some of the PEs contain only one task although each PE can accommodate three tasks. Consequently, the underutilized PEs can lead to higher energy consumption. Here we would like to emphasize that the target of the proposed work is continuously running applications where the set of tasks are executed iteratively as long as the application is running.

Fig. 9. Problem illustration in CPNN (a) Mapping by CPNN. (b) Task migration by proposed algorithm.

The proposed algorithm attempts to migrate tasks from lightly loaded PEs. For instance, task 11 is migrated to PE containing tasks 3 and 6. While selecting tasks for migration and destination PE for hosting the task, communication dependence among the tasks is also considered. As a result, tasks are migrated to PEs hosting the tasks that communicate with task to be migrated. It must be noted that CAM is applied after initial task placement is done by CPNN. Proposed algorithm takes as input the task mapping produced by CPNN. Consequently, after initial iteration of application tasks, the proposed migration algorithm is applied to optimize the task placement produced by CPNN. The proposed Communication-Aware Migration (CAM) algorithm is presented in Table 2.

Table 2: Pseudo-Code for Proposed Algorithm Communication-Aware Migration (CAM) Algorithm Input: ATG (T, E), threshold, mpg-CPNN (T → P); Output: optimized mapping function: mpg (T → P) 1) 2) 3) 4) 5) 6) 7) 8) 9)

Ps` ← sort all p ∈ P in order of utilization from lower to higher for all the ps ∈ Ps` where utilization of ps < threshold do tasklist ← get the list of tasks that are mapped on ps for all ti ∈ tasklist do

comp_reqi ← computational requirements of task ti ps_tasks ← get list of tasks mapped on ps except ti

cv_psi ← communication volume of ti with ps_tasks for all pt ∈ Ps` in reverse order do if ptcapacity >= comp_reqi then pt_tasks ← get list of tasks mapped on pt if pt_tasks != NULL then

10) 11)

cv_pti ← communication volume of ti with pt_tasks if cv_pti >= cv_psi then

12) 13)

migrate ti ∈ tasklist to pt

14)

Go to line 4 and process next task ∈ tasklist

15)

End if

16)

end if

17)

end if

18)

end for

19) 20)

end for

21)

end for

22)

return mpg;

In the proposed algorithm, initially all processing elements pi ∈ P are sorted in order of utilization from lower to higher and are stored in sorted list Ps`. Next, each PE ps ∈ Ps` that has utilization less than the threshold is evaluated in the sorted order. Initially, algorithm gets the list of tasks mapped on ps. For each task ti mapped on ps, following steps are repeated. Calculate the volume of communication cv_psi exchanged between task ti and other tasks mapped on ps. Next, we evaluate each processing element pt in reverse sorted order, i.e., highest utilization to lowest utilization. For each pt, we first check whether pt has the required available capacity to host task ti. If pt has required capacity then we calculate communication volume cv_pti of ti with the tasks mapped on pt. If communication volume of ti with tasks mapped on pt is larger than communication volume with tasks mapped on ps, then the task ti is migrated to pt otherwise ti is not migrated and next PE is evaluated. In this way, tasks of each underutilized PE, having utilization less than threshold, are evaluated and are migrated if there is any suitable PE available that can better accommodate the tasks. 5.8. Complexity Analysis of Proposed CAM Algorithm 5.8.1. Time Complexity of CAM Algorithm Time complexity of proposed CAM algorithm is calculated as follows. In the first step the algorithm sorts all the PEs according to utilization in descending order. Therefore, time complexity of sorting n number of PEs is P &. R?S &. The outer for loop (line 2) is executed at most n number of times in worst-case scenario when the threshold is set to 100%. However, in practice the threshold can be set to 50% or 60% that can significantly reduce the worst-case time complexity. Let us consider that total number of tasks in the workflow is m. If each PE supports same number of tasks then the average number of tasks, represented by k, mapped on each PE is calculated using Eq. 6. T
(7)

To mitigate the impact of computation time of proposed migration algorithm, the algorithm can be applied in parallel to task execution. Usually a dedicated processor, termed as Manager Processor (MP), is used to run the mapping algorithm and make the

mapping decisions. Therefore, CAM algorithm can be executed in parallel to the execution of application tasks that run on other PEs. Later, at the end of the current iteration the migrations identified through CAM are applied. 5.8.2. Space Complexity of Proposed CAM Algorithm Space complexity of CAM algorithm is calculated using Eq. 8. P&  2T (8) Here, n is the total number of PEs and k is the average number of tasks mapped on each PE, calculated using Eq. 6. The space complexity of CAM is very moderate as we only need a list of size n to store PEs in sorted order and two lists, each of size k, to hold the tasks that are mapped on the current PE under consideration and the opposite PE where the tasks need to be migrated, respectively.

6. Formal Verification In this section we introduce terminologies used in the context of this work that will help the reader to comprehend the content related to formal verification using high level Petri nets, SMT-Lib, and Z3 solver. 6.1. High Level Petri Nets Petri nets are very useful for mathematical and graphical modeling of wide range of systems, such as, distributed, parallel, concurrent, stochastic, non-deterministic, and asynchronous systems [41]. In this wok we have used a variant of conventional Petri nets, termed as High-Level Petri Nets (HLPN) for formal verification of the proposed migration technique. The HLPN is a 7-tuple structure represented as: Z [, \, ], ^, , , 6_  where: 1. 2. 3. 4. 5. 6. 7.

P: refers to a set of places. T: is the set of transitions such that ([ ∩ \ ∅). F: refers to the flow relation such that ] ⊆ [ \ ∪ \ ∪ [ . ^: is a mapping function used to map P to data types. : defines the transition rules used to map T to predicate formulas such that  ∶ \ → ]?f
Information about the structure of Petri net is provided by the variables ([, \, ]) while static semantics are represented by (^, , ). Static semantics specify that the particular information does not change throughout the system. In HLPN, the set P may hold tokens of multiple types and may also be a cross product of two or more basic types. Pre-condition(s) of any transition must be satisfied to trigger the particular transition. Moreover, variables of incoming flows are used to trigger the transition. Similarly, postcondition uses variables pertaining to outgoing flows to trigger the transition. The readers interested to get a more detailed introduction of Petri nets are encouraged to see [41].

Fig. 10: HLPN model for the proposed migration algorithm

Table 3: Data types used for the modeling of task migration Types

Description

Task_id Task_req Task_comm Proc_id Proc_cap Used_cap

An integer type for task ID An integer type for the amount processing capacity required by the task An integer matrix of TxT representing the communication volume of each task with other tasks An integer type for processing core ID An integer type for total processing capacity of processing core An integer type for used capacity of processing core

Places Tasks Proc Unit Task List Sumreq Prev Tasks Migrate ti to P’

Mappings T_ID × T_Req × T_Comm× P_ID P_ID × P_Cap × P_U_Cap× P_T_Map P_ID × T_ID × T_Req P_ID × T_ID × T_Req P_ID × P_T_Map P_ID × P_T_Map

Table 4: Places and Mapping of Task Migration

6.2. SMT-Lib and Z3 Solver Satisfiability Modulo Theories (SMT) is used for automated checking of satisfiability of logical formulas over one or more theories of interest. SMT has been successfully applied in many application areas including computer science. Some of the recent fields of computer science where SMT is applied include: model checking, formal verification, planning, software engineering, and scheduling. SMT-Lib is an international initiative aimed at promoting the development and adoption of common interfaces for SMT solvers. For evaluation of any given system, SMT-Lib provides a standard benchmarking platform that works on common input/output framework. In this work, we used Z3, a high performance theorem solver and satisfiability checker developed by Microsoft research. Moreover, Z3 can also check the satisfiability of the provided set of logical formulas with reference to the builtin SMT-Lib theories. Readers interested to explore the working of Z3 in greater detail are encouraged to see [42]. 6.3. Modeling and Analysis of Proposed Algorithm The HLPN model of task migration process is illustrated in Fig. 10. To begin with the system modeling, we first need to define the set of places P and the corresponding types. It can be seen in the model, depicted in Fig. 10, that there are six places in the model. Table 3 illustrates the data types and their descriptions while the places and their mappings are presented in Table 4. Before going into the details of formal verification and modeling, we would like to give an overview of application, system, and NoC models used for conducting this study. In the proposed architecture, an application is represented as an Application Task Graph ATG (T, E), where T is the set of tasks and E is the set of edges. Each task ti ∈ T has certain computation requirements in number of cycles represented by reqi. Each edge ei,j represents the communication dependency between tasks ti and tj with the volume of data shared between ti and tj. Similarly, the NoC architecture is also represented as a graph ACG = (P, V), where P is the set of tiles hosting processing elements (PEs) and V is the set of links connecting two tiles. Each PE pi ∈ P is identified by an identifier pid and has certain computing capacity represented as capj. The link vi,j ∈ V, connecting pi and pj, contains the information of channel width and bandwidth available for data transmission [2]. The function for mapping tasks to processor is defined by:
(9)

The process of task list aggregation is represented in (9). ( ). The processing units are sort according to the utilization. If the tasks and processing units attributes are not null (which means that there are some tasks and some processing units available), the then take one processing unit and get the list of tasks mapped apped onto it and place it in the “task list” place. j ? ⦁ 7f 1% S 1% ∧ 7f 2% S 2% ∧ 7f 3% S 3% ∧ >′ > ⋃ x  7f 1% , 7f 2%z

(10)

Once the task list is acquired for the current processor, the total computing capacity of each task to run on a single proces processing unit is computed, as is shown in (10). j {9rk  ∀ ?i ∈ Pr ∧ ∀ }i ∈ gfr ∧ ∀ 7f ∈ > ⦁ ?i ?i 2% ~ 7f 3% ∧ ?i 1% n }i 1% ∧ ∀ i ∈ [\ ⦁ i 1% ?i 1% ∧ i 2% ?i 4% ∧ [\′ [\ ⋃ x  i 1% , i 2%z

(11)

In (11), ), for each processing unit (other than the one on which the task is currently mapped on), we checked if the available remaining capacity of the processing unit is greater than the computing capacity required by the task (sum_req). If the afore aforesaid condition is satisfied, then the transition {9_[\ will fire, fi which will get the task list of the other processing unit. j |T_?< ∀ i} ∈ [\ ∧ ∀ } ∈ \ ⦁ ∃ } 2% ∈ i} 2% ∧  €  ⋃ x  i} 1% , i} 2%z 

(12)

Once the task list of other processing unit is acquired, then the communication between the other processing units task and tthe current task is observed, as in (12). ). If the current task communicates with the tasks mapped on the target processing unit (i.e., if the task appears in the “Task_comm”” matrix of the tasks hosted on the target processing unit), and the degree of communication with the target processing unit is greater than the communication degree with the currently hosting processing unit, unit, then migrate this task to the target processing unit. 6.4. Model Verification Property The objective of verification was to ascertain that the proposed technique works according to the outlined specifications and produces the accurate results. The following properties were used to verify the model: • •

A task should be assigned to only one processing unit and no task should be missed, i.e., every task must be assigned to some processing unit (atomicity). The number of tasks assigned to any processing unit must not no exceed the capacity (capacity).

The aforementioned model was translated to SMT-Lib SMT Lib and was verified using the Z3 solver. The scenario that can was used for checking of the system is that a task is assigned to a processing unit whereas its communicating tasks tasks are placed at another processing unit (target processing unit) and the target processing unit has the required capacity to accommodate the task. In that case, the task should be migrated to the target processing otherwise the task should not be migrated. migrated. Results of Z3 solver indicate that the proposed model is workable and runs according to specifications. The results for the execution time over the Z3 solv solver are presented in Fig. 11. According to experimental results, the solver took 0.036 and 0.045 to to execute the proposed model and verify the two properties, i.e., atomicity and capacity, respectively.

Exec. Time (sec)

0.05 0.04 0.03 0.02 0.01 0

Atomicity

Capacity

Fig. 11. Execution Time for the Verification of Model using atomicity and capacity properties

7. Experimental Evaluation 7.1. Experimental set-up Extensive simulations were conducted for varying mesh size NoCs as mentioned in Table 3. The routine of tasks was set from 400 to 12500 tasks based on NoC size. The application task graphs were generated randomly where the computation requirements of the tasks were uniformly distributed between 1 to 20 million instructions. Other experimental settings are presented in Table 3. The table shows different mesh NoCs and the corresponding number of tasks of the workflow to be executed. The scenarios are designed in such a way to ensure that none of the tasks gets missed and all the tasks are mapped to only one of the PEs. The simulations were conducted using the Heterogeneous Network-on-Chip Simulator (HNOCS) simulator [40]. HNOCS is an open source simulator based on OMNET++ and provides a modular, scalable, and fully parameterized framework. Table 5: Experimental settings Mesh Size 5×5 8×8 9×9 15×15 20×20 25×25 30×30 40×40 50×50

No. of Tasks 400 1100 1600 5200 9300 9500 10250 10500 12500

7.2. Performance Metrics Performance analysis of all the algorithms was conducted on the basis of four parameters that include: (a) communication cost, (b) energy consumption, (c) average hop count, and (d) end-to-end latency. The energy consumption for each of the mapping heuristics is analyzed depending upon the total energy consumed by the MPSoC platform and is the sum of communication and computation energies. The communication cost is taken as the total number of flits forwarded along the NoC and is calculated as the product of total number of flits and the average number of hops travelled by all the flits. End-to-End latency is the average time taken, in nanoseconds, by all the flits to reach the destination PEs. 7.3. Simulation Results All of the mapping heuristics used in this study, except FF and NN, attempt to map the communicating tasks closer to each other that directly influence the communication cost and also affects the communication energy consumption. The results obtained from simulations are normalized on the parametric values of FF mapping algorithm because it has highest values in most cases. For better understanding and clarity, the results are grouped in two categories depending upon the number of tasks and the size of mesh NoC. In the first category, referred as smaller mesh NoCs, 5×5, 8×8, 9×9 and 15×15 mesh NoCs are grouped while the second category, referred as larger mesh NoCs, includes 20×20, 25×25, 30×30, 40×40, and 50×50 mesh NoCs. 7.4. Energy Consumption The results of energy consumption for smaller and larger mesh sizes are shown in Fig. 12 (a) and (b), respectively. Experimental results show that the proposed CAM algorithm exhibits least energy consumption as compared to all the other algorithms for smaller mesh NoCs. This is due to the fact that the proposed algorithm attempts to migrate the tasks from under-utilized PEs to reduce the energy consumption. However, with the increasing size of mesh NoCs and increased task load, the energy consumption of the proposed algorithm is slightly increased than CPNN. The reason for such behavior is that focus of the proposed algorithm is to minimize the communication overhead, consequently, while choosing the destination PEs for migrating the tasks, the proposed algorithm only considers the communication cost instead of computational energy consumption. More specifically, for smaller mesh NoCs, energy consumption of the proposed algorithm is 6% less than the CPNN and is 47% less from the FF algorithm. The energy consumption of proposed algorithm is slightly increased, i.e., 1% to 2% compared with CPNN for larger mesh NoCs. The results show that the energy consumed by the CPNN algorithm is significantly less than SNN and CE for all cases. Among SNN and CE, on average energy consumption of CE is less than SNN. However, as the size of mesh NoC increases the difference between energy consumption of SNN and CE is reduced. The mapping techniques of SNN and CPNN are slightly different from each other. The SNN algorithm follows a depth first search (DFS) approach for task mapping wherein it tries to map the child nodes of the requesting node first. On the other hand, CPNN follows breadth first search (BFS) approach for mapping where it attempts to map the first hop neighbor tasks of the requesting task first. Moreover, CE algorithm only considers the communication energy consumption, therefore, it has lower communication cost and energy but increased computation energy consumption as compared to CPNN and CAM.

(a)

(b)

Fig. 12. Comparison of average energy consumption for all algorithms. (a) Smaller mesh NoCs. (b) Larger mesh NoCs

(a)

(b)

Fig. 13. Comparison of communication cost for all algorithms. (a) Smaller mesh NoCs. (b) Larger mesh NoCs

Here we would like to emphasize that the focus of this work is on the continuously running applications where the set of peri periodic tasks are executed repeatedly throughout the lifetime of application. application. The examples of such applications include: audio and video encoders/decoders, software radios, and streaming applications just to name a few [47, 48]. ]. The complete cycle in which all tthe periodic tasks of an application are executed once is also referred as a single iteration. The results presented here are based on average of all the execution cycles. However, to quantify the impact of migration on energy consumption and communication cost we have included the results of CAM-M. It must be noted that CAM represents the results without considering the migration overhead whereas CAM-M M represents the results of CAM algorithm including migration overhead, overhead, in terms of energy and communication cost, for a single iteration. Experimental xperimental results reveal that increase in energy consumption of CAM CAM-M ranges from 1% to 5% with an average of 3% over varying mesh sizes compared with CAM for single iteration. However, ssince the migration is conducted once, e, after the initial iteration of periodic tasks, the migration cost is amortized by the performance gains and reduced energy consumption achieved in subsequent iterations. iterations Specifically, average energy savings of CAM compared to CPNN is 0.5%. Consequently, the migration overhead, in terms of energy consumption, can be amortized within with 6 to 10 subsequent iterations of the application. On the contrary, task to core mapping produced by the CPNN algorithm remains constant throughout the life life-time of application,, i.e., CPNN does not migrate the tasks in subsequent iterations of the application. Therefore, CPNN cannot amortize or further reduce the communication cost because CPNN does not incur any migration cost. 7.5. Communication Cost Placing the interdependent and communicating tasks to the same or nearest PEs can significantly reduce the communication cost. The communication cost for each algorithm is shown in Fig. 13 (a) and (b) for smaller and large NoC sizes, respectively. The FF, NN, and NN-e algorithms exhibit the highest communication costs for both small and large NoC sizes, because these algorithms do not perform any pre-processing processing and do not take into consideration the communication overhead while mapping the tasks tasks. The experimental results reveal that the proposed algorithm is able to reduce the communication cost as compared to CPNN especially in larger mesh NoCs. More ore specifically, as compared to CPNN, the proposed algorithm achieves 3% and 8% less communication overhead on average for small and large mesh NoCs, respectively. Similarly, the proposed algorithm shows a decrease of 40% and 26% in communication cost than FF for smaller and larger mesh NoCs, respectively. The reduction in communication overhead is

Fig. 14: Comparison of average hop count

Fig. 15: Comparison of end-to-end end latency

achieved by placing the frequently communicating tasks to the closer or same PEs by migrating the tasks that are otherwise placed far apart by the CPNN algorithm. The results for communication cost of the CE algorithm show different trend in the communication ion cost at varying task loads. For instance, in case of smaller mesh NoCs, CE exhibits higher communication cost as compared to CPNN and CAM CAM. However, as the size of NoC and number of tasks increases the difference becomes more evident and CE algorithm achieves lower communication cost than all the algorithms for larger mesh NoCs. NoCs The reason behind this is that CPNN, CAM, and SNN map the communicating tasks closer. Whereas, CE also considers onsiders the volume of communication among the tasks.. That is, iinstead of mapping any communicating tasks closer, CE first attempts to map those tasks to the same or closer PE that have highest volume of communication with each other. As a result, majority of the highly communicating tasks are placed at the same or nearest available PEs. The effect of this mapping by CE is not visible for smaller NoCs and less number of tasks. Because, in case of smaller me mesh sizes, the average hop distance between the PEs is very small as compared to the large mesh NoCs. Therefore, in case oof smaller mesh NoCs the mapping performed by the algorithms other than CE exhibit better results. However, for large NoCs with large number mber of tasks, CE algorithm achieves significantly lower communication overhead as compared to other algorithms. The impact of task migration is captured by the CAM-M CAM M that represents the results of CAM including migration overhead. It can be seen that communication unication cost and corresponding energy consumption of CAM-M CAM is increased after considering the communication cost for task migration. Particularly, CAM-M CAM M increases the communication cost up to 23% with an average of 11% compared to CAM for single iteration. However, as mentioned earlier, that effect of migration cost is amortized in subsequen subsequent iterations of the application. Experimental results indicate that CAM can reduce the communication cost up to 13% with an average of 6% as compared to CPNN in the subsequent bsequent iterations. As a result, the communication overhead incurred, for migrating tasks, can be amortized within 2 to 4 iterations of the application. Similarly, the impact of additional computational overhead introduced by applying CAM is also alleviated by the performance gain achieved in the subsequent iterations of the applications. Therefore, the longer the applications are executed, larger the impact of proposed CAM algorithm will be on overall application performance and energy consumption as compared to the mapping produced by CPNN. Experimental results indicate that CAM can significantly reduce the communication cost that not only results in reduced energy consumption but also leads to improved application performance. 7.6. Hop Count The results for comparison on the basis of average hop count are presented in Fig. 14. The average hop count of the proposed algorithm is comparatively less than CPNN because once once the mapping is done by CPNN, the proposed algorithm attempts to migrate the communicating tasks to the same or nearest available PEs thereby reducing the average hop count. More specifically, the proposed algorithm achieves, on average, 4% and 39% decrease in hop count as compared to CPNN for smaller and larger mesh sizes, respectively. The CE algorithm shows lowest values for hop count for larger mesh NoCs. For smaller NoCs, there is no remarkable difference in the hop count values of CE and other algorithms but in case of larger mesh sizes the there is significant difference between CE and other algorithms. 7.7. End-to-End Latency Reduction in end-to-end end latency can lead to significant significant improvement in application throughput and faster execution of tasks because the waiting time, of dependent tasks, for the required data is reduced. The results for comparison of end end-to-end latency are shown in Fig. 15.. The proposed algorithm has lower end-to-end e latency than CPNN in all cases and remains considerably consistent throughout the small and larger mesh sizes. The improvement in latency of CAM as compared to CPNN is more prominent for large NoC sizes. More specifically, CAM shows 6% less end-to-end end latency than CPNN for smaller mesh NoCs and 29% less for larger mesh NoCs. Similarly, latency of CE is significantly less than all algorithms, algorithms, including proposed algorithm algorithm, especially for large mesh NoCs. Because, CE algorithm attempts to place the heavily communicating tasks closer to each other, communication among

the tasks does not traverse through large number of intermediate routers. Consequently, average end-to-end latency of CE algorithm is significantly lower than all the other algorithms. 8. Conclusions and Future Work In this work we have conducted a detailed quantitative evaluation of the selected dynamic task mapping algorithms for NoC based MPSoCs for a wide range of mesh sizes with varying task and communication loads. The experimental results revealed that CPNN algorithm has lowest energy consumption among the evaluated algorithms. However, the task mappings produced by the CPNN algorithm results in higher communication overhead. Therefore, we proposed an extension of the CPNN algorithm that reduces the communication overhead for constantly running applications. The proposed algorithm achieved reduced communication cost as compared to CPNN that is more prominent in large mesh NoCs. In addition, the proposed heuristic is able to reduce the energy consumption for smaller mesh NoCs. Further, we also carried out the formal verification and modeling of proposed technique using high level petri nets. A series of experiments were conducted to evaluate the performance of selected algorithms for different scenarios having different mesh size NoCs and varying task loads. On the basis of experimental results, it can be concluded that the proposed algorithm exhibits better performance in terms of communication cost, hop count, and end-to-end latency compared to CPNN and other algorithms except CE algorithm. CE algorithm has superior performance than all the algorithms, including proposed algorithm, in terms of network parameters, such as communication cost, average hop count, and end-to-end latency especially for large mesh NoCs. However, CE has higher energy consumption than CPNN and the proposed CAM algorithm. There can be several directions for future work. First, currently the proposed technique only focuses on reducing communication overhead. How to simultaneously optimize energy consumption and communication cost is an interesting problem to explore. Second, how to evenly distribute the load among the available PEs in order to balance the heat within the NoC based MPSoC platform is another active area of research. References [1] Kumar Singh, A.; Jigang, W.; Kumar, A.; Srikanthan, T., “Run-time mapping of multiple communicating tasks on MPSoC platforms”, International Conference on Computational Science, ICCS 2010, Volume 1, Issue 1, Pages 1019–1026, May 2010. [2] Kumar Singh, A.; Srikanthan, T.; Kumar, A.; Jigang, W., “Communication-aware heuristics for run-time task mapping on NoC-based MPSoC platforms”, Journal of Systems Architecture, Volume 56, Issue 7, Pages 242-255, 2010. [3] Vincenzo Rana, David Atienza, Marco Domenico Santambrogio, Donatella Sciuto, Giovanni De Micheli, “A Reconfigurable Network-on-Chip Architecture for Optimal Multi-Processor SoC Communication”, 16th IFIP/IEEE International Conference on Very Large Scale Integration, Rhodes, Greece, October 13-15, 2008. [4] Pradip Kumar Sahu, Santanu Chattopadhyay, “A survey on application mapping strategies for Network-on-Chip design”, Journal of Systems Architecture, Volume 59, Issue 1, Pages 60– 76, January 2013. [5] Carvalho, Ewerson, Ney Calazans, and Fernando Moraes. "Heuristics for dynamic task mapping in NoC-based heterogeneous MPSoCs." Rapid System Prototyping, 2007. RSP 2007. 18th IEEE/IFIP International Workshop on. IEEE, 2007. [6] Ewerson Carvalho, César Marcon, Ney Calazans, Fernando Moraes, “Evaluation of Static and Dynamic Task Mapping Algorithms in NoC-Based MPSoCs”, International Symposium on System-on-Chip, Page(s): 087 – 090, 2009. [7] Benini, L.; De Micheli, G., "Networks on chips: a new SoC paradigm," Computer , vol.35, no.1, pp.70,78, Jan 2002. [8] Dally, W.J.; Towles, B., "Route packets, not wires: on-chip interconnection networks," Design Automation Conference, 2001. Proceedings , vol., no., pp.684,689, 2001. [9] Bharati B. Sayankar, S.S. Limaye, “Overview of Network on Chip Architecture”, 2nd National Conference on Information and Communication Technology (NCICT) 2011. [10] Wikipedia, The Free Encyclopedia. (2013) Network on a chip. [Online]. http://en.wikipedia.org/w/index.php?title=Network_on_a_chip&oldid=578090376. [11] S.Subha, “A Scheduling Algorithm for Network On Chip”, International Conference on Advances in Computing, Control, and Telecommunication Technologies, Page(s): 289 – 291, 2009. [12] S. Subha, “An Optimized Algorithm for Network on Chip”, International Journal of Computer Applications”, Volume 3 – No.12, July 2010. [13] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote, N. Borkar, An 80-tile 1.28tflops networkon-chip in 65nm cmos, in: Solid-State Circuits Conference, 2007. ISSCC 2007. Digest of Technical Papers. IEEE International, 2007, pp. 98–589. [14] First 100-core processor with the new tile-gx family, http://www.tilera.com/news & events/press release 091026.php (Oct, 2009). [15] M. Kistler, M. Perrone, F. Petrini, Cell multiprocessor communication network: Built for speed, IEEE Micro 26 (3) (2006) 10–23. [16] Ahmed Hemani, Axel Jantsch, Shashi Kumar, Adam Postula, Johnny Öberg, Mikael Millberg, Dan Lindqvist, “Network on a Chip: An architecture for billion transistor era”, 18th NORCHIP Conference, Turku, Finland, Nov. 2000, pp. 166-173. [17] Wen-Chung Tsai, Ying-Cherng Lan, Yu-Hen Hu, and Sao-Jie Chen, “Networks on Chips: Structure and Design Methodologies,” Journal of Electrical and Computer Engineering, vol. 2012, Article ID 509465, 15 pages, 2012. [18] Suleyman Tosun, “Cluster-based application mapping method for Network-on-Chip”, Advances in Engineering Software, Volume 42, Issue 10, Pages 868–874, October 2011. [19] Ashwini Raina, V Muthukumar, “Traffic Aware Scheduling Algorithm for Network on Chip”, Sixth International Conference on Information Technology: New Generations, Pages 877882, 2009. [20] Jingcao Hu, Radu Marculescu, “Energy-Aware Communication and Task Scheduling for Network-on-Chip Architectures under Real-Time Constraints”, Design, Automation and Test in Europe Conference and Exhibition, Page(s): 234 - 239 Vol.1, 2004. [21] Jingcao Hu, Radu Marculescu, “Energy-Aware Mapping for Tile-based NoC Architectures under Performance Constraints”, Page(s): 233 – 239, 2003. [22] Bin Xie, Tianzhou Chen, Wei Hu, Xingsheng Tang , DazhouWang, “An energy-aware online task mapping algorithm in NoC-based system”, The Journal of Supercomputing, Volume 64, Issue 3 , Pages 1021-1037, June 2013. [23] Chen-Ling Chou, Radu Marculescu, “Run-Time Task Allocation Considering User Behavior in Embedded Multiprocessor Networks-on-Chip”, IEEE Transactions on Computer-aided Design of Integrated Circuits and Systems, Vol. 29, No. 1, January 2010. [24] Chen-Ling Chou, Radu Marculescu, “User-Aware Dynamic Task Allocation in Networks-on-Chip”, Design, Automation and Test in Europe, DATE '08, Page(s): 1232 – 1237, 2008. [25] Armin Mehran, Ahmad Khademzadeh, Samira Saeidi, “DSM: A Heuristic Dynamic Spiral Mapping algorithm for network on chip”, IEICE Electronics Express, Vol. 5, No. 13, Pages 464-471, 2008. [26] Mohammad Abdullah Al Faruque, Rudolf Krist, and Jörg Henkel, “ADAM: Run-time Agent-based Distributed Application Mapping for on-chip Communication”, Proceedings of the 45th annual Design Automation Conference, Pages 760-765, 2008.

[27] Ewerson Carvalho, Ney Calazans, Fernando Moraes, “Heuristics for Dynamic Task Mapping in NoC-based Heterogeneous MPSoCs”, 18th IEEE/IFIP International Workshop on Rapid System Prototyping, Page(s): 34 – 40, 2007. [28] Ewerson Carvalho, Ney Calazans, Fernando Moraes, “Dynamic Task Mapping for MPSoCs”, Journal: IEEE Design & Test, Volume 27 Issue 5, Pages 26-35, September 2010. [29] Jia Huang, Christian Buckl, Andreas Raabe, Alois Knoll, “Energy-Aware Task Allocation for Network-on-Chip Based Heterogeneous Multiprocessor Systems”, 19th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), Page(s):447 – 454, 2011. [30] Chen-Ling Chou, Radu Marculescu, “Contention-aware Application Mapping for Network-on-Chip Communication Architectures”, IEEE International Conference on Computer Design, Page(s): 164 – 169, 2008. [31] S. Tosun, O. Ozturk, M. Ozen, “An ILP formulation for application mapping onto Network-on-Chips”, International Conference on Application of Information and Communication Technologies (AICT), 2009, pp. 1–5. [32] Suleyman Tosun, “Cluster-based application mapping method for Network-on-Chip”, Journal: Advances in Engineering Software, Volume 42, Issue 10, October 2011, Pages 868–874. [33] Wenbiao Zhou, Yan Zhang, Zhigang Mao, “An application specific NoC mapping for optimized delay”, International Conference on Design and Test of Integrated Systems in Nanoscale Technology, Page(s) 184 – 188, 2006. [34] Leandro Möller, Leandro Soares Indrusiak, Luciano Ost, Fernando Moraes, Manfred Glesner, “Comparative Analysis of Dynamic Task Mapping Heuristics in Heterogeneous NoCbased MPSoCs”, International Symposium on System on Chip (SoC), Page(s): 1–4, 2012. [35] Kaushik, S.; Singh, A.K.; Srikanthan, T., "Computation and communication aware run-time mapping for NoC-based MPSoC platforms," SOC Conference (SOCC), 2011 IEEE International, vol., no., pp.185,190, 26-28 Sept. 2011. [36] Atitallah, Rabie Ben, et al. "Estimating energy consumption for an MPSoC architectural exploration." Architecture of Computing Systems-ARCS 2006. Springer Berlin Heidelberg, 2006. 298-310. [37] Loghi, Mirko, Massimo Poncino, and Luca Benini. "Cycle-accurate power analysis for multiprocessor systems-on-a-chip." Proceedings of the 14th ACM Great Lakes symposium on VLSI. ACM, 2004. [38] Ye, T.T.; Benini, L.; De Micheli, G., "Analysis of power consumption on switch fabrics in network routers," Design Automation Conference, 2002. Proceedings. 39th , vol., no., pp.524,529, 2002. [39]Jun Ho Bahn, Overview of Network-on-Chip [Online], http://gram.eng.uci.edu/comp.arch/lab/NoCOverview.htm, Accessed on September 3, 2014. [40] Omnet++. (2011) HNOCS- Network on Chip Simulation Framework. [Online]. http://www.omnetpp.org/omnetpp/doc_details/2231-inoc-network-on-chip-simulation-framework. [41] Murata, T. (1989). Petri nets: Properties, analysis and applications. Proceedings of the IEEE, 77(4), 541-580. [42] De Moura, Leonardo, and Nikolaj Bjørner. "Z3: An efficient SMT solver." Tools and Algorithms for the Construction and Analysis of Systems. Springer Berlin Heidelberg, 2008. 337340. [43] Liyang Zhou, Ming-e Jing, Xiaoyang Zeng, Task-Binding Based Branch-and-Bound Algorithm for NoC Mapping, in IEEE International Symposium on Circuits and Systems (ISCAS), 648-651, May 2012. [44] Brião, E. W., Barcelos, D., Wronski, F., & Wagner, F. R. (2007, October). Impact of task migration in NoC-based MPSoCs for soft real-time applications. In Very Large Scale Integration, 2007. VLSI-SoC 2007. IFIP International Conference on (pp. 296-299). IEEE. [45] Quan, W., & Pimentel, A. D. (2014, October). A system-level simulation framework for evaluating task migration in MPSoCs. In Compilers, Architecture and Synthesis for Embedded Systems (CASES), 2014 International Conference on (pp. 1-9). IEEE. [46] Das, A., & Kumar, A. (2012, October). Fault-aware task re-mapping for throughput constrained multimedia applications on NoC-based MPSoCs. In Rapid System Prototyping (RSP), 2012 23rd IEEE International Symposium on(pp. 149-155). IEEE. [47] Wang, Y., Liu, H., Liu, D., Qin, Z., Shao, Z., & Sha, E. H. M. (2011). Overhead-aware energy optimization for real-time streaming applications on multiprocessor system-on-chip. ACM Transactions on Design Automation of Electronic Systems (TODAES), 16(2), 14. [48] Das, A., Singh, A. K., & Kumar, A. (2013, July). Energy-aware dynamic reconfiguration of communication-centric applications for reliable MPSoCs. In Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC), 2013 8th International Workshop on (pp. 1-7). IEEE.

Tahir Maqsood received his B.S. in Computer Science from PIMSAT, Pakistan, in 2001, Master in Computer Science from KASBIT, Pakistan, in 2004, and M.S. in Computer Networks from Northumbria University, UK, in 2007. He is currently a Ph.D. candidate in the Department of Computer Science, COMSATS Institute of Information Technology, Abbottabad, Pakistan. His research interests include application mapping, energy efficient systems, and network performance evaluation.

Sabeen Ali received her Bachelor degree in Telecommunication and Networking from COMSATS Institute of Information Technology, Abbottabad in 2012, and her Master degree in Computer Science from the same university in 2014. Her research interests include NoC-based MPSoC systems, and run-time mapping algorithms.

Saif U. R. Malik received the MS degree from the COMSATS Institute of Information Technology, Islamabad, Pakistan in 2009, and the PhD degree from North Dakota State University, Fargo, ND, USA in 2014. His primary research interests include formal verification, modeling, cyber physical systems, and cloud computing systems. Other areas of research interest include data replication, routing protocols, and HPC.

Sajjad A. Madani works at COMSATS Institute of Information technology as Associate Professor. He did M.S.in Computer Sciences from Lahore University of Management Sciences and Ph.D. from Vienna University of Technology. His areas of interest include low power wireless sensor network and green computing. He has published more than 40 papers in peer reviewed international conferences and journals.

Tahir Maqsood

Sabeen Ali

Saif U. R. Malik

Sajjad A. Madani