- Email: [email protected]

Polynomial interpolation and polynomial root ﬁnding on OTIS-mesh Prasanta K. Jana

*

Department of Computer Science and Engineering, Indian School of Mines, Dhanbad 826004, India Received 27 March 2005; received in revised form 25 September 2005; accepted 12 January 2006 Available online 20 March 2006

Abstract OTIS-mesh is an eﬃcient model of optoelctronic parallel computers. In this paper, we develop several parallel algorithms for two important problems of numerical analysis, i.e., polynomial interpolation and polynomial root ﬁnding on this network. The algorithms are based on two schemes of data mapping namely, row–column mapping and group mapping in which the input data elements are arranged to store into row/column and group wise respectively by suitably exploiting the links of the network. We use Lagrange method for polynomial interpolation and Durand–Kerner method [T.L. Freeman, Calculating polynomial zeros on local memory parallel computer, Parallel Comput. 12 (1989) 351–358.] for polynomial root ﬁnding. The optimality/near optimality of the algorithms is shown to achieve with the assumption that the data elements are initially stored following the above mentioned mapping. The scalability of the algorithms is also studied. 2006 Elsevier B.V. All rights reserved. Keywords: Optoelectronic parallel computer; OTIS-mesh; Lagrange interpolation; Durand–Kerner method

1. Introduction Optical transpose interconnection system (OTIS) proposed by Marsden et al. [20] is a model of optoelectronic parallel computer in which the processors are divided into groups. The processors within each group are connected using electronic links and the connections between diﬀerent groups are realized by optical links. Optical interconnects has power, speed and cross talk advantages over the electronic interconnects when the connect distance is more than a few millimeters [8,19]. It is shown in [20] that whenever the number of groups is equal to the number of the processors within each group, the bandwidth of the OTIS-system can be maximized and the power consumption is minimized. OTIS-mesh [24,26] is an eﬃcient model of OTIS family of optoelctronic pﬃﬃﬃﬃ parallel pﬃﬃﬃﬃ computers. In an N · N OTIS-mesh, N2 processors are organized into N groups in the form of N N lattice as shown in Fig. 1 p p for N = 4. Each group is basically an N · N 2D-mesh. Let (G, P) denote the Pth processor of the Gth group. The processors within a group are connected via electronic links following 2D-mesh topology. These *

Tel.: +91 326 2210025; fax: +91 326 2210028. E-mail address: [email protected]

0167-8191/$ - see front matter 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.parco.2006.01.001

302

P.K. Jana / Parallel Computing 32 (2006) 301–312

G(1,1)

G(1,2)

1,1

1,2

1,1

1,2

2,1

2,2

2,1

2,2

G(2,2)

G(2,1) 1,1

1,2

1,1

1,2

2,1

2,2

2,1

2,2

Fig. 1. OTIS-mesh of 16 processors.

links are shown by solid arrows in the ﬁgure. The processors of two diﬀerent groups are connected via optical links following OTIS rule, i.e., (G, P) is connected with (P, G) as shown by dashed arrows. We assume that all the links are bi-directional. Let (gx, gy) denote the coordinates of the group G. Then the processor placed in the pﬃﬃﬃﬃ px row and py column within the group G is denoted by P(gx, gy, px, py) for 1 6 gx, gy, px, py 6 N . The coordinates of a processor pﬃﬃﬃﬃ within a group are shown inside the small box in the ﬁgure. The diameter of an N · N OTIS-mesh is 4 N 3 [26]. The simulation of a 4D-mesh by an OTIS-mesh has been shown in [31]. In the recent years, a number of parallel algorithms for various computations have been developed on the OTIS-mesh such as image processing [27], matrix multiplication [28], data sum, consecutive sum, concentrate [29], BPC permutation [30], k–k sorting [21], randomized routing, selection and sorting [22] etc. Polynomial interpolation and polynomial root ﬁnding are two important problems of numerical analysis with variety of real-time, and real-life applications. For example, in automatic control, digital signal processing, very often we require fast extractions of all the roots of a high degree polynomial. Speciﬁc applications include the identiﬁcation of the poles of a digital ﬁlter in which polynomial with degree as high as 100 are sometimes encountered [23]. Polynomial interpolation can be very useful in geological mapping, petroleum explorations, cardiology (heart potential) etc. Several parallel algorithms for these problems have been studied on diﬀerent models of parallel computers. Parallel algorithms for polynomial interpolation based on Lagrange formula have been studied in [3,4,11,15–17,25]. Various parallel algorithms for polynomial root ﬁnding based Durand–Kerner method [6,18] have been reported in the literatures [5,9,10,13,14]. In this paper, we present various parallel algorithms for Lagrange’s interpolation and Durand–Kerner method on an N2 processors OTIS-mesh. The algorithms result from two suitable ways of mapping the initial data elements over the whole network, namely row–column mapping and group mapping. In the row–column mapping, we arrange to distribute the data elements x1, x2, . . . xN row wise and column wise following suitable communication links. In grouppmapping the data elements xpNﬃﬃﬃði1Þþj are arranged to store in all the processors ﬃﬃﬃﬃ of the group G(i, j), 1 6 i, j 6 N , and x1, x2, . . . xN in row major fashion in every group. pﬃﬃﬃﬃ In Section 2, we ﬁrst develop a parallel algorithm for N-point Lagrange interpolation in 8 N 6 electronic moves pﬃﬃﬃﬃ and 3 OTIS moves using the row–column mapping. Then we develop another algorithm that requires 6 N 4 electronic moves and 2 OTIS moves pﬃﬃﬃﬃfollowing the group mapping. In Section 3, we present a parallel algorithm for Durand–Kerner method in 6 N 6 electronic moves p and ﬃﬃﬃﬃ 2 OTIS moves that use row–column mapping followed by another algorithm using group mapping in 4 N 4 electronic moves and one OTIS move. We show that the algorithms can be made optimal/near optimal if we assume that data elements are initially stored in the appropriate mapping as mentioned above. The scalability of the algorithms is also discussed.

P.K. Jana / Parallel Computing 32 (2006) 301–312

303

2. Parallel algorithms for Lagrange interpolation Given a set of tabulated values y1, y2, . . . , yN of the function y = f(x) at some discrete points x1, x2, . . . , xN, the N-point Lagrange interpolation is deﬁned as follows [12]: f ðxÞ ¼ pðxÞ

N X i¼1

yi ðx xi Þp0 ðxi Þ

where pðxÞ ¼

N Y ðx xi Þ; i¼1

p0 ðxi Þ ¼

N Y

ðxi xj Þ;

i ¼ 1; 2; ; N

j¼1;j6¼i

For the parallel algorithms presented in this paper we use a special symbol ‘*’ for the coordinates to denote the set of processors/groups with all possible values. We also assume that each processor has some local registers as per the requirement of the algorithms. Further we assume the SIMD mode operation of the OTIS-mesh in which all the active processors move the data along the same dimension. We consider the row–column mapping of the input data in our ﬁrst Algorithm L1 and group mapping in the next Algorithm L2 which we describe step wise as follows: pﬃﬃﬃﬃ Algorithm pL1. ﬃﬃﬃﬃ /* We assume here that the boundary processors P(i, 1, k, 1), 1 6 i, k 6 N , and P(1, j, 1, l), 1 6 j, l 6 N have only I/O capability. Then the steps 1 through 7 perform the row–column mapping of the initial input data elements */ pﬃﬃﬃﬃ Step 1. For all i, k, 1 6 i, k 6 N do in parallel Input xpNﬃﬃﬃðk1Þþi to A(i, 1, k, 1) Input x to C(i, 1, k, 1) Step 2. Broadcast locally the contents of A and C registers stored in step 1 to all processors along the same row of the group. pﬃﬃﬃﬃ Step 3. For all j, l, 1 6 j, l 6 N do in parallel Input xpﬃﬃNﬃðl1Þþj to B(1, j, 1, l) Step 4. Broadcast locally the contents of B register stored in step 3 to all processors along the same column of the group. Step 5. Perform an OTIS move on A, B and C register contents stored in steps 2 and 4. Step 6. Broadcast locally the contents of A and C registers stored in step 5 to all the processors along the same row of the group. Step 7. Broadcast locally the contents of B registers to all processors stored in step 5 along the same column of the group.

Illustration 1. The contents of A and B registers after this step are, respectively, shown in Figs. 2 and 3 for N = 9 in which input of the initial data is also shown by arrows. pﬃﬃﬃﬃ Step 8. For all i, j, k, l, 1 6 i, j, k, l 6 N do in parallel If A(i, j, k, l) 5 B(i, j, k, l) then Aði; j; k; lÞ Else A(i, j, k, l)

Cði; j; k; lÞ Bði; j; k; lÞ Aði; j; k; lÞ Bði; j; k; lÞ 1

304

P.K. Jana / Parallel Computing 32 (2006) 301–312

x1

x1

x1

x1

x1

x1

x1

x1

x1

x1

x4

x2

x2

x2

x2

x2

x2

x2

x2

x2

x7

x3

x3

x3

x3

x3

x3

x3

x2

x4

x4

x4

x4

x4

x4

x4

x4

x4

x5

x5

x5

x5

x5

x5

x5

x5

x5

x5

x8

x6

x6

x6

x6

x6

x6

x6

x6

x6

x3

x7

x7

x7

x7

x7

x7

x7

x7

x7

x6

x8

x8

x8

x8

x8

x8

x8

x8

x8

x9

x9

x9

x9

x9

x9

x9

x9

x9

x9

x3

x3

Fig. 2. Contents of A-registers and initial input.

x1

x4

x7

x2

x5

x8

x3

x6

x9

x1

x2

x3

x4

x5

x6

x7

x8

x9

x1

x2

x3

x4

x5

x6

x8

x9

x1

x2

x3

x4

x5

x6

x8

x9

x1

x2

x3

x4

x5

x6

x7

x8

x9

x1

x2

x3

x4

x5

x6

x7

x8

x9

x1

x2

x3

x4

x5

x6

x7

x8

x9

x1

x2

x3

x4

x5

x6

x7

x8

x9

x1

x2

x3

x4

x5

x6

x7

x8

x9

x1

x2

x3

x4

x5

x6

x7

x8

x9

x7 x7

Fig. 3. Contents of B-registers and initial input.

ð ﬃﬃ Þ for A(i, j, k, l) 5 B(i, j, k, l) or 1 otherwise which ﬃﬃ ﬃﬃ ð Þ for i 5 j or 1 for i = j, respectively.

Illustration 2. After this step A(i, j, k, l) holds is shown in Fig. 4 where xij indicates

xxj xi xj

xxpN ðj1Þþl p x N ði1Þþk xpN ðj1Þþl

Step 9. Form the local product with the contents of the A registers along the same row of each group and pﬃﬃﬃﬃ store it in A(i, j, k, 1), 1 6 i, j, k 6 N .

P.K. Jana / Parallel Computing 32 (2006) 301–312

1

x2-1

305

x1-2 x1-3

x1-4

x1-5 x1-6

x1-7

x1-8 x1-9

x2-3

x2-4

x2-5 x2-6

x2-7

x2-8 x2-9

1

x3-4

x3-5 x3-6

x3-7

x3-8 x3-9

x4-5 x4-6

x4-7

x4-8 x4-9

x5-6

x5-7

x5-8

1

x6-7

x6-8 x6-9

1

x7-8 x7-9

1

x3-1

x3-2

x4-1

x4-2

x4-3

x5-1

x5-2

x5-3

x5-4

x6-1

x6-2

x6-3

x6-4

x6-5

x7-1

x7-2

x7-3

x7-4

x7-5 x7-6

x8-1

x8-2

x8-3

x8-4

x8-5 x8-6

x8-7

1

x8-9

x9-1

x9-2 x9-3

x9-4

x9-5 x9-6

x9-7

x9-8

1

1

1

x5-9

Fig. 4. After step 8.

Step 10. Perform an OTIS move on the contents of the A registers stored in step 9. Step 11. Form the local product with the contents of A registers pﬃﬃﬃﬃstored in step 10 along the same row of each group G(*, 1) and store it in A(i, 1, k, 1) , 1 6 i, k 6 N .

Illustration 3. After step 11, A(i, 1, k, 1) holds ðx x1 Þðx x2 Þ . . . x xpNﬃﬃﬃðk1Þþi1 x xpNﬃﬃﬃðk1Þþiþ1 . . . ðx xN Þ xpNﬃﬃﬃðk1Þþi x1 . . . xpﬃﬃNﬃðk1Þþi xpﬃﬃNﬃðk1Þþi1 xpﬃﬃNﬃðk1Þþi xpﬃﬃNﬃðk1Þþiþ1 . . . xpﬃﬃNﬃðk1Þþi xN

pﬃﬃﬃﬃ Step 12. For all i, k, 1 6 i, k 6 N do in parallel Bði; 1; k; 1Þ y pNﬃﬃﬃðk1Þþi pﬃﬃﬃﬃ Step 13. For all i, k, 1 6 i, k 6 N do in parallel A(i, 1, k, l) A(i, 1, k, 1) * B(i,1, k, 1) pﬃﬃﬃﬃ Step 14. Form the local sum with the contents of the A(i, 1, k, 1), 1 6 i, k 6 N along the same column of pﬃﬃﬃﬃ each group and store them in A(i, 1, 1, 1), 1 6 i 6 N . pﬃﬃﬃﬃ Step 15. Perform an OTIS move on the contents of A(i, 1, 1, 1), 1 6 i 6 N . pﬃﬃﬃﬃ Step 16. Form the ﬁnal result by adding the contents of A(1, 1, k, 1) column wise for 1 6 k 6 N and store it in A(1, 1, 1, 1). Time complexity: Since one piece of data can be moved at a time along an electronic link, the contents of the registers A and C in step 2 can be broadcast in pipelining, i.e., the content of C register trails by that of pA ﬃﬃﬃﬃ register as analyzed by Wang and Sahni for their matrix multiplication on OTIS-mesh [28]. This requires N pﬃﬃﬃﬃ electronic moves. Step 6 also requires N electronic moves by similar pipelining. Each of the steps 4, 7, 9, 11, pﬃﬃﬃﬃ 14 and 16 requires N 1 p electronic moves. Steps 5, 10, 15 each requires a single OTIS move. Therefore the ﬃﬃﬃﬃ above algorithm requires 8 N 6 electronic moves and 3 OTIS moves.

306

P.K. Jana / Parallel Computing 32 (2006) 301–312

The correctness of the above algorithm can be readily seen from the illustrations and the p ﬁgures as follows. ﬃﬃﬃﬃ xx After the step 8, the contents of the A registers are xi xjj for i 5 j or 1 for i = j, 1 6 i , j 6 N as clear from QpNﬃﬃ pﬃﬃﬃﬃ ðxxj Þ Illustration 2 and Fig. 4. After step11, A registers hold QpNﬃﬃj¼1 , 1 6 i 6 N and after step 16, the ﬁnal ðxi xj Þ QN j¼1;i6¼j ðxxj Þ PN j¼1 result i¼1 QN y i is obtained from A register of the processors P(1, 1, 1, 1). j¼1;i6¼j

ðxi xj Þ

Remark pﬃﬃﬃﬃ 1. In the above algorithm, pﬃﬃﬃﬃthe data is fed only through the boundary processors P(i, 1, k, 1), 1 6 i, k 6 N and P(1, j, 1, l), 1 6 j, l 6 N which is then broadcast row wise and column wise. This requires a total of 2N I/O pins and similar as applied to the enumeration sort on mesh of trees [1,2]. If we, however, assume that the data elements x1, x2, . . . xN are initially stored in the twoplocal ﬃﬃﬃﬃ registers namely A and B as shown in Figs. 2 and p 3 then the above algorithm can be achieved in only 4 N 4 electronic moves and 2 OTIS moves ﬃﬃﬃﬃ (a total of 4 N 2 moves) following the steps 8 through 16 and omitting the step 1 through 7. This is almost pﬃﬃﬃﬃ optimal since the diameter of OMULT is 4 N 3. This may be noted that similar assumption has been made for the initial data allocation using group row mapping (GRM) and group submatrix mapping (GSM) for matrix multiplication on OTIS-mesh [28]. Therefore we have the following theorem. Theorem 1. N-Point Lagrange interpolation can be performed by the Algorithm L1 in near optimal time, i.e., in pﬃﬃﬃﬃ 4 N 4 electronic moves and 2 OTIS moves. We now outline how the above algorithm can be modiﬁed when p2 (for p < n) processors will be available as follows. For the sake of simplicity, we assume that N = kp, where k is an integer. It can be noted that the Algorithm L1 ﬁrst stores the input data points x1, x2, . . . , xN in all the rows and all the columns (as indicated in Figs. 3 and 2, respectively) followed by sum of product computation. Now, suppose N input data points are grouped into k = N/p sets: {x1, x2, . . . , xp}, {xp+1, xp+2, . . . , x2p}, . . . , {x(k1)+1, x(k1)+2, . . . , xkp}. For a given input set to the p columns, the rows are successively fed with possible input sets and each time the required product terms are formed. Then the input sets to the columns are successively changed to update the partially computed results and the same procedure is repeated until the ﬁnal interpolated value is generated in processor pﬃﬃﬃﬃ P(1, 1, 1, 1). The time complexity of the modiﬁed algorithm will be Oðk N Þ. Algorithm L2. /* Here, pwe ﬃﬃﬃﬃ assume that the data can be fed only through the boundary processors P(i, j,1,1), 1 6 i, j 6 N . Then steps 1 through 7 perform the group mapping of the initial input data elements */ pﬃﬃﬃﬃ Step 1. For all i, j, 1 6 i, j 6 N do in parallel Input xpNﬃﬃﬃði1Þþj to A(i, j, 1, 1) Input y pﬃﬃNﬃði1Þþj to D(i, j, 1, 1) Input x to B(i, j, 1, 1) Step 2. On each group, broadcast locally the contents of A and B registers stored in step 1 to all the processors of the corresponding group. After this step the contents of A registers is shown in Fig. 5. pﬃﬃﬃﬃ Step 3. For all i, j, k, l, 1 6 i, j, k, l 6 N do in parallel Cði; j; k; lÞ

Aði; j; k; lÞ

Step 4. Perform an OTIS move on the contents of A registers of all groups. The contents of A registers after this step, is shown in Fig. 6. pﬃﬃﬃﬃ Step 5. For all i, j, k, l, 1 6 i, j, k, l 6 N do in parallel If C(i, j, k, l) 5 A(i, j, k, l) then Aði; j; k; lÞ Else A(i, j, k, l)

Bði; j; k; lÞ Aði; j; k; lÞ Cði; j; k; lÞ Aði; j; k; lÞ 1

P.K. Jana / Parallel Computing 32 (2006) 301–312

x1

x2

307

x3

x1

x1

x1

x2

x2

x2

x3

x3

x3

x1

x1

x1

x2

x2

x2

x3

x3

x3

x1

x1

x1

x2

x2

x2

x3

x3

x3

x4

x5

x6

x4

x4

x4

x5

x5

x5

x6

x6

x6

x4

x4

x4

x5

x5

x5

x6

x6

x6

x4

x4

x4

x5

x5

x5

x6

x6

x6

x7

x8

x9

x7

x7

x7

x8

x8

x8

x9

x9

x9

x7

x7

x7

x8

x8

x8

x9

x9

x9

x7

x7

x7

x8

x8

x8

x9

x9

x9

Fig. 5. Contents of A-registers and initial inputs.

x1

x2

x3

x1

x2

x3

x1

x2

x3

x4

x5

x6

x4

x5

x6

x4

x5

x6

x7

x8

x9

x7

x8

x9

x7

x8

x9

x1

x2

x3

x1

x2

x3

x1

x2

x3

x4

x5

x6

x4

x5

x6

x4

x5

x6

x7

x8

x9

x7

x8

x9

x7

x8

x9

x1

x2

x3

x1

x2

x3

x1

x2

x3

x4

x5

x6

x4

x5

x6

x4

x5

x6

x7

x8

x9

x7

x8

x9

x7

x8

x9

Fig. 6. After step 4.

Illustration 4. After this step the contents of A(i, j, k, l) is shown in Fig. 7 where xij bears the same notation, xx i.e., xi xjj for i 5 j or 1 for i = j, respectively. Step 6. Form the local pﬃﬃﬃﬃ product with the contents of the A registers on each group and store it in A(i, j, 1, 1), 1 6 i, j 6 N .

308

P.K. Jana / Parallel Computing 32 (2006) 301–312

x1-2

x1-3

x2-1

x2-3

x3-1

x3-2

x1-4 x1-5

x1-6

x2-4

x2-5 x2-6

x3-4

x3-5 x3-6

x1-7

x1-8

x1-9

x2-7

x2-8 x2-9

x3-7 x3-8 x3-9

x4-1

x4-2

x4-3

x5-1 x5-2 x5-3

x6-1 x6-2 x6-3

x4-5 x4-6

x5-4

x6-4

x6-5

x4-7

x4-8

x4-9

x5-7

x6-7

x6-8 x6-9

x7-1

x7-2

x7-3

x8-1 x8-2 x8-3

x9-1 x9-2

x7-4 x7-5

x7-6

x8-4

x8-5 x8-6

x9-4

x9-5 x9-6

x8-9

x9-7

x9-8

1

1

1

x7-8 x7-9

x8-7

1

1

x5-6

x5-8 x5-9

1

1

1

x9-3

1

Fig. 7. After step 5.

Step 7. For all i, j = 1 to Aði; j; 1; 1Þ

pﬃﬃﬃﬃ N do in parallel Aði; j; 1; 1Þ Dði; j; 1; 1Þ

Step 8. Perform an OTIS move on the contents of the A registers stored in step 7. Step 9. Add the contents of the A registers of the group G(1, 1) to form the ﬁnal result. Time complexity: Step 1 requires constant time. Step 2 can be performed by ﬁrst sending the contents of the A and B registers initially of a group down column one of that group in pipeline fashion (i.e., the content of pﬃﬃﬃﬃ the B register trails the content of the A register by one column processor). This requires N electronic moves. pﬃﬃﬃﬃ Next the contents of A and B registers are broadcast along rows in another N electronic moves using a simpﬃﬃﬃﬃ ilar pipelining. Thus step 2 requires 2 N electronic moves. Further, each of the steps 6 and 9 requires pﬃﬃﬃﬃ (2 N 1) electronics moves and steps 4 and 8 each requires a single OTIS move. Therefore, the above algopﬃﬃﬃﬃ rithm requires 6 N 4 electronics moves plus 2 OTIS moves which lesser than that of the Algorithm L1. The correctness of the above algorithm can be proved as readily seen from the illustrations and the corresponding ﬁgures. Remark 2. In Algorithm L2, we have assumed pﬃﬃﬃﬃthat input/output operations can be performed only through the boundary processors P(i, j, 1, 1), 1 6 i, j 6 N so that there is a need of N I/O pins which is lesser than that of the Algorithm L1. If we assume that the data elements x1, x2, . . . xN arepinitially stored in the two local ﬃﬃﬃﬃ registers as shown in Figs.p5ﬃﬃﬃﬃand 6 then the above algorithm requires only 4 N 4 electronic moves and one OTIS moves (a total of 4 N 3 moves) which is optimal. This leads the following theorem. pﬃﬃﬃﬃ Theorem 2. The Algorithm L2 can perform N-Point Lagrange interpolation in optimal time, i.e., in 4 N 4 electronic moves and one OTIS moves. pﬃﬃﬃﬃ We can also modify the Algorithm L2 similarly as discussed for Algorithm L1 to run it in O(k N Þ time for the case when p2 (p < n) processors. 3. Parallel algorithms for ﬁnding polynomial roots Consider the following Nth degree polynomial equation, P N ðxÞ ¼ a0 xN þ a1 xN 1 þ a2 xN 2 þ þ aN 1 x þ aN ¼ 0

P.K. Jana / Parallel Computing 32 (2006) 301–312

309

Let us assume that the coeﬃcients ai’s, i = 0, 1, 2, . . . , N, are real. The Durand–Kerner iterative scheme for ﬁnding the roots of the above polynomial equation is the following [10]: ðkÞ P N xi ðkþ1Þ ðkÞ ; i ¼ 1; 2; . . . ; N ¼ xi Q xi ðkÞ ðkÞ N x x i j j¼1;j6¼i The Durand–Kerner method is locally convergent with quadratic rate of convergence, respectively [10]. It can be noted that the principal computation of the above method is very similar to that of the Lagrange interpolation. Therefore we develop parallel algorithms for Durand–Kerner method similarly as the Algorithms L1 and L2 as described for Lagrange interpolation in the previous section. Algorithm D1. Steps 1 through 7. Same as Algorithm L1 omitting C registers pﬃﬃﬃﬃ Step 8. For all i, j, k, l = 1 to N do in parallel If A(i, j, k, l) 5 B(i, j, k, l) then Aði; j; k; lÞ Bði; j; k; lÞ

Aði; j; k; lÞ

Else A(i, j, k, l) 1 Step 9. Form the local product with the contents of the A registers along the same row of each group and pﬃﬃﬃﬃ store it in A(i, j, k, 1), 1 6 i, j, k 6 N . Step 10. Perform an OTIS move on the contents of the A registers stored in step 9. Step 11. Form the local product with the contents pﬃﬃﬃﬃof the A registers along the same row of each group G(*, 1) and store it in A(i,p1,ﬃﬃﬃﬃk, 1) , 1 6 i, k 6 N . Step 12. For all i, k = 1 to N do in parallel Input y pNﬃﬃﬃðk1Þþi to D(i, 1, k, 1) Input xpﬃﬃNﬃðk1Þþi p toﬃﬃﬃﬃC(i, 1, k, 1) Step 13. For all i, k = 1 to N do in parallel Aði; 1; k; lÞ

Cði; 1; k; 1Þ

Dði; 1; k; 1Þ Aði; 1; k; 1Þ

pﬃﬃﬃﬃ Time complexity: Each of the steps 2, 4, 6, 7, 9 and 11 requires N 1 electronic moves and pﬃﬃﬃﬃthe steps 5 and 10 each requires a single OTIS move. Therefore, this version of the algorithm requires 6 N 6 electronic moves and 2 OTIS moves. Like the Algorithm L1 if we assume thatpthe ﬃﬃﬃﬃ data elements x1, x2, . . . xN initially exist in the local registers as shown inp Figs. 2 and 3, it requires only 2 N 2 electronic moves and one OTIS ﬃﬃﬃﬃ move (steps 9 and 11 each requiring N 1 electronic moves and the step 10 requiring one OTIS move). The correctness and the scalability of the algorithm can be similarly developed as Algorithm L1. Algorithm D2. Steps 1 through 4. Same as Algorithm L2 omitting B and D registers pﬃﬃﬃﬃ Step 5. For all i, j, k, l = 1 to N do in parallel If C(i, j, k, l) 5 A(i, j, k, l) then Aði; j; k; lÞ

Cði; j; k; lÞ Aði; j; k; lÞ

Else A(i, j, k, l) 1 Step 6. Form the local pﬃﬃﬃﬃ product with the contents of the A registers on each group and store it in A(i, j, 1, 1), 1 6 i, j 6 N . pﬃﬃﬃﬃ Step 7. For all i, j = 1 to N do in parallel Input y pﬃﬃNﬃði1Þþj to D(i, j, 1, 1) Input xpﬃﬃNﬃði1Þþj to C(i, j, 1, 1) pﬃﬃﬃﬃ Step 8. For all i, j = 1 to N do in parallel Aði; j; 1; 1Þ

Cði; j; 1; 1Þ

Dði; j; 1; 1Þ Aði; j; 1; 1Þ

310

P.K. Jana / Parallel Computing 32 (2006) 301–312

Table 1 Comparison between algorithms Algorithm

Mapping

1st version of Lagrange (L1) 2nd version of Lagrange (L2) 1st version of Durand–Kerner (D1) 2nd version of Durand–Kerner (D2)

Row–column Group Row–column Group

Electronic moves pﬃﬃﬃﬃ 8pN ﬃﬃﬃﬃ 6 6pN ﬃﬃﬃﬃ 4 6pN ﬃﬃﬃﬃ 6 4 N 4

OTIS moves 3 2 2 1

pﬃﬃﬃﬃ Time complexity: Each of the steps 2 and 6 requires p (2ﬃﬃﬃﬃ N 1) electronic moves and step 4 requires a single OTIS move. Therefore, the above algorithm requires 4 N 4 electronic moves pﬃﬃﬃﬃ plus one OTIS move. Assuming the data elements are initially stored, this algorithm can be achieved in 2 N 2 electronic moves and one OTIS move. The scalability of the Algorithm D2 can be similarly achieved as Algorithm L2. 4. Conclusion In this paper, we have presented several versions of parallel algorithms for Lagrange interpolation and Durand–Kerner method on an N2 processors OTIS-mesh. The algorithms are based on two ways of mapping the initial data elements namely row–column mapping and group mapping. The results are summarized in Table 1. It can be seen that group mapping is superior to the row column with respect to time complexity and I/O pin requirements. We have also shown that if we assume that data elements are initially pﬃﬃﬃﬃ stored in the corresponding mapping, then the Algorithm L1 is nearpoptimal with the time complexity of 4 N 4 elecﬃﬃﬃﬃ tronic moves and 2 OTIS moves and L2 is optimal with 4 N 4 electronic moves and one p OTIS move. With ﬃﬃﬃﬃ the similar assumption, both the Algorithms D1 and D2 can be shown to run in only 2 N 2 electronic moves and one OTIS move. The scalability of the algorithms has been also studied. It is important to note that parallel computation of Hermite polynomial (as given in the Appendix), which provides more accuracy, is similar to that of the Lagrange polynomial. This is because parallelization of both these methods mainly depend on p(x) and p 0 (xi). Therefore, the parallel algorithms for Lagrange interpolation presented in this paper can be easily modiﬁed to implement the Hermite interpolation with some constant communication moves. Similar remark can also be made another polynomial root ﬁnding method namely, Ehrlich method (Please see Appendix), as it is computationally similar to the Durand–Kerner method. It is important to note that Ehrlich method has higher (cubic) rate of convergency than the Durand–Kerner method (quadratic rate). Acknowledgements The ﬁrst version of this paper appeared in the proceedings of the 6th International Workshop on Distributed Computing (IWDC 2004), December 27–30, 2004, Indian Statistical Institute, Kolkata, India. The author thanks the anonymous reviewers of both IWDC 2004 and Parallel Computing Journal for their valuable comments and suggestions, which greatly helped in improving the results. Appendix Hermite formula [12]: N N X X f ðxÞ ¼ hi ðxÞf ðxi Þ þ hi ðxÞf 0 ðxi Þ i¼1

i¼1

where 2

hi ðxÞ ¼ ½1 2L0i ðxi Þðx xi Þ½Li ðxi Þ ; pðxÞ hi ðxÞ ¼ ðx xi Þ½Li ðxi Þ2 ; Li ðxÞ ¼ 0 ðxi Þ i ¼ 1; 2; . . . ; N p 0 0 and f (xi) and Li ðxi Þ denote the derivative values of f(x) and L(x), respectively at the point x = xi.

P.K. Jana / Parallel Computing 32 (2006) 301–312

311

Ehrlich iterative scheme [7]: ðkÞ

ðkþ1Þ

xi

ai

ðkÞ

¼ xi

ðkÞ

1 ai bðkÞ i

;

i ¼ 1; 2; . . . ; N

where ðkÞ

ai

ðkÞ P N xi ¼ ; ðkÞ P 0N xi

bðkÞ i ¼

N X

j¼1;j6¼i

1 ðkÞ xi

ðkÞ

xj

and dP ðxÞ N ðkÞ P 0N xi ¼ dx

ðkÞ

at x ¼ xi

References [1] S.G. Akl, Parallel Sorting Algorithm, Academic Press, Orlando, FL, 1985. [2] S.G. Akl, The Design Analysis of Parallel Algorithms, Prentice-Hall, Englewood Cliﬀs, NJ, 1989. [3] H.S. Azad, et al., An eﬃcient parallel algorithm for Lagrange interpolation and its performance, in: Proc. of High performance Computing in the Asia–Paciﬁc Region, Beijing, China, pp. 593–598, 2000. [4] H.S. Azad et al., Parallel Lagrange interpolation on the star graph, J. Parallel Distr. Comput. 62 (4) (2002) 605–621. [5] M. Cosnard, P. Fraigniaud, Finding the roots of a polynomial on an MIMD multicomputer, Parallel Comput. 15 (1990) 75–85. [6] E. Durand, Solutions Numeriques des Equations Algebriques, Tom I. Paris; Manson, 1960. [7] L.W. Ehrlich, A modiﬁed Newton method for polynomials, Comm. ACM 10 (1967) 107–108. [8] M. Feldman, S. Esener, C. Guest, S. Lee, Comparison between electrical and free space optical interconnects based on power and speed considerations, Appl. Opt. 27 (9) (1988). [9] T.L. Freeman, M.K. Bane, Asynchronous polynomial zero-ﬁnding algorithms, Parallel Comput. 17 (1991) 673–681. [10] T.L. Freeman, Calculating polynomial zeros on local memory parallel computer, Parallel Comput. 12 (1989) 351–358. [11] Ben Goertzel, Lagrange interpolation on a processor tree with ring connections, J. Parallel Distr. Comput. 22 (2) (1994) 321–323. [12] F.B. Hildebrand, Introduction to Numerical Analysis, McGraw-Hill, New York, 1956. [13] P.K. Jana, B.P. Sinha, R. Datta Gupta, Eﬃcient parallel algorithms for ﬁnding polynomial zeroes, in: Proc. of the 6th Int. Conf. on Advance Computing, CDAC, Pune University Campus, India, December 14–16, pp. 189–196, 1999. [14] P.K. Jana, Finding polynomial zeroes on a Multi-mesh of trees (MMT), in: Proc. of the 2nd Int. Conf. on Information Technology, Bhubaneswar, December 20–22, pp. 202–206, 1999. [15] P.K. Jana, B.P. Sinha, Fast parallel algorithm for polynomial interpolation, Comput. Math. Appl. 29 (4) (1997) 85–92. [16] P.K. Jana, B.P. Sinha, Eﬃcient parallel algorithms for Lagrange and Hermite interpolation, Int. J. Appl. Sci. Comput. 4 (2) (1997) 118–136. [17] J.J. Joshep, An Introduction to Parallel Algorithms, Addison Wesley, Reading, Massachusetts, 1992. [18] I.O. Kerner, Ein Gesamtschrittverfahren Zur Berechnung der Nullstetten on polynomen, Numer. Math. 8 (1966) 290–294. [19] A. Kiamilev, P. Marchand, A. Krishnamoorthy, S. Esener, S. Lee, Performance comparison between optoelectronic and VLSI multistage interconnection networks, J. Light Wave Technol. 9 (12) (1991). [20] G. Marsden, P. Marchand, P. Harvey, S. Esner, Optical transpose interconnection system architectures, Opt. Lett. 18 (13) (1993) 1083–1085. [21] A. Osterloh, Sorting on the OTIS–mesh, in: Proc. 14th Int. Parallel and Distributed Processing Symp., IPDPS 2000 (2000) 269–274. [22] S. Rajasekaran, S. Sahni, Randomized routing, selection, and sorting on the OTIS-mesh, IEEETPDS: IEEE Trans. Parallel Distributed Syst. 9 (9) (1998) 833–840. [23] T.A. Rice, L.H. Jamieson, A highly parallel algorithm for root extraction, IEEE Trans. Comput. 38 (3) (1989) 443–449. [24] S. Sahni, Models and algorithms for optical and optoelectronic parallel computers, Int. J. Foundations Comput. Sci. 12 (3) (2001) 249–264. [25] S. Sengupta, D. Das, B.P. Sinha, M. Ghosh, A Fast parallel algorithm for polynomial interpolation using Lagrange’s formula, in: Proc. Int. Conf. on High Performance Computing (HiPC), New Delhi, pp. 701–706, 1995. [26] Chih-Fang Wang, S. Sahni, OTIS optoelectronic computers, in: K. Li, Y. Pan, S.Q. Zhag (Eds.), Parallel Computation Using Optical Interconnection, Kluwer Academic, 1998. [27] Chih-Fang Wang, S. Sahni, Image processing on the OTIS-Mesh optoelectronic computer, IEEE Trans. Parallel Distri. Syst. 11 (2) (2000) 97–109. [28] Chih-Fang Wang, Sartaj Sahni, Matrix multiplication on the OTIS-Mesh optoelectronic computer, IEEE Trans. Comput. 50 (7) (2001).

312

P.K. Jana / Parallel Computing 32 (2006) 301–312

[29] Chih-Fang Wang, Sartaj Sahni, Basic operation on the OTIS-mesh optoelectronic computer, IEEE Trans. Parallel Distri. Syst. 9 (12) (1998). [30] Chih-Fang Wang, Sartaj Sahni, BPC permutations on the OTIS-Mesh optoelectronic computer, in: Proc. on the Fourth Int. Conf. on Massively Parallel Processing Using Optical Transpose Interconnection (MPPOI ’97), pp. 130–135, 1997. [31] F. Zane, P. Marchand, R. Paturi, S. Esener, Scalable network architectures using the optical transpose interconnection system (OTIS), J. Parallel Distri. Comput. 60 (5) (2000) 521–538.