- Email: [email protected]

6

Data Dimensionality Reduction in Materials Science S. Samudrala , K. Rajan† and B. Ganapathysubramanian

Department of Mechanical Engineering Department of Materials Science and Engineering, Iowa State University, Ames, IA, USA

†

1. INTRODUCTION Using data-mining techniques to probe and establish structureprocess property relationships has witnessed a growing interest owing to its ability to accelerate the process of tailoring materials by design. Before the advent of data-mining techniques, scientists used a variety of empirical and diagrammatic techniques (Rabe et al., 1992), like pettifor maps (Morgan et al., 2003), or finite-element methods (van Rietbergen et al., 1995; Langer et al., 2001; Yue et al., 2003 Chawla et al., 2004; Liu et al., 2004) to establish relationships between structure and mechanical properties. Pettifor maps, one of the earliest graphical representation techniques, are exceedingly efficient except that they require a thorough understanding and intuition about the materials. Recent progress in computational capabilities has seen the advent of more complicated paradigms socalled virtual interrogation techniques which span from first principle calculations to multiscale models. These complex multiphysics and/or statistical techniques and simulations (Chawla et al., 2004; Langer et al., 2001; Liu et al., 2004; van Rietbergen et al., 1995; McVeigh and Liu, 2008; Zabaras et al., 2006) result in an integrated set of tools that can predict the relationships between chemical, microstructure, and mechanical properties, producing an exponentially large collection of data. Simultaneously, experimental methods combinatorial materials synthesis (Meredith et al., 2000; Takeuchi et al., 2005), high-throughput experimentation, atom probe tomography allow synthesis and screening of a large number of materials while generating huge amounts of multivariate data. A key challenge is to efficiently probe the data to extract correlations between structures, process, and property. This data explosion motivated the use of data-mining techniques in material science to explore,

Informatics for Materials Science and Engineering DOI: http://dx.doi.org/10.1016/B978-0-12-394399-6.00006-0

© 2013 Elsevier Inc. All rights reserved.

97

98

S. Samudrala et al.

design, and tailor materials and structures. A key stage in this process is to reduce the size of the data, while minimizing the loss of information during this data reduction. This process is called data dimensionality reduction. By definition, dimensionality reduction (DR) is the process of reducing the dimensionality of the given set of (usually unordered) data points and extracting the low-dimensional (or parameter space) embedding with a desired property (e.g. distance, topology, etc.) being preserved throughout the process. Examples of DR methods are principal component analysis (PCA; Lumley, 1967), Isomap (Tenenbaum et al., 2000), Hessian locally linear embedding (hLLE; Donoho and Grimes, 2003), etc. Applying DR methods enables visualization of the high-dimensional data and also estimates the optimal number of dimensions required to represent the data without considerable loss of information. Data dimensionality reduction is not a novel concept. Page (2006) describes different techniques of data reduction and their applicability for establishing structureproperty relationships. Statistical methods like PCA (Rajan et al., 2009) and factor analysis (FA; Brasca et al., 2007) have been used on materials data generated by first principles or experimental methods. However, dimensionality reduction techniques like PCA or FA to establish structureproperty relationships traditionally assume a linear relationship among the variables. This is often not strictly valid; the data usually lies on a nonlinear manifold (or surface). Nonlinear dimensionality reduction (NLDR) techniques can be applied to unravel the nonlinear structure from unordered data. An example of such application for constructing a low-dimensional stochastic representation of property variations in random heterogeneous media is given in Ganapathysubramanian and Zabaras (2008). Another exciting application of data dimensionality reduction is in combination with quantum mechanics-based calculations to predict structure (Curtarolo et al., 2003; Fischer et al., 2006; Morgan et al., 2005). For a more mathematical list of linear and nonlinear DR techniques, the interested reader should consult Lee and Verleysen (2008) and van der Maaten et al. (2009). In this chapter, the theory and mathematics behind various linear and nonlinear dimensionality reduction methods are explained. Algorithms are given and advantages and disadvantages of the methods are discussed. This chapter also discusses and tackles questions pertinent to optimal dimensionality and the model reduction process for different input parameters, such as: What is the optimal dimensionality of the manifold on which the data lies? How well does the elbow in the scree plot reflect the

Data Dimensionality Reduction in Materials Science

99

optimal dimensionality? The mathematical aspects of NLDR are packaged into an easy-to-use software framework called Scalable Extensible Toolkit for Dimensionality Reduction (SETDiR), which (a) provides a userfriendly interface that successfully abstracts users from the mathematical intricacies, (b) allows for easy post-processing of the data, and (c) represents the data in a visual format and allows the user to store the output in the .JPG format, thus making data more tractable and providing an intuitive understanding of the data. We conclude by applying the techniques discussed on two example data sets. The first of these is a set of apatites (Elliott, 1994; Mercier et al., 2005; Pramana et al., 2008; White and Dong, 2003; White et al., 2005) described using several structural descriptors. Apatites (AI4 AII6 ðBO4 Þ6 X2 ) have the ability to accommodate numerous chemical substitutions and hence can be used in the process of detoxification. The second example data set constitutes a set of morphological images of plastic solar cells manufactured under different processing conditions (Wodo et al., 2012). These processing conditions are regulated with the objective to study optimal processing pathways to achieve efficient organic solar cells. Section 2 describes the concepts of DR, Section 3 describes the algorithms of each DR method in detail, and Section 4 describes the dimensionality estimators used to estimate the dimensionality. The software framework, SETDiR, developed to apply DR techniques is described in Section 5. Section 6 discusses the interpretation of low-dimensional results obtained by applying SETDiR to the two different materials data sets. Section 7 outlines supplementary literature and concludes the chapter.

2. DIMENSIONALITY REDUCTION: BASIC IDEAS AND TAXONOMY The problem of dimensionality reduction can be formulated as follows. Consider a set X 5 {x0, x1, . . ., xn21} of n points, where xi A RD, and D . . 1. We are interested in finding a set Y 5 {y0, y1, . . ., yn21}, such that yiARd, d , , D and ’i;j jxi 2 xj jh 5 jyi 2 yj jh . Here, ja 2 bjh denotes a specific norm that captures properties we want to preserve during dimensionality reduction (Lee and Verleysen, 2008). For instance, by defining h as a Euclidean norm we preserve Euclidean distance, thus obtaining a reduction equivalent to the standard technique of PCA (Lumley, 1967). Similarly, defining h to be the angular distance (or conformal distance; Bergman, 1950) results in locally linear embedding

100

S. Samudrala et al.

(LLE; Roweis and Saul, 2000) that preserves local angles between points. In a typical application (Fontanini et al., 2011; Wodo, Ganapathysubramanian, 2012), xi represents a state of the analyzed system, e.g. temperature field, concentration distribution, etc. Such a state description can be derived from the experimental sensor data or may be the result of a numerical simulation. However, irrespective of the source, it is characterized by high dimensionality; that is, D is typically of the order of 106 (Guo et al., 2012; Wodo et al., 2012). While xi represents just a single state of the system, common data acquisition setups deliver large collections of such observations, which correspond to the temporal or parametric evolution of the system (Fontanini et al., 2011). Thus, the cardinality n of the resulting set X is usually large (nB104105). Intuitively, information obfuscation increases with data dimensionality. Therefore, in the process of dimensionality reduction (DR) we seek as small a dimension d as possible, given constraints induced by the norm ja 2 bjh (Lee and Verleysen, 2008). Routinely, d , 4 as this permits, for instance, visualization of the set Y. The key idea underpinning DR can be explained as follows. We encode desired information about X, i.e. topology or distance, in its entirety by considering all pairs of points in X. This encoding is represented as a matrix An 3 n. Next, we subject matrix A to unitary transformation V, i.e. a transformation that preserves the norm of A, to obtain its sparsest form Λ, where A 5VΛVT. Here, An 3 n is a diagonal matrix with rapidly diminishing entries. As a result, it is sufficient to consider only d entries of Λ to capture all the information encoded in A. These d entries constitute the set Y. The above procedure hinges on the fact that unitary transformations preserve the original properties of A (Golub and Van Loan, 1996). Note also that it requires a method to construct matrix A in the first place. Indeed, what differentiates various spectral methods is the way information is encoded in A. Before going further into the details of the functionality of DR methods, a brief taxonomy of these techniques is useful. A classification of DR methods can be carried out in various ways. Based on the topology of the manifold on which the data lies, they can be classified as: 1. Linear DR methods. Linear DR methods assume that the data set lies on a linear manifold. They are efficient when the manifold is linear but fail to retrieve the hidden structure if the manifold is nonlinear, e.g. PCA, Multi-Dimensional Scaling (MDS). 2. Nonlinear DR methods. Nonlinear methods do not assume anything about the linearity of the manifold. Hence, they can extract the

Data Dimensionality Reduction in Materials Science

1. 2.

1.

2.

101

underlying structure of the manifold irrespective of whether the manifold is linear or nonlinear in the embedding space, e.g. Isomap, LLE, Hessian LLE. Based on the property they preserve, they can be classified as: Isometric DR methods. These preserve pairwise distances among all the input vectors in the given data set, e.g. PCA, Isomap. Topology-preserving DR methods. These methods preserve the topology or connectivity of the data set. These methods tend to stretch or twist but do not tear the manifold, e.g. LLE, hLLE, Laplacian eigenmaps. Based on the type of mapping, the classifications are: Explicit mapping methods. Explicit mapping methods are those that cannot map a new test point from embedding space to parameter space without demanding a fresh run of the algorithm, e.g. curvilinear distance analysis (CDA; Lee et al., 2002), geodesic nonlinear mapping (GNLM; Yang, 2004). Implicit mapping methods. Implicit mapping methods are those that do not need a fresh run of the algorithm for mapping a new test point from embedding to parameter space. In most cases they are spectral methods, e.g. PCA, Isomap.

3. DIMENSIONALITY REDUCTION METHODS: ALGORITHMS, ADVANTAGES, AND DISADVANTAGES We focus on four different DR methods: (a) principal component analysis (PCA), a linear DR method; (b) Isomap, a nonlinear isometry-preserving DR method; (c) locally linear embedding (LLE), a nonlinear conformalpreserving DR method; and (d) Hessian LLE, a topology-preserving DR method.

3.1 Principal Component Analysis (PCA) PCA is a powerful and popular DR strategy due to its simplicity and ease of implementation. It is based on the premise that the high-dimensional data is a linear combination of hidden low-dimensional axes. PCA then extracts the latent parameters or low-dimensional axes by reorienting the axes of the high-dimensional space in such a way that the variance of the variables is maximized (Lee and Verleysen, 2008).

102

S. Samudrala et al.

PCA Algorithm 1. Compute the pairwise Euclidean distance matrix [E] from the input matrix [X]. 2. Construct a matrix [W ] such that the elements of [W ] are squares of the elements of the Euclidean distance matrix [E]. 3. Find the dissimilarity matrix [A] by double centering [W ]:[A] 5 [HT] [W ][H]: ! 8 > 1 > > > < 1 2 n ’i 5 j ! : Hij 5 ð6:1Þ > 1 > > > : 2 n ’i 6¼ j 4. Solve for the largest d eigenpairs of [A]:[A] 5 [U][Λ][UT]. 5. Construct the low-dimensional representation in Rd from the eigenpairs: ½Y 5 ½IU½Λ1=2 U½U T : The limitation of PCA is that it assumes the data lies on a linear space, and hence performs poorly on data that lies on a nonlinear manifold. In these cases, PCA tends to overestimate the dimensionality of the data.

3.2 Isomap Isomap relaxes the assumption of PCA that the data lies on a linear space. A classic example of a nonlinear manifold is the Swiss roll (or a set of points on a spiral) as shown in Figure 6.1. Isomap essentially smoothes out the nonlinear manifold into a corresponding linear space and (A)

(B) PCA 3D data

40 20 15

–15

50

–10 0 15

–5

10 5 0

0

10 5

0 –5

10 –10

–15 15

15

–5 5 –10

–5

10

0

–10

Figure 6.1 Comparison of performance of PCA and Isomap on a data set lying on a nonlinear manifold. (From Ganapathysubramanian and Zabaras, 2008 reprinted with permission.)

Data Dimensionality Reduction in Materials Science

103

subsequently applies PCA. This smoothing out can intuitively be understood in the context of the spiral, where the ends of the spiral are pulled out to straighten the spiral into a straight line. Isomap accomplishes this objective mathematically by ensuring that the geodesic distance between data points is preserved under transformations. The geodesic distance is the distance measured along the curved surface on which the points rest (Lee and Verleysen, 2008). Since it preserves (geodesic) distances, Isomap is an isometric (distance-preserving) transformation. The underlying mathematics of the Isomap algorithm assumes that the data lies on a manifold that is convex (but not necessarily linear). Note that both PCA and Isomap are isometric mappings; PCA preserves pairwise Euclidean distances of the points while Isomap preserves the geodesic distance. Isomap Algorithm 1. Compute the pairwise Euclidean distance matrix [E] from the input matrix [X]. 2. Compute the k nearest neighbors from the distance matrix [E]. 3. Compute the pairwise geodesic distance matrix [G] from [E]. 4. Construct a matrix [W ] such that the elements of [W ] are squares of the elements of the geodesic distance matrix [G]. 5. Find the dissimilarity matrix [A] by double centering: ½W :½A

58½H T ½W ! ½H > 1 > > > < 1 2 n ’i 5 j ! : Hij 5 > 1 > > > 2 ’i 6¼ j : n

ð6:2Þ

6. Solve for the largest d eigenpairs of A: [A] 5 [U][Λ][UT]. 7. Construct the low-dimensional representation in Rd from the eigenpairs: ½Y 5 ½IU½Λ1=2 U½U T : The nonlinearity in the data is accounted for by using a geodesic distance metric. Since computation of geodesic distance without the knowledge of the low-dimensional surface is close to impossible, graph distance is used to approximate the geodesic distance (Bernstein et al., 2000). Graph distance between a pair of points in a graph (V, E) is the shortest

104

S. Samudrala et al.

path connecting the two given points. The graph distances are calculated for a given graph using Floyd’s algorithm for shortest distances (Floyd, 1962).

3.3 Locally Linear Embedding In contrast to the PCA and Isomap methods, which preserve distances, locally linear embedding (LLE) preserves the local topology (or local orientation, or angles between data points). The LLE method uses the notion that locally the (nonlinear) manifold on which the data lies is well approximated by a d-dimensional Euclidean space (Rd). In other words, the manifold is locally linear. The algorithm first divides the manifold into patches, reconstructs each point in the patch based on the information (or weights) obtained from its neighbors (i.e. infers how a specific point is located with respect to its neighbors). This process extracts the local topology of the data. Finally, the algorithm reconstructs the global structure by combining individual patches and finds an optimized, lowdimensional representation. Numerically, local topology information is constructed by finding the k nearest neighbors of each data point and reconstructing each point from the information about the weights of the neighbors. The global reconstruction from the local patches is accomplished by assimilating the individual weight matrices to form a global weight matrix [W] and evaluating the eigenvalues of the normalized global weight matrix [A]. LLE Algorithm 1. For (i 5 1:n), for each of the n input vectors from X 5 {x0, x1, . . ., xn21} of n points, where xiARD: a. Find the “k” nearest neighbors of the input vector xi. b. Construct the local covariance or Gram matrix G(i): gr;s ðiÞ 5 ðxi 2vðrÞÞT ðxi 2 vðsÞÞ; where v(r) and v(s) are neighbors of xi. c. Weights can be computed by solving for the linear system: G(i) w(i) 5 1, where 1 is a k 3 1 vector of ones. 2. Knowing the vectors w(i), build the sparse matrix W such that for each ith row W(i, j) 5 0 if xi and xj are not neighbors and the corresponding linear coefficient obtained by solving the linear equation [G(i)] w(i) 5 1 otherwise.

Data Dimensionality Reduction in Materials Science

105

3. From [W] build [A]: [A] 5 (I 2 W)T(I 2 W). 4. Compute the eigenpairs for [A]: [A] 5 [U][Λ][UT]. 5. Compute the low-dimensional points in Rd from the eigenpairs: [Y] 5 n1/2 [U].

3.4 Hessian LLE Hessian locally linear embedding (hLLE; Donoho and Grimes, 2003) is an improvement upon LLE and Laplacian eigenmaps (Belkin and Niyogi, 2003), which replaces the Laplacian (first derivative) operator with a Hessian (second derivative) operator over the given connected graph. hLLE constructs patches, performs a local PCA on each patch, constructs a global Hessian from the eigenvectors thus obtained, and finally finds the low-dimensional representation from the eigenpairs of the Hessian. hLLE is a topology preservation method and assumes that the manifold is locally linear. hLLE Algorithm 1. At each given point xi, construct a k 3 n neighborhood matrix [Mi] such that each row of the matrix represents a point xi 5 xi 2 xi , where j A [0,N] and xi is the mean of the k neighboring points. 2. Perform singular value decomposition (SVD) of the constructed [Mi] to obtain [U], [V], and [D]. 3. Construct the (N d(d 1 1)/2) local Hessian matrix [X]i such that the first column is a vector of all ones and the next d columns are the columns of U followed by the products of all the d columns of [U]. 4. Compute GramSchmidt orthogonalization (Golub and Van Loan, 1996) on the local Hessians [X]i and assimilate the last d(d 1 1)/2 orthonormal vectors of each to construct the global Hessian matrix [H] (Donoho and Grimes, 2003). 5. Compute the eigenpairs of the Hessian matrix: [H] 5 [W][Λ][W]T. 6. Compute the low-dimensional points [Y] in Rd from the eigenpairs: ½Y 5 ½W Uð½W T U½W Þ21=2 :

4. DIMENSIONALITY ESTIMATORS A key step in constructing the low-dimensional points from the data is the choice of the low dimensionality d. Methods like PCA and Isomap have an implicit technique to estimate the low dimensionality (approximately)

106

S. Samudrala et al.

using scree plots. We introduce a graph-based technique that rigorously estimates the latent dimensionality of the data that can be used in conjunction with the scree plot. A scree plot is a plot of the eigenvalues with the eigenvalues arranged in decreasing order of their magnitude. Scree plots obtained from PCA and Isomap (distance-preserving methods) give an estimate of the dimensionality. They constitute a heuristic method of identifying the dimensionality by identifying the elbow in the scree plot. A more quantitative estimate of dimensionality is estimated by choosing a value for δ that ensures a threshold of the minimum percentage variability (pvar). If λ1 . λ2 . . . . λn are the individual eigenvalues arranged in descending order, the percentage variability covered by considering the first d eigenvalues is given by Pd λi pvar 5 100U Pi51 # δ: n i51 λi We have recently utilized a dimensionality estimator based on the BHH theorem (BeardwoodHaltonHammersley theorem; Beardwood et al., 1959). This powerful theorem uses the geodesic minimal spanning tree length (GMST) estimator in expressing the dimensionality of an unordered data set as a function of GMST on the graph. Specifically, the slope of the loglog graph constructed by calculating the GMST with respect to increasing data points provides an estimate of the dimensionality d 5 1/(1 2 m), where m is the slope. Computationally, GMST is computed using Prim’s (1957) algorithm.

5. SOFTWARE These DR techniques can be packaged into a modular, scalable framework for ease of use by the materials science community. We call this package SETDiR (Scalable Extensible Toolkit for Dimensionality Reduction). This framework contains two major components: 1. Core functionality, developed using C1 1. 2. User interface, developed based on Java (Swings). Figure 6.2 describes the scope of the functionality of both modules in SETDiR. Details of the implementation are described in the subsections below.

Data Dimensionality Reduction in Materials Science

107

Common workspace

1. R inp ece uts ive su se r

t

ut

tp u ou

si np

ite s

ad 8.

ut

wr

utp

Re

so

3.

ad

Dimensionality estimator

User input

Re

4

9.

Construct weight matrix

5 6

C++ Executable

Eigen solver

2. Calls the C++ executable

User interface 10. Displays output to users

7

K-Means clustering

Graph lowdimensional points

Figure 6.2 Description of the software.

5.1 Core Functionality Functionality is developed using the object-oriented C1 1 programming language. It implements the following methods: PCA, Isomap, LLE, and dimensionality estimators such as GMST and correlation dimension estimators (Lee and Verleysen, 2008).

5.2 User Interface The user interface (shown in Figure 6.3) is developed using Java Swings components with the following features, which make it user-friendly: 1. Abstracts the user from the functionality and the mathematical intricacies. 2. Postprocessing of results: One of the motivations behind DR techniques is their ability to enhance the visualization of the data. To emphasize this fact sophisticated post-processing tools and graphical representation of the data are incorporated into SETDiR. 3. Displays the results graphically and makes the post-processing of the data easier. 4. The interface also lets the user store the plots in JPEG format. This framework can be downloaded from SETDiR Download. We next showcase the framework and the mathematical strategies on material science data sets.

108

S. Samudrala et al.

Figure 6.3 Snapshot of clustering pattern displayed using SETDiR for apatite data set.

6. ANALYZING TWO MATERIAL SCIENCE DATA SETS: APATITES AND ORGANIC SOLAR CELLS In this section we compare and contrast the algorithms on two interesting data sets with considerable technological and scientific significance. Data dimensionality reduction offers unique insights into the originally intractable data sets by enabling visual clustering and pattern association. The first data set we consider is an apatite database. Apatites have the ability to accommodate numerous chemical substitutions and exhibit a broad range of multi-functional properties. The rich chemical and structural diversity provides a fertile ground for the synthesis of technologically relevant compounds (Elliott, 1994; Mercier et al., 2005; Pramana et al., 2008; White and Dong, 2003; White et al., 2005). Chemically apatites are conveniently described by the general formula AI4 AII6 ðBO4 Þ6 X2 , where AI and AII are distinct crystallographic sites that usually accommodate larger monovalent (Na1, Li1, etc.), divalent (Ca21, Sr21, Ba21, Pb21, etc.) and trivalent (Y31, Ce31, La31, etc.) ions and the X site is occupied by halides (F2, Cl2, Br2), oxides, or hydroxides. Establishing the relationship between the microscopic properties of apatite complexes with those of the macroscopic properties can help to gain understanding and promote the use of apatites in various daily-life applications. For example, information about the relative stability of the apatite complexes can

Data Dimensionality Reduction in Materials Science

109

promote the utilization of apatites as an antidote for lead poisoning (by finding an apatite that is more stable than a lead apatite). DR techniques can be used to establish structureproperty relationship for apatites described using various structural descriptors by enhancing the visualization and understanding of the data. The second data set we consider is a set of morphological images of plastic solar cells manufactured under different processing conditions. Organic solar cells (or plastic solar cells) manufactured from organic blends (i.e. a blend of two polymers) represent a promising low-cost, rapidly deployable strategy for harnessing solar energy. While highly costeffective and flexible, their low power conversion efficiency makes them less competitive on a commercial scale in comparison with conventional inorganic solar cells. A key aspect determining the power conversion efficiency of organic solar cells is the morphological distribution of the two polymer regions in the device. Recent studies reveal that significant additional improvement in power conversion efficiency is possible through better morphological control of the organic thin-film layer during the manufacturing process (Chen et al., 2009; Deibel et al., 2010; Hoppe and Sariciftci, 2006; Park et al., 2009; Peet et al., 2009; Sean et al., 2001; Wodo et al., 2012). High-throughput exploration of the various manufacturing parameters (evaporation rate, blend ratio, substrate patterning frequency, substrate patterning intensity, solvent) can potentially unravel processmorphology relationships that can help tailor processing pathways to obtain enhanced morphologies. It is important to note that such high-throughput analysis results in data sets that are too large to visually look for trends and relationships. A promising approach towards unraveling processmorphology relationships in this high-throughput data is via data dimensionality reduction.

6.1 Apatite Data The crystal structure of a typical P63 =mCaI4 CaII6 ðPO4 Þ6 F2 apatite with hexagonal unit cell is shown in Figure 6.4 with the atoms projected along the (001) axis. The polyhedral representations of AIO6 and BO4 structural units are clearly shown, with the CaII site (pink atoms) and F site (green atoms) occupying the tunnel. The thin black line represents the unit cell of the hexagonal lattice. The sample apatite data set considered consists of 25 different compositions described using 29 structural descriptors. These structural

110

S. Samudrala et al.

Crystal structure of Ca10(PO4)6F2–A sample apatite

Figure 6.4 Crystal structure of a typical P63 =mCaI4 CaII6 ðPO4 Þ6 F2 apatite with a hexagonal unit cell. (From Balachandran, 2011; Balachandran and Rajan, 2012.)

descriptors, when modified, affect the crystal structure of the composition (Balachandran and Rajan, 2012). By establishing the relationship between the crystal structure and these structural descriptors and analyzing the clustering of different compositions, conclusions can be drawn about how the changes in these structural descriptors (defining the microscopic properties) of the crystal structure can affect the macroscopic features (like melting point, Young’s modulus, etc.). The bond length, bond angle, lattice constants, and total energy data are taken from the work of Mercier et al. (2005), the ionic radii data is taken from the work of Shannon (1976), and the electronegativity data is based on the Pauling (1960) scale. The ionic radius of the AI site (rAI) has a coordination number 9; rAII has a coordination number 7 (when the X site is F61) or 8 (when the X site is Cl61 or Br61). Our database describes Ca, Ba, Sr, Pb, Hg, Zn and Cd in the A site, P, As, Cr, V and Mn in the B site, and F, Cl and Br in the X site. The 25 compounds considered in this study belong to the P63/m hexagonal space group. We utilize SETDiR on the apatite data and present some of the results below. A more comprehensive analysis is deferred to a forthcoming publication (Balachandran and Samudrala, 2013). Dimensionality Estimation SETDiR first estimates the dimensionality using the scree plot. A scree plot is a plot of eigenvalue indices versus eigenvalues. The occurrence of an

Data Dimensionality Reduction in Materials Science

Scree plot for apatite data from PCA

Scree plot for apatite data from Isomap 1

1 Normalized Unnormalized

0.8

Normalized eigenvalues

Normalized eigenvalues

111

0.6 0.4 0.2

Normalized Unnormalized

0.8 0.6 0.4 0.2 0

0 0

2

4 6 8 Index of eigenvalues

10

0

2

4 6 8 Index of eigenvalues

10

Figure 6.5 Scree plots for PCA and Isomap normalized vs. unnormalized input. (From Balachandran and Samudrala, 2013.)

elbow (or a sharp drop in eigenvalues) in a scree plot gives an estimate of the dimensionality of the data. Figure 6.5 shows the scree plots when the input vectors {x0, x1, . . ., xn21} are normalized with respect to when they are not normalized. We plot for comparison the eigenvalues obtained from both PCA and Isomap. This plot shows how the second eigenvalue collapses to zero when the input vectors are not normalized and hence emphasizes the importance of normalization of input vectors. It is also interesting to compare the eigenvalues of PCA and Isomap for normalized input: PCA, being a linear method, overestimates the dimensionality as 5 while Isomap estimates it to be 3. SETDiR next uses the geodesic minimal spanning tree method to estimate the dimensionality of the apatite data. This method gives a rigorous estimate of 3, which matches the outcome of the more heuristic scree plot estimate. We finally plot the low-dimensional data extracted using both PCA and Isomap. Figure 6.6 (left) shows the 2D plot between principal components 2 and 3. The reason for showing this classification map is that the PC2PC3 map captures a pattern that is similar to Isomap components 1 and 2. While we find associations among compounds that are exactly the same, as shown in Figure 6.6 (right), the nature of the information is manifested in different ways. This is mainly attributed to the difference in the governing mathematics of the two techniques, where PCA is essentially a linear technique and Isomap is a nonlinear technique. We have further interpreted the hidden information captured by the Isomap classification map by focusing on the three regions separately in a forthcoming paper (Samudrala et al., 2013).

112

S. Samudrala et al.

Figure 6.6 Apatite PCA (left) and Isomap (right) result interpretation. (From Balachandran, 2011; Balachandran and Samudrala, 2013.)

6.2 Unraveling ProcessMorphology Pathways of Organic Solar Cells using SETDiR We showcase the results obtained by applying (a parallel version of) the SETDiR framework (Samudrala et al., 2013) on this pressing scientific problem.1 We particularly focus on using dimensionality reduction to understand the effects of substrate patterning (patterning frequency and intensity) on morphology evolution.2 The data set consists of n 5 75,150 morphologies. Each morphology is a two-dimensional snapshot that is vectorized to have dimensionality D 5 8326. Figure 6.7 shows several representative final morphologies (Wodo, 2012) obtained by varying (a) the patterning frequency, lp, from 0.5 to 1.50, and (b) the intensity of the attraction/repulsion, μ, from 1 1 e26 to 1 1 8e24. In all these cases, the lower surface of the domain is patterned to attract and/or repel specific classes of polymers, thus affecting the morphology. Figure 6.8 plots the first 10 eigenvalues of the data. Note that the first three eigenvalues (and hence the first three principal components of the data) represent B70% of the information content of the entire data. We therefore characterize each morphology in terms of this threedimensional representation.

1

2

We developed an in-house parallel solver PaDRe to apply DR to data sets as large as n 5 105 and D 5 106. We detail the parallel algorithms and associated scaling studies in a separate paper (Samudrala et al., 2013). Nano-tip photolithography patterning of the substrate has shown significant potential to guide morphology evolution (Coffey and Ginger, 2005).

Data Dimensionality Reduction in Materials Science

113

Figure 6.7 Snapshots of microstructures representing final morphologies of 50 different processes under consideration.

Figure 6.8 (A) Scree plot for the 10 largest eigenvalues. (B) Proportional energy covered by the first 10 eigenvalues.

Figure 6.9 represents all the morphologies on this three-dimensional reduced space. In Figure 6.9(A), the points are color coded according to the patterning frequency used, while in Figure 6.9(B), the points are color coded according to the patterning intensity used. This plot provides significant visual insight into the effects of patterning frequency and intensity. There exists a central plane of patterning frequency, where the morphology evolution is highly regulated irrespective of the patterning intensity (lp # 1). This is particularly valuable information as the patterning

114

S. Samudrala et al.

Figure 6.9 Morphology evolution with respect to the first three principal components color coded with respect to: (A) patterning frequency (lp); (B) patterning intensity (μ).

Figure 6.10 Morphology evolution in lp 5 1.50: categorization of parametric space.

frequency is much easier to control than patterning intensity from a manufacturing perspective. For patterning frequencies above lp 5 1, the morphologies are highly sensitive to slight variations in both frequency and intensity. This is also clearly seen in Figure 6.10, where slight variations in the intensity give dramatically different final morphologies. Notice also the key insight that higher intensities do not necessarily give different morphologies. This allows us to preclude further (expensive) exploration of the phase space of increasing patterning intensity. Finally, the low-dimensional plots also illustrate the ability to achieve the same

Data Dimensionality Reduction in Materials Science

115

Figure 6.11 Multiple pathways to morphology evolution.

morphology using different processing conditions. For instance, in Figure 6.11, we isolate the morphology evolution under two processing conditions that result in an identical morphology. Such correlations most sensitive regions, least sensitive regions (Figure 6.10), configurations resulting in identical morphologies are very useful as we tailor processing pathways to achieve designer morphologies. This analysis illustrates the power of parallel data dimensionality reduction methods to achieve this goal.

7. FURTHER READING Dimensionality reduction methods are a necessary and critical component in a broad range of materials science problems and the reader is urged to explore many examples discussed in the other chapters of this book. In this chapter we have introduced the basic principles behind data dimensionality reduction and introduced examples in materials science using both linear and nonlinear methods. The latter approach can reveal new correlations and uncover new structureproperty relationships when linear methods are shown to be limiting. Such nonlinear dimensionality reduction strategies are not only used to study materials science data, but also to analyze large amounts of high-dimensional data in the financial (Back and Weigend, 1997; Ince and Trafalis, 2007), engineering (Azzoni et al., 1996; Sarma et al., 2008), and biological (Balsera et al., 1996; Maisuradze et al., 2009) domains for various purposes such as visualization of high-dimensional data (Dawson et al., 2005; Vlachos et al., 2002),

116

S. Samudrala et al.

image processing (Mudrova and Prochazka, 2005; Weinberger and Saul, 2004), and predictive modeling (Mun˜oz and Felicı´simo, 2009), etc. Specifically in materials science, DR techniques have been extensively used to study high-dimensional data to establish structureprocessproperty relationships in several different contexts such as microscopy (Bonnet, 1998), etc. in the form of PCA (Suh et al., 2002), FA (Brasca et al., 2007), and several other methods mentioned in Section 1. In this chapter, we have detailed a mathematical framework of selected3 nonlinear dimensionality reduction techniques for constructing reduced order models of complicated data sets and discuss key questions involved in data selection. We also describe a rigorous technique (based on graph-theoretic analysis) to estimate the optimal dimensionality of the low-dimensional (or parametric) representation. The techniques are packaged into a modular, computational, scalable software framework with a graphical user interface Scalable Extensible Toolkit for Dimensionality Reduction (SETDiR). This interface helps to separate out the mathematics and computational aspects from the scientific applications, thus significantly enhancing the utility of DR techniques for the scientific community. This framework was built to achieve the following: 1. To make the application of dimensionality reduction methods easier by taking the mathematics out of the application process. 2. To estimate the optimal dimensionality of the high-dimensional data. 3. To obtain topological information about the surface on which the data lies. 4. For enhancing the visualization and thus to obtain an intuitive understanding of the data. 5. For performing postprocessing techniques such as clustering, etc.

REFERENCES Azzoni, P.M., Moro, D., Porceddu-Cilione, C.M., Rizzoni, G., 1996. Misfire detection in a high-performance engine by the principal component analysis approach. SAE Technical Paper 960622. Back, A.D., Weigend, A.S., 1997. A first application of independent component analysis to extracting structure from stock returns. Int. J. Neural Syst. 8 (4), 473484. Balachandran, P.V., 2011. Statistical Learning for Chemical Crystallography. PhD thesis, Iowa State University. 3

A comprehensive catalog of nonlinear dimensionality reduction techniques along with the mathematical prerequisites for understanding dimensionality reduction can be found in Lee and Verleysen (2008).

Data Dimensionality Reduction in Materials Science

117

Balachandran, P.V., Rajan, K., 2012. Structure maps for AI4 AII6 ðBO4 Þ6 X2 apatite compounds via data mining. Acta Cryst. B 68 (1), 2433. Balsera, M.A., Wriggers, W., Oono, Y., Schulten, K., 1996. Principal component analysis and long time protein dynamics. J. Phys. Chem. 100 (7), 25672572. Beardwood, J., Halton, J.H., Hammersley, J.M., 1959. The shortest path through many points. Math. Proc. Cambridge Phil. Soc. 55, 299327. Belkin, M., Niyogi, P., 2003. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15 (6), 13731396. Bergman, S., 1950. The Kernel Function and Conformal Mapping. American Mathematical Society. Bernstein, M., De Silva, V., Langford, J.C., Tenenbaum, J.B., 2000. Graph Approximations to Geodesics on Embedded Manifolds. Technical Report. Department of Psychology, Stanford University. Bonnet, N., 1998. Multivariate statistical methods for the analysis of microscope image series: applications in materials science. J. Microsc. 190 (12), 218. Brasca, R., Vergara, L.I., Passeggi, M.C.G., Ferrona, J., 2007. Chemical changes of titanium and titanium dioxide under electron bombardment. Mater. Res. 10, 283288. Chawla, N., Ganesh, V.V., Wunsch, B., 2004. Three-dimensional (3d) microstructure visualization and finite element modeling of the mechanical behavior of SiC particle reinforced aluminum composites. ScriptaMaterialia 51 (2), 161165. Chen, L., Hong, Z., Li, G., Yang, Y., 2009. Recent progress in polymer solar cells: manipulation of polymer:fullerene morphology and the formation of efficient inverted polymer solar cells. Adv. Mater. 21 (1415), 14341449. Coffey, D.C., Ginger, D.S., 2005. Patterning phase separation in polymer films with dippen nanolithography. J. Am. Chem. Soc. 127, 45644565. Curtarolo, S., Morgan, D., Persson, K., Rodgers, J., Ceder, G., 2003. Predicting crystal structures with data mining of quantum calculations. Phys. Rev. Lett. 91, 135503. Dawson, K., Rodriguez, R.L., Malyj, W., 2005. Sample phenotype clusters in highdensity oligonucleotide microarray data sets are revealed using IsoMap, a nonlinear algorithm. BMC Bioinformatics 6 (1), 195. Deibel, C., Dyakonov, V., Brabec, C.J., 2010. Organic bulk-heterojunction solar cells. IEEE J. Sel. Top. Quantum Electron. 16 (6), 15171527. Donoho, D.L., Grimes, C., 2003. Hessian eigenmaps: new locally linear embedding techniques for high-dimensional data. Proc. Natl. Acad. Sci. U.S.A. 100, 55915596. Elliott, J.C., 1994. Structure and Chemistry of the Apatites and other Calcium Orthophosphates, vol. 4. Elsevier, Amsterdam. Fischer, C.C., Tibbetts, K.J., Morgan, D., Ceder, G., 2006. Predicting crystal structure by merging data mining with quantum mechanics. Nat. Mater. 5 (8), 641646. Floyd, R.W., 1962. Algorithm 97: shortest path. Commun. ACM 5 (6), 345. Fontanini, A., Olsen, M., Ganapathysubramanian, B., 2011. Thermal Comparison Between Ceiling Diffusers and Fabric Ductwork Diffusers for Green Buildings. Energy and Buildings. Iowa State University. Ganapathysubramanian, B., Zabaras, N., 2008. A non-linear dimension reduction methodology for generating data-driven stochastic input models. J. Comput. Phys. 227 (13), 66126637. Golub, G.H., Van Loan, C.F., 1996. Matrix Computations. Johns Hopkins University Press. Guo, Q., Rajewski, D., Takle, E., Ganapathysubramanian, B., 2013. Constructing lowdimensional stochastic wind models from meteorology data. Wind Energy. Hoppe, H., Sariciftci, N.S., 2006. Morphology of polymer/fullerene bulk heterojunction solar cells. J. Mater. Chem. 16, 4561.

118

S. Samudrala et al.

Ince, H., Trafalis, T.B., 2007. Kernel principal component analysis and support vector machines for stock price prediction. IIE Trans. 39 (6), 629637. Langer Jr, S.A., Fuller, E.R., Carter, W.C., 2001. OOF: an image-based finite-element analysis of material microstructures. Comput. Sci. Eng. 3 (3), 1523. Lee, J.A., Verleysen, M., 2008. Nonlinear Dimensionality Reduction. Springer. Lee, J.A., Lendasse, A., Verleysen, M., 2002. Curvilinear distance analysis versus IsoMap. Proc. ESANN, 185192. Liu, Z.K., Chen, L.Q., Raghavan, P., Du, Q., Sofo, J.O., Langer, S.A., et al., 2004. An integrated framework for multi-scale materials simulation and design. J. Comp.-Aided Mater. Des. 11, 183199. Lumley, J.L., 1967. The structure of inhomogeneous turbulent flows. In: Yaglom, A.M., Tatarski, V.I. (Eds.), Atmospheric Turbulence and Radio Wave Propagation. Nauka, Moscow, pp. 166178. Maisuradze, G.G., Liwo, A., Scheraga, H.A., 2009. Principal component analysis for protein folding dynamics. J. Mol. Biol. 385 (1), 312329. McVeigh, C., Liu, W.K., 2008. Linking microstructure and properties through a predictive multiresolution continuum. Comp. Methods Appl. Mech. Eng. 197 (4142), 32683290. Mercier, P.H.J., Le Page, Y., Whitfield, P.S., Mitchell, L.D., Davidson, I.J., White, T.J., 2005. Geometrical parameterization of the crystal chemistry of P63/m apatites: comparison with experimental data and ab initio results. Acta Cryst. B 61 (6), 635655. Meredith, J.C., Smith, A.P., Karim, A., Amis, E.J., 2000. Combinatorial materials science for polymer thin-film dewetting. Macromolecules 33 (26), 97479756. Morgan, D., Rodgers, J., Ceder, G., 2003. Automatic construction, implementation and assessment of pettifor maps. J. Phys. Condens. Matter. 15 (25), 4361. Morgan, D., Ceder, G., Curtarolo, S., 2005. High-throughput and data mining with ab initio methods. Meas. Sci. Technol. 16 (1), 296. Mudrova, M., Prochazka, A., 2005. Principal component analysis in image processing. In: Proceedings of the MATLAB Technical Computing Conference, Prague. Mun˜oz, J., Felicı´simo, A´.M., 2009. Comparison of statistical methods commonly used in predictive modeling. J. Vegetation Sci. 15 (2), 285292. Page, Y.L., 2006. Data mining in and around crystal structure databases. MRS Bull. 31, 991994. Park, S.H., Roy, A., Beaupre, S., Cho, S., Coates, N., Moon, J.S., et al., 2009. Bulk heterojunction solar cells with internal quantum efficiency approaching 100%. Nat. Photonics 3, 297302. Pauling, L., 1960. The Nature of the Chemical Bond and the Structure of Molecules and Crystals: An Introduction to Modern Structural Chemistry, vol. 18. Cornell University Press. Peet, J., Heeger, A.J., Bazan, G.C., 2009. Plastic solar cells: self-assembly of bulk heterojunction nanomaterials by spontaneous phase separation. Acc. Chem. Res. 42 (11), 17001708. Pramana, S.S., Klooster, W.T., White, T.J., 2008. A taxonomy of apatite frameworks for the crystal chemical design of fuel cell electrolytes. J. Solid. State Chem. 181 (8), 17171722. Prim, R.C., 1957. Shortest connection networks and some generalizations. Bell Sys. Tech. J. 36 (6), 13891401. Rabe, K.M., Phillips, J.C., Villars, P., Brown, I.D., 1992. Global multinary structural chemistry of stable quasicrystals, high-Tc ferroelectrics, and high-Tc superconductors. Phys. Rev. B 45, 76507676.

Data Dimensionality Reduction in Materials Science

119

Rajan, K., Suh, C., Mendez, P.F., 2009. Principal component analysis and dimensional analysis as materials informatics tools to reduce dimensionality in materials science and engineering. Stat. Anal. Data Min. 1 (6). Roweis, S.T., Saul, L.K., 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290 (5500), 23232326. Samudrala, S., Zola, J., Aluru, S., Ganapathysubramanian, B., 2013. Parallel framework for dimensionality reduction of large-scale datasets. Journal of Scientific Programming. Preprint. Sarma, P., Durlofsky, L.J., Aziz, K., 2008. Kernel principal component analysis for efficient, differentiable parameterization of multipoint geostatistics. Math. Geosci. 40 (1), 332. Sean, E.S., Christoph, J.B., Niyazi, S.S., Franz, P., Thomas, F., Jan, C.H., 2001. 2.5% efficient organic plastic solar cells. Appl. Phys. Lett. 78 (6), 841843. Shannon, R.D., 1976. Revised effective ionic radii and systematic studies of interatomic distances in halides and chalcogenides. Acta Crystallogr. A 32 (5), 751767. Suh, C., Rajagopalan, A., Li, X., Rajan, K., 2002. The application of principal component analysis to materials science data. Data Sci. J. 1, 1926. Takeuchi, I., Lauterbach, J., Fasolka, M.J., 2005. Combinatorial materials synthesis. Mater. Today 8 (10), 1826. Tenenbaum, J.B., de Silva, V., Langford, J.C., 2000. A global geometric framework for nonlinear dimensionality reduction. Science 290 (5500), 23192323. Van der Maaten, L.J.P., Postma, E.O., van den Herik, H.J., 2009. Dimensionality Reduction: A Comparative Review. Tilburg University Technical Report, TiCC-TR 2009-005. Van Rietbergen, B., Weinans, H., Huiskes, R., Odgaard, A., 1995. A new method to determine trabecular bone elastic properties and loading using micromechanical finite-element models. J. Biomech. 28 (1), 6981. Vlachos, M., Domeniconi, C., Gunopulos, D., Kollios, G., Koudas, N., 2002. Non-linear dimensionality reduction techniques for classification and visualization. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp. 645651. Weinberger, K.Q., Saul, L.K., 2004. Unsupervised learning of image manifolds by semidefinite programming. In: Computer Vision and Pattern Recognition, CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference, vol. 2. IEEE, p. II-988. White, T.J., Dong, Z.L., 2003. Structural derivation and crystal chemistry of apatites. Acta Cryst. B 59 (1), 116. White, T., Ferraris, C., Kim, J., Madhavi, S., 2005. Apatite an adaptive framework structure. Rev. Mineral. Geochem. 57 (1), 307401. Wodo, O., Tirthapura, S., Chaudhary, S., Ganapathysubramanian, B., 2012. A novel graph based formulation for characterizing morphology with application to organic solar cells. Org. Electron. 13, 11051113. Wodo, Ganapathysubramanian, 2012. Modeling morphology evolution during solventbased fabrication of organic solar cells. Comp. Mater. Sci. 55, 113126. Yang, L., 2004. Sammon’s nonlinear mapping using geodesic distances. In: Pattern Recognition, ICPR 2004, Proceedings of the 17th International Conference, August, vol. 2, pp. 303306. Yue, Z.Q., Chen, S., Tham, L.G., 2003. Finite element modeling of geomaterials using digital image processing. Comput. Geotechnics 30 (5), 375397. Zabaras, N., Sundararaghavan, V., Sankaran, S., 2006. An information-theoretic approach for obtaining property PDFs from macro specifications of microstructural variability. TMS Lett. 3, 12.