- Email: [email protected]

2

Data Mining in Materials Science and Engineering Chandrika Kamath and Ya Ju Fan Center for Applied Scientific Computing, Lawrence Livermore National Laboratory

1. INTRODUCTION Data mining techniques are increasingly being applied to data from scientific simulations, experiments, and observations with the aim of finding useful information in these data. These data sets are often massive and can be quite complex, in the form of structured or unstructured mesh data from simulations or sequences of images from experiments. They are often multivariate, such as data from different sensors monitoring a process or an experiment. The data can be at different spatial and temporal scales, for example as a result of simulations of a material being modeled at different scales to understand its behavior. In the case of experiments and observations, the data often have missing values and may be of low quality, for example images with low contrast, or noisy data from sensors. In addition to the size and the complexity of the data, we may also face challenges when the data are being analyzed for scientific discovery or in decision making. In the former, the domain scientists may not have a well-formulated question they want addressed, or they may want the data to be explored to determine if any insights could be gained in the analysis. In the latter case, in addition to the results of the analysis, scientists may need information on how much they can trust the results as they want to make decisions based on the analysis. In this chapter, we provide a brief introduction to the field of data mining, focusing on techniques that are useful in the analysis of data from materials science and engineering applications. Data mining is the process of uncovering patterns, associations, anomalies, and statistically significant structures and events in data. It borrows and builds on ideas from a diverse collection of fields, including statistics, machine learning, pattern recognition, mathematical optimization, as well as signal, image, and video analysis. The resulting literature in the area of data analysis techniques is therefore enormous, and solution approaches are often rediscovered in different fields Informatics for Materials Science and Engineering Lawrence Livermore National Laboratory, LLC, Contract no. DE-AC52-07NA27344 DOI: http://dx.doi.org/10.1016/B978-0-12-394399-6.00002-3

© 2013 Elsevier Inc. All rights reserved.

17

18

Chandrika Kamath and Ya Ju Fan

where they are likely known by different names. In addition, scientists often create their own solutions when they can exploit properties of the data to reduce the time for analysis or improve the accuracy of the analysis. In light of this, we view this chapter as a starting point for someone interested in learning about the field of data mining and understanding the different categories of techniques available to address their data analysis problems in materials science. We begin the chapter by discussing the types of analysis problems often encountered in scientific domains, followed by a brief description of the analysis process. The bulk of the chapter focuses on different categories of analysis algorithms, including image analysis, dimension reduction, and the building of descriptive and predictive models. We conclude with some suggestions for further reading. First, some caveats. This chapter is written from the viewpoint of a data miner, not a materials scientist. The focus therefore is on algorithms rather than the insights into materials science one might gain from the application of these algorithms. Further, for the reasons mentioned earlier, the chapter barely scratches the surface of a broad, multidisciplinary field. Therefore, we recommend that the interested reader learn more about various techniques available to address their problem before selecting one for use with their data.

2. ANALYSIS NEEDS OF SCIENCE APPLICATIONS There are many ways in which scientific data-mining techniques are being used to analyze data from scientific simulations, experiments, and observations in a variety of domains ranging from astronomy to combustion, plasma physics, wind energy, and materials science. The tasks being addressed in these domains are often very similar, and data analysis problems in materials science can frequently be addressed using approaches developed in the context of a related problem in a different application domain. In the case of experimental data, the data are often in the form of one-dimensional signals or two-dimensional images. The data may be multivariate, for example the signals from several sensors monitoring a process, or may have a time component, such as a sequence of images taken over time. Data in the form of images are often analyzed to extract objects of interest and their characteristics, such as galaxies in astronomical surveys (Kamath et al., 2002) or fragments in images of material fragmentation (Kamath and Hurricane, 2011). Streaming data from sensors monitoring a process may be analyzed to determine if the process is evolving

Data Mining in Materials Science and Engineering

19

as expected; if something untoward is about to happen, prompting a shutdown; or if the process is moving from one normal state to another, requiring a change in the control parameters. Data analysis techniques are being used both in manufacturing (Harding et al., 2006), in materials development (Morgan and Ceder, 2005), and in the intelligent processing of materials (Wadley and Vancheeswaran, 1998), which integrates advanced sensors, process models, and feedback control concepts. In the case of simulation data, the output of the simulations can be analyzed to extract information on the phenomenon being modeled. A common task is to identify coherent structures in the data and extract statistics on these structures as they evolve over time. Other tasks include: sensitivity analysis (Saltelli et al., 2009) to understand how sensitive the outputs are to changes in the inputs of the simulation; uncertainty quantification (Committee on Mathematical Foundations of Verification, 2012) to understand how uncertainty in the inputs affects the outputs; and design and analysis of computer experiments (Fang et al., 2005), which uses the simulations to better understand the input space of the phenomenon. For example, if we are interested in creating materials with certain properties, we could use computer simulations to generate the properties for a sample of compounds. We could then analyze the inputs and outputs of these simulations to determine which inputs are more sensitive (and therefore must be sampled more finely), to build a data-driven model to predict the output given the inputs, and to place additional sample points at appropriate input values to create a more accurate predictive model. In some problems, we may need to analyze both simulation and experimental data. This is often the case in validation, where we compare how close a simulation is to an experiment by extracting statistics from both (Kamath and Miller, 2007). Or we may want to use the simulations to guide an experiment or understand the results of an experiment better. Though there is a wide variety of analysis tasks encountered in scientific data sets, the techniques used in the analysis are often very similar. For example, methods used to extract information from images may be very similar regardless of whether the images were obtained using a hand-held camera or a scanning electron microscope. Also, techniques to identify coherent structures in simulation output may be similar for structured and unstructured grids. We next discuss some of the analysis techniques relevant to problems in materials science and engineering. A detailed discussion of all techniques is beyond the scope of this chapter; instead, brief descriptions are provided, followed by suggestions for further reading.

20

Chandrika Kamath and Ya Ju Fan

Identify and extract objects

Extract features

Reduce dimension

Pattern recognition

Visualization and validation

Figure 2.1 Scientific data analysis: an iterative and interactive process.

3. THE SCIENTIFIC DATA-MINING PROCESS The process of scientific data mining is usually a multistep process, with the techniques used in each step motivated by the type of data and the type of analysis being performed (Kamath, 2009). At a high level, we can consider the process to consist of five steps, as shown in Figure 2.1. The first step is to identify and extract the objects of interest in the data. In some problems this is relatively easy, for example when the objects are chemical compounds or proteins and we are provided data on each of the compounds or proteins. In other cases, it can be more complicated, for example when the data are in the form of images and we need to identify the objects (say, galaxies in astronomical images) and extract them from the background. Once the objects have been identified, we need to describe them using features or descriptors. These should reflect the analysis task. For example, if the task focuses on the structure or the shape of the objects, the descriptors must reflect the structure or the shape respectively. In many cases, one may extract far more features than is necessary, requiring a reduction in the number of features or the dimension of the problem. These key features are then used in the pattern recognition step and the patterns extracted are visualized for validation by the domain scientists. The data analysis process is iterative and interactive; any step may lead to a refinement of one or more of the previous steps and not all steps may be used during the analysis of a particular data set. For example, in some problems, such as the analysis of coherent structures (Kamath et al., 2009) or images of material fragmentation (Kamath and Hurricane, 2011), the analysis task may be to extract the objects and statistics on them. This would require only the first two steps of the process. In other problems, we may be given the data set in the form of objects described by a set of features and the task may be to identify the important features using dimension reduction techniques. A few problems may require the complete end-to-end process, for example finding galaxies with a particular shape in astronomical images. In our experience, data analysis is a close collaboration between the analysis expert and the domain scientist who is actively involved in all steps, starting

Data Mining in Materials Science and Engineering

21

Figure 2.2 Processing of an image resulting from the fragmentation of a material. (A) Original image. (B) After the application of the Retinex algorithm to make the illumination uniform. (C) After smoothing to reduce the noise. (D) After segmentation to isolate the fragements from the background. (E) A zoomed-in view after cleanup following segmentation. (F) Identifying the skeleton of the gap regions.

from the initial description of the data and the problem, the extraction of potentially relevant features, the identification of the training set where necessary in pattern recognition, and the validation of the results from each step. We next discuss various analysis techniques we have found to be broadly useful in analysis of data from scientific simulations, experiments, and observations.

4. IMAGE ANALYSIS In this section, we use some of our prior work to describe the tasks in image analysis. We consider the analysis of images obtained from experiments investigating the fragmentation of materials (Kamath and Hurricane 2007, 2011). Figure 2.2A shows a subset of a larger image of a material as it fragments. The lighter regions are the fragments of the material, while the darker areas are the gaps between the fragments. The images were

22

Chandrika Kamath and Ya Ju Fan

analyzed to obtain statistics for both the fragments (such as their size) and the gaps (such as their length and width). The distributions of these characteristics, in the form of histograms, were then used to provide a concise summary of each image. The approach used in the analysis was to first segment the fragments from the background (that is, the gaps) and then extract the statistics on both the fragments and the gaps. This is challenging for several reasons. First, there is a large variation in the intensity, with the top left corner being brighter than the lower right region. The intensity of fragments in the darker regions is similar to the intensity of the gaps in the brighter regions. Some fragments, especially those in the lower right corner, have a range of intensity values. The images are also quite grainy and there is no clear demarcation between the fragments and the gaps. To overcome these challenges and make it easier to analyze these images, we formulated a multistep process. First, we used the multiscale Retinex algorithm (Jobson et al., 1997) to make the illumination more uniform. If I(x,y) is the two-dimensional image, the single-scale Retinex algorithm, R(x,y), is defined as Rðx; yÞ 5 loge ðIðx; yÞÞ 2 loge ðFðx; yÞUIðx; yÞÞ; where the second term represents the illumination and is the convolution of the image with a Gaussian filter F of the form: Fðx; yÞ 5 K exp½ 2 ðx2 1 y2 Þ=σ2 ; where

ðð Fðx; yÞ dx dy 5 1;

and the integral is taken over the filter. The standard deviation σ of the Gaussian determines the scale. The multiscale Retinex method is a weighted sum of N applications of the single-scale algorithm MSRðx; yÞ 5

N X

wi Ri ðx; yÞ;

i51

where each Ri(x,y) is obtained using a different scale σi and the weights sum to 1. For our data, we used two equally weighted scales, one small with σ 5 20, and the other larger, with σ 5 80, resulting in Figure 2.2B. Once we have an image with more uniform illumination, we can use standard image analysis techniques (Umbaugh, 2005; Sonka et al., 2007)

Data Mining in Materials Science and Engineering

23

to extract the information of interest. First, we reduced the graininess of the image by using simple filtering techniques that smooth the image locally, resulting in Figure 2.2C. Next, we thresholded the image by dropping values below a threshold, as shown in Figure 2.2D, where the black pixels, which represent the gap regions, are values below the threshold. The threshold was selected based on an examination of the pixel intensity values, though other approaches, such as Otsu’s method (Otsu, 1979), are also possible. This task of separating the objects of interest in an image from the background is referred to as image segmentation. There are a number of ways in which this can be accomplished, ranging from the simple one used in our example to more complex methods, and varying from general approaches applicable to a broad range of images to those tailored to exploit specific properties of an image. Once we have segmented the image, we clean it by removing the small regions of black pixels in the fragments and the small regions of white pixels in the gaps. We also use morphological operations (Soille, 1999) to smooth the boundaries of the fragments, resulting in Figure 2.2E, which shows a subset of Figure 2.2D after processing. Such cleanup steps are often required after segmentation; in our problem, it is the graininess of the image combined with the thresholding that prompts the cleanup. We can now extract some simple statistics directly from the processed image, such as the size of the fragments and gaps measured in number of pixels. To obtain the length of the gaps, we obtain a skeleton of the gap regions (Hilditch, 1969), as shown in Figure 2.2F, which can be considered as a graph, with the length of the edges (the part of a skeleton between the nodes or junctions) approximating the length of the gaps. In the specific example considered in this section, the extraction of statistics on the objects in the data (that is, the fragments and the gaps in the image) is the objective of the analysis. In other problems, we may start by identifying the objects in the data, extract features or characteristics describing each object, and then perform further analysis as described in the next sections. The features extracted from the data tend to be very problem specific and are not discussed here; see (Kamath, 2009) for some examples.

5. DIMENSION REDUCTION In some data analysis problems, the data are in the form of a table, where the rows represent the samples or objects and the columns represent the

24

Chandrika Kamath and Ya Ju Fan

features describing each object. For example, an object might be a compound created by introducing impurities into a base material, and the features might be the properties of the atomic species that constitute the compound. In other problems, the data may have originally been in the form of images or simulation output, and processed to extract objects and features representing the objects. Often, with each object or sample, there may be an associated “output” variable, such as a property of the compound or a class label indicating class or group to which the compound belongs. In this section, we consider the task of dimension reduction. The number of features, d, describing an object in a data set is referred to as the dimension of the problem as the object can be considered as a point in a d-dimensional space. Often, not all the features used to describe an object are relevant to the analysis task, or the data, though described in a high dimensional space, may naturally lie in a lower dimensional space. In dimension reduction we try to determine if there is a lower dimensional representation of the data using feature selection techniques that identify a subset of the original features, or feature transformation techniques that transform the data from a higher dimensional space to a lower dimensional space. We next briefly describe some of the methods in each category.

5.1 Feature Selection Techniques The basic idea in feature selection methods is to select a subset of features that are more relevant than others in determining the value of the output variable. These variables can be highly correlated to the output, or be able to discriminate among objects of different classes. Feature selection methods are applicable when the data set has an output variable or a class label associated with each object. We next briefly describe four feature selection methods to indicate how these are different from feature transformation methods. Distance Filter This is an intuitive method based on the class separability of a feature using the KullbackLeibler (KL) distance between histograms of feature values. For each feature, there is one histogram for each class. In a two-class problem, if a feature has a large distance between the histograms for the two classes, then it is likely topbe ﬃﬃﬃﬃﬃ an important feature. Suppose we discretize numeric features using jnj=2 equally spaced bins, where jnj is the size of the data. Let pj(d 5 ijc 5 a) be an estimate of the

Data Mining in Materials Science and Engineering

25

probability that the jth feature takes a value in the ith bin of the histogram given a class a. For each feature j, the class separability is calculated as Δj 5

c X c X

δj ða; bÞ;

a51 b51

where c is the number of classes and δj(a,b), the KL distance between histograms corresponding to classes a and b is given by B X pj ðd 5 ijc 5 aÞ δj ða; bÞ 5 pj ðd 5 ijc 5 aÞlog ; pj ðd 5 ijc 5 bÞ i51 where B is the number of bins in the histograms. A larger distance Δj means feature j is better at separating the classes. Chi-Squared Filter In this method, we first compute the Chi-square statistics from contingency tables for every feature. These tables have one row for every class label and the columns correspond to possible values of the feature (Table 2.1, adapted from Huang (2003)). Numeric features are represented by histograms, so the columns of the contingency table are the histogram bins. The Chi-square statistic for feature j is χ2j 5

X ðoi 2ei Þ2 i

ei

;

where the sum is over all the cells in the r 3 c contingency table, where r is the number of rows and c is the number of columns; oi stands for the observed value (the count of the items corresponding to the cell i in the contingency table) and ei is the expected frequency of items calculated as ei 5

ðColumn totalÞ 3 ðRow totalÞ : Grand total

Table 2.1 A 2 3 3 Contingency Table, with Observed and Expected Frequencies (in parentheses) of a Fictitious Feature f1 that Takes Three Possible Values (1, 2, and 3) Class f1 5 1 f1 5 2 f1 5 3 Total

0 1 Total

31 (22.5) 14 (22.5) 45

20 (21) 22 (21) 42

11 (18.5) 26 (18.5) 37

62 62 124

26

Chandrika Kamath and Ya Ju Fan

The variables are ranked by sorting them in descending order of their χ2 statistics. Stump Filter A stump is a decision tree (see Section 6.1) with only the root node. The stump filter ranks features using the same process as the one used to create the root node. That is, the features are ranked according to their optimal impurity measures. ReliefF ReliefF (Robnik-Sikonja and Kononenko, 2003) estimates the quality of features by calculating how well they distinguish between instances close to each other. It starts by taking an instance i at random and identifies its nearest k hits (Hi) and misses (Mi), which are the closest instances of the same and different classes respectively. Then, it obtains the quality estimate of a feature s, which for a two-class data set is defined as ( ) n X X kXis 2 Xms k X kXis 2 Xhs k Qs 5 ; 2 nk nk mAMi i51 hAHi where Xis is the value of feature s for instance i. By increasing the quality estimate when the selected point and its misses have different values of feature s, and decreasing it when the point and its hits have different values of the feature, ReliefF ranks the features based on their ability to distinguish between instances of the same and different classes.

5.2 Feature Transformation Techniques In contrast to feature selection techniques, feature transformation methods convert the original data from a high-dimensional space to a lower dimensional space through a linear or nonlinear transformation. There are many methods that belong to this category; we next describe a few, including principal component analysis (PCA), which is a linear method, and three popular nonlinear dimension reduction (NLDR) techniques: Isomap, locally linear embedding (LLE), and Laplacian eigenmaps. These methods share the use of an eigen-decomposition to obtain a lowerdimensional embedding of the data that is guaranteed to provide global optimality.

Data Mining in Materials Science and Engineering

27

Principal Component Analysis (PCA) PCA (Pearson, 1901) is a linear technique that preserves the largest variance in the data while decorrelating the transformed data set. An eigenvalue problem to the data covariance matrix, C, is formulated as CM 5 λM. The eigenvectors, Md, corresponding to the d significant eigenvalues, λi, i 5 1, . . ., d, form a basis for linear transformation that optimally maximizes the variance in the data. The low-dimensional representation is expressed by Y 5 XMd, where X is the original data and the eigenvalues are used to determine the lower dimensionality, d.

Isomap The Isomap method (Tenenbaum et al., 2000) preserves pairwise geodesic distances between data points. It starts by constructing an adjacency graph based on the neighbors of each point in the input space. These neighbors can be either the k nearest neighbors or points which lie within an ε neighborhood. Next, the geodesic distances (Dijkstra, 1959; Floyd, 1962) between all pairs of points are estimated by computing their shortest path distances over the graph. Let DG 5 {dG(i, j )}i,j 5 1, . . ., n be the matrix of geodesic distances, where dG(i, j ) is the distance between points i and j. Isomap then constructs an embedding in a d-dimensional Euclidean space such that the pairwise Euclidean distances between points in this space approximate the geodesic distances in the input space. Let DY 5 {dY(i, j )}i, j 5 1, . . ., n be the Euclidean distance matrix and dY ði; jÞ 5 kYi 2 Yj k2 . The goal is to minimize the cost function kτðDG Þ 2 τðDY Þk2 ; where the function τ performs double centering on the matrix to support efficient optimization. The optimal solution is found by solving the eigen-decomposition of τ(DG). The Y coordinates are then computed based on the d largest eigenvalues and their corresponding eigenvectors.

Locally Linear Embedding (LLE) The LLE method (Roweis and Saul, 2000) preserves the reconstruction weights wij that are used to describe a data point Xi as a linear combination of its neighbors Xj ; jANðiÞ, where NðiÞ is the set of neighbors of

28

Chandrika Kamath and Ya Ju Fan

point i. The optimal weights for each i are obtained by minimizing the cost function 2 ( ) X X min Xi 2 wij Xj w 51 : w jAN ðiÞ ij jAN ðiÞ LLE assumes that the manifold is locally linear and hence the reconstruction weights are invariant in the low-dimensional space. The embedding Y of LLE is the solution that minimizes the cost function ΣikYi 2 ΣjWijYjk2, where W is the reconstruction weight matrix with elements Wij ; 5 0 if j 2 = NðiÞ; Wij 5 wij otherwise. Y can be obtained from the eigenvectors corresponding to the smallest d nonzero eigenvalues of the embedding matrix, defined as M 5 (I 2 W)T(I 2 W), where I is an identity matrix. Laplacian Eigenmaps This method provides a low-dimensional representation in which the weighted distances between a data point and other points within a c neighborhood (or k nearest neighbors) are minimized (Belkin and Niyogi, 2003). The distances to the neighbors are weighted using the Laplacian operator (S 2 W), where kXi2Xj k2 t

Wij 5 e2

and Sii 5

X

Wij :

j

Here, t 5 2σ2, where σ is the standard deviation of the Gaussian kernel. The objective is to find arg minftrðY T ðS 2 W ÞY jY T SY 5 1g: Y

The representation of Y is computed by solving the generalized eigenvector problem: (S 2 W)v 5 λSv. Only the eigenvectors (v) corresponding to the smallest nonzero eigenvalues (λ) are used for the embedding.

5.3 Comparison of Dimension Reduction Methods Figure 2.3 shows the performance of several dimension reduction methods on a data set in wind power generation. In this problem we are interested in using weather conditions to predict days with ramp events (Kamath and

Data Mining in Materials Science and Engineering

29

Dataset: b6020 Dist filter Chisq filter Stump filter reliefF_k3 isomapY_ep30 lapY_ep300_t5 lleY_ep300 pcaY_dim29 22.92

34

Percentage error rate

32 30 28 26 24 22 20

5

10

15

20

25

30

35

Number of features

Figure 2.3 A comparison of the performance of different dimension methods on a data set relating weather variables to days with ramp events in wind power generation. The plot shows how the error in prediction using an ensemble of decision tree classifiers varies as we use just the top k features (k shown on the x-axis) identified by each dimension reduction method.

Fan, 2012). There are several weather variables available, not all of which are expected to be predictive of the ramp events. First, the dimension reduction techniques are used to identify the order of the variables by importance. These are either the original features identified using feature selection methods or transformed features identified using feature transform methods. Next, we use a classification algorithm based on ensembles of decision trees (see Section 6.1) and evaluate the percentage error obtained when we use the top k features. The plot in Figure 2.3 shows how the error rate varies as k is changed. The horizontal line in the plot is the error rate using all the features. For the problem under consideration, the plot shows that feature selection methods perform better than feature transformation methods, even reducing the error to below that obtained using all features. A comparison of different dimension reduction methods in the context of problems in wind power generation is provided in Kamath and Fan (2012), while a more detailed comparison of the nonlinear dimension reduction methods is given in van der Maaten et al. (2009).

30

Chandrika Kamath and Ya Ju Fan

6. BUILDING PREDICTIVE AND DESCRIPTIVE MODELS The task of building models is at the core of many data-mining endeavors. There are two broad categories of models predictive, where we use the data obtained thus far to create a model that is then used in the task of prediction, and descriptive, where we build a model to describe the data. Predictive models are used in problems where we have objects that have been identified as belonging to different categories or classes, for example astronomical objects assigned a label of a galaxy or a star, or semiconductor materials assigned a band-gap type. Each of these objects is characterized using features. Once a model is built, then, given a new object described by its set of features, we can use the model to assign it a class label. In problems where objects have not been assigned labels, we can create a descriptive model that groups similar objects together. An underlying assumption in both cases is that the features are a good representation of the objects; that is, the features capture what makes an object similar or dissimilar to other objects. Otherwise, it would not be possible to build these predictive and descriptive models. Before we describe the algorithms used in building the models, we note that the preprocessing done thus far essentially takes the original data in the form of meshes, images, or sequences of images, and converts it into a form where we focus on the objects in the data. We are usually interested in the patterns among the objects. The preprocessing steps convert information at the low level of pixels or mesh points into the higher level objects. If the objects can be represented accurately using features that reflect the patterns in the data, then we can successfully build the predictive and descriptive models.

6.1 Classification and Regression There are two categories of algorithms used to build predictive models: Classification methods for the case where the class label is discrete and regression algorithms for the case where the class label is continuous. These algorithms tend to be very similar. In this section, we will describe two algorithms decision trees for classification and locally weighted learning for regression. Their counterparts are regression trees and nearest-neighbor classifiers. We selected these algorithms as they are easy to understand, fast, and interpretable. This last quality is important as it provides insights into how the model arrives at a result and, in the process, provides greater confidence in the use of these techniques.

Data Mining in Materials Science and Engineering

31

A decision tree is a structure that is either a leaf, indicating a class, or a decision node that specifies some test to be carried out on a feature (or a combination of features), with a branch and subtree for each possible outcome of the test. Decision trees start at the root node with all the data. Then they recursively split the data by examining each feature and finding the split that optimizes an impurity measure. This splits the data into two subsets, and the process is repeated on each of the subsets. The splitting stops when an appropriate condition is satisfied, such as all instances at a node having the same class label or too few instances at a node to make further splitting useful. To search for the optimal split for a numeric feature x, the feature values are sorted (x1 , x2 , . . . , xn) and all intermediate values (xi 1 xi11)/2 are evaluated as possible splits using a given impurity measure. A commonly used impurity measure is the Gini index (Breiman et al., 1984), which is based on finding the split that most reduces the node impurity, where the impurity is defined as follows: LGini 5 1:0 2

k X

ðLi =jTL jÞ2 ;

RGini 5 1:0 2

i51

k X

ðRi =jTR jÞ2 ;

i51

Impurity 5 ðjTL jULGini 1 jTR jURGini Þ=n; where jTL j and jTR j are the number of examples, and LGini and RGini are the Gini indices on the left and right sides of the split respectively. In addition to decision trees, some of the commonly used classification algorithms include neural networks, k-nearest-neighbor classifiers, and support vector machines (Hand et al., 2001; Hastie et al., 2001). Recent research has shown that instead of using a single classifier, we can obtain more accurate results by using an ensemble of classifiers, where each classifier is created from the same data set using randomization. There are many ways in which we can introduce randomization to create different classifiers from the same data set and combine the results from the classifiers (Rokach, 2010). Among regression algorithms, a simple approach is locally weighted regression (LWR) (Cleveland and Devlin, 1988), which combines local models that are fit to nearby data. It is derived from standard regression procedures for global models. A global regression model fits all data points in one single model that is then used to predict new observations. Large prediction error occurs when the selected model fails at capturing the properties of the data, such as using a linear model to fit a nonlinear function. On the other hand, LWR fits a different regresion model everywhere,

32

Chandrika Kamath and Ya Ju Fan

weighting the data points by how close they are to the point of interest. In addition to a regression function, LWR contains three critical parts: distance function, weighting function, and smoothing parameters. The distance function determines the data around the point of interest that should be included in the fitting. Since the observations near the point of interest are supposed to tell us more about the prediction than points which are far from it, we use a weighting function on these observations. A smoothing parameter can be used to adjust the radius of the weighting function and reduce cross-validation error when fitting the data. With the right choices of these three elements, LWR can be quite successful at recovering the underlying nonliner regression function (Atkeson et al., 1997).

6.2 Clustering Data clustering is a method that tries to group objects together in a cluster such that the objects in the same cluster are more similar to each other than they are to objects belonging to different clusters. It is an exploratory method that analyzes the interrelationships among the objects for assessing their structure. Partitional algorithms group the data into disjoint subsets. k-means clustering is one of the well-known algorithms. It computes k cluster centers such that the sum of the squared Euclidean distances from all objects to their closest centers is minimized. Similar to the k-means clustering are the k-medians clustering and the k-medoids clustering. Instead of using the squared Euclidean distances, the k-median clustering considers 1-norm distances among the objects and calculates the median. As opposed to calculating the cluster centers using the mean or the median, the k-medoids clustering tries to find k existing objects as cluster centers such that the sum of the distances from all objects to their closest cluster centers is minimized. The k-means type of clustering can be formulated as optimization problems, but their optimal solutions are hard to find. It is common to use the expectation-maximization (EM) algorithm for approximate solutions. Given the number of clusters k and the initial cluster centers, the algorithm alternates between an expectation (E) step and a maximization (M) step until a local optimal solution is found. In the E step, all objects are assigned to their nearest centers. In the M step, the means (or medians or medoids) are reevaluated by computing the geometric centroid or medians, or choosing the objects that are closest to all others in the same clusters. The algorithm converges when the cluster assignments do not change across iterations.

Data Mining in Materials Science and Engineering

33

The k-means clustering is more sensitive to outliers since the squared Euclidean distance is used in calculating the centers. The k-medoids clustering can be formulated differently by using a pre-calculated symmetric distance matrix where the element at the ith row and the jth column of the matrix is the distance between object i and object j. The matrix is used for finding new medoids at each iteration. Since the clusters are assigned using the distance matrix, it gives us the flexibility of choosing different proximity metrics for accommodating the data. Instead of giving a single partitioning of the data, hierarchical clustering constructs an extensive hierarchy of clusters that can be combined with each other at certain levels of distances. Such a structure can be represented using a dendrogram. The algorithm is based on the idea that objects are more related to nearby objects than to objects farther away. The strategy of hierarchical clustering falls into two categories: agglomerative and divisive. Agglomerative clustering is a bottom-up approach. It begins by considering each object as one cluster; then, pairs of clusters are merged based on certain distance levels as we move up the hierarchy. Divisive clustering is a top-down approach, where all observations are in one cluster at the beginning; this splits based on a certain distance level as the clustering process moves down the hierarchy. The distance levels used to determine the merges and splits are determined in a greedy manner. The algorithm is not a robust approach toward outliers, which appear as additional clusters or cause other clusters to merge. As a result, the final clusters need to be chosen appropriately from a hierarchy.

7. FURTHER READING In this chapter, we presented a brief introduction to some of the techniques that can be used to analyze data from simulations and experiments in problems in materials science and engineering. Data mining is a multidisciplinary area, which borrows and builds on techniques from a multitude of disciplines including statistics, machine learning, pattern recognition, mathematical optimization, signal and image processing, and high-performance computing. Given the wealth of analysis techniques in each of these domains, a short chapter is barely sufficient to scratch the surface. We next list some additional references so those interested can explore further. • For general reading in data mining, the books by Theodoridis and Koutroumbas (2003), Duda et al. (2001), and Hastie et al. (2001)

34

Chandrika Kamath and Ya Ju Fan

all provide a good overview of algorithms, albeit from different perspectives. • For image processing, the classics by Jain (1989) and by Sonka et al. (2007) provide excellent overviews. • For dimension reduction, the books by Liu and Motoda (1998) and by Zhao and Liu (2011) discuss feature selection methods, while principal component analysis is the focus of the book by Joliffe (2002). A recent survey on dimension reduction (van der Maaten et al., 2009) provides greater insights into the nonlinear methods. • In the area of building predictive and descriptive models, the books mentioned under general reading are an excellent first start. More details are available in texts on specific topics. For example, the classic by Breiman et al. (1984) provides an in-depth view of some of the early work on decision trees, while the more recent work on ensemble learning is discussed in the book by Rokach (2010). A good overview of clustering is given in Gan et al. (2007). The special issue of the journal Statistical Analysis and Data Mining (Rajan and Mendez, 2009) on materials informatics has some articles at the intersection of data mining and materials science and engineering, while the recent article by Domingos (2012) provides useful information on the practical use of machine learning techniques. Finally, the book by Kamath (2009) provides an overview of scientific data mining. It starts with the different ways in which data-mining techniques are used in a variety of problem domains. It then identifies the tasks that are common to the analysis in these domains. A majority of the book focuses on these tasks, starting with the preprocessing of the data in the form of images or computational mesh data, extraction of objects in the data and features describing them, dimension reduction, and pattern recognition.

ACKNOWLEDGMENTS This work was partially supported through the ARRA-funded SciDAC-e project, MINDES: Data Mining for Inverse Design; we appreciate the support of the Office of Science, US Department of Energy. LLNL-MI-592793: This work was performed under the auspices of the US Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Data Mining in Materials Science and Engineering

35

REFERENCES Atkeson, C.G., et al., 1997. Locally weighted learning. Artif. Intell. Rev. 11, 1173. Belkin, M., Niyogi, P., 2003. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15 (6), 13731396. Breiman, L., et al., 1984. Classification and Regression Trees. CRC Press, Boca Raton, FL. Cleveland, W.S., Devlin, S.J., 1988. Locally weighted regression: an approach to regression analysis by local fitting. J. Am. Stat. Assoc. 83 (403), 596610. Committee on Mathematical Foundations of Verification, 2012. Assessing the Reliability of Complex Models: Mathematical and Statistical Foundations of llerificaiion, Validation, and Uncertainty Quantification. The National Academies Press. Dijkstra, E.W., 1959. A note on two problems in connexion with graphs. Numerische Mathematik 1, 269271. Domingos, P., 2012. A few useful things to know about machine learning. Commun. ACM 55, 7897. Duda, R.O., et al., 2001. Pattern Classification. John Wiley, New York. Fang, K.-T., et al., 2005. Design and Modeling for Computer Experiments. Chapman & Hall/CRC Press, Boca Raton, FL. Floyd, R.W., 1962. Algorithm 97: shortest path. Commun. ACM 5, 345. Gan, G., et al., 2007. Data Clustering: Theory, Algorithms, and Applications. SIAM, Philadelphia, PA. Hand, D., et al., 2001. Principles of Data Mining. MIT Press, Cambridge, MA. Harding, J.A., et al., 2006. Data mining in manufacturing: a review. J. Manuf. Sci. Eng. 128, 969976. Hastie, T., et al., 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York. Hilditch, C.J., 1969. Linear skeletons from square cupboards. In: Meltzer, B., Mitchie, D. (Eds.), Machine Intelligence. University Press, Edinburgh, pp. 404420. Huang, S.H., 2003. Dimensionality reduction on automatic knowledge acquisition: a simple greedy search approach. IEEE Trans. Knowl. Data Eng. 15 (6), 13641373. Jain, A.K., 1989. Fundamentals of Digital Image Processing. Prentice Hall, Englewood Cliffs, NJ. Jobson, D.J., et al., 1997. A multiscale retinex for bridging the gap between color images and the human observation of scenes. IEEE Trans. Image Process. 6 (7), 965976. Joliffe, I.T., 2002. Principal Component Analysis. Springer, New York. Kamath, C., 2009. Scientific Data Mining: a Practical Perspective. Society for Industrial and Applied Mathematics (SIAM). Kamath, C., Fan, Y.J., 2012. Using data mining to enable integration of wind resources on the power grid. Stat. Anal. Data Min. 5 (5), 410427. Kamath, C., Hurricane, O., 2007. Analysis of Images from Experiments Investigating Fragmentation of Materials. Technical Report UCRL-TR-234578. Lawrence Livermore National Laboratory, California. Kamath, C., Hurricane, O.A., 2011. Robust extraction of statistics from images of material fragmentation. Int. J. Image Graph. 11, 377401. Kamath, C., Miller, P.L., 2007. Image analysis for validation of simulations of a fluid-mix problem. In: IEEE International Conference on Image Processing, vol. III, pp. 525528. Kamath, C., et al., 2002. Classification of bent-double galaxies in the FIRST survey. IEEE Comput. Sci. Eng. 4 (4), 5260. Kamath, C., et al., 2009. Identification of coherent structures in three-dimensional simulations of a fluid-mix problem. Int. J. Image Graph. 9, 389410. Liu, H., Motoda, H., 1998. Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers, Boston, MA.

36

Chandrika Kamath and Ya Ju Fan

Morgan, D., Ceder, G., 2005. Data mining in materials development. In: Yip, S. (Ed.), Handbook of Materials Modeling. Springer, the Netherlands, pp. 395421. Otsu, N., 1979. A threshold selection method from gray-level histograms. IEEE Trans. on Systems, Man and Cybernetics 9, 6266. Pearson, K., 1901. On lines and planes of closest fit to systems of points in space. Philos. Mag. 2 (6), 559572. Rajan, K., Mendez, P., 2009. Special issue on materials informatics. Stat. Anal. Data Min. 1 (5). Robnik-Sikonja, M., Kononenko, I., 2003. Theoretical and empirical analysis of ReliefF and RReliefF. Mach. Learn. 53, 2369. Rokach, L., 2010. Pattern Classification using Ensemble Methods. World Scientific Publishing, Singapore. Roweis, S.T., Saul, L.K., 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290, 23232326. Saltelli, A., et al., 2009. Sensitivity Analysis. Wiley. Soille, P., 1999. Morphological Image Analysis: Principles and Applications. Springer. Sonka, M., et al., 2007. Image Processing, Analysis, and Machine Vision, third ed. International Thompson Publishing, Pacific Grove, CA. Tenenbaum, J.B., et al., 2000. A global geometric framework for nonlinear dimensionality reduction. Science 290 (5500), 23192323. Theodoridis, S., Koutroumbas, K., 2003. Pattern Recognition. Academic Press, San Diego, CA. Umbaugh, S.E., 2005. Computer Imaging: Digital Image Analysis and Processing. CRC Press, Boca Raton, FL. van der Maaten, L., et al., 2009. Dimensionality Reduction: a comparative review. Technical Report TiCC TR 2009-005. Tilburg University. Wadley, H.N.G., Vancheeswaran, R., 1998. The intelligent processing of materials: an overview and case study. J. Met. 41 (1), 1930. Zhao, Z.A., Liu, H., 2011. Spectral Feature Selection for Data Mining. CRC Press, Boca Raton, FL.