Classifying Documents with Respect to “Earnings” and Then Making a Predictive Model for the Target Variable Using Decision Trees, MARSplines, Naïve Bayes Classifier, and K-Nearest Neighbors with STATISTICA Text Miner

Classifying Documents with Respect to “Earnings” and Then Making a Predictive Model for the Target Variable Using Decision Trees, MARSplines, Naïve Bayes Classifier, and K-Nearest Neighbors with STATISTICA Text Miner

TUTORIAL X Classifying Documents with Respect to “Earnings” and Then Making a Predictive Model for the Target Variable Using Decision Trees, MARSplin...

10MB Sizes 0 Downloads 22 Views

TUTORIAL X

Classifying Documents with Respect to “Earnings” and Then Making a Predictive Model for the Target Variable Using Decision Trees, MARSplines, Naïve Bayes Classifier, and K-Nearest Neighbors with STATISTICA Text Miner CONTENTS Introduction: Automatic Text Classification ...............................................................................773 Data File with File References ......................................................................................................774 Specifying the Analysis..................................................................................................................775 Processing the Data Analysis........................................................................................................778 Saving the Extracted Word Frequencies to the Input File.........................................................779 Initial Feature Selection .................................................................................................................782 General Classification and Regression Trees..............................................................................784 K-Nearest Neighbors Modeling.....................................................................................................793 Conclusion .......................................................................................................................................796 Reference .........................................................................................................................................796

INTRODUCTION: AUTOMATIC TEXT CLASSIFICATION This example is based on the “classic” Reuters collection of documents. Specifically, 5,000 documents were selected from the Reuters-21578 database, which is a collection of 21,578 articles from Reuters that appeared on the newswires in 1987. The documents were assembled and indexed with categories by personnel from Reuters Ltd. in 1987. Note that the copyright for these articles resides with Reuters Ltd. and Carnegie Group, Inc., and these files are available for research and demonstration purposes only. You can also review Chapter 16 in Manning and Schütze (2002) to learn more about these documents and the specific types of analyses illustrated in this example. The body of the articles was placed into XML (Extensible Markup Language) files. Following is an example of such a file (see Figure X.1). Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications. DOI: 10.1016/B978-0-12-386979-1.00032-3 Ó 2012 Gary D. Miner. Published by Elsevier Inc.

773

774

TUTORIAL X:

Classifying Documents with Respect to “Earnings”

FIGURE X.1 XML (Extensible Markup Files) in a STATISTICA text file supplied with the software.

The value of this collection of documents is that it was carefully coded by experts with respect to different content categories. The one of interest for this example is the “Earnings” categorydthat is, the goal of this text mining project is to derive a simple classifier that enables us to automatically classify the articles as either dealing with earnings or not (see also Manning and Schütze, 2002, p. 579). Needless to say, the general utility of such methods that enable you to automatically classify large numbers of texts into certain categories (e.g., of interest or not of interest; or categories that allow for automated routing of documents to the appropriate offices, departments, etc.) can be immense. Once a good (accurate) classification method has been determined, hundreds or perhaps thousands of human work hours could be saved by implementing an automated system to perform necessary classifications of documents. (Note that the STATISTICA system is ideally suited to implement such systems because it supports deployment of text mining results and because the system is completely programmable, so it can be seamlessly integrated with existing electronic management systems, such as the STATISTICA Document Management System.)

DATA FILE WITH FILE REFERENCES To reiterate, the purpose of this analysis is to derive a model that will enable us to automatically determine whether a document is relevant to the Earnings category. The STATISTICA Text Mining and Document Retrieval system includes many options for retrieving documents or references to documents, including web or file crawling. In this case, the example data file Reuters.sta will be used (see Figure X.2), which already contains the necessary information to retrieve all documents.

Specifying the Analysis

FIGURE X.2 STATISTICA data file with “Text” column, column No. 1, having the “file name” entered for each cell. When a text mining analysis is executed, each of these text files will be pulled into the analysis from the folder where stored.

The variable File Name contains the actual file names to be explored. The second variable, Topic: Earnings?, is how the experts classified each document (as relevant or not relevant to Earnings). Also, there is a variable called Training that will later be used during cross-validation of the final model to evaluate its predictive validity and accuracy.

SPECIFYING THE ANALYSIS Begin by opening the example data file Reuters.sta: Ribbon bar: Select the Home tab. In the File group, click the Open arrow and select Open Examples to display the Open a STATISTICA Data File dialog. Open the Datasets folder. The Reuters.sta data file is located in the TextMiner folder. Classic menus: Select Open Examples from the File menu to display the Open a STATISTICA Data File dialog. Open the Datasets folder. The Reuters.sta data file is located in the TextMiner folder. Next, launch STATISTICA Text Miner: Ribbon bar: Select the Data Mining tab. In the Text Mining group, click Text Mining to display the Text mining Startup Panel. Classic menus: From the Data Mining menu, select Text & Document Mining to display the Text mining Startup Panel. On the Quick tab, we need to specify the source of text data (e.g., from spreadsheet cases, from files, or from a file in locations specified by in a spreadsheet column). Select the Files option button, and select the Paths in spreadsheet checkbox (see Figure X.3).

775

776

TUTORIAL X:

Classifying Documents with Respect to “Earnings”

FIGURE X.3 STATISITCA Text Miner “Quick Tab.”

Now, click the Document paths button to display a variable selection dialog (see Figure X.4) in which you select the variable File Name (which is the variable containing the complete references to the input document [XML] files).

FIGURE X.4 STATISTICA Text Miner “variable selection” dialog.

Specifying the Analysis

Click the OK button to return to the Startup Panel (see Figure X.5). FIGURE X.5 STATISTICA Text Miner Quick Tab dialog showing that the “file name” or text data source has been selected.

Next, select the Advanced tab. Change the % of files where word occurs option to 3 in order to filter out infrequent words. Now, select the Words tab (see Figure X.6), and select the Stop words (discarded, excluded from indexing) checkbox. Click the adjacent Select button to display the Open stop-word (text) file dialog. Browse to the EnglishStoplist.txt file (which is in the TextMiner subdirectory of the STATISTICA Text Mining and Document Retrieval installation). Click the Open button to load that file as the default stop listdthat is, the words and terms contained in that stoplist will be excluded from the indexing that occurs during the processing of the documents. FIGURE X.6 STATISTICA Text Miner “Words tab” dialog, showing that the English stoplist, called “EnglishStoplist.txt,” has been selected.

777

778

TUTORIAL X:

Classifying Documents with Respect to “Earnings”

PROCESSING THE DATA ANALYSIS Next, click the Index button in the Startup Panel to begin the processing of the documents. After a few seconds (or minutes, depending on the speed of your computer hardware), the Results dialog will be displayed as illustrated in Figure X.7).

FIGURE X.7 STATISTICA Text Miner “Results” dialog.

The options available at this point are described in some detail in the Introductory Overview (see the ON-LINE HELP which if part of STATISTICA Text Miner), as well as in the documentation for the TM results dialog. The primary goal of this research is to derive a good classification model for automatically classifying documents (news stories) as relevant or not relevant to Earnings.

Saving the Extracted Word Frequencies to the Input File

SAVING THE EXTRACTED WORD FREQUENCIES TO THE INPUT FILE The next step is to write the extracted word frequencies back to the input file so we can use these frequencies for further analyses. Select the Save results tab (see Figure X.8). To write the 349 words that were extracted back into the input file, we need to first “make room” in the data file. To do this, enter 349 into the Amount field. FIGURE X.8 STATISTICA Text Miner “Results Dialog” with “Save Results” tab selected.

Then click the Append empty variables button. If Reuters.sta was opened as a read-only file, we will be asked to save the file to a different directory (see Figure X.9).

779

780

TUTORIAL X:

Classifying Documents with Respect to “Earnings”

FIGURE X.9 STATISTICA Text Miner spreadsheet with the “NewVar1ethrough NewVar349” added to the spreadsheet, opening up these columns so that the “word counts” of the 349 words selected can be written back to this spreadsheet.

With this operation, 349 blank new variables will be appended to the input file (see Figure X.9). Next, click the Write back current results (to selected variables) button to display the Assign statistics to variables, to save them to the input data dialog (see Figure X.10). Select all extracted words (variables) in the left pane and all newly created variables in the right pane.

FIGURE X.10 STATISTICA Text Miner “Assign statistics to variablesesave to input spreadsheet” dialog.

Saving the Extracted Word Frequencies to the Input File

Then click Assign (see Figure X.11).

FIGURE X.11 STATISTICA Text Miner “Assign statistics to variables” dialog, after selecting the “assign button.” The words assigned and the spreadsheet column to which assigned are indicated in the lower panel of this dialog.

Next, click OK to complete this operation. The newly added variables will automatically be assigned the appropriate variable names to reflect the respective word that was extracted, and the respective frequency counts will automatically be written to the new variables (see Figure X.12).

FIGURE X.12 STATISTICA Text Miner spreadsheet illustrating that the “word counts” have now been placed into the spreadsheet, following the “assigning of these” in previous “assign dialog.”

781

782

TUTORIAL X:

Classifying Documents with Respect to “Earnings”

These simple steps conclude the text mining specific portion of this analysis. What remains is the task to build a good model for predicting the contents (Earnings - Yes/No) of the news stories so that we can automatically classify them.

INITIAL FEATURE SELECTION There are several ways in which we could proceed. As a first step, let’s use the powerful and efficient Feature Selection and Variable Screening facilities to identify a subset of important predictors from the 349 words that were extracted for inclusion in further model building. Technically, this isn’t necessary here because practically all methods for predictive classification available in STATISTICA Data Miner can handle this many predictors. However, to illustrate how quickly models can be built, let’s first use the Feature Selection and Variable Screening methods. Select Feature Selection and Variable Screening (see Figure X.13) from the Data Mining menu. Then select variable “Topic: Earnings?” as the categorical dependent variable and all variables containing the word counts (which we wrote back to the input data) as continuous predictors (see Figure X.13).

FIGURE X.13 STATISTICA Text Miner “Feature Selection and Variable Screening” dialog.

Then click OK on the Feature Selection and Variable Screening dialog to display the FSL Results dialog. Specify to display the best 50 predictors of “Topic: Earnings?” (enter 50 into the Display field) and create the graph of the predictor importance (click the Histogram of importance for best k predictors button as illustrated in Figure X.14).

Initial Feature Selection

FIGURE X.14 STATISTICA Text Miner “Importance Plot” generated from “Feature Selection and Variable Importance” computations.

Judging from this plot, it may be sufficient to take only the first 20 or so predictors for final modeling (refer also to the Feature Selection and Variable Screening Overviews). We will use the best 20 variables (words) as the predictors (see Figure X.15) for further model building, specifically to use Classification and Regression Trees to build a final predictive model. In the Display field, specify to display 20 predictors, and click the Report of best k predictors (features) button to display the list of the best predictors in a report. Copy the 20 predictors (see Figure X.15) to the Clipboard to be used in the General Classification and Regression Trees (GC&RT) analysis.

FIGURE X.15 STATISTICA Data Miner and Text Miner “best predictor lists” results from the “Feature Selection” process. These can be highlighted and copy and pasted as the variables to use in additional analyses.

783

784

TUTORIAL X:

Classifying Documents with Respect to “Earnings”

GENERAL CLASSIFICATION AND REGRESSION TREES Select General Classification/Regression Trees Models from the Data Mining menu. Standard C&RT is selected by default (see Figure X.16).

FIGURE X.16 STATISTICA Data Miner “General Classification and Regression Trees” dialog.

Click the OK button to display the Standard C&RT dialog, select the Categorical response (categorical dependent variable) checkbox, click the Variables button, and select variable “Topic: Earnings?” (see Figure X.17) as the Dependent variable. As the Continuous predictors, select the best 20 predictors (paste them into the variable selection dialog from the Clipboard) derived from the Feature Selection and Variable Screening analysis.

FIGURE X.17 STATISTICA variable selection dialog.

General Classification and Regression Trees

Click OK on the “Variable Selection” dialog. which brings back the “Standard C&RT” window. FIGURE X.18 STATISTICA Standard C&RT dialog showing that the variables selected are now placed into the model to be computed when the OK button is selected.

Next, on the Validation tab, select the V-fold cross-validation checkbox (to automatically select a robust model), and also specify variable Training as a Test sample variable (see Figure X.19), with the code Training as to define the sample from which we will build the model (we will use the remaining cases to test the predictive validity of the model).

FIGURE X.19 STATISTICA C&RT dialog with “Validation Tab” selected, “V-fold cross-validation” checked, and the “cross-validation” window set for the Training sample, which is set to “On” status.

785

786

TUTORIAL X:

Classifying Documents with Respect to “Earnings”

Now click OK in the Standard C&RT dialog to begin the analysis. After a few seconds, the GC&RT Results dialog will be displayed. Click the Tree graph button on the Summary tab to review the final tree (see Figure X.20). FIGURE X.20 STATISTICA Data Miner C&RT tree graph results.

The final tree is similar, although not identical to that shown in Manning and Schütze (2002, Figure 16.1). Nevertheless, if you select the Classification tab of the GC&RT Results dialog, select the Test set option button to compute the predicted classification for the (holdout) test sample, and click the Predicted vs. observed by classes button to get the following confusion (misclassification) matrix as illustrated in Figure X.21.

FIGURE X.21 STATISTICA Data Miner C&RT tree “classification matrix” or also known as “Confusion e misclassification e Matrix” results dialog.

This translates into a classification model with a predictive accuracy rate of 94 percent! MARSplines example: We will use just 100 of the Reuters documents to make this example run faster for use as a teaching example (see Figures X.22 and X.23).

General Classification and Regression Trees

FIGURE X.22 Selection of Reuters text files from the pathway where they reside.

FIGURE X.23 Open document files window.

All parameters of all tabs will be left at their defaults, except the number of words to be selected. We’ll change this to 300 words (see Figure X.24).

787

788

TUTORIAL X:

Classifying Documents with Respect to “Earnings”

FIGURE X.24 Text Mining dialog “Advanced” tab, where we have selected just 300 words to be returned from the indexing.

Click Index to compute the frequency of words as illustrated in Figure X.25. Then compute Concepts using the SVD procedure (as explained in other tutorials in this book), and save all results back to the master data file. Thirty-six concepts were extracted. FIGURE X.25 Results of text mining the subset of just 100 documents.

General Classification and Regression Trees

Select “MARSplines” from the Data Mining pull-down menu (see Figure X.26 and X.27).

FIGURE X.26 MARSplines algorithm selection from the Data Mining pull-down menu.

FIGURE X.27 MARSplines dialog window.

789

790

TUTORIAL X:

Classifying Documents with Respect to “Earnings”

Next, the variables are selected as illustrated in Figure X.28.

FIGURE X.28 Select variables window.

The results of this MARSplines analysis are seen in Figures X.28 and X.29. FIGURE X.29 Results window for MARSplines.

General Classification and Regression Trees

Confusion matrix Topic: Earnings? (Reuters_100 cases for MARS and NBayes Predicted (rows) x Observed (columns) Class Predicted No Yes No 67 7 3 23 Yes

FIGURE X.30 The resulting “Confusion matrix” shows that most of the documents were classified correctly as per the “earnings” target variable.

The Naïve Bayes Classifier will be computed next to compare with Trees and MARSplines. The selection of NAIVE BAYES CLASSIFER from the Data Mining pull down menu is illustrated in the next series of figures, Figure X.31 through X.34.

FIGURE X.31 Selecting the Naïve Bayes Classifier algorithm. FIGURE X.32 Naïve Bayes Classifiers.

791

792

TUTORIAL X:

Classifying Documents with Respect to “Earnings”

FIGURE X.33 Naïve Bayes dialog window.

FIGURE X.34 Variables selected. We will use only the Concepts for this computation, since the concepts have extracted information from all of the 300 words.

K-Nearest Neighbors Modeling

Leaving all tabs of the Naïve Bayes dialog at their defaults, click on OK to run the computations. The results are shown in Figure X.35.

FIGURE X.35 Results of Naïve Bayes.

We will not go into all of the results here, but you can go to the companion website and get the “Naïve Bayes Workbook results.stw” file, open it in STATISTICA, and view some of the specific results obtained from selecting some of the “results buttons” seen in Figure X.35 above.

K-NEAREST NEIGHBORS MODELING Let’s try one more algorithm on this dataset: K-Nearest Neighbors as illustrated in Figures X.36 through X.38.

793

794

TUTORIAL X:

Classifying Documents with Respect to “Earnings”

FIGURE X.36 Selecting the K-Nearest Neighbors algorithm in STATISTICA.

FIGURE X.37 Variable selection for the K-Nearest Neighbors computation.

K-Nearest Neighbors Modeling

FIGURE X.38 Cross-validation selected on the appropriate tab.

Then click OK selected to compute the K-Nearest Neighbors Results (see Figure X.39).

FIGURE X.39 K-Nearest Neighbors Results.

795

TUTORIAL X:

Classifying Documents with Respect to “Earnings”

We won’t go into this K-Nearest Neighbors Results in detail. You can go to the DVD and find the K-Nearest Neighbors Results Workbook to examine some of the details. Only one graph will be presented here, showing that this model correctly identified documents more often than not (see Figure X.40).

Histogram of Topic: Earnings? (Accuracy ) (Test) 22 20 18 16 No. of observations

796

14 12 10 8 6 4 2 0 Correc t

I ncorrect

Topic: Earnings? (Accuracy )

FIGURE X.40 Histogram of prediction of earnings from the Reuters documents, using K-Nearest Neighbors modeling.

CONCLUSION This example illustrated how the various methods in STATISTICA Text Mining and Document Retrieval, along with STATISTICA Data Miner, can be used to build highly accurate predictive models for classifying text. The STATISTICA system is particularly well suited for this purpose because of the seamless integration of all components of the data and text mining facilities of the system.

Reference Manning, C. D., and Schütze, H. (2002). Foundations of statistical natural language processing (5th edition). Cambridge, MA: MIT Press.