text mining the federalist papers · pdf filetext mining the federalist papers this...

17
Text Mining the Federalist Papers This demonstration illustrates how to use the Text Miner node to identify the author or authors of the 12 unattributed Federalist Papers. Some of the demonstrations in this course are “pre-cooked”; that is, a project diagram already exists in the course data that contains the complete demonstration. The instructor might use the pre-cooked version, or might choose to re-create the diagram using a different name. Creating and running a diagram can consume valuable class time. SAS Enterprise Miner requires the user to associate an analysis data set with metadata. The Data Source Wizard guides you through creation of the metadata table. After the metadata table has been created, the data is available as an input data source and can be placed in a SAS Enterprise Miner diagram. Most data sets in this course have already been set up as SAS Enterprise Miner input data sources. Demonstration instructions usually contain a preliminary step requesting, “If necessary, create an input data source for the demonstration data.” Chapter 2 provides details about creating data sets for text collections and setting up input data sources. Access to SAS Enterprise Miner depends on the nature of your computing environment, whether attending a SAS classroom offering, participating in a Live Web class, or receiving instruction on-site at your organization. These course notes assume that you are using a workstation version of SAS Enterprise Miner with course data stored in the following folder: Course Data Folder: D:\workshop\dmtxt51 Course SAS Program Folder: D:\workshop\dmtxt51\sassrc 1. Write the locations for your environment below: Your Course Data Folder: Your Course Program Folder: You can access SAS Enterprise Miner in many different ways. For example, the Web-based access method uses a Web browser entry point. Some courses use a virtual lab setting to provide access to SAS Text Miner software. The server platform for the virtual lab contains both client and server installations on a single computer. The course data resides on a hard drive accessed directly by the server, that is, the data do not travel through a network connection. You might have to connect through some Web-based application to access the virtual lab. Your instructor will provide complete instructions along with the necessary logon credentials. Following is a typical logon window for SAS Enterprise Miner 7.1.

Upload: ngonhan

Post on 27-Mar-2018

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Text Mining the Federalist Papers · PDF fileText Mining the Federalist Papers This demonstration illustrates how to use the Text Miner node to identify the author or authors of the

Text Mining the Federalist Papers

This demonstration illustrates how to use the Text Miner node to identify the author or authors of the 12 unattributed Federalist Papers.

Some of the demonstrations in this course are “pre-cooked”; that is, a project diagram already exists in the course data that contains the complete demonstration. The instructor might use the pre-cooked version, or might choose to re-create the diagram using a different name. Creating and running a diagram can consume valuable class time.

SAS Enterprise Miner requires the user to associate an analysis data set with metadata. The Data Source Wizard guides you through creation of the metadata table. After the metadata table has been created, the data is available as an input data source and can be placed in a SAS Enterprise Miner diagram. Most data sets in this course have already been set up as SAS Enterprise Miner input data sources. Demonstration instructions usually contain a preliminary step requesting, “If necessary, create an input data source for the demonstration data.” Chapter 2 provides details about creating data sets for text collections and setting up input data sources.

Access to SAS Enterprise Miner depends on the nature of your computing environment, whether attending a SAS classroom offering, participating in a Live Web class, or receiving instruction on-site at your organization. These course notes assume that you are using a workstation version of SAS Enterprise Miner with course data stored in the following folder:

Course Data Folder: D:\workshop\dmtxt51

Course SAS Program Folder: D:\workshop\dmtxt51\sassrc

1. Write the locations for your environment below:

Your Course Data Folder:

Your Course Program Folder:

You can access SAS Enterprise Miner in many different ways. For example, the Web-based access method uses a Web browser entry point. Some courses use a virtual lab setting to provide access to SAS Text Miner software. The server platform for the virtual lab contains both client and server installations on a single computer. The course data resides on a hard drive accessed directly by the server, that is, the data do not travel through a network connection. You might have to connect through some Web-based application to access the virtual lab. Your instructor will provide complete instructions along with the necessary logon credentials.

Following is a typical logon window for SAS Enterprise Miner 7.1.

Page 2: Text Mining the Federalist Papers · PDF fileText Mining the Federalist Papers This demonstration illustrates how to use the Text Miner node to identify the author or authors of the

To access the logon window, you typically select Start Programs SAS Analytics SAS Enterprise Miner. Usually, a shortcut for SAS Enterprise Miner is also on the desktop.

2. For most course environments, including those with a virtual lab, the logon fields will be populated, and you can just click Log On. If necessary, type your user ID and password and click Log On. The welcome window is shown below.

Page 3: Text Mining the Federalist Papers · PDF fileText Mining the Federalist Papers This demonstration illustrates how to use the Text Miner node to identify the author or authors of the

3. The typical SAS classroom configuration has an established set of SAS Enterprise Miner projects. The project for this course is DMTXT51. The project can be accessed by selecting Recent Projects. To create a new project, select New Project.

4. Select Recent Projects. The projects that you see might differ from those shown below. Select the DMTXT51 project, and then click Open. You might be instructed to pick a different project.

The DMTXT51 project opens.

Page 4: Text Mining the Federalist Papers · PDF fileText Mining the Federalist Papers This demonstration illustrates how to use the Text Miner node to identify the author or authors of the

5. As mentioned above, most data sources have already been processed for use by SAS Enterprise Miner. For illustration, the steps for creating a data source are included, but you can skip them if the data source already exists. Select File New Data Source. The Data Source Wizard appears.

6. Click Next.

Page 5: Text Mining the Federalist Papers · PDF fileText Mining the Federalist Papers This demonstration illustrates how to use the Text Miner node to identify the author or authors of the

7. Click Browse.

Page 6: Text Mining the Federalist Papers · PDF fileText Mining the Federalist Papers This demonstration illustrates how to use the Text Miner node to identify the author or authors of the

8. Select the table Federalistpapers from the Dmtxt library. Click OK.

Page 7: Text Mining the Federalist Papers · PDF fileText Mining the Federalist Papers This demonstration illustrates how to use the Text Miner node to identify the author or authors of the

9. Click Next.

10. Click Next.

Page 8: Text Mining the Federalist Papers · PDF fileText Mining the Federalist Papers This demonstration illustrates how to use the Text Miner node to identify the author or authors of the

11. Click Next.

12. You need to change the role of the variable Author from Input to Label. Select the row that Author is in and then right-click the word Input. Select Label. Also, change the Level of the variable Target to Nominal.

Page 9: Text Mining the Federalist Papers · PDF fileText Mining the Federalist Papers This demonstration illustrates how to use the Text Miner node to identify the author or authors of the

13. Click Next.

14. Click Next.

Page 10: Text Mining the Federalist Papers · PDF fileText Mining the Federalist Papers This demonstration illustrates how to use the Text Miner node to identify the author or authors of the

15. Click Next.

16. Click Finish.

Page 11: Text Mining the Federalist Papers · PDF fileText Mining the Federalist Papers This demonstration illustrates how to use the Text Miner node to identify the author or authors of the

The Federalist Papers data source is now available for your diagrams.

17. Select File New Diagram. Name the diagram Federalist Papers. Click OK.

18. Click the Federalist Papers data source, hold down the mouse button, and drag the data source into the diagram. Release the mouse button when the data source is in the diagram.

Page 12: Text Mining the Federalist Papers · PDF fileText Mining the Federalist Papers This demonstration illustrates how to use the Text Miner node to identify the author or authors of the

19. Right-click the data source node and select Run. When the Run Completed window appears, select OK. The green circle with the white check mark indicates that the node ran successfully at least one time. It does not mean that the node is necessarily up-to-date.

20. Click the Sample tab, and drag a Filter node into the diagram. Attach the input data source node to the Filter node.

Page 13: Text Mining the Federalist Papers · PDF fileText Mining the Federalist Papers This demonstration illustrates how to use the Text Miner node to identify the author or authors of the

The Filter node Properties panel follows:

21. Change the Default Filtering Method property to None for both class and interval variables.

22. Select the Class Variables property. Change the Filtering Method property for the variable Target to User Specified. Click the Generate Summary button. Hold down the control key and click the bars corresponding to values 2 and 3. Then select Apply Filter.

Page 14: Text Mining the Federalist Papers · PDF fileText Mining the Federalist Papers This demonstration illustrates how to use the Text Miner node to identify the author or authors of the

23. Click OK. Run the filter node. The results window reveals that eight documents have been excluded.

24. On the Utility tab, select the Metadata node and drag it into the diagram. Attach it to the Filter node. Select the property Train. Change the level of the Target variable from Nominal to Binary.

Page 15: Text Mining the Federalist Papers · PDF fileText Mining the Federalist Papers This demonstration illustrates how to use the Text Miner node to identify the author or authors of the

25. Click OK. Run the Metadata node.

26. Attach a Text Parsing node to the Metadata node. Change the Synonyms property to No Data Set to be Specified. Change the Stop List property to DMTXT.FederalistStop. Change the Find Entities property to Standard. Run the Text Parsing node.

27. Attach a Text Filter node to the Text Parsing node. Change the Term Weight property to Inverse Document Frequency. Change the Minimum Number of Documents property to 2. Run the Text Filter node.

28. If using Text Miner 5.1: Attach a Text Cluster node to the Text Filter node. Set the Exact or Maximum Number property to Exact, and set the Number of Clusters property to 2. Run the Text Cluster node. If using Text Miner 4.2: Attach a Text Miner node to the Text Filter node. Change the Stop List property to DMTXT.FederalistStop. Change the Synonyms property to No Data Set to be Specified. Change the Find Entities property to Yes. Change the Term Weight property to Inverse Document Frequency. Set the Exact or Maximum Number property to Exact, and set the Number of Clusters property to 2. Specify the number of Descriptive Terms to be 8.

29. Attach a Text Topic node to the Text Cluster node. Change the Number of Multi-term Topics property to 5. Run the Text Topic node.

30. If using Text Miner 4.2, skip this step. Otherwise, if using Text Miner 5.1, on the Utility tab, drag a Metadata node into the diagram and attach it to the Text Topic node. Change the role of TextCluster_prob1 and TextCluster_prob2 to Rejected, and change the role of _DOCUMENT_ to ID. This action fixes an erroneous choice of variable roles that will be corrected in a later release of the software. The erroneous roles of Text can cause a server error for certain downstream actions, and the erroneous role of Input can degrade predictor performance.

Page 16: Text Mining the Federalist Papers · PDF fileText Mining the Federalist Papers This demonstration illustrates how to use the Text Miner node to identify the author or authors of the

31. Run the Metadata node.

32. On the Model tab, drag a Regression node into the diagram and attach it to the Metadata node. Use the default settings, and run the Regression node. Because the target variable is binary, the Regression node will fit a logistic regression model to the data to predict the author of each essay.

The process flow appears below.

Page 17: Text Mining the Federalist Papers · PDF fileText Mining the Federalist Papers This demonstration illustrates how to use the Text Miner node to identify the author or authors of the

33. Select the property Exported Data from the Properties panel of the Regression node. Click the TRAIN data and click the Explore button. Click on the plot wizard icon. In the plot wizard, select Bar (chart) and then click Next. Specify the role Category for variable Target and the role Group for variable I_Target. Click Finish. The following plot appears:

The regression model is 100% accurate in predicting the known authors of the 65 attributed essays, and the model predicts that the author of the unattributed essays is Madison. This result agrees with the conclusion of Mosteller and Wallace.