Text Mining Tutorial

A Tutorial for Text Classification using SQL Server 2005 Beta2 Data Mining

Peter Pyungchul KimSQL Business IntelligenceMicrosoft Corporation

IntroductionThis tutorial presents details steps for you to take to perform a typical text classification task using SQL Server 2005 Beta2. The sample dataset is obtained from http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html. The dataset is a small subset of USENET newsgroup postings that belong to 5 different groups. The task is to build a mining model to classify each posting into its group. This tutorial document should be available together with an import-ready file, NGArticles.txt (or NGArticles.zip).

1 Create a database1.1 In SQL Mgmt Studio, connect to the local SQL server (localhost).1.2 Create a new database and name it “TDM”.

2 Import News Group Articles to the database2.1 Right click the database, TDM, and Task Import.

Source: NGArticles.txt (Flat File, unzipped from NGArticles.zip provided) Header row delimiter: @@@@ Check “Column names in the first data row”

1

http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html

http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html

Row delimiter: @@@@ Column delimiter: &&&& Column property for “ArticleText”: Change DataType to DT_NTEXT Destination: Server: local SQL server (localhost) Database: TDM Table: NGArticle

2

3 Build a dictionary3.1 Start Business Intelligence Development Studio with a new Integration Services

project called “TextDataMining”. This will create a solution and a Integration Services project in it, both of which are named “TextDataMining”.

3.2 Rename the Integration Services project as “PrepareArticles” just for convenience.

3.3 Create a new DTS (SSIS) package3.4 Rename the package to BuildDictionary.dtsx3.5 Go to Data Flow tab and add a new Data Flow task3.6 In the data flow task, add a “OLE DB Source” transform

Connection: create a new for localhost.TDM

4

Table: NGArticles Columns: ArticleText only

3.7 Add a “Term Extraction” transform and connect from the OLE DB Source transform Term Type: Noun and Noun Phrase Score Type: TFIDF Parameters: Frequency=10, Length=2

3.8 Add a “Sort” transform and connect it. Sort “Term” in ascending order Don’t pass through Score column

3.9 Add an “OLE DB Destination” transform and connect it. Use the connection: localhost.TDM Click “New” and name it “Dictionary” In Mappings, connect the column, “Term”

3.10 Execute the package It automatically enters into debugging mode It may take a few minutes

3.11 Stop debugging

4 Build term vectors4.1 Create a new DTS (SSIS) package4.2 Rename the package to BuildTermVectors.dtsx4.3 Go to Data Flow tab and add a new Data Flow task4.4 In the data flow task, add a “OLE DB Source” transform

Connection: create a new for localhost.TDM Table: NGArticles Columns: ID, ArticleText only

4.5 Add a “Term Lookup” transform and connect from the previous transform Reference table: Dictionary PassThru column: ID

5

Lookup input column: ArticleText

4.6 Add a “Sort” transform and connect it. Sort “ID” in ascending order, then, “Term” in ascending order, no duplicates

4.7 Add an “OLE DB Destination” transform and connect it. Use the connection: localhost.TDM Click “New” and name it “TermVectors” In Mappings, make sure to connect all columns, “Term”, “Frequency”, “ID”

4.8 Execute the package It automatically enters into debugging mode It may take a few minutes

4.9 Stop debugging

(Note that the picture doesn’t include the Derived Column transform built in step 4.5.)

5 Prepare train/test samples5.1 Create a new DTS (SSIS) package5.2 Rename the package to PrepareSamples.dtsx5.3 Go to Data Flow tab and add a new Data Flow task5.4 In the data flow task, add a “OLE DB Source” transform

6

Connection: create a new for localhost.TDM Table: NGArticles Columns: ID, NewsGroup only

5.5 Add a “Percentage Sampling” transform and connect from the OLE DB Source transform

Sampling rate: 70% Selected rows: Train sample (70%) Unselected rows: Test sample (30%)

5.6 Add two “OLE DB Destination” transforms and connect them from the Percentage Sampling (one from Train sample, another from Test sample)

Use the connection: localhost.TDM Click “New” and name them “TrainArticles” and “TestArticles” respectively. In Mappings, make sure to connect all columns, “ID”, “NewsGroup”

5.7 Execute the package It automatically enters into debugging mode

5.8 Stop the debugging mode.

6 Build/Test/Refine data mining models6.1 Add a new Analysis Services project, and name it as “DataMining”.

7

6.2 Create a Data Source to refer the database, TDM, in the local SQL server.6.3 Create a Data Source View using the data source, TDM. Add the following tables in

the DSV: TrainArticles, TestArticles, and TermVectors.

6.4 Create a Mining Structure as follows: Algorithm: Microsoft_Decision_Trees DSV to use: TDM Case table: TrainArticles Nested table: TermVectors Columns usage:

8

Name the structure as “NGArticlesDM” and the model as “NGArticlesDM_DT”

6.5 Right click the model, NGArticlesDM_DT and select “New Mining Model…” to add the following two additional models: NGArticlesDM_NB with Microsoft_Naive_Bayes algorithm NGArticlesDM_NN with Microsoft_Logistic_Regression algorithm

6.6 Right-click each model and set the algorithm parameters as follows: NGArticlesDM_DT:

Disable automatic feature selection (MAXIMUM_INPUT_ATTRIBUTES=0)

NGArticlesDM_NB: Disable automatic feature selection

(MAXIMUM_INPUT_ATTRIBUTES=0) NGArticlesDM_NN:

Disable automatic feature selection (MAXIMUM_INPUT_ATTRIBUTES=0)

6.7 Deploy the project by pressing F5. It may take several minutes to train all the three models.

6.8 Select “Mining Accuracy” tab to see the lift chart using “TestArticles” and “TermVectors” to compare the classification accuracy of the three models trained.

9

6.9 Browse models. Note that browsing the model’s content may take considerably long time due to the complexity of models. E.g., NGArticlesDM_NB, NGArticlesDM_NN involves more than 5,000 attributes (scoring/coefficients). For instance, browsing NGArticlesDM_NN took 3 minutes in 3GHz Xeon CPU, 2GB memory PC.

11

7 Deployment data mining modelsNot covered in this tutorial at this moment.

13

Text Mining Tutorial

Documents

Transcript of Text Mining Tutorial