SF Data Mining Meetup September 22, 2014dataminingreporting.weebly.com/uploads/4/0/9/7/4097240/...SF...

16
9/23/2014 1 Copyright © 2014 KNIME.com AG Text Analytics Tutorial SF Data Mining Meetup September 22, 2014 Kilian Thiel, Rosaria Silipo, Cathy Pearl KNIME.com AG, Zurich, Switzerland www.knime.com @KNIME [email protected] [email protected] [email protected] Copyright © 2014 KNIME.com AG Tool Installation Download open source KNIME analytics platform from: http ://www.knime.org/knime-analytics-platform-sdk-download Select package for your OS and install Open the KNIME application In the top menu select “File” or “LOCAL” -> “Install KNIME Extensions” Install “KNIME & Extensions” and “KNIME Labs Extensions” 2

Transcript of SF Data Mining Meetup September 22, 2014dataminingreporting.weebly.com/uploads/4/0/9/7/4097240/...SF...

Page 1: SF Data Mining Meetup September 22, 2014dataminingreporting.weebly.com/uploads/4/0/9/7/4097240/...SF Data Mining Meetup September 22, 2014 Kilian Thiel, Rosaria Silipo, Cathy Pearl

9/23/2014

1

Copyright © 2014 KNIME.com AG

Text Analytics TutorialSF Data Mining MeetupSeptember 22, 2014

Kilian Thiel, Rosaria Silipo, Cathy Pearl

KNIME.com AG, Zurich, Switzerland

www.knime.com

@KNIME

[email protected]

[email protected]

[email protected]

Copyright © 2014 KNIME.com AG

Tool Installation

• Download open source KNIME analytics platform from:

http://www.knime.org/knime-analytics-platform-sdk-download

• Select package for your OS and install

• Open the KNIME application

• In the top menu select “File” or “LOCAL” -> “Install KNIME Extensions”

• Install “KNIME & Extensions” and “KNIME Labs Extensions”

2

Page 2: SF Data Mining Meetup September 22, 2014dataminingreporting.weebly.com/uploads/4/0/9/7/4097240/...SF Data Mining Meetup September 22, 2014 Kilian Thiel, Rosaria Silipo, Cathy Pearl

9/23/2014

2

Copyright © 2014 KNIME.com AG

Install KNIME Extensions (incl. Text Processing)

3

Copyright © 2014 KNIME.com AG

Requirements to import and run Demo Workflows

• KNIME 2.10

• Text Processing Extension from KNIME Labs Extensions

• Distance Matrix from KNIME Extensions

Memory Tip

In file knime.ini set memory to max available

• -Xmx 3G

4

Page 3: SF Data Mining Meetup September 22, 2014dataminingreporting.weebly.com/uploads/4/0/9/7/4097240/...SF Data Mining Meetup September 22, 2014 Kilian Thiel, Rosaria Silipo, Cathy Pearl

9/23/2014

3

Copyright © 2014 KNIME.com AG

• The KNIME Website (www.knime.org)• LEARNING HUB under RESOURCES (www.knime.org/learning-

hub)

• Use Cases and White Papers for example workflows, and

• FORUM for questions and answers

• DOCUMENTATION for documentation, FAQ, change-logs, ...

• LABS for new developments and experimental nodes

• COMMUNITY for development instructions and third party nodes

• Blog for news, tips and tricks(www.knime.org/blog)

• KNIME TV channel on

Text Mining Webinar http://www.youtube.com/watch?v=tY7vpTLYlIg

• KNIME on @KNIME

Resources

5

Copyright © 2014 KNIME.com AG

Resources

eBooks from the KNIME Press:

http://www.knime.org/knimepress

- KNIME Beginner’s Luck

- The KNIME Cookbook

- The KNIME Booklet for SAS Users

Free Beginner’s Guide – use Code

“meetupsf14”

Page 4: SF Data Mining Meetup September 22, 2014dataminingreporting.weebly.com/uploads/4/0/9/7/4097240/...SF Data Mining Meetup September 22, 2014 Kilian Thiel, Rosaria Silipo, Cathy Pearl

9/23/2014

4

Copyright © 2014 KNIME.com AG

Text Processing Steps

7

1. Import Data

2. Enrichment(Tagging)

3. Pre-processing(Filtering, Stemming, …)

4. TransformationBoW, Frequencies,Document Vector

4. ClassificationClustering

Document Type Cell

Term Type Cell

Copyright © 2014 KNIME.com AG

Import Demo Workflows

• Download zip file with demo workflows from meetup site

• Open the KNIME application

• In the top menu, select File -> Import KNIME Workflow ...

• Enable option „Select Archive File“

• Browse to zip file

• Import all workflows and data into KNIME

8

Page 5: SF Data Mining Meetup September 22, 2014dataminingreporting.weebly.com/uploads/4/0/9/7/4097240/...SF Data Mining Meetup September 22, 2014 Kilian Thiel, Rosaria Silipo, Cathy Pearl

9/23/2014

5

Copyright © 2014 KNIME.com AG

Import Demo Workflows

9

Copyright © 2014 KNIME.com AG

Demo Workflows

0-TripAdvisorCrawling: importing data from web

1-Reading: Importing data from text, word, pdf, Twitter, XML, …

2-Enrichment POS: String to Document and Word Tagging in Document

3-Preprocessing: Filtering and Stemming

4-Classification-Cuisine: BoW, Frequencies, Document to Document Vector

Other workflows for multi-words, clustering, topic extraction, and reporting.

10

Page 6: SF Data Mining Meetup September 22, 2014dataminingreporting.weebly.com/uploads/4/0/9/7/4097240/...SF Data Mining Meetup September 22, 2014 Kilian Thiel, Rosaria Silipo, Cathy Pearl

9/23/2014

6

Copyright © 2014 KNIME.com AG

Demo: The KNIME Workbench

Copyright © 2014 KNIME.com AG

Text Processing Category

12

Page 7: SF Data Mining Meetup September 22, 2014dataminingreporting.weebly.com/uploads/4/0/9/7/4097240/...SF Data Mining Meetup September 22, 2014 Kilian Thiel, Rosaria Silipo, Cathy Pearl

9/23/2014

7

Copyright © 2014 KNIME.com AG

Demo: TripAdvisor Restaurant Data Set (SF)

13

Copyright © 2014 KNIME.com AG

Demo: TripAdvisor Data (SF Restaurants)

14

Reviews about Italian and Chinese restaurants in San Francisco

• Chinese: 272

• Italian: 268

Page 8: SF Data Mining Meetup September 22, 2014dataminingreporting.weebly.com/uploads/4/0/9/7/4097240/...SF Data Mining Meetup September 22, 2014 Kilian Thiel, Rosaria Silipo, Cathy Pearl

9/23/2014

8

Copyright © 2014 KNIME.com AG

Demo: Goal of this Tutorial

15

Goal:

• Build a classifier to distinguish between Chinese and Italian restaurants, based on the reviews.

Italian or Chinese Restaurant?

Copyright © 2014 KNIME.com AG

Demo: Final Workflow

16

Goal:

Page 9: SF Data Mining Meetup September 22, 2014dataminingreporting.weebly.com/uploads/4/0/9/7/4097240/...SF Data Mining Meetup September 22, 2014 Kilian Thiel, Rosaria Silipo, Cathy Pearl

9/23/2014

9

Copyright © 2014 KNIME.com AG

1.) Reading

Read/Parse textual data

17

Copyright © 2014 KNIME.com AG

Demo

Reading

• Read Tripadvisor data (.table file)

• Filter rows with missing restaurant value

• Convert strings to documents

• Filter all but the document column

• Examples of other possible formats to import

18

Page 10: SF Data Mining Meetup September 22, 2014dataminingreporting.weebly.com/uploads/4/0/9/7/4097240/...SF Data Mining Meetup September 22, 2014 Kilian Thiel, Rosaria Silipo, Cathy Pearl

9/23/2014

10

Copyright © 2014 KNIME.com AG

0.) Web Crawler Workflow

Palladian Extension from:

KNIME Community Contributions – Other

19

Copyright © 2014 KNIME.com AG

Demo

Reading

• Web Crawler Workflow to get data from the Web

• Palladian Community Contributions Extension

• HtmlParser node

• Xpath node

20

Page 11: SF Data Mining Meetup September 22, 2014dataminingreporting.weebly.com/uploads/4/0/9/7/4097240/...SF Data Mining Meetup September 22, 2014 Kilian Thiel, Rosaria Silipo, Cathy Pearl

9/23/2014

11

Copyright © 2014 KNIME.com AG

2.) Enrichment

Enrich documents with semantic information

21

This assigns a tag to each word:- Grammar tags (POS)- Context dependent tags- Sentiment tags- Named Entity tags - Custom tags

Copyright © 2014 KNIME.com AG

Demo

Enrichment / Tagging

• Apply POS Tagger node

• Use Bag of Words node to inspect tagging result

• Show other possible Taggings

22

Page 12: SF Data Mining Meetup September 22, 2014dataminingreporting.weebly.com/uploads/4/0/9/7/4097240/...SF Data Mining Meetup September 22, 2014 Kilian Thiel, Rosaria Silipo, Cathy Pearl

9/23/2014

12

Copyright © 2014 KNIME.com AG

3.) Preprocessing

Preprocess documents and filter words

23

Copyright © 2014 KNIME.com AG

Demo

Preprocessing

• Filter

– Numbers

– Punctuation marks

– Stop Words

• Convert to lower case

• Stemming (Snowball stemmer because of the many languages associated with it)

• Keep only nouns (NN), verbs (VB), adjectives (JJ)

24

Page 13: SF Data Mining Meetup September 22, 2014dataminingreporting.weebly.com/uploads/4/0/9/7/4097240/...SF Data Mining Meetup September 22, 2014 Kilian Thiel, Rosaria Silipo, Cathy Pearl

9/23/2014

13

Copyright © 2014 KNIME.com AG

4.) Transformation

Creation of numerical representation of documents

25

BoW creates the list of words for each documentTF calculates word frequencies (absolute or relative)

in each document

Copyright © 2014 KNIME.com AG

Demo

Transformation

• Transform to bag of word

• Compute TF value for terms

TFrel (word) = n(word)/N

IDF(word) = log(1+(n(docs)/n(word, docs))

Tfrel(word) * IDF(word) is used often

ICF(word) = log(1+(n(cat)/n(word, cat))

• Sort output data by frequency

26

Page 14: SF Data Mining Meetup September 22, 2014dataminingreporting.weebly.com/uploads/4/0/9/7/4097240/...SF Data Mining Meetup September 22, 2014 Kilian Thiel, Rosaria Silipo, Cathy Pearl

9/23/2014

14

Copyright © 2014 KNIME.com AG

4.) Transformation

Creation of numerical representation of documents

27

Copyright © 2014 KNIME.com AG

Demo

Transformation

• Transform to document vectors

• Extract category (class) value

28

Page 15: SF Data Mining Meetup September 22, 2014dataminingreporting.weebly.com/uploads/4/0/9/7/4097240/...SF Data Mining Meetup September 22, 2014 Kilian Thiel, Rosaria Silipo, Cathy Pearl

9/23/2014

15

Copyright © 2014 KNIME.com AG

5.) Classification

Back to classical Data Analytics:

Training of a model (decision tree) and scoring

29

Copyright © 2014 KNIME.com AG

Demo

Classification

• Append color based on class

• Partition data into training and test set

• Train decision tree model in training data

• Apply decision tree model on test data

• Score model, measure accuracy

• Show cross-validation loop

30

Page 16: SF Data Mining Meetup September 22, 2014dataminingreporting.weebly.com/uploads/4/0/9/7/4097240/...SF Data Mining Meetup September 22, 2014 Kilian Thiel, Rosaria Silipo, Cathy Pearl

9/23/2014

16

Copyright © 2014 KNIME.com AG

Additional Workflows

• Multi Word Tagging

– Detection of frequent Ngrams (Ngram Creator)

– Creation of dictionary from Ngrams

– Applying Dictionary Tagger

• Classification with Multi Words

• Clustering of documents

– hierarchical clustering based on distance matrix

• Topic Extraction

– Topic Extractor (Parallel LDA)

31

Copyright © 2014 KNIME.com AG

Thank You

40k

60k

20k

32

Questions

• http://tech.knime.org/forum

[email protected]

Follow us

• Twitter: @KNIME

• LinkedIn: https://www.linkedin.com/groups?gid=2212172

• KNIME Blog: http://www.knime.org/blog