SF Data Mining Meetup September 22, 2014dataminingreporting.weebly.com/uploads/4/0/9/7/4097240/...SF...
Transcript of SF Data Mining Meetup September 22, 2014dataminingreporting.weebly.com/uploads/4/0/9/7/4097240/...SF...
9/23/2014
1
Copyright © 2014 KNIME.com AG
Text Analytics TutorialSF Data Mining MeetupSeptember 22, 2014
Kilian Thiel, Rosaria Silipo, Cathy Pearl
KNIME.com AG, Zurich, Switzerland
www.knime.com
@KNIME
Copyright © 2014 KNIME.com AG
Tool Installation
• Download open source KNIME analytics platform from:
http://www.knime.org/knime-analytics-platform-sdk-download
• Select package for your OS and install
• Open the KNIME application
• In the top menu select “File” or “LOCAL” -> “Install KNIME Extensions”
• Install “KNIME & Extensions” and “KNIME Labs Extensions”
2
9/23/2014
2
Copyright © 2014 KNIME.com AG
Install KNIME Extensions (incl. Text Processing)
3
Copyright © 2014 KNIME.com AG
Requirements to import and run Demo Workflows
• KNIME 2.10
• Text Processing Extension from KNIME Labs Extensions
• Distance Matrix from KNIME Extensions
Memory Tip
In file knime.ini set memory to max available
• -Xmx 3G
4
9/23/2014
3
Copyright © 2014 KNIME.com AG
• The KNIME Website (www.knime.org)• LEARNING HUB under RESOURCES (www.knime.org/learning-
hub)
• Use Cases and White Papers for example workflows, and
• FORUM for questions and answers
• DOCUMENTATION for documentation, FAQ, change-logs, ...
• LABS for new developments and experimental nodes
• COMMUNITY for development instructions and third party nodes
• Blog for news, tips and tricks(www.knime.org/blog)
• KNIME TV channel on
Text Mining Webinar http://www.youtube.com/watch?v=tY7vpTLYlIg
• KNIME on @KNIME
Resources
5
Copyright © 2014 KNIME.com AG
Resources
eBooks from the KNIME Press:
http://www.knime.org/knimepress
- KNIME Beginner’s Luck
- The KNIME Cookbook
- The KNIME Booklet for SAS Users
Free Beginner’s Guide – use Code
“meetupsf14”
9/23/2014
4
Copyright © 2014 KNIME.com AG
Text Processing Steps
7
1. Import Data
2. Enrichment(Tagging)
3. Pre-processing(Filtering, Stemming, …)
4. TransformationBoW, Frequencies,Document Vector
4. ClassificationClustering
Document Type Cell
Term Type Cell
Copyright © 2014 KNIME.com AG
Import Demo Workflows
• Download zip file with demo workflows from meetup site
• Open the KNIME application
• In the top menu, select File -> Import KNIME Workflow ...
• Enable option „Select Archive File“
• Browse to zip file
• Import all workflows and data into KNIME
8
9/23/2014
5
Copyright © 2014 KNIME.com AG
Import Demo Workflows
9
Copyright © 2014 KNIME.com AG
Demo Workflows
0-TripAdvisorCrawling: importing data from web
1-Reading: Importing data from text, word, pdf, Twitter, XML, …
2-Enrichment POS: String to Document and Word Tagging in Document
3-Preprocessing: Filtering and Stemming
4-Classification-Cuisine: BoW, Frequencies, Document to Document Vector
Other workflows for multi-words, clustering, topic extraction, and reporting.
10
9/23/2014
6
Copyright © 2014 KNIME.com AG
Demo: The KNIME Workbench
Copyright © 2014 KNIME.com AG
Text Processing Category
12
9/23/2014
7
Copyright © 2014 KNIME.com AG
Demo: TripAdvisor Restaurant Data Set (SF)
13
Copyright © 2014 KNIME.com AG
Demo: TripAdvisor Data (SF Restaurants)
14
Reviews about Italian and Chinese restaurants in San Francisco
• Chinese: 272
• Italian: 268
9/23/2014
8
Copyright © 2014 KNIME.com AG
Demo: Goal of this Tutorial
15
Goal:
• Build a classifier to distinguish between Chinese and Italian restaurants, based on the reviews.
Italian or Chinese Restaurant?
Copyright © 2014 KNIME.com AG
Demo: Final Workflow
16
Goal:
9/23/2014
9
Copyright © 2014 KNIME.com AG
1.) Reading
Read/Parse textual data
17
Copyright © 2014 KNIME.com AG
Demo
Reading
• Read Tripadvisor data (.table file)
• Filter rows with missing restaurant value
• Convert strings to documents
• Filter all but the document column
• Examples of other possible formats to import
18
9/23/2014
10
Copyright © 2014 KNIME.com AG
0.) Web Crawler Workflow
Palladian Extension from:
KNIME Community Contributions – Other
19
Copyright © 2014 KNIME.com AG
Demo
Reading
• Web Crawler Workflow to get data from the Web
• Palladian Community Contributions Extension
• HtmlParser node
• Xpath node
20
9/23/2014
11
Copyright © 2014 KNIME.com AG
2.) Enrichment
Enrich documents with semantic information
21
This assigns a tag to each word:- Grammar tags (POS)- Context dependent tags- Sentiment tags- Named Entity tags - Custom tags
Copyright © 2014 KNIME.com AG
Demo
Enrichment / Tagging
• Apply POS Tagger node
• Use Bag of Words node to inspect tagging result
• Show other possible Taggings
22
9/23/2014
12
Copyright © 2014 KNIME.com AG
3.) Preprocessing
Preprocess documents and filter words
23
Copyright © 2014 KNIME.com AG
Demo
Preprocessing
• Filter
– Numbers
– Punctuation marks
– Stop Words
• Convert to lower case
• Stemming (Snowball stemmer because of the many languages associated with it)
• Keep only nouns (NN), verbs (VB), adjectives (JJ)
24
9/23/2014
13
Copyright © 2014 KNIME.com AG
4.) Transformation
Creation of numerical representation of documents
25
BoW creates the list of words for each documentTF calculates word frequencies (absolute or relative)
in each document
Copyright © 2014 KNIME.com AG
Demo
Transformation
• Transform to bag of word
• Compute TF value for terms
TFrel (word) = n(word)/N
IDF(word) = log(1+(n(docs)/n(word, docs))
Tfrel(word) * IDF(word) is used often
ICF(word) = log(1+(n(cat)/n(word, cat))
• Sort output data by frequency
26
9/23/2014
14
Copyright © 2014 KNIME.com AG
4.) Transformation
Creation of numerical representation of documents
27
Copyright © 2014 KNIME.com AG
Demo
Transformation
• Transform to document vectors
• Extract category (class) value
28
9/23/2014
15
Copyright © 2014 KNIME.com AG
5.) Classification
Back to classical Data Analytics:
Training of a model (decision tree) and scoring
29
Copyright © 2014 KNIME.com AG
Demo
Classification
• Append color based on class
• Partition data into training and test set
• Train decision tree model in training data
• Apply decision tree model on test data
• Score model, measure accuracy
• Show cross-validation loop
30
9/23/2014
16
Copyright © 2014 KNIME.com AG
Additional Workflows
• Multi Word Tagging
– Detection of frequent Ngrams (Ngram Creator)
– Creation of dictionary from Ngrams
– Applying Dictionary Tagger
• Classification with Multi Words
• Clustering of documents
– hierarchical clustering based on distance matrix
• Topic Extraction
– Topic Extractor (Parallel LDA)
31
Copyright © 2014 KNIME.com AG
Thank You
40k
60k
20k
32
Questions
• http://tech.knime.org/forum
Follow us
• Twitter: @KNIME
• LinkedIn: https://www.linkedin.com/groups?gid=2212172
• KNIME Blog: http://www.knime.org/blog