© 2014 IBM Corporation2 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
Tina Groves
IBM Big Data & Analytics Product Strategy team
Product manager, 15+ years
Focus on new product introduction and
innovation areas
Results tied to 1,000s of customers; 1,000,000s
of users and 100s of millions in revenue
Personal
hockey mom, skier, closet Scrabble nerd and
oftentimes analytics geek
© 2014 IBM Corporation3 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
What Makes Text Analytics Challenging?
Company IBM
Annual
Revenue
99,751
Annual
Revenue
Units
Billion
Number of
Employees
432,212
Tone Conservative
Easier: One source; derive attributes Harder: Many sources; infer perception & behaviour
© 2014 IBM Corporation4 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
What Makes Text Analytics Even MORE CHALLENGING?
Culture, Slang, Sarcasm
• Same word, different
meanings
e.g., “Sick”
• Same meaning,
different words
e.g., “daks”, “trousers”,
“pants”
Infrastructure
Increasing
• Volume
• Sources
• Users
• Analytic complexity
“Juan” = “John” or “Jean”
“The lazy brown dog
lazed in the sunshine” =
Multiple languages
怠惰な茶色の犬は太陽
の下でlazed
Setting aside the obvious linguistic challenges…
© 2014 IBM Corporation5 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
The Dilemma with Text Analytics: Skills vs. Need Disconnect
Business
AnalystApplication
Developer
Data
Scientist
Domain
Knowledge
Advanced
Analytics & NLP
Programming
© 2014 IBM Corporation6 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
NLP Engines & tools: Developer and Data Scientist Tools
• NLP market is about 10-15 yrs old
• Highly fragmented, no clear leader
• Many open source or free alternatives
tm
Text
Mining
Free / Open Source
NLP Pure
Plays
Sources:
• A Review of Text Analytics Suppliers, Butler
Analytics, 2014-01
• Text Analytics 2014: User Perspectives on
Solutions and Providers, Seth Grimes, 2014-06
• Who's Who in Text Analytics, Gartner, 2012-09
© 2014 IBM Corporation7 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
“100 Best Jobs” Copyrighted 2014. U.S. News &
World Report. 112878:914JM.
Occupation Software
Developer
Rating 8.4
Upward
Mobility
good
(Average)
Stress Level Fair (Average)
Flexibility Fair (Average)
Let’s look at an example
© 2014 IBM Corporation8 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
Rstudio using stringr & hmisc libraries against BigInsights
© 2014 IBM Corporation9 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
# Text Analytics with Open Source R# Strata Conference - October 2014, NYC################################## Clean-up environmentrm(list=ls())
# Load required open source R packageslibrary(stringr)library(Hmisc)
# Loop through all files in a directoryfiles <-list.files(path="/home/biadmin/Desktop/Best-jobs", pattern=".txt", all.files=T, full.names=T)
outDF <-data.frame(InputDocument=character(),
JobTitle=character(), OverallScore=numeric(), Stress=character(), UpWardMobility=character(), Flexibility=character())
for (file in files) {
# Read in the text file of interestf <- readLines(file)
# Text to extract: Occupationcline1 <- f[1]
val1 <- as.character(str_extract(cline1,"[a-zA-Z]+\\s*[a-zA-Z]*"))val1 <- ifelse(is.null(val1) == TRUE, NA, val1)
# Text to extract: Rating (Overall Score)cline2 <- grep("Overall Score", f, value=TRUE)
val2 <- as.numeric(str_extract(cline2,"[0-9]+.[0-9]+"))val2 <- ifelse(is.null(val2) == TRUE, NA, val2)
# Text to extract: Stress Levelcline3 <- grep("Stress Level",f, value=TRUE)val3 <- as.character(substring(cline3, 14))val3 <- first.word(val3)
val3 <- ifelse(is.null(val3) == TRUE, NA, val3)# Text to extract: Upward Mobility cline4 <- grep("Upward Mobility",fvalue=TRUE)
val4 <- as.character(substring(cline4, 17))val4 <- first.word(val4)val4 <- ifelse(is.null(val4) == TRUE, NA, val4)
# Text to extract: Flexibilitycline5 <- grep("Flexibility",f
val5 <- as.character(substring(cline5, 13))val5 <- first.word(val5)val5 <- ifelse(is.null(val5) == TRUE, NA, val5)
fileName <- basename(file)
newRow <-data.frame(InputDocumentJobTitle=val1, OverallScoreStress=val3, UpWardMobility
© 2014 IBM Corporation10 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
# Text to extract: Occupationcline1 <- f[1]
val1 <- as.character(str_extract(cline1,"[a-zA-Z]+\\s*[a-zA-Z]*"))val1 <- ifelse(is.null(val1) == TRUE, NA, val1)
# Text to extract: Stress Levelcline3 <- grep("Stress Level",f, value=TRUE)val3 <- as.character(substring(cline3, 14))val3 <- first.word(val3)val3 <- ifelse(is.null(val3) == TRUE, NA, val3)
© 2014 IBM Corporation11 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
Finished Results
Time: a few hours
Considerations
Programming
Text parsing
Multiple files
Missing values resulting in
missing rows
Infrastructure
Dataset size
Single machine vs cluster
© 2014 IBM Corporation12 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
NLP Tool: BigInsights Big Text ExampleNote: In Beta
© 2014 IBM Corporation13 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
But… how to reach the Business Analyst?
Business
AnalystApplication
Developer
Data
Scientist
Domain
Knowledge
Programming
Advanced
Analytics & NLP
© 2014 IBM Corporation14 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
A. Within
Enterprise
Offerings
NLP Engines
B. Niche
Tool
Reaching the Business Analyst with Tools
• Key drivers: ease of use and time-to-results
• Differences from NLP tools
• GUI-driven
• Built-in algorithms
• Multi-language support
• Related technologies Search or Information
Discovery Machine Learning
© 2014 IBM Corporation15 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
“100 Best Jobs” Copyrighted 2014. U.S.
News & World Report. 112878:914JM.
The Difference
1,000
2,000
3,000
4,000
5,000
6,000
7,000
High School orless
2 years postsecondary or
less
Bachelor'sdegree or
higher
Thousands
Projected Job Growth 2020
© 2014 IBM Corporation16 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
Business Analyst: IBM Social Media Analytics example
© 2014 IBM Corporation17 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
Tools & Engines incorporate Domain Knowledge
Marketplace
Platforms
Point
Solutions
Where’s the Growth?
• Key influence: SaaS
• NLP Engines Solution Platforms
• LOB Tools Point Solutions
• data integration services
Areas trending
• Marketing, fraud, healthcare
NLP Engines
LOB Tools
© 2014 IBM Corporation18 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
Business Analyst: IBM Social Media Analytics example
© 2014 IBM Corporation20 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
Conclusion
A variety of tools is
needed to reach a
variety of users
1 With a highly
fragmented market,
look for integration.
2
This market is changing.
Don’t be afraid to re-assess.3
© 2014 IBM Corporation22 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
ibmhadoop.challengepost.com
Stop by
IBM Booth
#321 to
learn more!
© 2014 IBM Corporation23 #strataconf #hadoopworldIbm.com/hadoop @tinagroves
Legal Disclaimer
• © IBM Corporation 2014. All Rights Reserved.• The information contained in this publication is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information contained in this publication, it is
provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and strategy, which are subject to change by IBM without notice. IBM shall not
be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other materials. Nothing contained in this publication is intended to, nor shall have the effect of, creating any
warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.
• References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in this
presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. Nothing
contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.
• If the text contains performance statistics or references to benchmarks, insert the following language; otherwise delete:
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon
many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can
be given that an individual user will achieve results similar to those stated here.
• If the text includes any customer examples, please confirm we have prior written approval from such customer and insert the following language; otherwise delete:
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance
characteristics may vary by customer.
• Please review text for proper trademark attribution of IBM products. At first use, each product name must be the full name and include appropriate trademark symbols (e.g., IBM Lotus® Sametime® Unyte™).
Subsequent references can drop “IBM” but should include the proper branding (e.g., Lotus Sametime Gateway, or WebSphere Application Server). Please refer to http://www.ibm.com/legal/copytrade.shtml for
guidance on which trademarks require the ® or ™ symbol. Do not use abbreviations for IBM product names in your presentation. All product names must be used as adjectives rather than nouns. Please list all of
the trademarks that you use in your presentation as follows; delete any not included in your presentation. IBM, the IBM logo, Lotus, Lotus Notes, Notes, Domino, Quickr, Sametime, WebSphere, UC2,
PartnerWorld and Lotusphere are trademarks of International Business Machines Corporation in the United States, other countries, or both. Unyte is a trademark of WebDialogs, Inc., in the United States, other
countries, or both.
• If you reference Adobe® in the text, please mark the first use and include the following; otherwise delete:
Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries.
• If you reference Java™ in the text, please mark the first use and include the following; otherwise delete:
Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
• If you reference Microsoft® and/or Windows® in the text, please mark the first use and include the following, as applicable; otherwise delete:
Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both.
• If you reference Intel® and/or any of the following Intel products in the text, please mark the first use and include those that you use as follows; otherwise delete:
Intel, Intel Centrino, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
• If you reference UNIX® in the text, please mark the first use and include the following; otherwise delete:
UNIX is a registered trademark of The Open Group in the United States and other countries.
• If you reference Linux® in your presentation, please mark the first use and include the following; otherwise delete:
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others.
• If the text/graphics include screenshots, no actual IBM employee names may be used (even your own), if your screenshots include fictitious company names (e.g., Renovations, Zeta Bank, Acme) please update
and insert the following; otherwise delete: All references to [insert fictitious company name] refer to a fictitious company and are used for illustration purposes only.
Top Related