Search, Signals & Sense: An Analytics Fueled Vision
-
Upload
seth-grimes -
Category
Technology
-
view
103 -
download
0
description
Transcript of Search, Signals & Sense: An Analytics Fueled Vision
Search, Signals & Sense:An Analytics Fueled Vision
Seth Grimes@sethgrimes
A Sense Making Story
New York Times,September 30, 2012
New York Times,September 8, 1957
Valium: Starting a Chain of Connections
H.P. Luhn
By H.P. Luhn, inIBM Journal,April, 1958
http://altaplana.com/ibm-luhn58-LiteratureAbstracts.pdf
Modelling Text
“Statistical information derived from word frequency and distribution is used by the machine to compute a relative measure of significance, first for individual words and then for sentences. Sentences scoring highest in significance are extracted and printed out to become the auto-abstract.”
-- H.P. Luhn, The Automatic Creation of Literature Abstracts, IBM Journal, 1958.
Luhn’s analysis of Messengers of the Nervous System, a Scientific American article http://wordle.net,
applied to the NY Times article
New York Times,September 8, 1957
Luhn’s Example
Close Reading
Can Software Make the Connection?
Mark Lombardi, George W. Bush, Harken Energy and Jackson Stephens, c. 1979-90, Detail
There and Back Again: Modelling Text, 2
The text content of a document can be considered an unordered “bag of words.”
Particular documents are points in a high-dimensional vector space.
Salton, Wong & Yang, “A Vector Space Model for Automatic Indexing,” November 1975.
Modelling Text, 3
We might construct a document-term matrix...• D1 = “I like databases”• D2 = “I hate hate databases”
and use a weighting such as TF-IDF (term frequency–inverse document frequency)…
in computing the cosine of the angle between weighted doc-vectors to determine similarity.
I like hate databases
D1 1 1 0 1
D2 1 0 2 1http://en.wikipedia.org/wiki/Term-document_matrix
Modelling Text, 4
In the form of query-document similarity, this is Information Retrieval 101.• See, for instance, Salton & Buckley, “Term-Weighting
Approaches in Automatic Text Retrieval,” 1988.• A useful basic tech paper: Russ Albright, SAS, “Taming Text
with the SVD,” 2004.
Given the complexity of human language, statistical models may fall short.
“Reading from text in general is a hard problem, because it involves all of common sense knowledge.”
-- Expert systems pioneer Edward A. Feigenbaum
From Text to Data: Features
Analytical methods make text tractable.Latent semantic indexing utilizing singular value
decomposition for term reduction / feature selection.
Classification technologies / methods:• Naive Bayes.• Support Vector Machine.• K-nearest neighbor.
Thus the Orb he roam'dWith narrow search; and with inspection
deep Consider'd every Creature, which of all Most opportune might serve his Wiles.
-- John Milton, Paradise Lost
“Reading from Text is a Hard Problem”
Eugène Delacroix, St. Michael Defeats the Devil
Thus the Orb he roam'dWith narrow search; and with inspection
deep Consider'd every Creature, which of all Most opportune might serve his Wiles.
-- John Milton, Paradise Lost
Eugène Delacroix, St. Michael Defeats the Devil
Data, Search, Analysis, and Discovery
Data Space
For features Analysi
s
Intent, Goals
The User Interface
“Search is the UI for data today.”-- Grant Ingersoll, Chief Scientist, LucidWorks
Quoted by Gil Press in Forbes,
“LucidWorks: Bringing Search to Big Data”http://www.forbes.com/sites/gilpress/2012/09/24/lucidworks-bringing-search-to-big-data/
What’s beyond?
Search and Sensemaking
“It is convenient to divide the entire information access process into two main components: information retrieval through searching and browsing, and analysis and synthesis of results. This broader process is often referred to in the literature as sensemaking. Sensemaking refers to an iterative process of formulating a conceptual representation from of a large volume of information. Search plays only one part in this process.”
-- Marti Hearst, 2009http://searchuserinterfaces.com/
Senseless Search
New but old: Dumb and siloed
Better?
Searcher Supplied Sense
Siloed signals.
More better?
Semantic Search Engines
Meh.
Clustered Clarity
Carrot2.(open source)
Semanticized (Web) Search
Google Knowledge Graph
Search Fronted Analysis & Discovery
Fusions, Signals
Old Search Sensemaking
Search on: keywords + identity, history & context
Sources: content/type silos
Unified
Indexed: terms + metadata (properties)
Returned: hit lists Categories / clusters / answers first
Relevance: PageRank (Inferred) intent
Prevalence: plenty of new platforms with old(ish) search
Plenty of established search with new(ish) capabilities, also wanna-bes.
Toward Semantic Search Sensemaking
Platforms and ecosystems.
APIs and services.
Text and content analytics --Discerns and extracts features including
relationships from source materials.
Features = entities, key-value pairs, concepts, topics, events, sentiment, etc.
Provide (for) BI on content-sourced data.
Data integration, record linkage, data fusion.
The Back End
Text/content analytics generates semantics to bridge search, BI, and applications, enabling next-generation information systems.
Search BI
Applica-tions
Search based applications (search + text + apps)
Information access (search + text + BI)
Integrated analytics (text + BI)
Text analytics (inner circle)
Semantic search (search + text)
NextGen CRM, EFM, MR, marketing, …
Text+ Technology Mashups
Analytical Assets (Open Source)
>>> import nltk>>> sentence = """At eight o'clock on Thursday morning... Arthur didn't feel very good.""">>> tokens = nltk.word_tokenize(sentence)>>> tokens['At', 'eight', "o'clock", 'on', 'Thursday', 'morning','Arthur', 'did', "n't", 'feel', 'very', 'good', '.']>>> tagged = nltk.pos_tag(tokens)>>> tagged[0:6][('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),('Thursday', 'NNP'), ('morning', 'NN')]
http://nltk.org/tm: Text Mining PackageA framework for text mining applications within R.
A Big Data Analytics Architecture
http://www.geeklawblog.com/2011/12/lexis-advance-platform-launch-two.html
http://hpccsystems.com/ (GNU Affero GPL)
Commercial (Non-OS) Solutions Plug In
Drivers and Trends
Social media!… and personal-social-enterprise integration.
Via-API cloud services.
Big Data (even if you don’t like the term).Volume and velocity mean new analytical approaches.Variety: new types and a new fusion imperative.
Sentiment: Mood, opinions, emotions, intent.
Question answering.
Text Tech Initiatives
Now and near future.• Broader & deeper international language support.• Sentiment analysis, beyond polarity.
Emotions, intent signals. etc.• Identity resolution & profile extraction.
Online-social-enterprise data integration.• Semantic data integration, Complex Data. • Speech analytics.• Discourse analysis.
Because isolated messages are not conversations.
• Rich-media content analytics.• Augmented reality; new human-computer interfaces.
http://timoelliott.com/blog/2010/10/sap-businessobjects-augmented-explorer-now-available-resources-to-test-it.html
Personal. Mobile. Intelligent?
A Focus on Information & Applications
Now and near future.• Signal detection.
Sentiment, emotion, identity, intent.• Semanticized applications.
Linkable, mashable, enrichable.• Rich information.
Context sensitive, situational.
Σ = Sensemaking.
Onward… to Q&A
Search, Signals & Sense:An Analytics Fueled Vision
Seth Grimes@sethgrimes