Text Analytics and Graph Text Analytics and Graph Theory By Victoria Loewengart. Agenda...
Embed Size (px)
Transcript of Text Analytics and Graph Text Analytics and Graph Theory By Victoria Loewengart. Agenda...
August 6, 2015
© Analytics Inside 2014-2015
Text Analytics and Graph Theory By Victoria Loewengart
Agenda • Introductions • Brief Overview of Natural Language Processing
– Definitions and terms – Basic concepts – NE, Topics, Sentiment, Semantic, Ontology – Basic processes – Sentence detection, tokenization, stemming, POS tagging,
chunking, parsing – Higher level processes – clustering, classification, importance, co-references
• Usage of graph theory – Using named entity and sematic decomposition to create a graph – Semantic relationships and ontologies
• Overview • Navigation and inference
– Clustering and topic extraction • Distance based clustering • Other methods
– Term frequency – inverse document frequency (tf-idf) – Eigenvectors and Singular Value Decomposition
• Summary and Conclusion
2August 6, 2015 Copyright © 2014-2015 Analytics Inside
• Natural Language Processing (NLP) is understanding, analysis, manipulation, and/or generation of natural (spoken) languages.
• Computational Linguistics is the study of the applications of computers in processing and analyzing language, as in automatic machine translation and text analysis.
• Text analytics is the process of deriving high-quality information from text.
4August 6, 2015 Copyright © 2014-2015 Analytics Inside
• Information Retrieval (IR) refers to the human-computer interaction (HCI) that happens when we use a machine to search a body of information for information objects (content) that match our search query. Depending on the sophistication of the algorithm, a person's query is matched against a set of documents to find a subset of 'relevant' documents.
• Information Extraction (IE) is extraction of specific information such as Named Entities, Events, and Facts.
• Metrics are Precision, Recall, and F-Measure
5August 6, 2015 Copyright © 2014-2015 Analytics Inside
IR – Hubs and Authorities
6August 6, 2015 Copyright © 2014-2015 Analytics Inside
• Hubs are index pages that provide lots of useful links to relevant content pages (or authorities)
• Authorities are pages that are recognized as providing significant, trustworthy, and useful information on a topic
• Together they form a bipartite graph
• Sentence detection – A bit more difficult than it seems
• Bob gave Mr. Bill a pencil.
• Tokenizing • Stemming – singing -> sing • Part of speech tagging
– Penn Treebank, Treebank II • Tagged copra • Gold standards
– More complex methods
• Chunking and parsing – Finding phrases
• Semantic analysis – Subject, object, other relationships
7August 6, 2015 Copyright © 2014-2015 Analytics Inside
• Proper nouns – James Woods
• Pronouns – she, he, it, they …
• Singular or plural - trees
• Possessive – Jimmy’s
• Conjunctions – Bill and Bob
• Noun phrase – A big green tree
The monkey ate a banana.
8August 6, 2015 Copyright © 2014-2015 Analytics Inside
NP NP NN NN
Monkey ate banana
Named Entities • Start by tagging
Jim Black was the CEO of Medco Enterprises.
• How do we do this? – Rules – Stochastic methods – machine learning, MEMM
• How do we assign more specific meaning? – Dictionaries, thesauri, ML disambiguation, etc.
• How about this one: Jim got Ed James’s book.
9August 6, 2015 Copyright © 2014-2015 Analytics Inside
NNP NNP NNP NNPVBD EX INDT
Rules and Stochastic Methods • Rules specify “productions”
Noun phrase -> Noun
-> Adjective Noun
-> Determiner Adjective Noun
-> Noun “Prepositional Phrase”
Prepositional Phrase -> Preposition “Noun Phrase”
and so on…..
• Stochastic methods specify relationship probabilities – We start by examining a very large number of sentences (from a
– Given a new sentence, if we have found a Determiner, then an adjective, what is the probability that the next word is a noun?
10August 6, 2015 Copyright © 2014-2015 Analytics Inside
11August 6, 2015 Copyright © 2014-2015 Analytics Inside
• Named Entities applicable to most domains: – People names
– Organization names
– Locations (Countries, Cities, Continents/geographic terms)
• Domain specific named entities: – Diseases, diagnoses, procedures, body parts
– Drugs, dosages, and usage
– Identifiers – SSN, Driver’s license, Claim number, Domain name, URL
Text Analytics Processes Named Entity Extraction
Basic Concepts- Named Entities
12August 6, 2015 Copyright © 2014-2015 Analytics Inside
Patient Medical Provider Hospital or Facility
Pharmaceutical Diagnosis / Injury
Medical Report Biometrics
Police Report Coroner Report Arrest Record
Enforcement Agency Alias
Criminal Method …
User ID IP Address
Network Origination Online postings
Social Media Pages Email
Text Messages …
Info Bearing Entity Document
Security log Web log
Asset Asset class HR Report
Encryption Method …
Financial Instrument Event Task
Language Prediction Inference
Account Credit Card
Policy Claim Lien Title
The National Information Exchange Model (NIEM)
Named Entities and More Complex Graphs
Bob was Bill’s friend and he drove a white truck.
Bob drove Bill’s white truck and he wanted one of his own.
Bob was Bill’s friend and he drove his white truck and wanted one of his own.
13August 6, 2015 Copyright © 2014-2015 Analytics Inside
Simple Relationships Semantic relations among words can be extracted from their textual context in natural languages.
• Relationships may occur through communication, friendship, advice, influence, or exchange. The two basic elements of a relationship network are links and nodes.
• Relationship analysis is the mapping and measuring of relationships and flows between people, groups, organizations, computers or other information/knowledge processing entities.
Graphs allow us to store the relationships between entities, and algorithms allow us to interrogate these connections.
14August 6, 2015 Copyright © 2014-2015 Analytics Inside
Simple Relationships -Example
15August 6, 2015 Copyright © 2014-2015 Analytics Inside
Simple Relationships - Techniques
• Simple relationships are identified through co- reference.
• Co-reference is the instance of occurrence within a unit of text
• Metadata is relevant too – coauthors.
• Topics are words that are assigned to a document that relate “concepts.”
16August 6, 2015 Copyright © 2014-2015 Analytics Inside
17August 6, 2015 Copyright © 2014-2015 Analytics Inside
• Nouns are parsed into sentence structures – Yields relationships
– Can usually detect compound subjects and various verb inflective forms
– Captures modifiers (adjectives and adverbs) that can be used in sentiment or inversion
• Graph analysis and graph theory now comes into play – When documents and document sets are processed, typically creates a
very large graph
Text Analytics Processes Semantic Named Entity Extraction
Clusters of terms
18August 6, 2015 Copyright © 2014-2015 Analytics Inside
• An Ontology is “a description of things that exist and how they relate to each other” (Chris Welty).
• An Ontology Model is: – the classification of entities and
– modeling the relationships between those entities.
Text Analytics Processes The Importance of Ontologies
Introduction to Ontologies
What is an ontology?
An ontology is a specification of a conceptualization (Thomas Gruber)
Why create ontologies?
Why you should think “semantically” even if you never create a formal ontology
Why you should create ontologies
Ontologies model their domain and models provide:
A common understanding of the structure of information among people and software
Explain and make and make predictions
Enable reuse of domain knowledge
Make domain assumptions explicit and mediates between multiple viewpoints