Natural Language Processing and Machine Learning for Discovery
-
Upload
mjbommar -
Category
Technology
-
view
4.577 -
download
4
description
Transcript of Natural Language Processing and Machine Learning for Discovery
![Page 1: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/1.jpg)
MSU LawElectronic DiscoveryFal l 2012Week 9
NATURAL LANGUAGE PROCESSING AND MACHINE
LEARNING FOR DISCOVERY
![Page 2: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/2.jpg)
Natural language processing Mathematical and linguistic concepts Models of representation Real-world application
Machine learning Common pre-processing and learning algorithms Real-world application
Communicate with software and service vendors!
GOALS
© Bommarito Consulting
Understand the BLACK BOX.
![Page 3: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/3.jpg)
How do we characterize a black box?
BLACK BOX
© Bommarito Consulting
3 English medium
Inputs Parameters Outputs
![Page 4: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/4.jpg)
?
BLACK BOX
© Bommarito Consulting
Secret: Most black boxes are very similar inside.
We’re going to learn to identify the common parts.
![Page 5: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/5.jpg)
Defi nition: Dealing with real-world text in an automated, reproducible way.
Often referred to as NLP.
Used somewhat interchangeably with computational linguistics.
NATURAL LANGUAGE PROCESSING
© Bommarito Consulting
![Page 6: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/6.jpg)
Let’s start with some text.
“Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.
The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.”
(Bloomberg article on Sandy)
NATURAL LANGUAGE PROCESSING
© Bommarito Consulting
![Page 7: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/7.jpg)
What kind of questions can we ask?
Basic What is the structure of the text?
Paragraphs Sentences Tokens/words
What are the words that appear in this text? Nouns
Subjects Direct objects
Verbs
Advanced What are the concepts that appear in this text? How does this text compare to other text?
NATURAL LANGUAGE PROCESSING
© Bommarito Consulting
![Page 8: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/8.jpg)
Segmentation and Tokenization
“Hurricane Sandy grounded 3,200 fl ights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.
The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of infl icting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.”
NATURAL LANGUAGE PROCESSING
© Bommarito Consulting
• Segments Types• Paragraphs• Sentences• Tokens
![Page 9: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/9.jpg)
Segmentation and Tokenization
But how does it work?
Paragraphs Two consecutive line breaks A hard line break followed by an indent
Sentences Period, except abbreviation, ellipsis within quotation, etc.
Tokens and Words Whitespace Punctuation
Remember what real-world text looks like – think text and email.
NATURAL LANGUAGE PROCESSING
© Bommarito Consulting
![Page 10: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/10.jpg)
Segmentation and Tokenization“Hurricane Sandy grounded 3,200 fl ights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.
The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of infl icting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.”
Paragraphs: 2Sentences: 2Words: 561.
['Hurricane', 'Sandy', 'grounded', '3,200', 'fl ights', 'scheduled', 'for', 'today', 'and', 'tomorrow‘, …]
NATURAL LANGUAGE PROCESSING
© Bommarito Consulting
![Page 11: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/11.jpg)
What kind of questions can we ask?
We now have an ordered list of tokens.
['Hurricane', 'Sandy', 'grounded', '3,200', 'fl ights', 'scheduled', 'for', 'today', 'and', 'tomorrow‘, …]
Does the word phrase “quote stuffi ng” occur in the text? How many times does “Sandy” occur? How often does “outage” occur after “power?” What percentage of tokens are numbers?
NATURAL LANGUAGE PROCESSING
© Bommarito Consulting
![Page 12: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/12.jpg)
An Aside on Storage
Data: The word ‘the’ ten times and the word ‘a’ ten times.
Representation 1 - Ordered List: [‘the’, ‘a’, ‘the’, ‘a’, ‘the’, ‘a’, …]
Representation 2 – Term Frequency: [(‘the’, 10), (‘a’, 10)]
NATURAL LANGUAGE PROCESSING
© Bommarito Consulting
![Page 13: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/13.jpg)
An Aside on Storage
Representation 1 - Ordered List: [‘the’, ‘a’, ‘the’, ‘a’, ‘the’, ‘a’, …]
Representation 2 - Frequency Map: [(‘the’, 10), (‘a’, 10)]
Tradeoffs Total space Ease of answering certain questions Information about context
Not all software make the same choice!
NATURAL LANGUAGE PROCESSING
© Bommarito Consulting
![Page 14: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/14.jpg)
Stopwording, Stemming, Parsing, and Tagging
Stopwording Removing “fi ller” words like prepositions, auxiliary or infinitive
verbs, and conjunctions.
Stemming Matching declined nouns like dog/dogs or child/children. Matching conjugated verbs like run/ran.
Parsing Determining the “structure” of a sentence, typically as represented
by a grade school sentence diagram (requires grammar definition; we’ll skip).
Tagging Identifying the part of speech of each token in a sentence.
NATURAL LANGUAGE PROCESSING
© Bommarito Consulting
![Page 15: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/15.jpg)
Stopwording Hurricane Sandy grounded 3,200 fl ights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.
The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of infl icting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.
Hurricane Sandy grounded 3,200 fl ights scheduled today tomorrow, prompted New York suspend subway bus service forced evacuation New Jersey shore headed toward land life-threatening wind rain.
System, killed many 65 people Caribbean path north, may capable infl icting much $18 billion damage barrels New Jersey tomorrow knock power millions week, according forecasters risk experts.
NATURAL LANGUAGE PROCESSING
© Bommarito Consulting
![Page 16: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/16.jpg)
Stopwording + Stemming Hurricane Sandy grounded 3,200 fl ights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.
The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of infl icting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.
Hurrican Sandi ground 3,200 fl ight schedul today tomorrow, prompt New York suspend subway bu servic forc evacu New Jersey shore head toward land life-threaten wind rain.
System, kill mani 65 peopl Caribbean path north, may capabl infl ict much $18 billion damag barrel New Jersey tomorrow knock power million week, accord forecast risk expert.
NATURAL LANGUAGE PROCESSING
© Bommarito Consulting
![Page 17: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/17.jpg)
Tagging Hurricane Sandy grounded 3,200 fl ights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain.
The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of infl icting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.
[('Hurricane', 'NNP'), ('Sandy', 'NNP'), ('grounded', 'VBD'), ('3,200', 'CD'), ('fl ights', 'NNS'), ('scheduled', 'VBN'), ('for', 'IN'), ('today', 'NN'), ('and', 'CC'), ('tomorrow', 'NN'), …]
NATURAL LANGUAGE PROCESSING
© Bommarito Consulting
![Page 18: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/18.jpg)
Back to the black box.
NATURAL LANGUAGE PROCESSING
© Bommarito Consulting
3 English medium
Inputs Parameters Outputs
![Page 19: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/19.jpg)
Let’s say that we’re investigating Enron for accounting fraud related to its reserve reporting and transfers.
We want to look for any material that discusses reserves and profi ts in the same sentence. However, we want cases where these words are used as nouns; we’re not interested in dinner reservations.
NATURAL LANGUAGE PROCESSING
© Bommarito Consulting
Inputs Parameters Output
MemosResearchEmailsTextsTranscriptions
Stopword: NoStem: YesTag: YesSearch: …
MemosResearchEmailsTextsTranscriptions
![Page 20: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/20.jpg)
NATURAL LANGUAGE PROCESSING
© Bommarito Consulting
In general, all document search and discovery software combines the elements discussed above.
Segment Tokenize Stopword Stem Parse Tag Store Search Retrieve
![Page 21: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/21.jpg)
NATURAL LANGUAGE PROCESSING
© Bommarito Consulting
How do they diff er? Interface and ease-of-use De-duplication and versioning Supported languages Optical character recognition (OCR) File formats, e.g., Word, WordPerfect, PDF, HTML Ability to scale to large databases.
![Page 22: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/22.jpg)
Definition: Automated classification and prediction on data.
Examples: Product recommenders, a la Amazon Computer vision – is it a cat? Sentiment analysis Topic classification Document clustering
At least two stages to machine learning: Training Classification
MACHINE LEARNING
© Bommarito Consulting
![Page 23: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/23.jpg)
Learning
Machine learning requires “learning” or “training.”
There are two types of training: Supervised Unsupervised
The goal of training is to determine a mapping from input features to a set of target classes.
MACHINE LEARNING
© Bommarito Consulting
![Page 24: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/24.jpg)
Learning
Imagine a student given a small l ist of organisms and descriptions. The student is tasked to assign the organisms into groups based on these descriptions. Where do the groups come from? Supervised: The teacher provides the answers. Unsupervised: The teacher provides nothing.
When the student is done with the task, the teacher checks the student’s responses and decides if the student has learned.
In our example, the teacher wil l typical ly provide the “canonical” domains and kingdoms of biology. However, most real-world problems domains are not so well-studied.
MACHINE LEARNING
© Bommarito Consulting
![Page 25: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/25.jpg)
Learning
What if the teacher gave the student some of the answers?
This is semi-supervised learning. Supervised: The teacher provides the answers.Semi-supervised: The teacher provides some answers.Unsupervised: The teacher provides nothing.
MACHINE LEARNING
© Bommarito Consulting
![Page 26: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/26.jpg)
Classification
The student has now learned to map from an organism’s description to a group. Now, the student is sent out into the field to use their knowledge to classify newly discovered organisms. They observe the organisms and document the features they learned to use. Then, they apply the learned rules to determine the class of organism.
MACHINE LEARNING
© Bommarito Consulting
![Page 27: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/27.jpg)
This is exactly how predictive coding works!
Organisms : DocumentsDescriptions : Natural language features or modelsSemi-supervised : Sample coding
The goal of predictive coding in discovery is to learn to classify documents based on natural language features, typically into relevant/irrelevant or privileged/unprivileged.
MACHINE LEARNING
© Bommarito Consulting
![Page 28: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/28.jpg)
Some Machine Learning Algorithms Supervised
Statistical models Bayesian, e.g., Naïve Bayes Classification Frequentist, e.g., Ordinary Least Squares.
Neural Networks (NN) Support Vector Machines (SVM) Random Forests (RF) Genetic Algorithms (GA)
Semi/unsupervised Neural Networks (NN) Clustering
K-means Hierarchical Radial Basis (RBF) Graph
MACHINE LEARNING
© Bommarito Consulting
![Page 29: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/29.jpg)
Notes on Algorithm Diversity
Not all algorithms return scores; some are binary. True, True, False 0.9, 0.7, 0.1
Not all algorithms support more than two classes. Cat, Dog, Mouse Cat, Not Cat
Not all algorithms scale similarly. 1M documents = 1 day 10M documents = {10 days, 100 days, 1000 days}
MACHINE LEARNING
© Bommarito Consulting
![Page 30: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/30.jpg)
Michael J Bommarito II CEO, Bommarito Consulting,
LLC Email:
[email protected] Web: http://bommaritollc.com/
THANKS!
You can get these slides on my blog – http://bommaritollc.com/blog/.
© Bommarito Consulting
![Page 31: Natural Language Processing and Machine Learning for Discovery](https://reader036.fdocuments.us/reader036/viewer/2022062701/553bb38f550346d4448b4600/html5/thumbnails/31.jpg)
Books and Wiki Pages A Brief Survey of Text Mining. Hotho, Nurnberger, Paaß.
http://www.kde.cs.uni-kassel.de/hotho/pub/2005/hotho05TextMining.pdf Text Mining: Predictive Methods for Analyzing Unstructured Information. Weiss,
Indurkhya, Zhang, Damerau. http://www.amazon.com/Text-Mining-Predictive-Unstructured-Information/dp/0387954333
The Elements of Statistical Learning. http://www-stat.stanford.edu/~tibs/ElemStatLearn/
Wiki – Machine Learning. http://en.wikipedia.org/wiki/Machine_learning
Wiki – Machine Learning Algorithms. http://en.wikipedia.org/wiki/List_of_machine_learning_algorithms
Software Natural Language Toolkit (NLTK).
http://nltk.org/ Stanford NLP Group.
http://nlp.stanford.edu/software/ Weka.
http://www.cs.waikato.ac.nz/ml/weka/ R.
http://www.r-project.org/ SAS Predictive Analytics and Data Mining.
http://www.sas.com/technologies/analytics/datamining/index.html
REFERENCES