Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk...
Transcript of Using Amazon Mechanical Turk for Data Collection in · PDF fileUsing Amazon Mechanical Turk...
Using Amazon Mechanical Turk for DataCollection in Parts of Speech Tag Correction for
Patent Claims
David Cinciruk
June 24, 2015
1
Table of Contents
Overview of Our Problem
Using Amazon Mechanical TurkWorker ViewsRequester ViewsTesting HITsDifficulties of Obtaining Results
Automatic Correcter of Patent Claim POS Tags
Conclusion
2
Our Research
I Work is in Patent ProcessingI Mapping patents to other technical documentsI Automated classificationI Patent RetrievalI Patent Valuation
I Need to find a way to represent patents. NLP offers a solutionwith dependencies
3
Dependency Modeling
I Dependencies help to represent words by their relationships
I Formed by traversing the parse tree of a sentence or segmentand noting two words and how they are linked together
4
NLP Parsers Do Not Work Well With Patent Claims!
5
Obvious Mislabeling of Words
6
Odd Language of Patent Claims
The holder for use with a razor blade according to claim 1, saidrecess being semicircular in shape and revealing opposite faces ofthe razor blade which may be grasped with the fingers of the user.
I Patents are the intersection of technical and legal speech
I Legally each patent claim must be one very long run-onsentence
I Features particular language whose meanings aren’t standardmeanings in normal English.
I Because of these, NLP software struggles to correctly tagpatent claims
7
Trying to Correct the NLP
blade: VBP → NNsaid: VBD → JJsemicircular: VBN → JJ
I By forcing incorrect tags to a corrected tag, NLP softwareswould be forced to reconstruct the parse tree and createdependencies more similar to true speech
I We need a large collection of hand-corrected patent claims inorder to train a system to automatically correct patents
I Amazon Mechanical Turk allows us to gather this collectionvia crowdsourcing
8
Table of Contents
Overview of Our Problem
Using Amazon Mechanical TurkWorker ViewsRequester ViewsTesting HITsDifficulties of Obtaining Results
Automatic Correcter of Patent Claim POS Tags
Conclusion
9
Artificial Artificial Intelligence
Crowdsourcing Internet Marketplace that enables individuals andorganizations to coordinate the use of human intelligence toperform tasks that computers are unable to do.
10
The Turk
Origin comes from a supposed master Chess Playing Automatonbuilt in the 1770 by Wolfgang von Kempelen
11
The Turk
Actually was a chess master hidden in the cabinet playing thegame from underneath
12
Workers and Requesters
I Worker
I Also known as Turkers
I Performs tasks set up byrequesters
I Work their own hours andchoose their own tasks
I Requester
I Creates tasks for workers todo
I Can set qualifications ontasks to dislow unprovenpeople from performingtasks.
I Can set up tests todetermine if people arequalified to work onparticular HITs or not
13
Our Use of Mechanical Turk
I We want to develop the most accurate dependencies for use inour future systems
I We have patent claims automatically segmented based off ofsemicolons and colons and POS tagged
I Tags may be incorrect - want to eventually be able toautomatically correct the tags
I Need to gather lists of corrected tags
I Decided to use Mechanical Turk to gather corrected tags
14
Our Use of Mechanical Turk
I Want to create an easy to perform HIT that has high impacton our tagging.
I Most common problem that has a high impact on thedependencies are problems with words incorrectly tagged asverbs
I Chose the task of checking if words initially tagged as verbsare nouns or adjectives
I Turkers will label the verbs as noun, adjective, or verbs andassume all other tags are correct
15
The Setup
I Initial:I The/DT holder/NN for/IN use/NN with/IN a/DT razor/NN
blade/VBP according/VBG to/TO claim/NN 1/CD ,/,said/VBD recess/NN being/VBG semicircular/VBN in/INshape/NN and/CC revealing/VBG opposite/JJ faces/NNSof/IN the/DT razor/NN blade/NN which/WDT may/MDbe/VB grasped/VBN with/IN the/DT fingers/NNS of/INthe/DT user/NN ./.
I Corrected:I The/DT holder/NN for/IN use/NN with/IN a/DT razor/NN
blade/NN according/VBG to/TO claim/NN 1/CD ,/,said/JJ recess/NN being/VBG semicircular/JJ in/INshape/NN and/CC revealing/VBG opposite/JJ faces/NNSof/IN the/DT razor/NN blade/NN which/WDT may/MDbe/VB grasped/VBN with/IN the/DT fingers/NNS of/INthe/DT user/NN ./.
16
Table of Contents
Overview of Our Problem
Using Amazon Mechanical TurkWorker ViewsRequester ViewsTesting HITsDifficulties of Obtaining Results
Automatic Correcter of Patent Claim POS Tags
Conclusion
17
Worker Homepage
18
HIT Search
19
HIT Search
20
Our HIT
21
Table of Contents
Overview of Our Problem
Using Amazon Mechanical TurkWorker ViewsRequester ViewsTesting HITsDifficulties of Obtaining Results
Automatic Correcter of Patent Claim POS Tags
Conclusion
22
Requester Homepage
23
Tasks for Requesters
I Editing HITs
I Publishing Batches
I Reviewing Batches
24
Editing HITs
25
Editing HITs - Describing the HIT
26
Editing HITs - Setting Up Properties
27
Editing HITs - Setting Worker Requirements
28
Editing HITs - Designing the HIT
29
Editing HITs - Previewing the HIT
30
Tasks for Requesters
I Editing HITs
I Publishing Batches
I Reviewing Batches
31
Publishing HITs
32
Publishing HITs - Developing the Initial CSV file
I Takes a folder with initialPOS tagged patent claimsthat were not checked and afolder featuring correctlylabeled claim segments
I Makes a CSV file featuringone segment (featuringverbs) from the initialpatent claim files and one ofthe segments with analready correct segment
I Also outputs a file featuringall the correct segments
I From there we can extend itto a 10 segment version byhand
33
Publishing HITs - Developing a Larger CSV file
I VBA code that takes a 10segment version and makesa 20 segment version
I Interleaves the segments ofeach HIT (including the testsegments)
I Randomizes the locations ofthe two test segments andsaves their locations as ahidden variable
34
Publishing HITs - Uploading the CSV file
35
Publishing HITs - Previewing the HITs
36
Publishing HITs - Previewing the HITs
37
Publishing HITs - Confirming the Batch
38
Tasks for Requesters
I Editing HITs
I Publishing Batches
I Reviewing Batches
39
Reviewing Batches
40
Reviewing Batches - Summary of Results
41
Reviewing Batches - Reviewing Results
42
Reviewing Batches - Reviewing Results
43
Reviewing Batches - Reviewing Results
44
Reviewing Batches - CSV File
45
Reviewing Batches - CSV File
46
Reviewing Batches - Checking Answers
I Need to approve or reject HITs since answers may or may notbe correct.
I May opt to give the same HIT to multiple Workers and thenpay the majority answer
I Better for subjective HITsI More Expensive since need to pay for multiple “correct” HITs
I May instead put test questions inside HITs that have knownanswers and check the answers
I Better for objective HITs where accuracy is keyI Method we used for our HITs
47
Reviewing Batches - Automatic Matlab HIT Checker
Load CSV File and Solutions
Determine Location of Test
Questions
Remove non-alphabetical
symbols from CSV file and solutions
Compare the test questions for a
match in the solutions
Write whether to accept or reject in
the CSV File
48
Table of Contents
Overview of Our Problem
Using Amazon Mechanical TurkWorker ViewsRequester ViewsTesting HITsDifficulties of Obtaining Results
Automatic Correcter of Patent Claim POS Tags
Conclusion
49
Mechanical Turk Sandbox
I You may not know how to develop your HIT or how yourresults will look.
I Sandbox mode allows Requesters to develop HITs withouthaving to pay anyone.
I Requesters can make Sandbox Worker accounts to performtheir own HITs to test out answering the HITs.
I Layout for Workers and Requesters are same as regular Turk
50
Like Monopoly MoneyYou have fake money on the Sandbox that you can give out toSandbox Workers
51
Worker and Requester Sandbox
52
Table of Contents
Overview of Our Problem
Using Amazon Mechanical TurkWorker ViewsRequester ViewsTesting HITsDifficulties of Obtaining Results
Automatic Correcter of Patent Claim POS Tags
Conclusion
53
Fishing for Results
Workers need to be enticed to do HITs. Many potential ways to dothat. Need to weigh one lure against another
54
Our Experiences
Initial unsucessful HIT paid 5 cents for 2 questions - after 19 daysonly 86 HITs completed
Secondary successful HIT paid 45 cents for 10 questions - after 27days 657 HITs completed
Very successful third HIT paid $1.20 for 20 questions - after 11days 878 HITs completed
55
Our Experiences
I Higher Paying HITs entice more people
I More available HITs entice more people
I Reducing Qualifications allow more people to participate
56
Table of Contents
Overview of Our Problem
Using Amazon Mechanical TurkWorker ViewsRequester ViewsTesting HITsDifficulties of Obtaining Results
Automatic Correcter of Patent Claim POS Tags
Conclusion
57
The Next Step
I Next Step after gathering all the data is to run a system toautomatically correct Patent Claim POS tags.
I We needed Amazon Mechancial Turk to gather corrected tagsfrom patent segments to serve as ground truth for developingour system
58
Training Stage
Original POS Tagging
Corrected POS Tagging
Rule-based Changing
Feature Extraction
SVM Training
Sorting Features into Classes
59
Rule-Based Corrector
Original POS Tagging
Corrected POS Tagging
Rule-based Changing
Feature Extraction
SVM Training
Sorting Features into Classes
60
Rule-Based Corrector
I Some words are almost always mislabeled as verbs and canthus be automatically corrected from the very beginning via arule-like system
I List is ever expanding when new words that are always nounsor adjectives are discovered
I Will also force words in the list that have been correctlytagged to keep their tags
61
Rule-Based Corrector Examples
I said → JJ (“said recess being semicircular”)I claim → NN (“as recited in claim 5”)
I sole exception is at the start of a patent (“In this patent, weclaim”)
I means → NN (“including means for continuously conveying”)
I wherein → WRB (“The system of claim 13 wherein”)
I nitride → NN (“silver nitride”)
I boride → NN (“cobalt boride”)
62
Gathering Trigrams and Dependencies
Original POS Tagging
Corrected POS Tagging
Rule-based Changing
Feature Extraction
SVM Training
Sorting Features into Classes
63
Gathering Trigrams and Dependencies
I Developed Matlab code to gather all the “verbs” and itstrigrams and dependencies and filter them into whether theyare actually an adjective, noun, or still a verb.
I Each trigram and set of dependencies focus on just one“verb”. The other words are represented just by their POStags
I Question lies on how to represent the “verb” while preventingoverfitting and allowing for similar words to be groupedtogether
64
Word Vectors
I Representation ofwords as a vector
I Trained on an inputdatabase (for us,patent claims)
I Preserves similarrelationships betweensimilar groups ofwords
65
Testing Stage
Original POS Tagging
Rule-based Changing
SVM Decision Boundary
Feature Extraction
SVM Test
New Labels for Segments
Repeat
66
Multiple Rounds of Testing?
67
After 4 Iterations: A Truly Corrected Parse Tree
68
Table of Contents
Overview of Our Problem
Using Amazon Mechanical TurkWorker ViewsRequester ViewsTesting HITsDifficulties of Obtaining Results
Automatic Correcter of Patent Claim POS Tags
Conclusion
69
Conclusion
I Patent Claims, due to their structure, are notorously difficultfor a computer to properly parse
I An automated system can be made to correct incorrect POStags in order to be used in more advanced patent processingsystems
I A large hand-corrected dataset needs to be obtained in orderto learn how to properly change POS tags.
I Amazon Mechanical Turk provides the tools needed tocrowdsource this hand labeling.
70
Further Work
I Taking the approved Amazon Mechanical Turk results,determining how correct they actually are and make a largedatabase of the original tagging and the corrected tagging.
I Learning how to make a corpus for use in word2vec and thenrunning word2vec on an all patent claim database.
I Putting together all the pieces of codes already developed fortraining and testing our automatic corrector and beginexperimenting with it
71