Introduction to Corpora@Stanford Florian Jaeger, [email protected] For the Methods class, December...
-
Upload
wilfrid-shields -
Category
Documents
-
view
215 -
download
0
Transcript of Introduction to Corpora@Stanford Florian Jaeger, [email protected] For the Methods class, December...
![Page 1: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003.](https://reader035.fdocuments.us/reader035/viewer/2022071716/56649e535503460f94b48ca3/html5/thumbnails/1.jpg)
Introduction to Corpora@Stanford
Florian Jaeger,
For the Methods class,
December 3rd, 2003
![Page 2: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003.](https://reader035.fdocuments.us/reader035/viewer/2022071716/56649e535503460f94b48ca3/html5/thumbnails/2.jpg)
Some basic questions
Where are our corpora? Where is the software?– Is there a list of all the stuff we have?– How can I access the software?
Where do I start? What information is available where?
– Are there tutorials for the available software?
What kind of corpus work is supported at Stanford? – Corpora are only for those computational folks … ;-)
And the most important question:
![Page 3: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003.](https://reader035.fdocuments.us/reader035/viewer/2022071716/56649e535503460f94b48ca3/html5/thumbnails/3.jpg)
Why bother at all …
Because we are often wrong with our (ad-hoc) intuitions – linguistic methodology is …– well, let’s not go there.
While corpora have a lot of drawbacks (no negative evidence, genre specific, etc.) they offer a lot of opportunities.
To illustrate my point, a little case study …
![Page 4: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003.](https://reader035.fdocuments.us/reader035/viewer/2022071716/56649e535503460f94b48ca3/html5/thumbnails/4.jpg)
Hagit Borer: “Some notes on the Syntax and Semantics of Quantity”Talk for the Sem. Workshop, 10/31/2002
Claim: “The interpretation of bare plurals does not, actually, consist of any subset of (well-defined) singulars.”– 0.5 apples/apple– 1.0 apples/apple– 1.5 apples/apple– zero apples/apple
![Page 5: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003.](https://reader035.fdocuments.us/reader035/viewer/2022071716/56649e535503460f94b48ca3/html5/thumbnails/5.jpg)
Hagit Borer: “Some notes on the Syntax and Semantics of Quantity”Talk for the Sem. Workshop, 10/31/2002
Hagit Borer’s judgments:– 0.5 apples/*apple– 1.0 apples/*apple– 1.5 apples/*apple– zero apples/*apple
![Page 6: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003.](https://reader035.fdocuments.us/reader035/viewer/2022071716/56649e535503460f94b48ca3/html5/thumbnails/6.jpg)
Hagit Borer: “Some notes on the Syntax and Semantics of Quantity”Talk for the Sem. Workshop, 10/31/2002
Google’s count: – 0.5 apples (120)/*apple (179)– 1.0 apples (42)/*apple (23,600)– 1.5 apples (59)/*apple (362)– zero apples (194)/*apple (124)
This also makes clear, some of the problems, so let’s take pears
![Page 7: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003.](https://reader035.fdocuments.us/reader035/viewer/2022071716/56649e535503460f94b48ca3/html5/thumbnails/7.jpg)
Hagit Borer: “Some notes on the Syntax and Semantics of Quantity”Talk for the Sem. Workshop, 10/31/2002
Google’s count: – 0.1 pears (32)/*pear (118)– 0.5 pears (37)/*pear (50)– 0.7 pears (9)/*pear (14)– 1.0 pears (14)/*pear (24,000)– 1 pears (14)/?pear (7,480)– One pears (1,130)/?pear (3,060)– 1.5 pears (28)/*pear (316)– zero pears (3)/*pear (0)
Conclusion:– It is amazing how many programs or computers products use fruit
names.– The original judgments seem questionable.
BUT: can we trust Google?
![Page 8: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003.](https://reader035.fdocuments.us/reader035/viewer/2022071716/56649e535503460f94b48ca3/html5/thumbnails/8.jpg)
…
GSearch Tutorial
Corpora@Stanford
Introduction
Rules & Copyrights
Account Setup
Available corpora Available softwareClasses & Projects
Acknowledgments
Site Map(to come)
Home
Help for Corpus TAs
Grep Tutorial
Tgrep Tutorial
CQP Tutorial
Top 10 Info sourcesOn the net
In addition to the indicated structure, all pages offer links to external pages, including corpora, software, tutorials, demos, etc.
Local SupportE-list & Corpus TA
![Page 9: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003.](https://reader035.fdocuments.us/reader035/viewer/2022071716/56649e535503460f94b48ca3/html5/thumbnails/9.jpg)
Looking for a corpus
There are several sites on the web that can help you to find out if what you are looking for exists:
– Databases like David Lee’s site (see also our Top 10 list)– The LDC database– Our list of corpora (next page)
Email lists, see our site under ‘Support’– Local: [email protected]– Global: [email protected]
![Page 10: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003.](https://reader035.fdocuments.us/reader035/viewer/2022071716/56649e535503460f94b48ca3/html5/thumbnails/10.jpg)
Types of corpora
Different languages Different media (speech, video, text) Different levels of annotation
– No annotation– Transcribed speech or video– Sociological annotation (gender of speaker, average age of
audience, dialect of speaker, etc.)– Discourse and textual information (publication date, number of
discourse participants, discussion panel vs. novel, etc.)– Linguistic annotation (phonemes, prosody, syntax, morpho-
syntax, lexemes, phonological segments & syllables, etc.)
![Page 11: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003.](https://reader035.fdocuments.us/reader035/viewer/2022071716/56649e535503460f94b48ca3/html5/thumbnails/11.jpg)
Looking for a specific corpus
List of available corpora– If the corpus is on AFS– If the corpus in on the Corpus Computer– If the corpus is on CD– If the corpus is on the WWW– If the corpus has special license conditions– If we don’t have the corpus
![Page 12: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003.](https://reader035.fdocuments.us/reader035/viewer/2022071716/56649e535503460f94b48ca3/html5/thumbnails/12.jpg)
…
GSearch Tutorial
Corpora@Stanford
Introduction
Rules & Copyrights
Account Setup
Available corpora Available softwareClasses & Projects
Acknowledgments
Site Map(to come)
Home
Help for Corpus TAs
Grep Tutorial
Tgrep Tutorial
CQP Tutorial
Top 10 Info sourcesOn the net
In addition to the indicated structure, all pages offer links to external pages, including corpora, software, tutorials, demos, etc.
Local SupportE-list & Corpus TA
![Page 13: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003.](https://reader035.fdocuments.us/reader035/viewer/2022071716/56649e535503460f94b48ca3/html5/thumbnails/13.jpg)
Tools & software
General Where to start:
– Local online tutorials (see also external references and manuals)
– The corpus TA– [email protected]
Little helpers
![Page 14: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003.](https://reader035.fdocuments.us/reader035/viewer/2022071716/56649e535503460f94b48ca3/html5/thumbnails/14.jpg)
A brief look at some tools
BNC Web – Problem: Superiority “who the hell …”– Problem: Distribution of “… is like …” – age dependent?
General information Age (easy export to e.g. Excel) Crosstabs
TGrep2 and Tgrep– Tutorial– Examples:
tgrep2 -c wsj_mrg.t2c.gz -l 'VP < (NP $. NP)‘ tgrep2 -c wsj_mrg.t2c.gz -l 'VP < (NP $. PP-DTV)‘ tgrep2 -c wsj_mrg.t2c.gz -l 'VP=foo < (/VB*/ < gave) & < (NP $ NP)‘ tgrep2 -c wsj_mrg.t2c.gz -l 'VP=foo < (/VB*/ < gave) & < (NP $ PP-DTV)'
![Page 15: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003.](https://reader035.fdocuments.us/reader035/viewer/2022071716/56649e535503460f94b48ca3/html5/thumbnails/15.jpg)
Note: Tgrep is right-headed
The following pattern matches an S which has a child A and another child that is a C and that the A has a child B:
– S < (A < B) < C
However, this pattern means that S has child A and that A has children B and C:
– S < ((A < B) < C)
It is equivalent to this:– S < (A < B < C)
![Page 16: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003.](https://reader035.fdocuments.us/reader035/viewer/2022071716/56649e535503460f94b48ca3/html5/thumbnails/16.jpg)
Some more Tgrep2 syntax
A < B A is the parent of (immediately dominates) B. A > B A is the child of B. A <N B B is the Nth child of A (the rst child is <1). A >N B A is the Nth child of B (the rst child is >1). A <, B Synonymous with A <1 B. A >, B Synonymous with A >1 B. A <-N B B is the Nth-to-last child of A (the last child is <-1). A >-N B A is the Nth-to-last child of B (the last child is >-1). A <- B B is the last child of A (synonymous with A <-1 B). A >- B A is the last child of B (synonymous with A >-1 B). A <` B B is the last child of A (also synonymous with A <-1 B). A >` B A is the last child of B (also synonymous with A >-1 B). A <: B B is the only child of A A >: B A is the only child of B A << B A dominates B (A is an ancestor of B).
![Page 17: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003.](https://reader035.fdocuments.us/reader035/viewer/2022071716/56649e535503460f94b48ca3/html5/thumbnails/17.jpg)
Some more TGrep2 syntax
A >> B A is dominated by B (A is a descendant of B). A <<, B B is a left-most descendant of A. A >>, B A is a left-most descendant of B. A <<` B B is a right-most descendant of A. A >>` B A is a right-most descendant of B. A <<: B There is a single path of descent from A and B is on it. A >>: B There is a single path of descent from B and A is on it. A . B A immediately precedes B. A , B A immediately follows B. A .. B A precedes B. A ,, B A follows B. A $ B A is a sister of B (and A 6= B). A $. B A is a sister of and immediately precedes B. A $, B A is a sister of and immediately follows B. A $.. B A is a sister of and precedes B. A $,, B A is a sister of and follows B. A = B The node matched by A is also matched by B.
![Page 18: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003.](https://reader035.fdocuments.us/reader035/viewer/2022071716/56649e535503460f94b48ca3/html5/thumbnails/18.jpg)
The alternative with windows
TigerSearch 2.1; screen shots:– Grammar search– Collocation search
![Page 19: Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd, 2003.](https://reader035.fdocuments.us/reader035/viewer/2022071716/56649e535503460f94b48ca3/html5/thumbnails/19.jpg)
The end my friends
Want to help? – The website can always use additions (short
blurbs about software, your opinion about the user-friendliness of a certain web interface, etc.)
Tschuessi!