Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical...
-
Upload
melany-milsap -
Category
Documents
-
view
216 -
download
0
Transcript of Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical...
![Page 1: Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551c0dd2550346ad4f8b525d/html5/thumbnails/1.jpg)
Feed Corpus :
An Ever Growing Up to Date Corpus
Akshay Minocha, Siva Reddy, Adam KilgarriffLexical Computing Ltd
![Page 2: Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551c0dd2550346ad4f8b525d/html5/thumbnails/2.jpg)
Introduction
• Study language changeo over months, years
• Most web pageso no info about when written
• Feedso written then posted
• Same feeds over timeo we hope
identical genre mix only factor that changes is time
![Page 3: Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551c0dd2550346ad4f8b525d/html5/thumbnails/3.jpg)
Method
Feed Discovery
Feed Crawler
Feed Scheduler
Feed Validation
Cleaning, de-duplication, Linguistic Processing
![Page 4: Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551c0dd2550346ad4f8b525d/html5/thumbnails/4.jpg)
Feed Discovery via Twitter
• Tweets often contain links for posts on feedso bloggers, newswires often tweet
"see my new post at http..."
• Twitter keyword searcheso News, business, arts, games, regional, science,
shopping, society, etc.o Ignore retweetso Every 15 minutes
![Page 5: Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551c0dd2550346ad4f8b525d/html5/thumbnails/5.jpg)
Sample Search
Aim - To make the most out of the search results
https://twitter.com/search?q=news%20source%3Atwitterfeed%20filter%3Alinks&lang=en&include_entities=1&rpp=100
• Query - News
• Source - twitterfeed
• Filter - Links ( To get all tweets necessarily with links)
• Language - en ( English )
• Include Entities - Info like geo, user, etc.
• rpp - result per page ( maximum 100 )
![Page 6: Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551c0dd2550346ad4f8b525d/html5/thumbnails/6.jpg)
Feed Validation
• Does the link lead directly to a feed?o does metadata contain
type=application/rss+xml type=application/atom+xml
• If yes, good
• If noo search for a feed in domain of the linko If no
search for feed in (one_step_from_domain)
• If still noo link is blacklisted
![Page 7: Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551c0dd2550346ad4f8b525d/html5/thumbnails/7.jpg)
Scheduling
• Inputso Frequency of update
average over last ten feedso Yield Rate
ratio, raw data input to 'good text' output• as in Spiderling, Suchomel and Pomikalek 2012
• Outputo priority level for checking the feed
![Page 8: Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551c0dd2550346ad4f8b525d/html5/thumbnails/8.jpg)
Feed Crawler
Visit feed at top of queue
• Is there new content?o If yeso Is it already in corpus?
• Onion: Pomikalek
if no clean up
• JusText: Pomikalek
add to corpus
![Page 9: Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551c0dd2550346ad4f8b525d/html5/thumbnails/9.jpg)
Prepare for analysis
• Lemmatise, POS-tag
• Load into Sketch Engine
![Page 10: Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551c0dd2550346ad4f8b525d/html5/thumbnails/10.jpg)
Initial run: Feb-March 2013
• Raw:1.36 billion English words
• 300 m words after deduplication, cleaning
• 150,000+ feeds
• Delivered to CUP• Keep their corpus up-to-date
• Keywords vs enTenTen12o [a-z]{3,}
![Page 11: Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551c0dd2550346ad4f8b525d/html5/thumbnails/11.jpg)
![Page 12: Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551c0dd2550346ad4f8b525d/html5/thumbnails/12.jpg)
An earlier version
• maintenance
![Page 13: Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551c0dd2550346ad4f8b525d/html5/thumbnails/13.jpg)
![Page 14: Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551c0dd2550346ad4f8b525d/html5/thumbnails/14.jpg)
Future Work
MAINTAIN• Include "Category Tags"
• Other languageso Collection started nowo Identification by langid.py (Lui and Baldwin 2012)
• "No-typo" materialo copy-edited subset, so
newspapers, business: yes personal blogs: no
o method: manual classification of 100 highest-volume feeds
![Page 15: Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.](https://reader036.fdocuments.us/reader036/viewer/2022062511/551c0dd2550346ad4f8b525d/html5/thumbnails/15.jpg)
Thank You
http://www.sketchengine.co.uk