School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise:...

29
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge Management and Adaptive Systems Eric Atwell, Language Research Group

Transcript of School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise:...

Page 1: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

School of somethingFACULTY OF OTHER

School of ComputingFACULTY OF ENGINEERING

Coursework exercise:Localization of English on the World Wide Web

Knowledge Management

and Adaptive Systems

Eric Atwell, Language Research Group

Page 2: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

English on the World Wide Web

This assignment involves knowledge management and knowledge discovery from real-world data, sourced from the World Wide Web.

This exercise will give you practical experience of

methods in data mining, knowledge discovery, text analytics, the CRISP-DM data-mining methodology,

reporting and explaining technologies for knowledge management to professional “clients” or “customers”.

Page 3: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

The imaginary scenario

English is used world-wide as the common language of science, engineering, business, commerce, and education; but in any given part of the world, is there a localized regional variety of English? Advertising and marketing literature may need to be “localized” or adapted to suit local preferences. We want to persuade academics in language-related research that Data Mining and Text Mining are useful for their work. To do this, we are going to produce research reports targeted at language academics, demonstrating that text analytics can provide empirical evidence to help answer a research question:

Are the English-language web-pages from a selected region closer to British English or American English? And are there noticeable region-specific differences?

Page 4: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

Deliverable: report

The end result will be a summary of your research aimed at language academics. Your 5-10 minute video summary of your findings should include Introduction, Methods, Results, and Conclusions; you should include screenshots with examples of evidence.

Students can work in pairs, to share the workload; this means you have someone else to discuss the details with, and also should halve the workload of each individual, though there is some extra overhead in collaborating and coordinating the work. However, if you prefer, you can work on your own and produce an individual report.

Page 5: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

1) Background research

Read the lecture handouts on this topic, including (Atwell et al 2009, Atwell et al 2007) which describe similar exercises undertaken previously, and (Baroni and Bernadini 2004), which describes their use of BootCat and Google to collect Italian medical documents from the Web, and then data-mine these for Italian medical terminology. You should also explore the SketchEngine website which includes WebBootCat, a simple web-interface to BootCat and Google; and the World Wide English Corpus collected by past students, using WebBootCat and Google.

Page 6: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

2) Choose region and audience

Choose a specific region to study, for which you can find WWW-domain data and a target readership.

(i) Select a regional subcorpus sample of texts, covering between 3 and 6 national domains relevant to your chosen region, from the World Wide English corpus website. You may use SketchEngine and WebBootCat to collect additional national English 200,000-word sample(s) for countries not yet included in the World Wide English Corpus; you will need to register for a free trial username, then follow instructions.

(ii) Find a journal with research papers about language use in this region, e.g. from the lists on the website, or from Google searches, or by browsing the Leeds University Library; this should be a journal publishing papers on language and/or advertising and marketing, NOT a computer science journal. The journal website and sample papers should give you an idea of what sort of content to put in your report to interest these “clients”.

Page 7: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

SketchEngine WebBootCat Advanced

Page 8: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

WebBootCat Advanced options

Page 9: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

Getting URLs ...

Page 10: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

Harvest your corpus

Tick the URLs you want (default: yes!)

WebBootCat visits URLs, downloads, scrapes text

This may take some time ... Or even stall

When finished, download Horizontal or Vertical

- you want Vertical, one word per line

BUT: check it is 200,000+ words;

If not, try again, maybe with different parameter settings

Page 11: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

3) Tell me your plans

Email your choice of 3+ WWW domains, journal TITLE and website URL to [email protected] for verification;

... and if you choose to work as a pair, please include NAMES of both partners.

If I find the set of domains and/or journal unsuitable or someone else has already targeted the same group of countries, I may ask you to choose another.

Page 12: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

4) Features: differences UK v US

Question: is your regional English more like UK or US?

Consult expert sources on the English language to identify significant, measurable terms or features which show the difference between British and American English,

eg relative freq of “color” v “colour”, or “center” v “centre”.

You have to also think about how to extract feature-values from the data; do not choose features which are difficult to extract or compute (for example, features based on differences in grammar or meaning; these are hard to extract computationally).

Some possible expert sources are suggested in Wikipedia; also try appropriate Google searches, and/or Leeds Uni Library.

You can try Rayson’s Log-likelihood calculator or Sharoff’s compare-fq-lists.pl to compare two corpus-derived word-frequency lists and highlight words which are noticeably more frequent in one or other corpus.

Page 13: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

compare-fq-lists.pl

cslin-gps% perl compare-fq-lists.pl

A tool for comparing frequency lists from Fname2 against Fname1 as the reference corpus

Usage compare-fq-lists.pl -1 Fname -2 Fname2 [-f N] [-l] [-o] [-s] [-t N]

The format of frequency lists is either frq word OR rank frq word

-f N --functionwords Do not count the top N most frequent words (50 is a good approximation)

-i --ignorezero Do not count words not occurring in the reference corpus

-l --latex Output a Latex table

-o --odds Compute log-odds instead of loglikelihood (used by Marco Baroni)

-s --simplified Only words and scores are output

-t N --threshold No more than N words

Page 14: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

cslin-gps% perl compare-fq-lists.pl -1 /home/www/eric/db32/uk/ukw -2 /home/www/eric/db32/us/usw | more

/home/www/eric/db32/uk/ukw: 1765060

/home/www/eric/db32/us/usw: 1745843

Comparing two corpora /home/www/eric/db32/uk/ukw (1765060) vs /home/www/eric/db32/us/usw (1745843)

Word Frq1 Frq2 LL-score

à 490.00 7767.00 7751.18

CALL 7.00 1456.00 1940.10

NUMBER 10.00 1438.00 1888.85

shall 290.00 2317.00 1803.39

State 157.00 1943.00 1801.43

or 8300.00 13941.00 1499.97

state 302.00 1922.00 1324.76

Texas 23.00 1068.00 1290.99

Page 15: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

Contd...

Texas 23.00 1068.00 1290.99

Mr. 31.00 1088.00 1269.74

program 157.00 1164.00 872.88

PN1997 0.00 581.00 805.44

Posted 184.00 1165.00 800.53

Board 363.00 1516.00 767.93

County 219.00 1164.00 714.21

percent 38.00 672.00 689.69

Video 70.00 772.00 687.77

license 17.00 536.00 615.79

Department 421.00 1438.00 595.40

Program 18.00 523.00 593.15

States 105.00 775.00 579.83

school 532.00 1610.00 576.27

vehicle 50.00 601.00 551.82

Ms. 6.00 433.00 545.62

ORS 0.00 383.00 530.95

Bush 39.00 545.00 524.88

City 281.00 1099.00 523.75

property 243.00 1019.00 518.06

PM 140.00 800.00 515.61

court 74.00 636.00 512.14

Ãç 8.00 416.00 508.99

Center 59.00 588.00 504.09

CA 16.00 442.00 497.08

Home 238.00 952.00 463.50

Commission 88.00 624.00 457.11

department 84.00 603.00 444.63

Oregon 4.00 347.00 443.17

Page 16: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

cslin-gps% perl compare-fq-lists.pl -1 /home/www/eric/db32/us/usw-2 /home/www/eric/db32/uk/ukw | more

/home/www/eric/db32/us/usw: 1745843

/home/www/eric/db32/uk/ukw: 1765060

Comparing two corpora /home/www/eric/db32/us/usw (1745843) vs /home/www/eric/db32/uk/ukw (1765060)

Word Frq1 Frq2 LL-score

UK 39.00 1542.00 1823.35

BBC 16.00 1349.00 1716.88

London 40.00 1176.00 1331.52

programme 2.00 601.00 808.89

Wales 6.00 586.00 753.13

aircraft 18.00 641.00 747.30

Experience 45.00 722.00 718.64

Scotland 20.00 575.00 648.63

Ireland 13.00 541.00 643.88

British 97.00 809.00 636.02

aerospace 4.00 478.00 621.53

Page 17: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

Contd...aerospace 4.00 478.00 621.53

it 6977.00 10254.00 595.13

programmes 0.00 402.00 557.29

i 318.00 1217.00 555.58

album 10.00 461.00 555.42

University 579.00 1605.00 493.47

you 6734.00 9646.00 491.91

aero 0.00 350.00 485.20

aviation 3.00 372.00 484.64

Centre 7.00 389.00 478.09

Tel 4.00 347.00 442.51

Scottish 11.00 367.00 423.84

Subjects 3.00 326.00 421.67

Cambridge 37.00 451.00 412.98

research 331.00 1074.00 408.20

DeweyClass 0.00 272.00 377.07

Author 4.00 296.00 373.08

scheme 18.00 352.00 368.07

details 115.00 610.00 367.97

Manchester 5.00 296.00 366.01

colour 6.00 300.00 364.72

Britain 44.00 417.00 347.09

organisation 3.00 267.00 341.08

IRA 10.00 298.00 338.16

Page 18: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

Compare-fq-lists.pl results

Many terms are local places, names

US: Texas, state, Bush; UK: Wales, Manchester, IRA

– not really “linguistic” preferences?

Many terms are “noise” due to unrepresentative sample, skewed to a few topics/domains

US: County, Board, vehicle; UK: University, aviation

A few terms are recognisably “linguistic” differences

US: program, center; UK: programme, centre, colour

Page 19: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

5) Start your report: intro, methods

“Arabic and Arab English in the Arab World” (Atwell et al 2009) provides a model for what should appear in your report; plan your own report IN YOUR OWN WORDS to show you understand what you are doing, and replace Arab English results with your own results.

Start by drafting the Introduction, to explain why the question and approach are relevant to language researchers in this region (if appropriate, quote from the website statement of aims or target readership of your journal), and your initial assumption about the type of English preferred in the countries you have chosen to investigate (e.g. you might assume that countries in the Commonwealth, mainly ex-British colonies, might prefer British English).

Next, expand on the Methods, explaining your approach to finding empirical evidence, and your choice of features for text-mining.

Page 20: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

6) Extract feature values

For each feature you have chosen to use, you need to extract feature-values for each National English.

For example, if you decide that relative frequency of “color” v “colour” is a feature you want to include in your model, you will need to extract values of this feature from each National English sample;

To help you, word-frequency lists are available on the website for each national text sample.

Create data-files formatted for data-mining (eg WEKA .arff files) encoding these features:

for the Gold Standard UK and US samples on the website,

and for your national domain web-as-corpus samples

Page 21: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

/home/www/eric/db32/ukus.arff

@relation ukus

@attribute center numeric

@attribute centre numeric

@attribute centerpercent numeric

@attribute color numeric

@attribute colour numeric

@attribute colorpercent numeric

@attribute english {UK,US}

@data

1,32,3, 0,20,0, UK

0,25,0, 0,12,0, UK

9,27,25, 0,84,0, UK

0,19,0, 0,24,0, UK

0,16,0, 0,14,0, UK

0,16,0, 0,12,0, UK

0,21,0, 0,38,0, UK

0,25,0, 0,34,0, UK

2,26,7, 2,3,40, UK

2,32,5, 1,59,2, UK

31,0,100, 55,0,100, US

61,0,100, 26,0,100, US

24,0,100, 11,0,100, US

12,1,92, 21,4,84, US

8,0,100, 4,2,67, US

10,0,100, 8,0,100, US

19,0,100, 22,0,100, US

14,0,100, 7,0,100, US

14,0,100, 6,0,100, US

8,5,62, 24,0,100, US

Page 22: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

/home/www/eric/db32/test.arff

@relation test

@attribute center numeric

@attribute centre numeric

@attribute centerpercent numeric

@attribute color numeric

@attribute colour numeric

@attribute colorpercent numeric

@attribute english {UK,US}

@data

10,5,33, 0,20,0, UK

test.arff has data

(values for 7 features)

For ONE other nation

(UK English assumed)

Page 23: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

7) Train and test models

Use these data-mining feature files in a DM tool (eg in WEKA, train data-mining models such as ZeroR, OneR, J-rip rule-based classifier, or J48 decision-tree) with the UK and US datasets, to produce a model which predicts whether a new dataset is UK or US English; OR just draw a Decision Tree “by hand”

then test this with your selected national domains, to find how they are classified by the model.

Note that if you use WEKA, the arff files must include UK or US as the final feature; this should be your initial assumption of the preference for each country.

If the model classifies the country “correctly” this means WEKA agrees with your assumption; but if the model is “wrong”, this actually signifies that your assumption was wrong.

Make sure you save lots of “snapshots” of evidence from your data-mining experiments, to paste into your video.

Page 24: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

8) Localized English words

EXTRA WORK: combine your 3-6 national corpora into a single corpus representing English for the whole region – like the .ARAB Arab English corpus in (Atwell et al 2009). Derive a word-frequency list for your regional English corpus, and use Sharoff’s compare-fq-lists.pl to compare the regional word-frequency list with the UK and/or US word-frequency list. Highlight words which are noticeably more frequent in one or other corpus; and try to think of possible explanations for the differences found.

Page 25: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

9) Add to Report: results, conclusions

Extend the research report plan to include

Results (evidence, including summary of the evidence showing Data-Mining output)

and Conclusions (summarise the evidence showing which variety of English is dominant in each national domain, and in the region overall; discuss noticeable differences comparing regional English with UK and/or US English; and discuss any limitations of your approach).

(you may also want to “tidy up” intro and methods)

Page 26: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

Submit Research Video

upload the movie to Youtube; then EMAIL [email protected] with URL of your movie, AND state whether youn agree to let me add this URL to the course website, to let other students see it.

Deadline: Week 8, Friday 18/11/11

-please do NOT submit your video report direct to a Journal!

Page 27: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

References

Atwell, Eric; Al-Sulaiti, Latifa; Sharoff, Serge. 2009. Arabic and Arab English in the Arab World. Proceedings of CL’2009 International Conference on Corpus Linguistics.

Atwell, Eric; Arshad, Junaid; Lai, Chien-Ming; Nim, Lan; Rezapour Ashregi, Noushin; Wang, Josiah; Washtell, Justin. 2007. Which English dominates the World Wide Web, British or American? Proceedings of Corpus Linguistics 2007.

Baroni, Marco; and Bernardini, Silvia. 2004. BootCaT: Bootstrapping corpora and terms from the web. Proceedings of LREC’2004

LREC: Language Resources and Evaluation Conferences website http://www.lrec-conf.org/

World Wide English Corpus collected by past students http://www.comp.leeds.ac.uk/eric/wwe.shtml

Sharoff, Serge. 2008. Tools for Processing Corpora

(including compare-fq-lists.pl ) http://corpus.leeds.ac.uk/tools/

Page 28: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

Marking scheme

Markers will act as if they were “clients” or “customers” for the summary report, and review how well it might convince language researchers of the value of text analytics and text-mining techniques; grades will reflect how much more work is required to make necessary improvements.

All grades will be on a letter scale from A** down to F*. As SIS cannot handle letter-grades, they will be mapped onto a numerical scale from 10 down to 1 to enter into edass/SIS.

Your research report will be graded for quality of

(1)I: Introduction (explaining why the question and approach are relevant to language researchers),

(2)M: Methods (explaining your approach to finding empirical evidence, and your choice of features),

(3)R: Results (evidence, including screenshots showing analytics output),

(4)C: Conclusions (summarising the evidence showing which variety of English is dominant in each national domain, and in the region overall; and discussing any limitations of your approach)

You should devote roughly equal effort to each of these 4 parts, as each carries equal weight; 4 marks will be conflated into a single overall grade.

Page 29: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise: Localization of English on the World Wide Web Knowledge.

Summary: practical experience of …

This assignment involves KM, Information Retrieval and text mining from data sourced from the World Wide Web.

This exercise will give you practical experience of

methods in data mining, knowledge discovery, text analytics, the CRISP-DM data-mining methodology,

and explaining technologies for KM to professional “clients” or “customers” – you are the KM Consultant!