School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise:...

School of somethingFACULTY OF OTHER

School of ComputingFACULTY OF ENGINEERING

Coursework exercise:Localization of English on the World Wide Web

Data Mining and Text Analytics

Eric Atwell, Language Research Group

English on the World Wide Web

This assignment involves knowledge management, data mining and text analytics from real-world data, sourced from the World Wide Web.

This exercise will give you practical experience of

methods in data mining, knowledge discovery, text analytics, the CRISP-DM data-mining methodology,

reporting and explaining technologies for knowledge management to professional “clients” or “customers”.

The imaginary scenario

English is used world-wide as the common language of science, engineering, business, commerce, and education; but in any given part of the world, is there a localized regional variety of English? Advertising and marketing literature may need to be “localized” or adapted to suit local preferences. We want to persuade Canon marketing department that Data Mining and Text Mining are useful for their work. To do this, you are going to produce a short video research report, demonstrating that text analytics can provide empirical evidence to help answer a research question:

Are the English-language web-pages from a selected region closer to British English or American English?

Deliverable: report

The end result will be a summary of your research aimed at Canon marketing managers. Your 5-10 minute video summary of your findings should include Introduction, Methods, Results, and Conclusions; you should include screenshots with examples of evidence.

Students can work in pairs, to share the workload; this means you have someone else to discuss the details with, and also should halve the workload of each individual, though there is some extra overhead in collaborating and coordinating the work. However, if you prefer, you can work on your own and produce an individual report.

1) Background research

Read research papers on this topic, including (Atwell et al 2009, Atwell et al 2007) which describe similar exercises undertaken previously, and (Baroni and Bernadini 2004), which describes their use of BootCat and Google to collect Italian medical documents from the Web, and then data-mine these for Italian medical terminology. You should also explore the SketchEngine website which includes WebBootCat, a simple web-interface to BootCat and Google; and the World Wide English Corpus collected by past students, using WebBootCat and Google.

2) Choose region and audience

Choose a specific region to study, for which you can find WWW data.

Select a regional subcorpus sample of texts, covering between 3 and 6 national domains relevant to your chosen region, from the World Wide English corpus website. You may use SketchEngine and WebBootCat to collect additional national English 200,000-word sample(s) for countries not yet included in the World Wide English Corpus; you will need to register for a free trial username, then follow instructions.

SketchEngine WebBootCat Advanced

WebBootCat Advanced options

Getting URLs ...

Harvest your corpus

Tick the URLs you want (default: yes!)

WebBootCat visits URLs, downloads, scrapes text

This may take some time ... Or even stall

When finished, download Horizontal or Vertical

- you want Vertical, one word per line

BUT: check it is 200,000+ words;

If not, try again, maybe with different parameter settings

3) Tell me your plans

Email your choice of 3+ WWW domains to [email protected] for verification;

... and if you choose to work as a pair, please include NAMES of both partners.

If I find the set of domains unsuitable or someone else has already targeted the same group of countries, I may ask you to choose another.

4) Features: differences UK v US

Question: is your regional English more like UK or US?

Consult expert sources on the English language to identify significant, measurable terms or features which show the difference between British and American English,

eg relative freq of “color” v “colour”, or “center” v “centre”.

You have to also think about how to extract feature-values from the data; do not choose features which are difficult to extract or compute (for example, features based on differences in grammar or meaning; these are hard to extract computationally).

Some possible expert sources are suggested in Wikipedia; also try appropriate Google searches, and/or Leeds Uni Library.

You can try Rayson’s Log-likelihood calculator or Sharoff’s compare-fq-lists.pl to compare two corpus-derived word-frequency lists and highlight words which are noticeably more frequent in one or other corpus.

compare-fq-lists.pl

cslin-gps% perl compare-fq-lists.pl

A tool for comparing frequency lists from Fname2 against Fname1 as the reference corpus

Usage compare-fq-lists.pl -1 Fname -2 Fname2 [-f N] [-l] [-o] [-s] [-t N]

The format of frequency lists is either frq word OR rank frq word

-f N --functionwords Do not count the top N most frequent words (50 is a good approximation)

-i --ignorezero Do not count words not occurring in the reference corpus

-l --latex Output a Latex table

-o --odds Compute log-odds instead of loglikelihood (used by Marco Baroni)

-s --simplified Only words and scores are output

-t N --threshold No more than N words

cslin-gps% perl compare-fq-lists.pl -1 /home/www/eric/db32/uk/ukw -2 /home/www/eric/db32/us/usw | more

/home/www/eric/db32/uk/ukw: 1765060

/home/www/eric/db32/us/usw: 1745843

Comparing two corpora /home/www/eric/db32/uk/ukw (1765060) vs /home/www/eric/db32/us/usw (1745843)

Word Frq1 Frq2 LL-score

Ã 490.00 7767.00 7751.18

CALL 7.00 1456.00 1940.10

NUMBER 10.00 1438.00 1888.85

shall 290.00 2317.00 1803.39

State 157.00 1943.00 1801.43

or 8300.00 13941.00 1499.97

state 302.00 1922.00 1324.76

Texas 23.00 1068.00 1290.99

Contd...

Texas 23.00 1068.00 1290.99

Mr. 31.00 1088.00 1269.74

program 157.00 1164.00 872.88

PN1997 0.00 581.00 805.44

Posted 184.00 1165.00 800.53

Board 363.00 1516.00 767.93

County 219.00 1164.00 714.21

percent 38.00 672.00 689.69

Video 70.00 772.00 687.77

license 17.00 536.00 615.79

Department 421.00 1438.00 595.40

Program 18.00 523.00 593.15

States 105.00 775.00 579.83

school 532.00 1610.00 576.27

vehicle 50.00 601.00 551.82

Ms. 6.00 433.00 545.62

ORS 0.00 383.00 530.95

Bush 39.00 545.00 524.88

City 281.00 1099.00 523.75

property 243.00 1019.00 518.06

PM 140.00 800.00 515.61

court 74.00 636.00 512.14

ÃÃÂ§ 8.00 416.00 508.99

Center 59.00 588.00 504.09

CA 16.00 442.00 497.08

Home 238.00 952.00 463.50

Commission 88.00 624.00 457.11

department 84.00 603.00 444.63

Oregon 4.00 347.00 443.17

cslin-gps% perl compare-fq-lists.pl -1 /home/www/eric/db32/us/usw-2 /home/www/eric/db32/uk/ukw | more

/home/www/eric/db32/us/usw: 1745843

/home/www/eric/db32/uk/ukw: 1765060

Comparing two corpora /home/www/eric/db32/us/usw (1745843) vs /home/www/eric/db32/uk/ukw (1765060)

Word Frq1 Frq2 LL-score

UK 39.00 1542.00 1823.35

BBC 16.00 1349.00 1716.88

London 40.00 1176.00 1331.52

programme 2.00 601.00 808.89

Wales 6.00 586.00 753.13

aircraft 18.00 641.00 747.30

Experience 45.00 722.00 718.64

Scotland 20.00 575.00 648.63

Ireland 13.00 541.00 643.88

British 97.00 809.00 636.02

aerospace 4.00 478.00 621.53

Contd...aerospace 4.00 478.00 621.53

it 6977.00 10254.00 595.13

programmes 0.00 402.00 557.29

i 318.00 1217.00 555.58

album 10.00 461.00 555.42

University 579.00 1605.00 493.47

you 6734.00 9646.00 491.91

aero 0.00 350.00 485.20

aviation 3.00 372.00 484.64

Centre 7.00 389.00 478.09

Tel 4.00 347.00 442.51

Scottish 11.00 367.00 423.84

Subjects 3.00 326.00 421.67

Cambridge 37.00 451.00 412.98

research 331.00 1074.00 408.20

DeweyClass 0.00 272.00 377.07

Author 4.00 296.00 373.08

scheme 18.00 352.00 368.07

details 115.00 610.00 367.97

Manchester 5.00 296.00 366.01

colour 6.00 300.00 364.72

Britain 44.00 417.00 347.09

organisation 3.00 267.00 341.08

IRA 10.00 298.00 338.16

Compare-fq-lists.pl results

Many terms are local places, names

US: Texas, state, Bush; UK: Wales, Manchester, IRA

– not really “linguistic” preferences?

Many terms are “noise” due to unrepresentative sample, skewed to a few topics/domains

US: County, Board, vehicle; UK: University, aviation

A few terms are recognisably “linguistic” differences

US: program, center; UK: programme, centre, colour

5) Start your report: intro, methods

“Arabic and Arab English in the Arab World” (Atwell et al 2009) provides a model for what should appear in your report; plan your own report IN YOUR OWN WORDS to show you understand what you are doing, and replace Arab English results with your own results.

Start by drafting the Introduction, to explain why the question and approach are relevant to language researchers in this region, and your initial assumption about the type of English preferred in the countries you have chosen to investigate (e.g. you might assume that countries in the Commonwealth, mainly ex-British colonies, might prefer British English).

Next, expand on the Methods, explaining your approach to finding empirical evidence, and your choice of features for text-mining.

6) Extract feature values

For each feature you have chosen to use, you need to extract feature-values for each National English.

For example, if you decide that relative frequency of “color” v “colour” is a feature you want to include in your model, you will need to extract values of this feature from each National English sample;

To help you, word-frequency lists are available on the website for each national text sample.

Create data-files formatted for data-mining (eg WEKA .arff files) encoding these features:

for the Gold Standard UK and US samples on the website,

and for your national domain web-as-corpus samples

http://www.comp.leeds.ac.uk/eric/wwe.shtml -> ukus.arff

@relation ukus

@attribute center numeric

@attribute centre numeric

@attribute centerpercent numeric

@attribute color numeric

@attribute colour numeric

@attribute colorpercent numeric

@attribute english {UK,US}

@data

1,32,3, 0,20,0, UK

0,25,0, 0,12,0, UK

9,27,25, 0,84,0, UK

0,19,0, 0,24,0, UK

0,16,0, 0,14,0, UK

0,16,0, 0,12,0, UK

0,21,0, 0,38,0, UK

0,25,0, 0,34,0, UK

2,26,7, 2,3,40, UK

2,32,5, 1,59,2, UK

31,0,100, 55,0,100, US

61,0,100, 26,0,100, US

24,0,100, 11,0,100, US

12,1,92, 21,4,84, US

8,0,100, 4,2,67, US

10,0,100, 8,0,100, US

19,0,100, 22,0,100, US

14,0,100, 7,0,100, US

14,0,100, 6,0,100, US

8,5,62, 24,0,100, US

http://www.comp.leeds.ac.uk/eric/wwe.shtml -> test.arff

@relation test

@attribute center numeric

@attribute centre numeric

@attribute centerpercent numeric

@attribute color numeric

@attribute colour numeric

@attribute colorpercent numeric

@attribute english {UK,US}

@data

10,5,33, 0,20,0, UK

test.arff has data

(values for 7 features)

For ONE other nation

(UK English assumed)

7) Train and test models

Use these data-mining feature files in a DM tool (eg in WEKA, train data-mining models such as ZeroR, OneR, J-rip rule-based classifier, or J48 decision-tree) with the UK and US datasets, to produce a model which predicts whether a new dataset is UK or US English

then test this with your selected national domains, to find how they are classified by the model.

Note that if you use WEKA, the arff files must include UK or US as the final feature; this should be your initial assumption of the preference for each country.

If the model classifies the country “correctly” this means WEKA agrees with your assumption; but if the model is “wrong”, this actually signifies that your assumption was wrong.

Make sure you save lots of “snapshots” of evidence from your data-mining experiments, to paste into your video.

8) Localized English words

EXTRA WORK: combine your 3-6 national corpora into a single corpus representing English for the whole region – like the .ARAB Arab English corpus in (Atwell et al 2009). Derive a word-frequency list for your regional English corpus, and use Sharoff’s compare-fq-lists.pl to compare the regional word-frequency list with the UK and/or US word-frequency list. Highlight words which are noticeably more frequent in one or other corpus; and try to think of possible explanations for the differences found.

9) Add to Report: results, conclusions

Extend the research report plan to include

Results (evidence, including summary of the evidence showing Data-Mining output)

and Conclusions (summarise the evidence showing which variety of English is dominant in each national domain, and in the region overall; discuss noticeable differences comparing regional English with UK and/or US English; and discuss any limitations of your approach).

(you may also want to “tidy up” intro and methods)

Submit Research Video

upload the movie to Youtube; then EMAIL [email protected] with URL of your movie, AND state whether you agree to let me add this URL to the course website, to let other students see it.

- if you want to keep your video private, let me know

- please do NOT submit your video report direct to a Journal! (but AFTER it is marked, if it’s really good I may suggest submitting a paper to the journal...)

References

Atwell, Eric; Al-Sulaiti, Latifa; Sharoff, Serge. 2009. Arabic and Arab English in the Arab World. Proceedings of CL’2009 International Conference on Corpus Linguistics.

Atwell, Eric; Arshad, Junaid; Lai, Chien-Ming; Nim, Lan; Rezapour Ashregi, Noushin; Wang, Josiah; Washtell, Justin. 2007. Which English dominates the World Wide Web, British or American? Proceedings of Corpus Linguistics 2007.

Baroni, Marco; and Bernardini, Silvia. 2004. BootCaT: Bootstrapping corpora and terms from the web. Proceedings of LREC’2004

LREC: Language Resources and Evaluation Conferences website http://www.lrec-conf.org/

World Wide English Corpus collected by past students http://www.comp.leeds.ac.uk/eric/wwe.shtml

Sharoff, Serge. 2008. Tools for Processing Corpora

(including compare-fq-lists.pl ) http://corpus.leeds.ac.uk/tools/

Summary: practical experience of …

This assignment involves NLP, text analytics and text mining from data sourced from the World Wide Web.

This exercise will give you practical experience of

methods in data mining, knowledge discovery, text analytics, the CRISP-DM data-mining methodology,

and explaining technologies for KM to professional “clients” or “customers” – you are the KM Consultant!

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise:...

Documents

Transcript of School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Coursework exercise:...