Forecasting the beginnings of newspaper texts Some corpus & experimental findings Michael Hoey,...

Forecasting the Forecasting the beginnings of beginnings of newspaper textsnewspaper textsSome corpus & experimental Some corpus & experimental findingsfindings

Michael Hoey, Matthew Brook O’Donnell, Michaela Mahlberg and

Mike Scott

BAAL 11-13 September 2008, Swansea University

The Lexical Priming claim

Whenever we encounter a word (or syllable or combination of words), we note subconsciously

the words it occurs with (its collocations),

the grammatical patterns it occurs in (its colligations),

the meanings with which it is associated (its semantic associations),

word collocates with against and a

a word against has a semantic association with sending & receiving communication

(e.g. hear a word against)

send/receive a word against has a pragmatic association with denial

(e.g. wouldn’t hear a word against)

word collocates with against and a

a word against has a semantic association with sending & receiving communication

(e.g. hear a word against)







the pragmatics it is associated with (its pragmatic associations),



denial + send/receive a word against has a pragmatic association with hypotheticality

(e.g. wasn’t prepared to say a word against)

The Lexical Priming claimWhenever we encounter a word (or

syllable or combination of words), we also note subconsciously

the grammatical patterns it is associated with (its colligations),

the genre and/or style and/or social situation it is used in,

whether it is used in a context we are likely to want to emulate or not

denial + send/receive a word against colligates with modal verbs


denial + send/receive a word against also colligates with human subjects and human prepositional objects

The Lexical Priming claim All the features we notice prime us so

that when we come to use the word ourselves, we are likely (in speech, particularly) to use it in the same lexical context, with the same grammar, in the same semantic context, as part of the same genre/style, in the same kind of social and physical context, with a similar pragmatics and in similar textual ways.


Our ability to do this is what it means to know a word.

We are ALL learners, since we never stop being primed.

The only difference between the native speaker and the non-native speaker is the way that they are typically primed.

Creativity is the result of overriding some of one’s primings.

A footnote

Whenever we encounter a word (or syllable or combination of words), we note subconsciously …


the grammatical patterns it occurs in (its colligations),



Whenever we encounter a word (or syllable or combination of words), we also note subconsciously

the positions in a text that it occurs in, e.g. does it like to begin sentences? Does it like to start paragraphs? (its textual colligations),

the genre and/or style and/or social situation it is used in

Research QuestionResearch Question

• Do certain words and groups of words exhibit preferences for particular textual positions, such as the beginnings of texts and paragraphs? (Once upon a time is canonical example)

If they do, how can these items be discovered in a corpus?

Research QuestionResearch Question

• Do certain words and groups of words exhibit preferences for particular textual positions, such as the beginnings of texts and paragraphs? (Once upon a time is canonical example)

• If they do, how can these items be discovered in a corpus?

AHRC Textual Priming AHRC Textual Priming ProjectProject

Using a corpus of Home News articles from the Guardian/Observer newspaper 1998-2004◦Approx. 54 million words◦113,288 articles

Each sentence in body of each article is classified according to its positionTISC – first sentence of first paragraphPISC – first sentence of any subsequent

paragraphNISC – any non-initial sentence

Thanks to AHRC



Each sentence in body of each article is classified according to its position◦TISC – first sentence of first paragraph(Text-Initial Sentence Corpus)

Thanks to AHRC



Each sentence in body of each article is classified according to its position◦TISC – first sentence of first paragraph◦PISC – first sentence of any subsequent

paragraph(Paragraph-Initial Sentence Corpus)

Thanks to AHRC




paragraph◦NISC – any non-initial sentence

Thanks to AHRC




paragraph◦NISC – Non-Initial Sentence Corpus

Thanks to AHRC

More wet weather was predicted across Britain today as experts warned many areas were already saturated with rain.

…

On Wednesday and Thursday a brief respite should see most of the country becoming fine, with heavy rain only expected across parts of Northern Ireland. But by Friday, much of England and Wales will again be hit by storms and further downpours.

…

So far, Britain's recent storms have already claimed the lives of six people. Yesterday, insurers said the cost of the cleanup could run into tens of millions of pounds.

Method: Sentence classificationMethod: Sentence classification


…


…


Method: Sentence classificationMethod: Sentence classificationTISCsentence


…


…



PISCsentence


…


…



PISCsentence

NISCsentence


…


…



PISCsentence

PISCsentence

NISCsentence


…


…



PISCsentence

PISCsentence

NISCsentence

NISCsentence

GuardianGuardian Home News 1998- Home News 1998-20042004

TISC PISC NISC

tokens 3,122,037 12,521,902 19,338,590

types 58,432 127,038 141,793

type/token ratio (TTR) 53.43 98.57 136.39

sentences 113,288 607,125 1,064,493

mean (in words) 28 21 18

std.dev. 11.11 9.68 9.88

Summary of positional subcorporaSummary of positional subcorpora

Method: Method: Intra-textual Key Intra-textual Key Word AnalysisWord Analysis• Compare the frequency of words

and clusters in one section of text with their frequency in another

• For example, fresh occurs significantly more frequently in text-initial sentences (TISC) than in non-initial sentences (NISC)

• fresh is a text-initial key word

• It also exhibits distinctive patterns in TISC contexts in terms of collocates:• fresh{row,controversy,

embarrassment}

Method: Method: Intra-textual Key Intra-textual Key Word AnalysisWord Analysis• Compare the frequency of words

and clusters in one section of text with their frequency in another

• For example, fresh occurs significantly more frequently in text-initial sentences (TISC) than in non-initial sentences (NISC)

• fresh is a text-initial key word

• It also exhibits distinctive patterns in TISC contexts in terms of collocates:• fresh {row, controversy,

embarrassment}

Method: Comparative KW Method: Comparative KW listslistsTake the pair-wise comparisons

for TISC, PISC and NISC and create Key Word and Key Cluster lists:

TISC_NISC

TISC_PISC



TISC_NISC

TISC_PISC

PISC_NISC

PISC_TISC



TISC_NISC

TISC_PISC

PISC_NISC

PISC_TISC

NISC_TISC

NISC_PISC

Method: Key Word/Cluster Method: Key Word/Cluster MatrixMatrixEach word/cluster scored according

to whether (Y) or not (N) it is found on each of the six lists:

TISC_NISC TISC_PISC PISC_NISC PISC_TISC NISC_TISC NISC_PISC

yesterday Y Y Y N N N

said N N Y Y Y N

also N N N Y Y N

recall N N N N Y N

it was announced

Y Y N N N N

revealed that

Y N Y N N N

Categories from patternsCategories from patterns From our corpus there 18

resulting patterns, covering:◦ 4467 words◦ 50861 clusters

Here we focus on four patterns:Text-Initial (YYNNNN & YNNNNN)

Paragraph-Initial (NNYYNN & NNYNNN)

TI and PI (YNYNNN)

Non-initial (NNNNYY)



Here we focus on four patterns:1. Text-Initial (YYNNNN & YNNNNN)

Paragraph-Initial (NNYYNN & NNYNNN)

TI and PI (YNYNNN)





2. Paragraph-Initial (NNYYNN & NNYNNN)

TI and PI (YNYNNN)






3. TI and PI (YNYNNN)






3. TI and PI (YNYNNN)

4. Non-initial (NNNNYY)

Category 1: Text-initialCategory 1: Text-initial(YYNNNN, YNNNNN & YYNNNY)(YYNNNN, YNNNNN & YYNNNY)

TISC PISC NISC

ONE OF BRITAIN’S 132.0 13.4 9.0A REPORT BY THE 16.0 6.8 3.1ARE TO BE 271.9 23.8 24.7THAT COULD 106.7 41.4 61.7AFTER BEING 334.4 67.8 54.5

• 1,600 (36%) of our key words and 29,303 (58%) of our key clusters being to this category

normalized to occurrences per million words

Category 2: Paragraph-initialCategory 2: Paragraph-initial(NNYYNN,NNYNNN & NNNYNN)(NNYYNN,NNYNNN & NNNYNN)

TISC PISC NISC

THE FINDINGS 13.1 43.5 16.7CAME AS 5.8 47.9 9.8IS THE LATEST 10.6 17.2 3.5GENERAL SECRETARY OF THE 15.4 58.5 11.6CONFIRMED THAT 43.2 66.5 32.3

• 732 (16%) of our key words and 5,755 (11%) of our key clusters being to this category


Category 3: Text- & Category 3: Text- & Paragraph-initialParagraph-initial(YNYNNN & YNYYNN)(YNYNNN & YNYYNN)

TISC PISC NISC

THE CONTROVERSY 28.5 17.2 6.6HEAD OF THE 93.5 89.4 46.8DECISION TO 151.8 130.8 73.9SAID YESTERDAY THAT 80.1 67.5 26.6ISSUED A 51.9 35.9 20.3

• 253 (6%) of our key words and 913 (2%) of our key clusters being to this category


Category 4: Non-initialCategory 4: Non-initial(NNNNYY, NNNYYY & NYNNYY)(NNNNYY, NNNYYY & NYNNYY)

TISC PISC NISC

HAVE TO 174.2 352.2 616.1WHILE 530.1 589.3 701.0BUT 1262.0 4164.3 6068.6BE ABLE TO 81.0 108.9 165.5GOING TO 78.2 268.8 494.9

• 486 (11%) of our key words and 3,105 (6%) of our key clusters being to this category



…


…



PISCsentence

PISCsentence

NISCsentence


…


…




TISC 79 per million sentencesPISC 5 per million sentencesNISC 22 per million sentences



TISC 117 per 100,000 sentencesPISC 25 per 100,000 sentencesNISC 12 per 100,000 sentences



TISC 48 per thousand sentencesPISC 7 per thousand sentencesNISC 6 per thousand sentences



TISC 17 per thousand sentencesPISC 4 per thousand sentencesNISC 3 per thousand sentences


Theoretical ImplicationsTheoretical Implications

Confirmation of prediction made by lexical priming theory

Knowing a word includes knowing where it will be used in a text

Clusters are more important than single words in textual positioning (cf. Wray)

Applied Linguistic Applied Linguistic ImplicationsImplications

Translation

Academic writing

Authentic data

Death (or redefinition) of the topic sentence


Translation

Academic writing

Death of the topic sentence


Translation

Academic writing

Death (or redefinition) of the topic sentence

Applied Linguistic Applied Linguistic ImplicationsImplicationsLearning a word or phrase includes

learning its characteristic textual positioning, or else a learner’s text will read awkwardly

Fabricated texts are unlikely to preserve the natural textual colligations of the language if the intention of these texts is to illustrate other features

Textual colligation is where discourse analysis and dictionaries meet.

Forecasting the beginnings of newspaper texts Some corpus & experimental findings Michael Hoey,...

Documents

Transcript of Forecasting the beginnings of newspaper texts Some corpus & experimental findings Michael Hoey,...