Xavier Blanco

37

description

MULTI-LEXEMIC UNITS: AN OVERVIEW

Transcript of Xavier Blanco

MULTI-LEXEMIC UNITS:AN OVERVIEW

LET ME INTRODUCE MYSELF!

Xavier Blanco - [email protected]

Autonomous University of Barcelona

Laboratory of Phonetics,Lexicology and Semantics

fLexSem

My research areas:LexicographyMachine Translation

My theoretical background:Lexicon-GrammarMeaning-Text Theory

MY SPEECH AT A GLANCE

WHY MULTI-LEXEMIC UNITS?

4 TYPES OF MULTI-LEXEMIC UNITS

CONSEQUENCES FOR TRANSLATION

Why are multi-lexemic unitsso important for MT?

All Machine Translation Software needs dictionaries (i.e. complete linguistic descriptions of its working languages, formalized with identical procedures and criteria and linked by means of translation equivalence relations).

A natural question concerns the nature of the macrostructure elements (lemmas) in these dictionaries, i.e. the lexical units.

Lexicon items are not always just words but very often sequences of words. Many technical terms have been coined to refer to these complex lexical items: compounds, collocations, idioms, frozen expressions...

Why are multi-lexemic unitsso important for MT?

In fact, multi-lexemic units are important because they are linguistic signs, and linguistics signs are the natural unit of treatement for dictionaries.

Which is the main problem?

The main problem is that multi-lexemic units are/seem not so easy to identify and classify than mono-lexemic units.

N.B.: that is true only regarding the segmentation (tokenization) question, not the polysemy question.

WHAT IS A MULTI-LEXEMIC UNIT?

A multi-lexemic unit is a sequence of word-forms P whose meaning cannot be build (by the general rules of the language L) from the meanings of the constituent lexemes of P, their semantically loaded morphological means (if any) and their combinatorial properties.

FIRST TYPE OF MULTI-LEXEMIC UNIT:“COMPOUND” UNITS

In most electronic dictionaries lexical units are typically classified as nouns, verbs, adjectives and adverbs.

In adition to simple forms, the languages we work with (romance languages) have, for each of this categories, multi-lexemic items (i.e. sequences of word-forms separated by a blank or a hyphen).

Let’s beginn with compound nouns...

“COMPOUND NOUNS”

Examples of compound nouns are:

hard drug

high school

black hole

“COMPOUND NOUNS”

A large amount of compound nouns:

• have simple variants or synonyms (demócrata cristiano, demócrata-cristiano, demócratacristiano).

• can (sometimes must) be translated into simple nouns: high school = lycée, instituto // escalera mecánica, escalier roulant = escalator).

• acronyms are a way of constructing variants of compound nous: acquired immune deficiency syndrome = AIDS.

“COMPOUND NOUNS”

Systematic descriptions of compound nouns have been proposed ranging from just a few types to over 700. Probably, for most of the applications, it’s enough to take in account only a dozen cases, such as:

N-Adj, Adj-N (christian name // nombre de pila)

N-Prep-N (quality of live // calidad de vida)

N-N (family doctor // médico de cabecera)

Prep-N (under age // menor de edad)

V-N (washing machine // lavadora)

“COMPOUND NOUNS”

It is often necessay to treat compound nouns as having a recursive structure:

auditory canal = Adj-N => N

external auditory canal = Adj-(Adj-N) => Adj-N => N

“COMPOUND NOUNS”

VERY IMPORTANT:

The number of compound nouns has been often underestimated. Typically even in large tradictional dictionaries only a very small percentage seems to have caught the lexicographers attention.

We think that, if we consider technical languages, a few millions compound occur in texts. The size of any serious dictionary project in this area must be very important.

“COMPOUND NOUNS”

When we say that a compound noun is a linguistic unit, it means that we are obliged to describe its form, its meaning and its combinatory INDEPENDENTLY of the properties the forms it contains may have.

A regular construction like a black sweater is reducible to a predication like black(sweater).

But compounds like a black hole or a black box are not.

“COMPOUND NOUNS”

Compound nouns can be inmune to a number of syntactic modifications that similar but regular constructions can undergo:

• MODIFICATION: *a very black hole

• PREDICATIVITY: *this hole is black

• SYNONYMY: a black hole vs a black orifice, a black opening

• COORDINATION: *a black and deep hole

• DELETION: ...a black hole. This hole...

• NOMINALIZATION: *the blackness of the hole

“COMPOUND NOUNS”

Clearly these tests have varying degrees of precision depending on the semantic opaqueness of the compound

It is nevertheless obvious that we need to list them in a dictionary.

Not only high schol, but also driving shool or private shool need to be associated which a specific meaning description. How should we come to know that driving school is not a school in which one take courses while being in a car? Or that private shool is not a school for soldiers of this rank?

“COMPOUND VERBS”

He flogs a dead horse.

I gave him a taste of his own medicine

I put my shoulder to the wheel

“COMPOUND VERBS”

Compound verbs are typicallly verbs with frozen arguments (1, 2 or 3).

More often than not their degree of semantic opacity is much higher than in the case of compound nouns.

The bad news: They are more difficult to extract from corpora (e.g. insertions, inflectional patterns...).

The good news: the number of elements of this class is considerable smaller than in the nominal counterpart. It is likely that they can easily be kept far below 100,000.

“COMPOUND ADVERBS”

on footat your own riskin cold bloodin spite of N

Probably below 10,000, excepting a few compound adverbs schemas that are productive: from (9 a.m.) till (5 p.m.)

Compound adjectives: out of order, de moda (fashionable)...

Compound determiners: a lot of, a flurry of criticism...

SECOND TYPE OF MULTI-LEXEMIC UNIT: “COLLOCATIONS”

Compounds need to be regarded as units in connection with almost every linguistic operation. They are macrostructural elements of the dictionaries and are typically translated as a whole without any attempt to maintain neither the internal structure nor the meaning of their particular parts.

On the opposite, collocations involve 2 linguistic signs: the base of the collocation and the value of the collocation. We are going to discuss only two classes of collocations.

COLLOCATIONS: FROZEN MODIFIERS

to condemn strongly, to endorse heartily, to laugh heartily, to laugh one’s head off...

easy as pie, as 1-2-3...

(smb) thin as a rake..

it rains cats and dogs...

heavy smoker, ______ liar

admirer profondement; aimer passionnément; remercier chaleureusement; surveiller étroitement...

COLLOCATIONS: FROZEN MODIFIERS

aid = valuable

behaviour = excellent

cut = neatly, cleanly

advice = sound

proposal = tempting

struggle = heroic

analysis = fruitful

COLLOCATIONS: FROZEN MODIFIERS

Translation is a good indicator of the frozen status of these constructions. The translation of these expressions must proceed by first identifying the type of modification and then reconstructing that modification in the target language on the basis of the translation of the base term:

miedo cerval -> INT(miedo) = INT(fear) -> mortal fear *deer fear

peur bleue -> INT(peur) = INT(fear) -> mortal fear*blue fear

COLLOCATIONS: FROZEN MODIFIERS

There seems to be a restricted number of meanings that are likely to function as values of collocations. Exemples of such semantic values are intensity, anti-intensity, praise and anti-praise.

Collocations are not so difficult to understand (by a human being), but are difficult to produce (for a non-native speaker).

COLLOCATIONS: FROZEN MODIFIERS

These modifiers need to be coded for each lexical unit separately. Not every lexical unit will have instantiations for every semantic value of a possible frozen modifier and some lexical units will have more than one modifier for a given semantic value.

These frozen modifiers range from highly idiosyncratic ones to almost regular ones: but they always need to be explicitly coded.

COLLOCATIONS: SUPPORT VERBS

The main predicate of a sentence can be realized not just by verbs, but also by nouns, adjectives and prepositions.

In the latter cases, an additional lexical element, called support verb, is usually associated with the real semantic predicate to form the predicational basis of the simple sentence.

Particularly for nouns, these support verbs cannot always be predicted just from the nature of the main predicate.

COLLOCATIONS: SUPPORT VERBS

to play a role

to give an advice

to take a look at

to do someone a favor

to put a question

“The man who makes no mistakes does not usually make anything”

COLLOCATIONS: SUPPORT VERBS

the war broke out

I keep my calm

to reduce to despair

to raise hope in

to draw smb attention to

COLLOCATIONS: SUPPORT VERBS

to fulfil a promise

to answer a question

to follow an advice

his dream came true

COLLOCATIONS: SUPPORT VERBS

The artillery _________ a heavy bombardment over the town.

The artillery __________ the town to a heavy bombardment.

The town _________ a heavy bombardment (of the artillery).

A heavy bombardment (of the artillery) ________ over the town.

THIRD TYPE OF MULTI-LEXEMIC UNIT: “FROZEN SENTENCES”

Proverbs: A birth in the hand is worth two in the bush;A rolling stone gathers no moss.

Pragmatemes: Staff only, Can I help you?

N.B.: Often frozen sentences undergo variations which can involve creative mechanisms fo the defreezing of the ordinary accepted patterns.

FOURTH TYPE OF MULTI-LEXEMIC UNIT: “GRAMMATICAL UNITS”

Empirical Grammatical Expressions: has been, could have been, may have been / either... or / if... then...

Theoretical Grammatical Expressions:<Adj_colour> <clothes>... but blue jeans!<Noun_animate> drink <beverages>

Here I should present the calculus of Grammatical Meanings, but...

zzz

PISS: Powerpoint Induced Sleep Syndrom

FINAL OVERVIEW

Let a linguistic sign be an ordered triple

A = <‘A’, A, ∑A> where:

• ‘A’ is the signifier of A• A is the signifiant of ‘A’• ∑A is the set of combinatory properties of A

Basic types of linguistic signs are: morphs, modifications, conversions, supramorphs, word-forms, phrasemes and syntagms.

FINAL OVERVIEW

Free sequences: AB = <‘AB’; /A B/ ∑A U ∑B>

Full-idioms: AB = <‘C’; /A B/> | ‘A’ ‘C’ & ‘B’ ‘C’

Cuasi-idioms: AB = <‘A B C’; /A B/> | ‘C’ ≠‘A’ & ‘C’ ≠ ‘B’

Semi-idioms: AB = <‘A C’; /A B/>

The signifier of the semi-idiom includes, intact, the signifier of one of its two constituents. A is chosen by the speaker strictly because of its signified. But B s used to express ‘C’ contingent on A. Otherwise B will not be the signifiant of ‘C’.

CONCLUSIONS

• Tests with large collection of texts at the LADL and the CIS have shown that at least one-third of any natural language corpus must be analyzed in terms of multi-lexemic units.

• The characteristics of multi-lexemic units are such than there is no alternative to the lexicographic solution.

• The availability of large scale multi-lexemic dictionaries will significantly improve the quality of machine translation systems.