Linguistic Essentials for NLP

44
IDS Lab Linguistic Essentials for NLP SNU IDS Lab. Jamie Seol

Transcript of Linguistic Essentials for NLP

Page 1: Linguistic Essentials for NLP

IDS Lab

Linguistic Essentials for NLPSNU IDS Lab.

Jamie Seol

Page 2: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Quiz• A ( 1 ) is a category of words, which have similar syntactical

or grammatical behavior.• ( 2 ) is the study of the regularities and constraints of word

order and phrase structure.• ( 3 ) is the study of the meaning of words, constructions,

and utterances. We can divide ( 3 ) into two parts, the study of the meaning of individual words and the study of how meanings of individual words are combined into the meaning of sentences.

• (bonus point) list up 4 major types of phrases.

Page 3: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Language (언어)• Definition of language varies through perspective

• invariant: it’s a system for communication• We can say so many many things about “What is a language?”• In here, we’re focuing on Natural Human Language from

Linguistics which consists:• productivity• syntax

• recursivity• displacement• modality independent

Page 4: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Language - Appendix• Natural human language is defined as: a system for complex

communications using signs, gestures, sounds, symbols and etc.• complex is achieved by double articulation and syntax (note

that drawings don’t have a syntax)• Properties from previous slide and above may appear even in

non-human or non-linguistic languages like bee signs or baby cries, but natural human language is the only known one that has those in mutual

• Natural human language has two major parts: phonological system and syntactic system; actually, treating those parts as separated concept is quite dangerous!

• There are so many other properties in a language! It’s very, very sophisticated system we’re talking about

Page 5: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Language - Appendix• Examples of various languages

• formal language: a set of strings and symbols constrained by finite (but possibly recursive) rules, having potential to construct complete and sound axiomatic systems

• programming language: extenion of formal language that can determine a turing-complete systems

• baby cries: typical non-linguistic communication systems, which is modality dependent, non-recursive, non-displacement

• bee signs: typical non-human communication systems, modality dependent and non-recursive but has displacement!• surprisingly, bees can precisely tell the location of nectar

sources even if it’s in somewhere out of sight

Page 6: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Language - Appendix• Classifying languages are very, very hard task• 3 major types of language families

• dialect continua, isolates, proto-languages• In sense of morphological structure in typology, there are 4 types:

• agglutinative: derivation occurs a lot• inflectional: inflection occurs a lot• isolating: like Chinese characters; requires alignment

information to determine word’s meanings, neither do derivate nor inflect

• polysynthetic: long long words like concatenated morphemes, acts as almost a sentence

Page 7: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Sentence (문장)• A sentence is a sequence of words that is complete in itself

which can make a statement, question, command and etc.• compound of several clauses, and complete

• Empirically, we do know that (for example, in English) letters → words →phrases → clauses → sentences →paragraphs → documents → … → languages → ?

• But we can’t deal with some infinite concept!• Temporarily, we’ll only talk about things at most a sentence

• semantics and pragmatics can cover cases with multiple sentences

Page 8: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Clause (절)• A clause is a sequence of words that has exactly one relationship

of a subject and a predicate• “because she smiled at her”

• this is a typical type of dependent clause• if a clause is complete in itself, then it can be a sentence and

we call it an independent clause• “놀랍게도����������� ������������������  학식이����������� ������������������  맛있다”• “⽇日出”• “me gustas tu”

• Note that imperatives may omit subject

Page 9: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Phrase (구)• A phrase is a sequence of words that lacks either a subject, a

predicate, or both• “bright sunshine”• “닭과����������� ������������������  함께”• so it’s not a clause nor a sentence, but it can be a component

for it!• 4 major types of phrase

• noun phrase, prepositional phrase, verb phrase and adjective phrase

• Then, what is a word?

Page 10: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Word (단어)• A word is a single distinct meaningful element

• this is quite broad definition; there’re so many special cases• Korean: 조사, a typical kind of 의존형태소 is also treated as a

word (special case!)• Japanese: “重い” is a word, while “重” is not!

• standalone version is more like a stem• Chinese: ???• easy case: in English, Spanish, French and some more

similar languages, just space-delimited tokens are (almost) precisely words

Page 11: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Word Typology• by open vs closed

• open (lexical categories)• like nouns, verbs and adjectives• dictionary, or a lexicon, increases over time

• closed (functional categories)• like determiners and prepositions• used for grammatical purpose

• by part-of-speech• …

Page 12: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Word Typology - Appendix• if we increase size of corpus:

• number of types of closed words are almost constant, except for very initial part

• nouns grow linearly• verbs and adjectives grow at O(log(log(n))), empirically

• found by S. Choi and J. Seol• more precisely, we found that dw/d(log(n)) = O(1/log(n))

Page 13: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Part of Speech (POS, 품사)• A part of speech is a category of words, which have similar

syntactical or grammatical behavior• used as typical semantic type• it’s called syntactic or grammatical categories, but we all just

say POS• Examples:

• noun: people, animals, concepts and things• verb: action or action-like• adjective: properties of nouns• …

• Note that classes of POS depends on language

Page 14: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Part of Speech (POS, 품사)• How to tell two different words belong to same class?

• use substitution test!• “Birds can fly/sing”

• Note that a POS is not deterministically determined!• that’s what makes us hard to create an accurate POS tagger• we need concepts of semantic context when it comes to

determination• extreme example: garden path sentences

• “The old train young”• ???

Page 15: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Morphology (형태론)• Before we talk about POS in detail, we need to know about

morphological processes or morphology• ‘cause, inflection does not chage word class or meaning

significantly, while derivation and compounding can!• these morphological processes are systematically related to

POS; it even also classifies languages! (see appendix)• There are 3 major types

• inflections• derivations• compounding

Page 16: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Morphology (형태론)• Inflections (굴절): (mostly) generates a new word by modifying

minorly so that can change features like tense, number, plurality while remaining in same category• a set of all inflectional forms are called manifestations of a

lexeme• Derivations (파생): (mostly) generates a new word by adding some

prefix or suffix, where it could possibly change the category or meaning

• Compounding (합성): (mostly) generates a new word by merging two or more words, possibly can mean way different thing• a word “disk drive” has completely different meaning compared

to “disk” or “drive”, even it does relate somehow, we still waaaaaaant to list it up to the lexicon

Page 17: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Morphology (형태론)• Examples of inflections:

• “work” / “worked”• verbs inflect (mostly) by tense, aspect and mood

• “dog” / “dogs”• nouns and adjectives inflect (mostly) by gender, plurality and

case • “flora” / “florae” / “floram” / “florarum” / …

• Latins have 272 inflections!• “하다” / “했다”

• Korean does inflect!

Page 18: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Morphology (형태론)• Examples of derivations:

• “wide” / “widely”• an adjective becomes adverb when we add “-ly” • but we don’t have “oldly” or “difficultly”!

• derivation rules are applied to rather small fraction of words, compared to inflection rules

• “weak” / “weaken”• “understand” / “understandable”• “teach” / “teacher”• “넒다” / “넓이” / “넓게”

Page 19: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Morphology (형태론)• Is it inflection? or derivation?

• this might not be thaaaaaat important in this course• but, let’s make a checkpoint: we will never confuse those two

from now on• Derivation: stem + affix, while we can determine the boundary of

those two, and stem/affix usually has unique and definite purpose• and it changes POS

• Inflection: itself has clear purpose:• tense, aspect, mood / gender, plurality, case

• The truth is, these definitions are more a continuum than a strict distinction; think of covalent bond versus ionic bond

Page 20: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Morphology (형태론) - AppendixInflection Derivation

Lexical category

Do not change the lexical category of the word.

Often change the lexical category of the word

Location Tend to occur outside derivational affixes. Tend to occur next to the root

Type of meaning

Contribute syntactically conditioned information, such as number, gender, or aspect.

Contribute lexical meaning

Affixes used Occur with all or most members of a class of stems.

Are restricted to some, but not all members of a class of stems

Productivity May be used to coin new words of the same type.

May eventually lose their meaning and usually cannot be used to coin new terms

Grounding Create forms that are fully-grounded and able to be integrated into discourse.

Create forms that are not necessarily fully grounded and may require inflectional operations before they can be integrated into discourse

Page 21: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Morphological Typology - Appendix• Agglutinative languages (교착어):

• clear boundary between stem and ending• endings have definite grammatical function

• “나는” / “나의”• “나” is a stem, “-는” and “-의” are suffix(ending)

• “-는” is clearly a 주격조사• “-의” is clearly a 소유격조사

• mostly uses derivations frequently• Korean is very agglutinative• Japenese is even more, very very agglutinative

• boundary is 99.9% clear!

Page 22: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Morphological Typology - Appendix• Inflectional languages (굴절어):

• unclear boundary between stem and ending• endings have indefinite grammatical function

• “flora” / “florae” / “floram”• each are nominative / possessive / accusative case, with all

having singular plurality• but in “-am”, which part means accusative? which part

takes plurality? is “-m” an ending? or maybe “-am” so that stem can be “flor”?

• has many inflection rules• Latin is typical inflectional language, while Sanskrit is supreme

Page 23: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Part of Speech (POS, 품사)• Let’s head back to POS• So far we are…

sentence(문장)

clause(절)

phrase(구)

word(단어) morphology(형태론)

language(언어)

POS(품사론)

syntactics(통사론)

semantics(의미론)

pragmatics(화용론)

?

Page 24: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Nouns (명사)• Nouns (명사): refer to entities in the world like people, animals and

things and have three type of inflections• plurality: singular, plural• gender: feminine, masculine, neuter• case: nominative, genitive(possessive), dative, accusative

• Pronouns (대명사): act like variables in that they refer to a person or thing, and is in functional category (or closed), like anaphors

• Proper nouns (고유명사): refer to particular person or things• Adverbial nouns (부사성����������� ������������������  명사): can be used without modifiers to

give information like time(“tomorrow”) or location(“west”)

Page 25: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Determiners (한정사) and Adjectives (형용사)• Determiners (한정사): describe the particular reference of a noun

like articles (관사, “the/a”) including demonstratives (지시사, “this/that”)

• Adjectives (형용사): describe properties of nouns• attributive (한정용법) or adnominal (관형용법): modifies noun

like “red rose”• predicative (서술용법): used as a complement like “rose is red”• more finely, there are categories like quantifier (수량사),

interrogative pronouns (의문대명사), interrogative determiners (의문한정사) and so on

Page 26: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Verbs (동사)• Verbs (동사): describe actions, activities and states and have

inflections by plurality, case, particles and• tense: past, present, future• aspect: completion(perfect), progressive• mood: possibility, subjunctive, irrealis• voice: active, passive, middle

• Special cases like infinitive(부정사, “to go”), gerund(동명사, “going”), auxiliaries(조동사, “have”), modal auxiliaries(법조동사, “should”) may occur• periphrastic forms (복합형): things like “have to”; it has

nothing to do with “have” or “to” while it means “must”

Page 27: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Other POS• Adverbs (부사): modify a verb just like adjectives, while it usually

specify place, time, manner or degree• degree adverbs (정도부사): adverbs that are specialized to

modify other adjectives and adverbs, like qualifiers(수식어구, “really bright sunshine”)

• Prepositions (전치사): prototypically express spatial relationships like “in the glass”, “on the table”• particles (소사): specialized in creating phrasal verbs like “take

~ off”• Conjunctions (접속사): joins words or phrases (coordinating, 등위, “or”), or clauses (subordinating, 종속, “because”)

Page 28: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Syntax (구문론, 통사론)• Syntax is the study of the regularities and constraints of word

order and phrase structure• Words are organized into phrases, and phrases are organized into

clauses!• so it’s like some kind of constituents for compound

• We’re currently focusing on phrase structures here• Paradigmatic relationship (계열관계): if some elements can be

replaced each other in certain syntactic position, they share a paradigm

• Syntagmatic relationship (통합관계): if some elements can form a phrase (or syntagma), it’s in syntagmatic relationship• very important cases are also called collocations (연어)

Page 29: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Phrase Typology• There’re 4 major types of phrases:

• noun phrases (명사구) • leaded by noun, usually arguments of verbs• can be clauses like relative clause (관계사절, “that I love”)

• prepositional phrases (전치사구)• leaded by preposition, working as noun phrase complement

• verb phrases (동사구)• leaded by verb, lacks subject noun phrase

• adjective phrases (형용사구)• “She was surprisingly beautiful”

Page 30: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Grammars (문법)• We now have constituents of sentence, namely phrases• Orders usually matters to form a sentence!

• sentences have various purposes like interrogatives (which might involve inverted expression in English), imperatives (which might omit subject), declaratives and many others

• we need precise orderings of phrases to make it meaningful!• “Mary gave Peter a book”• “Peter gave Mary a book”

• of course, there are languages don’t really depend on word orderings

Page 31: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Grammars (문법)• Commonly, in perspective of context-free grammars, we use

rewrite rules and lexicons

Page 32: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Grammars (문법)• More intuitive and visually-effect way to represent phrase structure

is a parse tree

Page 33: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Grammars (문법)• Note that natural human languagues have recursivity in view of

syntax, so we can even do something like this

Page 34: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Grammars - Appendix• Quiz: find a regular expression for the following rewrite rules of

context-free grammarS → ABA → aA | λB → bB | b

• We’re already very familiar with context-free grammar studied from automata theory, using some kind of extended formal language

Page 35: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Grammars - Appendix• Actually, syntax of formal language from mathematical logic are

pretty similar, which means it’s not thaaaaat hard course to learn!• For example, a mathematical term is defined as

term → constantterm → variableterm → n-ary function of term

• We all know that some expression like “7cos(3n)” is a typical mathematical term, where above definition is just a bit more formal

• Actually actually, the above example is not exactly the same thing as general meaning of a term; it’s a term in sense of really-formal language, and it needs a model to be a resoluted term!

• See model theory for further study

Page 36: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Ambiguity• If given sentence can be induced from two different parse trees,

we call it ambiguious• “The children ate the cake with a spoon”

Page 37: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Dependency• Local dependency: dependency coming from words that shares

a syntactic rule; for example, to make to-infinitive form, we need base form of a verb

• Non-local dependency: phenomena that two words somehow syntactically depends, even if they are far apart in a sentence• subject-verb agreement: “She ~~~ walks”• wh-extraction: “Should Peter by a book” versus “Which book

should Peter by?”

Page 38: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Dependency• “She saw” - what? who?• Some particular phrase (more precisely, verbs) require arguments

• commonly, noun phrases can be those arguments• we classify it by semantic roles: agent, patient, instrument

• note that some languages might have voice (active versus passive)

• alternative: subject, object, recipient (indirect object)

Page 39: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Subcategorization• Classification of phrases by which complements do verbs permit

• intransitive verb (자동사): “The womun walked”• transitive verb (타동사): “John loves Mary”• ditransitive verb: “Mary gave Peter flowers”• intransitive with PP: “I rent in Puddington”• transitive with PP: “She put the book on the table”• sentential complement: “I know (that) she likes you”• transitive with sentential complement: “She told me that Gary is

coming on Tuesday”• subject, object, prepositional phrase, predicative adjective, bare

infinitive, … / now we can talk about SV, SVO, SVC, …

Page 40: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Semantics (의미론)• Semantics is the study of the meaning of words, constructions,

and utterances• lexical semantics

• semantic relations of words like hypernymy, hyponymy, polysemy, synonymy, etc.

• combination semantics • semantic information that can be retrieved by looking at the

whole phrase or sentence• “white wine” is not a literally some wine with white color!• idioms make special meanings by borrowing the whole

phrase or clause

Page 41: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Semantics - Appendix• Remember these from PL? We’ve been using a formal concept of

semantics everyday!

• Let C = “x = 1; y = x + 1” then it can be proved by

Page 42: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Pragmatics (화용론)• Pragmatics is the study of how knowledge about the world and

language conventions interact with literal meaning• things like: what did the speaker really ment?• we need knowledge about “hurricanes” so that we can idenfity

“the disaster” refers “hurricanes” in some context• It’s something (maybe) way beyond from syntactics• Might involve some knowledges from sociolinguistics, historical

linguistics, phonology, phonetics and many other lots and lots of contexts; very, very sophisticated and complex!

Page 43: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

Overall• Before we start study NLP, we need to know some minimum

knowledge from linguistics since we are dealing with natural language!

• Actually there’s a lot more stuffs to introduce, specially about syntax and semantics, since this note couldn’t cover that much

• Anyway, this note shows: what is language, sentence, phrase, word, morphology of words, part of speech, phrase structure, syntax, grammar, semantics and pragmatics

• When it comes to practical NLP task, one should know quite deeply about specific linguistic concept while possible it could require absolutely nothing as preliminary; which is funny part

• For more examples of those, I recommend Coursera courses

Page 44: Linguistic Essentials for NLP

IDS Lab

Jamie Seol

References• http://www-01.sil.org/linguistics/glossaryoflinguisticterms/

comparisonofinflectionandderiv.htm • http://www-personal.umich.edu/~jlawler/Inflection.pdf • https://www.uio.no/studier/emner/hf/ikos/EXFAC03-AAS/h05/larestoff/linguistics/

Chapter%201.(H05).pdf • https://en.wikipedia.org/wiki/Language • https://ko.wikisource.org/wiki/대한민국_한글_맞춤법(제2014-39호) • http://ropas.snu.ac.kr/~kwang/4190.310/11/pl-book-draft.pdf • Manning, Christopher D., and Hinrich Schütze. “Foundations of statistical natural

language processing.” Vol. 999. Cambridge: MIT press, 1999. • Edward, Sapir. “Language: An introduction to the study of speech.” New York: Harcout,

Brace & Company, 1921. • Radev, Dragomir R., Ph.D. "Introduction to Natural Language Processing." Lecture,

University of Michigan, Coursera, 2016.