CS 6120/CS4120: Natural Language Processing

68
CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang College of Computer and Information Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang

Transcript of CS 6120/CS4120: Natural Language Processing

Page 1: CS 6120/CS4120: Natural Language Processing

CS6120/CS4120:NaturalLanguageProcessing

Instructor:Prof.LuWangCollegeofComputerandInformationScience

NortheasternUniversityWebpage:www.ccs.neu.edu/home/luwang

Page 2: CS 6120/CS4120: Natural Language Processing

TimeandLocation

• Time:TuesdaysandFridays,3:25pm- 5:05pm

• Location:WestVillageH108

Page 3: CS 6120/CS4120: Natural Language Processing

CourseWebpage

• http://www.ccs.neu.edu/home/luwang/courses/cs6120_sp2018/cs6120_sp2018.html

Page 4: CS 6120/CS4120: Natural Language Processing

Prerequisites• Programming• Beingabletowritecodeinsomeprogramminglanguages(e.g.Python,Java,C/C++,Matlab)proficiently

• Courses• Algorithms• Somecalculus• Probabilityandstatistics• Linearalgebra(optionalbuthighlyrecommended)

Page 5: CS 6120/CS4120: Natural Language Processing

Prerequisites• Aquiz:• ThisFriday,inclass• 22simplequestions,20ofthemasTrueorFalsequestions(relevanttoprobability,statistics,andlinearalgebra)• Thepurposeofthisquizistoindicatetheexpectedbackgroundofstudents.• 80%ofthequestionsshouldbeeasytoanswer.• Notcountedinyourfinalscore!

Page 6: CS 6120/CS4120: Natural Language Processing

TextbookandReferences• Maintextbook(andsomeslides)• DanJurafsky andJamesH.Martin,"SpeechandLanguageProcessing,2nd Edition",PrenticeHall,2009.

• Wewillusesomematerialfrom3rd editionwhenitisavailable.• http://web.stanford.edu/~jurafsky/slp3/

• Youtube video:https://www.youtube.com/watch?v=s3kKlUBa3b0

• Otherreference• ChrisManningandHinrich Schutze,"FoundationsofStatisticalNaturalLanguageProcessing",MITPress,1999

• Machinelearningtextbooks:• ChristopherM.Bishop,"PatternRecognitionandMachineLearning",Springer,2006.

• TomMitchell,"MachineLearning",McGrawHill,1997.

Page 7: CS 6120/CS4120: Natural Language Processing

TopicsoftheCourse(tentatively)• LanguageModeling

• Part-of-SpeechTagging

• TextCategorization:WordSenseDisambiguation,NamedEntityRecognition

• Syntax:FormalGrammarsofEnglish,SyntacticParsing,StatisticalParsing,DependencyParsing

• Semantics:Vector-Space,LexicalSemantics,SemanticswithDenseVectors

• InformationExtraction

• QuestionAnswering

• MachineTranslation

• Summarization

• SentimentAnalysis,OpinionMining

• NLPandSocialMedia• DialogSystemsandChatbots

Page 8: CS 6120/CS4120: Natural Language Processing

TheGoal

• StudyfundamentaltasksinNLP

• Learnsomeclassicandstate-of-the-arttechniques

• Acquirehands-onskillsforsolvingNLPproblems• Evensomeresearchexperience!

Page 9: CS 6120/CS4120: Natural Language Processing

Grading• Assignment(30%)• 2 assignments,15%foreach

• Quiz(5%)• 8 in-classtests,1%foreach(threelowestscoresaredropped)• Tuesdays,andstartingnextweek

• FinalExam(35%)• Project(25%)• Participation(5%)• Classes:askandanswerquestions,participateindiscussions…• Piazza:helpyourpeers,addressquestions…

Page 10: CS 6120/CS4120: Natural Language Processing

Exam

• Openbook• Timeandplace,TBD(stillschedulingwiththecollege)• Pleasedon’tmaketravelarrangementsforexamweeks.

Page 11: CS 6120/CS4120: Natural Language Processing

CourseProject

• AnNLP-relatedresearchproject

• 2-3studentsasateam

Page 12: CS 6120/CS4120: Natural Language Processing

CourseProjectGrading

• Wewanttoseenovelprojects!• Theproblemneedstobewell-defined,novel,useful,andpractical.

• Reasonableresultsandobservations.

• Weencourageyoutotacklearesearch-drivenproblem.

• Option1:aresearchprojectdiscussedwiththeinstructor• Abettersolutionforanexistingproblem• Oranovelproblem

• Option2:Thefakenewschallenge

Page 13: CS 6120/CS4120: Natural Language Processing

SampleProjectsfromPreviousOffering

• Projectreportsarelistedhere:http://www.ccs.neu.edu/home/luwang/courses/cs6120_fa2017.html• NeuralSemanticParsingNaturalLanguageintoSQL• ShortPassagesReadingComprehensionandQuestionAnswering• PoliticalPromiseEvaluation(PPE)• PredictingPersonalityTraitsusingTweets• STORYNEXT2.0:ATEXTINSIGHTS/VISUALIZATIONTOOL• AndroidApplicationforVisualQA• NovelSummarizerandKeywordIdentifierUsingTextRankwithSentenceFarmDetection• ParaphraseGeneration• HashtagSimilaritybasedonTweetText• StanceDetectionfortheFakeNewsChallenge• MachineComprehensionUsingmatch-LSTMandAnswer-Pointer• OnlineAbuseDetection• PlagiarismDetectionUsingFP-GrowthAlgorithm• AnExaminationofInfluentialFramingofControversialTopicsonTwitter

Page 14: CS 6120/CS4120: Natural Language Processing

NeuralSemanticParsingNaturalLanguageintoSQL

Page 15: CS 6120/CS4120: Natural Language Processing

ShortPassagesReadingComprehensionandQuestionAnswering

Page 16: CS 6120/CS4120: Natural Language Processing
Page 17: CS 6120/CS4120: Natural Language Processing

OnlineAbuseDetection

Automating the process of identifying abuse comments wouldnot only save time for the Social Media platforms but alsowould increase user safety and improve discussions online.

Build a classifier that classifies the test data as either an“Abuse” (aggression, personal attack or a toxic statement) or“Not-Abuse” statement using multiple techniques.

Page 18: CS 6120/CS4120: Natural Language Processing

StoryNext 2:SentimentAnalysisforDocuments

Page 19: CS 6120/CS4120: Natural Language Processing

Option2:TheFakeNewsChallenge

Page 20: CS 6120/CS4120: Natural Language Processing

TheFakeNewsChallenge

• Website:http://www.fakenewschallenge.org/• Goal:“Thegoalofthe FakeNewsChallenge istoexplorehowartificialintelligencetechnologies,particularlymachinelearningandnaturallanguageprocessing,mightbeleveragedtocombatthefakenewsproblem.WebelievethattheseAItechnologiesholdpromiseforsignificantlyautomatingpartsoftheprocedurehumanfactcheckersusetodaytodetermineifastoryisrealorahoax.”

Page 21: CS 6120/CS4120: Natural Language Processing

TheFakeNewsChallenge

• Stage1:StanceDetection

Page 22: CS 6120/CS4120: Natural Language Processing

TheFakeNewsChallenge

• Data:https://github.com/FakeNewsChallenge/fnc-1

Page 23: CS 6120/CS4120: Natural Language Processing

Headline:“RobertPlantRippedup$800MLedZeppelinReunionContract”

Page 24: CS 6120/CS4120: Natural Language Processing

CourseProjectGrading

• Wewanttoseenovelprojects!• Theproblemneedstobewell-defined,novel,useful,andpractical.

• Reasonableresultsandobservations.

• Weencourageyoutotacklearesearch-drivenproblem.• Abettersolutionforanexistingproblem• Oranovelproblem

• FeelfreetotalktotheinstructororTAsonprojecttopicsduringofficehours.

Page 25: CS 6120/CS4120: Natural Language Processing

CourseProjectGrading

• Threereports• Proposal(3%),duebytheendofJanuary• Progress,withcode(7%)• Final,withcode(10%)

• Onepresentation• Inclass(5%)

Page 26: CS 6120/CS4120: Natural Language Processing

AudienceAward

• Bonuspoints!• Allteamsvotefortheirfavoriteproject(s).• Bestprojectgets1%asbonus(onebestprojectineach

batch,ifweneedtohavemorethanonebatch/lectureforpresentation)

Page 27: CS 6120/CS4120: Natural Language Processing

SubmissionandLatePolicy• Eachassignmentorreportisdueatthebeginningofclassonthecorrespondingduedate.

• Programminglanguage• Python(encouraged),Java,C/C++

• Electronicversion• Onblackboard

Page 28: CS 6120/CS4120: Natural Language Processing

SubmissionandLatePolicy

• Assignmentorreportturnedinlatewillbecharged20points(outof100points)offforeachlateday(i.e.24hours).

• Eachstudenthasabudgetof5days throughoutthesemesterbeforealatepenaltyisapplied.

• Latedaysarenotapplicabletofinalpresentation.

• Eachgroupmemberischargedwiththesamenumberoflatedays,ifany,fortheirsubmission.

Page 29: CS 6120/CS4120: Natural Language Processing

Howtofindus?• Coursewebpage:• http://www.ccs.neu.edu/home/luwang/courses/cs6120_sp2018/cs6120_sp2018.html

• Officehours• Prof.LuWang:Tuesdays,from5:15pmto6:15pm,orbyappointment,258WVH• TALiwen Hou• TATirthraj Maheshkumar Parmar• TAManthan Thakar

• Piazza• http://piazza.com/northeastern/sp2018/cs6120/home• Allcourserelevantquestionsshouldgohere– alsoisthebestwaytoreachtheinstructorandTAs!

Page 30: CS 6120/CS4120: Natural Language Processing

WhatisNaturalLanguageProcessing?

• Allowingmachinestocommunicatewithhuman

• Naturallanguageunderstanding+naturallanguagegeneration

Page 31: CS 6120/CS4120: Natural Language Processing

Whatdoesitmeantounderstandalanguage?

Page 32: CS 6120/CS4120: Natural Language Processing

Whatdoesitmeantounderstandalanguage?Phonology

Morphology

Lexemes

Syntax

Semantics

Pragmatics

Discourse

Soundwaves

Words

Parsetrees

Meanings

Page 33: CS 6120/CS4120: Natural Language Processing

Whatdoesitmeantounderstandalanguage?Phonology

Morphology

Lexemes

Syntax

Semantics

Pragmatics

Discourse

ShallowerAnalysis

DeeperAnalysis

Page 34: CS 6120/CS4120: Natural Language Processing

Syntax,Semantic,Pragmatics• Syntaxconcernstheproperorderingofwordsanditsaffectonmeaning.

• Thedogbittheboy.• Theboybitthedog.• Bitboydogthethe.

• Semanticsconcernsthe(literal)meaningofwords,phrases,andsentences.• “plant”asaphotosyntheticorganism• “plant”asamanufacturingfacility• “plant”astheactofsowing

• Pragmaticsconcernstheoverallcommunicativeandsocialcontextanditseffectoninterpretation.• Thehamsandwichwantsanotherbeer.• Johnthinksvanilla.

[ModifiedfromRayMooney’sSlides]

Page 35: CS 6120/CS4120: Natural Language Processing

AmbiguityisUbiquitous• SpeechRecognition

• “recognizespeech”vs.“wreckanicebeach”• “youthinAsia”vs.“euthanasia”

Page 36: CS 6120/CS4120: Natural Language Processing

AmbiguityisUbiquitous• SpeechRecognition

• “recognizespeech”vs.“wreckanicebeach”• “youthinAsia”vs.“euthanasia”

• SyntacticAnalysis• “Iatespaghettiwith chopsticks”vs.“Iatespaghettiwith meatballs.”

Page 37: CS 6120/CS4120: Natural Language Processing

AmbiguityisUbiquitous• SpeechRecognition

• “recognizespeech”vs.“wreckanicebeach”• “youthinAsia”vs.“euthanasia”

• SyntacticAnalysis• “Iatespaghettiwith chopsticks”vs.“Iatespaghettiwith meatballs.”

• SemanticAnalysis• “Thedogisinthepen.”vs.“Theinkisinthepen.”• “Iputtheplant inthewindow”vs.“Fordputtheplant inMexico”

Page 38: CS 6120/CS4120: Natural Language Processing

AmbiguityisUbiquitous• SpeechRecognition

• “recognizespeech”vs.“wreckanicebeach”• “youthinAsia”vs.“euthanasia”

• SyntacticAnalysis• “Iatespaghettiwith chopsticks”vs.“Iatespaghettiwith meatballs.”

• SemanticAnalysis• “Thedogisinthepen.”vs.“Theinkisinthepen.”• “Iputtheplant inthewindow”vs.“Fordputtheplant inMexico”

• PragmaticAnalysis• From“ThePinkPantherStrikesAgain”:• Clouseau:Doesyourdogbite?HotelClerk:No.Clouseau:[bowingdowntopetthedog]Nicedoggie.[DogbarksandbitesClouseau inthehand]Clouseau:Ithoughtyousaidyourdogdidnotbite!HotelClerk:Thatisnotmydog.

Page 39: CS 6120/CS4120: Natural Language Processing

AmbiguityisExplosive• Ambiguitiescompoundtogenerateenormousnumbersofpossibleinterpretations.• InEnglish,asentenceendinginn prepositionalphraseshasover 2nsyntacticinterpretations(cf.Catalannumbers).• “Isawthemanwiththetelescope”:2parses• “Isawthemanonthehillwiththetelescope.”:5parses• “IsawthemanonthehillinTexaswiththetelescope”:14parses• “IsawthemanonthehillinTexaswiththetelescopeatnoon.”:42parses• “IsawthemanonthehillinTexaswiththetelescopeatnoononMonday”:132parses

Page 40: CS 6120/CS4120: Natural Language Processing

HumorandAmbiguity

• Manyjokesrelyontheambiguityoflanguage:• Policemantolittleboy:“Wearelookingforathiefwithabicycle.”Littleboy:“Wouldn’tyoubebetterusingyoureyes.”• Whyistheteacherwearingsun-glasses.Becausetheclassissobright.• GrouchoMarx:OnemorningIshotanelephantinmypajamas.Howhegotintomypajamas,I’llneverknow.• Shecriticizedmyapartment,soIknockedherflat.• Noahtookalloftheanimalsonthearkinpairs.Excepttheworms,theycameinapples.

Page 41: CS 6120/CS4120: Natural Language Processing

WhyisLanguageAmbiguous?

Page 42: CS 6120/CS4120: Natural Language Processing

WhyisLanguageAmbiguous?

• Havingauniquelinguisticexpressionforeverypossibleconceptualizationthatcouldbeconveyedwouldmakelanguageoverlycomplexandlinguisticexpressionsunnecessarilylong.• Allowingresolvableambiguitypermitsshorterlinguisticexpressions,i.e.datacompression.• Languagereliesonpeople’sabilitytousetheirknowledgeandinferenceabilitiestoproperlyresolveambiguities.• Infrequently,disambiguationfails,i.e.thecompressionislossy.

Page 43: CS 6120/CS4120: Natural Language Processing

SomeNLPTasks

Page 44: CS 6120/CS4120: Natural Language Processing

SyntacticTasks

Page 45: CS 6120/CS4120: Natural Language Processing

WordSegmentation

• Breakingastringofcharactersintoasequenceofwords.• Insomewrittenlanguages(e.g.Chinese)wordsarenotseparatedbyspaces.• EveninEnglish,charactersotherthanwhite-spacecanbeusedtoseparatewords[e.g.,;.- :() ]• ExamplesfromEnglishURLs:• jumptheshark.comÞ jumptheshark.com• myspace.com/pluckerswingbarÞmyspace .compluckers wingbarÞmyspace .complucker swingbar

Page 46: CS 6120/CS4120: Natural Language Processing

MorphologicalAnalysis

• Morphology isthefieldoflinguisticsthatstudiestheinternalstructureofwords.(Wikipedia)• Amorpheme isthesmallestlinguisticunitthathassemanticmeaning(Wikipedia)

• e.g.“carry”,“pre”,“ed”,“ly”,“s”

• Morphologicalanalysisisthetaskofsegmentingawordintoitsmorphemes:• carriedÞ carry+ed (pasttense)• independentlyÞ in+(depend+ent)+ly• GooglersÞ (Google+er)+s(plural)• unlockableÞ un+(lock+able)?

Þ (un+lock)+able?

Page 47: CS 6120/CS4120: Natural Language Processing

PartOfSpeech(POS)Tagging

• Annotateeachwordinasentencewithapart-of-speech.

• Usefulforsubsequentsyntacticparsingandwordsensedisambiguation.

I ate the spaghetti with meatballs. Pro V Det N Prep N

John saw the saw and decided to take it to the table.PN V Det N Con V Part V Pro Prep Det N

Page 48: CS 6120/CS4120: Natural Language Processing

PhraseChunking

• Findallnon-recursivenounphrases(NPs)andverbphrases(VPs)inasentence.• [NPI][VPate][NPthespaghetti][PPwith][NPmeatballs].• [NP He][VP reckons ][NP thecurrentaccountdeficit][VP willnarrow ][PPto][NP only#1.8billion][PP in][NP September]

Page 49: CS 6120/CS4120: Natural Language Processing

SyntacticParsing

• Producethecorrectsyntacticparsetreeforasentence.

Page 50: CS 6120/CS4120: Natural Language Processing

SemanticTasks

Page 51: CS 6120/CS4120: Natural Language Processing

WordSenseDisambiguation(WSD)

• Wordsinnaturallanguageusuallyhaveafairnumberofdifferentpossiblemeanings.• Ellenhasastronginterest incomputationallinguistics.• Ellenpaysalargeamountofinterest onhercreditcard.

• Formanytasks(questionanswering,translation),thepropersenseofeachambiguouswordinasentencemustbedetermined.

Page 52: CS 6120/CS4120: Natural Language Processing

SemanticRoleLabeling(SRL)

• Foreachclause,determinethesemanticroleplayedbyeachnounphrasethatisanargumenttotheverb.agent patient source destination instrument• John droveMary fromAustin toDallas inhisToyotaPrius.• Thehammer brokethewindow.

• Alsoreferredtoa“caseroleanalysis,”“thematicanalysis,”and“shallowsemanticparsing”

Page 53: CS 6120/CS4120: Natural Language Processing

SemanticParsing

• Asemanticparsermapsanatural-languagesentencetoacomplete,detailedsemanticrepresentation(logicalform).• Formanyapplications,thedesiredoutputisimmediatelyexecutablebyanotherprogram.• Example:MappinganEnglishdatabasequerytoProlog:

HowmanycitiesarethereintheUS?answer(A,count(B,(city(B),loc(B,C),

const(C,countryid(USA))),A))

Page 54: CS 6120/CS4120: Natural Language Processing

TextualEntailment

• Determinewhetheronenaturallanguagesentenceentails(implies)anotherunderanordinaryinterpretation.

• E.g.,“Asoccergamewithmultiplemalesplaying.->Somemenareplayingasport.”

Page 55: CS 6120/CS4120: Natural Language Processing

Pragmatics/DiscourseTasks

Page 56: CS 6120/CS4120: Natural Language Processing

AnaphoraResolution/Co-Reference

• Determinewhichphrasesinadocumentrefertothesameunderlyingentity.• Johnputthecarrotontheplateandateit.

• BushstartedthewarinIraq.ButthepresidentneededtheconsentofCongress.

• Somecasesrequiredifficultreasoning.• TodaywasJack'sbirthday.PennyandJanetwenttothestore.Theyweregoingtogetpresents.Janetdecidedtogetakite."Don'tdothat,"saidPenny."Jackhasakite.Hewillmakeyoutakeit back."

Page 57: CS 6120/CS4120: Natural Language Processing

MoreApplication-drivenTasks

Page 58: CS 6120/CS4120: Natural Language Processing

InformationExtraction(IE)

• Identifyphrasesinlanguagethatrefertospecifictypesofentitiesandrelationsintext.• Namedentityrecognitionistaskofidentifyingnamesofpeople,places,organizations,etc.intext.people organizations places• MichaelDell istheCEOofDellComputerCorporation andlivesinAustinTexas.

• Relationextractionidentifiesspecificrelationsbetweenentities.• MichaelDell istheCEOof DellComputerCorporation andlivesinAustinTexas.• MichaelDell istheCEOofDellComputerCorporationandlivesin AustinTexas.

Page 59: CS 6120/CS4120: Natural Language Processing

QuestionAnswering• Directlyanswernaturallanguagequestionsbasedoninformationpresentedinacorporaoftextualdocuments(e.g.theweb).• WhoisthepresidentofUnitedStates?

• DonaldTrump

• WhatisthepopularofMassachusetts?• 6.8million

Page 60: CS 6120/CS4120: Natural Language Processing

TextSummarization

• Produceashortsummaryofoneormanylongerdocument(s).• Article: Aninternationalteamofscientistsstudieddietandmortalityin135,335peoplebetween35and70yearsoldin18countries,followingthemforanaverageofmorethansevenyears.Dietinformationdependedonself-reports,andthescientistscontrolledforfactorsincludingage,sex,smoking,physicalactivityandbodymassindex. ThestudyisinTheLancet.Comparedwithpeoplewhoatethelowest20percentofcarbohydrates,thosewhoatethehighest20percenthada28percentincreasedriskofdeath.Buthighcarbohydrateintakewasnotassociatedwithcardiovasculardeath.…

• Summary: Researchersfoundthatpeoplewhoatehigheramountsofcarbohydrateshadahigherriskofdyingthanthosewhoatemorefats.

Page 61: CS 6120/CS4120: Natural Language Processing

SpokenDialogueSystems-- Chatbots

• Q:Isitgoingtoraintoday?• A:Itwillbemostlysunny.Norainisexpected.

Page 62: CS 6120/CS4120: Natural Language Processing

MachineTranslation

• Translateasentencefromonenaturallanguagetoanother.• 我喜欢汉堡à Ilikeburgers.

Page 63: CS 6120/CS4120: Natural Language Processing

AmbiguityResolutionisRequiredforTranslation• Syntacticandsemanticambiguitiesmustbeproperlyresolvedforcorrecttranslation:• “Johnplays theguitar.”→“John弹吉他”• “Johnplays soccer.”→“John踢足球”

Page 64: CS 6120/CS4120: Natural Language Processing

AmbiguityResolutionisRequiredforTranslation• Syntacticandsemanticambiguitiesmustbeproperlyresolvedforcorrecttranslation:• “Johnplays theguitar.”→“John弹吉他”• “Johnplays soccer.”→“John踢足球”

• AnapocryphalstoryisthatanearlyMTsystemgavethefollowingresultswhentranslatingfromEnglishtoRussianandthenbacktoEnglish:• “Thespiritiswillingbutthefleshisweak.”à “Theliquorisgoodbutthemeatisspoiled.”• “Outofsight,outofmind.”à “Invisibleidiot.”

Page 65: CS 6120/CS4120: Natural Language Processing

ResolvingAmbiguity• Choosingthecorrectinterpretationoflinguisticutterancesrequires(commonsense)knowledgeof:• Syntax

• Anagentistypicallythesubjectoftheverb• Semantics

• MichaelandEllenarenamesofpeople• Austinisthenameofacity(andofaperson)• ToyotaisacarcompanyandPriusisabrandofcar

• Pragmatics• Somesocialnorm,communicativegoals• Askingaquestion,expectingananswer

• Worldknowledge• Creditcardsrequireuserstopayfinancialinterest• Agentsmustbeanimateandahammerisnotanimate

Page 66: CS 6120/CS4120: Natural Language Processing

State-of-the-Arts

• Learningfromlargeamountsoftextdata(cf.rule-basedmethods)• Supervisedlearningorunsupervisedlearning

• Statisticalmachinelearning-basedmethods• Theprobabilisticknowledgeacquiredallowsrobustprocessingthathandleslinguisticregularitiesaswellasexceptions.

• Nowwithneuralnetwork-basedmethodsmostly

Page 67: CS 6120/CS4120: Natural Language Processing

RelatedFields

• ArtificialIntelligence• MachineLearning• Linguistics• Cognitivescience• Logic• Datascience• Politicalscience• Education• …manymore

Page 68: CS 6120/CS4120: Natural Language Processing

RelevantScientificConferencesandJournals

• AssociationforComputationalLinguistics(ACL)• NorthAmericanAssociationforComputationalLinguistics(NAACL)• EmpiricalMethodsinNaturalLanguageProcessing(EMNLP)• InternationalConferenceonComputationalLinguistics(COLING)• ConferenceonComputationalNaturalLanguageLearning(CoNLL)• TransactionsoftheAssociationforComputationalLinguistics(TACL)• JournalofComputationalLinguistics(CL)