Post on 28-Mar-2018
6.006IntroductiontoAlgorithms
Lecture1:DocumentDistanceProf.ErikDemaine
YourProfessors
Prof.ErikDemaine Prof.Piotr Indyk Prof.Manolis Kellis
YourTAs
KevinKelley JosephLaurendi Tianren Qi
NicholasZehenderDavidWen
YourTextbook
Administrivia• Handout: Course information• Webpage:http://courses.csail.mit.edu/6.006/spring11/• Signupforrecitationifyoudidn’tfilloutformalready• Sign up for problemsetserver: https://alg.csail.mit.edu/• SignupforPiazzza accounttoask/answerquestions:http://piazzza.com/
• Prereqs: 6.01(Python), 6.042(discretemath)• Grades: Problem sets (30%)
Quiz1 (20%;Mar.8@7.30–9.30pm)Quiz2 (20%;Apr.13@7.30–9.30pm)Final (30%)
• Lectures&Recitations;Homeworklabs;Quizreviews• Read collaboration policy!
Today• Classoverview
– What’sa(good)algorithm?– Topics
• DocumentDistance– Vectorspacemodel– Algorithms– Pythonprofiling&gotchas
What’sanAlgorithm?• Mathematicalabstractionofcomputerprogram
• Well‐specifiedmethodforsolvingacomputationalproblem– Typically,afinitesequenceofoperations
• DescriptionmightbestructuredEnglish,pseudocode,orrealcode
• Key: no ambiguityhttp://en.wikipedia.org/wiki/File:Euclid_flowchart_1.png
al‐Khwārizmī(c.780–850)• “al‐kha‐raz‐mi”
http://en.wikipedia.org/wiki/File:Abu_Abdullah_Muhammad_bin_Musa_al‐Khwarizmi_edit.png
http://en.wikipedia.org/wiki/Al‐Khwarizmi
al‐Khwārizmī(c.780–850)• “al‐kha‐raz‐mi”• Fatherofalgebra
– TheCompendiousBookonCalculationbyCompletionandBalancing(c.830)
– Linear&quadraticequations:someofthefirstalgorithms
http://en.wikipedia.org/wiki/File:Image‐Al‐Kit%C4%81b_al‐mu%E1%B8%ABta%E1%B9%A3ar_f%C4%AB_%E1%B8%A5is%C4%81b_al‐%C4%9Fabr_wa‐l‐muq%C4%81bala.jpg
http://en.wikipedia.org/wiki/Al‐Khwarizmi
EfficientAlgorithms• Wantanalgorithmthat’s
– Correct– Fast– Smallspace– General– Simple– Clever
EfficientAlgorithms• Mainlyinterestedinscalabilityasproblemsizegrows
WhyEfficient Algorithms?• Savewaittime,storageneeds,energyconsumption/cost,…
• Scalability=win– Solvebiggerproblemsgivenfixedresources(CPU,memory,disk,etc.)
• Optimizetraveltime,scheduleconflicts,…
HowtoDesignanEfficient Algorithm?
1. Definecomputational problem2. Abstract irrelevant detail3. Reducetoaproblemyoulearnhere
(or6.046oralgorithmicliterature)4. Elsedesignusing“algorithmictoolbox”5. Analyzealgorithm’sscalability6. Implement & evaluate performance7. Repeat(optimize,generalize)
Modules&Applications1. Introduction Document similarity2. BinarySearchTrees Scheduling3. Hashing Filesynchronization4. Sorting Spreadsheets5. GraphSearch Rubik’s Cube6. Shortest Paths Google Maps7. Dynamic Programming Justifyingtext,packing,…8. NumbersPictures(NP) Computingπ,collision
detection,hardproblem9. Beyond Folding,streaming,bio
DocumentDistance
• Giventwodocuments,howsimilararethey?
• Applications:– Findsimilardocuments– Detectplagiarism/duplicates
– Websearch(one“document”isquery)
http://en.wikipedia.org/wiki/Wikipedia:Mirrors_and_forks/
http://www.google.com/
DocumentDistance
• Howtodefine“document”?
• Word =sequenceofalphanumericcharacters
• Document=sequenceofwords– Ignorepunctuation&formatting
DocumentDistance
• Howtodefine“distance”?
• Idea: focusonsharedwords
• Wordfrequencies:– =#occurrencesofwordindocument
VectorSpaceModel[Salton, Wong, Yang 1975]
• Treat each document as a vector of its words– Onecoordinate foreverypossibleword
• Example:– =“thecat”– =“thedog”
• Similaritybetweenvectors?– Dotproduct:
http://portal.acm.org/citation.cfm?id=361220
‘the’
‘cat’
‘dog’
11
1
VectorSpaceModel[Salton, Wong, Yang 1975]
• Problem: Dotproductnotscaleinvariant• Example1:
– =“thecat”– =“thedog”–
• Example2:– =“thecatthecat”– =“thedogthedog”–
‘the’
‘cat’
‘dog’
2
2
2
1
1 10
http://portal.acm.org/citation.cfm?id=361220
VectorSpaceModel[Salton, Wong, Yang 1975]
• Idea: Normalizeby#words:
• Geometricsolution:anglebetweenvectors
– 0=“identical”, ∘ =orthogonal(nosharedwords)
‘the’
‘cat’
‘dog’
11
1
http://portal.acm.org/citation.cfm?id=361220
Algorithm1. Read documents2. Split eachdocument into words3. Count wordfrequencies(documentvectors)4. Compute dot product
Algorithm1. Read documents2. Split eachdocument into words
– re.findall(‘\w+’, doc)
– Buthowdoesthisactuallywork?3. Count wordfrequencies(documentvectors)4. Compute dot product
Algorithm1. Read documents2. Split eachdocument into words
– Foreachlineindocument:Foreachcharacterinline:
Ifnotalphanumeric:Addpreviousword
(ifany)tolistStartnewword
3. Count wordfrequencies(documentvectors)4. Compute dot product
Algorithm1. Read documents2. Split eachdocument into words3. Count wordfrequencies(documentvectors)a. Sortthewordlistb. Foreachwordinwordlist:
– Ifsameaslastword:Incrementcounter
– Else:AddlastwordanditscountertolistResetcounterto0
4. Compute dot product
Algorithm1. Read documents2. Split eachdocument into words3. Count wordfrequencies(documentvectors)4. Compute dot product:
Foreverypossibleword:LookupfrequencyineachdocumentMultiplyAddtototal
Algorithm1. Read documents2. Split eachdocument into words3. Count wordfrequencies(documentvectors)4. Compute dot product:
Foreverywordinfirstdocument:Ifitappearsinseconddocument:MultiplywordfrequenciesAddtototal
Algorithm1. Read documents2. Split eachdocument into words3. Count wordfrequencies(documentvectors)4. Compute dot product:a. Startatfirstwordofeachdocument(insortedorder)b. Ifwordsareequal:
MultiplywordfrequenciesAddtototal
c. Inwhicheverdocumenthaslexicallylesserword,advancetonextword
d. Repeatuntileitherdocumentoutofwords
Algorithm1. Read documents2. Split eachdocument into words3. Count wordfrequencies(documentvectors)a. Initializeadictionarymappingwordstocountsb. Foreachwordinwordlist:
– Ifindictionary:Incrementcounter
– Else:Put0indictionary
4. Compute dot product
Algorithm1. Read documents2. Split eachdocument into words3. Count wordfrequencies(documentvectors)4. Compute dot product:
Foreverywordinfirstdocument:Ifitappearsinseconddocument:MultiplywordfrequenciesAddtototal
PythonImplementations
PythonProfiling
Culprit
Fix
PythonImplementationsdocdist1 initialversiondocdist2 addprofiling 192.5 secdocdist3 replace+ withextend 126.5secdocdist4 countfrequenciesusingdictionary 73.4 secdocdist5 splitwordswithstring.translate 18.1secdocdist6 changeinsertion sorttomergesort 11.5secdocdist7 nosorting, dotproductwithdictionary 1.8secdocdist8 split wordsonwholedocument,
notlinebyline0.2sec
ExperimentsonIntelPentium4,2.8GHz,Python2.6.2,Linux2.6.18.Document1(t2.bobsey.txt)has268,778lines,49,785words,3,354distincts.Document2(t3.lewis.txt)has1,031,470lines,182,355words,8,530distincts.
Don’tForget!• Webpage:http://courses.csail.mit.edu/6.006/spring11/
• Signupforrecitationifyoudidn’talreadyreceivearecitationassignmentfromus
• Sign up for problemsetserver:https://alg.csail.mit.edu/
• SignupforPiazzza accounttoask/answerquestions:http://piazzza.com/