Prof. Erik Demaine -...

6.006IntroductiontoAlgorithms

Lecture1:DocumentDistanceProf.ErikDemaine

YourProfessors

Prof.ErikDemaine Prof.Piotr Indyk Prof.Manolis Kellis

YourTAs

KevinKelley JosephLaurendi Tianren Qi

NicholasZehenderDavidWen

YourTextbook

Administrivia• Handout: Course information• Webpage:http://courses.csail.mit.edu/6.006/spring11/• Signupforrecitationifyoudidn’tfilloutformalready• Sign up for problemsetserver: https://alg.csail.mit.edu/• SignupforPiazzza accounttoask/answerquestions:http://piazzza.com/

• Prereqs: 6.01(Python), 6.042(discretemath)• Grades: Problem sets (30%)

Quiz1 (20%;Mar.8@7.30–9.30pm)Quiz2 (20%;Apr.13@7.30–9.30pm)Final (30%)

• Lectures&Recitations;Homeworklabs;Quizreviews• Read collaboration policy!

Today• Classoverview

– What’sa(good)algorithm?– Topics

• DocumentDistance– Vectorspacemodel– Algorithms– Pythonprofiling&gotchas

What’sanAlgorithm?• Mathematicalabstractionofcomputerprogram

• Well‐specifiedmethodforsolvingacomputationalproblem– Typically,afinitesequenceofoperations

• DescriptionmightbestructuredEnglish,pseudocode,orrealcode

• Key: no ambiguityhttp://en.wikipedia.org/wiki/File:Euclid_flowchart_1.png

al‐Khwārizmī(c.780–850)• “al‐kha‐raz‐mi”

http://en.wikipedia.org/wiki/File:Abu_Abdullah_Muhammad_bin_Musa_al‐Khwarizmi_edit.png

http://en.wikipedia.org/wiki/Al‐Khwarizmi

al‐Khwārizmī(c.780–850)• “al‐kha‐raz‐mi”• Fatherofalgebra

– TheCompendiousBookonCalculationbyCompletionandBalancing(c.830)

– Linear&quadraticequations:someofthefirstalgorithms

http://en.wikipedia.org/wiki/File:Image‐Al‐Kit%C4%81b_al‐mu%E1%B8%ABta%E1%B9%A3ar_f%C4%AB_%E1%B8%A5is%C4%81b_al‐%C4%9Fabr_wa‐l‐muq%C4%81bala.jpg

http://en.wikipedia.org/wiki/Al‐Khwarizmi

EfficientAlgorithms• Wantanalgorithmthat’s

– Correct– Fast– Smallspace– General– Simple– Clever

EfficientAlgorithms• Mainlyinterestedinscalabilityasproblemsizegrows

WhyEfficient Algorithms?• Savewaittime,storageneeds,energyconsumption/cost,…

• Scalability=win– Solvebiggerproblemsgivenfixedresources(CPU,memory,disk,etc.)

• Optimizetraveltime,scheduleconflicts,…

HowtoDesignanEfficient Algorithm?

1. Definecomputational problem2. Abstract irrelevant detail3. Reducetoaproblemyoulearnhere

(or6.046oralgorithmicliterature)4. Elsedesignusing“algorithmictoolbox”5. Analyzealgorithm’sscalability6. Implement & evaluate performance7. Repeat(optimize,generalize)

Modules&Applications1. Introduction Document similarity2. BinarySearchTrees Scheduling3. Hashing Filesynchronization4. Sorting Spreadsheets5. GraphSearch Rubik’s Cube6. Shortest Paths Google Maps7. Dynamic Programming Justifyingtext,packing,…8. NumbersPictures(NP) Computingπ,collision

detection,hardproblem9. Beyond Folding,streaming,bio

DocumentDistance

• Giventwodocuments,howsimilararethey?

• Applications:– Findsimilardocuments– Detectplagiarism/duplicates

– Websearch(one“document”isquery)

http://en.wikipedia.org/wiki/Wikipedia:Mirrors_and_forks/

http://www.google.com/

DocumentDistance

• Howtodefine“document”?

• Word =sequenceofalphanumericcharacters

• Document=sequenceofwords– Ignorepunctuation&formatting

DocumentDistance

• Howtodefine“distance”?

• Idea: focusonsharedwords

• Wordfrequencies:– =#occurrencesofwordindocument

VectorSpaceModel[Salton, Wong, Yang 1975]

• Treat each document as a vector of its words– Onecoordinate foreverypossibleword

• Example:– =“thecat”– =“thedog”

• Similaritybetweenvectors?– Dotproduct:

http://portal.acm.org/citation.cfm?id=361220

‘the’

‘cat’

‘dog’

• Problem: Dotproductnotscaleinvariant• Example1:

– =“thecat”– =“thedog”–

• Example2:– =“thecatthecat”– =“thedogthedog”–

‘the’

‘cat’

‘dog’

• Idea: Normalizeby#words:

• Geometricsolution:anglebetweenvectors

– 0=“identical”, ∘ =orthogonal(nosharedwords)

‘the’

‘cat’

‘dog’

Algorithm1. Read documents2. Split eachdocument into words3. Count wordfrequencies(documentvectors)4. Compute dot product

Algorithm1. Read documents2. Split eachdocument into words

– re.findall(‘\w+’, doc)

– Buthowdoesthisactuallywork?3. Count wordfrequencies(documentvectors)4. Compute dot product

Algorithm1. Read documents2. Split eachdocument into words

– Foreachlineindocument:Foreachcharacterinline:

Ifnotalphanumeric:Addpreviousword

(ifany)tolistStartnewword

3. Count wordfrequencies(documentvectors)4. Compute dot product

Algorithm1. Read documents2. Split eachdocument into words3. Count wordfrequencies(documentvectors)a. Sortthewordlistb. Foreachwordinwordlist:

– Ifsameaslastword:Incrementcounter

– Else:AddlastwordanditscountertolistResetcounterto0

4. Compute dot product

Algorithm1. Read documents2. Split eachdocument into words3. Count wordfrequencies(documentvectors)4. Compute dot product:

Foreverypossibleword:LookupfrequencyineachdocumentMultiplyAddtototal

Foreverywordinfirstdocument:Ifitappearsinseconddocument:MultiplywordfrequenciesAddtototal

Algorithm1. Read documents2. Split eachdocument into words3. Count wordfrequencies(documentvectors)4. Compute dot product:a. Startatfirstwordofeachdocument(insortedorder)b. Ifwordsareequal:

MultiplywordfrequenciesAddtototal

c. Inwhicheverdocumenthaslexicallylesserword,advancetonextword

d. Repeatuntileitherdocumentoutofwords

Algorithm1. Read documents2. Split eachdocument into words3. Count wordfrequencies(documentvectors)a. Initializeadictionarymappingwordstocountsb. Foreachwordinwordlist:

– Ifindictionary:Incrementcounter

– Else:Put0indictionary

4. Compute dot product

Foreverywordinfirstdocument:Ifitappearsinseconddocument:MultiplywordfrequenciesAddtototal

PythonImplementations

PythonProfiling

Culprit

PythonImplementationsdocdist1 initialversiondocdist2 addprofiling 192.5 secdocdist3 replace+ withextend 126.5secdocdist4 countfrequenciesusingdictionary 73.4 secdocdist5 splitwordswithstring.translate 18.1secdocdist6 changeinsertion sorttomergesort 11.5secdocdist7 nosorting, dotproductwithdictionary 1.8secdocdist8 split wordsonwholedocument,

notlinebyline0.2sec

ExperimentsonIntelPentium4,2.8GHz,Python2.6.2,Linux2.6.18.Document1(t2.bobsey.txt)has268,778lines,49,785words,3,354distincts.Document2(t3.lewis.txt)has1,031,470lines,182,355words,8,530distincts.

Don’tForget!• Webpage:http://courses.csail.mit.edu/6.006/spring11/

• Signupforrecitationifyoudidn’talreadyreceivearecitationassignmentfromus

• Sign up for problemsetserver:https://alg.csail.mit.edu/

• SignupforPiazzza accounttoask/answerquestions:http://piazzza.com/

Prof. Erik Demaine -...

Documents

Transcript of Prof. Erik Demaine -...

Lecture 9 Slides: Pleat Folding, 6.849 Fall 2010 - ocw.mit.edu · Circular Variation from Bauhaus [Albers at Bauhaus, 1927–1928] Virtual Origami. Demaine, Demaine, Fizel, Ochsendorf

Lecture 9 - courses.csail.mit.educourses.csail.mit.edu/6.006/spring11/lectures/lec09.pdf · 2011. 3. 31. · Lecture 9 Prof. Piotr Indyk . Menu • Priority Queues • Heaps • Heapsort

6.006 Lecture 06: AVL trees, AVL sort

Prof. Erik Demainecourses.csail.mit.edu/6.006/spring11/lectures/lec02.pdf · 2011-02-03 · Lecture 2: Peak Finding Prof. Erik Demaine. Today • Peak finding (new problem) – 1D

Lecture 4: Balanced Binary Search Treescourses.csail.mit.edu/6.006/fall09/lecture_notes/lecture... · · 2009-09-24Lecture 4 Balanced Binary Search Trees 6.006 Fall 2009 ... We

When Can You Fold a Map? - Erik Demaine

6.006- Introduction to Algorithms - courses.csail.mit.educourses.csail.mit.edu/6.006/fall10/lectures/lec02.pdf6.006-Introduction to Algorithms Lecture 2 Prof. Constantinos Daskalakis.

6.006 Introduction to Algorithms, Fall 2011 Final Exam Solutions

6.006- Introduction to Algorithmscourses.csail.mit.edu/6.006/fall10/lectures/lecture16.pdf · – Bellman-Ford on a DAG (CLRS 24.2) – Dijkstra algorithm for the case with non-negative

Prof. Erik Demaine Spring ’12 Scribe Notes Collectioncourses.csail.mit.edu/6.851/spring14/scribe/2012scribes.pdfMIT 6.851 Advanced Data Structures Prof. Erik Demaine Spring ’12

6.006 Lecture 22 Original: Two kinds of guessing; piano ... · PDF fileTitle: 6.006 Lecture 22 Original: Two kinds of guessing; piano/guitar fingering, Tetris training, Super Mario

6.006 Introduction to Algorithms, Final Exam

Mike Demaine Portfolio

6.006 Recitationcourses.csail.mit.edu/6.006/spring08/keynotes/recitation07.pdf · • coming back from the dead to hunt us. Open Addressing • Goal: use nothing but the table •

Bounded-Degree Polyhedronization of Point Sets Andrew Winslow with Gill Barequet, Nadia Benbernou, David Charlton, Erik Demaine, Martin Demaine, Mashhood.

Lecture 5: Scheduling and Binary Search Treescourses.csail.mit.edu/6.006/fall11/lectures/lecture5.pdf · Lecture 5 Scheduling and Binary Search Trees 6.006 Fall 2011 49 46 1 + 2 +

Retroactive Data Structures - Erik Demaine

Conveyer-Belt Alphabet - Erik Demaineerikdemaine.org/papers/ConveyerAlphabet_Elasticity/paper.pdfConveyer-Belt Alphabet Erik D. Demaine Martin L. Demaine Bel en Palopy Mathematics

Erik D. Demaine Martin L. Demaine David Eppsteinz …Erik D. Demainey Martin L. Demaine David Eppsteinz Joseph O’Rourkex July 24, 2020 Abstract It is unknown whether every polycube

thesis - Erik Demaine