Big Algorithms Made Easy with Microsoft's F#
-
Upload
christian-longstaff -
Category
Technology
-
view
3.603 -
download
3
description
Transcript of Big Algorithms Made Easy with Microsoft's F#
![Page 1: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/1.jpg)
Joel Pobar Languages Geek DEV450 http://callvirt.net/blog/post/Why-F-(TechEd-09-DEV450).aspx
![Page 2: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/2.jpg)
Agenda
What is it?
F# Intro
Algorithms: Search
Fuzzy Matching
Classification (SVM)
Recommendations
Q&A
![Page 3: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/3.jpg)
All This in 1 hour?
This is an awareness session! Lots of content, very broad, very fast
You’ll get all demos, pointers, and slide deck to take offline and digest
Two takeaways: F# is a great language for data
Smart algorithms aren’t hard – use them, explore more!
![Page 4: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/4.jpg)
F# is
...a functional, object-oriented, imperative and explorative programming language for .NET
what is Functional Programming?
![Page 5: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/5.jpg)
What is Functional Programming?
Wikipedia: “A programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data”
-> Emphasizes functions
-> Emphasizes shapes of data, rather than impl.
-> Modeled on lambda calculus
-> Reduced emphasis on imperative
-> Safely raises level of abstraction
![Page 6: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/6.jpg)
Motivation for Functional
Simplicity in life is good: cheaper, easier, faster, better.
We typically achieve simplicity in software in two ways:
By raising the level of abstraction (and OO was one design to raise abstraction)
Increasing modularity
Better composition and modularity == reuse
Increasing signal to noise another good strategy:
Communicate more in less time with more clarity
![Page 7: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/7.jpg)
Functional Programming Safer, while still being useful
Unsafe Safe
Useful
Not Useful
C#, C++, … V.Next#
Haskell
F#
![Page 8: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/8.jpg)
Motivation for Functional
Data driven world More and more data: need higher order algorithms and techniques to derive value from data
Scalability is king Economies of software scale are changing: the web requires tools + frameworks + languages that scale to millions
The Multi-core (r)evolution! Need more adaptive languages + compilers to scale
Language features matter!
![Page 9: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/9.jpg)
What is F# for?
F# is a General Purpose Language Can be used for a broad range of programming tasks
Superset of imperative and dynamic features
Great for learning FP concepts
Some particularly important domains: Financial modelling
Data mining
Scientific analysis
Academic
![Page 10: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/10.jpg)
Let
Let binds values to identifiers
let helloWorld = “Hello, World”
print_any helloWorld
let myNum = 12
let myAddFunction x y =
let sum = x + y
sum
Type inference. The static typing of C# with
the succinctness of a scripting language
![Page 11: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/11.jpg)
Tuples
Simple, very useful data structure
let site1 = (“msdn.com”, 10)
let site2 = (“abc.net.au”, 12)
let site3 = (“news.com.au”, 22)
let allSites = (site1, site2, site3)
let fst (a, b) = a
let snd (a, b) = b
![Page 12: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/12.jpg)
List, Arrays, Seq, and Options
Lists and Arrays are first class citizens
Options provide a some-or-nothing capability
let list1 = [“Joel"; "Luke"]
let array = [|2; 3; 5;|]
let myseq = seq [0; 1; 2; ]
let option1 = Some(“Joel")
let option2 = None
![Page 13: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/13.jpg)
Records
Simple concrete type definition
type Person =
{ Name: string;
DateOfBirth: System.DateTime; }
let n = { Name = “Joel”;
DateOfBirth = “13/04/81”; }
![Page 14: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/14.jpg)
Immutability
Values may not be changed
Data is immutable by default
![Page 15: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/15.jpg)
Discriminated Unions
Great for representing the structure of data
type Make = string
type Model = string
type Transport =
| Car of Make * Model
| Bicycle
let me = Car (“Holden”, “Barina”)
let you = Bicycle
Both of these identifiers are of type “Transport”
![Page 16: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/16.jpg)
Functions
Functions: like delegates, but unified and simple
Deep type inference
(fun x -> x + 1)
let myFunc x = x + 1
val myFunc : int -> int
let rec factorial n =
if n>1 then n * factorial (n-1)
else 1
let data = [5; 3; 4; 4; 5]
List.sort (fun x y -> x – y) data
![Page 17: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/17.jpg)
Pattern Matching
Helps tease apart data and data structures
Works best with Unions and Records
let (fst, _) = (“first”, “second”)
Console.WriteLine(fst)
let switchOnType(a:obj)
match a with
| :? Int32 -> printfn “int!”
| :? Transport -> printfn “Transport“
| _ -> printfn “Everything Else!”
![Page 18: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/18.jpg)
F# Interactive
![Page 19: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/19.jpg)
Search
Given a search term and a large document corpus, rank and return a list of the most relevant results…
![Page 20: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/20.jpg)
Blog Crawler
![Page 21: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/21.jpg)
Search
Words Stemming? Tokenise
Markup Title/Author/Date
Links? A sign of strength?
Let’s explore something simple
![Page 22: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/22.jpg)
Search
Simplify: For easy machine/language manipulation
… and most importantly, easy computation
Vectors: natures own quality data structure Convenient machine representation (lists/arrays)
Lots of existing vector math algorithms
After a loving incubation period, moonlight 2.0 has been released. <a
href=“whatever”>source code</a><br><a
href”something else”>FireFox
binaries</a> … after 2
after
1
incub
ation
1 lo
vin
g
6 m
oonlig
ht
4
fire
fox
6
linu
x
2
bin
aries
![Page 23: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/23.jpg)
Term Count
Document1: Linux post:
Document2: Animal post:
Vector space:
9
the
1
incub
ation
1
cra
zy
6
moonlig
ht
4
fire
fox
6
linux
2
pengu
in
2
the
1
do
g
5
pengu
in
9
the
1
incu
ba
tio
n
1
cra
zy
6 m
oonlig
ht
4
fire
fox
6
linux
0
do
g
2
pengu
in
2 0 2 0 0 0 1 5
2
cra
zy
![Page 24: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/24.jpg)
Term Count Issues
‘the dog penguin’ Linux: 9+0+2 = 11
Animal: 2+1+5 = 8
‘the’ is overweight
Enter TF-IDF: Term Frequency Inverse Document Frequency
A weight to evaluate how important a word is to a corpus
i.e. if ‘the’ occurs in 98% of all documents, we shouldn’t weight it very highly in the total query
9
the
1
incub
ation
1
cra
zy
6
moonlig
ht
4
fire
fox
6
linux
0
do
g
2
pengu
in
2 0 2 0 0 0 1 5
![Page 25: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/25.jpg)
TF-IDF
Normalise the term count against the doc: tf = termCount / docWordCount
Measure importance of term idf = log ( |D| / termInDocumentCount)
where |D| is the total documents in the corpus
tfidf = tf * idf A high weight is reached by high term frequency, and a low document frequency
![Page 26: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/26.jpg)
Search in under 10 minutes
![Page 27: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/27.jpg)
Fuzzy Matching
String similarity algorithms: SoundEx; Metaphone
Jaro Winkler Distance; Cosine similarity; Sellers; Euclidean distance; …
We’ll look at Levenshtein Distance algorithm
Defined as: The minimum edit operations which transforms string1 into string2
![Page 28: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/28.jpg)
Fuzzy Matching
Edit costs: In-place copy – cost 0
Delete a character in string1 – cost 1
Insert a character in string2 – cost 1
Substitute a character for another – cost 1
Transform ‘kitten’ in to ‘sitting’ kitten -> sitten (cost 1 – replace k with s)
sitten -> sittin (cost 1 - replace e with i)
sittin -> sitting (cost 1 – add g)
Levenshtein distance: 3
![Page 29: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/29.jpg)
Fuzzy Matching
Estimated string similarity computation costs: Hard on the GC (lots of temporary strings created and thrown away, use arrays if possible.
Levenshtein can be computed in O (kl) time, where ‘l’ is the length of the shortest string, and ‘k’ is the maximum distance.
Parallelisable – split the set of words to compare across n cores.
Can do approximately 10,000 compares per second on a standard single core laptop.
![Page 30: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/30.jpg)
Did You Mean?
![Page 31: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/31.jpg)
Classification
Support Vector Machines (SVM) Supervised learning for binary classification
Training Inputs: ‘in’ and ‘out’ vectors.
SVM will then find a separating ‘hyperplane’ in an n-dimensional space
Training costs, but classification is cheap
Can retrain on the fly in some cases
![Page 32: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/32.jpg)
Classification
![Page 33: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/33.jpg)
SVM Issues
Classification on 2 dimensions is easy, but most input is multi-dimensional
Some ‘tricks’ are needed to transform the input data
![Page 34: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/34.jpg)
SVM Classifier Demo
![Page 35: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/35.jpg)
F# Recommendation Engine
Netflix Prize - $1 million USD Must beat Netflix prediction algorithm by 10%
480k users
100 million ratings
18,000 movies
Great example of deriving value out of large datasets
Earns Netflix loads and loads of $$$!
![Page 36: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/36.jpg)
Netflix Data Format
MovieId CustomerId Rating
Clerks 444444 5
Clerks 2093393 4
Clerks 999 5
Clerks 8668478 1
Dogma 2432114 3
Dogma 444444 5
Dogma 999 5
... ... ...
![Page 37: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/37.jpg)
Nearest Neighbour
MovieId CustomerId Rating
Clerks 444444 5
Clerks 2093393 4
Clerks 999 5
Clerks 8668478 1
Dogma 2432114 3
Dogma 444444 5
Dogma 999 5
... ... ...
![Page 38: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/38.jpg)
Nearest Neighbour
Find the best movies my neighbours agree on
CustomerId 302 4418 3 56 732
444444 5 4 5 2
999 5 5 1
111211 3 5 3
66666 5 5
1212121 5 4
5656565 1
454545 5 5
![Page 39: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/39.jpg)
Netflix Demo
![Page 40: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/40.jpg)
Vector Math Made Easy
If we want to calculate the distance between A and B, we call on Euclidean Distance
We can represent the points in the same way using Vectors: Magnitude and Direction.
Having this Vector representation, allows us to work in ‘n’ dimensions, yet still achieve Euclidean Distance/Angle calculations.
A (x1,y1)
B (x2,y2)
C (x0,y0)
![Page 41: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/41.jpg)
http://callvirt.net/blog/post/Why-F-(TechEd-09-DEV450).aspx
![Page 42: Big Algorithms Made Easy with Microsoft's F#](https://reader033.fdocuments.us/reader033/viewer/2022052118/554ebe82b4c905de468b4999/html5/thumbnails/42.jpg)
© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS,
IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.