Data Breaking Bad at Berlin Buzzwords

22
Da Michael Hausenblas, MapR Technologies Berlin Buzzwords 2013, Open Stage Talk Friday, 7 June 13

description

Talk given by Michael Hausenblas - Chief Data Engineer EMEA at MapR Technologies. Berlin Buzzwords 2013, Open Stage Talk

Transcript of Data Breaking Bad at Berlin Buzzwords

Page 1: Data Breaking Bad at Berlin Buzzwords

Da

Michael Hausenblas, MapR TechnologiesBerlin Buzzwords 2013, Open Stage Talk

Friday, 7 June 13

Page 2: Data Breaking Bad at Berlin Buzzwords

Nope. Not this one.

Friday, 7 June 13

Page 3: Data Breaking Bad at Berlin Buzzwords

Friday, 7 June 13

Page 4: Data Breaking Bad at Berlin Buzzwords

things youcan influence

things thataffect you

try and focus on this stuffFriday, 7 June 13

Page 5: Data Breaking Bad at Berlin Buzzwords

The awkward moment when I open the data I got from a customer

Friday, 7 June 13

Page 6: Data Breaking Bad at Berlin Buzzwords

http://techcrunch.com/2012/11/25/the-big-data-fallacy-data-%E2%89%A0-information-%E2%89%A0-insights/

aka crap in, crap out

Friday, 7 June 13

Page 7: Data Breaking Bad at Berlin Buzzwords

Some examples …

Friday, 7 June 13

Page 8: Data Breaking Bad at Berlin Buzzwords

• Encöding hell

• Schema? Sure, I fax you a screenshot

• Dupes and other fakes

• Sampling

Friday, 7 June 13

Page 9: Data Breaking Bad at Berlin Buzzwords

Encöding hell

application-specific encodings

• URL encoding• HTML encoding• Database escaping

non-ASCII?

a%20percent-encoded%20string%20as%

20of%20RFC%203986

a <strong>HTML</strong> encoded string

Friday, 7 June 13

Page 10: Data Breaking Bad at Berlin Buzzwords

• Use Unicode

• Use Unicode

• Use Unicode

Encöding hell

http://www.swedishfika.com/2010/01/19/escaping-from-encoding-hell/

Friday, 7 June 13

Page 11: Data Breaking Bad at Berlin Buzzwords

• Encöding hell

• Schema? Sure, I fax you a screenshot

• Dupes and other fakes

• Sampling

Friday, 7 June 13

Page 12: Data Breaking Bad at Berlin Buzzwords

Schema? Sure, I fax you a screenshot

Friday, 7 June 13

Page 13: Data Breaking Bad at Berlin Buzzwords

Schema? Sure, I fax you a screenshot

• There is a need for proper, formal documentation

• For humans and machines

• Basis for validation—automate!

Friday, 7 June 13

Page 14: Data Breaking Bad at Berlin Buzzwords

• Encöding hell

• Schema? Sure, I fax you a screenshot

• Dupes and other fakes

• Sampling

Friday, 7 June 13

Page 15: Data Breaking Bad at Berlin Buzzwords

Dupes and other fakes

Friday, 7 June 13

Page 16: Data Breaking Bad at Berlin Buzzwords

Dupes and other fakes

Friday, 7 June 13

Page 17: Data Breaking Bad at Berlin Buzzwords

Dupes and other fakes

• Use plots to get an overview

• Watch out for outliers

• Try to establish source for errors and fix

• Document (in any case)

Friday, 7 June 13

Page 18: Data Breaking Bad at Berlin Buzzwords

• Encöding hell

• Schema? Sure, I fax you a screenshot

• Dupes and other fakes

• Sampling

Friday, 7 June 13

Page 19: Data Breaking Bad at Berlin Buzzwords

• My data is too big. I can’t check it all.

• Why don’t you sample, then?

Sampling

Friday, 7 June 13

Page 20: Data Breaking Bad at Berlin Buzzwords

http://mortardata.com/Friday, 7 June 13

Page 21: Data Breaking Bad at Berlin Buzzwords

Friday, 7 June 13

Page 22: Data Breaking Bad at Berlin Buzzwords

Go

and

buy

this

boo

k. N

ow.

Friday, 7 June 13