Data Science &Culture
(Or how to stop worrying and love data driven culture)
Ícaro MedeirosData Science Forum São Paulo, Jun 2017
Inspired by(not limited to)
refs
Big Data
http://www.kdnuggets.com/2017/02/origins-big-data.html
✦ Fundamental blocks: evolutions on CS e.g. distributed systems, databases, massive AI, etc
✦ Fuzzy concept, ill-defined
✦ Popularized by Gartner(hype-fueled consulting firm)
✦ Big Data no longer considered an emerging technology (pervasive in industry)
✦ Entered Trough of Disillusionment in 2013
https://knowledgeimmersion.wordpress.com/2016/06/22/disillusionment-of-big-data/
http://www.mikelnino.com/2016/03/chronology-big-data.html
Chronology of antecedents
Data science✦ Statistics (late 19th century)
✦ Computer Science (1950s)
✦ Machine Learning (1950s)
✦ Data Mining (1990s)
✦ Data Science (2010s)
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
yet another hyped term
Beware: controversy✦ Data science is not all-science
✴ It’s getting more and more engineering-like, a practice
✴ Data storytelling is a creative endeavor
✦ Hyper-inflated expectations, misunderstood concepts and hurry to get value: a dangerous recipe
A new hope
machine learning
big data
https://trends.google.com/trends/explore?date=today%2012-m&geo=US&q=machine%20learning,big%20data
or hype
Hype: not that bad✦ Haters gonna hate i.e. don’t fully hate the hype
✴ more practitioners = faster tech and processes evolution
✴ Highly skilled professionals and innovation
✦ Academics sometimes look for difficult unwanted problems
✴ industry is more pragmatic, specially in tech
https://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science
What we need…✦ Forget about Big Data pokémons
✴ OH so in Big Data we don’t need people to think schemas?
✦ Forget about misunderstood business expectations
✴ OH in deep learning we don’t need people to train models?
✦ You need PEOPLE
✴ Collaborating with shared values
✴ Awesome in tech but more importantly: CREATIVE
Shared valuesand practices
Culture
Good people✦ People are more important than ideas
✴ A mediocre team will screw up a good idea
✴ Mediocre idea to great team: they will fix it or rethink it
✦ A good lab: different kinds of autonomous thinkers
✴ Why hire smart people if they can't fix what’s broken?
✦ Prefer a heterogeneous and complimentary team instead of looking for unicorns
The mythical 10x professional
https://twitter.com/icaromedeiros/status/838968884023668737
Good communication✦ Honesty, excellence, originality and self-
criticism (values)
✦ Communication structure <> organizational
✦ Be ready to hear the truth
✴ Sincerity is only valuable if people are open and willing to give up on ideas that will not work
✦ Braintrust: Leave ego and Jobs outside the door
Power to the people!✦ Product quality is everyone’s responsibility
✴ Don’t ask permission to take responsibility
✦ Passion and excellence versus autonomy
✦ Good things might shadow the bad
✴ People struggle to explore bad things to avoid being called “complainers”
Rebels
http://qaspire.com/2017/05/19/sketchnote-what-rebels-want-from-their-boss/
Destroy data silos!✦ Without information about data there is no science
✦ Software and data should be a collective property within the company
✦ Knowledge management matter
✦ Communication between areas must be enforced
Data portals✦ Self-service platforms to publish datasets
✴ Descriptions, schemas, samples, relations between datasets, etc
✦ Open Data initiatives, mostly governments
✦ OSS platforms: CKAN, AirBNB’s Dataportal
✦ Examples: data.gov.uk, dados.gov.br, etc
“When it comes to creative inspiration, job titles and
hierarchy are meaningless”
Data storytelling✦ Explain what numbers tell in layman, clear terms
✦ Make hidden premises clear
✴ Outside data insights
✦ Convince others about actions
✴ Decreases insights-to-value interval
✦ From data to knowledge
https://www.forbes.com/sites/brentdykes/2016/03/31/data-storytelling-the-essential-data-science-skill-everyone-needs
What is creativity
✦ Unexpected connections of concepts and ideas
✦ It's a marathon, it needs rhythm
✦ Creativity must start somewhere and there’s power on healthy feedback in a iterative process
Visual communication✦ Clean straightforward graphs > visually appealing
✴ Choose dataviz libs wisely
✦ “Don’t make me think”
✦ The right graph for the right audience
✴ Prefer a language everyone understands
Visual communication 101
Stats are not enough
https://www.autodeskresearch.com/publications/samestats
Stats are not enough
https://www.autodeskresearch.com/publications/samestats
Strateg a
Avoid egotrip data science✦ “OH my cluster has 10 Petabytes, I’m awesome”
✦ Fancy ML algorithms are not the goal
✦ The most important V in Big Data is value
https://twitter.com/amyhoy/status/847097034536554497
KPI versus HiPPO✦ Tech adoption per se is meaningless
✴ Slide-driven Big Data
✴ KPIs should grow from Big Data and data insights initatives
✦ Poor defined goals -> bad decisions
✦ Define viable but ambitious goals
✦ Data beats opinion
Set goal, plan and GO!✦ Business questions can't be like “OH we want to
detect things related to millennials”
✦ Clear goals must be set, with actionable metrics
✦ Balance perfect models versus time-to-market
✦ Brad Bird: “Sometimes, as a director, you’re guiding. Sometimes you’re letting the car drive”
https://hbr.org/2017/02/how-chief-data-officers-can-get-their-companies-to-collect-clean-data
The process✦ The process is not the goal
✴ It has no agenda or taste, it’s just a tool
✦ Quality is the best business plan
✦ Agile is a mindset: not only kanbans or scrum
✦ If the model will become operational, mix scientists and engineers from start
Build vs Buy✦ If you buy and your core business is not techie, you can be
illiterate in tech
✴ Benchmark before buying
✴ Accelerate results and boost internal knowledge
✦ If you build and have a good-enough techie culture, you’re more or less good to go
✴ Assess pros and cons consciously
✦ If you surf the tech hype AND build good systems you’re awesome
https://twitter.com/Doug_Laney/status/847452219641356288
When data goes to vendors…
http://www.louisdorard.com/machine-learning-canvas/
DATA ENGINEERING
Big Data vs Great Data✦ If your logical models do not make sense
✦ Most performed queries are slow
✦ If you have string-only databases
✦ If you have unused expensive data
✦ Maybe your data lake is a swamp
“The data is a mess”✦ First step: accelerate human understanding of data
✴ Metadata, context, hidden assumptions
✦ Datasets might serve multiple purposes
✴ Define rationale and context
✴ Data portals and understandable datasets > Dashboards
https://hbr.org/2016/12/why-youre-not-getting-value-from-your-data-sciencehttps://medium.com/airbnb-engineering/democratizing-data-at-airbnb-852d76c51770
Data lost in translation✦ Heterogeneous and siloed databases (and people)
✦ Rethink ESB (microservices network)
✦ State-of-the-art: data workflow
✴ Luigi, Airflow (open source), almost every big tech vendor
✴ Transparency, reusability, reproducibility, traceability
✴ Automation and monitoring all the way!
https://hbr.org/2016/12/why-youre-not-getting-value-from-your-data-science
Beyond relational models
✦ Not all data problems fits well in traditional SQL or DW models
✴ Key-value, columnar, graph-based, inverted index, etc
✦ Models are a framework for problem-solving
✴ Not the ultimate answer
✴ There’s no one-size-fits-all model
Do not forget fluency✦ Check the company lingua franca
✦ Make it easy for critical decision-makers
✴ Adhoc SQL queries?
✴ Dashboards?
✴ Reports?
EXPERIMENTATION
Experiments✦ Missions to discover facts towards understanding
✴ They don’t fail, any result produces new information
✴ If the initial theory was wrong: good
✴ With new facts you can reformulate the question
✦ Get more modeling questions asked more often
✦ Iterative data science
Product experimentation (A/B)
✦ Product experimentation should be hypothesis-driven (not feature-driven)
✦ Define the proper exposed population
✴ No new users, no heavy users only, no early adopters
✦ Understanding effect is essential
https://medium.com/airbnb-engineering/4-principles-for-making-experimentation-count-7a5f1a5268a
5 stages of A/B tests
https://www.linkedin.com/pulse/ab-testing-which-do-i-pick-sahar-heidari
Some other quick tips
✦ Focus on outcomes (not algorithms or methods)
✦ Design the right metric and evaluation
✦ Good experiments don't produce obvious insights
✦ Mix of data and intuition
https://twitter.com/mrdatascience/status/869957499662860288
Being data driven
✦ Be BAYESIAN - uncertainty is everywhere
✦ Be CURIOUS - keep learning
✦ Be AGILE - Fail fast, not too fast: evidence comes first
https://www.reaktor.com/blog/culture-eats-data-science-for-breakfast/
Being data driven
✦ Be TRUTHFUL - don’t torture data to please opinions
✦ Be HELPFUL - work across silos, support democracy
✦ Be WISE - know when to be analytical or intuitive
https://www.reaktor.com/blog/culture-eats-data-science-for-breakfast/
With the right people,Democracy,Creativity,Strategy,Big Great Data™and Experimentsthere's a good chance to do great
SCIENCE
Take-away message
Ícaro MedeirosData Scientist
icaromedeiros
Top Related