DSSG Speaker Series: Paco Nathan

download DSSG Speaker Series: Paco Nathan

of 96

  • date post

    06-May-2015
  • Category

    Technology

  • view

    4.688
  • download

    2

Embed Size (px)

description

An invited talk by Paco Nathan in the speaker series at the University of Chicago's Data Science for Social Good fellowship (2013-08-12) http://dssg.io/2013/05/21/the-fellowship-and-the-fellows.html Learnings generalized from trends in Data Science: a 30-year retrospective on Machine Learning, a 10-year summary of Leading Data Science Teams, and a 2-year survey of Enterprise Use Cases. http://www.eventbrite.com/event/7476758185

Transcript of DSSG Speaker Series: Paco Nathan

  • 1.DSSG Speaker Series, 2013-08-12: Learnings generalized from trends in Data Science: a 30-year retrospective on Machine Learning, a 10-year summary of Leading Data ScienceTeams, and a 2-year survey of Enterprise Use Cases Paco Nathan @pacoid Chief Scientist, Mesosphere 1

2. Learnings generalized from trends in Data Science: 1. the practice of leading data science teams 2. strategies for leveraging data at scale 3. machine learning and optimization 4. large-scale data workows 5. the evolution of cluster computing DSSG, 2013-08-12 2 3. employing a mode of thought which includes both logical and analytical reasoning: evaluating the whole of a problem, as well as its component parts; attempting to assess the effects of changing one or more variables this approach attempts to understand not just problems and solutions, but also the processes involved and their variances particularly valuable in Big Data work when combined with hands-on experience in physics roughly 50% of my peers come from physics or physical engineering programmers typically dont think this way however, both systems engineers and data scientists must Process Variation Data Tools Statistical Thinking 3 4. Modeling back in the day, we worked with practices based on data modeling 1. sample the data 2. t the sample to a known distribution 3. ignore the rest of the data 4. infer, based on that tted distribution that served well with ONE computer, ONE analyst, ONE model just throw away annoying extra data circa late 1990s: machine data, aggregation, clusters, etc. algorithmic modeling displaced the prior practices of data modeling because the data wont t on one computer anymore 4 5. Two Cultures A new research community using these tools sprang up.Their goal was predictive accuracy.The community consisted of young computer scientists, physicists and engineers plus a few aging statisticians. They began using the new tools in working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in nancial markets. Statistical Modeling: TheTwo Cultures Leo Breiman, 2001 bit.ly/eUTh9L chronicled a sea change from data modeling (silos, manual process) to the rising use of algorithmic modeling (machine data for automation/optimization) which led in turn to the practice of leveraging inter-disciplinary teams 5 6. approximately 80% of the costs for data-related projects gets spent on data preparation mostly on cleaning up data quality issues: ETL, log les, etc., generally by socializing the problem unfortunately, data-related budgets tend to go into frameworks that can only be used after clean up most valuable skills: learn to use programmable tools that prepare data learn to understand the audience and their priorities learn to socialize the problems, knocking down silos learn to generate compelling data visualizations learn to estimate the condence for reported results learn to automate work, making process repeatable What is needed most? UniqueRegistration aunchedgameslobby NUI:TutorialMode BirthdayMessage hatPublicRoomvoice unchedheyzapgame Test:testsuitestarted CreateNewPet rted:client,community NUI:MovieMode BuyanItem:web PutonClothing paceremaining:512M aseCartPageStep2 FeedPet PlayPet ChatNow EditPanel anelFlipProductOver AddFriend Open3DWindow ChangeSeat TypeaBubble VisitOwnHomepage TakeaSnapshot NUI:BuyCreditsMode NUI:MyProfileClicked sspaceremaining:1G LeaveaMessage NUI:ChatMode NUI:FriendsMode dv WebsiteLogin AddBuddy NUI:PublicRoomMode NUI:MyRoomMode anelRemoveProduct yPanelApplyProduct NUI:DressUpMode UniqueRegistration Launchedgameslobby NUI:TutorialMode BirthdayMessage ChatPublicRoomvoice Launchedheyzapgame ConnectivityTest:testsuitestarted CreateNewPet MovieViewStarted:client,community NUI:MovieMode BuyanItem:web PutonClothing Addressspaceremaining:512M CustomerMadePurchaseCartPageStep2 FeedPet PlayPet ChatNow EditPanel ClientInventoryPanelFlipProductOver AddFriend Open3DWindow ChangeSeat TypeaBubble VisitOwnHomepage TakeaSnapshot NUI:BuyCreditsMode NUI:MyProfileClicked Addressspaceremaining:1G LeaveaMessage NUI:ChatMode NUI:FriendsMode dv WebsiteLogin AddBuddy NUI:PublicRoomMode NUI:MyRoomMode ClientInventoryPanelRemoveProduct ClientInventoryPanelApplyProduct NUI:DressUpMode 6 7. apps discovery modeling integration systems help people ask the right questions allow automation to place informed bets deliver data products at scale to LOB end uses build smarts into product features keep infrastructure running, cost-effective Team Process = Needs analysts engineers inter-disciplinary leadership 7 8. business process, stakeholder data prep, discovery, modeling, etc. software engineering, automation systems engineering, availability data science Data Scientist App Dev Ops Domain Expert introduced capability Team Composition = Roles leverage non-traditional pairing among roles, to complement skills and tear down silos 8 9. discovery discovery modeling modeling integration integration appsapps systems systems business process, stakeholder data prep, discovery, modeling, etc. software engineering, automation systems engineering, availability data science Data Scientist App Dev Ops Domain Expert introduced Team Composition = Needs Roles 9 10. Alternatively, Data Roles Skill Sets Harlan Harris, et al. datacommunitydc.org/blog/wp-content/uploads/ 2012/08/SkillsSelfIDMosaic-edit-500px.png Analyzing the Analyzers Harlan Harris, Sean Murphy, Marck Vaisman OReilly, 2013 amazon.com/dp/B00DBHTE56 10 11. Learning Curves difculties in the commercial use of distributed systems often get represented as issues of managing complexity much of the risk in managing a data science team is about budgeting for learning curve: some orgs practice a kind of engineering conservatism, with highly structured process and strictly codied practices people learn a few things well, then avoid having to struggle with learning many new things perpetually that anti-pattern leads to big teams, low ROI scale complexity ultimately, the challenge is about managing learning curves within a social context 11 12. Learnings generalized from trends in Data Science: 1. the practice of leading data science teams 2. strategies for leveraging data at scale 3. machine learning and optimization 4. large-scale data workows 5. the evolution of cluster computing DSSG, 2013-08-12 12 13. Business Disruption through Data Geoffrey Moore Mohr DavidowVentures, author CrossingThe Chasm @Hadoop Summit, 2012: what Amazon did to the retail sector has put the entire Global 1000 on notice over the next decade data as the major force mostly through apps verticals, leveraging domain expertise Michael Stonebraker INGRES, PostgreSQL,Vertica,VoltDB, Paradigm4, etc. @XLDB, 2012: complex analytics workloads are now displacing SQL as the basis for Enterprise apps 13 14. Data Categories Three broad categories of data Curt Monash, 2010 dbms2.com/2010/01/17/three-broad-categories-of-data Human/Tabular data human-generated data which ts into tables/arrays Human/Nontabular data all other data generated by humans Machine-Generated data lets now add other useful distinctions: Open Data Curated Metadata A/D conversion for sensors (IoT) 14 15. Open Data notes successful apps incorporate three components: Big Data (consumer interest, personalization) Open Data (monetizing public data) Curated Metadata most of the largest Cascading deployments leverage some Open Data components: Climate Corp, Factual, Nokia, etc. consider buildingeye.com, aggregate building permits: pricing data for home owners looking to remodel sales data for contractors imagine joining data with building inspection history, for better insights about properties for sale research notes about Open Data use cases: goo.gl/cd995T 15 16. Trends in Public Administration late 1880s late 1920s (Woodrow Wilson) as hierarchy, bureaucracy only for the most educated, elite late 1920s late 1930s as a business, relying on Scientic Method, gov as a process late 1930s late 1940s (Robert Dale) relationships, behavioral-based policy not separate from politics late 1940s 1980s yet another form of management less command and control 1980s 1990s (David Osborne,Ted Gaebler) New Public Management service efciency, more private sector 1990s present (Janet & Robert Denhardt) Digital Age transparency, citizen-based debugging, bankruptcies Adapted from: The Roles,Actors, and Norms Necessary to Institutionalize Sustainable Collaborative Governance Peter Pirnejad USC Price School of Policy 2013-05-02 Drivers, circa 2013 governments have run out of money, cannot increase staff and services better data infra at scale (cloud, OSS, etc.) machine learning techniques to monetize viable ecosystem for data products,APIs mobile devices enabling use cases 16 17. Open Data ecosystem municipal departments publishing platforms aggregators data product vendors end use cases e.g., Palo Alto, Chicago, DC, etc. e.g., Junar, Socrata, etc. e.g., OpenStreetMap,WalkScore, etc. e.g., Factual, Marinexplore, etc. e.g., Facebook, Climate, etc. Data feeds structured for public private partnerships 17 18. Open Data ecosystem caveats for agencies municipal departments publishing platforms aggregators data product vendors end use cases e.g., Palo Alto, Chicago, DC, etc. e.g., Junar, Socrata, etc. e.g., OpenStreetMap,WalkScore, etc. e.g., Factual, Marinexplore, etc. e.g., Facebook, Climate, etc. Required Focus respond to viable use cases not budgeting hackathons 18 19. Open Data ecosystem caveats for publishers municipal departments publishing platforms aggregators data product vendors end use cases e.g., Palo Alto, Chicago, DC, etc. e.g., Junar, Socrata, etc. e.g., OpenStreetMap,WalkScore, etc. e.g., Factual, Marinexplore, etc. e.g., Facebook, Climate, etc. Required Focus surface the metadata cura