Love the Data

download Love the Data

of 41

Transcript of Love the Data

  • 8/11/2019 Love the Data

    1/41

    Love the Data

    ByNeil Hepburn (Dir. of Education, IRMAC)

    Three Stories About Data Management

    Love the Data. Three Stories About Data Managementby IRMACis licensed under a Creative Commons

    Attribution-NonCommercial-ShareAlike 2.5 CanadaLicense.

    Based on a work at wikipedia.org.

    http://irmac.ca/http://creativecommons.org/licenses/by-nc-sa/2.5/ca/http://creativecommons.org/licenses/by-nc-sa/2.5/ca/http://creativecommons.org/licenses/by-nc-sa/2.5/ca/http://wikipedia.org/http://wikipedia.org/http://creativecommons.org/licenses/by-nc-sa/2.5/ca/http://creativecommons.org/licenses/by-nc-sa/2.5/ca/http://creativecommons.org/licenses/by-nc-sa/2.5/ca/http://creativecommons.org/licenses/by-nc-sa/2.5/ca/http://creativecommons.org/licenses/by-nc-sa/2.5/ca/http://creativecommons.org/licenses/by-nc-sa/2.5/ca/http://creativecommons.org/licenses/by-nc-sa/2.5/ca/http://irmac.ca/
  • 8/11/2019 Love the Data

    2/41

    2Neil Hepburn

    Speaker Bio and Relevant Experience

    Bio

    Data Architect for Empathica Inc.

    Education:

    Honours Bachelor of Mathematics in Computer Science from the University of Waterloo

    Certified Data Management Professional (Mastery Level)

    PMI Certified

    18 years IS/IT, both in full time and external consulting capacities with a focus on Data Management over past 7 years

    GM of marketing for: Innovative iPhone App for Internet Radio Discovery

    Director of Education for IRMAC (Toronto chapter of DAMA-I)

    Relevant Experience

    Consultant to Bell Mobility assisting in a reboot of their Market Analytics and Intelligence programme

    Developed and implemented Predictive Analytics model for Call Genie, directly advising their CEO and SVP of Marketing

    Technical lead on Business Intelligence Modernization project at Empathica

  • 8/11/2019 Love the Data

    3/41

    3Neil Hepburn

    Presentation Roadmap

    Why am I giving this presentation?

    The Story of the Whiz Kids

    The Story of the Relational Model

    The Story of Twitter Analytics

  • 8/11/2019 Love the Data

    4/41

    4Neil Hepburn

    Why am I giving this presentation?

    Data Management is an important discipline as we move to an increasinglydata-driven society that relies on quality data to make fact-based decisions

    Data Management exists at the intersection between technology andbusiness

    Requires understanding the underlying meaning of the data and howit relates to the business

    Requires mastery of technology required to assist in the production,

    transformation, and consumption of information

    Most IT personnel have a Computer Science degree or similar educationalbackground

    Computer Science and IT programs dont generally teach datamanagement

    data is regarded as little more than a stage prop

    Databases are regarded as bit buckets

    Garbage in garbage out is the prevailing attitude in IT departments

    Data management is seen as a techno-bureaucracy

  • 8/11/2019 Love the Data

    5/41

    5Neil Hepburn

    Story of The Whiz Kids: The World Today

    Current wave of Cultures of Analytics has begun to capture

    the popular the popular imagination. In the last three years wehave seen the following books released:

    The Numerati (by Stephen Baker)

    Competing on Analytics (by Thomas Davenport & JeanneHarris)

    Supercrunchers (by Ian Ayres)

    Data Driven: Profiting from your most Important Asset (byThomas C. Redman)

    The Information (by James Gleick)

    Much of the inspiration behind these books originates fromMoneyball: The Art of Winning an Unfair Game (by MichaelM. Lewis), which documents the success of the Oakland Asthrough Sabermetrics taking an analytical approach toteam picks and real time game strategy

    Its all good stuff, but really nothing new

  • 8/11/2019 Love the Data

    6/41

    6Neil Hepburn

    Where did Evidence Based Management Begin?

    Some companies were using data analytics to gain a

    competitive advantage

    The very use of analytics was regarded as a secret weapon,and those employed in statistical analysis were warned not todiscuss their work

    In 1908, William Sealy Gosset was employed by ArthurGuinness

    Gosset applied statistics to both farm and brewery todetermine the best yielding varieties of barley

    Gosset also invented the Students t-distribution, which got

    its name from Gossets pseudonym Student

  • 8/11/2019 Love the Data

    7/41

    7Neil Hepburn

    Who were The Whiz Kids? (Pt. I)

    The Whiz Kids trace their roots back to the US Air Force under the command of RobertA. Lovet (assistant secretary of War)

    In 1939 Tex Thornton (who was the first Whiz Kid), hired nine other Whiz Kids from theHarvard Business School including Robert McNamara and Edward Lundy

    The team called themselves Statistical Control and committed themselves to a newmanagerial discipline, basing all decisions on numbers

    Statistical Control saved $3.6 billion for the Air Force in 1943 alone, while at the sametime improving pilot and troop morale

    After WWII, Tex Thornton sold all 10 Whiz Kids as a team to the Ford Motor Co.Reporting directly to then president Henry Ford Jr.

    Upon arrival, The Whiz Kids discovered the finance department was designed solely for

    IRS tax purposes, and was not a tool of management

  • 8/11/2019 Love the Data

    8/418Neil Hepburn

    Who were The Whiz Kids? (Pt. II)

    The Whiz Kids got off to a rocky start when two layers of management were

    inserted between them and Henry Ford Jr.

    Tex Thornton left the company, going on to head Litton Industries

    Were ridiculed as The Quiz Kids (after a popular game show)

    Nevertheless, through the techniques and discipline learned from StatisticalControl, The Whiz Kids were able provide substantial cost savings, while at thesame time growing market share

    After turning Ford around, they were relabelled The Whiz Kids

    McNamara was the first to recognize safety as a feature and attempted tointroduce seat belts as a standard feature (tragically, this decision wascollectively snubbed by the entire auto industry, delaying their introduction)

    Ed Lundy transformed finance from an IRS compliancy cost centre, into areporting powerhouse, establishing the CFO as the right-hand-man of the CEO

    By 1960, Robert McNamara had been promoted to president of the companyand was the first ever non family member to run the company

    McNamara left the company shortly after to become JFKs Secretary ofDefence

  • 8/11/2019 Love the Data

    9/419Neil Hepburn

    A Tale of Two Whiz Kids

    Jack Reith was a car guy

    Robert McNamera saw automobiles as consumer appliance, like awashing machine or refrigerator. Simply a means of transportation

    Jack Reith took it upon himself to get involved in design decision withthe Ford Edsel, and conceived the Mercury Comet

    The Mercury Comet reflected Reiths own convictions about drivingas romantic pastime

    Both cars bombed, leading to Reiths departure

    McNamara learned that Volkswagens were gaining market share.

    Was common wisdom among auto execs that only beatniks werepurchasing Volkswagens

    McNamara commission a market research study discovering thatcustomers were often doctors and lawyers

    Also learned that buyers purchased Volkswagens due to theirdesign that made it easier to repair in ones own driveway

    McNamara commissioned the Ford Falcon, which went on to be atop selling car

    McNamara continued to rise at Ford, soon becoming president

  • 8/11/2019 Love the Data

    10/4110Neil Hepburn

    Lessons Learned From The Whiz Kids

    They had the buy-in and full support of president Henry Ford Jr.

    They were disciplined and forced themselves to adhere to their own principles

    As measured by IQ, they were the most intelligent persons Ford had ever hired. Robert McNamara in

    particular was off the charts They acted as specialized generalists (i.e. Versatilists):

    Were as adept at data collection and statistical analysis as they were at leading and negotiating

    Could perform each others tasks, but were focussed on a particular role

    Continued to learn and seek out best practices

    E.g. They implemented some of Peter Druckers teachings, such as a division structuring

    Their experience in the Air Force infused them with a humility and maturity allowing them operateeffectively within a large organization

    In spite of their nickname The Whiz Kids, they were not Prima Donnas

    They were competitive amongst themselves and were fiercely driven to demonstrate bottom linemeasurable improvements

  • 8/11/2019 Love the Data

    11/4111Neil Hepburn

    Smart People will Always Make Bad Decisions

    Jonah Lehrers book How We Decide should be required reading forall analysts

    The book explains why even the best of us, are prone to make baddecisions

    All too often, good information is wilfully ignored

    Even the McNamara made some famously bad decisions after he leftFord

    As Secretary of Defence for Vietnam War, Robert McNamara continued to order the useof Agent Orange, in spite of report from The Rand Corporation showing that it did not help

    McNamara disagreed with Edward Lansdale (a general who successfully led a counter-insurgency campaign in the Phillipines), and ignored all his unconvential wisdom

    McNamara (under LB Johnson) arguably rationalized these poor decisions he alreadymade on poor information and refused to consider any new information

    Therefore, if we are to truly act in a rational manner we must above all elseembrace humility

  • 8/11/2019 Love the Data

    12/4112Neil Hepburn

    The Story of the Relational Model

    Relational Databases such as Oracle, DB2, SQLServer,PostgreSQL, MySQL, and Access have all been around for awhileat least since the 1970s

    What came before relational databases?

    Who invented the relational model and why?

    Why is there a holy war between the relational purists andobject oriented purists?

    What are NOSQL (No Only SQL) databases?

    Why were they invented?

  • 8/11/2019 Love the Data

    13/4113Neil Hepburn

    Punched Card Erapre magnetic storage

    In 1725 punched cards were used in France by Basile Bouchon

    and Jean-Baptiste Falcon to control textile looms

    Technique was improved by Jacquard in 1801

    In 1832 Semen Korsakov (Ukranian) working Russian govt.invented a search system using punched cards

    In 1890 Herman Hollerith invented a punch card and tabulatingmachine for the United States Census

    Size was the same as an 1887 dollar bill

    Enough for 80 columns and 12 rows (80x25 still exists interminalse.g. Windows 7 DOS terminal)

    Hollerith left US government and founded the TabulatingMachine Company in 1896.

    This company became IBM

  • 8/11/2019 Love the Data

    14/4114Neil Hepburn

    1930s and 1940s: The Information Age Begins

    In 1936 Alan Turing introduced the Universal Turing Machine

    as a thought experiment

    Demonstrated that all computers are fundamentally the same

    Divorcing the concept of computing from all physicalimplementations

    In 1947 at AT&Ts Bell Labs, the first working transistor wascreated

    In 1948 Claude E. Shannon, working at Bell Labs published theseminal paper A Mathemat ical Theory of Informat ion

    Shannon introduced the concept of bit and showed how allinformation could be reduced to a stream of bits

    Shannons paper sparked new thinking in practically everydomain, and in particular led to huge paradigm shifts in:physics; Chemistry; Biology; Psychology; Anthropology

    Randomness = Complexity = Information

  • 8/11/2019 Love the Data

    15/4115Neil Hepburn

    Early 1950spre generalization era

    Scientific applications dominated early 1950s

    with a shift to business administrative systems bythe end of the decade

    Standard application packages were rare, mostsoftware was written for the customer (money inthe hardware)

    Payroll was the first killer app

    General Electric set the standard for payrollprocessing in 1954 running on a Univac

    Difficult to deal with special-case handling.

    Was more complicated that missile controlsystems

    Essential Complexity!

    Programmers spent much of their time writinglow level data access and manipulation routines

    A need to hide the complexity of datamanipulation and retrieval from applicationprogrammers was well recognized

  • 8/11/2019 Love the Data

    16/4116Neil Hepburn

    Late 1950spre DBMS era (Pt. 2)

    Software abstraction (known asgeneralization) began to take hold

    Sorting was one of the first things to begeneralized into re-usable code acrosscustomer installations

    Report Generation Program was firstdeveloped in 1957 by GEs team at theHanford Nuclear Reservation on its IBM 702

    Consumed as input a data dictionary and afile containing desired report format(including calculated fields)

    Share Data Processing Committee (like

    todays Open Source communities) First met October 2nd1957, chaired by

    Charles Bachman

  • 8/11/2019 Love the Data

    17/4117Neil Hepburn

    In 1961 Charles Bachman first developed IDS (Integrated

    Data Store) at General Electric

    Was made possible by new random access disktechnologyas opposed to sequential tapes

    Developed as the DBMS for a Manufacturing Informationand Control System (MIACS) used for GEs High VoltageSwitchgear (HVSG) Department

    Later sold externally to Mack Truck and Weyerhauser

    Worlds first true transaction-oriented DBMS

    Followed a Network Model

    Data Element relationships were explicitly encoded andhad to be explicitly traversed

    Application programs had to be modified to takeadvantage of new indexes

    Was later sold to B.F. Goodrich

    Was modernized to behave more like an RDBMS andwas rebranded IDMS (Integrated Data ManagementStore)

    Currently being sold by CA, running on IBM mainframes

    1960sGeneral Electrics IDS

  • 8/11/2019 Love the Data

    18/4118Neil Hepburn

    1960sIBMs IMS/360

    In 1963 IBM was asked to build a data base for the

    Apollo space mission, to manage parts inventory

    IMS (Information Management System) was originallybuilt in collaboration with Rockwell Space Divisionand released in 1965 for IBM 7000 series hardware

    Utilized a hierarchical data model

    In 1966 IMS was moved under the development ofOS/360 (under the leadership of Fred Mythical ManMonth Brooks) IMS was now rebranded as IMS/360

    Available for routine use at Rockwell on August 14th1968

    IMS/360 led to many changes to OS/360 itself toprovide nonstop operation and recovery

    IBM also developed an alternative DBMS called GIS(Generalized Information System). GIS supportedmore flexible querying, but never achieved thesuccess of IMS

    IMS 11 currently runs on IBMs system z mainframes,and continues to sell well in telecom, airlines, and

    finance

  • 8/11/2019 Love the Data

    19/4119Neil Hepburn

    19651973 DBTG and System/360 years

    In 1965 Codasyl (Conference on Data Systems Languages) formsthe DBTG (Data Base Task Group)

    Was led by Charles Bachman (inventor of IDS)

    DBTGs mission was to create a DBMS standard

    Standardized terms such as record, set and database, andadded the term schema to describe logical format of data.

    Some terms would later change. (e.g. Data Structure Classis now referred to as a Data Model)

    In 1964 IBMs System/360 was designed to support softwarecompatibility between varying hardware platforms

    In 1968 IBM began unbundling software, consulting services, andtraining services

    In 1969, the DBTG published a language specification for aNetwork Database Model known as the Codasyl Data Model

    ANSI and ISO adopted the Codasyl Data Model calling it

    Network Database Language (NDL). ISO 8907:1987 Standard was eventually withdrawn in 1998 due to being

    superseded by SQL standardization

    Confluence of the DBTG recommendations, System/360 and IBMsunbundling of software led to an explosion of DBMS vendors

    In 1972 there were 82 vendors offering 275 packages for the lifeinsurance industry

    Major DBMSs were: IDMS; IMS; Cincom Total, System 2000;Adabas; and Datacom/DB

    Fun fact: In 1971, the Data Base Users Group was formed in Toronto(later renamed to IRMAC [Information Resource Management

    Association of Canada], which went on to become part of DAMA-I,and is still recognized as the first operating chapter of DAMA-I

    Fun fact: TomNies, CEO ofCincom is thelongest servingCEO of any ITc o m p a n y .

  • 8/11/2019 Love the Data

    20/4120Neil Hepburn

    1969: Enter the Relational Model

    In 1969, Edgar F. Codd working out of IBMs San Jose Research Laboratory internally

    published a paper titled "A Relational Model of Data for Large Shared Data Banks Paper was published externally published in 1970 in Communications of the ACM

    The Relational Model was grounded in pure mathematics.

    Set Theory (relational algebra) and First Order Logic (relational calculus)

    The Relational Model is proved to be better aligned with how the business viewed data

    Perspective Neutral:Shifted responsibility of specifying relationships between tablesfrom the person designing them to the person querying them

    Necessary for establishing large, general purpose databases shared betweendifferent departments and computer systems

    Non-procedural (i.e. declarative). Tell the RDBMS WHAT you want, not HOW to getthe data

    IBM initially passed over on implementing Codds recommendations for fear ofcannibalizing sales of IMS

    In 1973 IBM began working on System R, based on Codds relational model, but thesoftware architects were cut-off from Codd and did not entirely understand the relationalmodel

    IBM eventually released a relational database, DB2, which is to this date their de-factodatabase solution

    Fun fact: Codd was

    born in England, andmoved to the US in1948 to work for IBMas a programmer. In1953, fed up withM c C a r t h y i s m , h emoved to Ottawa,Ontar io and l ivedthere for a decadebefore moving back tot h e U S

  • 8/11/2019 Love the Data

    21/4121Neil Hepburn

    1970s Commercial Implementations (RDMS and INGRES)

    The first relational database was RDMS (Relational Data Management System) at MIT byL.A. Kraning and A.I. Fillat

    Written in PL/1 for Multics OS

    relation concept is implemented as a matrix of reference numbers which refer tocharacter string datums which are stored elsewhere in distinct dataclass files

    In 1973 two scientists at Berkeley - Michael Stonebraker and Eugene Wonglearned ofthe System R project and sought funding to create a relational database of their own

    Stonebraker and Wong already already had funding for a geographic database called

    Ingres (INteractive Graphics REtrieval System). They decided to abandon this project andpursue an RDBMS

    Additional funding came from the National Science Foundation, Air Force Office ofScientific Research, the Army Research Office, and the Navy Electronic SystemsCommand

    INGRES was developed at UC Berkeley by a rotating group of students and staff. Aninitial proto-type was released in 1974.

    Ran on DEC UNIX machines

    INGRES was quasi open source. You could purchase the source code for a fee, andbuild on it.

    Used a query language called Quel (as opposed to SQL)

    Many companies released source code based on INGRES.

    Most successful company was Relational Technology Inc (RTI)

    Robert Epstein was one of the lead developers who went on to found Sybase

    Flagship RDBMS eventually was acquired by Microsoft and lives on as MS SQLServer

    range of e is employeeretrieve (comp = e.salary /(e.age - 18)) where e.name ="Jones"

  • 8/11/2019 Love the Data

    22/4122Neil Hepburn

    1980s Commercial Implementations (Oracle and Informix)

    Oracle was founded in 1977 by Larry Ellison, Bob Miner, and Ed Oats.

    The original name of the company was Software Development Laboratories(SDL), which became Relational Software Inc (RSI), and eventually wasnamed after their flagship product Oracle.

    Ellison wanted to make a product that was compatible with IBMs System R.Although this was not possible, since IBM kept the error codes secret.

    Oracle derived early success because it was written in C, and was easier to

    port to other hardware platforms

    Oracle beat out Ingres by 1985 since it had standardized on SQL (as opposedto Ingres Quel), which was more popular.

    SQL was in fact based on IBM System Rs non-relational SEQUEL(Structured English Query Language)

    Oracle out marketed Ingres

    Informix (INFORMation on unIX) was founded in 1981 by Roger Sippl andLaura King

    In 1985 introduced new product ISQL which separated database accesscode into the query engine (as opposed to requiring the client to performdirect CISAM manipulations)

    Was a pioneer in and set the stage for client server computing which cameto dominate in the 1990s

    Fun fact: The name

    Oracle comes fromthe code name of aCIA project which theOracle founders hadall worked on while att h e A m p e xC o r p o r a t i o n .

  • 8/11/2019 Love the Data

    23/4123Neil Hepburn

    The 90s: Object Oriented Databases (OODBMS)

    In 1988 Versant became the first company to introduce an

    OODBMS (object oriented data base management system) Object Data Management Group was formed in 1991, and

    ratified the Object Definition Language (ODL) Object QueryLanguage (OQL)

    Sybase took a different approach and introduced StoredProcedures

    Coupling code and data into the RDBMSa key OOPsprinciple

    ANSI SQL (and RDBMS vendors) continue to add complexdatatypes and operators to their offerings:

    Geometric datatypes and operators

    Spatial datatypes and operators

    Hierarchy datatypes and operators Oracle added Binary Large Objects (BLOBS) and recently

    Microsoft has added FILESTREAM support

    OODBMS have come back in cloud computing

    Salesforce.com / force.com / database.com

    Access 2010 Web DB (running on SharePoint 2010)

    SELECT manufacturer, AVG(SELECT part.pc.ram FROM partitionpart)FROM PCs pcGROUP BY manufacturer: pc.manufacturer

    Type Date Tuple {year, day, month}

    Type year, day, month integer

    Class manager attributes(id : string unique name :string phone : string set employees : Tuple {[Employee], Start_Date : Date })

    Class Employee attributes(id : string unique name :string Start_Date : Date manager : [Manager])

  • 8/11/2019 Love the Data

    24/4124Neil Hepburn

    Codds 12 Rules and Dates Third Manifesto

    Codd observed that no vendor had correctly implemented the relationalmodel. To clarify his intent he published 13 (0 to 12) basic conditions

    that must be met in order for a DBMS to be considered relational

    To this date, no vendor can satisfy all 13 rules. E.g.:

    Updatable VIEWs are nigh impossible to implement

    Completness constraint cannot easily be implemented

    In spite of the popularity of RDBMS, starting in the 1980s, andcontinuing through to the present, Christopher Date (who worked withCodd on the relational model) believed that commercial RDBMSs were

    not truly relational

    In 1995 Christopher Date and Hugh Darwin published the ThirdManifesto

    Major theme of Third Manifesto is that the relational model is notflawed. Rather RDBMS vendors have not correctly implemented it.

    In particular, SQL is flawed

    Describes a new language called D to address SQLs shortcomings

    Dataphor is a DBMS implemented with D4 (a later version of D)

    Rel is implemented in Java as an interpretation of Dates manifesto

    SQL continues to evolve in order to meet deficiencies

    D4:T group by { A } add { Concat(B, C order by { A, B }) Bs }

    Oracle 11.2 SQL: select A, listagg(B, C) within group (order by B) as Bs from T group by A

  • 8/11/2019 Love the Data

    25/4125Neil Hepburn

    The Object Relational Impedance Mismatch Holy War

    Fun fact: The term object-relationalimpedance mismatch is derived fromthe electr ical engineering term

    i m p e d a n c e m a t c h i n g .

    A philosophical, never-ending, and somewhat imagined debate existsbetween the relationalists and the object orientedness

    Set-oriented vs. Graph-oriented

    Thinking in sets vs. Thinking in fine-grain objects discrete objects

    Data models within object oriented programs (e.g. Java, C#) dont alignwith relational data models

    Much time is spent interfacing with relational databases

    ORM (Object Relational Mapping) layers like Hibernate and ADO.NETEntity Framework allow OOPs developers to persist data from their ownobject models within an RDBMS

    Creates a virtual object database

    Some limitations still exist

    Performance issues can arise (especially in joins and batchdeletions)

    Often leads to application-centric data models

    Key data elements required for downstream reporting are often leftout

    ORM-centric languages also exist (e.g. Groovy)

    RDBMS-centric people prefer accessing data via stored procedures

    Creates clean separation between RDBMS and application

    Some RDBMSs support extensions in non-RDBMS languages (e.g.SQLServer allows functions and stored procs to be written in C# or

    VB.NET, as well as custom built-in scalar and aggregate functions

  • 8/11/2019 Love the Data

    26/4126Neil Hepburn

    The Semantic Web and Linked Data

    Relational Model generally operates under the Closed WorldAssumption: What is not known to be true is assumed to be false

    NULL values are the exception that proves the rule

    Semantic Web is based on the opposite, the Open World Assumption

    Because relational databases are centralized they guarantee dataintegrity, and users can safely apply First Order Logic to derive newfacts

    The Semantic Web, which is decentralized, cannot provide the same

    guarantees of integrity. However, it more closely resembles the organic(warts and all) nature of the Internet, and in turn the benefits that comewith decentralization

    Semantic Web is a set of technologies under the purview of the W3C.They include:

    RDF (Resource Descriptor Framework): metamodel based on asubject, predicate, object pattern

    SPARQL (SPARQL Protocol and RDF Query Language): SQL-likelanguage for querying RDF data

    Triplestore: database for storing RDF data

    Semantic Web projects:

    DBPedia (converting Wikipedia into RDF)

    FOAF (Friend of a Friend)

    Linking Open Data (one project to rule them all)

    Ad hoc integration through web APIs seems to be more popular

    PREFIX abc: .

    SELECT ?capital ?countryWHERE {

    ?x abc:cityname ?capital ;abc:isCapitalOf ?y.

    ?y abc:countryname ?country ;abc:isInContinent abc:Africa.

    }

  • 8/11/2019 Love the Data

    27/41

    27Neil Hepburn

    The Big Data Challenge and NOSQL

    Big Data represents a class of problems which were hitherto seen asunrelated, but can in fact be solved with the same tools

    Tracking (Geo Location and Proximit, Ads, RFID, you name it)

    Causal Factor Discovery

    Smart Utility Meters

    Genomics Analysis

    Data bag (Entity Attribute Value, on-the-fly data modeling)

    Two basic approaches:

    Extended RDBMS (e.g. Columnar MPP RDBMS)

    Leverages existing data warehouse tools, skills, and data models

    Slower load times

    Does not work well with unstructured data

    NOSQL, Hadoop/MapReduce

    Evolving set of tools, both low-level and high level

    Can deal with any kind of data, including BLOBs

    Still cannot solve problem of joining 1 billion dimensions to 1 trillionfacts

    Other NOSQL DBs

    MongoDB, CouchDB: Document Oriented (JSON). Supports ad hocdata models and flexible querying

    Redis, HBase: Key Value, Real Time analytics, Complex EventProcessing

    Cassandra, Riak: Works well in heavy writes. Started at Facebook

  • 8/11/2019 Love the Data

    28/41

    28Neil Hepburn

    The Real Challenge of Data Management

    Consider the challenge of managing your own personal data andoptimizing your own life, everything here is related:

    Finances

    Courses

    Home property (and all your possessions)

    Telephone, Television, Internet

    Personal Computer

    Automobile, expenses, maintenance

    Groceries

    Dependents

    Is an ultra-powerful, ultra-flexible database the solution?

    Maintaining quality data requires tremendous discipline and sacrifices

    Most companies can barely manage their Customer Master Data

    Duplication of data is still commonplace

    The real solutions are unglamorous, but separate the winners from losers:

    Master Data Management

    Metadata Management

    Data Governance

    Enterprise Frameworks and Data Models

    Cloud-based RDBMS: A good swiss-army knife. Even MS Access will do.

  • 8/11/2019 Love the Data

    29/41

    29Neil Hepburn

    The Story of TopGun Twitter Analytics

    Or... How To Build a Twitter Data Warehouse from public APIs and opensource RDBMS and ETL tools

    ...and keep the Open Source Code and run your own Twitter monitoringprogram

  • 8/11/2019 Love the Data

    30/41

    30Neil Hepburn

    Step 1: Choose you Subjects

    Subjects are the most important WHATs

    Always nouns

    Our subjects?

    Tweets

    Twitterers

  • 8/11/2019 Love the Data

    31/41

    31Neil Hepburn

    The Art of Analytics: Deciding on Facts

    In general, its difficult to know what questions to ask ofour subjectsthat is the art of Analytics

    KPI (Key Performance Indicator), help us determinewhich facts (quantitative data) to track

    Also helps us think about how we would like to pivotaround these facts. I.e. What qualitative (dimension)

    data we wish to also capture

    Altimeter Group has some fancy sounding ones:

    share of voice; audience engagement; conversationreach, active advocates, advocate Influence, advocateImpact, resolution rate, resolution time, satisfactionscore, topic trends, sentiment ratio, and idea impact

    Lets start simple:

    Follower count, following count, num URL click-thrus

    Decide on a partition key

    Tweet Date (UTC) is an obvious one

    For now this is not a priority

  • 8/11/2019 Love the Data

    32/41

    32Neil Hepburn

    The Art of Analytics: Deciding on Dimensions

    Dimensions represent the qualitative attributes pertaining to thesubject

    If our subject is a tweet, the following dimensions are useful:

    Keyword searched for to find Tweet

    Time of Tweet (both GMT and local time)

    Text of Tweet Twitterer who tweeted tweet

    Location of Twitterer

    Software Client used to send out tweet (e.g. TweetDeck)

    Web sites referenced by Tweet

    We can continue to add dimensions, as we see necessary

    Once we have our facts and dimensions, we can now create a datamodel

    Denormalized Star Schema is a tried-and-true approach to datawarehouse modeling

  • 8/11/2019 Love the Data

    33/41

    33Neil Hepburn

    The Science of Analytics: Build out our Schema in RDBMS

  • 8/11/2019 Love the Data

    34/41

    34Neil Hepburn

    The Science of Analytics: Data Definitions

    Always include a plain English definition for every data element

    Ideally the data definition is unambiguous, accurate, states what it is (as opposedto what it isnt), and means the same to everybody

  • 8/11/2019 Love the Data

    35/41

    35Neil Hepburn

    The Science of Analytics: Use Staging Tables

    Staging tables are mirrors of your fact tables

    (e.g. Staging_fact_tweets = fact_tweets)

    Staging tables allow you to prepare your fact table data without incurring theperformance hits that are normally occur when manipulating massive tables

  • 8/11/2019 Love the Data

    36/41

    36Neil Hepburn

    The Science of Analytics: Use and ETL tool to load data

    ETL (Extract Transform Load) are purpose built for loading data warehouses.Advantages include:

    Easy to write code that runs safely in parallel

    Configuration-oriented: Safer to change in live production environments

    Visual Metaphor: Self-documenting code. Easier for others to understand andsupport

  • 8/11/2019 Love the Data

    37/41

  • 8/11/2019 Love the Data

    38/41

  • 8/11/2019 Love the Data

    39/41

    39Neil Hepburn

    Start your engines

    Set up some topics (e.g. NoSQL)

    Enter some keywords for the topic

    Begin running TopGun Twitter Analytics to commence data collection

  • 8/11/2019 Love the Data

    40/41

    40Neil Hepburn

    Load Data into BI Tool (or just query using SQL)

    Some BI tools may require you to build an OLAP data model

    OLAP tools build cubes which contain the aggregation of every fact, for everycombination of dimension value

    MOLAP tools handle sparsity well, and can achieve excellent compression, evenfor billions of dimension tuples

  • 8/11/2019 Love the Data

    41/41

    Presentation Over: Download the source code

    Includes, Pentaho DI ETL Source Code, MySQL data model, and QlikView v10 loadscript.

    Licensed as Open Source under Gnu Public License v3.0

    Can be downloaded from SourceForge.com

    https://sourceforge.net/projects/topgun/

    NB: Requires bit.ly developer key (free)

    IP address is used to rate limit Twitter user lookups (be aware if youre sharing anIP, or using Twitter for other purposes)

    Questions can be e-mailed to: [email protected]

    https://sourceforge.net/projects/topgun/mailto:[email protected]:[email protected]://sourceforge.net/projects/topgun/