Big Data: How the Information Revolution Is Transforming Our Lives

129

Transcript of Big Data: How the Information Revolution Is Transforming Our Lives

Page 1: Big Data: How the Information Revolution Is Transforming Our Lives
Page 2: Big Data: How the Information Revolution Is Transforming Our Lives

BIGDATAHowtheInformationRevolutionisTransformingOurLives

BRIANCLEGG

Page 3: Big Data: How the Information Revolution Is Transforming Our Lives

ForGillian,ChelseaandRebecca

Page 4: Big Data: How the Information Revolution Is Transforming Our Lives

ACKNOWLEDGEMENTS

I’vehadalongrelationshipwithdataandinformation.WhenIwasatschoolwedidn’thaveanycomputers,butpatient teachershelpedustopunchcardsbyhand,whichweresentoffbyposttoLondonandwe’dgetaprint-outaboutaweeklater.Thistaughtmetheimportanceofaccuracyincoding–sothankstoOliver Ridge,Neil Sheldon and theManchesterGrammar School. I alsoowealottomycolleaguesatBritishAirways,whotooksomenascentskillsand turnedme intoadataprofessional;particularmention isneededforSueAggleton, John Carney and Keith Rapley. And, as always, thanks to thebrilliant team at Icon Books who were involved in producing this series,notablyDuncanHeath,SimonFlynn,RobertSharmanandAndrewFurlow.

Page 5: Big Data: How the Information Revolution Is Transforming Our Lives

CONTENTS

TitlePageDedicationAcknowledgements1 Weknowwhatyou’rethinking2 Sizematters3 Shoptillyoudrop4 Funtimes5 Solvingproblems6 BigBrother’sbigdata7 Good,badanduglyFurtherreadingIndexAbouttheAuthorOtherHotSciencetitlesavailablefromIconBooksCopyright

Page 6: Big Data: How the Information Revolution Is Transforming Our Lives

1WEKNOWWHATYOU’RETHINKING

Thebigdealaboutbigdata

It’shardtoavoid‘bigdata’.Thewordsarethrownatusinnewsreportsandfromdocumentaries all the time.Butwe’ve lived in an information age fordecades.Whathaschanged?

Takea lookata successstoryof thebigdataage:Netflix.OnceaDVDrentalservice,thecompanyhastransformeditselfasaresultofbigdata–andthe change is far more than simply moving from DVDs to the internet.Providing an on-demand video service inevitably involves handling largeamountsofdata.ButsodidrentingDVDs.AllaDVDdoesisstoregigabytesofdataonanopticaldisc.Ineithercasewe’redealingwithdataprocessingonalargescale.Butbigdatameansfarmorethanthis.It’saboutmakinguseofthe whole spectrum of data that is available to transform a service ororganisation.

Netflixdemonstrateshowanon-demandvideocompanycanputbigdataatitsheart.ServiceslikeNetflixinvolvemoretwo-waycommunicationthanaconventionalbroadcast.Thecompanyknowswhoiswatchingwhat,whenandwhere. Its systems can cross-index measures of a viewer’s interests, alongwith their feedback.We as viewers see the outcome of this analysis in therecommendationsNetflixmakes,andsometimes theyseemodd,because thesystem is attempting to predict the likes and dislikes of a single individual.But from theNetflix viewpoint, there is amuch greater andmore effectivebenefitinmatchingpreferencesacrosslargepopulations:itcantransformtheprocessbywhichnewseriesarecommissioned.

Take,forinstance,thefirstNetflixcommissiontobreakthroughasamajorseries:HouseofCards.Had thisbeenaproject for a conventionalnetwork,thebroadcasterwouldhaveproducedapilot,trieditoutonvariousaudiences,perhaps risked funding a short season (which could be cancelled part waythrough)andonlythencommittedtotheserieswholeheartedly.Netflixshort-

Page 7: Big Data: How the Information Revolution Is Transforming Our Lives

circuitedthisprocessthankstobigdata.Theproducersbehind the series,MordecaiWiczyk andAsifSatchu, had

toured the US networks in 2011, trying to get funding to produce a pilot.However,therehadn’tbeenasuccessfulpoliticaldramasinceTheWestWingfinished in 2006 and the people controlling the money felt that House ofCardswastoohighrisk.However,Netflixknewfromtheirmassofcustomerdata that they had a large customer base who appreciated the humour anddarkness of the original BBC drama the show was based on, which wasalready in the Netflix library. Equally, Netflix had a lot of customers wholiked the work of director David Fincher and actor Kevin Spacey, whobecamecentraltothemakingoftheseries.

Ratherthancommissionapilot,withstrongevidencethattheyhadareadyaudience,Netflixput$100millionupfrontforthefirsttwoseries,totalling26episodes. This meant that the makers ofHouse of Cards could confidentlypaintonamuchlargercanvasandgivetheseriesfarmoredepththanitmightotherwisehavehad.Andtheoutcomewasahugesuccess.NoteveryNetflixdramacanbeassuccessfulasHouseofCards.Butmanyhavepaidoff,andevenwhenthetakeupisslower,aswiththe2016NetflixdramaTheCrown,givenasimilarhigh-cost two-seasonstart, showshavefar longer tosucceedthanwhenconventionallybroadcast.Themodelhasalreadydeliveredseveralmajor triumphs,withdecisionsdrivenbybigdatarather than thegutfeelofindustry executives, infamous for getting itwrong farmore frequently thantheygetitright.

Theability tounderstand thepotentialaudience foranewserieswasnottheonlywaythatbigdatahelpedmakeHouseofCardsasuccess.Cleveruseofdatameant,forinstance,thatdifferenttrailersfortheseriescouldbemadeavailable todifferentsegmentsof theNetflixaudience.Andcrucially, ratherthanreleasetheseriesepisodebyepisode,aweekatatimeasaconventionalnetwork would, Netflix made the whole season available at once.With noadvertisingtorequireanaudiencetobespreadacrosstime,Netflixcouldputviewingcontrolinthehandsoftheaudience.Thishassincebecomethemostcommon release strategy for streaming series, and it’s amodel that is onlypossiblebecauseofthebigdataapproach.

Bigdataisnotallaboutbusiness, though.Amongotherthings, ithasthepotentialtotransformpolicingbypredictinglikelycrimelocations;toanimateastillphotograph;toprovidethefirstevervehicleforgenuinedemocracy;topredictthenextNewYorkTimesbestseller;togiveusanunderstandingofthefundamentalstructureofnature;andtorevolutionisemedicine.

Less attractively, it means that corporations and governments have the

Page 8: Big Data: How the Information Revolution Is Transforming Our Lives

potentialtoknowfarmoreaboutyou,whethertoselltoyouortoattempttocontrolyou.Don’tdoubt it –bigdata ishere to stay,making it essential tounderstandboththebenefitsandtherisks.

Thekey

Just as happened with Netflix’s analysis of the potential House of Cardsaudience, the power of big data derives from collecting vast quantities ofinformationandanalysingitinwaysthathumanscouldneverachievewithoutcomputersinanattempttoperformtheapparentlyimpossible.

Datahasbeenwithusalongtime.Wearegoingtoreachback6,000yearsto the beginnings of agricultural societies to see the concept of data beingintroduced.Overtime,throughaccountingandthewrittenword,databecamethebackboneofcivilisation.Wewillseehowdataevolvedintheseventeenthand eighteenth centuries to be a tool to attempt to open a window on thefuture.Buttheattemptwasalwaysrestrictedbythenarrowscopeofthedataavailableandbythelimitationsofourabilitytoanalyseit.Now,forthefirsttime,bigdataisopeningupanewworld.Sometimesit’sinaflashywaywithcomputers like Amazon’s Echo that we interact with using only speech.Sometimesit’sunderthesurface,ashappenedwithsupermarketloyaltycards.What’s clear is that the applications of big data aremultiplying rapidly andpossesshugepotentialtoimpactusforbetterorworse.

Howcantherebesomuchlatentpowerinsomethingsobasicasdata?Toanswerthatweneedtogetabetterfeelforwhatbigdatareallyisandhowitcanbeused.Let’sstartwiththat‘d’word.

Page 9: Big Data: How the Information Revolution Is Transforming Our Lives

2SIZEMATTERS

Datais…

According to the dictionary, ‘data’ derives from the plural of the Latin‘datum’, meaning ‘the thing that’s given’. Most scientists pretend that wespeak Latin, and tell us that ‘data’ should be a plural, saying ‘the data areconvincing’ rather than ‘the data is convincing.’ However, the usuallyconservativeOxfordEnglishDictionary admits that usingdata as a singularmassnoun–referringtoacollection–isnow‘generallyconsideredstandard’.Itcertainlysoundslessstilted,sowewilltreatdataassingular.

‘The thing that’s given’ itself seems rather cryptic. Most commonly itreferstonumbersandmeasurements,thoughitcouldbeanythingthatcanberecordedandmadeuseoflater.Thewordsinthisbook,forinstance,aredata.

Youcanseedataasthebaseofapyramidofunderstanding:

Page 10: Big Data: How the Information Revolution Is Transforming Our Lives

From data we construct information. This puts collections of related datatogethertotellussomethingmeaningfulabouttheworld.Ifthewordsinthisbookaredata,thewayI’vearrangedthewordsintosentences,paragraphsandchapters makes them information. And from information we constructknowledge.Ourknowledgeisaninterpretationofinformationtomakeuseofit – by reading the book, and processing the information to shape ideas,opinionsandfutureactions,youdevelopknowledge.

In another example, data might be a collection of numbers. Organisingthemintoatableshowing,say,thequantityoffishinacertainseaarea,hourbyhour,wouldgiveyouinformation.Andsomeoneusingthisinformationtodecidewhenwouldbethebesttimetogofishingwouldpossessknowledge.

Climbingthepyramid

Sincehumancivilisationbeganwehaveenhancedour technology tohandledata and climb this pyramid. This began with clay tablets, used inMesopotamia at least 4,000 years ago. The tablets allowed data to bepracticallyanduseablyretained,ratherthanheldintheheadorscratchedonacavewall.Thesewereportabledatastores.Ataroundthesametime,thefirstdataprocessorwasdevelopedinthesimplebutsurprisinglypowerfulabacus.First usingmarks or stones in columns, then beads onwires, these devices

Page 11: Big Data: How the Information Revolution Is Transforming Our Lives

enabledsimplenumericdata tobehandled.Butdespiteanincreasingabilityto manipulate data over the centuries, the implications of big data onlybecameapparentattheendthenineteenthcenturyasaresultoftheproblemofkeepingupwithacensus.

In theearlydaysof theUScensus, the increasingquantityofdatabeingstored and processed looked likely to overwhelm the resources available todealwithit.Thewholeprocessseemeddoomed.Therewasaten-yearperiodbetween censuses – but as population and complexity of data grew, it tooklonger and longer to tabulate the census data. Soon, a censuswould not becompletely analysed before the next one came round. This problem wassolvedbymechanisation.Electromechanicaldevices enabledpunchedcards,eachrepresentingasliceofthedata,tobeautomaticallymanipulatedfarfasterthananyhumancouldachieve.

Bythelate1940s,withtheadventofelectroniccomputers,theequipmentreached the second stage of the pyramid. Data processing gave way toinformation technology. There had been information storage since theinventionofwriting.Abookisaninformationstorethatspansspaceandtime.Butthenewtechnologyenabledthatinformationtobemanipulatedasneverbefore. The new non-human computers (the term originally referred tomathematiciansundertakingcalculationsonpaper)couldnotonlyhandledatabutcouldturnitintoinformation.

Foralongwhileitseemedasifthefinalstageofautomatingthepyramid–turning information into valuable knowledge – would require ‘knowledge-based systems’. These computer programs attempted to capture the ruleshumans used to apply knowledge and interpret data. But good knowledge-basedsystemsprovedelusiveforthreereasons.Firstly,humanexpertswereinno hurry to make themselves redundant and were rarely fully cooperative.Secondly,humanexpertsoftendidn’tknowhow theyconverted informationintoknowledgeandcouldn’thaveexpressedtherulesfortheITpeopleevenhadtheywantedto.Andfinally,theaspectsofrealitybeingmodelledthiswayprovedfartoocomplextoachieveausefuloutcome.

Therealworldisoftenchaoticinamathematicalsense.Thisdoesn’tmeanthatwhathappensisrandom–quitetheopposite.Rather,itmeansthattherearesomany interactionsbetween thepartsof theworldbeingstudied thataverysmallchangeinthepresentsituationcanmakeahugechangetoafutureoutcome. Predicting the future to any significant extent becomes effectivelyimpossible.

Now, though, as we undergo another computer revolution through theavailability of the internet and mobile computing, big data is providing an

Page 12: Big Data: How the Information Revolution Is Transforming Our Lives

alternative,morepragmatic approach to takingon the top levelof thedata–information–knowledge pyramid.A big data system takes large volumes ofdata–data that isusually fast flowingandunstructured–andmakesuseofthe latest information technologies to handle and analyse this data in a lessrigid,more responsive fashion.Until recently thiswas impossible.Handlingdataonthisscalewasn’tpractical,sothosewhostudiedafieldwouldrelyonsamples.

A familiar use of sampling is in opinion polls, where pollsters try todeducetheattitudesofapopulationfromasmallsubset.Thatsmallgroupiscarefully selected (in a good poll) to be representative of the wholepopulation,butthereisalwaysassumptionandguessworkinvolved.Asrecentelectionshaveshown,pollscanneverprovidemorethanagoodguessoftheoutcome.The 2010UKgeneral election?The polls got itwrong.The 2015UKgeneralelection?Thepollsgotitwrong.The2016BrexitreferendumandUSpresidentialelection–youguessedit.We’lllookatwhypollsseemtobefailingsooftenalittlelater(seepage23),butbigdatagetsaroundthepollingproblembytakingoneveryone–andthetechnologywenowhaveavailablemeans that we can access the data continuously, rather than through theclumsy,slowmechanismsofanold-schoolbigdataexerciselikeacensusorgeneralelection.

Past,presentandfuture

For lovers of data, each of past, present and future has a particular nuance.Traditionally,datafromthepasthasbeentheonlycertainty.Theearliestdataseemstohavebeenprimarilyrecordsofpasteventstosupportagricultureandtrade. Itwas thebeancounterswhofirstunderstood thevalueofdata.Whattheyworkedwiththenwasn’talwaysveryapproachable,though,becausetheveryconceptofnumberwasinastateofflux.

Lookback,forinstance,tothemightycitystateofUruk,foundedaround6,000yearsagoinwhatisnowIraq.ThepeopleofUrukweresooncapturingdata about their trades, but they hadn’t realised that numbers could beuniversal.Wetakethisforgranted,butitisn’tnecessarilyobvious.So,ifyouwereanUruktraderandyouwantedtocountcheese,freshfishandgrain,youwould use a totally different number system to someone counting animals,humansordriedfish.Evenso,datacomeshandinhandwithtrade,asitdoeswith theestablishmentofstates.Theword‘statistics’has thesameoriginas‘state’ –originally itwasdata about a state.Whetherdatawas captured fortradeortaxationorprovisionofamenities,itwasimportanttoknowaboutthe

Page 13: Big Data: How the Information Revolution Is Transforming Our Lives

past.Inasense,thisdependenceonpastdatawasnotsomuchaperfectsolution

asapragmaticreflectionofthepossible.Theidealwastoalsoknowaboutthepresent.Butthiswasonlypracticalforlocaltransactionsuntilthemechanismsforbigdatabecameavailabletowardstheendofthetwentiethcentury.Evennow,manyorganisationspretendthatthepresentdoesn’texist.

It is interestingtocomparetheapproachofabusinessdrivenbybigdatasuch as a supermarket with a less data-capable organisation like a bookpublisher. Someone in the head office of amajor supermarket can tell youwhatissellingacrosstheirentirearrayofshops,minutebyminutethroughouttheday.Heorshecaninstantlycommunicatedemandtosuppliersandbytheend of the day, the present data is part of the big data source for the next.Publishing(asseenbyanauthor)isverydifferent.

Typically,anauthorreceivesasummaryofsalesfor,say, thesixmonthsfrom January to June at the end of September and will be paid for this inOctober. It’s not that on-the-day sales systems don’t exist, but nothing isintegrated.Itdoesn’thelpthatpublishingoperatesadata-distortingapproachof ‘sale or return’, whereby books are listed as being ‘sold’when they areshippedtoabookstore,butcanthenbereturnedforarefundatanytimeinthefuture. This is an excellent demonstration ofwhywe struggle to copewithdata from the present – the technology might be there, but commercialagreements are rooted in the past, and changing to a big data approach is asignificantchallenge.Andthat’sjustadvancingfromthepasttothepresent–thefutureisawholedifferentballgame.

It wasn’t until the seventeenth century that there was a consciousrealisation thatdatacollected from thepastcouldhaveanapplication to thefuture. I’m stressing that word ‘conscious’ because it’s somethingwe havealwaysdoneashumans.Weusedatafromexperiencetohelpusprepareforfuturepossibilities.Butwhatwasnewwas toconsciouslyandexplicitlyusedatathisway.

It began in seventeenth-centuryLondonwith abuttonmaker called JohnGraunt.Outofscientificcuriosity,Grauntgothishandson‘billsofmortality’–documentssummarisingthedetailsofdeathsinLondonbetween1604and1661.Grauntwasnotjustinterestedinstudyingthesenumbers,butcombinedwhathecouldgleanfromthemwithasmanyotherdatasourcesashecould–scrappydetails,forinstance,ofbirths.Asaresult,hecouldmakeanattemptbothtoseehowthepopulationofLondonwasvarying(therewasnocensusdata)andtoseehowdifferentfactorsmightinfluencelifeexpectancy.

It was this combination of data from the past and speculation about the

Page 14: Big Data: How the Information Revolution Is Transforming Our Lives

futurethathelpedaworldwideindustrybegininLondoncoffeehouses,basedonthekindofcalculationsthatGraunthaddevised.Inaway,itwaslikethegambling thathad takenplace formillennia.But thedifferencewas that thedatawas consciously studied and used to devise plans. This new, informedtypeofgamblingbecametheinsurancebusiness.Butthiswasjustthestartofourinsatiableurgetousedatatoquantifythefuture.

Crystalballs

There was nothing new, of course, about wanting to foretell what wouldhappen.Whodoesn’twanttoknowwhat’sinstoreforthem,whowillwinawar,orwhichhorsewillwinthe2.30atChepstow?Augurs,astrologersandfortunetellershavedonesteadybusinessformillennia.Traditionally,though,theabilitytopeerintothefuturereliedonimaginarymysticalpowers.WhatGrauntandtheotherearlystatisticiansdidwasoffer thehopeofascientificviewofthefuture.Datawastoformaglowingchain,linkingwhathadbeentowhatwastocome.

This was soon taken far beyond the quantification of life expectancies,useful though that might be for the insurance business. The science offorecasting, the prediction of the data of the future, was essential foreverything from meteorology to estimating sales volumes. Forecastingliterallymeanstothroworprojectsomethingahead.Bycollectingdatafromthepast, andasmuchaspossible about thepresent, the ideaof the forecastwasto‘throw’numbersintothefuture–topushasidetheveiloftimewiththehelpofdata.

The quality of such attempts has always been very variable. MoaningabouttheaccuracyofweatherforecastshasbeenanationalhobbyintheUKsince theystarted inTheTimes in the1860s, though theyarenowfarbetterthan theywere40yearsago, for reasonswewilldiscover inamoment.Wefind it verydifficult to accepthowqualitativelydifferentdata from thepastanddataonthefutureare.Afterall,theyaresetsofnumbersandcalculations.It all seems very scientific.We have a natural tendency to give each equalweighting,sometimeswithhilariousconsequences.

Take, for example, a business mainstay, the sales forecast. This is acompany’s attempt to generate data on future sales based on what hashappened before. In every business, on a regular basis, those numbers areinaccurate. And when this happens, companies traditionally hold a post-mortemon‘whatwentwrong’withtheirbusiness.Thispost-mortemprocessblithelyignorestherealitythattheforecast,almostbydefinition,wasgoingto

Page 15: Big Data: How the Information Revolution Is Transforming Our Lives

bewrong.Whathappenedisthattheforecastdidnotmatchthesales,butthepost-mortem attempts to establishwhy the sales did notmatch the forecast.The reason behind this confusion is a common problemwhenever we dealwithstatistics.Weareover-dependentonpatterns.

Patternsandself-deception

Patternsare theprincipalmechanismused tounderstand theworld.Withoutmaking deductions from patterns to identify predators and friends, food orhazards,wewouldn’tlastlong.Ifeverytimealargeobjectwithfourwheelscamehurtlingtowardsusdownaroadwehadtoworkoutifitwasathreat,wewouldn’t survive crossing the road.We recognise a car or a lorry, eventhough we’ve never seen that specific example in that specific shape andcolour before.Andwe act accordingly. For thatmatter, science is all aboutusingpatterns–withoutpatternswewouldneedanewtheoryforeveryatom,everyobject,everyanimal,toexplaintheirbehaviour.Itjustwouldn’twork.

Thisdependenceonpatternsisfine,butwearesofinelytunedtorecognisethings through pattern that we are constantly being fooled.When the 1976Viking1probetookdetailedphotographsofthesurfaceofMars,itsentbackan image that our pattern-recognising brains instantly told uswas a face, acarvingonavastscale.Morerecentpictureshaveshownthiswasanillusion,causedbyshadowswhentheSunwasataparticularangle.Therockyoutcropbearsnoresemblancetoaface–butit’salmostimpossiblenottoseeoneintheoriginalimage.There’sevenawordforseeinganimageofsomethingthatisn’tthere:pareidolia.Similarlythewholebusinessofforecastingisbasedonpatterns–itisbothitsstrengthanditsultimatedownfall.

Page 16: Big Data: How the Information Revolution Is Transforming Our Lives

The‘faceonMars’asphotographedin2001,andinsetthe1976image.

If therearenopatternsatall in thehistoricaldatawehaveavailable,wecan’t say anythinguseful about the future.Agood exampleof datawithoutanypatterns–specificallydesigned tobe thatway– is theballsdrawn inalottery.Currently,theUKLottogamefeatures59balls.Ifthemechanismofthedrawisundertakenproperly,thereisnopatterntothewaytheseballsaredrawnweekonweek.Thismeans that it is impossible to forecastwhatwillhappeninthenextdraw.Butlogicisn’tenoughtostoppeopletrying.

Takea lookon the lottery’swebsite andyouwill findapagegiving thestatistics on each ball. For example, a table shows how many times eachnumberhasbeendrawn.Atthetimeofwriting,the59-balldrawhasbeenrun116 times.Themost frequentlydrawnballswere14 (drawnnineteen times)and41 (drawnseventeen times).Despite therebeingnoconnectionbetweenthem, it’s almost impossible to stop a pattern-seeking brain from thinking‘Hmm, that’s interesting.Why are the twomost frequently drawn numbersreversedversionsofeachother?’

Theleastfrequentnumberswere6,48and45,eachwithonlyfivedrawseach. This is just the nature of randomness. Random things don’t occurevenly, but have clusters and gaps. When this is portrayed in a simple,physicalfashionitisobvious.Imaginetippingacanofballbearingsontothefloor.Wewouldbe very suspicious if theywere all evenly spreadout on a

Page 17: Big Data: How the Information Revolution Is Transforming Our Lives

grid–weexpectclustersandgaps.Butmoveawayfromsuchanexampleandit’s hard not to feel that theremust be a cause for such a big gap betweennineteendrawsofball14andjustfivedrawsofball6.

Oncesuchpatternsicknesshassetin,wefindithardtoresistitspowerfulsymptoms. The reason the lottery company provides these statistics is thatmany people believe that a ball that has not being drawn often recently is‘overdue’.Itisn’t.Thereisnoconnectiontolinkonedrawwithanother.Thelottery does not have a memory.We can’t use the past here to predict thefuture.Butstillweattempttodoso.Itisalmostimpossibletoavoidtheself-deceptionthatpatternsforceonus.

Otherforecastsarelesscutanddriedthanattemptingtopredicttheresultsofthelottery.Inmostsystems,whetherit’stheweather,thebehaviourofthestockexchangeorsalesofwellingtonboots,thefutureisn’tentirelydetachedfrom the past. Here there is a connection that can be explored.We can, tosomedegree,usedata tomakemeaningfulforecasts.Butstillweneedtobecarefultounderstandthelimitationsoftheforecastingprocess.

Extrapolationandblackswans

Theeasiestwaytousedatatopredictthefutureistoassumethingswillstaythesameasyesterday.Thissimplestofmethodscanworksurprisinglywell,and requiresminimalcomputingpower. I can forecast that theSunwill risetomorrowmorning(or,ifyou’repicky,thattheEarthwillrotatesuchthattheSunappearstorise)andthechancesarehighthatIwillberight.Eventuallymy prediction will be wrong, but it is unlikely to be so in the lifetime ofanyonereadingthisbook.

Evenwhereweknowthatusing‘moreof thesame’asa forecasting toolmustfail relativelysoon, itcandeliverfor the typical lifetimeofabusiness.Asof2016,Moore’sLaw,whichpredictsthatthenumberoftransistorsinacomputer chipwilldouble everyone to twoyearshasheld true forover50years. We know it must fail at some point, and have expected failure tohappen ‘soon’ for at least twenty years, but ‘more of the same’ has doneremarkably well. Similarly, throughout most of history there has beeninflation.Thevalueofmoneyhasfallen.Therehavebeenperiodsofdeflation,andtimeswhenaredefinitionofaunitofcurrencymovesthegoalposts,butoverall‘thevalueofmoneyfalls’worksprettywellasapredictor.

Unfortunatelyforforecasters,veryfewsystemsarethissimple.Many,forinstance,havecyclicvariations.Imentionedattheendoftheprevioussectionthatwellingtonbootsalescanbepredictedfromthepast.However,todothis

Page 18: Big Data: How the Information Revolution Is Transforming Our Lives

effectively we need access to enough data to see trends for those salesthroughout the year. We’ve taken the first step towards big data. It’s notenough to say that nextweek’s sales should be the same as lastweek’s, orshouldgrowbyapredictableamount.Instead,weathertrendswillensurethatsalesaremuchhigher, for instance, inautumnthan theyareat theheightofsummer(notwithstandingabriefsurgeofsalesaroundthenotoriouslymuddymusicfestivalseason).

Takeanotherexample–barbecues.SupermarketchainTescoreckonsthata 10°C increase in temperature at the start of summer results in a threefoldincreaseinmeatsalesasall thebarbecuefansgointocavemanmode.Butasimilar temperature increase later in summer, once barbecuing is less of anovelty,doesnothavethesameimpact.Soasupermarketneedstohavebothseasonalitydataandweatherdatatomakeareasonableforecast.

Seasonaleffectsarejustoneoftheinfluencesreflectedinpastdatathatcaninfluence the future. And it is when there are several such ‘variables’ thatforecasting can come unstuck. This is particularly the case if sets of inputsinteractwitheachother;theresultcanbethekindofmathematicallychaoticsystemwhere it is impossible tomake sensiblepredictionsmore thana fewdays ahead. Take the weather. The overall weather system has so manycomplex factors, all interacting with each other, that tiny differences instartingconditionscanresultinhugedifferencesdowntheline.

For this reason, long-termweather forecasts, however good the data, arefantasiesratherthanmeaningfulpredictions.Whenyounextseeanewspaperheadline in June forecasting an ‘arctic winter’, you can be sure there is noscientificbasisforit.Bythetimewelookmorethantendaysahead,ageneralidea of what weather is like in a location at that time of year is a betterpredictor than any amount of data.And if chaos isn’t enough to dealwith,there’sthematterofblackswans.

The termwasmade famous byNassimNicholas Taleb in his bookTheBlackSwan,thoughtheconceptismucholder.Asearlyas1570,‘blackswan’was being used as a metaphor for rarity when a T. Drant wrote ‘CaptaineCorneliusisablackeswaninthisgeneration’.Whattheblackswanmeansinstatistical terms is thatmaking a predictionbasedon incomplete data – andthat is nearly always the case in reality – carries the risk of a sudden andunexpected break from the past. This statistical use refers to the fact thatEuropeans, up to the exploration of Australian fauna, could make theprediction‘allswansarewhite’,anditwouldhold.Butonceyou’veseenanAustralianblackswan,theentirepremisefallsapart.

The black swan reflects a difference between two techniques of logic –

Page 19: Big Data: How the Information Revolution Is Transforming Our Lives

deductionandinduction.ThankstoSherlockHolmes,wetendtorefertotheheartofscientifictechnique,ofwhichforecastingisapart,asdeduction.Wegathercluesandmakeadeductionaboutwhathashappened.Buttheprocessofdeduction isbasedonacomplete setofdata. Ifweknew,beyonddoubt,thatall bananaswere yellow andwere then given a piece of fruit thatwaspurple,wewouldbeabletodeducethatthisfruitisnotabanana.Butinthereal world, the best we can ever do is to say that all bananas we haveencountered are yellow – when ripe. The data is incomplete. Without theavailability of deduction, we fall back on induction, which says that it ishighlylikelythatthepurplefruitthatwehavebeengivenisnotabanana.Andthat’showscienceandforecastingwork.Theymakeabestguessbasedontheavailableevidence;theydon’tdeducefacts.

In the real world, we hardly ever have complete data; we are alwayssusceptibletoblackswans.So,forinstance,stockmarketsgenerallyriseovertime–until abubblebursts and theycrash.TheoncemassivephotographiccompanyKodakcouldsensiblyforecastsalesofphotographicfilmfromyeartoyear.Inductionledthemtobelievethat,despiteupsanddowns,theoveralltrendinaworldofgrowingtechnologyusewasupward.Butthen,thedigitalcamera black swan appeared. Kodak, the first to produce such a camera,initiallytriedtosuppressthetechnology.Buttheblackswanwasunstoppableand the company was doomed, going into protective bankruptcy in 2012.Althoughapared-downKodakstillexists,itisunlikelyevertoregainitsone-timedominance.

Theaimofbigdataistominimisetheriskofafailedforecastbycollectingasmuchdataaspossible.And,aswewillsee,thiscanenablethoseincontrolofbigdatatoperformfeatsthatwouldnothavebeenpossiblebefore.Butwestill need to bear inmind the lesson ofweather forecasting.Meteorologicalforecastswerethefirsttoembracebigdata.TheMetOfficeisthebiggestuserofsupercomputersintheUK,crunchingthroughvastquantitiesofdataeachday to produce a collection of forecasts known as an ensemble. These arecombinedtogivethebestprobabilityofanoutcomeinaparticular location.And these forecasts aremuchbetter than their predecessors.But there isnochanceofrelyingonthemeverytime,orofgettingausefulforecastmorethantendaysout.

We shouldn’t underplay the impact of big data, though, because it canremove the dangers of one of the most insidious tools of forecasting, onewhichattemptstogivetheeffectofhavingbigdatawithonlyasmallfractionof thedata.Aswe’vealreadydiscovered, the limitationsofsamplingarealltooclearinthefailureof21st-centurypoliticalpolls.

Page 20: Big Data: How the Information Revolution Is Transforming Our Lives

Sampling,pollsandusingitall

There was a time when big data could only be handled on infrequentoccasionsbecauseittooksomuchtimeandefforttoprocess.Foracensus,ora general election, we could ask everyone for their data, but this wasn’tpossible for anything elsewith themanual data handling systems available.And so, instead, we developed the concept of the sample. This involvedpicking out a subset of the population, getting the data from them andextrapolatingthefindingstothepopulationasawhole.

Let’s take a simple example to get a feel for how this works – PLRpayment in the UK. PLR (Public Lending Right) is a mechanism to payauthorswhen their books are borrowed from libraries.As systems aren’t inplacetopulltogetherlendingsacrossthecountry,samplesaretakenfrom36authorities,coveringaroundaquarterofthelibrariesinthecountry.Thesearethen multiplied up to reflect lendings across the country. Clearly somenumbers will be inaccurate. If you write a book on Swindon, it will beborrowed far more in Swindon than in Hampshire, the nearest authoritysurveyed.And therewill be plenty of other reasonswhy a particular set oflibraries might not accurately represent borrowings of a particular book.Sampling is better than nothing, but it can’t compare with the big dataapproach,whichwouldtakedatafromeverylibrary.

Samplingisnot justanapproachusedforpollsandtogeneratestatistics.Think, for example, of medical studies. Very few of these can take in thepopulation as awhole – until recently thatwould have been inconceivable.Instead, they take a (hopefully) representative sample and check out theimpactofatreatmentordietonthepeopleincludedinthatsample.Therearetwoproblemswiththisapproach.Oneisthatitisverydifficulttoisolatetheimpactof theparticular treatment,and theother is that it isverydifficult tochooseasamplethatisrepresentative.

Think of yourself for amoment.Are you a representative sample of thepopulation as awhole? In someways youmay be. Youmay, for instance,havetwolegsandtwoarms,whichthemajorityofpeopledo…butit’snottrueofeveryone.Ifweuseyouasasample,youmayrepresentalargenumberof people, but ifwe assumeeveryone else is like you,we candisadvantagethosewhoaredifferent.

Inmanyotherrespectsyouarefarlessrepresentative.Bythetimewetakein your hair colour and weight and gender and ethnic origin and job andsocioeconomic group and the place you live, you will have becomeincreasingly unrepresentative. So, to pick a good sample, we need to get

Page 21: Big Data: How the Information Revolution Is Transforming Our Lives

togetherabigenoughgroupofpeople, in therightproportions, tocover thevariationsthatwillinfluencetheoutcomeofourstudyorpoll.

Andthis iswherethewholethingtendstofallapart.Providedyouknowwhatthesignificantfactorsare,therearemechanismstodeterminethecorrectsamplesizetomakeyourgrouprepresentative.Butmanymedicalstudies,forexample,canonlyaffordtocoverafractionofthatnumber–whichiswhyweoftenget contradictory studies about, say, the impactof redwineonhealth.And many surveys and polls fall down both on size and on gettingrepresentative groupings. To get around this, pollsters try to correct fordifferencesbetween the sample andwhat theybelieve it should be like.So,the numbers in a poll result are not the actual values, but a guessworkcorrectionofthosenumbers.Asanexample,herearesomeoftheresultsfromaDecember2016YouGovpollofvotingintentiontakenacross1,667Britishadults:

VotingIntention

Con Lab LibDem UKIP

Weighted 457 290 116 135

Unweighted 492 304 135 146

Thebottom‘unweighted’valuesare thenumbersof individualsansweringaparticularway.But the top ‘weighted’valuesare those thatwerepublished,producing quite different relationships between the numbers, because theseadjustmentsweredeemednecessarytobettermatchthepopulationasawhole.Inevitably,suchaweightingprocessreliesonalotofguesswork.

As a result, political opinion polls since 2010 have got it disastrouslywrong. Confidence in polls has never beenweaker, reflecting the difficultypollsters face in weighting for a representative sample. Where beforesocioeconomic groupings and party politics were sufficient, divisions havebeenchanging,influencedamongotherthingsbytheimpactofglobalisationand inequality. It didn’t help that polling organisations are almost alwaysbasedin‘metropolitanelite’locationswhichreflectoneextremeofthesenewsocial strata. Add in the unprecedented impact of social media, breakingthough the old physical social networks, and it’s not surprising that thepollsters’guessworkapproximationsstartedtofallapart.

Theuseofsuchsampleswillcontinueinmanycircumstancesforreasonsofcostandconvenience, though itwouldhelp tobuild trust if itweremade

Page 22: Big Data: How the Information Revolution Is Transforming Our Lives

easier for consumers of the results to drill down into the assumptions andweightings. But big data offers the opportunity to avoid the pitfalls ofsampling and take input from such a large group that there is far lessuncertainty. Traditionally this would have required the paraphernalia of ageneralelection,takingweekstoprepareandcollect,evenifthevoterswerewilling to go through the process many times a year. But modern systemsmake it relatively easy to collect some kinds of data even on such a largescale.Andincreasingly,organisationsaremakinguseofthis.

Often thismeans using ‘proxies’. The idea is to startwith data you cancollect easily – data, for instance, that can be pulled together without thepopulationactivelydoinganything.Atonetime,suchobservationaldatawasverydifficult forstatisticians toget theirhandson.But thenalongcame theinternet.Weusuallygotoawebsitewithaparticulartaskinmind.Wegotoasearchengine tofindsomeinformationoranonlinestore tobuysomething,forexample.Buttheownersofthosesitescancapturefarmoredatathanweare aware of sharing. What we search for, how we browse, what thecompaniesknowaboutusbecausetheymadeitattractivetohaveanaccountortousecookiestoavoidretypinginformation,allcometogethertoprovidearich picture.Allowing this data to be used gives us convenience – but alsogivesthecompaniesapowerfulsourceofdata.

Thismeansthat,shouldtheyputtheirmindtoit,theownersofadominantsearch engine could gather all sorts of information to predict our votingintentions.Thecleverthingaboutabigdataapplicationlikethisisthat,unliketheknowledge-basedsystemsdescribedabove,noonehas to tell thesystemwhat the rules are.Noonewouldneed toworkoutwhat is influencingourvoting or calculate weightings. By matching vast quantities of data tooutcomes,thesystemcouldlearnovertimetoprovideasurprisinglyaccuratereflectionofthepopulation.Certainly,itwouldenableafarbetterpredictionthananysampledpollcouldachieve.

However, impressive though the abilities of big data are,we have to beawareofthedangersofGIGO.

GratuitousGIGO

Wheninformationtechnologywastakingoff,GIGOwasapopularacronym,standingfor‘garbagein,garbageout’.Thepremiseissimple–howevergoodyour system, if the data you give it is rubbish, the outputwill be too.Onepotential danger of big data is that it isn’t big enough.We can indeed usesearch data to find out things about part of a population – but only the

Page 23: Big Data: How the Information Revolution Is Transforming Our Lives

membersofthepopulationwhousesearchengines.Thatexcludesasegmentofthevotingpublic.Andtheexcludedsegmentmaybepartoftheupsetsinelectionandreferendumforecastssince2010.

It is alsopossible, aswe shall seewith somebigdata systems that havegone wrong, that without a mechanism to detect garbage and modify thesystemtoworkaroundit,GIGOmeansthatasystemwillperpetuateerror.Itbeginstofunctioninaworldofitsown,ratherthanreflectingthepopulationitis trying to model. For example – as we will see in Chapter 6 – systemsdesigned tomeasure theeffectivenessof teachersbasedonwhetherstudentsmeet expectations for their academic improvement,with noway of dealingwithatypicalcircumstances,haveprovedtobeimpressivelyineffective.

It is easy for the builders of predictive big data systems to get a HariSeldon complex. Seldon is a central character in Isaac Asimov’s classicFoundation series of science fiction books. In the stories, Hari Seldonassembles a foundation of mathematical experts, who use the ‘science’ ofpsychohistorytobuildmodelsofthefutureofthegalacticempire.Theiraimistominimisetheinevitableperiodofbarbarismthatcollapsedempireshavesuffered in history. It makes a great drama, but there is no such thing aspsychohistory.

Nomatterhowmuchdatawehave,wecan’tpredictthefutureofnations.Liketheweather,theyaremathematicallychaoticsystems.Thereistoomuchinteraction between components of the system to allow for good predictionbeyondaclosetimehorizon.Andeachindividualhumancanbeablackswan,providing a very complex system indeed. The makers of big data systemsneedtobecarefulnottofeel,likeHariSeldon,thattheirtechnologyenablesthem topredicthuman futureswithanyaccuracy–because theywill surelyfail.

Wealsoneed tobear inmind that data is not necessarily a collectionoffacts.Data can be arbitrary.Think, for instance of a railway timetable.Thetimesatwhichtrainsaresupposedtoarriveatthestationsalongarouteformacollection of data, as does the frequency with which a train arrives at thestated time.But these times and their implications are not the samekindoffactas,say,thecolourofthetrains.Fortwoyears,IcaughtthetraintwiceaweekfromSwindon toBristol.My train leftSwindonat8.01andarrivedatBristol at 8.45.After a year,GreatWesternRailway changed the departuretimefromSwindonto8.02.Nothingelsewasaltered.

This was the same train from London to Bristol, leaving London andarriving at Bristol at the same times. However, the company realised that,while the trainusuallyarrived inBristolon time, itwasoftena little lateat

Page 24: Big Data: How the Information Revolution Is Transforming Our Lives

Swindon.So,bymakingthechangeoftimetablefrom8.01to8.02theyhadasignificant impact on on-time arrivals at Swindon. The train itself wasunchanged.Yetdespitethis,bymakingthisadjustment,theperformancedataimproved.Thisunderlinesthelooserelationshipbetweendataandfact.

In science, data is usually presented not as specific values but as rangesrepresentedby‘errorbars’.Insteadofsayingavalueis1,youmightsay‘1±0.05 to99percentconfidence level’.Thismeans thatweexpect to find thevalueintherangebetween0.95and1.05,99timesoutof100,butwedon’tknow exactlywhat the value is. This lack of precision is usually present indata,butitisrarelyshowntous.Andthatmeanswecanreadthingsintothedatathatjustaren’tthere.

Aconstantlymovingpicture

If we get big data right, it doesn’t just help overcome the inaccuracy thatsamplingalwaysimposes,itbroadensourviewofusabledatafromthepasttoincludethepresent,givingusourbestpossiblehandleonthenearfuture.Thisisbecause,unlikeconventionalstatisticalanalysis,bigdatacanbeconstantlyupdated,copingwithtrends.

Aswehaveseen,forecastersknowaboutandincorporateseasonality,butbigdataallowsustodealwithmuchtighterrhythmsandvariations.Wecanbring inextra swathesofdataandsee if theyareanyhelp inmakingshort-termforecasts.Forexample,salesforecastinghasformanyyearstakenintheimpactof the fourseasonsand themajorholidays.Butnowit ispossible toaskifweatheronthedayhasanimpactonsales.Andthiscanbeappliednotjusttopurchasesofobviousproductslikesuncreamorumbrellas,butequallyto sausages or greetings cards.Where it appears there is an impact,we canthen react to tomorrow’s forecast, ensuring that we are best able to meetdemand.

Justasbigdataallowsformorefocusedseasonality,sotoocanitprovidemuchtighterregionalidentity.Retailersinthepastmightknow,forexample,that mushy black peas and jellied eels had their own geographic areas ofinterest. But with big data, every single outlet can be fine-tuned to localpreferences.

Thepioneers

If we were to look for the pioneers of big data, there are some surprisingprecursors.Inparticular,trainspottersanddiarists.

Page 25: Big Data: How the Information Revolution Is Transforming Our Lives

Eachofthesegroupstookabigdataapproachinapre-technologyfashion.When I was a trainspotter in my early teens, I had a book containing thenumberof every engine in theUK.Therewereno sampleshere.Thisbooklistedeverything–andmyaimwas,atleastwithinsomeofthecategories,tounderline every engine in the class. As I progressed in the activity, I wentfrom spotting numbers to recording as much data as I could about railjourneys.Times,speedsandmorewereallgristtothestatisticalmill.

Atleasttrainspotterscollectnumericaldata.Itmightbehardertoseehowdiarists come into this. I’d suggest that diarists are proto-big data collectorsbecause of their ability to collate minutiae that would not otherwise berecorded.Aproperdiarist,suchasSamuelPepysorTonyBenn,asopposedtosomeonewhooccasionallywroteacoupleoflinesinaCollinspocketdiary,captured the small details of life in a way that can immensely useful tosomeoneattempting to recreate thenatureof life inahistoricalperiod.Dataisn’tjustaboutnumbers.

Totransformaverysmalldataactivity likekeepingadiary intobigdataonlytakesorganisation.From1937to1949intheUK,aprogrammeknownasMassObservationdidexactlythis.Anationalpanelofwriterswasco-optedtoproducediaries,respondtoqueriesandfilloutsurveys.Atthesametime,ateam of investigators, paid to take part, recorded public activities andconversation, first in Bolton and then nationwide. The output from thisactivitywascollatedinover3,000reportsprovidinghigh-levelsummariesofthe broader data. All the data is now publicly available, providing aremarkableresource.Asecondsuchprojectwasstartedin1981withasmallerpanelofaround450volunteersfeedinginformationintoadatabase.

However much trainspotters and diarists – and particularly MassObservation – were precursors of big data operatives, they were inevitablylimited by the lack of technology. The best technology I had for my trainjourneydatawasaFilofax.Ifthatdatacouldhavebeenpulledtogetherwithmanyother sources into a system, then the essential step fromcollection toanalysis that makes big data worthwhile could take place. It’s this kind ofprocess thatmeans that theMassObservationdata is stilluseful today.Andit’sarguable that the first step in thatdirectionwasacryofprotest fromanintenselyborednineteenth-centurymathematician.

Doingitbysteam–fromBabbageandHollerithtoartificialintelligence

The name Charles Babbage is now tightly linked with the computer, andthoughBabbagenevergothistechnologytowork,andhiscomputingengines

Page 26: Big Data: How the Information Revolution Is Transforming Our Lives

wereonlyconceptuallylinkedtotheactualcomputersthatfollowed,thereisnodoubt thatBabbageplayedhispart in themove towardsmakingbigdatapractical.

ThestoryhasitthatBabbagewashelpingoutanoldfriend,JohnHerschel,son of theGerman-born astronomer andmusicianWilliamHerschel. Itwasthesummerof1821,andHerschelhadaskedBabbagetojoinhimincheckingabookofastronomicaltablesbeforeitwenttoprint:rowafterrowofnumberswhich needed to be accurate – tedious in the extreme.AsBabbageworkedthrough the tables, painstakingly examining each value, he is said to havecried out, ‘My God, Herschel, how I wish these calculations could beexecutedbysteam!’

ThoughBabbagecouldnotmakeithappenpracticallywithhismechanicalcomputing engines (despite spending large quantities of the Britishgovernment’s money), he had the same idea, probably independently, asHermanHollerith,theAmericanwhosavedtheUScensusbymechanisingitsdata.EachwasinspiredbytheJacquardloom.

TheJacquardloomwasaVictorianinventionthatenabledthepatternforasilkweavetobepre-programmedonasetofcards,eachwithholespunchedinittoindicatewhichcolourswereinuse.Babbagewantedtousesuchcardsin a general-purpose computer, but was unable to complete his intricatedesign.Hollerithtookastepbackfromtheinformationprocessor(or‘mill’,asBabbagecalledit),thecleverestpartofthedesign.Butherealisedthatif,forinstance,everylineofinformationfromthecensuswaspunchedontoacard,electromechanicaldevicescouldsortandcollatethesecardstoanswervariousqueriesandstarttoreapthebenefitsthatbigdatacanprovide.

The devices involved were called tabulators. A typical use might be tocounthowmanypeopletherewereindifferentagegroups,gender,raceandsoon.Thecardswouldbepassedthroughthetabulator(initiallymanuallyandlater automatically), where metal pins completed a circuit by dipping intomercurywhentheypassedthroughthepunchedholes.Eachelectricalimpulseadvanced a clock-like dial. The operator would then put the card into aspecificdrawerinasortingtable,asdirectedbythetabulator(againthispartof theprocesswas laterautomated).Hollerith’s tabulatorswereproducedbyhisTabulatingMachineCompany,whichmorphedintoInternationalBusinessMachines and hence became the one-time giant of information technology,IBM.

Thetroublewiththesemechanicalapproachesisthattheywereinevitablylimited in processing rate. They enabled a census to be handled in the tenyears available between these events – but they couldn’t provide flexible

Page 27: Big Data: How the Information Revolution Is Transforming Our Lives

analysis and manipulation. One of the reasons that big data can beunderestimatedisthesheerspeedwithwhichwe’vemovedoverthelasttwodecades to networked, ultra-high speed technology making true big dataoperationspossible.

WhenIwasatuniversity in the1970s,westillprimarilyentereddataonpunched cards. Admittedly wewere using the cards to input programs anddataintoelectroniccomputers–evenso,theterm‘Hollerithstring’forarowof information on a card was in common usage. Skip forward a couple ofdecadesto1995,whenIattendedtheWindows95launcheventinLondon.IntheQandAattheevent,IaskedMicrosoftwhattheythoughtoftheinternet.Theresponsewasthatitwasausefulacademictool,butnotexpectedtohaveanysignificantcommercialimpact.

By 1995, personal computers, one of the essential requirements for bigdata, were commonplace, if not yet as portable as smartphones. But thesecond essential of connectivity through the internet was not envisaged asbeing important even by as big a player asMicrosoft.However, therewerealreadyplentyofexamplesofthethirdandfinalpieceofthebigdatapuzzle–thealgorithm.

MeetBigAl

You can have as much data as you like, with perfect networked ability tocollateitfrommanylocations,butofitself thisisuseless.Infact, it’sworsethanuseless.Ashumans,wecanonlydealwith relativelysmallamountsofdataatatime;iftoomuchisavailablewecan’tcope.Togofurther,weneedhelpfromcomputerprograms,andspecificallyfromalgorithms.

Although theOxfordEnglishDictionary insists that theword‘algorithm’isderivedfromtheancientGreekfornumber(whichlooksprettymuchlike‘arithmetic’),mostother sources tellus that algorithmcomes, likemost ‘al’words,fromArabic.Specifically,itseemstobederivedfromthenameoftheauthorofaninfluentialmedievaltextonmathematicswhowasknownasal-Khwarizmi.Whatever theoriginof theword, it refers toasetofproceduresandrulesthatenableustotakedataanddosomethingwithit.Thesamesetofrulescanbeappliedtodifferentsetsofdata.

This sounds remarkably like the definition of a computer program, andmanyprogramsdoimplementalgorithms–butyoudon’tneedacomputertohave an algorithm, and a computer program doesn’t have to include analgorithm.Anexampleofasimplealgorithmis theoneused toproduce theFibonacciseries.Thisisthesequenceofnumbersthatgoes:

Page 28: Big Data: How the Information Revolution Is Transforming Our Lives

1,1,2,3,5,8,13,21,34,55,89,144…

Theseriesitselfisinfinitelylong,butthealgorithmtogenerateitisveryshort,and is something like ‘Start with two ones, then repeatedly add the lastnumberintheseriestothepreviousone,toproducethenextvalue.’

Forbigdatapurposes,algorithmscanbemuchmoresophisticated.ButliketheFibonacciseriesalgorithm,theyconsistofrulesandproceduresthatallowa system to analyse or generate data. Here’s another simple algorithm:‘Extract from a series of numbers only the odd numbers.’ If we apply thisalgorithmtotheFibonacciseriesweget

1,1,3,5,13,21,55,89…

Thisdatahasnovalue.Butiftheoriginaldatahadbeenabouttaxpayers,andinstead of using an algorithm for odd numbers, we had extracted ‘thoseearningmorethan£100,000ayear’wehavetakenafirststeptoanalgorithmto identify high-value tax cheats. I’m not saying that earning more than£100,000makesyouataxcheat,justthatyoucan’tbeahigh-valuetaxcheat,andhenceworth investigating, if you only earn £12,000 a year. Ifwewereimmediately to stigmatise as a cheat everyone selected by the ‘extract£100,000+earners’algorithm,wewouldbemisusingbigdata.Thealgorithmisneutral.Itdoesn’tcarewhatthedataimplies–itjustdoeswhatweask.Butasusersofbigdata,wehave tobecarefulaboutourassumptionsandknowexactlywhatthealgorithmisdoingandhowweinterprettheresults.

Once the technology had caught upwith the requirements to handle bigdata,thecapabilitiesofthisapproachstartedtomakethemselvesfelt.Andoneof the first areas where big data would have an impact was in an activitywhichmostofusbothloveandhate.Shopping.

Page 29: Big Data: How the Information Revolution Is Transforming Our Lives

3SHOPTILLYOUDROP

Goodmorning,MrsSmith

For fifteen years I lived in a village with a single post office and shop. Itwasn’t longafterstarting touse thepostoffice that thepeopleserving theregot to knowmebyname.Once, Iwalked into the post office to sendoff aparcel.‘I’mgladyoucamein,’saidtheshopkeeper.‘LasttimeyoucameinIoverchargedyou.Here’syourchange.’Anothertime,Iwentintotheshoptobuy somecurrypowder.Theyhadn’tgot any. ‘I tellyouwhat,’ saidLorna,whowasbehindthecounterthatday.‘Hangonamoment.’Shewentthroughto her kitchen and brought back some garlic and fresh chillies. ‘Use theseinstead,’shesaid.‘They’refarbetterthancurrypowder.’

Onefurtherexample.Iusedtobeanenthusiasticphotographerandwasaregularatalocalcamerashop–again,theyknewmeasanindividual.TheyknewthatIbroughtthemfrequentbusiness.Ihadbeensavingupanddecidedtotaketheplungeandswitchtoadigitalcamera,soaskedwhatwasavailablein digital cameras for around £400. The answer from the familiar salesassistant was shocking at first. ‘I wouldn’t sell you a camera in that pricerange,’hesaid.Iwasabouttoaskhimwhatwaswrongwithmymoneywhenhewenton. ‘Oneof thebestmanufacturershas justdropped thepriceof itscameras from£650 to £400, but they haven’t sent out the stock yet. If youcomebackinafewdays,Icandoyouamuchbettercamerafor£400thanIcouldtoday.Ireallywouldn’trecommendbuyinganythingnow.’

Look atwhat he did.He turned away the chance tomake an immediatesale. In isolation this is total madness – and it’s something that the salesassistants inmostchainstoreswouldneverdo,because theyareunderhugepressuretomovegoodstoday.Butthisassistantusedhisknowledgeofmeandthemarket.Hebalancedthevalueofasalenowagainstmylong-termcustom.Iwasveryimpressedthathehadsaidthathewouldn’tsellmeacameranow,andthatbygoingbackinafewdaysIcouldgetamuchbetterone.Notonly

Page 30: Big Data: How the Information Revolution Is Transforming Our Lives

did I go back for that camera, I made many other purchases there. And Ipassedonthisstorytootherwould-bepurchasers.

This iswhatknowledgecando fora local shop.Butuntilbigdatacamealong, it wasn’t possible to contemplate taking the same approach whenrunningamajorchain.

Upscaling

Bigdatapresents theopportunity toprovidesomethingclose to thepersonalservice of the village shop to millions of customers. As we will see, theapproachdoesn’talwayswork.ThisispartlybecauseofGIGO,partlybecauseof half-hearted implementation – good customer service costsmoney – andpartlybecausefewtraditionalretailershavebuilttheirbusinessesarounddatainthewaythatthenewwaveofretailerssuchasAmazonhave.Butthereisnodoubtthattheopportunityisthere.

Such systems were originally called CRM for ‘customer relationshipmanagement’,butnowareconsideredsointegraltothebusinessthatthebestusersdon’tbothergivingthemaseparatelabel.

The challenge to effective data-driven customer service comes from theway that the data faces twoways: towards the shop (or bank) and towardsyou, the customer. The shop wants to know as much as it can about thecustomer, so that it can retain you and get the most money out of you.However,youwantthedatatoenabletheshoptogiveyoubetterserviceandpersonal rewards. Done well, big data can provide such a win-win onpurchases.Andoneoftheearliestopportunitiestodothiscameintheformoftheloyaltycard.

Loyaltycards

Inmywallet, thereare around twenty loyaltycards.Some, typically forhotdrinks,areverycrude.HereIgetastamponthecardeverytimeIbuyadrink,andwhen I fill the card I get a free cup. It’swin-win. I ammore likely toreturn to thatcoffeeshop,so theygetmorebusiness,andIgetafreecoffeenowandagain.However,thisapproachwastestheopportunitytomakeuseofbig data, which is why, a couple of decades ago, supermarkets and petrolstations moved away from their equivalent of the coffee card, giving outreward tokens likeGreen Shield Stamps, to a different loyalty system. Thenew card had an inherent tie to data and, in principle, gave them theopportunitytoknowtheircustomerandprovidepersonalisedservicelikethe

Page 31: Big Data: How the Information Revolution Is Transforming Our Lives

villageshop.With my Nectar card, or Tesco Clubcard – or whatever the loyalty

programmeis– Ino longerhaveastampcard,whereall thedata resides inmywallet.Now, every time Imake a purchase, I swipemycard.Frommyviewpoint, this gives me similar benefits to the coffee card, accumulatingpointswhichIcanthenspend.Butfromtheshop’sviewpoint,itcanlinkmewithmypurchases.Thecompany’sdataexpertscanfindoutwhereandwhenIshopped.Whatkindof thingsI liked.Andtheycanmakeuseof thatdata,bothtoplanstocklevelsandtomakepersonalisedofferstome,forexamplewhenanewproductImightlikebecomesavailable.Thesystemsimulatesthefriendly local shopkeeper’s knowledge ofme,making it possible forme tofeelknownandappreciated.Thestoregivesmesomethingextra,whichIasacustomer find beneficial. But this kind of system can’t work effectivelywithoutdeployingbigdata.

Loyaltycardsgotovertheanonymityofcash.Asithappens,though,suchcardsareprobablycomingtotheendoftheirusefullife.Thisisbecauseweare paying for things less and less frequently with traditional forms ofpaymentsuchascashandcheques.Ifweuseadebitorcreditcard,theshopcanmake exactly the same kind of big data connection as it wouldwith aloyaltycard.Thisisamovethathasbeenonthewayforatleasttwentyyears.

Mondexandmore

When I firstmoved toSwindon in themid-1990s itwas todiscover the tailendofa revolutionaryexperimentcalledMondex.MostSwindonshopshadbeen issuedwithcard readers for theMondex smartcard.Thecardcouldbeloadedwithcashatcashpointsandalsoviaaspecialphoneathome.Theideawastotrialthecashlesssociety.Whetherbecauseitwasonlymonthsbeforetheexperimentended,orbecauseithadneverprovedhugelypopular,IfoundmanyretailersweresurprisedbymyrequeststouseMondex.Butitcertainlyhaditsplusside.

Therewas no need to carry a pocketful of cash – and the ability to addmoneytothecardathomemadevisitstothecashmachineathingofthepast.But in the end, the smartcard was trying too hard to emulate the physicalnatureofmoney.Weusedtocarrycasharound–nowwecarriedvirtualcashonacard.Itwaseasier,butitreplicatedanunnecessarycomplication.Itwasasmalldatasolution–theMondexcardknewverylittleaboutmeandhadnoconnectivity–inanincreasinglybigdataworld.

Instead of going down the smartcard wallet route, the new wave of

Page 32: Big Data: How the Information Revolution Is Transforming Our Lives

cashlesspaymentsmakesintimateuseofbigdata.‘Tapandpay’cardsystemsdo awaywith the need to load up a smartcard (a failing that is evenmoreobviously a pain with London’s dated Oyster card, with which any onlinepayments have to be validated at a designated station), because the newcontactlesscardsaresimplyameansofidentificationlinkingtoacentral,bigdatabankingsystem.Evenbetterfromasecurityviewpointarephone-basedpayment systems such as Apple Pay, with the same convenience, but theaddedsecurityoffingerprintrecognition.

Arguablythisnewgenerationofcashlesspaymentisitselftransitional.Asit stands, banks need layers of security to handle fraudulent use of cards,security that is driven by big data. Like me, you may have received anautomatedcallfromyourcreditcardprovider,askingyoutoconfirmthatyouhadmadethelastthreetransactions,asyouwerebuyinginapatternthanwasnot normal for you. Whenever it has happened to me it was simply thatpersonalcircumstanceswereunusual–forexample,whenmydaughterswerestartingatuniversityandwehadtobuyallsortsofhouseholdgoods.But insome cases, these systems prevent fraudulent use. It reflects the ease ofseparatingusandourcards.Ifwecouldmoveawayfrompayingwithcardsorourphones,thentherewouldbelessfraud.

Allweusethecardorphoneforistolinkanindividualtoabankaccount.Withthecardit’sviaaPIN(noteventhatforcontactlesspayments)andwiththe phone the link relies on the phone’s fingerprint reader. But it’s hard toimagine that payment systems won’t eventually use biometrics such asfingerprintorfacialrecognitiondirectly.Admittedly,asthrillersoccasionallydemonstrate, there are stillways to fool biometricsby copyingor removingbodyparts,butwitheffectivebiometricrecognition,bigdatacanduplicatethefriendlylocalshopkeeperwhoknowsyouonsight.

London’searlyattemptatanelectronicwallet,theOystercard,nowlookstohavehaditsday.Ithasalwaysbeensomewhatflawed.Notonlyis it lessflexiblethanMondex,asyoucan’tloaditupdirectlyathome,ifyouelecttopayonline,youhavetospecifywhichstationyouwillusetogetthe‘cash’ontothecard,atleast24hoursaheadofdoingso.ButTransportforLondonhassignedtheOyster’sdeathwarrantbyacceptingcontactlesssmartphones,creditanddebitcards.Therewill remainanichemarket forelectronicwallets likeOysterfor,forexample,children–butformostofus,anightoutinLondoncannowbeundertakenwithasinglecardorphone.Let’stakesuchajourneywith an enabled smartphone and see how big data works in two distinctdirections.

Page 33: Big Data: How the Information Revolution Is Transforming Our Lives

AnightoutinLondon

IsummonanUbertaxiwithmyphone.Thedetailsofmyjourney,theidentityofmydriver, andour respective ratingsare added toUber’s informationonme. Uber also links to my bank, accesses funds and provides them withinformationonthepaymentandthebroadlocationofthetransaction.

OneofmyfriendsfinishesworkearlierthantheotherssoI’mmeetinghimfirst.AsIarriveoutsidehisoffice,myphoneflagsupthatthere’saStarbucksaroundthecorner,soIsendhimamessagetomeetmethere.IplaceanorderandpayonthephoneusingmyStarbucksappbeforeIgetthere,sothatbythetimeIarrive in thecoffeeshop,myskinnylatte iswaitingonthebar.Here,thewholetransactionwasmadewithStarbucks.ThecompanyknowswhenIwas inwhichshopandwhat Iordered(andawardsmeanelectronic loyaltystar).It’sonlywhenIneedtotopupmyStarbucksaccountthatIalsogetaninteractionwithmy bank account, whichwon’t know aboutmy visit. (TheStarbuckscardisaninterestingcompromise.Although,liketheOystercard,ithastheinconvenienceofhavingtobepreloaded,thisprocessismadeeasy–Ican do itwith a couple of taps and a fingerprint frommy phone – and, asmentioned,I’mprovidedwithloyaltyrewardsforusingit.)

My friend joinsme andwe drink our coffee. Thenwe checkwhere theothersareupto:we’vearrangedtomeetatarestaurantbutacouplethoughttheymight be delayed atwork, sowe’ve added each other to the ‘FindmyFriends’app.Iglanceatthisandseethatonehasreachedtherestaurant,oneis five minutes’ walk away from the venue and the third is still at work.Probably time to set off. Exactly how much information the app providerkeeps in thiscase isnotentirelyclear,but inprinciple it couldknowwherewe’vebeeneversinceweactivatedtheapp.

IcheckthelocationofthenearesttubestationonGoogleMaps.Tomakeitconvenient,I’veconnectedthistomyGoogleaccount,soagain,inprinciple,dataaboutmymovementscouldbe storedbyGoogle.At the tube station, Iusemyphoneto tapmywaythroughthebarriers.This is themostcomplextransactionofthenight.Apple’ssystemsnowhavethechancetonotewhereIamandwhatIamdoing.ThecostoftheticketalsogoestoApple,whichwillthendispatchapaymentrequest tomybank.Thebankthensendsmea textalerttoconfirmthatIhaveusedApplePayforthisprocess.AndthepaymentgoesfromthebanktoTransportforLondonviaApple.Threedifferentmajorsystemshavefoundoutmoreaboutmeandmyactions.

Finally,IgototherestaurantwhereIcanpaymybillviaPayPalorApplePayfrommyphone–andagainI’mpotentiallyprovidingtherestaurant,the

Page 34: Big Data: How the Information Revolution Is Transforming Our Lives

bankandApplewithknowledgeofmeandmyactions.Allofthisistoday’stechnology. Frommy viewpoint, I have had an effortless evening. I didn’tneed tocarrycashandmypaymentsweresecure thanks to the technologiesemployed. Big data made my life more convenient. In exchange, thecompanies concerned found out more about me, about my likes and myactivities.

Thebigdatacomponentofmynightoutcouldbeevenmoreadvantageousfrom my viewpoint – the various companies could, for example, offer mediscountsonmyfavouritepurchasesoractivities.Butitisalsoahugebenefitforthecompanies.Theyareamassingdatathathelpsthemtoknowmebetterandselltomemorefrequentlyorencouragemetospendmore.Andthereisalso theconcernofmisuse. It’s easy to say ‘Ifyou’venothing tohide,whydoesitmatter?’,butbeingundersurveillancebycompanieswhomaynothaveyour best interests at heart could be a serious concern.BigBrothermay behiding in thedata, aswe shall discover.But, asyet, itwon’t stopmeusingthesetechnologies–becausebigdataensuresthattheyworktogethertomakemylifeeasier.

In this section I have described a night out. A day’s shopping wouldundoubtedlyprovidesimilarexamples–butincreasingly,bigdataisensuringthatIdon’tneedtoleavehometoshop,iftheactofshoppingdoesn’tinterestme.

ComparisonshoppingandAmazonscanners

Much of the modern shopping experience is driven by data: discoveringwhat’savailable,choosingtheproductthatworksbestforyouandgettingthebestprice.But thehigh street isn’tnecessarily thebestplace toacquire thisdata.Asyouwalkdownthestreet,eachshoplargelycontrolsitsowndata–butbigdataandconnectivityisbreakingdownthatcontrol.

Onebigbreakthroughiscomparisonshopping.Onceyouknowwhatyouwant,youinevitablywantthebestprice.Youmighttryhagglingintheshop–andsurprisingly,quiteafewshops,evenbigchainstores,willdothisifyoucanspeak toamanager.However, itwouldbeveryconvenient ifyoucouldsendaminionaroundashop’scompetitorsandseeifyoucangettheproductcheaperelsewhere.Online,thisprocessbecomesmucheasier.Tosellonline,shopshavetoopentheirdatauptoadegree,andthismeansthatacomparisonshoppingsitecanbringtogetherpricesfromdifferentretailersandgiveyouaheads-up.

However,comparisonshoppingsitescanbeexpensive to setupand run.

Page 35: Big Data: How the Information Revolution Is Transforming Our Lives

The first such site in the UK was a comparison site for books, calledBookBrain, and though there are still several attempting this service, theyremainsmall.Onthewhole,whencomparisonsitesaresuccessful,theyhavelimitedthemselvestomarketswherecommissionsarehigh,suchasinsuranceandtravel.Andcommissionsareanimportantfactortoconsider.

Comparisonsitesdon’t sellanything themselves, so tomakemoney theyhavetogetafeefromtheretailertheysendyouto.Thisiswhyyoudon’tseecomparison sites for some areas of themarket.Of course, commission alsoopensupthepossibilitiesofbeingmisled,orbeingprovidedwithincompleteinformation. It would be in the site’s interest to point customers towardsretailers providing thehighest commission,whether or not they are thebestmatch – and there is no point in a site including retailers who don’t givecommission.

Evenso,acomparisonsiteisaweaponofbigdatathatislargelybeneficialto thecustomerwhohasn’t timetosearchthroughthedifferentpossibilities.Butsuchasiteisuselessifyoudon’tknowwhatyouwanttobuyinthefirstplace. No matter how sophisticated the website, the browsing experiencewhenexploringwithoutaclearpurchaseinmindfailstomatchuptowalkingroundaconventionalshop.Thevarietyofstockthatavirtualstorecanofferbecomesadisadvantage–thereissimplytoomuchtolookat.Forsomekindsofbrowsing,youcan’tbeatbeinginaphysicalstore.Evenso,bigdatatriestoplayitspart.

Walkarounda largestore thesedaysandyouwill sometimesseepeoplepointing their phones at products. Some may just be taking a picture of aproduct to remember it or ask someone else about it. But others will bescanningthebarcodeintoaretailapplikeAmazon’stoseeiftheycouldgettheproductcheaperonline.Thebricksandmortarstoreislandedwithallthecostofprovidingthespacetobrowse,whiletheonlineretailergetsthesale.Insuchcircumstances,you’vewononpriceandtheonlineretailerhaswonviabigdata.Theonlyproblemis,ifyoudrivethebricksandmortarstoreoutofbusinessbytakingadvantageofitthisway,youhavelostyouropportunitytobrowse, whichmay be why Amazon, for example, is now starting to openphysicalstores.

Moreoftenthannot,bigdatagivesconveniencetotheuser,butastrongerbenefit to the business.However, done properly, the big data approach canofferthehumanpartoftheequationaseamlessexperiencethatmakesitseemalmostmagical.

Trainingthesystem

Page 36: Big Data: How the Information Revolution Is Transforming Our Lives

IrecentlytravelledbyEurostartrainfromLondontoBrussels.Someoneelsehadbookedtheticketsformeandsentanemailwiththebooking.Thatemailhadalinktoclick,whichbroughtupawebpage.Thisalsohadalinktoclick.A couple of seconds later, there was a ping in my pocket, and my ticketsappearedinthewalletonmyphone.AllIhaddonewasclicktwicewiththemouse.Ienterednoinformationwhatsoever–thoughthisseamlesseffectonlyworkedbecauseIwasusingawebbrowseronmycomputerwhichwaslinkedviaApple’scloudservicetomyphone.

I nowhad on the phone the timetable formy journeys and a barcode toscan at the ticketbarrier. Itwas almost all I needed for the journey, thoughthings could have been even better. I still had tomess aroundwith a paperpassport(despitethemodernpassportcontainingachipwithbiometricdata).Thisnowfeltlikeatransitionaliteminthebigdataworld.ButtheeasewithwhichsomeonecouldemailmedetailsandtwoclickslaterIhadticketsonmyphonewasbigdataatitsbest.ItbenefitedEurostar,butIgotalotfromitaswell.However, not all big data actions in sales andmarketing feel quite sobeneficial.Someareveryone-sided.

ShopbotsandBotshoppers

One step away from applications that use data to help the customer sitmechanismswhichenablebigdatatobecomethecustomer,formingakindofself-fuelledmarket.Therearemanyexamplesofthese,butlet’slookattwo:AmazonSellershopbotsandstockmarketAIs.

OnAmazon,youcanbuymanyproductsfrom‘othersellers’.Amazonactsas amarketplace, taking a cut of the price in exchange formaking the salepossible – but the price is set by the seller. Thismakes it tempting for thesellertodotwothings:totweakthepriceupifnooneisundercuttingthem,andtotweakthepricedowniftheyaren’tthecheapestseller.

Large sellers have thousands of lines listed, making it impractical for ahuman to copewith all thedata involved.Enter the shopbot– an algorithmwhich,onaregularbasis,willmakethosetweaksinpricing.Inprinciple,thismeans that the seller is kept in the best position. But if those rules aren’tcarefullyrestricted–orthefrequencywithwhichtheshopbotcheckspricingistoohigh–thebigdataapproachcangetoutofcontrol.Excessivetweakingdown results in far too many products listed at 1p – even when it isn’tfinanciallyviabletopostthematthisprice.

Similarly,where thebot is set to tweakup if there isnoopposition,youwill find, say, cheap and cheerful second-hand books priced at hundreds of

Page 37: Big Data: How the Information Revolution Is Transforming Our Lives

poundsacopy.We’renottalkingfirsteditionsofclassicshere–butperhapsa1993NewnesMS-DOS6.0PocketBook.Asithappens,IhaveacopyofthattitleandjustcheckedonAmazonMarketplacetofindthattheonlynewcopylistedhadsufferedthiskindofshopbottweaking.Itwaspricedat£999.11–soIlistedmycopyat£999.10.(Ifyoutakealooknow,youmayfindthattheshopbothastweakedtheotherseller’spricedown,somy£999.10isnowthehighest.) It’s fine to ask a premium for a rare book, even if it isworthless,because someone (the author, say) may eventually want a copy. You mayremember theadvertisement forYellowPages featuringFlyFishing by J.R.Hartley.However,therearelimits.Thebigdataapproachisonlyasgoodasthedataandthealgorithmsthathandleit.

The stock market example is the reverse of a shopbot. Here, the botreplacestheshopper.Youcanseeinprinciplewhysomeonemightthinkit’sagood idea tomake use of big data to purchase or sell the right stock at theright time. But what we’re doing here is the equivalent of having acomparison shopping sitewhere the customer is givenno information at alland the site simply buys the insurance (say) that it thinks is best for you.That’s a scary prospect, as stock markets discovered to their cost with theflashcrash.

Itwas6May2010.In36minutes,startingat2.32intheafternoon,NewYork time, over a trilliondollarswaswipedoff thevalueofUS stocks.Toblamewereacollectionofbotshoppers.Manystockmarkettradingdealsarenolongerhandledbyhumanbeings.Instead,algorithmsreacttostockmarketdata.Inthiscase,thealgorithmsinquestion–whatI’mcallingbotshoppers–primarily belonged to high-frequency traders, who often only hang on tostocks for minutes before reselling. Some of the algorithms were badlywritten. This made it possible to get into a spiral, where sales during oneminutewerebasedonapercentageofsalesinthepreviousminute.

Bigdatasystemscanactfarquickerthanhumanbeings,whichiswhyinthat36minutes that it tookhumans toworkoutwhatwashappeningand topull the plug on the algorithms all hell broke loose. This is a very costlyexampleoftherulethatbigdataisonlyasgoodasthealgorithmsthathandleit.Makeamistake,andthehighspeedandrepetitiveabilityofthealgorithmmeansthatitcandoalotofdamagebeforetheeffectsarenoticed.

Shopbotsandbotshoppersareat leastdoinga jobthatcanbeuseful.Butthere are thosewhowould argue that a common application of big data toshopping isnot justuselessbutdownrightdangerous.This iswhenbigdatagetscontrolofadvertising.

Page 38: Big Data: How the Information Revolution Is Transforming Our Lives

Weknowwhatyou’relookingat

It’s creepy the first time it happens. Arguably, it’s creepy every time ithappens. Let me give a specific example. A few days ago, I had aconversation with my wife about a present we were buying one of ourdaughters for Christmas. Today, when I opened Facebook, I was presentedwith an advert for that specific product. (A Chanel perfume, if you mustknow.Our daughters have expensive tastes.)This advert genuinely shockedme,becauseIhadn’tevenlookedfortheproductonline.HadFacebookbeenlistening inonour conversation?There are at least two technologies,whichwe’llcomeontolater,thatmakethistheoreticallypossible.

As it happens, therewas a simple and less intrusive explanation for theextremespookiness.MywifehadusedmycomputertocomparepricesonthisproductwhenIwasout.Thetrailleftbyhersearches–probablyacookie,thesmallfileusedbyawebsitetoholddatafromvisittovisit–hadbeenpickedupinFacebooktoprovidea‘targetedadvert’.Buteventhatlevelofintrusionis too much for most of us. According to a US survey, 68 per cent ofrespondentsdisapproveoftargetedadvertisingbecauseofthefeelingthattheyarebeingsnoopedon–thepercentagegoesupforthosewithcollegedegreesandhigherannualincomes–despitethis,suchadvertshavebecomeafeatureofourlives,drivenbybigdata.

Hereabenefitofbigdata to theretailer isbeingforcedonus. In theory,it’sgoodforusaswell,theargumentbeingthatwewouldratherseeadvertsfor productswe are interested in than for somethingwe hate.But there’s abalancebetweenusefulnessandintrusion–andmostseemtofeelthat,inthiscase,theintrusionoutweighsthebenefit.

There’s more to it than that, though. If you’ve ever been to a travelwebsite, for example, and seen advertisements for competitors you mightthinkthisistargetedadvertisingthathasgonewrong.MaybeGooglehasfedthemthewrongads,assurelyatravelsitewouldn’tadvertiseitscompetitors?Youwouldn’t go into Sainsbury’s and see adverts for Tesco andWaitrose.However,ifyoudoseeadvertsforcompetitorsonline,thechancesarethatbigdatahas identifiedyouasa timewasterandisconvertingyouintoarevenuestream.

Anyonewho hasworked for a retailer knows that after awhile you canspotsomeindividualswhoareveryunlikelytomakeapurchase.Whentheytakeupa lotofasalesassistant’s time, theyare literally timewasters,as thestore might lose other business because the sales assistant is tied up. Thisdoesn’thappenonline (unless thesite is sobusy that itbecomesunbearably

Page 39: Big Data: How the Information Revolution Is Transforming Our Lives

slow). But algorithms can still try to spot the online equivalent of thetimewasterforfinancialgain.

Ifthesystemdecidesthatyouareveryunlikelytomakeapurchase,that’stheperfecttimetoallowacompetitor’sadvert toappear.Ifyouclickonthead,thesiteyouwerevisitinggetsasmallcommission–betterthannothingatall,andbetterstill,theircompetitorpaysforit.Ofcourse,thisisonlyagoodmove if the system can read yourmind to discover that youwon’tmake apurchase.Ittriestodothisbycombiningobvioussignals,suchaswhetherornotyoulogintothesiteandwhethervisitorsfromyourIPaddresshavemadepurchasesbefore,alongwithany timeofday, timeofyearandsearch topicdatathatsuggesttheprobabilityyou’llmakeapurchase.It’saguess–butifit’slikelytoberightmoreoftenthannot,thenit’seasyenoughtosetalevelatwhicherrorsbecomeacceptabletothecompany.

Whatthisdoesn’tdetermine,though,iswhetherornotanerrorintargetingwillprove irritating to thecustomer.Let’sgoback tomoregeneral targetedadvertising.Itmightbeverycleverthatthesystemspotsyou’vebeenlookingat a specificperfumeandoffersyou thatproduct.But it ceases tobecleverand becomes frustrating, even if you like targeted advertising,when the adcomes up after you’ve bought the product (especially if you’ve just spentmoreonit).Inothercases,thetargetingcanbehugelymisjudged.ComedianandstatisticianTimandraHarknesstellsthestoryofanintervieweewhohadflowndowntoFloridatotakecareofherageingmotherwhowasill.Thenextday,shestartedgettingtargetedadvertisingforfuneralservicesinthearea.

However, even such an offensive miss is arguably better than targetedadvertisingthat intentionallypreysonthevulnerable.Itdoesnotseemright,for example, that someone who is looking for information on debtmanagementshouldthenstarttoreceiveastreamofadvertsforpaydayloans– infamously a burden to thosewith debt problems due to their four-figureAPIs–yetthisregularlyhappens.It’snotenoughtoweighupthebenefitsofusingbigdatathisway,wealsoneedtoassessthehumancosts.It’snotclear-cut.Itmaybeagoodidea,forexample,thatpeoplewhoarestrugglingwithdebtreceivetargetedadvertisingforfree,helpfuldebtmanagementcharities.But one thing is for certain: it’s amoral decision thatwe can’t leave to analgorithm.

Toselectanappropriatead,thetargetingalgorithmskeepaneyeonyourbrowsing history. But that’s only possible online. Surely this could neverhappenintherealworld?

We’rewatchingyou

Page 40: Big Data: How the Information Revolution Is Transforming Our Lives

It’shardnot tofeelsorryforbricksandmortarretailers.Notonlydoonlinestores have far lower overheads, they can access all kinds of informationaboutshoppersthatisn’tavailableonthehighstreet.Atleastitisn’tavailabletoday–but it’sveryclose.We’restartingtoseea little targetedadvertising,whetherit’stheTripAdvisorapppoppingupandtellingyouaboutthehandyrestaurant round the corner, or display adverts that can detect your mobilephone and attempt to display something relevant to you. But far more ishappeninginsidesomestoresinthedevelopingfieldofvideoanalytics.

If you’ve ever watched a crime drama on TV, the detectives are oftenlandedwithlookingthroughhouruponhourofCCTVfootagetotrytospotan individual or a car. It’s easy to think of big data as being just aboutnumbers–butavideoisjustasmuchdataasaspreadsheet,particularlyinadigital formwhere it is simply a collection of ones and zeros.Humans arevery good at processing visual data.But this ability tails offwith time.Wearen’t good at watching many hours of video, looking out for a specifictrigger.Thismakesvideoanalysisanidealopportunityforbigdataalgorithmstostepin.It’ssomethingincreasinglyusedindetection–butalsoinretail.

No one is surprised to see a video camera in a store. We’d be moresurprisednottoseethem.Buttheassumptionisthattheyaresecurityvideos.With big data, though, they have the potential to be farmore.Cameras cantrack which areas of the store, and which displays, get most attention.Individualscanbeflaggedupusingfacialrecognition,notonlytotracktheiractivityinthestorebutalsotoidentifyrepeatvisits.Itwouldn’tbetoomuchofastretchtoextendthatfacialrecognitiontosocialmedia–atwhichpointitwould be possible to identify individualswho have visited your stores, andcontactthemonlinewithoffers.

Anotherapproach,takenbyUScompanyPercolata–aproviderofsystemsusedby retailers includingUniqlo and7-Eleven – combines information oncustomer flows into and around the storewith the sales that each employeerings up. This is an attempt to come up with a measure that is difficult toassess in customer service – productivity. The system monitors how mucheachsalesassistanttakes,percustomergoingthroughthestore.Thisironsoutthevariabilityforquietandbusytimes–butitisstillameasurethatsuggeststheonlyvaluablethingasalesassistantcandoistakemoney.

Such data, plus various combinatorial factors – the impact of differentsales assistantsworking togetheron takings, thepeoplewhoworkbetter onsunnyorrainydays–isthenusedtodriveasystemthatallocatesshopfloortimetotheassistantsatshortnoticetomaximisetherevenueofthestore.(Seethesectiononthe‘gigeconomy’,page115,formoreonthiskindofsystem.)

Page 41: Big Data: How the Information Revolution Is Transforming Our Lives

By comparing similar stores with and without the system in use, Percolatasuggeststhattheycanpushuprevenuebybetween10and30percent.Likemanyalgorithm-managedjobs,thisisnotlikelytobeapositiveexperiencefortheemployees.Butwhatabouttheshoppers?IsPercolata’ssystem,ormakinguseofvideoanalysis,intrusiveorstraightforwardcommercialactivity?

Leavingasidethepossibilityoflinkingfacialrecognitiontosocialmedia,it’s hard to draw the line. Few would have a problem with a system thatmonitored footfall around a store anonymously, but once individuals areidentified and tracked, this becomes closer to surveillance.As shoppers, thechances arewewould never know that thiswas happening. But employeescould alsobemonitored, giving stores theopportunity to rewardgood salestechnique–andpunishshirkingoff.Goodmanagementpracticeor intrusivebehaviour?Theyaregreyareas.

Suchbigdataintegrationwithvideoisnotlimited,ofcourse,toshops.Thepolice and civic authorities could use it viaCCTVon the streets.Transporthubs,suchasairports,particularlythosewheresecurityisaseriousissue,arealready experimentingwith facial recognition on videos to track individualsandflagupsuspiciouscharacters.However, thisisn’t theonlyplacethatbigdataisatworkattheairport,somethingyouwillhaveexperiencedifyouhaveeverturneduptofindthatyourflightisoverbooked.

Whyairlinesoverbook

It seems crazy. Airline booking systems were among the earliest effectiveusers of big data. Within the industry, American Airlines used to be sofocusedonitsSabrebookingsystemthatitdescribeditselftofellowairlinesasabookingsystemcompany thatalso flewplanes.Withsplit-second, real-time booking happening all over the world, few companies have a betterinstantnotificationofthestateofsalesandinventorythananairline.Andyetfor years now, airlines have significantly overbooked some of their flights.Thisisnotanaccident.Itisacommercialdecisiontotakeacalculatedrisk.

The discipline operational research (OR), in which I used to work atBritishAirways,wasresponsibleforthisoverbookingtechnology.ORbeganduring the Second World War, when physicists and mathematicians werebrought in to help with military problem solving. They producedmathematical mechanisms to, for instance, decide what pattern of depthchargeswasmost likely toputasubmarineoutofaction.Butafter thewar,thisexpertisewasdispersedintoindustry.AndORanalystsaremastersofthealgorithm.

Page 42: Big Data: How the Information Revolution Is Transforming Our Lives

Intentional overbookingwas a direct result of the flexibility of businesstravel.Afullpriceticketwasveryexpensive,butwasfullyrefundable,evenafter the flight, if itwasn’tused.So,whenbusiness travellershaduncertainschedules, they would buy several tickets around the time they hoped totravel.Theywouldonlyuseone,andgetarefundontherest.Thismeantthatsomeflightstookoffwithlotsofemptyseats,whichhadnominallybeensold,butwouldlaterberefunded.

This behaviour was very route dependent. It was much more likely tohappen on, say, Heathrow to New York or Heathrow to Amsterdam thanLuton to Marbella. OR teams began to collect data on the statisticaldistribution of the no-shows.And as a big data picture built up, they couldpredict with reasonable accuracy that, say, 10 per cent of bookings on aparticularflightwouldnotbeused.Andso,thesystemsbegantosell110percentoftheseatsonthatjourney.

Like all predictions of the future that are based on the past, such‘overbooking profiles’ were not foolproof. Often they were effective, butoccasionally more passengers would turn up than there were seats on theplane. The airline would then have to pay off some passengers to fly atanother time, compensating them for the inconvenience. Some passengerswithout a strict timetable came to enjoy this source of easymoney.For theairline, there was a simple balance of cost and benefit. Althoughcompensatingpassengerswhowerebumpedofftheirflighthadacost,itwasmore thanoutweighedbyall theextra revenue frombeingable tosellmoretickets.

Andsotheapparentmisuseofdata inoverbookinghasbecomeawayoflife in an industry with a great ability to make use of data. Althoughoverbookingmakesitappearthatairlinesaregettingbigdatawrong,inrealitybigdatabrings themahidden reward. It’sharder, though, tounderstand thewaythathighstreetbanksdealwithbigdata.

Thecautiousbanker

Aswe have seen, big data has had a huge role in stockmarkets and otherbankingactivitiesthatareclosertogamblingthancarefulhandlingofmoney.But everyday high street banking – undertaking the activities that dealwithourbankaccounts–seemstohaveapoorgraspoftheimportanceofbigdata.Like theairlines, thebankswereearly into large-scale, real-time transactionprocessing–theidealenvironmenttomakebigdataworkforthemandtheircustomers–buttheoutcomehasbeenverymixed.

Page 43: Big Data: How the Information Revolution Is Transforming Our Lives

It’struethatbanksmakeuseofbigdatainanumberofwaysthatimpactcustomers.Where once it was your friendly(ish) local manager, who knewyoupersonally,whowoulddecidewhetherornottolendyoumoney,nowit’sdown to the application of an algorithm to a data collection that extendsoutsideyouraccountactivitytocreditscoringagenciesandfurtherafield.Thismakesforquickdecisions,but,aswe’lldiscoverinChapter6,itcanalsohavea terrible impact on everyday liveswhen a faceless algorithm decides yourfuture.

Similarly,aswehaveseen,bigdatasystemsconstantlymonitorcreditanddebitcardactivityforfraudulentactivity.It’sobvioushowthiskindofsystemis as much a benefit to the customer as it is to the bank. But what is lessobvious inbanks’ datahandling iswhy theyhave archaic time lags in theirsystems.

Youmighthave,wondered,forexample,whysomepaymentstakeseveraldaystoclear–orwhyamonthlystandingorderthatusuallygoesoutregularlyas clockworkwill be two days late if the date falls on a Saturday, or eventhree or four days late if bank holidays intercede. This is because bankingsystemswerebuilttosimulatetheoldpaper-basedsystems,wheresuchdelayswerenecessarytoenablephysicalpiecesofpapertobemanuallycheckedandpassedfrombranchtobranchthroughinternalmailsystems.

Rather than rewrite the systems from scratch – which would have hugeoverheads – the banks have bolted on new capabilities. So,we can transfermoney electronically from one account to another in seconds – even atweekends.We can pay instantly 24/7 with a tap of a contactless card. Buttheseactivitiessitontopofasystemthatstillthinksitisdealingwithpaperledgers and cheques that have to be returned to the branch of the customerwhowrotethembeforetheyarecleared.Thebanksdemonstratewellthatbigdataineverydaylifeisinatransitionalphase.Itisonlywhenwehaveanewgenerationofsystemsbuiltwithamodernapproachtobigdata inmindthateverythingwillcatchup.

Ascustomers,we tend topreferourbanks tobeconservative.Wewouldrathertheywerecarefulwithourmoneythancarefree.However,bigdatadoesmanage toembrace funaswellas theserioussideofcommerce.And that’swhatwewilldiscoverinthenextchapter.

Page 44: Big Data: How the Information Revolution Is Transforming Our Lives

4FUNTIMES

VisitingtheAustraliangarden

Bigdata–andbigdatafun–hascomeintoourlivesmostdirectlythroughtheinternet,andspecifically theWorldWideWeb.The twoareoftenconfused.The media frequently claim that Tim Berners Lee, the British computerscientistworkingatCERNinGeneva, invented the internet.Hedidn’t–hiscontributionwastheWeb.

Theinternetisaninfrastructureforconnectingcomputers.It’sliterallyan‘inter-computernetwork’.Itdevelopedorganicallyinthe1970s,growingoutof aUSmilitarynetworkcalledARPANet,which from its1960sbeginninghad become a significant presence inUS universities. Initially, the networkwas used for logging remote terminals on to a distant mainframe – so aresearcher in, say, Los Angeles, could interact with a computer in Bostonwithouttravelling.

The first big step away from such basic interconnectivity came inSeptember 1973 when someone forgot his razor. This was US computerscientistLenKleinrock,whohadbeenataconferenceatSussexUniversityintheUKandreturnedhometoLosAngeles,onlytodiscoverhe’dlefthisrazorinBrighton.As theBrighton conferencewasonnetworks likeARPANet, atemporaryextension to thenetworkhadbeen setup, relaying its signalsviathe Goonhilly Downs satellite communication station in Cornwall whichusuallyhandledtransatlantictelephonecallsandTV.

ThelinkwasstillupwhenKleinrockgothome,assomeofthedelegatesatthe conference had stayed on. He discovered a colleague connected to thenetwork(despite itbeing3amin theUK)andsentamessageviaaprogramdesigned for connecting teletypedevices.Kleinrock’s request to retrievehisrazorwaseffectivelythefirstemail.

Throughthe70s,emailspread, tobejoinedbymessageboardsandothercommunicationsmechanismsusingeithertheinternetorcommercialnetworks

Page 45: Big Data: How the Information Revolution Is Transforming Our Lives

likeCompuServeandAOL.ButbigdatawasstilltogetatoeholdineverydaylifeoutsidetheenthusiastsuntilBernersLeeworkedhismagic.

Though the name ‘WorldWideWeb’ sounds grandiose, all Berners Leeattemptedwastomakeiteasytoaccessalibraryofelectronicdocumentsviathe internet. He set up a standard mechanism for this, just as the internetfounders had devised communication protocols enabling inter-computercommunication to function. And Berners Lee made use of the concept ofclickable hyperlinks to jump from document to document, introducedconceptually by Ted Nelson in the 1960s and already widely used inMicrosoft’s help systems and theMacHypercardprogram.BernersLeegotthefirstlocalwebfunctionalitytogetheratCERNinlate1990andopenedittotheworldin1991.

WhenIfirstplayedwiththewebin1992,therewereonlyafewwebsites.LikeCERN’s,thesewereprimarilytextdocuments,withverylittleinthewayof images – not surprising, bearing in mind that outside of majororganisations, access was by dialling up and interacting through a modem,around1,000timesslowerthanabasicmoderninternetconnection.TherewasnoGoogle,noranyothersearchengine(thefirstbigsearchengine,AltaVista,camealong in1995).Youhad toknow the specificaddress to type into thecrudeweb browser (which for no good reason had a hard-on-the-eyes greybackground).Probably themost excitingwebsite tovisitwas theAustralianBotanic Gardens site, started in 1992. Mostly text, this had a few low-resolution images. But the thrillwas that youwere looking atmaterial thatwascomingdirectlyfromAustralia.Theworldwassuddenlymuchsmaller.

Itwouldhavebeenimpossibletobelievebackthenjusthowmuchthewebwould change everyday lives, particularly as a universal source ofinformation,pumpingbigdatatoourfingertips.

Theanswertoeverything

YouarewatchingTVandsomeonegetsmentionedonthenews.‘Whowashemarried to?’ someone asks. There was a timewhen finding this out wouldhavebeenalmostimpossiblewithoutheadingdowntothelibrary–andeventhen,thechancesarethattheinformationwouldnotbeavailable.Youcould,perhaps, find it after a day or two running through celebrity columns inarchivesofnewspapers. Inpractice, though,youwouldn’tbother to retrievesuch trivia.Now, it’s amatter of picking up a smartphone, typing in a fewwords and the information is there. It’s hard not to see the internet as auniversal big data oracle, the source of all the information youmight need,

Page 46: Big Data: How the Information Revolution Is Transforming Our Lives

whereandwhenyouneedit.Ofcourse,it’snotthatsimple.Theinternetisnotanencyclopaedia.Much

oftheinformationthatturnsupinresponsetoawebsearchisnotcurated–sojudgementhastobemadeaboutwhatisfactualandwhatisfabricated.Therewasalotofconcernin2016about‘fakenewssites’andtheirinfluenceontheUS presidential election –when information is easy to publish and easy toreach, we need considerably more discipline about checking informationbeforetakingitasgospel.

It’stheclassicWikipediaproblem.Therehasneverbeenanencyclopaediabeforewith the tiniest fractionof the information that there is inWikipedia.And a remarkable amount of that information is accurate, particularly onscienceandtechnologytopics.Analysisinthepasthasshownthattherewereno more errors in science articles of Wikipedia than there were inEncyclopaediaBritannica,butWikipediacontainedfar,farmoreinformation.Where articles stray into more contentious topics, though – politics forexample–itcanbehardnottodiscoverodditiescreepingin,whenyouhaveasystem that is only lightly policed, despiteWikipedia supporters’ efforts tokeepontopofthings.

Sometimesmisleadinginformationcanbeinsertedforfun.Forsometime,the Wikipedia entry on the Surrey children’s attraction Bocketts Farmcontainedthisinformation:

BockettsFarm isalsooneof theworld’s first complexes successful ingeneticallyengineeringdinosaurs, the firstof thatbeingan18 tonnebronchoraus [sic]namedStuart,whograzesa16acrepaddockuponthenorthendoftheFarmPark.Hisdietcomprisesprimarilyofhay,vodkamartinis and flying saucers. In the future, Stuart says he would like to pursue a career inaccountancy.

Needless to say, this is not entirely accurate. Nevertheless, the internet hasprovidedadatarevolutioninservingupinformation,anenvironmenttowhichsociety is yet fully to adjust. The key to such information access is a goodsearch engine, a market that – with an honourable mention going toMicrosoft’sBing–isdominatedbyonename:Google.

Googleit

ThereissomethingmagicalabouttheGooglesearchengine.IfIenter‘BrianClegg’, within a fraction of a second Google claims to have found about564,000 results.And though some are for an arts and crafts supplier of thesamename, a surprisinglyhighnumbercoverexactlywhat I’m looking for.This is big data at itsmost impressive. Evenwith the remarkable speed of

Page 47: Big Data: How the Information Revolution Is Transforming Our Lives

modern technology, itwouldbe impossible to look througheachpage that Icanpotentiallyaccess–Googlecoversbetween47and49billionpages (bycomparison, Bing covers 16 to 17 billion, though realistically most of themissingpageswillrarelybeused).

Thatdoesn’tmean,incidentally,thatthereareonly50billionpagesontheweb.ThereareplentymoredocumentsthatGoogleispreventedfromlookingatbecausetheyarepartofacommercialorsecuresite.However,therangeofpagesGooglecoversisstillvastenoughtomakeithardtobelievethatitcanreallybesearchingthroughallthismaterialforus.Ofcourse,thesitedoesn’tworkitswaythroughtheentirewebeachtimewemakearequest.Itssoftwareagents,calledcrawlers,constantlyroamtheweblookingfornewmaterial toaddtoitsindexes,anditisthesethatourrequestsarematchedagainst,ratherthan the rawdata.Evenso, theGoogle index runs to100petabytesofdata,whereapetabyteisamilliongigabytes.

Finding responses from the index is partly aboutmatching thewords inyour query to the words on web pages – something that is speeded up bystartingtheprocessassoonasyoubegintotype–but it’sfarmore.Googleusesawholemixofdatatoputtheresultsinacertainorder.Somewillgettothe top because their site owners have paid for them to do so. Others willbubbleup theorderbecause theyare recent,because thesite is linked tobyotherimportantsites,becauseGoogleconsidersthesitetobehighqualityandmore.The resultswill evenbe structureddifferently dependingon anythingGoogle’ssystemcandeduceabout thepersonwhohas requested themfrombrowsing history or being logged in to Google’s wider facilities. A wholeindustry has built up trying to reverse engineer the secret sauce that isGoogle’srankingalgorithmandtopushsitesuptheranking.Tocountersuch‘searchengineoptimisation’,Google’sengineersareconstantlytweakingtherankingalgorithm.

Of course, Google can be used for far more than entertainment. Manysearches are undertaken to buy something, or for business or educationalpurposes.But there’snodoubt that thefunsideofbigdata isatplayontheinternet too, and nevermore so thanwith themonsters of video streaming,NetflixandAmazonPrime.

Netflixandchill

Ifwesetasideonlinegaming,wherethebigdataaspectsareevident,themostobvious entertainment focus of the internet iswatching videos – and aswehaveseenwithNetflix’sdevelopmentprocessfornewseries,there’sfarmore

Page 48: Big Data: How the Information Revolution Is Transforming Our Lives

involvedherethansimplygettingthebitsandbytesofdigitalmovingpicturesfromaserversomewhereintheworldontoyourTV,laptoporsmartphone.Evenso,theapparentlysimpleabilitytoclickonamovieorTVshowofyourchoice at any time, starting and stopping it as you would a DVD, isphenomenal.AsingleDVDholdsseveralgigabytesofdata.Imaginethesheerquantitiesofdatainvolvedwhenthesestreamingsitesareservingmillionsofcustomers.Thisisindustriallybigdata.

Although streaming services have yet to become as widespread asconventional broadcasting – at the time ofwriting, around a quarter ofUKhomessubscribe to thebrand leader,Netflix– theyarebeginning tochangetheway thatwewatch television. Increasinglyweexpect towatchwhatwewantwhenwewant,andastheon-demandservicesbuildtheircustomerbase,theyareabletofundsufficientnewmaterialthatitispossibletodoallyourviewingthroughthesenon-conventionaldatachannels.

It is hard to believe that in twenty years’ time there will be scheduledbroadcasts anymore,with the exceptionof live events.Even the traditionalbroadcasters,suchtheBBCandITVintheUKandCBS,NBC,ABCandFoxin theUS,areunlikely tobotherwithan increasinglyoutdatedwaytomakeTVaccessible. In the future,wecanexpectall themainnetworks tobeon-demand.AsNetflixhasshown,withthatcomessomesignificantbenefitsforthe network – far more than saving money by losing the overheads ofbroadcastingorprovidingadedicatedcablenetwork.It’seasyfornostalgiatogive us the idea that we would lose out by moving away from traditionalbroadcasting. But all the evidence is that it would enable traditionalbroadcasters to – likeNetflix – takemore daring and effective decisions intheircommissioningofnewprogrammes.

Like all ‘pure data’ entertainment, video has the potential to be pirated.Therewill always be some piracy, but the evidence is that the bestway tominimisethisistomakelegalstreamingordownloadedentertainmentaseasy,convenientandsupportiveaspossible.Companies likeNetflixmanaged thiswell fromdayone,making theirproductavailable throughasmanyviewingplatforms aspossible andprovidingvaluable features, such asbeing able tostopwatchingamovieor showpartway throughand restartingat the samepointonanotherdevice.

ThishastypicallybeenthedifferencebetweenbigdataleaderslikeNetflixandAmazonPrimeanddigitalaccesstovideoprovidedbytheoldstudiosandnetworks. The old guard has typically made it relatively inconvenient tostream – because they want to boost direct viewing to support advertisingrevenueandDVDsales.Consequentlytheyhavesufferedmorethanthenew

Page 49: Big Data: How the Information Revolution Is Transforming Our Lives

kidsontheblockatthehandsofthepirates.However, every kind of TV and film company has to make moving

pictures in thefirstplace–and this isanotherpartof theprocesswherebigdataisshowingitshand.

Fixingthepicture

A quite different application of big data to TV and movies is in makingimages in the first place. Remarkably, an artificial intelligence system hasbeen trained to producemoving images as a kind of ‘what happens next?’puzzlefromastillpicture.

A team from theMassachusetts InstituteofTechnologyplugged into thesystem the big data of 2 million videos from an online sharing site, onlyselecting videos that featured a specific set of scenes, including babies inhospitals,beachesand railwaystations.TheAI systemmadeuseof thisbigdatasettogeneratevideosequencesfromastillimage.Asof2016theseareshort, around one second long, but are nonetheless managing to animate asingle,stillphotograph.

As alwayswith big data, the outcome is only as good as the algorithmsmaking use of the data. Cleverly, the MIT system used two separatealgorithms,oneofwhichactedasacriticforthequalityoftheoutputoftheother,lookingforvariationsfromtheexpectationsithadfromallthemoviesithadabsorbed.

These systems are inevitably limited. We make deductions (or, rather,inductions) based on wider knowledge. If we see a talking human headsticking out of the sand on a beach,we assume that the rest of a person isburied in the sand. The system could only realise this if it had processed avideo of someone being buried. And at the moment they are limited inresolutionandinhowfartheycango.Inprinciple,theycouldbeusedtofillinafewmissingframesinamovie.Buttherealreasontheresearchersaretryingtodothisistogivetheirartificialintelligencesystemsabettergraspof‘whathappens next’ – something that is essential when we want technology tooperatewithautonomy.Taketheexampleofself-drivingcars,alreadybeingtestedon the roads. Indecidinghow toact, theartificial intelligencesystemcontrolling the car has to monitor the environment around it and predictoutcomestoreducetheriskofaccidents.TheMITsystemcouldbeasteponthewaytoimprovingthisability.

Video is not the only part of the entertainment industry that has beenmaking the difficult transition into the big dataworld.Butmusic and book

Page 50: Big Data: How the Information Revolution Is Transforming Our Lives

publishinghavebeenmoreliketheoldstudiosintheirresponsetotheimpactofbigdata:theyhavestruggledwithnewwaysofworking.

Movingthemedia

Thedigitalworldhadanimpactonmusicfirst,wherepiracyprovedtobeanimmense challenge. By its very nature, digital data is easier to copy thanphysically stored analogue data. Just moving from LPs to the much moreeasilycopiedcassettesandCDshadbeenashock,butadigitalfilecouldbemadeavailabletotheworldinseconds,andfreesharingserviceslikeNapsterateintoprofitsinadramaticfashion.

However, like the TV streamers, the music business realised eventuallythatmakingiteasytobelegalwasmoreeffectivethanspendingvastamountsattempting to squash piracy where the moment they closed down one siteanother sprangupelsewhere.First easymusicdownloads fromservices likeiTunesandthenmusicstreamingfromSpotifyanditscompetitorsmeantthat,for the average music listener, there was little reason to break the law. Ofcourse, apercentagealwayswould–but that couldbe regardedaswastage,withthemajoritygettingtheirmusiclegally.Onceagain,thekeytousingbigdata effectivelywas convenience.By using the power and flexibility of bigdata, a company like Spotify could push the boundaries of music listeningwithoutbreakingthelaw.

By contrast, in the publishing world, books as digital data camesignificantlylater.ResearchersatCarnegieMellonUniversitydescribebeingapproachedbytheheadofmarketresearchfromoneofthebigpublishersin2009 asking ‘What is an ebook?’While the researchers suspected that thequestion probably meant ‘How should we deal with ebooks?’, it remainsremarkablylateintheday,andmanypublishersstillstruggletodealwiththebigdataapproachtopublishing.

LiketheoldbrigadestudiosforTVandfilm,bookpublishinghadawell-establishedmodelforsqueezingasmuchaspossibleoutofbooksales.Firstapublisher would put out a hardback, which wouldn’t sell vast numbers ofcopies,butwouldhaveamuchhighermarkupthanapaperback,sothepeoplewhoreallywanted toget theirhandson thebookassoonaspossiblewouldpayaheftypremiumThen,asmuchasayear later, thepaperbackwouldbepublished for the rest of the readerswhoweren’t prepared to stump up. Sowhattodowiththeebook?

Initially, many publishers treated an ebook like a paperback, holding itbackformonths.Ratherthantakingthebigdataapproachofmakingitaseasy

Page 51: Big Data: How the Information Revolution Is Transforming Our Lives

as possible to obtain the legal version, they made it hard – and piracyblossomed in a trade that had never much suffered from it before.Interestingly,researchshowedthat therewasnologicbehindthedecisiontoholdback ebooks, because themarket for hardbacks and ebookshadhardlyanyoverlap.Itbecamepossible toexplorethiswhenin2010Amazonhadadispute with a publisher that usually published ebooks straight away, andtemporarily stopped selling Kindle editions of the publisher’s titles.Researchers were able to see how a delay in ebook availability impactedhardbacksales,comparedtowhenbotheditionswerereleasedsimultaneously.Therewashardlyanyeffect.

Even more interestingly, although hardback sales were not impacted bysimultaneous release of ebooks, the ebook sales dropped significantly if thereleasewasdelayed.Itseemsthatebookbuyersliketogettheirhandsontheproduct straight away. So the traditional pre-big data approach was notprotectingsales,butwasreducingthem.

Eventually publishers picked up on this and nowmost release an ebookwiththehardback(ifthereisone).Butevennow,somepublishers,typicallytheolder,lessflexiblebehemoths,demonstratethattheydon’tunderstandthemarket.Some release their ebookpricedat just less than thehardback,onlybringing down the price when the paperback is available. It’s a dangerousstrategy,oncemoreinspiringpiracy,showingthattheystilldon’tunderstandtheirdigitalcustomers.

It’sno surprise that thedominantplayer in the ebookmarket isAmazonwithitsKindleplatform,holdingover75percentoftheUSmarketand95percentoftheUK.Amazon,likeNetflix,isapastmasterofusingitsdataaboutcustomers to control which products are highlighted on its website and theeasewithwhich customers can access ebooks.Theyhave ahuge advantageover thepublishers, asbookpublishersdon’thavemuchdirect contactwiththeircustomerbase.Andinthebigdataworld,lackingdirectcontactputsyouinadangerousposition.

There is still oneway for book publishers to get information from theirreadersindirectly,though.Thatistousebigdatatotryandworkoutwhatitisthatmadepreviouslypublishedbooksbestsellers.

Readingtherunes

Aswehaveseen,bookpublishersdon’thavethesamekindofbigdataaccesstotheircustomersasdosomeoftheotherentertainmentmedia;however,theydohaveaccesstocontent–thatofbooksinprintandofmanuscriptsthatthey

Page 52: Big Data: How the Information Revolution Is Transforming Our Lives

receive,andithasbeensuggestedthatthisinitselfcanbeusedtopredictorevenformulatethenextbestseller.

Although publishers like to pretend that they are good at spotting apotential winner, major bestsellers, such asHarry Potter and 50 Shades ofGrey, come out of the blue. This is becausemaking predictions in systemswithmanyfactorsthatinteractwitheachothersoonbecomesmathematicallyimpossible as the system is mathematically chaotic – as we have alreadydiscoveredappliestoweatherforecasting.

Similarly, the aspects of a book that make it a bestseller are far toointertwinedwithtrendsandsocialfactorstoallowforgoodforecasting.ButaUS academic, Matthew Jockers, has teamed up with former editor JodieArcher to suggest thatbigdatamakes itpossible toovercome thisproblem.Theyhavedesignedsoftwarewhichanalysesahugerangeofbestsellersandfindslinkingfeatures.Theysuggestthatthesamealgorithmcanthenbeusedto check submissions for potential bestseller status.Here, big data becomesthearbiterofpublishingtaste.

Certainly,theNetflixmodelshowsusthatbigdatacansometimesdeliverbetter judgement on the potential of a project than an industry professional.Butisthisaviablebestsellermachine?JockersandArcherhaveputtogetheramechanism based on computerised text analysis that is good at spottingbestsellers – and yet, oddly, this doesn’t contradict the inherentunpredictability fromchaos.Why?Because thereare twodifferent levelsofbestsellerdom involved– andbecause Jockers’s andArcher’s analysis lacksoneaspectoftrulybeingabletomakingeffectivedecisions.

By looking at variousword uses, patterns and shaping, the software canmake a good shot at predictingwhether or not an existing book is likely tohavefeaturedontheNewYorkTimesbestsellerlist.Thisisimpressive,butitisn’t a universal panacea. In fact, Jockers andArcher admit thatwhat theiralgorithms spot is notwhatmostwould regard as great fiction. The systemlaps up the output of Dan Brown and the 50 Shades of Grey books. Butinterestingly,italsoisausefulcountertothosewhosaytheycan’tunderstandwhythesekindsofbooksellbecausetheyareterriblywritten.Inanumberofrespects, these books are well written – it’s just that the criteria for ‘wellwritten’arenotthoseusedbytraditionalliterarycritics.

Notonlyisthealgorithmnotarecipeforproducinggreatliterature,it’snotaboutproducingbookseveryonewouldlikeeither.Thatwouldbeimpossible.I personallywould only be interested in a handful of the books the systemconsidersthetop100bestsellersofthoseithasanalysed.Butmanyofusarenot ‘bestseller’ readers in themain.We likeourownniche, and that’s fine.

Page 53: Big Data: How the Information Revolution Is Transforming Our Lives

This system isn’t for us – it is about finding likely hits for the traditionalbestsellermarket.

However,whatabsolutely isn’t true is theassertionmadebyJockersandArcher that ‘mega-bestsellers are not black swans’. Their system uses anumber ofmeasures, and though it’s true thatmost super-sellers likeHarryPotterand50Shadesdowellonsomeofthosemeasures,theyallfalldownonothers.So,forinstance,towriteabestseller,JockersandArcherencourageustoavoidfantasy,veryBritishtopics,sex,anddescriptionsofbodies.Whatthemodelseems todowell is torecognise therun-of-the-millbestsellers, ratherthanpickoutmostoftherealrunawaysuccesses.

Asfortheaspectmissingfromtheanalysis,JockersandArchertellushowmanybooks that scoredhighly from their systemwereon thebestseller list,and that is impressive. But they don’tmention false positives – howmanybooks the system thought were bestsellers but weren’t. That kind ofinformation is also needed to assess an algorithm’s effectiveness. I’m surewe’llhearmoreof thiskindofanalysis,but Ihopepublishersdon’tput toomuchstockbyit–becauseitisalowestcommondenominatorapproach.Andsomewouldsaythesameappliestoanotherkindofbigdatasystemthatisfarmorewidelyavailable.Infact,Iwasspeakingtoherjustthismorning.

Itcantalktome

If you own a smartphone, or a modern computer, you may well have hadconversationswithanalgorithm.SoftwaresuchasSiriandCortanasimulatesa human voice and intellect, attempting to give intelligent answers toquestionsposedasspeech.Andbigdataisabsolutelyat theheartofmakingthistechnologywork.

Getting technology to speak and understand speech is a longstandingdream. Originally it was thought that this would be approached by acombinationofvocabularyandgrammarrules–thesamewayyoumighthavebeen taught a foreign language at school. The trouble is that this kind oflearning only takes you so far. Anyone who has moved from classroomteaching to total immersion in a foreign language realises howmuchmorethey pick up from exposure to real conversations andwrittenmaterial thantheydofromwordlistsandrules–andit’sthesameforacomputer.

Thebigdataapproachtodealingwithaforeignlanguage(foracomputer,all human languages are foreign languages) is one of immersion. Thecomputerhasaccess tovastquantitiesof real,human-writtenpiecesof text.And from them it can attempt to deducewhat works as a translation – the

Page 54: Big Data: How the Information Revolution Is Transforming Our Lives

system puts the words into context, giving it a far better chance ofunderstandingandspeakingnaturallythanifittriedtoworkfromrulesalone.

The designers of speech recognition systems like Siri make them moreimpressivebyhard-codingsomeresponsestoquestionstheyarelikelytogetfrequently, anddeveloping theseover time.The first time I ever askedSiri,‘Open thepodbaydoors,HAL,’ repeating the line from themovie2001:ASpace Odyssey, which is probably one of the most frequently used ‘fun’requests to an AI system, she replied, ‘Without your space helmet, Brian,you’regoingtofindthatrather…breathtaking.’WhenItrieditjustnowshegot decidedly snippy and responded, ‘That’s it… I’m reporting you to theIntelligentAgents’Unionforharassment.’

Tobeginwith,thecapabilitiesofanintelligentagentseemtobelittlemorethan having fun, rather than providing practical benefits to the user. But itdoesn’ttakelongtorealisethatitisquickertoaskSiri‘Howmanycaloriesinabanana?’andhaveherdoawebsearchthanit is togointoabrowserandtypeyourrequest.Similarly,anagentlikethiscanadditemstoyourcalendarmorequicklythanyoucanbytyping,especiallyifyouareonthemove.Andmore recent versions, such as Siri onMac computers, can deal with fairlysophisticatedrequestslike‘ShowmetheWorddocumentsIeditedthisweek,’pullingupalistofdocumentsthatcanbeclickedtoopenthem.

ItisalsopossibletousepartofthecapabilityneededforasystemlikeSiritoundertakedifferentfunctions.Forinstance,dictation–let’strythat.IhavejustdictatedthissentenceintomyMacusingthebuilt-insoftware,

thefactorsasyoucanseeitcanslipup.‘Thefactorsasyoucansee’?WhatIsaidwas:‘Butthefactis,asyoucan

see …’. Short connecting words like ‘but’ are often truncated to hardlyanythingaudible,while‘factis’and‘factors’aredifficulttodistinguish.Butthemore big data that is involved, the better the chance of getting it right.Similarly,translationsoftwarelikeGoogle’smakesusenotofword-by-wordtranslationfromadictionarybutavastdatabaseofhumantranslations,takingphrasesandsentencestogivecontext.

Sirimaybe thedoyenof thecomputingworldassistants,but she isnowrivalledbyAlexa,whichintegratesbigdatawithyourhome.

Thedata-drivenhouse

Forthelastcoupleofmonths,myhomehasbeeninvadedbybigdataintheformofAmazon’sEchosystem.Intworooms,acylindricalspeakerhasbeenadded to themore familiar technology. It looks just like a simpleBluetooth

Page 55: Big Data: How the Information Revolution Is Transforming Our Lives

speaker – but speak to it, startingwith the triggerword ‘Alexa’ and itwillspeakbacktoyou.

Echoisstillinitsearlyphase,thoughalreadyAlexahassometricksupitssleevethatSirican’tmanage.WhenyouaskAlexaforsomething,therequestispassedtoAmazon’sbackendbigdatasystemswhichbothinterpretitandattempttoserveupasuitableresponse.Youcanaskwhattheweatherisgoingtobeliketomorrow,tohearajoke,togetadefinitionofawordoranarticleon it fromWikipedia.Youcanaddanentry toyourcalendar, seta timeroralarm,organiseashoppinglist,playradiostationsandmusicfromAmazon’sdatabaseoryourphone–ormakeapurchasefromAmazon,orderanUberorreorderyourfavouritetakeaway.

If that isn’t enough, Echo also works with a range of home automationsystems. In the roomswhere it’s installedwehave added smart light bulbs.Thismeans that you can askAlexa to turn the lights on, dim them or turnthemoff.Withtherightkityoucaninteractwithyourhomeheatingorswitchdevices on and off. And mostly it just works. The only problem we’veregularly experienced is that theEcho in the loungewill get triggered onceeverycoupleofdaysbytheTVandrespondwitharandomremark,whichcanbedecidedlyunnerving.

Asoftenseemstobethecasewithbigdata,thereisatrade-offinvolved.For the user, the Echo system with its chirpy Alexa character is fun anddeliversasurprisingamountoffunctionality.Itbecomessecondnaturetoasktheradioforaparticularstation,ortoturnthelightsonfromacrosstheroomwithyour hands full.At the timeofwriting, it’sChristmas, and rather thanhope that theradioplaysChristmassongsyoucansimplyaskAlexa toplayChristmasmusic.ButthereisnodoubtthatthesystemisorientedtomakingiteasytobuythingsfromAmazon.And,unlessyoupressabuttontostopit,theEcho is constantly monitoring everything that is said in reach of itsmicrophones.AswewillseeinChapter6,thiscouldbeaconcern.

IntheTVdramaMr.Robot(notentirelysurprisingly,onAmazonPrime),one of the characters treats Alexa as the closest thing she has as a friend.AccordingtoAmazon,aquarterofamillionpeoplehaveproposedtoAlexa,while100,000adaysay‘Goodmorning,’toher.Andit’ssometimeshardnottosay‘Thankyou’whenshehashelpedyou.However,weneedtotreatthesestats with a pinch of salt. It’s unlikelymany proposals were serious, whileAmazonencourages‘Goodmorning’byrespondingwithanentertainingfactof thedayeachtime.ButhoweverseriouslywetakeAlexa,formanyofus,big data has had a significant impact on somethingwe’ve always had, longbeforeitwentonline–thesocialnetwork.

Page 56: Big Data: How the Information Revolution Is Transforming Our Lives

Socialmediaandsocialevolution

It’s rare for anyone, hermits apart, to live in isolation.Wehave alwayshadsocialnetworks.Friends,relations,workcolleagues,acquaintanceswenodto,peopleweseeregularlywhencommutingandwishthatwecouldworkupthecourage to say ‘Hello’ to. However, big data has taken this concept to adifferentlevel,andtheimpactofthechangethatsocialmediaishavingwasneverpredictedorthoughtthrough.

At the time of writing, Donald Trump has recently been elected as USpresident,andmuchisbeingmadeof the‘falsenewssites’ thatpumpedoutfictional negative ‘news’ stories via social media. Of course, there havealwaysbeenattemptstousepropagandaforpoliticalends,particularlywherethe state controls themedia. But in the free world, the press has offered afilteringservicethat,atitsbest,managedtofactcheckandweedoutoutrightlies.Socialmedia, however, has no such filtering.What’smore,we tend toputmoreweightonthingsthatweare toldbypeople inoursocialnetworksthanwe do from a remote source – and thisweighting seems to have creptthroughtobigdatasocialnetworkstoo.

Partoftheproblemwiththiskindoffalseinformationisthatitisfareasierfor information to spreadonsocialmedia than in thephysicalworld. IfyoureceiveashockingbitofnewsonFacebook,itonlytakesacoupleofclickstopassitontoyourpersonalnetwork.It’snotanunrealisticmetaphorthatthiskindof informationspreading isdescribedasviral– ithas thesamekindofone-to-manyspreadingmechanismthatfuelsanepidemic.

Asyet,ourindividualbehaviourhasn’tcaughtupwithFacebook,Twitterand theotherplatforms.Mostofusdon’thave theskills toperformaquickfact check before spreading a piece of news. And similarly, we aren’tparticularly well equipped to deal with the additional scale of socialnetworkingthat thesesystemsprovide.Ournaturalnetworks tendto involvesmall numbers of close contacts – say six to ten – with another layer of‘friends of friends’ taking themup to amaximumof around100.ButmostheavyusersofFacebook,forexample,willhavemanyhundredsof‘friends’.

How can we possibly cope with the output of such vast networks?Wecan’t.Facebookmakessureof thisbecausetheweightingonwhatweseeisleft to Facebook’s totally opaque algorithms, which decide how frequentlyand prominently we see input from each ‘friend’. Some kind of sifting iscertainly necessary – but the problem is that we have no idea how thosealgorithmsworkorcontroloverthem.IsuspectthatatthemomentFacebookisethicalandunbiased.Butlet’simaginethatinthefuturethecompanywas

Page 57: Big Data: How the Information Revolution Is Transforming Our Lives

bought by a malignant power. And they decided they wanted to bias anelectioninyourcountry.Thepowerisintheirhands,becausewehavenoideahowtheyareselectingwhatappearsonoursocialmediafeeds.

Even more worrying is the impact this kind of big data networking ishaving on our ability to concentrate and interact normally with others.Youngerusersparticularly,whoselivesareimmersedinsocialmedia,lookattheir phoneswith startling regularity, averaging around100 checksper day.Manyevenchecksocialmediaanytimetheywakeupduringthenight.

Weseemtoappreciatefindingoutinformationinthesamewaythatwegeta kick out of hunting down something physical – it gives a small hit ofpleasure, releasing dopamine to trigger the appropriate parts of our brains.Presumably this reflects our origins as an animal with a hunter-gathererlifestyle.Butthecombinationofsocialmedia,whereinformationisconstantlypumped at us, and mobile phones, so that information can be constantlyaccessed, is leading to addictive behaviour. When we get this kind ofneurotransmitterreleasetoofrequently,thebrainreducesitssensitivity,soweneedmoreandmorehits tosatisfyourselves.Andso,withinminutes,we’recheckingthesocialmediaonceagain.

Since time immemorial there has been a tendency to criticise youngerpeople’s ability to concentrate. However, the impact of social media ismeasurable.When experiments have been performed setting students a taskthat theymust complete in fifteenminutes and that theyknow isurgent,onaverage theywill switch away from the task and check socialmediawithinaround threeminutes. Though there is benefit from switching away from ataskoccasionallyanddoingsomethingcompletelydifferent,theleveloftask-switchingthatbigdatasystemsencouragehasbeendemonstratedtoproduceclear deterioration in capabilities. Too much social media makes you lesscapableofcarryingouttasksthatrequirementalinput.

Thepatternwithbigdatatendstobeoneofbothprosandcons.Soisthereanythinggoodtosayforsocialmedia?

Likeitorloatheit

Despiteeverythingintheprevioussection,IamaregularuserofsocialmediaandwasbeforeFacebookandTwitterexisted.Writingisasolitaryjob,oftenworkingfromhomewithoutthesocialsphereofanofficejob.Fromearlyon,I’vebenefited fromonline forumswherewriters can share their experiencesandsupporteachother.It’saboutfifteenyearssinceIjoinedanonlineforumset up by the Society of Authors called Writers’ Exchange (unfortunately

Page 58: Big Data: How the Information Revolution Is Transforming Our Lives

rendered‘writersexchange’bythebulletinboardsystem)andtherearestillahandfulofpeoplefromthatnowdefunctforumIkeepintouchwith–thoughvia Facebook. I’ve never met any of them, though some do meet upoccasionally.Butthesocialbenefitshavebeensignificant.

Similarly, I use both Facebook (facebook.com/brianclegg author) andTwitter (@brianclegg) for work purposes. It’s a good way to shareinformationwithreadersandtointeractwiththescientists,sciencewritersandpublishers Iworkwith.Bothof these typesofconnection– theprofessionalsupport group and the work contacts – seem not to have the negativeconnotations of theworstmisuses of socialmedia.But I still have to forcemyselfnottolookatittoofrequently.Workingonacomputer,it’seasyjusttoflipovertoFacebookandseewhat’shappening.

Achieving a balance with this potentially intrusive big data requires anawareness of the potential problems and a conscious effort to do somethingaboutit.It’spossibletogetthebestofbothworldswithsocialmediaaslongasyouareconsciousoftheissuesandtakecontrol.Theessentialseemstobenottoletitrunyou,oryourdevices.

Makingsocialmediaworkforyouinvolvesfirstbeingawarethatthereisaproblem.Andproblemsolvingissomewherethatbigdatahastheopportunitytocomeintoitsown,aslongasweareawareofitsneutrality.Dependingonwhatwedowith it, big data canhelp sort out problemsor can add to theirimpact.

Page 59: Big Data: How the Information Revolution Is Transforming Our Lives

5SOLVINGPROBLEMS

DowntheCERNrabbithole

Wehavealreadydiscoveredthatoneofthemostpopularvehiclesforbigdata,theWorldWideWeb,wasdevelopedat theCERNlaboratory.LocatednearGeneva, thismultinational centre for nuclear research is home to the LargeHadronCollider(LHC),thevastexperimentforslamminghigh-speedbeamsofprotonsintoeachother,bywhichmeanstheHiggsbosonwasdiscovered.ItmightseemstrangethatsuchapowerfuldatatoolastheWebdidn’tcomefromasoftwarespecialistsuchasIBMorMicrosoft.ButthevastexperimentsatCERN,particularlythoseusingtheLHC,aredatamonstersandalablikeCERNhastohaveasmuchexpertiseinhandlingandanalysingbigdataasitdoesparticlephysics.

TheLHCaloneproducesabout30petabytesofusabledataayear.Though‘only’ around a third of the size of Google’s index, this is a phenomenalamounttodealwith,andit’sonlyafractionofthedatathattheLHCpumpsout.Mostofthedataisthrownaway.Whencollisionshappeninthecollider’smassivedetectors, avast sprayofparticles canbegenerated, eachofwhichhasthepotentialtodecayfurther,producingaround600millioneventsor25gigabytesasecondtostore.

Even CERN’s systems can’t store 25 gigabytes per second, so a set ofalgorithms is used to select out the potentially interesting-lookingdata, firstreducing from 600 million events per second to 100,000 and then furtherdowntobetween100and200eventspersecond.Thedataisthendistributedaround the world for various computers to work on in the slow process ofsiftingandanalysing.Asanillustrationofthespeedatwhichthishappens,thedatathatwouldresultintheHiggsbosonannouncementstartedtobecollectedin2010,buttheannouncementwasnotmadeuntil2012.

Bytesize

Page 60: Big Data: How the Information Revolution Is Transforming Our Lives

Computerstorageisgenerallygiveninbytes,whereabyteismadeupof8bits,eachofwhichcanstore0or1.Aphonetypicallyhasbetween8and128gigabytesofstorage,whereagigabyteisaroundabillionbytes.(It’s‘around’becauseakilobyteissometimestaken as 1,024, an exact power of 2, rather than 1,000. Where this is the case, amegabyteis1024×1024,etc.)APCmighthaveaterabyte–around1,000billionbytes.Theprefixesindicatingmultiplesgoup:

kilo – 1,000 mega – 1,000,000

giga – 1,000,000,000 tera – 1,000,000,000,000 peta – 1,000,000,000,000,000 exa – 1,000,000,000,000,000,000

ThediscoveryoftheHiggsbosonwasapurebigdataevent.NoonesawaHiggs boson. No one even detected a Higgs boson. The events used toestablishitsexistenceweresimplyacollectionofdataontheparticlesitwasassumed a Higgs would decay into, analysed from the vast flow ofinformationfromtheLHC.Thediscoverymadenewsaroundtheworld,eventhoughthephysicsbehindthediscoverywascomplexandobscure.

Reapingthebigdatabonanza

Particlephysicsisaverynewadditiontohumanunderstanding,butbigdataalso reaches out to what is arguably the oldest of the sciences, astronomy.OrganisationslikeCERNrecognisedearlyonhowimportantbigdatawouldbetotheirscience,butthishasn’tappliedeverywhere.It’snotthatcomputersaren’tused.EvenwhenIwasstudyingphysicsinthe1970s,computerswerealreadymakingtheirmark.Butthemanagementofdataisnotgiventhesamelevelofimportanceasthephysics,andsomebelievethatthisisamistake.

JamesH.Simons,a formermathematicianwhobecameabillionaireasahedgefundmanager,hassetuptheFlatironInstituteinNewYorktofocusondeveloping computational infrastructure andmethods to support big data inscience, starting with astronomy and biology. As Simons points out, whencomputers are used in science,with exceptions such asTimBernersLee atCERN,most of the programming is left to non-specialist graduate studentswho‘aren’tgreatcodersfor themostpart’.Andtheirsoftware isoftenusedononeprojectandthendiscarded.

Ifbigdataistobeusedmosteffectively,thenthecodersneedanexpertise

Page 61: Big Data: How the Information Revolution Is Transforming Our Lives

that is rarely found in a university biology or physics department; Simonshopesthathisfoundationwillprovidetheimpetustochangethis.Anexampleofthekindofbigdataprojectthathasariseninbiologyisanalysingelectricalsignals collected from probes in animal brains. Many universities areundertaking this kind of research, but each uses their own,mostly amateur-writtensoftwaretohandlethedata.TheFlatironsoftwarehasbeendevelopedtopulltogetherdataacrossmanyresearchgroups,givingapotentialforafarbetterunderstanding.

Simons’proposition that scientists can’t hack it as codersmight seem indangerofcreatingresistancefromdisgruntledscientists,butcertainlysomeinthefieldagree.EdwardMerricks,whoworksatColumbiaUniversityonjustsuch a brain project, commented, ‘Their idea of employing dedicatedprogrammers for this sounds great.’ Merricks suggests that dedicatedprogrammerswillonlybeeffectivewithclosecooperation:‘Isupposetheonepotentialissuewithhaving“straight”programmersworkonthesetools,isthatoftentherealdatasetsincludeweirdsituationsthatnobodyhadthoughtaboutbefore,andtodebugthoseissueswouldfrequentlyinvolvehavingarelativelygood understanding of the biology underlying it, not being a purelyprogrammaticproblem.Butthen,iftrulycollaborativethroughouttheprocess,this wouldn’t be as much of a problem, and standardising these analyticalproceduresacrossthefieldwouldbefantastic.’

In another, very different field, astrophysicists, who can’t do directexperimentsonstars,havetorelyoncomputersimulationsoftheformationsofsupernovas,interactionsbetweenblackholes,howgalaxiesformandothercomplexcomputationalproblems.TheFlatironbigdataapproachisdesignedtodealwiththeirapplicationstooinwaysthatwouldnotbeavailabletothetypicalprogrammingastrophysicist.

InbothapplicationsattheFlatironInstitute,aswellasatCERN,bigdataisbeingusedonindividual, large-scaleprojects.But thebenefitsofbigdataanalysis can also be reaped in amuchmorewidely distributed fashion, forexampleinimprovingthecapabilitiesofourcars.

Betterdriving

If you drive amodern car, you have a heavily computerised device in yourdriveway. Under the bonnet and in the cabin, computers are constantlymonitoringengineperformanceandfarmore.Manyofthecontrolsforthecaralsorunthroughthesesystems.Andyetthemajorityofusonlyeverseethisdatareflectedinalightonthedashboard.

Page 62: Big Data: How the Information Revolution Is Transforming Our Lives

Whenacargoesintothegaragetobecheckedout,oneofthefirstthingsthatisdoneistolinkuptoacomputerinterfaceinthecar,enablingthegaragesystemstoget theirhandsonthisdata. Inprinciplewecouldallmakemoreuseof it–andofotherdriving-baseddata.And it ispossible tobuyadd-ondevices that interactwith thecar’scomputer interfaceviaasmartphone.Butmost carmanufacturers,with the exception ofmakers of high-tech vehicleslikeTesla electric cars, seem togoout of theirway to avoidusgettingourhandsonthedata.

Thereisnodoubtthatitcanbeuseful.Apartfrombeingabletoremotelycontrol locking, lights, engine start andmore, the systems canmonitor fueluseandemissions,giveinformationonanybuilt-inmonitors–forexampleoilorwasherlevels–andgenerallykeepontopofthestateofthemostcomplexpieceofmechanicalengineeringmostofuseverbuy.CombinethiswithGPSdata,andallkindsofinformationaboutdrivingstyle,fueleconomy,reliabilityandmorecanbededuced.

Sometimes part of this information can be provided, for example by thekind of ‘black box’ used by insurance companies to lower the premium forgood drivers.Butwemake ridiculously little use of this data.We can onlyspeculate as to why car companies are so reluctant to make access readilyavailable. In part this could be a habit of large companies to lock in accesswith their own software, much as electronic music players often rely ondedicated software (think iTunes).And in part it is likely to bebecause themanufacturers would rather we consumers didn’t have easy access toperformanceandreliabilitydata.

However, there is no doubt the directionwe aremoving in. Itwould besurprisingif,withinadecadeortwo,wecan’tmakeuseofthisdatafarbetter.Notonlywillitbeabletotellushowtoimproveourdriving,withalgorithmscomparing,forinstance,ourbrakinganticipationwithothers,butbyusingbigdataacrossallsimilarcars itwouldbepossibletogivepredictionsforwhenpartsarelikelytofailandawholerangeoftechnicalguidance.Inasense,carswould be catching up with people, as wearable technology that measuresfitnessdataisalreadypartofthebigdatarevolution.

Fitnessdata–wearables

WhetheryouuseaniWatch,aFitbit,oranyothermonitoringdevice,wearabletechnologythatkeepsaneyeonheartrate,bloodpressureandawholehostoffitnessdataisbecomingcommonplace.

At a personal level, it’s interesting to monitor your own performance,

Page 63: Big Data: How the Information Revolution Is Transforming Our Lives

especiallyifyouenjoysportandexercise.However,thebigdataaspectcomesinwhendatafromawiderangeofdevicesissharedandcompared,enablingawidepictureofuserstobebuiltup.Usuallysuchsharingisvoluntary,butthebenefitsareconsiderable inbeingable toputyourperformance intocontextand flag up anymarkers that could indicate health risks or suggestions formoreappropriateexerciseregimes,somanywillenablesharing.

Although usuallymarketed for the exercisemarket, such devices are, inreality,asmallpartoftheincreasingpresenceofbigdatainthemedicalfield.

Solvingthebiggestmedicalheadache

Wetendtobracketmedicinewithscience,buttraditionallytherewaslimitedoverlapbetweenthetwo;manymedicaltreatmentsreliedmoreonhearsayandhope thangood scientific data.Evennow ‘evidence-basedmedicine’ is rareenoughtobegiventhat label(asopposedto just‘medicine’).But thingsarechanging,withbigdataatthefore.

Oneoftheproblemsmedicalresearchersfaceisthatyoucan’tputpeopleinaboxandisolatethemfromotherinfluences.Thismakesitverydifficulttobesureexactlywhatiscausingsomething,resultinginthevastarrayofclaimsthatvariousfoodsandlifestyleissuesresultinimprovementstoordamagetoourhealth.WecansaythatpeoplewholiveaMediterraneanlifestyleandeataMediterraneandietarelesslikelytosufferfromheartconditionsthanpeopleofScotland (say).Thismeans, for example, thatpeoplewhoconsumemoreoliveoilarelesslikelytohaveheartproblems.Butwecan’tsaythattheoliveoil is the cause of the reduced likelihood because there are so many otherfactorsthataredifferent,overwhichwehavenocontrol.

This reflects an old science problem, summed up as ‘correlation is notcausality.’ Justbecause two thingsgoupordown inparalleldoesnotmeanthatAcausesB.Forinstance,foranumberofyearsaftertheSecondWorldWar, pregnancy rates in the UKwent up and down (were correlated) withbanana imports. The bananas did not cause the pregnancies (clearly). It’spossiblethat thepregnanciesincreasedbananaconsumption.It’smorelikelythatathirdfactor–householdincome,say–hadacausalimpactonboth.Butwecan’tassumebecausetwothingsare insomewaylinkedthatonecausestheother.

Afirst-levelbigdataapproachthatisnowincreasinglycommoninmedicalresearch is themeta study.An individual studyon, say,diet andhealthwillhave trouble both getting good quality data on sufficient participants andisolating a causal link. But the more data there is, the more reliable the

Page 64: Big Data: How the Information Revolution Is Transforming Our Lives

deductionsandthebetterchancethereistobeabletocontrolforsomeoftheother potential causes, removing them from the equation. Meta studiescombinetheresultsofarangeofexistingstudies,usuallyweightingthemforthequalityofthedata.Thisbigdataapproachisalreadymakingiteasier,forexample, to be sure that many alternative medicines are no better than aplacebo,ortoestablishspecificdietarycontributionstohealth.

This is only the start of the possibilities for big data, though. At themoment, most medical data is compartmentalised. Each individual hasmedical records which traditionally have been local and not shared.Establishmentslikehospitalshavesomedataacrosstheboard,butthatagainhastendedtobekepttoaparticularestablishment.Themorewecansharethismedical data, the better chance we have of using it to establish theeffectivenessoftreatmentsandtodevelopnewones.

There is inevitably some caution to be exercised here. Medical data isextremely sensitive and careful handling is required to ensure that it is keptanonymouswhere it can be – patients have an understandable reluctance tohandoverdata,evenwiththepromiseofrealbenefits.Some,whoeitherdon’ttrust those involved or don’t understand the concept of anonymisation, stillresist even this– ina surveyof2,000patientsaround17percent said theywouldneverconsenttotheiranonymiseddatabeingsharedwiththirdpartiesforanyreason.Asanexampleofgettingitwrong,inFebruary2016,Google’sartificial intelligence DeepMind group started work with the Royal FreeHospital in London to use big data to help spot patients with a risk ofdeveloping kidney disease. This involved collating data from 1.6 millionpatient records. But patientsweren’t informed that thiswas happening, andtheresultwasastrongnegativereactionfromthepress.

It is usually possible to persuade patients of the benefits when usinganonymised data, which can then provide the benchmark to allow analgorithm towork on a consenting patient’s data to, for instance, check forkidneydiseaserisk.TheproblemwiththeDeepMindapproachisthatGoogleresearchers casually assumed that they would have access to full non-anonymisedrecords. In tryingtosell theirapproach,Googlehasdescribedapatientportalwherebothpatientsanddoctorscouldaccessall theirmedicalrecordsandaddtothem–withoutseemingtobeawarethatpatientsmaynotbehappyopeningupinthiswaytoGoogle.

At thesame time,hospitalsmaybe reluctant tosharedata if it showsuptheirstatisticsinabadlight.Plusthereispotentialforbigdatatobemisusedmedically;wewilltalkmoreaboutthiswithreferencetoinsuranceinthenextsection.Yetwith thecorrectcontrols inplace, it’shard to thinkofanything

Page 65: Big Data: How the Information Revolution Is Transforming Our Lives

with thepotential tomake sobiga change to thewaymedicineworks thanproperly used big data. It would enable drugs to come to market faster,treatmentstobedevelopedmoreeffectivelyand,bottomline,morelivestobesaved.Whilecautionisjustified,itshouldn’tgetinthewayoftheincrediblyvaluableprogressthatcanbemade.

Oneoftheproblemsforbigdataandmedicineisgettingoverthehumangenomeproject(HGP)backlash.Thiswasamassiveproject,startedin1990and(sortof)completedin2003.That‘sortof’ isbecausethegenome–thecompleteDNAcode for an individual, or in this case, parts of a numberofindividuals’DNA–wasnot100percentcompletedwhentheannouncementwasmade.Inpartthiswasbecausearivalcommercialprojecthadchallengedthe state-funded study to a race (though in the end the rivals announcedtogether).

The project was trumpeted as a huge breakthrough in medicine,transforming practice through personal targeted treatments. And thetechnology has certainly moved on, with the cost of mapping a humangenomedroppingfromtheoriginalprojectcostof$3billiontounder$1,000ahead.Butverylittlehasemergedmedically–andmostworkersinthefield,while stillveryenthusiasticabout long-termbenefits, accept that it couldbedecadesbeforemuchpracticalmedicineisinfluencedbytheproject.Thislackofanimmediateoutcomemayhaveresultedinsignificantpublicdoubtaboutbigdataprojectsinmedicine.

In the long run, though, there is little doubt that big data can enablemedicalsciencetoimprovethelivesofpatients.Thebenefitsofthiskindofproblem solving are clear. Solutions to other problems, however, can haveverymixedoutcomes–nonemoresothanintheinsurancebusiness.

AmIinsurable?

In 2012 the European Union caused uproar by ruling that it wasdiscriminatorytochargeadifferentcarinsurancepremiumformencomparedtowomen.This hadbeen commonpractice for a very simple reason.Therewasconclusivedatatoshowthatyoungfemaledriverswerealotlessofariskthanyoungmaledrivers,whoconsequentlyfacedpremiumsuptothreetimesashigh.Andthisdemonstratesoncemorehowimportantamoraldimensionistobigdatadecisions.Analgorithmcan’tmakethatmoralassessment.But,asyet,wehavestruggled tocomeupwithanacceptableapproach todealwithmoralimplications.Let’stakethreeexamples–achancetoplaythebigdata‘Dealornodeal?’game.

Page 66: Big Data: How the Information Revolution Is Transforming Our Lives

Firstwe’ll consider those youngdrivers. Imaybe biased, having had topaypremiumsfortwodaughterswhichwentupconsiderablybecauseofthatruling.Butwas thechangefair?Itcertainlywouldbewrongtodiscriminatepurelyongender.Butwithgoodevidencethatyoungmenhadamuchhigherriskofdrivingdangerously,isitfairtopushupeveryoneelse’spremiumstoremovethis‘discrimination’?Arguably,thosewhoarelikelytoputtherestofus at risk should paymore, or the insurers are discriminating against thosewhodrivecarefully.

Asecondcaseisanothercarinsurancedecisionwhichyoucanchoosetosee either as discriminatoryor as fair to everyone else. Insurancepremiumsaren’t just decided on gender and age. Another factor that is often used iswhereyoulive.Acar in theruralareasaroundSwindonissignificantly lesslikely to be in a crash than one based in central London. So it seemsreasonabletosetalowerpremiumfortheSwindondriver.

But let’s say we have two identical cars, based in the same town. Theownerofonelivesonapleasantroadinasatellitevillage,theotherinarun-down council estate. The car on the estate is statisticallymore likely to bedamagedor stolen.So it also seems reasonable that theowner shouldpayahigher premium. The trouble is that nowwe are dealing with a measure –postcode – which has quite a strong correlation with wealth, and in somelocationsalsowithethnicity.Certainly,itwouldnotbefairtopunishsomeonewith higher premiums just for being poor or from an ethnicminority – andsomeput thisforwardasanargumentfornotbasingpremiumsonpostcode.However,itisclearthatthereisnointenttobiasonraceorwealth;theremaybecorrelation,butnosuggestionofcausality.Andifpremiumsaren’televatedwherethereishigherrisk,thecorollaryisthatpeoplelivinginlow-riskareasarechargeddisproportionatelyhighly.It’sano-winsituationandthedecisionisnotaneasyone.Arguablytheansweristomakeitlessriskytoliveontheestate,butinthemeantime,theinsurancecompanieshavetoactonwhatweconsidertobethefaireroftwounfairoutcomes.

Thefinalcasemovestolifeinsurance.Wealreadyacceptthatifsomeoneis diagnosed with, say, heart disease, or has a parent who died from heartdisease,aninsurancecompanywillputtheirlifeinsurancepremiumup.Mostofusacceptthisasafactoflife.Butifyoufeltthatthepersonontheestateshouldn’tpayhighercar insurancepremiums, is itanydifferent toprejudicepremiums against someone because of a medical misfortune they have nocontrolover?Andifitisacceptabletobasepremiumsonmedicaldata,wheredowedrawtheline?

Ifsomeone is likely todieofadiagnoseddisease in thenext fewweeks,

Page 67: Big Data: How the Information Revolution Is Transforming Our Lives

thenthat’sonething;anditisperhapsreasonablefortheinsurancecompanynottotakethemon,ortoproposeaveryhighpremium.Butshouldweallowprobabilisticpricehikes,suchasthosebasedonaparent’scauseofdeath?Atthe moment, insurance companies largely don’t take into account detailedgeneticdata,butcouldtheybegintodothis inthesameway?Again,wheredoes the cost to the rest of the policyholders start to weigh against theunfairnessofmakingsomeonepaymorebecausetheyhaveageneticmarkerthatmakesitmorelikelytheywilldieyoung?

I don’t have magic answers to these questions. I leave them to you toponder,asweallmustifwearetotakeafairapproachtobigdata’sinfluenceoninsurance.Wewillmoveinsteadtolookattheleaguetableswhichcanbeaboonforthosechoosingaplaceofeducation–oraterribleburdenforbadlyratedestablishments.

Abetteralmamater

Everyyearwearebombardedwithleaguetablesandcomparisonsforschoolsanduniversities.It’snotsurprisingwe’reinterested.Bothparentsandstudentswanttomakesurethattheymakethebestchoice–andfromthepointofviewof the place of learning, getting a good position is an important marketingtool.But thedifficultycomeswith thenatureof these tables.Allweusuallyseeasendusersisalistandsomekindofscore.Butreducingascomplexanassessment as the quality of a university or school to a single digit isdangerous in the extreme.And there is plenty of evidence that the big datasystemsproducingtheserankingscanbehighlymisleading.

Tobeginwith, it isn’t entirelyobviouswhatmeasures shouldbeused torank universities (we’ll focus on these establishments, rather than keepreferring to universities and schools, though the arguments apply to both).Percentage of first class degrees issued? But that’s at the discretion of theuniversity,sohighlysusceptibletomanipulation.Somedatacanbegatheredin a simple numerical form – the ratio of staff to students, the percentagestayingthecourseandgettingadegree,thepercentagegoingintoajobwithinsixmonths, the percentage doing further academic work after graduation –othersarefarmoresubjective,suchasthe‘satisfactionrating’ofthestudents.

Whichevermeasuresareselected–andprettywelleverythingimaginablehasbeentriedovertheyears–thenumbersarethencrunchedinanalgorithmallocating usually secretweighting to differentmeasures and out comes thescore thatwilldetermine theranking.Thechancesare that theoutcomewillgivebroadindicators,butthedetailwillbehopelesslyoverstressed,asthereis

Page 68: Big Data: How the Information Revolution Is Transforming Our Lives

no way to clearly distinguish between close competitors. But the rankingsystem would not be such a problem were it not for a powerful feedbacksystem.Becauseuniversityapplicationswillbeinfluencedbytheoutputofthesystem– and thequalityof those applicationswill then feedback intonextyear’sdata.

To put it bluntly, the problem with this kind of system is that if anestablishment is rated as poor it will be less desirable to go there. Theuniversity will typically end up with less able students. And the studentdemographicwillthendragdowntheratingevenfurther.It’sadeathspiral.

Theoutcomeofbeinginsuchasystemisthatsmartadministratorswilltrytogamethesystem.Toimprovetheirrankings,someuniversitieswillattemptto influence theunderlyingdata. In theUS, forexample, theU.S.News list,the original university league table, made use of the SAT (ScholasticAptitude/Assessment Test) scores of students. This is a test given to highschool students and used for university admission. Some universities paidtheir students to resit the test in the hope that theywouldget better results.Otherswerefarlesssubtleandsimplysentfakeresultsintothesurvey.

Arguablynotallattemptstogamethesystemweredisastrousastheymayhave genuinely improved facilities or teaching. For example, one of themeasures used by theU.S. News list was fundraising. So, by putting moreeffort intofundraising(henceimprovinguniversityfacilities)acollegecouldpush itself up the ranking. The danger came when attempts to manipulatetheselimitedmeasuresresultedintakingtheeyeofftheballinareasthatwerereally important to a student’s education. If all your effort goes intofundraising,forexample,yourteachingmaysuffer.

Moneyalsocomesintoanotherdebateregardingtherankings–theydon’tinclude the fees that students pay. Thismakes the college table, to say theleast, anunusual comparison.Noonewould thinkof selecting an insuranceprovider,say,usingasystemthatpaidnoattentiontothecostof thepolicy.Andwhileit’spossibletoarguethatcostshouldn’tbeafactor–becausethebettertheuniversity,themoreitshouldbeworthtoyou–thisdoesn’treflectthe reality for students attending.Yet theU.S.News systemdidnot includefees.Whileit’sonlypossibletospeculateonwhythisisthecase,it’slikelyitwasdonetosupportthecredibilityoftheleaguetables.

Thisistodowithbrandawareness.IfyousawacomparisonofcomputerswhereallApple’s laptopswereconsidered terrible, thechancesare thatyouwouldbesuspiciousofthequalityofthedata.ThehighstandingofApple’sbrandissuchthateveryone‘knows’thattheyaregoingtobeamongthetopoptions,andifthisdoesn’thappentherehastobesomethingwrongwiththe

Page 69: Big Data: How the Information Revolution Is Transforming Our Lives

comparison. Similarly, there are certain universities that are expected to dowell.IntheUK,thesewouldincludeOxfordandCambridge,whileintheUSit’sIvyLeagueschoolssuchasHarvard,YaleandPrinceton.

In theUK, feesarenot toomuchofan issue in the selectionprocess, asthey do not vary hugely between universities. But in the US, brand-leaderuniversities reflect their cachetwith unusually high tuition fees. So, if highfeeshadcountedsignificantlyagainstauniversity’sperformanceinthetable,itwouldhavemeantthattheIvyLeagueschoolswerepushedwelldowntheranking–whichwoulddevaluethecredibilityofthetablesasawhole.Simplesolution:ignorethefees.

IthasbeensuggestedthatoneofthereasonsUSuniversityfeeshavegoneupsomuchinrecentdecadesisduetothelackofconsiderationoffeesinthelistings–sobychargingmoreandploughingthatintofacilities,auniversitycanpushupitsranking.Atthesametime,otherestablishmentsmakemoneyoutof tutoringstudents toget throughuniversityadmissionssystems,whichthemselves employ algorithms attempting to reverse-engineer the listingsalgorithm and select studentswho are likely to give the university the bestposition,ratherthanthosewhowouldbenefitmostfromattending.

It would be easy to think that this implies that big data is inherentlyproblematicandcan’tbeusedtohelpthiskindofdecision–butthatsimplyisn’ttrue.Ifthedata(ideallyacrossawiderrangeofcategories,includingfeelevels) could be made available to students without the secret algorithmturningitintoaranking,itcouldbeextremelyvaluableinmakingachoice.Itwouldn’tbesufficienttosimplysupplythedata–thewholepointofbigdataisthatitisbeyondbasichumancapabilitytoassessunaided.Butitwouldbeperfectlypossibletoproduceeasy-to-usetoolsthatenabledstudentstocutandcollect data in different ways depending on their own requirements. Theranking could come out totally different for two students whose academicgoals and achievementswere far apart. Itwould no longer be amonolithic,damagingsystem,butonewherethedatawasusedtoassist thegoalsof thestudents.

Rankinguniversitiesorschoolsisalwaysgoingtobedifficultbecauseofthe subjective nature of some of the data involved. But it pales intoinsignificance as a problem when set alongside the challenge of makingdemocracywork.

Democracyasaproblem

Ifthereisasingleexamplethatbestillustratesthedelicatebalanceinvolvedin

Page 70: Big Data: How the Information Revolution Is Transforming Our Lives

usingbigdatatosolveaproblem,it’sthematterofdemocracy.Weoftentalkabout democracy without thinking about what it means. A dictionarydefinitionmight say something like ‘Government by the people – a type ofgovernment in which all the people of the state are involved in makingdecisions’. Traditionally, the only way to directly involve people in suchdecisionshasbeenanationwidepoll–ageneralelectionorreferendum.

However, these mechanisms are expensive and slow, meaning that theycan’t be used for day-to-day decision making. As a stand in, we havedeveloped the idea of representative democracy, where a group of electedindividuals stand for the people as a whole. Inevitably, such a process canonlyprovidecruderepresentation.Candidatesareoftengroupedintoparties,whichhaveacross-the-boardpolicies– it ishighlyunlikely that apartyyouvoteforentirelymatchesyourpersonalopinionsoneverymajorissue.

Now, for the first time,we have the technology thatwould enable us tosolvetheproblemofdemocracyusingbigdata.Thereisnothingtopreventussettingupasystemwhereallthegovernment’sdecisionsareavailableforthepublictodirectlyinteractwithandintroducetrulydemocraticgovernmentforthefirsttimeever.Therewouldneedtobechecksandbalancestoensurethatthepersonvotingwaswho they said theywere–but our ability tomanagedata on the scale of the population of a country instantly and across acountrywidenetworkmeansthatthisapproachispractical.

Althoughthisisyettohappeninreality,theideaisportrayedinanumberofsciencefictionbooks,mostinterestinglyinJohnBrunner’sgroundbreaking1975novelTheShockwaveRider.ThetitlereferstoAlvinToffler’s1970non-fictionworkFutureShock,whichattempted topredict life in theyear2000,but was mostly wide of the mark. However, Brunner does far more. Heimpressively develops the ideas of universally available computer networksand big data to describe a systemwhere the population appears to have aninputintheguidanceofthestate.

Interestingly,BrunnerpicksupontheDelphiprinciple,anideadevelopedby the US think tank the RAND Corporation. In Delphi, a group makes achoice.Thestatisticsontheirdecisionsarefedbacktothegroup,membersofwhich can then change their minds. This process can go through severaliterations.Thereisevidencethatforsomerequirements,thiskindofiterative‘wisdomofthecrowd’approachproducesbetterdecisionsthantakingasinglevote.InBrunner’snovel,thevotingapproachhasabettingcomponenttohelpmake it immersive, though in practice, Delphi is used to control thepopulation, rather than as the mechanism of true government, which ishandledtraditionally.

Page 71: Big Data: How the Information Revolution Is Transforming Our Lives

Bigdata, then,has thepotential tomake truedemocracypracticalon thescaleofanationforthefirsttimeever.Yetthereseemstobenorushtomakethis happen.Cynicsmight say that this is because those in power – electedpoliticians–wouldbeoptingfortheirowndemise.Itwouldbeturkeysvotingfor Christmas. But others feel that there is a fundamental danger in givingcontrol to ‘the masses’ because this can result in outcomes that do notnecessarilyfollowthedirectionofexpertadvice.Wesawthisin2016withtheUKreferendumvotetoleavetheEuropeanUnion.Manyarguedthiswastooimportantadecisiontoleavetothe‘ignorant’electorate.

A similar argument has been applied in the past, for example to capitalpunishment.TheUKabolishedthedeathpenaltyin1965.Althoughthepublicmoodhasgraduallyshiftedagainstcapitalpunishment,forsomedecadesthisdecision was not supported by the public at large – if we had had a trulydemocratic system, thepublicwouldhavebrought capital punishmentback.Those who argue we are better off with our representative system areeffectively supporting a kind of oligarchy, where power rests with a smallnumber of people who, it is hoped, can then be better informed than themassesandmakebetterdecisions.

Again, there is not a simple answer here, which is why democracyprovidessuchagoodillustrationoftheprosandconsofbigdata.Arguably,atrue democracy should make use of the ability for everyone to be directlyinvolvedindecisionsusingbigdata–butthatcouldonlybesatisfactoryifwecan also provide mechanisms for the voting public to be well-informedenoughtobeabletomakegooddecisions–somethingthatbigdatacouldalsosupportordeny,butthatasyetisarguablynotthecase.Bigdataforpoliticalends needs a lot of thinking through.When data and politics mix, there isalwaysapossibilityforBigBrothertogetafootinthedoor.Theuseofbigdatacaneitherresultinbettergovernment–ortotalitariancontrol.

Page 72: Big Data: How the Information Revolution Is Transforming Our Lives

6BIGBROTHER’SBIGDATA

Ancestralbigdata

Althoughweinevitablyresortto‘BigBrother’asanimageofthedarksideofbigdata,referringtotheall-seeingdespotofGeorgeOrwell’sdystopiannovel1984ratherthantherealityTVshow,suspicionabouttheimpactofbigdatagoesbackmuchfurther.Probably the firstbigdataexerciseswerecensuses.WhentheUKproposedonein1753,theideawasshelvedbyparliament.Thiswaspartlybecause theverywordhadnegativebiblicalconnotations–KingDavid’scensuswasapparentlyrewardedwithaplagueandtheRomancensusthatdeterminedthebirthplaceofJesus led indirectly toHerod’sslaughteroftheinnocents.However,therewasalsorealisticconcernfrombothpeopleandgovernmentabouthowthedatawouldbeused.

From the governing viewpoint, although the data would be extremelyuseful, it was felt that it would expose statistics to enemy countries. It’snotable thatwhenSweden’s firstbirth statisticswerepublished in1744, thenameofthecitythestudywasbasedon,Uppsala,wasconcealed.Equally,theBritish peoplewerewary that if the state knewmore about them, it wouldalmostcertainlyresultinmoretaxesandmakeiteasierforthearmedforcestoconscriptyoungmen,takingthemawayfromfamilyfarmsandbusinesses.Itwouldtakenationwidefoodshortagesin1800tomakethefirstUKcensusof1801anecessity.

We’vecometoregardthecensusasanecessaryevil–andtheinvolvementofbigdata isveryobvioushere.But sometimesbigbrother’sacquisitionofdataissignificantlymoresubtle–asinthecaseofsmartenergymeters.

Thesmartmeterdilemma

Agoodexampleofthewaythatallegedbenefitsofbigdatacanbesoldtotheend user while in practice providing more benefits for a company is the

Page 73: Big Data: How the Information Revolution Is Transforming Our Lives

apparentlyinnocuoussmartelectricitymeter.Atraditionalelectricitymeterisasimpledevicethatdoesjustwhatitsays

on the tin. It measures the amount of electricity used. It’s situated in yourhouse,sotheelectricitycompanyhastosendsomeoneouttoreadthemetertobeable tobillyou.However,manyhomeshavenowbeenfittedwithanewgenerationofmeter–asmartmeter.Amajor,expensiveprogrammeisunderway in the UK with the aim of getting smart meters into over 26 millionhomesby2020.Andthesalespitchisimpressive.

Weare told thatsmartmeterswillenableus toslashourelectricitybills,because theydisplay exactlyhowmuch energy is beingused andwhat it iscosting,makingusmuchmorelikelytocutbackonusage.And,becausetheyare smart, the meters can make use of special tariffs that supply cheaperenergy at different times of day, so careful users can trim their bills evenfurther. However, this isn’t why smart meters are popular with energysuppliers.

The benefit of this big data technology is in fact biased towards theelectricitycompanies.Smartmetersmeanthatthecompaniesnolongerhaveto employ meter readers, cutting their costs. And those variable tariffs themeters enable are just as likely to confuse the householder, enabling theelectricitycompaniestoaddinextrachargesforthepeakperiodsastheyaretoallowcustomerstosave.

It’snotthatsmarttechnologycouldn’tbenefithomeowners.Butcomparethesmartmeterwithamoreusefulpieceofsmarttechlikeasmartthermostat.Withadevicelikethis,allthesmartnessisfocusedoncustomerbenefit.Thecustomer can control the thermostat and access data directly from a phone.The thermostat candetectwhen thehouse is not occupied, and cut backonheatingautomatically. It really issmart fromtheconsumerviewpoint.Butasmartelectricitymeterkeepsdataandcontroltoitself.Thechancesofmakingsignificant changes to consumption because a display shows costs are low.Where benefits for consumers have been demonstrated they are based onlimitedstudies.

ThisisanexampleofaBigBrotherbigdataapplicationthathasrelativelyone-sidedbenefits,andnotintheconsumer’sfavour.Butevenwhentheuserreally does gain advantages, there can be concerns, aswe’ll discoverwhenrevisitingAmazon’sEchodevice.

Echohearsyou

ThereisnodoubtthatEchoisgreatfunfortheuser.Afterseveralmonthsof

Page 74: Big Data: How the Information Revolution Is Transforming Our Lives

interactingwithAlexainresearchingthisbook,ithasbecomeaverynaturalway to listen to the radio and music, or get a quick piece of information.Whileswitchinga lightonandoff in the roomyouare inhas limitedvalueoverthrowingaswitch(unlessyourhandsarefull)theabilitytodothisacrossthehouse,or tohavethelightscomeonautomaticallywhenreturninghomeafter dark, still has value. However, there is a Big Brother factor here –becauseEchoisalwayslistening.

Thisalsoappliestomobilephonesusingthe‘HeySiri’or‘OkayGoogle’command.Tobeabletopickupyourtriggerphrase,thedevicehastolistentoeverything you say.Ensuring that this hyper-snooping is notmisused is theresponsibility of the company behind it. And, historically at least, bigAmerican corporations have not proved great role models for treatingcustomers fairly. Amazon and Apple and Google all say that their systemsdon’tkeep trackofyoureveryword.Andallof themofferamechanism totemporarily disable monitoring. However, to feel that we don’t have BigBrother looking over our shoulder all the time we have to trust thosecompanies.

Something we know for sure is that every time you make a request toAlexa, that bit of speech is stored on Amazon’s servers. You can elect todeleteit,butbydefaultit’supthereindefinitely.Ifthat’sasfarasitevergoes,that’sfine.But it’seasy to imagine the temptation.We’vealreadyseenhowadvertisingistargetedbykeepingtrackofourwebclicks.Let’simaginesomebright spark in Amazon decides it would be a good idea to monitor allconversation in a household. Then, next time someone using that accountvisits Amazon, the system could make use of the information gathered toprovidespecialoffersandtopromotecertainproducts.

Theenduserneedneverbeaware.So,forinstance,youmightbewatchingTVnexttotheEcho,casuallycommentingonhowuncomfortableyourcouchis.Acoupleofdays lateryougoon toAmazon tobuysomething.Nowthesystem can ensure that it highlights particularly comfortable couches, andprovides a few tempting offers in this area. You don’t know this hashappened.Ifyoudid,maybeitwouldmakeyouabituneasy,butthenagainyoumightfinditquiteuseful.Inanycase,youdon’thavetobuy.Let’sstepitupalevel.

Youmention in conversation that you urgently need a cleaning product.MinuteslateryougoonAmazonandordersome.Priceisnotanissue–it’sarelativelycheapproductandyoujustneeditquickly.Soyouprobablydon’tevennoticethatAmazonhaspricedtheproductat20percentabovetheusualrate.Variablepricingisrealandhappeningnow.Uberwillchargeyoumore

Page 75: Big Data: How the Information Revolution Is Transforming Our Lives

for a taxi if trade is busy. Coffee shops use a more sophisticated variablecharging,raisingthepricefora‘premium’blendeventhoughtheremaybenodifference in the cost of the coffee beans. So the principle is somethingwelive with. But the difference is that Amazon would have used it based oninformationthat thesystemhadgatheredfromaprivateconversation.Is thisacceptable?

Let’s try one more level. Your Amazon Echo device overhears adiscussionaboutaspeedingfine.Youagreetotakethefineforyourpartner,whowill losehisorherdriving licence if theygetanymorepenaltypoints.This is illegal–but it’s lowriskandyoudon’t think that it iswrong.Whatyoudon’tknowis thatAmazonhasrecentlyentered intoanagreementwiththegovernmenttoalertthemtothiskindofaction.Whenyouattempttopaythefine,youaretakentocourtandarecordingofyourconversationisusedasevidence.

Should thisbeallowed?Theoldargumentgoes thatyouhavenothing tofear if you’vedonenothingwrong,but even if thatwere true it is intenselyintrusive.On theother hand,would it still be a problem if the systemwereusedtouncoveraterroristcell,havingrecordedmembersdiscussingaplottokill innocentpeople?Or if theEchohadwitnessedamurder? If this soundslike unlikely fiction, in December 2015, US police served a warrant onAmazon concerning an Echo device that was located near a hot tub inBentonville,Arkansas,whereVictorCollinshadbeenstrangled.EchoownerJamesBateswaschargedwithhismurder.ThepolicewishedtoexamineanyrecordsAmazonmighthold from the timeof the incident.Amazonresisted,butinJanuary2017wascompelledtoprovidetheinformation.

IfyoubelievethatAmazonwaswrongtoresist,whereshouldthelinebedrawn?What’stostoptheprocessbeingextendedtoprovidingevidencetoanunscrupulous government that you broke that new law saying that youshouldn’tinsultthepresident’shairstyle?Clearlybysuchapoint,theusesofbigdatawouldhavetravelledfartoofardowntheslipperyslope.

I ought to stress thatAmazon andApple andGoogle are not voluntarilysharinginformationwiththegovernment,astheAmazoncase,andthecasein2016 when Apple refused US government demands to unlock an allegedterrorist’s phone, make clear. But the technology makes this kind ofsurveillancepossible,andwearethenlefttothecompany’slevelofethicstokeep us safe. It’s a risk thatmany of us are liable to take in return for theadvantagesthiskindofdirectconnectiontobigdatabrings.Afterall,wetrustcarandplanemanufacturerswithourliveseveryday.Butweneedtomakeaconsciousdecisionaboutopeningourselvesupinthisway.

Page 76: Big Data: How the Information Revolution Is Transforming Our Lives

Whenyourbossisbigdata

Ubergot amention in theprevious section,but alongwithother companiesthathavesprunguptosupportthe‘gigeconomy’,thisisanorganisationthatdon’t just use big data to set prices or to interactwith customers. Big datacontrolsthewaytheindividualsworkingforthesecompaniesdotheirjobs.Ineffect,bigdatahasbecomethebigboss.Andthisisbecomingbigbusiness.IntheUSalone, ithasbeenestimated that around800,000peopleearnmoneyviagigeconomycompanies–andthesenumbersaresettorise.Uberaloneisthoughttohaveover1milliondriversworldwide.

Inanycustomer-facingemploymenttherearepeaksandlulls.Traditionallythese have been coped with using rosters, based on historical data. So, forinstance,most restaurantsare likely tohave less staff rosteredat four in theafternoonthan8pmatthepeakofdinnerservice.Mostshopswillrostermoresales assistants on a Saturday than on aMonday.However, big datameansthattherosteringprocesscanbecomemuchmoreefficient,reactingtoactualrequirementsonaminute-by-minutebasis.Thisisgreatforthecompany,butpotentiallydisastrousfortheemployee.Itcanresultinterribleworkinghours,alackofroutineandreducedearnings.

Takethelackofroutineaspect.Ifyouhavearegularshiftpatternitiseasyenoughtoplanyourlifeoutsidework.Butarosterthatisdrivenbybigdatacanbemodified at amoment’s notice.Does the forecast badweathermeanthatpeoplewillbuymoreofyourproduct,or thatfewercustomerswill turnup at your restaurant? No problem – change tomorrow’s shifts. Has yourbusinessgotsomerecentsocialmediacoverage?Bettergetsomemorestaffinthisevening.Youremployeescanneverplananything–andunlesstheyhaveacontractthatguaranteesanacceptableminimumnumberofhours,theycanalsosufferweekswhentheydon’tgetenoughmoneytopaytherent.Becauseeverypenny savedby the company in such ‘staff efficiency savings’ comesoutofthestaff’spockets.

Somemanufacturers,ledbytheJapanese,specialiseinanapproachknownas‘JustInTime’orJIT.Thisreflectsthecosttothecompanyof,forinstance,holdinglotsofparts,justincasetheyareneeded.TheJITapproachreliesonhaving very quick availability and only brings in parts from the supplier ashort timebeforeuse.The result is to transfer the cost of holdingon to thestockfromthecustomertothesupplier.Ineffect,thegigeconomyisawayofsupplying JIT people, bringing them into action at a moment’s notice toreflect the data. But the price of this convenience for the company is adisruptedlifefortheemployee.

Page 77: Big Data: How the Information Revolution Is Transforming Our Lives

Or, rather, for the non-employee. Because one of the ways that gigeconomy companies keep their costs down is by treating workers as self-employed, meaning that the company does not have overheads such assicknessandholidaypay.Thiskindofgigrelationshipbetweenemployees–sorry,freelancers–andthecompanyisonlypossiblewithabigdatasystemtoenable tasks to be distributed easily and quickly, using the kind of variablepricingthatsuchasystememploys.

August2016sawaprotest fromcouriersworkingforUberEats,apartofUberincompetitionwithrestaurantmealdeliverycompanieslikeDeliveroo.Theway that thecompanyhad treatedcourierswashardlyanadvertisementfor the benefits of having big data as your boss. Couriers were originallyoffered £20 an hour – an attractively high rate. But before long, the ‘whatyou’ll be paid’ algorithm became far more complex, with a low figure foreachdelivery,plusasmallamountpermile,lessUber’scut,plusapeaktimebonus. It’sall tooeasywhenyourworkforce isviewedviaadata stream totreat them like JIT parts, depersonalised and ripe for ‘efficiency’improvements.

Therehasbeenastringoflegalchallenges,attemptingtogetUbertotreatdriversasemployeeswithemploymentbenefitsanda fixedhourlypay rate,whetherornot theygetallocatedtojobs.Dependingonthejurisdiction,andthetraditionofgovernmentinvolvementinworkers’rights,somedisputesareliabletogothegigeconomycompanies’way,otherstheemployees’.

It’s not that the gig economy is inherently a bad thing. As a freelancewriter,my job isalsopartof thegigeconomy,and I farprefer it tomyoldsalaried job. But the difference is that I genuinely do work for myself,providingservices toawide rangeofothercompanies. Ihavea skill that isrelativelyscarce,socanearnenoughtoliveon,andIamnotdependentonanalgorithm to allocate my tasks, but rather can contact any newspaper,magazine or publisher I like. There is stillmore uncertainty in this kind ofworkingthantraditionalemployment.Itakeonmoreoftherisk–andliketheotherfreelancersofthegigeconomyIdon’tgetholidaypayorsickpay.ButIdoseethebenefitsofflexibilityandself-determinationfromgenuinelybeingself-employed,whilethosewhoareatthemercyofasinglelargecompanygetallthenegativesandnoneofthepluses.

Therearealsorisksattachedtotheratingsthatoftenaccompanybigdataemployment. We are used to casually rating our experiences, whether it’sproducts bought online or hotels and restaurants we’ve visited. Sites likeTripAdvisor can wield a significant amount of power by their ratings – ineffect, thedatabecomesafilterforaccesstotheratedorganisation.Andthe

Page 78: Big Data: How the Information Revolution Is Transforming Our Lives

whole rating scheme is taken further by taxi company Uber. Because herecustomersratedrivers,anddriversratecustomers.

CharlieBrookertookthisideatoitsillogicalconclusioninthe‘Nosedive’episode of his Black Mirror TV series (ironically shown on the big dataserviceNetflix,whichitselfusesaratingsystem).Inthedrama,everyoneinsocietyconstantly rateseachother,and their ratings influencewhat theyareabletodoandwheretheyareabletogo.Anargumentatanairport leadstothe total collapse of the protagonist’s ratings – and her discovery of thefreedomofnolongercaringaboutthem.

ClearlyUberisn’tthisbad–yettheneedtorateeachotherinevitablyleadsto falsemutual ratingforself-advantage.And there isalsoevidence that theratings are having an unintended consequence because the people doing therating don’t understand the implications of certain values – or, morerealistically, those who design the ratings don’t understand people wellenough.ThereisgoodevidencethatthosefillingoutsatisfactionratingsintheUKarelesslikelytousethemostextremeratings,avoidingboththeabsoluteworstandtheabsolutebest–so innormalcircumstancesfourstarsought torepresent a ‘good’, or even ‘very good’, level of customer satisfaction.However,Uberseestheminimumacceptableaverageratingas4.6.Anythinglower, and drivers can lose their position on the network. Every apparently‘good’ four-star rating drags them down. It’s also possible that personalprejudiceplaysarole.Uber’sdatahasn’tbeenmadeavailableforstudy,butwhencompanieswithasimilarratingsystem,suchasaccommodationrentalcompanyAirbnb,haveopenedup theirdata, it hasbeen found that this canhappen. In one US study, people with African-American-sounding nameswere 16 per cent less likely to be accepted than customers with European-American-soundingnames.

Ingigeconomyjobsit’softenthecasethatcustomersratetheemployeesandthisbecomespartoftheemployees’jobevaluation.Thisisjustapartofthe way that big data can distort a business’s ability to manage peopleeffectively.

Evaluationbynumbers

WhenIworkedforalargecompany,wehadacomplexsystemforevaluatingthe performance of our workers. The system was strongly driven by data,rather than by human knowledge. The idea was that, using a range ofmeasures, the staff ineachareawouldbeevaluatedona scale that requiredstaff to be placed on a distribution within the area – some above average,

Page 79: Big Data: How the Information Revolution Is Transforming Our Lives

someaverageandsomebelowaverage(thereweremorecategories,but thiswas the concept behind the distribution). The problemwith the systemwasthatitdidnotallowanypartofthecompanytobeextraordinary–theyallhadthe same distribution imposed – and it was unable to reflect the actualperformanceoftheindividuals.Datadrovethedistribution.

How that systemwasuseddependedon theacademicbackgroundof themanagers.Thosewithanon-technicalbackground tended tosimply take theoutputofthesystematfacevalue.Butthosewithamathematicalbackgroundgamed thesystem.Thismeantno longer treating thesystemasablackbox,butattemptingtounderstanditsalgorithmandmakeselectionstoreflectthat.This took a lot of effort – but itmeant that themanagerswhowere in theknowcoulddecidetheoutcometheywantedandworkbackwardstogivetheinputsthatwouldgeneratethatoutcome.

Clearly something had gone wrong here. The system should have beenused straightforwardly to assess performance, but savvy managers wereinsteadworkingaroundthesystemtogettheoutputtheyrequired.Andwheresimilar systems have been used in other organisations with an unthinkingbeliefinthevalidityofwhatcomesoutofsuchasystem–presumablybeingblindedbyscience–thereisthepossibilityofunfairanddownrightridiculousresults.

Agoodexampleofthisiswheresystemsaredeployedtoreflectthequalityofworkersbasedonmeasures thatdon’t reflectperformance. Imagineapaysystemwhere you got paidmore if your surname beganwith an S, or youcame fromaparticularpartof thecountry. It is clearly ludicrous.Yet somerealsystemshaveequallycrazymeasures.Thebeliefthatthecomputermusthaveitright,combinedwithalackoftransparency,meansthattheoutcomesareoftenaccepted,particularlybyamanagementthatdoesnotwanttowasteitstimethinkingaboutsuchtechnicalmatters.

Let’s lookat a specific example,describedbyCathyO’Neil inherbookWeaponsofMathDestruction.WashingtonDC’sschoolswerefailingandthemayorbroughtinaneducationguru,MichelleRhee,tosortthingsout.Rheebelieved that theproblems laywithpoor teaching. It’sentirelypossible theydid – but measuring the quality of teaching is something that is extremelydifficulttodo.It’sonlythroughhoursofmonitoringbyexpertsthatyoucanbesureofhowgoodateacheris,andthat’sexpensive.Extremelyexpensive.

However,schoolsareawashwithdataaboutstudentperformance.SoRheearrangedforasystemcalledIMPACTtobedeveloped,usingthatdatainanattempt tohighlightwhichteachersweredoingwellandwhichwerefailing.Notethatthisdataisindirect.Itdoesn’tnecessarilytellyouanythingaboutthe

Page 80: Big Data: How the Information Revolution Is Transforming Our Lives

teacher, it’s based on student performance only. But it’s easy and cheap tomeasure.Attheendofthefirstyearofuse,theteachersinthebottom2percentwerefired,andthefollowingyear thebottom5percent lost their jobs.That’sover200professionalsdismissedinayear.

Now, evenbeforewe look at thedataused,we can see a problem– thesame one that we had at the airline. The system imposes an arbitrarydistribution.Thepeoplewhocomeoutatthebottomaredeclaredtobefailing.But the system doesn’t know anything about how this group of peoplecomparedwith theprofession at large.Washingtonmight havehad thebestteachersinthecountry.Theirbottom5percentmighthavebeenasgoodas,say, thetop10percent inColumbus,Ohio.I’mnotsayingthat theywere–thepointis,wedon’tknow.

Theproblemsofimposingadistribution,though,areasnothingcomparedwith the data that was used to decide whether a teacher was good or bad.Becauseitwasimpossibletotellhoweffectivetheteacherwasfromthisdata– itwas just as irrelevant as using the first letter of the surname. Take theexample of SarahWysocki, one of the teachers in the second group to befired. She had excellent reports from observation of her work. But thealgorithm put her in the bottom 5 per cent. This algorithm, developed by aconsultancy calledMathematica Policy Research, attempted to measure theteacher’s performance by seeing how her students had progressed inmathsandliteracy.

The task the algorithm facedwas complex. It had to try to consider thecircumstancesof the students; clearlya teacher couldnotbe responsible forlack of progress of a child who, for example, never came to school. Thealgorithm neither had the data to make a good judgement on this, nor theabilitytotestforerrorsandcorrectitself.Goodalgorithmsmakinguseofbigdata can self-correct over time. If, for instance, this system had data on allteachers in all schools, it could see whether its decisions continued to bereflected in performancewhen a teachermoved elsewhere, and could reactaccordingly.But in this localisedsystem,oncea teacherwas fired fromoneschool therewasnowayoffollowingupthedata if theywerere-employed.Thealgorithmwasthewayitwas,eventhoughtheonlysensibleoutcomewas‘It’snotpossibletoeffectivelymeasurestaffperformancefromtheavailabledata’.

Inasense,whathappenedherewastheapplicationofabigdataapproachtoasmalldataproblem.Ifthesystemhadtriedtomeasuretheperformanceoftheschoolsystemacrossthecountry,ratherthanindividualteachers,itwouldhavehaddataonmillionsofstudentsandcouldhavemadesomedeductions,

Page 81: Big Data: How the Information Revolution Is Transforming Our Lives

thoughevenwiththatlevelofdataavailableitwouldstruggletomeasurehowwellteachersweredoing.Butlookingatthedataforasingleteacherinvolvedasamplesizeofaclass–andeveninovercrowdedclasses,thisisn’tanumberthatgivesanyconfidenceinthestatistics.

Another factor that appears to have come into play here is GIGO. Todecidehowwell thestudentshadadvancedduring theyear, thesystemtookdatafromteststakenattheendofeachpreviousyear.ThestudentsgoingintoSarah Wysocki’s final class had finished their previous year (at anotherschool)withunusuallyhighgrades.Yet theirperformance suggested that, ifanything, they were below average. Bearing in mind the school they camefromwas judged on its scores, and a subsequent enquiry showed that theirtestshadunusuallyhigh levelsofcorrections, ithasbeensuggested that theschool modified the results to ensure that its students scored well. IfWysocki’sclassstartedwithartificiallyhighresults,anyattempt tomeasure‘progress’wouldbefarcical.

O’Neilgivesanevenmorestrikingexample– thatofanEnglish teachercalled Tim Clifford. Although his school was not implementing a ‘fire thebottom tranche’ policy, it did use a similar rating system. Clifford washorrified to discover that he scored an achingly bad 6 out of 100.The nextyearhescored96outof100.Hehaddonenothingdifferent–butthesecretalgorithmhadconjuredupawildlydissimilarresult,suggestingthatitsscoreboreverylittleresemblancetowhatitwassupposedtomeasure:thequalityoftheteaching.

In an attempt to avoid bias by not comparing likewith like, this systemcompared student performance with expected outcomes for those students.Withenoughdata,andwithreasonablygoodpredictions,thisisn’ttoobadanapproach, because local fluctuations tend to get ironed out. However, onceagainthesizeofanindividualclass isfar toosmall toproduceanythingbutnear-randomresults.Becauseofusingabigdatasolutiononasmalldataset,theresultswerepainfullybad.

Thisapplicationofbigdatawasgearedtowardspeoplewhowerealreadydoing a particular job – but increasingly, big data is also being used inrecruitment.

Jobsfortheboys

Applyingforajobcanbestressful–andsomebigorganisationsgoouttheirwaytomakeitso,puttingapplicants throughbatteriesof tests, trickylateralthinkingproblemsandsearchinginterviews.Increasingly,theresultsofthese

Page 82: Big Data: How the Information Revolution Is Transforming Our Lives

assessment methods are pulled together in a big data system, whereindividualscanberecommendedforhiring–orexcluded–purelyasaresultofanalgorithm’sassessment.

Back in the 1990s I was frequently involved in recruitment for a largecompany.Wewerelookingforathree-figurestreamofnewstarterseachyearintotechnicaljobsandusedamixofanapplicationform,interviewandtestresults to assess our applicants. Like many employers, we specified aminimumlevelofeducation–adegree–tobeabletoapply.Thereasonforthis isnot theoneyoumightassume–andhasaninterestinginsightfor theuseofbigdata.

The limiting factor in the recruitment process was the interview. Thisinvolved three professionals from the company. Itwas time-consuming andinevitablymeant that relatively few applicants could be seen in a day – sotherehad tobesomewayof trimmingdowntheapplicationsaheadof time.We wanted to keep the interviews. Despite reasonable data suggestinginterviews have limited effectiveness, they remain the only opportunity tointeractwithacandidateonahumanlevelandarevaluedinmanybusinesses.Thatmeanthavingawaytotrimdowntheapplicantsbeforetheyreachedtheinterviewstage.

Today, applicants are far more likely to be put through screening testsonline before reaching the interview stage. We didn’t have that big datatechnology, instead putting applicants through tests as part of an interviewday.Sowehadtouseadifferentmeasurebeforetheday–andthatwastheroleofthedegreerequirement.Wehadgoodevidencethatadegreewasnotnecessary to be good at the job.We had experimentedwith takingA-levelstudents,and theyprovedasgoodas thedegreestudents.However,openingrecruitmenttoschoolleaversmeantthatwecouldn’tcopewiththenumberofapplicants.Askingforadegreewasawayof filteringdown thenumbers. Itwasnomoreeffective thanselectingapplicantswhosesurnamesbeganwiththefirstfivelettersofthealphabet–butitfeltfairer.

At least we were conscious of what we were doing. But when asophisticatedalgorithmdoesthatselection,it’salltooeasyfortheoutcometobe skewed in a way that no one understands. For example, many largecompaniesuse apersonalityprofile test aspart of their recruitmentprocess.We did, using it to get a feel for how successful applicants might fit withvariousteams.However,itwasneverusedasafactorinmakinganoffer.Butbigdataalgorithmsarelikelytoconsiderdatalikethisasgristtothemillandsoitcouldbethatsomeoneclassed,forexample,asintrovertedandabitofaloner, might be rejected without anyone knowing how this assessment

Page 83: Big Data: How the Information Revolution Is Transforming Our Lives

contributedtohisorherfinalscore.Somethingweneverhadtocontendwithwassocialmedia.Butnow, the

bigdata systemsdoing the initial sliceanddiceofapplications– somethingthat ismorenecessary thanever inaworldwhere jobapplicationsareoftenonline – have the temptation of taking a stroll into the murky world ofFacebook, Twitter, LinkedIn and Instagram. The question is how a site forposting pictures of your dinner andwitty comments about cats can provideusefuldata for job selection.Ahuman interviewermay raise an eyebrowattalesofdrunkennightsout,butabigdatasystemismorelikelytotrytomakedeductionsfromyournetwork.WhatkindofpeopleareyouconnectedtoonLinkedIn, how often do you post and do you gain many comments? HowmanyfollowyouonTwitter?

This kind of data can then be matched up against behaviours like teamworking,socialabilityandmore.We’reback,toanextent,tothemisuseofapersonality profile toweed out applications, butwith the insidious additionthat these systemsdon’t even require you to take a test – theymakeuseofyourpubliclyavailabledata(orthelackofit)tomakeanindirectassessmentofyourpersonality.Andonceagain, thiswillbebasedonanalgorithmthatonlythecompanyprovidingtheservicetrulyunderstands.Fromthepointofview of the applicant, and often the recruiter, a black box that could be arandomnumbergeneratorisdecidingsomeone’sfuture.

Likemanysuchsystems,notonlyistherenotransparencyinwhatisbeingusedandhow,thereisalsonomechanismfortestingtheeffectivenessofthedata used to predict the desired outcome.Whatwewant to know iswho isbest for the job.Butwhatwe have is a set of data on personal history.Allthesesystemscandoistrytofindsomelinkbetweenthetwo.Butrememberthemantrathatcorrelationisnotcausality.Evenifyoucanfindmeasuresthatseem to match with future job effectiveness (and usually that linkage isn’tavailable– theselectionofpredictivedata is justat thewhimof thesystemdesigner),thechancesarethisisjustcoincidentalcorrelation,notcausality.

Therearewebsitesthataresetuptodiscoverspuriouscorrelationsindata.There, we discover that US crude oil imports from Norway correlate withdriverskilledincollisionswithrailwaytrains,thatthepercapitaconsumptionof mozzarella cheese is correlated with the number of civil engineeringdoctoratesawarded,andthatthepeoplewhowouldwanttoknowthedateonwhichtheywilldie,shouldtheyhavetheopportunitytofindout,arefarmorelikelythantheaveragetofeelthatpizzaswithoutacrustarejustfine.Evenifthemeasuresbeingusedinrecruitmentdocorrelatewithabetteremployee,toassumeacausallinkmaybejustasillogicalasitwouldintheexamplesgiven

Page 84: Big Data: How the Information Revolution Is Transforming Our Lives

above.Systemslikethismisusedatainextremeways,butatleasttheparticipants

are voluntarily involved.The samedoesn’t applywhenbig data is capturedaboutoureverydayliveswithoutanywayofusoptingout.

Thesurveillancesociety

Itisoftensaidthatweliveinasurveillancesociety.Whereverwegotherearecameras–CCTV,bodycamerasonofficials,phones,carcams–andthemorethisvideodatafeedsintobigdatasystems,thegreaterthepotentialforittobemisused tocontrolourday-to-day lives.Someareasgoeven further,addingmicrophones to the street furniture, bringing us back into Alexa territorywithouteventheawarenessoftheirpresence.Butlet’sstickwiththevideo.

Surveillance video has become an important part of police evidencegathering.And there isgood reason for trying to findvideoevidence ratherthan rely on witness evidence – because human beings make terriblewitnesses. Despite academics being aware for a century of how inaccuratewitnessevidenceis,suchevidenceisstillproducedincourtcasesandisstilllargelybelievedbyjuries.Itshouldn’tbe.

The definitive experiment that should have removed all credibility fromwitnessevidencetookplaceinDecember1901inBerlin.AnargumentbrokeoutinaseminarbeinggivenbycriminologyprofessorFranzvonLiszt.Intheensuingscuffle,agunshotrangoutandastudentfelldeadtothefloor.Astheclass froze inhorror,vonLisztexplained thatnoonewasactuallyhurt, thiswasanexercise,andhewantedeachstudenttowritedownadetailedaccountofwhattheyhadjustseen.

Thesewitnessstatementswereasgoodassuchstatementswereevergoingtoget.Theyweretakenimmediatelyaftertheevent,andthestudentshadbeenreassured thatwhat theyhadseendidnotputany livesat risk,so the initialshock had been reduced far quicker than is usually the case after a violentincident.Thestudentssatdownandbegantowriteuptheexperiencetheyhadjustgonethrough.

ItishardtoimaginethatevenvonLisztwouldhaveexpectedtheeventualoutcome.Across the class therewere totally different accounts ofwhat hadtaken place.Most students got the timescale wrong. Often the sequence ofeventswasincorrectlyreported.Somedescribedhowthekillerhadrunfromtheroom–onlyhehadn’t.Mostdepressinglyfortheuseofwitnessevidenceinidentifyingacriminal,eightdifferentnameswereprovidedfor thepersonwhostartedthescuffle.

Page 85: Big Data: How the Information Revolution Is Transforming Our Lives

Human memory is a terrible source of evidence. So video is hugelybeneficialasitcan’tmisremember.Aslongastheimageisclear,itisthebestwaytoplacesomeoneata location.However,oncevideobecomespartofabigdatasystem,significantlymoreispossible.Wecanrelievethatpoorpoliceofficer from theboringhoursof searching throughvideos–and theall-too-likely possibility of missing the crucial evidence – by using artificialintelligencesystemstodothesearchforus.It’snotperfect,butagoodsystemcanat thevery leastwhittledownmanyhoursofvideo toa fewminutesofkeyfootagethathumaneyesneedtocover.It’sabitlikethesoftwareattheLHCthatpicksoutthemostlikelydataforfurtherprocessing(seeChapter5).

Withgoodrecognitionsoftware, it isalsopossible to track individualsorcars as theymove fromcamera to camera, buildingup adetailedpictureoftheirmovements.Sometimesthiscanbeforarelativelysimplepurpose,asinthecamerasnowwidelyusedintheUKtospotunlicensedvehiclesandflagthem up. In other cases, the system could be tracking the last knownmovementsofamissingperson.

Thereisnorightorwronghere–wehavetodecidewhatwewilltolerateinexchangefor thepotentialbenefits to justice.As longas thedata iswell-handled and only used to provide evidence for legitimate investigations, itseemsaperfectlysensibleuse–assumingoursafeguardsagainstmisusearestrongenough.ButthebalanceshiftstowardsBigBrotherwhenweseehowbigdataisbeingusedtopredictwherecrimesarelikelytotakeplace.

Weknowwhatyou’regoingtodo

Theideaofpredictingcrimesoundsfamiliar.Youmayremembertheopeningofthe2002filmMinorityReport.Itis2054.AfrowningpoliceofficerwitharemarkableresemblancetoTomCruisestandsinfrontofwall-sizedcomputerdisplay, flicking controls, expanding views, dragging images, as if he isworkingonavastiPad.

He already knows the perpetrator and victims of a homicide. He canpinpointthetimethekillingoccurs.Andthattimeisinthefuture–thecrimehas not yet happened. The officer now needs to deduce exactly where themurder will take place. With minutes to spare, leading a crack team, theofficerspeedsto thesceneandarrests theperpetrator,beforehecancommitthecrime.

BasedonthePhilipK.Dickshortstory‘TheMinorityReport’written in1956, themovie ispure fantasy. In the story, certain individuals– so-calledprecogs–havetheabilitytoseeintothefuture.Thisisnotgoingtohappen.

Page 86: Big Data: How the Information Revolution Is Transforming Our Lives

There is no scientific evidence for the existence of precognition orclairvoyance. However, big data does give us our best hope for getting astatistical glimpse of the future. This isn’t science fiction set in 2054. It’shappeningnowonthestreetsofAmericaandtheUK,usingasystemcalledPredPol.

A growing number of police forces are using PredPol (and competitorssuchasCompStatandHunchLab)tomanagelimitedpoliceresources.Suchasystemdoesnothavethepinpointaccuracyofsayingthataspecificpersonisgoing to commit a crime at a known time. But it divides the city up intofootballpitch-sizedchunksandassesseshistoricalcrimedata,placebyplace,reflectinganapproachofmappingoutaproblemtolookforpatternsthathasarichandeffectivehistory.

Back in the nineteenth century, London doctor John Snow pinned downthe sourceof anoutbreakof cholera inSohobymappingwater supplyuse,householdbyhousehold.Hewasable toshowfromthepatternofoutbreaksthat thediseasewasspreadbyaspecificwaterpump.Snowtookthehandleoffthepump,renderingituseless,andthespreadofthediseasewashalted.Itlaterturnedoutthatsewagewasleakingintothewatersupplyfrombuildingslacking proper drainage. Snow changedmedical opinion,which at the timefavouredamiasmaofbadairasthevehicleforthediseasetospread,byhiscarefulandimaginativeuseofdata.

Whatthebigdata-drivenPredPolcando,overandaboveSnow’sanalysis,is keep on top of amassive flow of data,which the system uses to predictwhere it feels thatcrimesaremost likely to takeplace.As thosepredictionsarise,policeofficerscanbesenttopatroltheareas,sothatascarceresourcecanbedeployed tohave themaximumbenefit for thecityor region.WhereSnowbasedhisworkonahunchthatthewatersupplywasthekey,PredPolanditscompetitorshavenopredeterminedideas.Theoperatorssimplypileinlotsofpossibledatasets–whereATMsandattractiveburglarytargetsare,forexample, and how common closed circuit TV cameras are. How busy thestreets are and howmany known criminals live nearby.Where crimes havebeencommittedbefore,ofcourse–alongwithfactorsliketimeofday,dayoftheweek, public holidays andmore. Then thewhole is churned through tocomeupwithasuggestionofwhereofficersshouldbedeployed.Onceontheground,thepolicecanfeedbackcrimepreventionstatistics,andwherethere’sa positive outcome, the system reinforces the data that delivered the bestresults.

Itmakes perfect sense.There is often concern in inner-city policing thatpolicepayundueattention to certainminorities andgroups.But this system

Page 87: Big Data: How the Information Revolution Is Transforming Our Lives

knowsnothingaboutindividuals,socan’thaveadiscriminatoryfactorbuiltinbased on, say, age, ethnicity or religious background of local residents. Itmakesthemostofresources.WhenKentPolicetrialledPredPol,theirofficersmanaged to deal with ten times the incidents compared with relying onrandom patrols. Yet despite this, the system can produce a dangerousfeedbackloopthatdiscriminatesagainstcertainneighbourhoods.

This reflects theway that crimes are reported.Most aren’t. The chancesare, at some point in your life, you have been a crime victim and have notbotheredtoreportit.WhenIwasatschool,forexample,Iwastwiceassaulted–oncepunchedon the jawatarailwaystationandonanotheroccasionhadstonesthrownatmeasIwalkeddownthestreet.InbothcasesitwasbecauseIwaswearing the uniformofwhat some regarded as a ‘posh’ school.Bothwereminoroffences,but inbothcases the lawwasbroken.Yet theywouldneverberecordedonapolicedatabase.

However, let’simaginethatapredictionsystemtellsustoexpectasurgeofcrimeinanareaofacity.Policeofficersappearontheground.TheycanrespondtoalotofminoroffencesliketheonesIsuffered,whichgointothesystem.Andsothisareaisflaggedasbeinginparticularneed.Accordingly,moreofficersarescheduledtoturnupthere–andwe’reinadownwardspiral.Mostofthesesystemshavetheoptiontochoosewhethertouseallthedataorjustthatonseriouscrimes.Butthetemptationistoincludeminoroffences(asKentPolicedid)becauseit’saneasywaytoincreasetheclear-uprate.Andiftheyareincluded,thesespirallingloopstendtohitthepoorerdistricts,wherethesekindofminoroffenceshappenmorefrequently.Theresultisakindofdiscrimination that nohumanhasbrought intobeing, simplybecauseof thedecisiontoincludelow-levelcrimedata.

Theneedsofthemany

Overall,oneof thedifficultiesweface indealingwithbigdata is thatoftentherearebenefits forsomeanddisbenefits forothers. It is relativelyeasy todisapprove of a use of big data where all the benefits go to the state or acompanyandnonetotheindividual,orwherethealgorithmsmakenosense,asoccurredintheteacherratingsystem.Thebalanceislessclearwheneachgetssomeadvantage.Andperhapsthehardestrequirementistoweighupthebalance whenwe have to line up the impact onmany individuals with theimpactonthefew.

Thereareplentyofbigdatasystemsthatworkwellformostofus,butletasubsetofthepopulationdown.Insomecircumstances,thistrade-offisnota

Page 88: Big Data: How the Information Revolution Is Transforming Our Lives

hugeproblem.Forexample,ifyouaregettingfilmrecommendationsfromastreaming system likeNetflix, itmaywell get its suggestions right inmanycases,butfailterriblyonceinawhile,recommendingabaseballmovie,say,to someone who doesn’t like sporting dramas, because they happen to likefilmslocatedintheUS.Thisisnotgoingtocauseamajorproblem.

However, it is very different if that big data system is handling youremploymentprospectsoryourcreditscore.Takecreditscoringasaspecificexample.Onceuponatime,youhadabankmanager.Thiswasarealhumanbeingwhoyoutalkedtoandwhodevelopedapictureofwhatyouwerelikeandwhetherornotyouwereafinancial risk.Now, though,whenyouapplyforaloan,say,theoutcomeisalldowntoanalgorithmandbigdata.

The system will pull in data on any existing loans, your income, yourhistoryofrepaymentanddefaults,andwillmakeaninstantdecisionbasedonthealgorithm’sinterpretationofthatdata.Youhavenowaytodiscoverhowitreacheditsdecision,norcanyoupointout,forinstance,thataparticularpieceofdata is incorrect,or that, for instance,youmisseda loanrepaymentwhenthebank’scomputersystemhadproblemslastmonth,notbecauseyouwereabadrisk.

Some lendersgofarbeyondcredit scoring,making theiralgorithmsevenmoreopaque–butclaimthattakinginthisextradataenablesthemtolendtopeoplewhowouldn’totherwisegetaloan.(Whetherornotthatisagoodideaisopentodebate.)IntheUK,themostfamous(orinfamous)ofpaydayloancompaniesisWonga,whichmanagesasignificantlylowerdefaultrateonitsloansthantraditionalbanks.

As well as credit scores, theWonga system accesses what it can aboutapplicants throughbigdata.Are theyon socialmedia?Thencan it findoutanythingabouttheirfriendsthatwillgiveitanedgeonyourriskasapotentialclient?What kind of technology are they using to get toWonga’s site, andwherearetheybased?Thealgorithmthatdecidesonwhetherornottoissuealoan is not basing its decision on clearly understandable rules. There weresome to begin with. The system will have started with assumptions like‘someonewithonlinefriendswhogenerallyrepaytheirloanswillalsotendtorepay’ or ‘people who live in an area where most people don’t defaultprobablywon’tdefault.’Butovertime,thesystemwilltuneitselftobecomemoreefficient.Someofthecorrelationsitusesmaybecrazy.Butaslongastheyaregettingresultstheywillbeaddedintothemix.

Credit scoring is arbitrary, dependent on poor algorithms and is at bestsemi-transparent. Arguably, the systems that go beyond credit scoring likeWonga’sareevenworseinthisregard.Andeveryday,creditscoringsystems

Page 89: Big Data: How the Information Revolution Is Transforming Our Lives

are putting black marks on people’s records, making it harder for them toundertake financial transactions in the future. Can it really be fair that ahiddenalgorithmcanimposefinancialruinonanindividual?

It seems thatbigdatavergeson theevil in suchcircumstances.Andyet,thesystemitreplacedwasoftennobetter.Ifbigdatameansthattheneedsofthe many outweigh the needs of the few, leaving a few unfortunatesstruggling,it’sarguablethatintheoldsystem,theneedsofthefewwhoknewthe rightpeopleoutweighed theneedsof themany–becausehowwellyoudidwithyourbankmanagercouldeasilydependonwhoyouknewandyoursocialcircles.Neitherapproachisperfect.

Althoughthebigdatasystemisprobablypreferable,weneedtobeabletoopenup thealgorithmsso that it iseasy foranyone to findoutexactlywhytheywerescoredinaparticularwayandhowtheycanquicklyandeffectivelycorrectanyerrorsand ‘deathspirals’ofdatawhereoneblackmark leads tomoreproblems,leadingtostillworseratings.Itwouldbeperfectlypossibletohavelegislationthatrequiredbanks,forexample,togivedetailsofhowtheirdecisiontomakealoanwasreached,openingupthegutsofthealgorithm.

Despiteitslimitations,atleastcreditscoringforabankloanorcreditcardisusingthedataforsomethingitwasdesignedtosupport.However,thecreditrating agencies have realised that they couldhavemore customers than justthebanks.Onesidetothisistotrytosellindividualsaccesstotheirowndata.Somecountrieshaverealisedthatthisismadness–wesurelyhavearighttoseedatathat theagenciesholdaboutourselves–butothersstillallowcreditscorers to rip people off.However, another possibility soon occurred to thecreditagencies.Lotsoforganisationswanttoratepeople.Whynotsellcreditdataforthesepurposestoo?

Thismeans that someorganisationshave started tousecredit ratingas amechanism to assesswould-be employees. There’s a crude sense to this. Ifsomeone can’t look after their financial affairs, would you want them, forinstance,workingonyouradministration?Butthetroubleisthatthereisonlysomuchyoucantakefromonescenarioandapplyinanother.Someonecanbelaxwiththeirownmoneybutverycarefulwithsomeoneelse’scash.Andifsomeonehasjustbeenunemployedorastudent(asmanypeopleapplyingfora jobwillbe), theywill tend tohaveadepressedcredit rating–but thissaysnothingabouttheirabilitytodothejob.

Weliketothinkoftheworldinblackandwhitecertainty.Buttherealityisthatwithbigdatatherearemanyshadesofgrey.

Page 90: Big Data: How the Information Revolution Is Transforming Our Lives

7GOOD,BADANDUGLY

TheSpider-Maneffect

It’s rare that you come across words of deep philosophical wisdom fromcomic book heroes, but Spider-Man famously (if with a certain amount ofpomposity) announced to the world that ‘with great power comes greatresponsibility’. This is a lesson that needs to be printed on every ‘how tosucceed with big data’ handbook. As we’ve seen, big data often involvespotentialbenefitsandriskbothforusersandfortheownersofthesystems.

Althoughwetalkabout‘bigdata’systems,thedataitselfisneutral.Itcan’tdoanythingon itsown.Whatmakesorbreaksbigdata is thequalityof thealgorithms – the computer programs that pull the data together and makedecisions and discoveries as a result of their ability to look across vastquantitiesofinput.Suchalgorithmscancopewithfarmoreinformationthananyhuman,buttheydon’thavehumansenseandsensibilities.

Mostimportantistohavetransparencyandaclearunderstandingofwhathappenswhenthingsgowrong.The‘responsibility’inSpider-Man’swarningreflectstheneedfortheownersofbigdatasystemstoensurethatthosewhoaresubjecttobigdatasystemshavetheopportunitytodigintojustwhatthesystemisdoingandcanpointouterrorswhichwillfeedbackintothesystemandallowforcorrections.

Many big data owners are reluctant to take this level of responsibility.Theywillargue, forexample, that theiralgorithmsareproprietaryandcan’tbe explained to end users. Transparency, they argue, will damage theirbusiness. However, this isn’t acceptable. When a system impacts people’slives, thiskindof safeguard is anecessity.Allowingbigdataowners togetawaywith theargument ‘ouralgorithm isproprietary’ is likeallowingacarmanufacturertoarguethatitscarscan’tbetestedforsafetybecauseitwouldgive away commercial secrets. Tough. At the moment, system owners getawaywithfartoomuch,whethertheiralgorithmsaredecidingwhattosellto

Page 91: Big Data: How the Information Revolution Is Transforming Our Lives

usorhowourcreditscoresstackup.Thereislessobviousreasonforreluctancetobuildinfeedbackloopsthat

enable the system to correct itself when it goes wrong. This isn’t givinganythingaway,itisjustmakingthesystemworkproperly,sothatthebigdataisbeingusedappropriatelyforanindividual.Frankly,theonlyreasontoresistaddingerrorcorrectiontothesealgorithmsislazinessandthecostofmakingachange.Andagain,theownersofsuchsystemsshouldnotbegiventhechoicewheretheoutcomecanimpactqualityoflife.

Wehaveheardsomeunpleasanttruthsaboutbigdata.However,thisisn’taLudditepolemic,incitingyoutogooutandsmashupthemachinesandreturntonature.Bigdatahas thepotential tomakeushealthier, togiveusabetterlifeandtomakethemostoftheremarkabletechnologywehaveavailable.Wedon’t want to throw that away, andwe shouldn’t have to. However, as wehaveseen,bigdatacomeswith risksattachedandboth thedataownersandtheendusersorcustomersneedtobeawareofthem.

Thegood–datasetsyoufree

Solet’sstartwiththegoodnews.Idon’tthinkit’sanexaggerationtosaybigdata has the potential to set us free. Science benefits hugely from havingaccess to as much data as it can handle. And so do we as individuals.Managedproperly,bigdatacouldmakepolitics reallydemocraticandallowus to make better informed decisions about our lives. Big data gives usentertainmentthatisstreetsaheadofwhatwe’veseenbefore.Medically,bigdatacansortmythfromcure,factfromguesswork.Andintheend,weloveinformation. Itgivesusakick.Bigdatacandeliver the informationbuzzasneverbefore.

The only proviso is that we should be able to access the data we wanteasily,we should have the tools tomanipulate it and understand it, andweneedtobeabletodealwitherrorsandmisapplicationbyalgorithms.

Oneimplicationofthisisthatweshouldconsiderchangingtheeducationsystemtoreflectabigdataworld.Ourcurrentsystemofteachingforexamsisfartooweightedtowardsgivingusthebasicsforspecificcareersandlearninginformation that we might never need again. This is an education systemdesignedforapre-bigdataworld.Thereistheneedtogiveyoungpeoplethetoolstheyneedtomanipulateandunderstanddata,butalsotoavoidbecomingobsessedwithit.Thatlastpartisimportant,becausewhileweallknowhowthosewhohavegrownupinthebigdataworldareadeptatinteractingwithaphone, a computer and the TV all at the same time, andwhile they all get

Page 92: Big Data: How the Information Revolution Is Transforming Our Lives

computersciencelessons,thosesameindividualsaren’ttaughthowdamagingtask-switchingandrandominformationgrazingcanbe.Theyneedhelptobeable tofocusandhandledatawell.Andwestill teach insubjects,expectingstudents to memorise far more than is necessary, purely to be able toregurgitate it in exams, rather than giving them the skills to interrogate andmanipulatebigdata,toseehowitisbeingmisusedandtogetthemostfromit.A big data examwould allowopen, read only internet access (to prevent itbeingusedforcommunication)–becauseitwouldbetestingtheirskills,nottheirmemory.

Ifwecan’timprovethewayweteachyoungpeopletodealwiththeimpactofbigdataweareopeningupthepathtothedarkside.

Thebad

IntheUK,despiteahabitofmoaningandadistrustofpoliticians,wetendtohaveanassumptionthatthestateisonourside.Thisispartofthereasonwhywe find it so difficult to understand the US attitude to gun control whichseems, in part, driven by the belief that the state cannot be trusted and thepeople need the ability to defend themselves against it. And there certainlyhave been many examples where countries have suppressed the populationand exerted undue control.Where the state has this potential, big data canrapidlybecomeatoolofsuppression.

Historicallyinmanytotalitarianregimes,thestatereliedonturningpeopleagainst each other, using the population as informers to keep their peers inplace.This isawastefuland inefficientmeansofcontrol.Largeamountsoftime are given to surveillance activities and a population that is in constantfear and lacks trust in friends and relations will always be operating tosubstandard levels.BigdatagivesBigBrother all the advantagesof theoldsurveillancestatewithfarfewerofthenegatives.

This could become a reality in China, where there is a far strongeracceptance of limited individual freedoms. The Chinese government isdevelopingabigdatasystemwhichisintendedtobuildautomaticdossiersonitscitizens,scoringthemonbothsocialandfinancialmarkerstoreflecttheirbehaviour, with rewards provided in the availability of state controlledservices–whichmeansprettymucheverything.Aswellasthemorefamiliarchecks on spending and creditworthiness, this systemwouldmonitorminoroffences such as jaywalking and fare dodging on public transport – orviolations of China’s family size restrictions. The result would be a‘trustworthiness’scorethatwouldfeeddirectlyintothewaythestatetreated

Page 93: Big Data: How the Information Revolution Is Transforming Our Lives

you,evenifyouweretryingtobookamealinarestaurant,orgetanonlinedate.

AsfarastheChinesestateisconcerned,suchasystemisnotaburden,butaroutetofreedom.Ifyouactinatrustworthymanner,thesystemwillrewardyou.Butact inaway thatdoesnot support a ‘harmonious socialist society’and you will suffer. Break trust in one place and you will be restrictedwhereveryougo.Thestatebelievessuchasystemisnecessarytogetontopofanout-of-controleconomywherefraudandbriberyiscommon,andalltoooftencompaniessellinediblefoodordangerousfakemedicines.Andthereisnodoubtthatbigdatashouldbeabletohelpwiththesekindsofproblems–butthescaleofthisproposedsystemwilltakeinfarmorethanthefraudsters.

This is Big Brother big data on a massive scale – too massive to bepracticalevenwithtoday’stechnology,givenChina’spopulationofover1.3billionpeople.Itisonlycurrentlybeingtrialledonasmallscale.Andshouldthe system ever go live there, it is hard to imagine that all the potentialproblemswithbaddataandpoorassumptionswillbepredictedandprevented.HereGIGOcouldentirelyruinpeople’slives.It’sascaryprospect.

This is why transparency is so important for big data systems used indemocracies,andwhyweneedtokeepstrictcontrolsonthelimitsallowedtogovernmentswhenitcomestomakinguseof thatdata,evenwhenthereareperfectlylegitimatereasons,suchaspreventionofterrorism.Therearespecialcircumstanceswherethegovernmentneedsaccesstodataaboveandbeyondnormal limits – but they should always be special cases, not the kind ofblanketcoveragethatbigdatamakesalltoopossible.

And we also need to see that there is a positive side to governmentinvolvementindata,turningtheaccessroundandgivingustheopportunitytofind out what we need about the state easily and freely. At the moment,freedomofinformationisfartoorestrictedandcostly.Withbigdataitshouldbetrivialforanyonetogainaccesstorelevantgovernmentinformation.Withtherightknowledge,wecantransformdemocracyandensurethatgovernmentandcorporationsaren’tmisusingthebigdataopportunity.

Thatonlyleavestheuglysideofthebusiness.Hackers.

Theugly

Itmightseemdifficulttodrawthelinesometimesbetweenacorporationouttofleeceyouandahacker.Forthatmatter,thereareplentyofexamplesthesedaysofstate-sponsoredhacking.Butnoonewantstheirlivestoberuinedbyhackerstakingcontroloftheirhomeortheirlife.

Page 94: Big Data: How the Information Revolution Is Transforming Our Lives

It’s great that big data means that, for example, my intelligent lightingsystemcanswitchon the lightswhenIarrivehome(seepage81).Nomorecoming back to a dark house. But it wouldn’t be amusing if hackers gotcontrol ofmy lights and started switching themoff and on atwhim. In theextremeyoucouldimaginethescenariofromtheTVshowMr.Robot,wherehackerstakeoverasmarthouseequippedwithbigdata-connectedtechnology,usingittodrivetheowneroutsothehackerscansquatinthehouse.

Thismightbea relatively trivialgoal ashackinggoes–but it illustratesjusthowmuchbigdata is starting towork itsway intoourhomes.And,ofcourse, even if you don’t have smart technology at home, it’s anuncomfortable feeling that the increasing use of big data on planes and inhospitalscouldleaveusopentodata-driventerrorattackswhenwearemostvulnerable.

As is the case with all terrorism opportunities, though, we can’t let thehackerswin.It’snoreasontowithdrawfrombigdata.Itmeansthatwehavetobevigilant,fightingthearmswarofimprovedsecurityonsystems.Andit’sallthemorereasontoeducateourselvesmoreaboutwhatbigdataisandhowweinteractwith it.One importantaspectof this ismakingsurewequestionthetwobig‘A’s.

Thebig‘A’s

That’salgorithmsandassumptions.Abigdatasystemisonlyasgoodasthealgorithmsusedtoaccessandmanagethatdata.Andthosealgorithmsdependontheirdesignersbeingabletomakeaccurateassumptionsabouttheusersofthesystemsandaccurateassumptionsaboutthedeductionsthatcanbedrawnfromthedata.

Assumptions are at the heart of most failures in decision making. Weassumethatwecangetacarthroughthetrafficlightsbeforetheychange.Weassume that we can catch a train with a short connection time. We makeassumptionsaboutapersonbasedontheirappearance.Wemakeassumptionsaboutwhat’spossibleandwhat’snotpossible–andtheseassumptionsgetinthewayofcreativityandnewideas.Andsimilarly,whenwewritealgorithmswe make assumptions about the limitations of the data and how it will beused.Andunlesstherearetheopportunitiesforcorrectionthatwe’vealreadydiscussed, those assumptions will get in the way of big data being usedproperly.

Thinkofatrivialexampleofassumptionsfindingtheirwayintocomputercode.Themillenniumbug.Whencomputersstartedtobecommonplaceinthe

Page 95: Big Data: How the Information Revolution Is Transforming Our Lives

1960s, the year 2000 seemed a longway off. To avoid dates taking up toomuch room, they were often stored as an offset from a starting date, withlimitedcapacity.Andmanysystemsassumedthat thedatewouldstart19.Itwasentirelypossiblethatcome2000,adatein2017,say,wouldbeassumedto be in 1917. That would result in incorrect data being output, or in aprogramcrashingwhen, for example, the ageof apersonborn in1973wascalculatedbysubtracting1973from1917.

Asithappens,inthisparticularcasefarmoreeffortwasprobablyputintothe prevention of system failures thanwas necessary. Checking and testingwas clearly important for, say, systems that kept planes in the air, but notnecessarilyforeverybitofbasicofficesoftware.However,thefactremained,theprogrammershadmadeanassumption–that2000wasfartoofarintothefuturetoworryabout–thatprovedamistake.

In part, then, the lesson here is that the designers of big data algorithmsshould,asmuchaspossible,maketheirassumptionsclearandtestthemoutasmuchaspossible.Therewillalwaysbesomeissuesthatslipthrough,whichiswheretheneedforfeedbackandcorrectioncomesin,butthereshouldalsobebetterthinkingthroughofconsequencesbeforeasystemgoeslive.

Don’tforgetknowledge

One final limitation is worth considering as we ponder the big data future.This approach isgreat forproviding information buthas limitationswhen itcomestoknowledge.TheIBMsystemWatsonfamouslywonaspecialversionoftheUSquizshowJeopardy!in2011.However,whenWatsongotquestionswrong, it got them wrong in ways that didn’t make any sense. So, forexample,itsanswertoaquestionlookingforaUScitywithspecificairportswas‘What isToronto???’Thebigdataapproach lackswhatcouldbecalledcommonsense.

Computer science professor Hector Levesque suggests that artificialintelligence driven by big data will always have problems with rareoccurrences–theso-called‘longtail’inaprobabilitydistribution.Theseareeventsthatareunlikelytooccur.Butwhereanyparticularsituationprobablywon’tarise,thechancesarethatsomeunexpectedoccurrence,forwhichthereis little existing data for a system to work with, will. Making use ofknowledge – in effect, common sense – is the best approach in suchcircumstances.Forexample, a self-drivingcarwithonlypastdata tohelp itmakedecisionsmaystruggleif,say,livestockfindsitswayontoamotorwayor the car encounters Swindon’s infamous ‘magic roundabout’ where five

Page 96: Big Data: How the Information Revolution Is Transforming Our Lives

separatesmallroundaboutscombinetoformasixthlargeone.It’s possible that for big data to go even further in supporting human

intelligencewewillhavetorevisitsomeoftheaspectsofknowledge-basedAIthathavebeensidelinedinpastdecades.Butwhetherornot this is thecase,thereislittledoubtthatbigdatawillhaveagrowinginfluenceonourlives.

Abigdatafuture

Wecantalkabouttheprosandconsofbigdata.Wecanworryaboutthebadandtheuglywhileappreciatingthegood.Butwehavetoacceptthatbigdataisnotgoingaway.Wecan’tput thegenieback in thebottle, soweneed tomakeourwisheswisely.

Bigdatacanbringbenefitstoallofus,aslongasweeducateourselvestounderstandanddealwithit,andensurethatalgorithmsaretransparentandnotdesignedinawaythatlockspeopleintodecliningspiralswithnowayout.

Humanity has faced awhole series of developments that are two-edged.It’sthenatureofprogress.Firehasenabledustocookfood,vastlyincreasingour ability to eat safely, and to warm our homes – but misused, it can bedeadly. Physics has given us unparalleled knowledge of how the universeworks and fantastic electronics-based technology – but it also gives us theabilitytodestroyourselvesfareasierthaneverbefore.

Thereisnopointturningourbacksontheworldandsaying‘Iwishnoneofthisexisted’.Itdoes.Wehavebigdata.Wecanmakeabrilliantnewlifewith it.Butwewon’tmanage todo that ifwe ignore the issuesandpretendthatwecanleaveallthedecisionstothetechies.

Thebigdatafuturecanbebright–aslongaswewalkintoitwithoureyesopen.

Page 97: Big Data: How the Information Revolution Is Transforming Our Lives

FURTHERREADING

Chapter1:Weknowwhatyou’rethinkingNETFLIXANDBIGDATA–Streaming,Sharing,Stealing,MichaelD.SmithandRahulTelang(M.I.T.

Press,2016)PRECRIME–TheMinorityReport,shortstoryappearinginMinorityReport,PhilipK.Dick(Gollancz,2002)SCIENTIFICSTUDYOFPRECOGNITION–ExtraSensory,BrianClegg(StMartin’sPress,2013)

Chapter2:SizemattersOURDEPENDENCEONPATTERNS–DiceWorld,BrianClegg(IconBooks,2013)UKLOTTOSTATISTICS–

availableatwww.lottery.co.uk/lotto/statisticsBLACKSWANS–TheBlackSwan,NassimNicholasTaleb(Penguin,2008)PLR–moreinformationatwww.plr.uk.comPSYCHOHISTORYANDHARISELDON–FoundationTrilogy,IsaacAsimov(Everyman,2010)MassObservationinformationcanbeaccessedviawww.massobs.org.ukALGORITHMS–AlgorithmstoLiveBy,BrianChristianandTomGriffiths(WilliamCollins,2016)

Chapter3:ShoptillyoudropYoucanfindthe£999.10copyoftheMSDOS6.0bookat

www.amazon.co.uk/exec/obidos/ASIN/0750609990/491FLASHCRASH–AlgorithmstoLiveBy,BrianChristianandTomGriffiths(WilliamCollins,2016)

TARGETEDADVERTISING–BigData,TimandraHarkness(BloomsburySigma,2016)

Chapter4:FuntimesHISTORYOFTHEINTERNET–WhereWizardsStayUpLate,KatieHafnerandMatthewLyon(Simon&

Schuster,1998)NETFLIXANDBIGDATA–Streaming,Sharing,Stealing,MichaelD.SmithandRahulTelang(M.I.T.Press,2016)E-BOOKSALESSTRATEGIES–Streaming,Sharing,Stealing,MichaelD.SmithandRahulTelag(MITPress,2016)PREDICTINGABESTSELLER–TheBestsellerCode,JodieArcherandMatthewJockers(AllenLane,2016)SPEECHGENERATIONANDRECOGNITION–TenBillionTomorrows,BrianClegg(St.Martin’sPress,2015)SOCIALMEDIAIMPACTONBEHAVIOUR–TheCyberEffect,MaryAiken(JohnMurray,2016)

SOCIALMEDIAIMPACTONCONCENTRATION–TheDistractedMind,AdamGazzaleyandLarryD.Rosen(MITPress,2016)

Chapter5:SolvingproblemsRANKINGUNIVERSITIES–WeaponsofMathDestruction,CathyO’Neil(AllenLane,2016)TheShockwaveRider,JohnBrunner(Gateway,2014)

GIGECONOMYNAMEDISCRIMINATION–‘AreEmilyandGregmoreEmployablethanLakishaandJamal?AFieldExperimentonLaborMarketDescrimination’,MarianneBertrandandSendhilMullainathan(inAmericanEconomicReview,vol.94,no.4[September2004],pp.991–1013)

Chapter6:BigBrother’sbigdata1984,GeorgeOrwell(PenguinClassics,2004)

Page 98: Big Data: How the Information Revolution Is Transforming Our Lives

TEACHEREVALUATIONS–WeaponsofMathDestruction,CathyO’Neil(AllenLane,2016)SPURIOUSCORRELATIONS–www.tylervigen.com/spurious-correlationsandwww.correlated.orgPREDICTIVEPOLICING–TheCyberEffect,MaryAiken(JohnMurray,2016)

Chapter7:Good,badanduglyKNOWLEDGEANDCOMMONSENSE–CommonSense,theTuringTest,andtheQuestforRealAI,Hector

J.Levesque(MITPress,2017)

Page 99: Big Data: How the Information Revolution Is Transforming Our Lives

INDEX

7-Eleven150ShadesofGrey1,2198412001:ASpaceOdyssey1

Aabacus1addictivebehaviour1advertising1

targeted1Airbnb1airlineoverbooking1airports,security1al-Khwarizmi1Alexa1,2algorithms1,2,3,4

definition1quality1

AltaVista1Amazon

Echo1,2,3andinformationsharing1Kindle1,2Marketplace1physicalstores1Prime1,2,3retailapp1sellershopbots1

AmericanAirlines1anonymisation1AOL1Apple1,2,3

cloudservice1andinformationsharing1Pay1,2Siri1,2,3,4

Archer,Jodie1ARPANet1

Page 100: Big Data: How the Information Revolution Is Transforming Our Lives

Asimov,Isaac1assumptions1astronomy1astrophysics1AustralianBotanicGardens1

Page 101: Big Data: How the Information Revolution Is Transforming Our Lives

BBabbage,Charles1bananas1banking1barbecues1Bates,James1Benn,Tony1Bentonville1BernersLee,Tim1,2,3bestsellers1BigBrother1,2,3,4‘billsofmortality’1Bing1biology1biometrics1bits1blackholes1BlackMirror1TheBlackSwan1blackswans1,2,3BockettsFarm1Bolton1BookBrain1botshoppers1brainprobes1brandawareness1Brexit1,2bribery1Brighton1BritishAirways1broadcasting1,2Brooker,Charlie1Brown,Dan1Brunner,John1bytes1

Page 102: Big Data: How the Information Revolution Is Transforming Our Lives

Ccameras1,2,3,4capitalpunishment1CarnegieMellonUniversity1cars

insurance1performanceandreliabilitydata1self-driving1,2tracking1

cashlesssociety1cassettes1causality1,2,3CCTV1,2,3CDs1censuses1,2,3

firstUK1CERN1,2,3,4,5chaoticsystems1,2,3,4China1cholera1claytablets1Clifford,Tim1coffeeshops1Collins,Victor1ColumbiaUniversity1commissions1commonsense1comparisonshopping1CompStat1CompuServe1computerchip,transistorsin1computerprograms1computerstorage1computingengines1concentration1confidencelevels1cookies1correlation1,2,3Cortana1couriers1crawlers1creditratingagencies1creditscoring1crimeprediction1crimereporting1

Page 103: Big Data: How the Information Revolution Is Transforming Our Lives

CRM1TheCrown1Cruise,Tom1customerrelationshipmanagement1customerservice1

productivityin1cyclicvariations1

Page 104: Big Data: How the Information Revolution Is Transforming Our Lives

Ddata

definition1evolution1andfact1

dataprocessing1David,King1debtmanagement1deduction1,2DeepMindgroup1Deliveroo1Delphiprinciple1democracy1,2

representative1diarists1Dick,PhilipK.1dictation1dopamine1Drant,T.1driving,improving1DVDs1,2,3

Page 105: Big Data: How the Information Revolution Is Transforming Our Lives

Eebooks1education

changingsystem1seealsoschools;universities

elections1general1,2USpresidential1

electricitymeters1email1EncyclopaediaBritannica1errorbars1errorcorrection1,2EuropeanUnion,UKreferendumon1,2Eurostar1evidencegathering1

Page 106: Big Data: How the Information Revolution Is Transforming Our Lives

FFacebook1,2,3,4facialrecognition1,2

andsocialmedia1fact,dataand1fakenews1,2feedback1,2Fibonacciseries150ShadesofGrey1,2Filofax1Fincher,David1‘FindmyFriends’app1fingerprintrecognition1fire1Fitbit1fitnessdata1FlatironInstitute1forecasting1,2Foundationseries1fraud1,2,3freedomofinformation1fundraising1future,applicationofdatato1FutureShock1

Page 107: Big Data: How the Information Revolution Is Transforming Our Lives

Ggalaxies,formation1generalelections1,2gigeconomy1gigabytes1GIGO1,2,3,4Google1,2,3,4

DeepMindgroup1index1,2andinformationsharing1rankingalgorithm1translationsoftware1

GoogleMaps1GoonhillyDowns1Graunt,John1GreenShieldStamps1guncontrol1

Page 108: Big Data: How the Information Revolution Is Transforming Our Lives

Hhackers1hardbacks1HariSeldoncomplex1Harkness,Timandra1HarryPotter1,2heartdisease1,2Herod,King1Herschel,John1Herschel,William1Higgsboson1,2Hollerith,Herman1Hollerithstring1homeautomationsystems1HouseofCards1humangenomeproject(HGP)1HunchLab1hunter-gathererlifestyle1hyperlinks1

Page 109: Big Data: How the Information Revolution Is Transforming Our Lives

IIBM1

Watson1immersion1IMPACT1induction1,2,3inflation1information1,2

freedomof1informationgrazing1informationprocessor1informationtechnology1,2Instagram1insurance1,2,3intelligentagents1InternationalBusinessMachines1

Watson1internet1,2

asinformationsource1interviews1intrusion1,2Iraq1iTunes1,2IvyLeague1iWatch1

Page 110: Big Data: How the Information Revolution Is Transforming Our Lives

JJacquardloom1Jeopardy1JesusChrist1jobrecruitment1Jockers,Matthew1JustinTime(JIT)1

Page 111: Big Data: How the Information Revolution Is Transforming Our Lives

KKentPolice1,2kidneydisease1kilobytes1Kindle1,2Kleinrock,Len1knowledge1,2knowledge-basedsystems1Kodak1

Page 112: Big Data: How the Information Revolution Is Transforming Our Lives

LLargeHadronCollider(LHC)1Levesque,Hector1LHC1libraries1lifeinsurance1lights1,2,3,4LinkedIn1loans1

payday1,2London1,2

choleraoutbreak1coffeehouses1driversin1nightoutin1Oystercard1,2,3

‘longtail’1LosAngeles1lotteries1Lotto1loyaltycards1LPs1

Page 113: Big Data: How the Information Revolution Is Transforming Our Lives

MMacHypercard1‘magicroundabout’1Mars,‘faceon’1MassObservation1MassachusettsInstituteofTechnology(MIT)1MathematicaPolicyResearch1medicaldata1medicalstudies1medicine1

evidence-based1scienceand1

Mediterraneandiet1Merricks,Edward1Mesopotamia1messageboards1MetOffice1metastudies1Microsoft1,2

Bing1millenniumbug1MinorityReport1MIT1mobilephones1

unlocking1seealsosmartphones

Mondex1,2Moore’sLaw1Mr.Robot1,2music1

streaming1

Page 114: Big Data: How the Information Revolution Is Transforming Our Lives

NNapster1Nelson,Ted1Netflix1,2,3,4,5NewYorkTimesbestsellerlist1NewnesMS-DOS6.0PocketBook119841number,concept1

Page 115: Big Data: How the Information Revolution Is Transforming Our Lives

Ooligarchy1oliveoil1on-demandservices1,2O’Neil,Cathy1operationalresearch(OR)1opinionpolls1,2Orwell,George1overbooking1Oystercard1,2,3

Page 116: Big Data: How the Information Revolution Is Transforming Our Lives

Ppaperbacks1pareidolia1particlephysics1passports1past,datafrom1patientportals1patterns1paydayloans1,2PayPal1Pepys,Samuel1Percolata1performanceevaluation1personalcomputers1personalityprofiles1,2phone-basedpaymentsystems1photograph,animationof1physics1,2PIN1pioneersofbigdata1piracy1,2,3precognition1PredPol1pregnancyrates1present,datafrom1pricing,variable1productivity1programmers,specialist1‘proxies’1psychohistory1PublicLendingRight(PLR)payment1publishing1,2punchedcards1,2pyramidofunderstanding1

climbing1

Page 117: Big Data: How the Information Revolution Is Transforming Our Lives

Rrailwaytimetables1RANDCorporation1randomness1ranges1rareoccurrences1ratingsystems1recognitionsoftware1recruitment1referenda1,2regionalidentity1representativedemocracy1responsibility1,2Rhee,Michelle1rosters1RoyalFreeHospital1

Page 118: Big Data: How the Information Revolution Is Transforming Our Lives

SSabrebookingsystem1‘saleorreturn’1salesforecasts1,2sampling1,2SATscores1Satchu,Asif1schools,ranking1science,medicineand1screeningtests1searchengineoptimisation1searchengines1,2,3

seealsoAltaVista;Bing;Googleseasonality1,2security1,2,3,4self-deception1self-drivingcars1,2self-employment1,2series,streaming17-Eleven1TheShockwaveRider1shopbots1,2shopping1,2,3,4,5Simons,JamesH.1Siri1,2,3,4smartmeters1smartthermostats1smartcards1smartphones1,2,3

seealsomobilephonesSnow,John1socialmedia

benefits1facialrecognitionand1frequencyofchecking1,2impact1,2andjobselection1

socialnetworks1,2scale1

SocietyofAuthors1Soho1Spacey,Kevin1speechrecognitionsystems1Spider-Maneffect1Spotify1

Page 119: Big Data: How the Information Revolution Is Transforming Our Lives

Starbucks1statistics,wordorigin1stockmarkets1

AIs1,2studentperformanceevaluation1supermarkets1supernovas1supportgroups1surveillance1,2,3surveillancesociety1Sweden1Swindon1,2,3,4

Page 120: Big Data: How the Information Revolution Is Transforming Our Lives

TTabulatingMachineCompany1tabulators1Taleb,NassimNicholas1‘tapandpay’cardsystems1targetedadvertising1task-switching1,2teacherperformanceevaluation1terabytes1terrorism

data-drivenattacks1prevention1

Tesco1Teslaelectriccars1thermostats1timewasters1Toffler,Alvin1totalitariancontrol1,2tracking1traintickets1trainspotters1,2translationsoftware1transparency1,2,3,4,5TransportforLondon1,2travelwebsites1trends1TripAdvisor1,2Trump,Donald1trustworthiness1Twitter1,2,32001:ASpaceOdyssey1

Page 121: Big Data: How the Information Revolution Is Transforming Our Lives

UUber1,2,3,4UberEats1Uniqlo1UnitedStates

guncontrol1presidentialelections1

universitiesadmissions1fees1fundraising1ranking1

Uppsala1upscaling1Uruk1U.S.Newslist1,2

Page 122: Big Data: How the Information Revolution Is Transforming Our Lives

Vvideoanalytics1videostreaming1,2Viking1probe1vonLiszt,Franz1

Page 123: Big Data: How the Information Revolution Is Transforming Our Lives

WWashingtonDC1Watson1WeaponsofMathDestruction1wearabletechnology1weatherforecasts1,2,3

long-term1wellingtonbootsales1,2TheWestWing1Wiczyk,Mordecai1Wikipedia1Windows951‘wisdomofthecrowd’1witnessevidence1Wonga1WorldWideWeb1,2,3Writers’Exchange1writing1Wysocki,Sarah1,2

Page 124: Big Data: How the Information Revolution Is Transforming Our Lives

YYellowPages1YouGov1

Page 125: Big Data: How the Information Revolution Is Transforming Our Lives

ABOUTTHEAUTHOR

BrianClegg’smost recent books areTheRealityFrame (Icon, 2017),WhatColouristheSun(Icon,2016)andTenBillionTomorrows(StMartin’sPress,2016).HisDiceWorldandABriefHistoryofInfinitywerebothlonglistedfortheRoyal Society Prize for ScienceBooks.Brian haswritten for numerouspublications includingTheWallStreetJournal,Nature,BBCFocus,PhysicsWorld,TheTimes,TheObserver,GoodHousekeepingandPlayboy.Brianiseditorofpopularscience.co.ukandblogsatbrianclegg.blogspot.com.

www.brianclegg.net

Page 126: Big Data: How the Information Revolution Is Transforming Our Lives

OtherHotSciencetitlesavailablefromIconBooks

DestinationMars:TheStoryofOurQuesttoConquertheRedPlanet

Page 127: Big Data: How the Information Revolution Is Transforming Our Lives

COPYRIGHT

PublishedintheUKin2017byIconBooksLtd,OmnibusBusinessCentre,

39–41NorthRoad,LondonN79DPemail:[email protected]

www.iconbooks.com

SoldintheUK,EuropeandAsiabyFaber&FaberLtd,BloomsburyHouse,

74–77GreatRussellStreet,LondonWC1B3DAortheiragents

DistributedintheUK,EuropeandAsiabyGranthamBookServices,

TrentRoad,GranthamNG317XQ

DistributedintheUSAbyPublishersGroupWest,

1700FourthStreet,Berkeley,CA94710

DistributedinAustraliaandNewZealandbyAllen&UnwinPtyLtd,

POBox8500,83AlexanderStreet,CrowsNest,NSW2065

DistributedinSouthAfricabyJonathanBall,OfficeB4,TheDistrict,41SirLowryRoad,Woodstock7925

DistributedinIndiabyPenguinBooksIndia,7thFloor,InfinityTower–C,DLFCyberCity,

Gurgaon122002,Haryana

DistributedinCanadabyPublishersGroupCanada,76StaffordStreet,Unit300Toronto,OntarioM6J2S1

ISBN:978178578-249-7

Textcopyright©2017IconBooksLtd

Theauthorhasassertedhismoralrights.

Nopartofthisbookmaybereproducedinanyform,orbyanymeans,withoutpriorpermissioninwritingfromthepublisher.

TypesetinIowanbyMarieDoherty

PrintedandboundintheUKbyClaysLtd,StIvesplc

Page 128: Big Data: How the Information Revolution Is Transforming Our Lives

HotScienceisanewseriesexploringthecuttingedgeofscienceandtechnology.WithtopicsfromMartianexplorationandbigdatatoblackholes,astrobiology,darkmatterandepigenetics,thesearebooksforpopularscience

readerswholiketogothatlittlebitdeeper…

Availablenowandcomingsoon

DestinationMarsAstrobiology

GravitationalWaves

iconbooks.com/hotscience

Page 129: Big Data: How the Information Revolution Is Transforming Our Lives

ALSOAVAILABLE

WhentheApolloastronautswalkedontheMoonin1969,manypeopleimaginedMarswouldbenext.Halfacenturylater,onlyrobotshavebeento

theRedPlanetandourastronautsrarelyventurebeyondEarthorbit.

Now,Marsisback.WitheveryonefromElonMusktoRidleyScottandDonaldTrumptalkingaboutit,interplanetaryexplorationisbackontheagendaandMarsisonceagaintheprimedestinationforfuturehuman

expansionandcolonisation.

InDestinationMars,astrophysicistandsciencewriterAndrewMaytracesthehistoryofourfascinationwiththeRedPlanetandexploresthescienceuponwhichacrewedMarsmissionwouldbebased,fromassemblingaspacecraftinEarthorbittosurvivingsolarstorms.WithexpertinsightheanalysesthenewspaceraceandassesseswhatthefutureholdsforhumanlifeonMars.

ISBN9781785782251(paperback)/9781785782268(ebook)