How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st...

46
How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012

Transcript of How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st...

Page 1: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

How to Handle and Analyse Large DatasetsBENVGEE7 'Methods of Environmental Analysis'Ed Sharp21st February 2012

Page 2: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Introduction

• Me……. • BSc Geography, • Worked as SABSCO ltd, niche power station construction contractor• MSc GIS, • MRes Energy Demand Studies• PhD: The Spatiotemporal patterns of energy demand and supply in the UK• Recent interest and research into large datasets including a major piece of

research into the effects of disparate inaccurate datasets on energy demand forecast models

• Email: [email protected]• Web

• Linkedin: http://www.linkedin.com/pub/ed-sharp/43/2b4/b1b• UCL: http://www.bartlett.ucl.ac.uk/energy/people/students/ed-sharp• LoLo: http://www.lolo.ac.uk/profilepreview/view/id/102

Page 3: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Todays Lecture

• Three distinct sections

1. Theory: Describe how to handle and analyse large datasets

2. Practice: Run an exercise outlining some pervasive issues

3. Case Study: Demonstrate these within the context of some existing research

• Slides available on Moodle with web and literature references in full, colour denotes section.

Page 4: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Part 1: What is a large dataset?

• Large volumes of data– Millions of entries– Many Terabytes– Computationally intensive– Past 10 years x 1m

• Varied sources of data– Same variables– Different sources– Separate set of issues

causing problems with handling and analysis

Two types

There are issues that are common between the two as well as individual

Page 5: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Examples….

• Volumes– Census (http://census.ac.uk/)– Home Energy Efficiency

Database (HEED http://www.energysavingtrust.org.uk/Professional-resources/Existing-Housing/Homes-Energy-Efficiency-Database

)– Time series datasets e.g. energy

production/consumption– Remotely sensed data– Geographic datasets– Climate reanalyses

• Sources– Population– Economic variables (GDP,

GVA etc.)– Socio-demographic

variables (Population, Employment etc.)

Page 6: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Sources including repositories and search engines:• Data.gov: www.data.gov.uk• GoGeo: www.gogeo.ac.uk • ShareGeo: www.sharegeo.ac.uk • Eurostat: http://epp.eurostat.ec.europa.eu/portal/page/portal/eurostat/home/• IEA: www.iea.org • National Statistics: www.statistics.gov.uk • Odyssee: http://www.odyssee-indicators.org/• OECD: www.oecd.org • UNECE: www.unece.org • World Bank: www.worldbank.org • ADS, Archaeology Data Service; archaeologydataservice.ac.uk • BADC, British Atmospheric Data Centre; badc.nerc.ac.uk • BODC: (Oceanographic): www.bodc.ac.uk • CDS, Chemical Database Service; cds.dl.ac.uk • EBI, European Bioinformatics Institute; www.ebi.ac.uk • ESDS, Economic and Social Data Service; www.esds.ac.uk • NCDR, National Cancer Data Repository; www.ncin.org • NGDC, National Geo-science Data Centre; www.ngdc.noaa.gov • UKSSDC, UK Solar System Data Centre. www.ukssdc.ac.uk • Office for national statistics: www.ons.gov.uk • UK data archive (UKDA): www.data-archive.ac.uk • Casweb (census): casweb.mimas.ac.uk • DFT: www.dft.gov.uk • EEA: www.eea.europe.eu • World Energy Council: www.worldenergy.org • Florida solar energy centre: www.fsec.ucf.edu/ • EDINA: edina.ac.uk • Mapcruzin: www.mapcruzin.com • Guardian datastore: www.guardian.co.uk/data

• London air quality network: www.londonair.org.uk • OpenStreetMap: www.openstreetmap.org • UK Borders: edina.ac.uk/ukborders • Met Office: www.metoffice.gov.uk • DECC: www.decc.gov.uk • Etc……………………………• Highlighted examples should be the most relevant to EDE

Page 7: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Has anyone used “large datasets” before?

1 2

88%

12%

1. Yes

2. No

Page 8: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Does anyone think they will use it in the future?

1 2 3

44%

19%

38%

1. Yes

2. No

3. Don’t know

Page 9: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Likely encounters• Access is predominantly through the web• Some may require sign in through university• Fees sometimes waived for academic use (always worth asking)• Verify Copyright and Licensing• Used in

– Research– Modelling– Pervasive in the environmental domain– Property– Finance

• Volume and complexity are increasing (e.g. Facebook, Flickr)• Mckinsey: concluded that the analysis of this kind of dataset will become

increasingly important in influencing business decisions therefore skills in this area will be valuable

Mckinsey: “Big data: The next frontier for innovation, competition, and productivity” Available from: http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation

Page 10: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Storage:• Very large datasets require their own servers, especially those which require security e.g.

HEED and OpenStreetMap• Parallel storage allows download simultaneously with simulation, visualisation and

analysis• Hardware development means all but the very biggest can be stored and transported on

portable hard drives• Most can be downloaded via the internet or in special cases requested on a CD (e.g.

Ordnance Survey Mastermap)• Effective backup is necessary especially once analysis begins• Bespoke data architecture exists (e.g. financial databases)• This requires knowledge of primarily SQL• Most data that you encounter will be accessible through some sort of graphical interface

– Example on next slide

Page 11: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Graphical interface SQL script

Page 12: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Software and data format• Use whatever you are comfortable with• Excel OK for majority of operations, good graphically

– Limited to 1 million rows and 16384 columns (beware when importing data)

• For larger datasets or more sophisticated operations consider a statistical packge– SAS very good for large datasets but requires programming skill– SPSS almost as powerful with a better interface

• Works well in conjunction with Field (2009)

• Microsoft Access allows handling of large complicated databases• All of these available through cluster machines or for home use from http://

www.ucl.ac.uk/isd/common/software• Alternatives include: R, Mathematica, Statistica and Rapidminer

Formats• Excel (.xls, .xlsx)• Access (.mdb, .dbf)• SAS and SPSS have proprietary formats but can be exported to excel• A common format used for exchange is comma separated (.CSV, .txt)• Others include: xml (machine readable), CDF (NASA), NeXus, OpenMath, PDS, SAIF,

SDTS, VICAR etc…… (these require some kind of specialist knowledge)

Field, A. P. 2009. Discovering statistics using SPSS, SAGE publications Ltd.

Page 13: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Data Handling: First steps

1. Metadata– Data about data– Attached in different ways– Varies in forms and content– Should follow standards e.g. INSPIRE http://inspire.jrc.ec.europa.eu/

2. Identify methods of collection– Are these uniform across data sources?– May require reading supporting documentation

3. Identify contributors– Are they reliable

4. Identify alternative sources– Case study will show that divergence is possible

Page 14: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

5. Identify data gaps– First do this visually– Genuine gaps should not skew subsequent analysis– If this has been replaced by for example NULL or 0.0 it may cause

problems and should be investigated– If several datasets are used this should be harmonised– Follow a convention that is obvious to you and acceptable to the

software

6. Identify Duplicates– More than one value for a data point– Possibly valid– E.g. shortened labels falsely groups values

Data Handling: Second steps

Page 15: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Data Handling: Second steps continued…

7. Note precision– Data should be stored at a reasonable precision– For example: Beware of the dataset that tries to depict population

to the nearest person– Harmonise between datasets– Can affect comparability to other data

8. Identify spurious data– Many rows and columns may not be needed– Discard to make analysis simple– Note changes – Keep copies of original

9. Harmonise heading– Ensure that they make sense to you and the software

Page 16: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Graphical representation and statistical analysis• The above steps can be carried out by looking through a

data• However techniques exist to automate them and therefore

reduce time• The first step in any analysis should be to create graphs• These can reveal patterns alongside highlighting duplicates,

gaps and errors• After this is done it may be useful to clean your data again• Excel is fine but more complex and repeatable operations

are available with other software and some programming

Page 17: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Some examples…..• A simple graph

Tufte (1983) and McCandless (2009)

Page 18: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

• Something more complex

Page 19: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

• Some better looking examples

Page 20: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.
Page 21: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Statistical tests• Another automated analysis

technique is statistical• These can be combined in

a box plot conveying statistics graphically

• Simple metrics such as mean, median, mode and standard deviation are useful as well as looking at distribution

• As well as the t test• More sophisticated analysis

through e.g. SPSS, GIS…..

Page 22: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Advanced analysis, simulation and visualisation• These methods vary based on purpose and available data• If you have purely statistical intentions then something like

SPSS or SAS is ideal, especially in conjunction with Field (2009)

• A multitude of tests exist which will suit your needs, beware that these depend on data type, collection etc.

• The internet along with books and lecturers are a good source for deciding which to choose

• A good program for visualisation, provided that you have spatially related data

• Some examples of output that I have produced are on the next slide, again there is an abundance of web and literature resources

Page 23: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

GIS

Page 24: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Part 2: Exercise

• Attempt to calculate the floor area of central house (this building) in pairs

• Stay in the room but use whatever techniques you have at your disposal

• No use of the internet (it will be obvious)• Write your answer down on a piece of paper• 10 minutes• Be prepared to answer some questions using the poll

system• We will declare a floor area champion at the end

Page 25: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

What units did you use?

1 2 3 4 5 6

0% 0% 0%

100%

0%0%

1. Acres

2. Hectares

3. Square Mile

4. Square Kilometre

5. Square Metre

6. Square foot

Page 26: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Why?

• Although the standard is m2 you should not assume that data you are given uses this standard

• Always check the metadata to ensure that it has been done correctly

• Remember that Americans will not use the metric system and a large volume of data will originate from here

• Other units could well be correct but ensure that you use the data properly

Page 27: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Did you include the basement in your calculations?

1 2

0%

100%1. Yes

2. No

Page 28: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Why

• Floor area calculations can be defined as usable, in this case the basement is used but someone creating a larger database would not have this information

• This can cause divergence between real data and that which you are provided with

• Check the metadata• And if necessary at source

Page 29: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Did you attempt to subtract the floor area of interior walls?

1 2

100%

0%

1. Yes

2. No

Page 30: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Why

• Alongside different ways of defining floor area (semantics)• There are different ways of calculating it• It is possible a dataset may have been formed from an

Ordnance survey outline which would include them• Or a building survey would not• Neither is wrong but transparency is essential

Page 31: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

How many floors did you allow for?

1 2 3 4 5 6 7 8

0% 0% 0%

16%

0%

42%

16%

26%

1. 3

2. 4

3. 5

4. 6

5. 7

6. 8

7. 9

8. More

Page 32: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Why?

• The correct number is eight but this may not be clear from plans

• Is the basement included in this?

Page 33: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Did you allow for the light well in the centre of the building?

1 2

71%

29%

1. Yes

2. No

Page 34: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Why?

• One method of calculating this would be to figure out the bottom floor and multiply it by the number of floors

• If you were unaware of the gap this may skew the result• This type of error is common not only in floor area

calculation but others that you may come across• It is important to investigate and understand these sources

of error

Page 35: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

What was your final answer in metres squared?

1 2 3 4 5 6 7 8 9

11%

0% 0% 0%

21%

26%

11%

5%

26%

1. 0 – 750

2. 750 – 1500

3. 1500 – 2250

4. 2250 – 3000

5. 3000 – 3500

6. 3500 – 4000

7. 4000 – 4500

8. 4500 – 5000

9. More

Page 36: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Conclusion:

• The “Real” answer was 3,658m2– 39,376 sqft, – 0.003658km2, – 0.903949 Acre, – 0.365815 hectare, – 0.001412 mile2

• Interestingly there is no DEC here so the figure is off the internet• Different ways of defining the floor area have been used here as is the

case for real datasets• The reality is that the data you have created is probably as good an

estimation of the floor area as is available publicly• Errors would be multiplied if applied to for example the whole country

which is “a large dataset”

Page 37: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Data Sources (UK only)

Part 3: Research Case study: Assessing the availability and quality of data for tertiary sector energy demand forecast models

• Large number of separate datasets• Divergence responsible for error of up to 100%

Page 38: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Results – Classification schemes

NACE (Tertiary) ISIC (Commercial)Wholesale & Retail Trade; repair of motor vehicles and motorcycles

Wholesale and Retail Trade; Repair of Motor Vehicles, Motorcycles and Personal and Household Goods

Accommodation and food service activities

Hotels and Restaurants

Financial, insurance and real estate activities

Real Estate, Renting and Business Activities

Administrative and support service activities

Post and telecommunication, Financial Intermediation

Education EducationHuman health and social work activities

Health

Other NACE activities Miscellaneous  Public administration and defence  Agriculture, Forestry and Fishery (as

separate sub sectors

NACE: Nomenclature statistique des Activités économiques dans la Communité Européenne (Eurostat, 2008)ISIC: United Nations International Standard Industrial Classifications (UNIDO, 2010)

Page 39: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Results - Floor space in the sector

Entire Non-domestic stock

“Tertiary sector”

Questionable Difference

Questionable Difference

“Tertiary sector”

All Commercial and Public buildings

Page 40: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Results - Energy consumption in the sector

Values from the ISIC scheme

Values from the NACE scheme

Declining Range

Page 41: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Results - Population

Page 42: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Results - Employee numbers in the sector

Values from the ISIC scheme

Values from the NACE scheme

Declining Range

Same patterns as seen with the energy consumption data

Page 43: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Results - Gross Domestic Product

Clearly wrong (would this be obvious in isolation)

Page 44: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Results - Gross value added

Values from the ISIC scheme

Values from the NACE scheme

Page 45: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

Conclusions………..

Research Case Study Conclusions

• Majority of error caused by lack of standard classification methodology• Semantic differences exist but can be resolved• Artefacts of harmonisation require care to eradicate• Lack of transparency is pervasive• Precision inextricably varies• Variables with associated established methodology can be relied upon• Many issues could be resolved through the setting up of a centralised

repository• Data is dangerous

Theory conclusions:

• Data exists in many and varied forms• Handling and analysis skills will become

increasingly important• There are a set of standard steps which should be

followed in an initial exploration of any dataset• Foremost in your mind should be viewing a

dataset critically• Visualisation is key to understanding• Graphs etc. are generally the best way of

communicating information

Page 46: How to Handle and Analyse Large Datasets BENVGEE7 'Methods of Environmental Analysis' Ed Sharp 21 st February 2012.

References:

– Field, A. P. 2009. Discovering statistics using SPSS, SAGE publications Ltd.

– Witten, I. H. & Frank, E. 2005. Data Mining: Practical machine learning tools and techniques, Morgan Kaufmann.

– Mccandless, D. 2009. Information is beautiful, Collins.– Tufte, E. R. & Howard, G. 1983. The visual display of quantitative

information, Graphics press Cheshire, CT.– Mckinsey. 2011. Big data: The next frontier for innovation,

competition, and productivity Available from: http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation.

– Infrastructures, D. S. D. 2000. The SDI Cookbook. GSDI/Nebert. (for those interested in data infrastructure)

– See also slide detailing data sources