Big Data

BIG DATA

Definition

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

How big ?

ABC of BIG DATA

Analytics. This solution area focuses on providing efficient analytics for extremely large datasets. Analytics is all about gaining insight, taking advantage of the digital universe, and turning data into high-quality information, providing deeper insights about the business to enable better decisions.

Bandwidth. This solution area focuses on obtaining better performance for very fast workloads. High-bandwidth applications include high-performance computing: the ability to perform complex analyses at extremely high speeds; high-performance video streaming for surveillance and mission planning; and as video editing and play-out in media and entertainment.

Content. This solution area focuses on the need to provide boundless secure scalable data storage. Content solutions must enable storing virtually unlimited amounts of data, so that enterprises can store as much data as they want, find it when they need it, and never lose it.

3 V’S of BIG DATA

Volume: Not only can each data source contain a huge volume of data, but also the number of data sources, even for a single domain, has grown to be in the tens of thousands.

Velocity: As a direct consequence of the rate at which data is being collected and continuously made available,many of the data sources are very dynamic.

Variety: Data sources (even in the same domain) are extremely heterogeneous both at the schema level regarding how they structure their data and at the instance level regarding how they describe the same real-world entity, exhibiting considerable variety even for substantially similar entities.

Examples

The NASA Center for Climate Simulation (NCCS) stores 32 petabytes of climate observations

Big data analysis played a large role in Barack Obama's successful 2012 re-election campaign.

eBay.com uses two data warehouses at 7.5 petabytes and 40PB as well as a 40PB Hadoop cluster for search, consumer recommendations, and merchandising. Inside eBay’s 90PB data warehouse

Amazon.com handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers. The core technology that keeps Amazon running is Linux-based and as of 2005 they had the world’s three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB.[

Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data – the equivalent of 167 times the information contained in all the books in the US Library of Congress.

Facebook handles 50 billion photos from its user base.

Big data Integration

A lot of data growth is happening around these so-called unstructured data types. Big data integration is all about automation of the collection, organization and analysis of these data types.

The importance of big data integration has led to a substantial amount of research over the past few years on the topics of schema mapping, record linkage and data fusion.

Structured data vs Unstructured data

Big data vs Traditional Data Integration

The number of data sources, even for a single

domain, has grown to be in the tens of thousands. Many of the data sources are very dynamic, as a

huge amount of newly collected data are continuously made available.

The data sources are extremely heterogeneous in their structure, with considerable variety even for substantially similar entities.

The data sources are of widely differing qualities, with

significant differences in the coverage, accuracy and timeliness

of data provided.

Schema Mapping

Schema mapping in a data integration system refers to

i) creating a mediated (global) schema, and

(ii) Identifying the mappings between the mediated (global) schema and the local schemas of the data sources to determine which (sets of) attributes contain the same information

Example

Entities like people (customers, employees), companies (the enterprise itself, competitors, partners, suppliers), products (those owned by the enterprise and its competitors)

Defined Relationships among these entities Activities with one or more entities as actors

and/or subjects - Documents can represent these activities

Record Linkage

Record linkage (RL) refers to the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, databases).

Record linkage is necessary when joining data sets based on entities that may or may not share a common identifier (e.g., database key, URI, National identification number), as may be the case due to differences in record shape, storage location, and/or curator style or preference

Challenge in BDI

In BDI, (i) data sources tend to be heterogeneous in their structure and many sources (e.g., tweets, blog posts) provide unstructured data, and

(ii) data sources are dynamic and continuously evolving.

To address the volume dimension, new techniques have been proposed to enable parallel record linkage using MapReduce.

Adaptive blocking is another technique been used to overcome this.

MapReduce

MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.

The model is inspired by the map and reduce functions commonly used in functional programming.

A MapReduce program is composed of a Map() procedure that performs filtering and sorting and Reduce() procedure that performs a summary operation.

Adaptive Blocking

Blocking methods alleviate this big data integration problem by efficiently selecting approximately similar object pairs for subsequent distance computations, leaving out the remaining pairs as dissimilar.

Data fusion

Data fusion refers to resolving conflicts from different sources and finding the truth that reflects the real world.

Its motivation is exactly the veracity of data: the Web has made it easy to publish and spread false information across multiple sources.

Data integration might be viewed as set combination wherein the larger set is retained, whereas fusion is a set reduction technique

Data fusion model

Level 0: Source Preprocessing. Level 1: Object Assessment Level 2: Situation Assessment Level 3: Impact Assessment Level 4: Process Refinement Level 5: User Refinement

Advantages

Real-time rerouting of transportation fleets based on weather patterns

Customer sentiment analysis based on social postings

Targeted disease therapies based on genomic data

Allocation of disaster relief supplies based on mobile and social messages from victims

Cars driving themselves.

Conclusion

This seminar gives a basic insight of what is big data and reviews the state-of-the-art techniques for data integration in addressing the new challenges raised by Big Data, including volume and number of sources, velocity, variety, and veracity. It also lists out the advantages of harnessing the potential of big data.

Big Data

Technology

Transcript of Big Data