Computational intelligence for big data analytics bda 2013

l h ll d Big Data Analytics: Challenges and What Computational Intelligence

h ffTechniques May Offer

Ah-Hwee Tan(http://www.ntu.edu.sg/home/asahtan)School of Computer EngineeringSchool of Computer EngineeringNanyang Technological University

Big Data Analytics SymposiumBig Data Analytics SymposiumLondon, UK

13 September 2013

Outline

Big Data Analytics

Outline

Big Data Analytics

Computational Intelligence Techniques

Web Data Analytics

Flexible Organizer for Competitive Flexible Organizer for Competitive Intelligence (FOCI)

Web Information Fusion and Associative DiDiscovery

Analytics for Active Living for Elderly

The Era of Big DataThe Era of Big Data

Big data refers tocollection of data sets so large and complex th t d th t f l dthat exceed the competence of commonly used IT systems in terms of processing space and/or timetime.

Sources of Big Datag• Traditionally, mostly produced in scientific fields such as

astronomy meteorology genomics physics biology andastronomy, meteorology, genomics physics, biology, and environmental research.

• With rapid development of IT technology and the p p gyconsequent decrease of cost on collecting and storing data, big data has been generated from almost every industry and sector as well as governmental departmentindustry and sector as well as governmental department, including retail, finance, banking, security, audit, electric power, healthcare.

• Recently, big data over the Web (big Web data for short), which includes all the context data, such as, user generated contents browser/search log data deep webgenerated contents, browser/search log data, deep web data, etc.

Examples of Big Data(Source: Wikipedia)(Source: Wikipedia)

• Walmart handles more than 1 million customer transactions h hi h i i t d i t d t b ti t d tevery hour, which is imported into databases estimated to

contain more than 2.5 petabytes (2560 terabytes) of data –the equivalent of 167 times the information contained in all the books in the US Library of Congress.

• Facebook handles 50 billion photos from its user base.

• FICO Falcon Credit Card Fraud Detection System protects 2.1 billion active accounts world-wide.

• Windermere Real Estate uses anonymous GPS signals from nearly 100 million drivers to help new home buyers determine their typical drive times to and from work ypthroughout various times of the day.

Examples of Big Data(Source: Wikipedia)

• NASA Center for Climate Simulation (NCCS) stores 32 petabytes of

(Source: Wikipedia)

(NCCS) stores 32 petabytes of climate observations and simulations on the Discover supercomputing

Value Metric

1000 kB kilobyte

cluster.

• Utah Data Center is a data centerc rrentl being constr cted b the

10002 MB megabyte

10003 GB gigabyte

10004 TB terabytecurrently being constructed by the United States National Security Agency. When finished, the facility

10005 PB petabyte

10006 EB exabyte

10007 ZB zettabyte8will handle yottabytes of information

collected by NSA over the Internet.10008 YB yottabyte

Money of Big Data(Source: Wikipedia)(Source: Wikipedia)

• "Big data" have increased the demand of information gmanagement specialists

• Software AG, Oracle Corporation, IBM, Microsoft, SAP EMC d HP h t th $15 billiSAP, EMC, and HP have spent more than $15 billion on software firms specializing in data management and analytics. y

• In 2010, this industry on its own was worth more than $100 billion and was growing at almost 10 percent a year: about twice as fast as the software business as a whole.

Market of Big Data(Source: Wikipedia)(Source: Wikipedia)

• Developed economies make increasing use of data-Developed economies make increasing use of dataintensive technologies. There are 4.6 billion mobile-phone subscriptions worldwide and there are between 1 billion and 2 billion people accessing the internet1 billion and 2 billion people accessing the internet

• The world's effective capacity to exchange information through telecommunication networks was 281through telecommunication networks was 281 petabytes in 1986, 471 petabytes in 1993, 2.2 exabytesin 2000, 65 exabytes in 2007[14] and it is predicted that the amount of traffic flowing over the internet will reachthe amount of traffic flowing over the internet will reach 667 exabytes annually by 2013.[5]

Big Data Market Segments(Report by Transparency Market Research)(Report by Transparency Market Research)

• Segmentation of the big data market by components, by g g y p , yapplications and by geography.

• The different components included are software andThe different components included are software and services, hardware and storage.

• Software and services segment dominates the components• Software and services segment dominates the components market whereas storage segment will be the fastest growing segment for the next 5 years owing to the

t l th i th d t t dperpetual growth in the data generated.

Big Data Market Segment by ApplicationsApplications• Covered eight applications namely financial services,

manufacturing, healthcare, telecommunication, government, retail and media & entertainment and others in the application segment.the application segment.

• Financial Services, healthcare and the government sector are the top three contributors of the big data market and together held more than 55% of the big data market in 2012.

M di d E t t i t d th h lth t ill• Media and Entertainment and the healthcare sectors will grow at high CAGR of nearly 42% from 2012 to 2018. The growth in data in the form of video, images, and games is g g gdriving the media and entertainment segment. Read more: http://www.digitaljournal.com/pr/1395146#ixzz2b0hvuxrQ

Challenges of Big Data

• Volume– Size in the order of petabytes,

exabytes, … • Velocity

Value Metric

1000 kB kilobyte

10002 MB megabyte

10003 GB i b t• Velocity– Time sensitive data, data that

grow exponentially or even in

10003 GB gigabyte

10004 TB terabyte

10005 PB petabyte

10006 EB exabyteg p yrates that overwhelm the well-known Moore's Law

10006 EB exabyte

10007 ZB zettabyte

10008 YB yottabyte

V i t• Variety– From structured data into semi-structured and

completely unstructured data of different types such ascompletely unstructured data of different types, such as text, image, audio, video, click streams, log files,

Deeper Issues of Big Data(The additional 3Vs)(The additional 3Vs) • Validity

– Is the data correct and accurate for the intended usage?

V i• Veracity– Are the results meaningful for the given problem

space?space?• Volatility

– How long do you need to look/store this data?How long do you need to look/store this data?

Computational IntelligenceComputational Intelligence

• Neural Networks (IJCNN)Neural Networks (IJCNN)– Brain-like mathematical models for pattern

recognition, memory, and association discovery– Examples: Perceptron, BP, SVM, SOM, ART, …

• Fuzzy Systems (IEEE-FUZZ)– Fuzzy operators for handling non-discrete reasoning– Examples: FNN, Fuzzy C-Means, …

Computational IntelligenceComputational Intelligence

• Evolutionary Computing (CEC)Evolutionary Computing (CEC)– Classes of heuristic algorithms repeatedly

search for good solutions by mimicking g y gthe process of natural evolution

– Commonly used for optimization and search problems

– Examples: Genetic Algo, Memetic Algo,

Flagship Events of Computational IntelligenceComputational Intelligence• World Congress on Computational Intelligence

(Australia 2012, Beijing 2014)• IEEE Symposium on Computational Intelligence y p p g

(Singapore 2013, Florida, USA 2014)• IEEE Symposium on Computational Intelligence

in Big Data (IEEE CIBD'2014)

Examples of Use of CI in Big Data

• Data size and feature space adaptation• Uncertainty modeling in learning from big data• Uncertainty modeling in learning from big data• Distributed learning techniques in uncertain environment• Uncertainty in cloud computing

Di ib d ll l i• Distributed parallel computation• Feature selection/extraction in big data• Sample selection based on uncertainty• Incremental Learning• Manifold Learning on big data• Uncertainty techniques in big data classification/clusteringUncertainty techniques in big data classification/clustering• Imbalance learning on big data• Active learning on big data

R d i ht t k bi d t• Random weight networks on big data• Transfer learning on big data

S lf O i i N l Self-Organizing Neural Networks forP li d W b I t lliPersonalized Web Intelligence

Towards Personalized Web IntelligencegAh-Hwee Tan, Hwee-Leng Ong,

Hong Pan, Jamie Ng, Qiu-Xiang LiKnowledge and Information Systems 18 (2004) 297-306

Workflow for Web Data Analyticsy

• Search– Getting the information

• Organize (clustering/categorizing)– Putting things in perspectives

• Analyze (data mining)– Discover hidden knowledge

• Share (knowledge management)– Saving for reference and sharing

• Track– Constant monitoring

Approaches to Organizing/Analyzing

Cl stering

Organizing/Analyzing

• Clustering– Organizing information into groups based on

similarity functions and thresholdssimilarity functions and thresholds – e.g. BullsEye, NorthernLight, Vivisimo

• Categorizationg– Organizing information into a “predefined” set of

classes– e.g. Yahoo!, Autonomy Knowledge Server

• Which is better?

Clustering

• Pros

g

– Unsupervised/self-organizing, require no training or predefinition of classesAble to identify new themes– Able to identify new themes

• ConsUsers have no control– Users have no control

– Ever changing cluster structure – Difficult to navigate and trackDifficult to navigate and track

Categorization

• Pros

g

– Good control on classes– Every info assigned to one or more classes

of interests• Cons

R i l i ( i d) d/– Require learning (supervised) and/or definition of classification rules/knowledge

– Every info has to be assigned to one orEvery info has to be assigned to one or more classes

– Good control but lack flexibility to handle new information

User-configurable Clustering(Tan & Pan PAKDD 2002)(Tan & Pan, PAKDD 2002)

Information organi ation and content• Information organization and content management

• Online incremental clustering + user• Online incremental clustering + user-defined structure (preferences)

• Reduces to a clustering system if no user• Reduces to a clustering system if no user indication given

• Allows personalization in a directAllows personalization in a direct, intuitive, and interactive manner

• Control + flexibilityControl flexibility

ARAM for PersonalizedInformation Management

Information Clusters

Information Management

F2

Information Clusters

x

F1a

a

- x

F1b

b

-ba

+ +

A BInformation Vector Preference Vector

Flexible Organizer for Competitive Intelligence (FOCI)Intelligence (FOCI)

• A platform for gathering, organizing,A platform for gathering, organizing, tracking, analyzing, and sharingcompetitive information

• Natural way of turning raw search results into personalized CI portfoliosinto personalized CI portfolios– Multilingual enabled – with Multilingual Efficient Analyzerg y– Domain localization (Technology)

• Patented and licensed to many companies

FOCI User Interface

FOCI Architecture

Intranet/Intranet/Internet

d

ContentGathering

ContentManagement Fr

ont E

nd

User’s

ualiz

atio

n User’s

CI PortfolioContent

PublishingDomain-Specific

Knowledge Content Analysis

Vis

ug

Analysis

Personalized Content Managementg

f S• Portfolio created through Search• Unsupervised clustering (ARAM Pattern Channel A)

• Loop– Personalization by users (ARAM Pattern Channel B)

– Reorganization of clusters (ARAM Pattern Channel A&B)

• Saving of personalized portfolio• Tracking of new information

Personalization FunctionsPersonalization Functions

Marking/labeling (selected) clusters• Marking/labeling (selected) clusters– Personal interpretation

• Inserting Clusters• Inserting Clusters– Indicate preference on groupings

• Merging clusters• Merging clusters– Indicate preferences on similarities

• Splitting clusters• Splitting clusters– Indicate preferences on differences

• ...

Information Clusteringg

• A portfolio created by a meta-search of y4 search engines with a query on ““Text Mining”

A Personalized PortfolioA Personalized Portfolio

after <=19 personalization operations p p(mainly labeling and creating clusters)

Organizing New Informationg g

Without the Based onPersonalized Portfolio

Personalized Portfolio

42 new documents from DirectHit, Netscape, and B i WiBusinessWire

Summary

• A fusion neural network algorithm, called fusion ART, hasbeen proposed for integrating clustering and

y

been proposed for integrating clustering andcategorization

Has been applied to competiti e intelligence on the eb• Has been applied to competitive intelligence on the web.

• Comparing with existing works, fusion ART hasadvantages inadvantages in– Personalization— fusion ART performs analysis and organization

of data based on user preferencesf f– Low time complexity — fusion ART performs real-time search and

match of patterns resulting in a linear time complexity– Incremental clustering manner — fusion ART may adapt to

d i b l i di d b i ll l idynamic web multimedia data set by incrementally clustering newpatterns based on the learnt cluster structure without referring tothe old data. 3

2

Heterogeneous Data Co-clustering Heterogeneous Data Co-clustering for Social Media Data Theme Discovery and Mining

Lei Meng, Ah-Hwee Tan and Dong Xug gIEEE Transactions on Knowledge and Data Engineering, 2013

33

Introduction

• The popularity of social websites leads to greatly p p y g yincrease of web multimedia documents– Massive number – Billions of images and articles online– Diversity – Diverse content and booming emerging topics – Multi-modal descriptors – images, text, category, tags,

comments Keywordscomments BirdsWild, bird, beach, tree, vacation,

Category Keywords from Surrounding textanimal, mar, sunny,

playa, nayarit, arena,ave, water,

i

Images

text

34

vacaciones, hollyday, pelicano.

Introduction

• Clustering of web multimedia data is challengingS l bili bi d– Scalability to big data

– Difficulty in integrating multi-modal feature data– Ambiguity in deciding the number of categoriesAmbiguity in deciding the number of categories– Rich but noisy meta-information – semantic gap of images, noisy

tagsBi d B hBirds BeachWild, bird,

beach, tree, vacation, animal mar

Ocean, blue, sea, summer, vacation, sun,

b hanimal, mar, sunny, playa, nayarit, arena, ave, water,

man, beach, water, yellow, fun, sand, play, funny,

35

vacaciones, hollyday, pelicano.

p y yadult, humor, lifestyle, sunny, resort.

Problem Statement

We define the theme discovery of web multimedia datah d l i bl hi has a heterogeneous data co-clustering problem, which

identifies the semantic categories of data patternsthrough the fusion and recognition of multiple types ofthrough the fusion and recognition of multiple types offeatures.

AppleAppleMultiple

DescriptionsAppleFruits Products MoviesCategory

TagTag

User Description

36………… ……Surrounding

text

Proposed Approach

• A self-organizing neural network approach to Heterogeneous

p pp

Data Co-clustering

Based on Fusion Adaptive Resonance Theory (Fusion ART)

Fuse arbitrary number of feature modalities

Adaptively tune the weights for different feature modalities

Two different learning function for primary data, such as images and articles, and meta-information to handle short and nois te tand noisy text

Incremental fast learning

D d i h b f lDo not need to give the number of clusters

37

Experiments• NUS-WIDE data set

36784 images of 18 categories– 36784 images of 18 categories– Visual features: Grid color moment, Edge direction histogram, and

wavelet texture T t l f t f di t t 1142 d (7 d i– Textual features of surrounding text: 1142 words (7 words per image on average)

• 20 Newsgroups data setg p– 12826 text documents of 10 categories– Textual features of document content: over 60k words (800 words per

document on average)document on average)– Textual features of category: 3 labels per document on average

38

Experiments on NUS-WIDE Data Set

• Evaluation on weight adaptation across channels for visual and textual featurestextual features– Performance Comparison with fixed weight values

• GHF-ART with the adaptively tuned weight values γ_SA achieves the bestperformance in 5 classes and the overall performance, and achieves closeperformance with the best results obtained by fixed weight values

39


– Tracking of the change in weight values of γ _SA

• Textual features of surrounding text are assigned higher weights than visual features

• The value of γ SA stabilizes in [0.7, 0.8] with the increase of patterns e v ue o γ_S s b es [ .7, . ] w e c e se o p e s• Big fluctuation may be resulted by the generation of new clusters

40


• Clustering Performance comparison with existing algorithms in terms ofweighted average precision cluster entropy ( ) class entropy ( )lH classHweighted average precision, cluster entropy ( ), class entropy ( ),purity and rand index (RI)

clusterH class

• GHF-ART achieves the best performance in terms of all the evaluationmeasures

• With supervisory information, GHF-ART(SS) consistently obtains betterperformance

41


• Time complexity analysis

– GHF-ART and Fusion ART incur very small increase of time cost

– For 23284 images, GHF-ART complete the clustering process in 10 seconds

42

Experiments on 20 Newsgroups Data Set

• Clustering performance comparison using document content d t i f ti

p g p

and category information

– Both GHF-ART and GHF-ART(SS) outperform other algorithms in all the evaluation measuresGHF ART has a 5% gain than Fusion ART in terms of Average– GHF-ART has a 5% gain than Fusion ART in terms of Average Precision, Purity and Rand Index.

– Comparing with other unsupervised algorithms, GHF-ART achieves around 80% in Average Precision Purity and Rand Index while otheraround 80% in Average Precision, Purity and Rand Index while other algorithms typically obtain less than 75%

43

Summary

• A Heterogeneous data co-clustering algorithm, called GHF-ART is proposed to discover the themes of web multimedia data

y

ART, is proposed to discover the themes of web multimedia datavia their rich but heterogeneous descriptors.

• Comparing with existing works GHF ART has advantages in• Comparing with existing works, GHF-ART has advantages in– Strong noise immunity — A learning function of meta-information is

proposed to handle noiseAd ti h l i hti A ll d fi d i h i l i h i– Adaptive channel weighting — A well-defined weighting algorithm isproposed to identify the important feature modalities for a better fusion ofmulti-modal features for overall similarity measure;L ti l it GHF ART f l ti h d t h– Low time complexity — GHF-ART performs real-time search and matchof patterns resulting in a linear time complexity for big data;

– Incremental clustering manner — GHF-ART may adapt to dynamicb lti di d t t b i t ll l t i tt b dweb multimedia data set by incrementally clustering new patterns based

on the learnt cluster structure without referring to the old data.44

Research Centre of Excellence in A ti LILI i f th ld LYLY (LILY)(LILY)Active LILIving for the elderLYLY (LILY)(LILY)

Aging in Place:Aging in Place:Opportunities and Challenges

Ah-Hwee Tan(http://www.ntu.edu.sg/home/asahtan)( p g )

School of Computer EngineeringNanyang Technological University

JOINT UBC-NTU RESEARCH CENTRE

Aging in Placeg g

“the ability to live in one's own home and community

safely, independently, and comfortably, regardless of

age, income, or ability level” - Center for Disease

Control, Dec 2011,

46

Motivation

Global aging population creates silver challengesGlobal aging population creates silver challenges Most adults would prefer to age in place 78 percent of adults between the ages of 50 and 64

report that they would prefer to stay in their currentresidence as they age

Growing elderly population will be livingGrowing elderly population will be livingindependently in own homes

Vital to transform future homes into intelligentghuman-centered environment for the elderly

Golden opportunities for innovating assistiveh l i f i i ltechnologies for aging in place

47

A Basic Scenario of Tender Care for Aging-in-placep

UnobtrusiveSensing

Social SignalProcessingg

ContextAware AutoTagging

Unobtrusive sensing device detects: the elder keeps walking around at an irregular pace.Social signal processing indicates: the elder has been silent for an unusually long time.

Tagging Social

CognitiveNetwork

Your mother may

be feeling anxious now

Cognitive Analysis

lt

I need to call my now…result… ymother now…

Silver Challengesg

49

Vision

T bl ld l t i t i ti h lth dTo enable elderly to maintain an active, healthy andengaging life style in their own homes supported byan age-friendly intelligent environment, providing all-g y g p ground comprehensive tender care Round-the-clock day-to-day health and wellness

i imonitoring Cognitive Support and recommendation to products

and servicesand services Companionship and emotional support Support for maintaining/stimulating social Support for maintaining/stimulating social

interaction50

Design Consideration and ChallengesChallenges

How to perform unobtrusive monitoring? How to perform unobtrusive monitoring?- Mobile sensing, activity tracking

How to provide all-around comprehensive care? How to provide all around comprehensive care?- Physical, cognitive, emotional, social, sustainability

How to maintain ubiquitous access andqinteraction?- Cross platform, multimedia, multimodal

How to provide friendly, personal touch?- Adaptive user modeling, mood detection

P i l i i- Proactive, natural interaction

51

Approach and Methodologypp gy

To support active living of elderlies pp g fthrough an intelligent multi-agent environment with ubiquitous access, natural interface, and all-

d d h i rounded comprehensive care

Key TechnologiesKey Technologies Unobtrusive sensing and social signal processing Activity pattern and user modeling Activity pattern and user modeling Information and service recommendation Proactive stimulation and natural interaction

52

A Multi-Agent Collaborative Care EnvironmentCare Environment

Isabel(Personal Nurse)

Small talkRecommendations

for healthcareAlf d for healthcare products and services

FrankAlfred

(The Butler)Small talk (Robot Dog)

Activity sensingPattern modeling

Small talkUser modeling

Social and travel advisory Pattern modelingadvisory

53

Why Multi-Agent?y g

Unobtrusive sensing and monitoring – agents Unobtrusive sensing and monitoring agents of different characteristics and capabilities

Ubi i i f i d Ubiquitous access to information and services – agents in different platforms and locationslocations

Comprehensive tender care – agents with diff d i k l d d f idifferent domain knowledge and functions

“Three’s a party” – more opportunities for p y ppcognitive stimulation and social interaction

54

Comprehensive Tender Care

Physical Support – Activity tracking safety and Physical Support Activity tracking, safety and wellness monitoring

C i i S i f i d Cognitive Support – information and recommendation on (healthcare) products, services, skills and activities k nd ct v t

Emotional Support – mood detection, affective t ll t lksupport, small talk

Social Support – companionship and connection to family and friends (old and new) through sms, emails and facebooks etc 55

Unobtrusive Sensing and Ubiquitous Access to Services Ubiquitous Access to Services

unobtrusive in-home real-time data collection and contextual social signal processing - Essential to better understand and cater to the ld l ’ d

Sensing – bio sensing, motion sensors,

elderly’s needs.

wearable/mobile sensors for health monitoring and activity tracking

Cross Platform – Large screen interactive display, mobile handheld devices, physical robots

Multimedia – text, audio, video56

Adaptive User Modellingp g

Identity and profile Identity and profile Interests and preferences Behaviour model: Ti p ti it Behaviour model: Time, space, activity

Knowledge and skills S i l k l d f d Social network: Family and friends

Meth0ds for Model Building Explicit: User specification Implicit: User actions, choices, conversation

57

Cognitive Support:Product/Service RecommendationProduct/Service Recommendation

Domain knowledge:Domain knowledge:Healthcare, Travel, Cooking

Delivery modes:Delivery modes:- Question & AnswerP i d i- Proactive recommendation

- Conversation

P l T h Personal Touch:Personalized, Context sensitive, small talks

58

Challenges in Big Living Analyticsg g y

Volume – huge amount of data through bio Volume – huge amount of data through bio sensing, motion sensors, wearable/mobile sensors for health monitoring and activity tracking

Velocity – 24x7 real time sensing, sense making, decision making service recommendationdecision making, service recommendation

Variety – information integration and knowledge h i f l f l i di sharing from cross platform, multimedia

unstructured data - text, audio, video, gestures

59

Research Centre of Excellence in Active LILIving for the elderLYLY (LILY)(LILY)

Thank you!

JOINT UBC-NTU RESEARCH CENTRE

Computational intelligence for big data analytics bda 2013

Education

Transcript of Computational intelligence for big data analytics bda 2013