Computational intelligence for big data analytics bda 2013
description
Transcript of Computational intelligence for big data analytics bda 2013
l h ll d Big Data Analytics: Challenges and What Computational Intelligence
h ffTechniques May Offer
Ah-Hwee Tan(http://www.ntu.edu.sg/home/asahtan)School of Computer EngineeringSchool of Computer EngineeringNanyang Technological University
Big Data Analytics SymposiumBig Data Analytics SymposiumLondon, UK
13 September 2013
Outline
Big Data Analytics
Outline
Big Data Analytics
Computational Intelligence Techniques
Web Data Analytics
Flexible Organizer for Competitive Flexible Organizer for Competitive Intelligence (FOCI)
Web Information Fusion and Associative DiDiscovery
Analytics for Active Living for Elderly
The Era of Big DataThe Era of Big Data
Big data refers tocollection of data sets so large and complex th t d th t f l dthat exceed the competence of commonly used IT systems in terms of processing space and/or timetime.
Sources of Big Datag• Traditionally, mostly produced in scientific fields such as
astronomy meteorology genomics physics biology andastronomy, meteorology, genomics physics, biology, and environmental research.
• With rapid development of IT technology and the p p gyconsequent decrease of cost on collecting and storing data, big data has been generated from almost every industry and sector as well as governmental departmentindustry and sector as well as governmental department, including retail, finance, banking, security, audit, electric power, healthcare.
• Recently, big data over the Web (big Web data for short), which includes all the context data, such as, user generated contents browser/search log data deep webgenerated contents, browser/search log data, deep web data, etc.
Examples of Big Data(Source: Wikipedia)(Source: Wikipedia)
• Walmart handles more than 1 million customer transactions h hi h i i t d i t d t b ti t d tevery hour, which is imported into databases estimated to
contain more than 2.5 petabytes (2560 terabytes) of data –the equivalent of 167 times the information contained in all the books in the US Library of Congress.
• Facebook handles 50 billion photos from its user base.
• FICO Falcon Credit Card Fraud Detection System protects 2.1 billion active accounts world-wide.
• Windermere Real Estate uses anonymous GPS signals from nearly 100 million drivers to help new home buyers determine their typical drive times to and from work ypthroughout various times of the day.
Examples of Big Data(Source: Wikipedia)
• NASA Center for Climate Simulation (NCCS) stores 32 petabytes of
(Source: Wikipedia)
(NCCS) stores 32 petabytes of climate observations and simulations on the Discover supercomputing
Value Metric
1000 kB kilobyte
cluster.
• Utah Data Center is a data centerc rrentl being constr cted b the
10002 MB megabyte
10003 GB gigabyte
10004 TB terabytecurrently being constructed by the United States National Security Agency. When finished, the facility
10005 PB petabyte
10006 EB exabyte
10007 ZB zettabyte8will handle yottabytes of information
collected by NSA over the Internet.10008 YB yottabyte
Money of Big Data(Source: Wikipedia)(Source: Wikipedia)
• "Big data" have increased the demand of information gmanagement specialists
• Software AG, Oracle Corporation, IBM, Microsoft, SAP EMC d HP h t th $15 billiSAP, EMC, and HP have spent more than $15 billion on software firms specializing in data management and analytics. y
• In 2010, this industry on its own was worth more than $100 billion and was growing at almost 10 percent a year: about twice as fast as the software business as a whole.
Market of Big Data(Source: Wikipedia)(Source: Wikipedia)
• Developed economies make increasing use of data-Developed economies make increasing use of dataintensive technologies. There are 4.6 billion mobile-phone subscriptions worldwide and there are between 1 billion and 2 billion people accessing the internet1 billion and 2 billion people accessing the internet
• The world's effective capacity to exchange information through telecommunication networks was 281through telecommunication networks was 281 petabytes in 1986, 471 petabytes in 1993, 2.2 exabytesin 2000, 65 exabytes in 2007[14] and it is predicted that the amount of traffic flowing over the internet will reachthe amount of traffic flowing over the internet will reach 667 exabytes annually by 2013.[5]
Big Data Market Segments(Report by Transparency Market Research)(Report by Transparency Market Research)
• Segmentation of the big data market by components, by g g y p , yapplications and by geography.
• The different components included are software andThe different components included are software and services, hardware and storage.
• Software and services segment dominates the components• Software and services segment dominates the components market whereas storage segment will be the fastest growing segment for the next 5 years owing to the
t l th i th d t t dperpetual growth in the data generated.
Big Data Market Segment by ApplicationsApplications• Covered eight applications namely financial services,
manufacturing, healthcare, telecommunication, government, retail and media & entertainment and others in the application segment.the application segment.
• Financial Services, healthcare and the government sector are the top three contributors of the big data market and together held more than 55% of the big data market in 2012.
M di d E t t i t d th h lth t ill• Media and Entertainment and the healthcare sectors will grow at high CAGR of nearly 42% from 2012 to 2018. The growth in data in the form of video, images, and games is g g gdriving the media and entertainment segment. Read more: http://www.digitaljournal.com/pr/1395146#ixzz2b0hvuxrQ
Challenges of Big Data
• Volume– Size in the order of petabytes,
exabytes, … • Velocity
Value Metric
1000 kB kilobyte
10002 MB megabyte
10003 GB i b t• Velocity– Time sensitive data, data that
grow exponentially or even in
10003 GB gigabyte
10004 TB terabyte
10005 PB petabyte
10006 EB exabyteg p yrates that overwhelm the well-known Moore's Law
10006 EB exabyte
10007 ZB zettabyte
10008 YB yottabyte
V i t• Variety– From structured data into semi-structured and
completely unstructured data of different types such ascompletely unstructured data of different types, such as text, image, audio, video, click streams, log files,
Deeper Issues of Big Data(The additional 3Vs)(The additional 3Vs) • Validity
– Is the data correct and accurate for the intended usage?
V i• Veracity– Are the results meaningful for the given problem
space?space?• Volatility
– How long do you need to look/store this data?How long do you need to look/store this data?
Computational IntelligenceComputational Intelligence
• Neural Networks (IJCNN)Neural Networks (IJCNN)– Brain-like mathematical models for pattern
recognition, memory, and association discovery– Examples: Perceptron, BP, SVM, SOM, ART, …
• Fuzzy Systems (IEEE-FUZZ)– Fuzzy operators for handling non-discrete reasoning– Examples: FNN, Fuzzy C-Means, …
Computational IntelligenceComputational Intelligence
• Evolutionary Computing (CEC)Evolutionary Computing (CEC)– Classes of heuristic algorithms repeatedly
search for good solutions by mimicking g y gthe process of natural evolution
– Commonly used for optimization and search problems
– Examples: Genetic Algo, Memetic Algo,
Flagship Events of Computational IntelligenceComputational Intelligence• World Congress on Computational Intelligence
(Australia 2012, Beijing 2014)• IEEE Symposium on Computational Intelligence y p p g
(Singapore 2013, Florida, USA 2014)• IEEE Symposium on Computational Intelligence
in Big Data (IEEE CIBD'2014)
Examples of Use of CI in Big Data
• Data size and feature space adaptation• Uncertainty modeling in learning from big data• Uncertainty modeling in learning from big data• Distributed learning techniques in uncertain environment• Uncertainty in cloud computing
Di ib d ll l i• Distributed parallel computation• Feature selection/extraction in big data• Sample selection based on uncertainty• Incremental Learning• Manifold Learning on big data• Uncertainty techniques in big data classification/clusteringUncertainty techniques in big data classification/clustering• Imbalance learning on big data• Active learning on big data
R d i ht t k bi d t• Random weight networks on big data• Transfer learning on big data
S lf O i i N l Self-Organizing Neural Networks forP li d W b I t lliPersonalized Web Intelligence
Towards Personalized Web IntelligencegAh-Hwee Tan, Hwee-Leng Ong,
Hong Pan, Jamie Ng, Qiu-Xiang LiKnowledge and Information Systems 18 (2004) 297-306
Workflow for Web Data Analyticsy
• Search– Getting the information
• Organize (clustering/categorizing)– Putting things in perspectives
• Analyze (data mining)– Discover hidden knowledge
• Share (knowledge management)– Saving for reference and sharing
• Track– Constant monitoring
Approaches to Organizing/Analyzing
Cl stering
Organizing/Analyzing
• Clustering– Organizing information into groups based on
similarity functions and thresholdssimilarity functions and thresholds – e.g. BullsEye, NorthernLight, Vivisimo
• Categorizationg– Organizing information into a “predefined” set of
classes– e.g. Yahoo!, Autonomy Knowledge Server
• Which is better?
Clustering
• Pros
g
– Unsupervised/self-organizing, require no training or predefinition of classesAble to identify new themes– Able to identify new themes
• ConsUsers have no control– Users have no control
– Ever changing cluster structure – Difficult to navigate and trackDifficult to navigate and track
Categorization
• Pros
g
– Good control on classes– Every info assigned to one or more classes
of interests• Cons
R i l i ( i d) d/– Require learning (supervised) and/or definition of classification rules/knowledge
– Every info has to be assigned to one orEvery info has to be assigned to one or more classes
– Good control but lack flexibility to handle new information
User-configurable Clustering(Tan & Pan PAKDD 2002)(Tan & Pan, PAKDD 2002)
Information organi ation and content• Information organization and content management
• Online incremental clustering + user• Online incremental clustering + user-defined structure (preferences)
• Reduces to a clustering system if no user• Reduces to a clustering system if no user indication given
• Allows personalization in a directAllows personalization in a direct, intuitive, and interactive manner
• Control + flexibilityControl flexibility
ARAM for PersonalizedInformation Management
Information Clusters
Information Management
F2
Information Clusters
x
F1a
a
- x
F1b
b
-ba
+ +
A BInformation Vector Preference Vector
Flexible Organizer for Competitive Intelligence (FOCI)Intelligence (FOCI)
• A platform for gathering, organizing,A platform for gathering, organizing, tracking, analyzing, and sharingcompetitive information
• Natural way of turning raw search results into personalized CI portfoliosinto personalized CI portfolios– Multilingual enabled – with Multilingual Efficient Analyzerg y– Domain localization (Technology)
• Patented and licensed to many companies
FOCI User Interface
FOCI Architecture
Intranet/Intranet/Internet
d
ContentGathering
ContentManagement Fr
ont E
nd
User’s
ualiz
atio
n User’s
CI PortfolioContent
PublishingDomain-Specific
Knowledge Content Analysis
Vis
ug
Analysis
Personalized Content Managementg
f S• Portfolio created through Search• Unsupervised clustering (ARAM Pattern Channel A)
• Loop– Personalization by users (ARAM Pattern Channel B)
– Reorganization of clusters (ARAM Pattern Channel A&B)
• Saving of personalized portfolio• Tracking of new information
Personalization FunctionsPersonalization Functions
Marking/labeling (selected) clusters• Marking/labeling (selected) clusters– Personal interpretation
• Inserting Clusters• Inserting Clusters– Indicate preference on groupings
• Merging clusters• Merging clusters– Indicate preferences on similarities
• Splitting clusters• Splitting clusters– Indicate preferences on differences
• ...
Information Clusteringg
• A portfolio created by a meta-search of y4 search engines with a query on ““Text Mining”
A Personalized PortfolioA Personalized Portfolio
after <=19 personalization operations p p(mainly labeling and creating clusters)
Organizing New Informationg g
Without the Based onPersonalized Portfolio
Personalized Portfolio
42 new documents from DirectHit, Netscape, and B i WiBusinessWire
Summary
• A fusion neural network algorithm, called fusion ART, hasbeen proposed for integrating clustering and
y
been proposed for integrating clustering andcategorization
Has been applied to competiti e intelligence on the eb• Has been applied to competitive intelligence on the web.
• Comparing with existing works, fusion ART hasadvantages inadvantages in– Personalization— fusion ART performs analysis and organization
of data based on user preferencesf f– Low time complexity — fusion ART performs real-time search and
match of patterns resulting in a linear time complexity– Incremental clustering manner — fusion ART may adapt to
d i b l i di d b i ll l idynamic web multimedia data set by incrementally clustering newpatterns based on the learnt cluster structure without referring tothe old data. 3
2
Heterogeneous Data Co-clustering Heterogeneous Data Co-clustering for Social Media Data Theme Discovery and Mining
Lei Meng, Ah-Hwee Tan and Dong Xug gIEEE Transactions on Knowledge and Data Engineering, 2013
33
Introduction
• The popularity of social websites leads to greatly p p y g yincrease of web multimedia documents– Massive number – Billions of images and articles online– Diversity – Diverse content and booming emerging topics – Multi-modal descriptors – images, text, category, tags,
comments Keywordscomments BirdsWild, bird, beach, tree, vacation,
Category Keywords from Surrounding textanimal, mar, sunny,
playa, nayarit, arena,ave, water,
i
Images
text
34
vacaciones, hollyday, pelicano.
Introduction
• Clustering of web multimedia data is challengingS l bili bi d– Scalability to big data
– Difficulty in integrating multi-modal feature data– Ambiguity in deciding the number of categoriesAmbiguity in deciding the number of categories– Rich but noisy meta-information – semantic gap of images, noisy
tagsBi d B hBirds BeachWild, bird,
beach, tree, vacation, animal mar
Ocean, blue, sea, summer, vacation, sun,
b hanimal, mar, sunny, playa, nayarit, arena, ave, water,
man, beach, water, yellow, fun, sand, play, funny,
35
vacaciones, hollyday, pelicano.
p y yadult, humor, lifestyle, sunny, resort.
Problem Statement
We define the theme discovery of web multimedia datah d l i bl hi has a heterogeneous data co-clustering problem, which
identifies the semantic categories of data patternsthrough the fusion and recognition of multiple types ofthrough the fusion and recognition of multiple types offeatures.
AppleAppleMultiple
DescriptionsAppleFruits Products MoviesCategory
TagTag
User Description
36………… ……Surrounding
text
Proposed Approach
• A self-organizing neural network approach to Heterogeneous
p pp
Data Co-clustering
Based on Fusion Adaptive Resonance Theory (Fusion ART)
Fuse arbitrary number of feature modalities
Adaptively tune the weights for different feature modalities
Two different learning function for primary data, such as images and articles, and meta-information to handle short and nois te tand noisy text
Incremental fast learning
D d i h b f lDo not need to give the number of clusters
37
Experiments• NUS-WIDE data set
36784 images of 18 categories– 36784 images of 18 categories– Visual features: Grid color moment, Edge direction histogram, and
wavelet texture T t l f t f di t t 1142 d (7 d i– Textual features of surrounding text: 1142 words (7 words per image on average)
• 20 Newsgroups data setg p– 12826 text documents of 10 categories– Textual features of document content: over 60k words (800 words per
document on average)document on average)– Textual features of category: 3 labels per document on average
38
Experiments on NUS-WIDE Data Set
• Evaluation on weight adaptation across channels for visual and textual featurestextual features– Performance Comparison with fixed weight values
• GHF-ART with the adaptively tuned weight values γ_SA achieves the bestperformance in 5 classes and the overall performance, and achieves closeperformance with the best results obtained by fixed weight values
39
Experiments on NUS-WIDE Data Set
– Tracking of the change in weight values of γ _SA
• Textual features of surrounding text are assigned higher weights than visual features
• The value of γ SA stabilizes in [0.7, 0.8] with the increase of patterns e v ue o γ_S s b es [ .7, . ] w e c e se o p e s• Big fluctuation may be resulted by the generation of new clusters
40
Experiments on NUS-WIDE Data Set
• Clustering Performance comparison with existing algorithms in terms ofweighted average precision cluster entropy ( ) class entropy ( )lH classHweighted average precision, cluster entropy ( ), class entropy ( ),purity and rand index (RI)
clusterH class
• GHF-ART achieves the best performance in terms of all the evaluationmeasures
• With supervisory information, GHF-ART(SS) consistently obtains betterperformance
41
Experiments on NUS-WIDE Data Set
• Time complexity analysis
– GHF-ART and Fusion ART incur very small increase of time cost
– For 23284 images, GHF-ART complete the clustering process in 10 seconds
42
Experiments on 20 Newsgroups Data Set
• Clustering performance comparison using document content d t i f ti
p g p
and category information
– Both GHF-ART and GHF-ART(SS) outperform other algorithms in all the evaluation measuresGHF ART has a 5% gain than Fusion ART in terms of Average– GHF-ART has a 5% gain than Fusion ART in terms of Average Precision, Purity and Rand Index.
– Comparing with other unsupervised algorithms, GHF-ART achieves around 80% in Average Precision Purity and Rand Index while otheraround 80% in Average Precision, Purity and Rand Index while other algorithms typically obtain less than 75%
43
Summary
• A Heterogeneous data co-clustering algorithm, called GHF-ART is proposed to discover the themes of web multimedia data
y
ART, is proposed to discover the themes of web multimedia datavia their rich but heterogeneous descriptors.
• Comparing with existing works GHF ART has advantages in• Comparing with existing works, GHF-ART has advantages in– Strong noise immunity — A learning function of meta-information is
proposed to handle noiseAd ti h l i hti A ll d fi d i h i l i h i– Adaptive channel weighting — A well-defined weighting algorithm isproposed to identify the important feature modalities for a better fusion ofmulti-modal features for overall similarity measure;L ti l it GHF ART f l ti h d t h– Low time complexity — GHF-ART performs real-time search and matchof patterns resulting in a linear time complexity for big data;
– Incremental clustering manner — GHF-ART may adapt to dynamicb lti di d t t b i t ll l t i tt b dweb multimedia data set by incrementally clustering new patterns based
on the learnt cluster structure without referring to the old data.44
Research Centre of Excellence in A ti LILI i f th ld LYLY (LILY)(LILY)Active LILIving for the elderLYLY (LILY)(LILY)
Aging in Place:Aging in Place:Opportunities and Challenges
Ah-Hwee Tan(http://www.ntu.edu.sg/home/asahtan)( p g )
School of Computer EngineeringNanyang Technological University
JOINT UBC-NTU RESEARCH CENTRE
Aging in Placeg g
“the ability to live in one's own home and community
safely, independently, and comfortably, regardless of
age, income, or ability level” - Center for Disease
Control, Dec 2011,
46
Motivation
Global aging population creates silver challengesGlobal aging population creates silver challenges Most adults would prefer to age in place 78 percent of adults between the ages of 50 and 64
report that they would prefer to stay in their currentresidence as they age
Growing elderly population will be livingGrowing elderly population will be livingindependently in own homes
Vital to transform future homes into intelligentghuman-centered environment for the elderly
Golden opportunities for innovating assistiveh l i f i i ltechnologies for aging in place
47
A Basic Scenario of Tender Care for Aging-in-placep
UnobtrusiveSensing
Social SignalProcessingg
ContextAware AutoTagging
Unobtrusive sensing device detects: the elder keeps walking around at an irregular pace.Social signal processing indicates: the elder has been silent for an unusually long time.
Tagging Social
CognitiveNetwork
Your mother may
be feeling anxious now
Cognitive Analysis
lt
I need to call my now…result… ymother now…
Silver Challengesg
49
Vision
T bl ld l t i t i ti h lth dTo enable elderly to maintain an active, healthy andengaging life style in their own homes supported byan age-friendly intelligent environment, providing all-g y g p ground comprehensive tender care Round-the-clock day-to-day health and wellness
i imonitoring Cognitive Support and recommendation to products
and servicesand services Companionship and emotional support Support for maintaining/stimulating social Support for maintaining/stimulating social
interaction50
Design Consideration and ChallengesChallenges
How to perform unobtrusive monitoring? How to perform unobtrusive monitoring?- Mobile sensing, activity tracking
How to provide all-around comprehensive care? How to provide all around comprehensive care?- Physical, cognitive, emotional, social, sustainability
How to maintain ubiquitous access andqinteraction?- Cross platform, multimedia, multimodal
How to provide friendly, personal touch?- Adaptive user modeling, mood detection
P i l i i- Proactive, natural interaction
51
Approach and Methodologypp gy
To support active living of elderlies pp g fthrough an intelligent multi-agent environment with ubiquitous access, natural interface, and all-
d d h i rounded comprehensive care
Key TechnologiesKey Technologies Unobtrusive sensing and social signal processing Activity pattern and user modeling Activity pattern and user modeling Information and service recommendation Proactive stimulation and natural interaction
52
A Multi-Agent Collaborative Care EnvironmentCare Environment
Isabel(Personal Nurse)
Small talkRecommendations
for healthcareAlf d for healthcare products and services
FrankAlfred
(The Butler)Small talk (Robot Dog)
Activity sensingPattern modeling
Small talkUser modeling
Social and travel advisory Pattern modelingadvisory
53
Why Multi-Agent?y g
Unobtrusive sensing and monitoring – agents Unobtrusive sensing and monitoring agents of different characteristics and capabilities
Ubi i i f i d Ubiquitous access to information and services – agents in different platforms and locationslocations
Comprehensive tender care – agents with diff d i k l d d f idifferent domain knowledge and functions
“Three’s a party” – more opportunities for p y ppcognitive stimulation and social interaction
54
Comprehensive Tender Care
Physical Support – Activity tracking safety and Physical Support Activity tracking, safety and wellness monitoring
C i i S i f i d Cognitive Support – information and recommendation on (healthcare) products, services, skills and activities k nd ct v t
Emotional Support – mood detection, affective t ll t lksupport, small talk
Social Support – companionship and connection to family and friends (old and new) through sms, emails and facebooks etc 55
Unobtrusive Sensing and Ubiquitous Access to Services Ubiquitous Access to Services
unobtrusive in-home real-time data collection and contextual social signal processing - Essential to better understand and cater to the ld l ’ d
Sensing – bio sensing, motion sensors,
elderly’s needs.
wearable/mobile sensors for health monitoring and activity tracking
Cross Platform – Large screen interactive display, mobile handheld devices, physical robots
Multimedia – text, audio, video56
Adaptive User Modellingp g
Identity and profile Identity and profile Interests and preferences Behaviour model: Ti p ti it Behaviour model: Time, space, activity
Knowledge and skills S i l k l d f d Social network: Family and friends
Meth0ds for Model Building Explicit: User specification Implicit: User actions, choices, conversation
57
Cognitive Support:Product/Service RecommendationProduct/Service Recommendation
Domain knowledge:Domain knowledge:Healthcare, Travel, Cooking
Delivery modes:Delivery modes:- Question & AnswerP i d i- Proactive recommendation
- Conversation
P l T h Personal Touch:Personalized, Context sensitive, small talks
58
Challenges in Big Living Analyticsg g y
Volume – huge amount of data through bio Volume – huge amount of data through bio sensing, motion sensors, wearable/mobile sensors for health monitoring and activity tracking
Velocity – 24x7 real time sensing, sense making, decision making service recommendationdecision making, service recommendation
Variety – information integration and knowledge h i f l f l i di sharing from cross platform, multimedia
unstructured data - text, audio, video, gestures
59
Research Centre of Excellence in Active LILIving for the elderLYLY (LILY)(LILY)
Thank you!
JOINT UBC-NTU RESEARCH CENTRE