Current Advances in Data Mining: Multimedia Data Mining and ...
description
Transcript of Current Advances in Data Mining: Multimedia Data Mining and ...
1
Information Management and Data mining
Presented by: Dr. Herna L Viktor Others: Dr. Iluju Kiringa
Dr. Thomas TranDr. Liam Peyton
2
Information overload: “The amount of knowledge in the world has doubled in the past ten (10) years and is doubling every 18 months” American Society of Training and Documentation (ASTD)
Massive Petabytes (250) data repositories: E.g. it is estimated that Google maintains 4 Petabytes of RAM.
E-Commerce and the Web: A digital marketplace; eHealth Data sharing: data must be available “anywhere, any time, and
in almost any form” The “Digital Rosetta Stone”: Our digital heritage is in danger of
being lost due to the silent obsolesce of current technology
OUR RESEARCH: How do we share/store/preserve this data?
What information can we use to improve our decision making? How do we obtain/extract and explore the hidden knowledge?
3
Information Management and Data Mining Research: Five Themes Data/Information Management:
(T1) Dr. Iluju Kiringa: Data Sharing (T2) Dr. Herna L Viktor: Relational and multimedia data
mining (T3) Dr. Thomas Tran: Software agents for e-Commerce (T4) Dr. Herna L Viktor: Long-term preservation of data (T5) Dr. Liam Peyton: Accessible data warehousing for e-
health
4
(T1) Data Sharing: Dr. Iluju Kiringa
Data must be available “anywhere, any time, and in almost any form”; thus we must cope with
very large networks of data sources complex heterogeneity among the sources Inconsistent data across the sources data sharing and exchange between the sources etc.
Several applications illustrate this need Genomic data E-health Enterprise alliances
5
Background and Goals:Dr. Iluju Kiringa
Background: data sharing on peer-to-peer networks P2P networks are open-ended networks of distributed
computational nodes (peers) Each peer can directly exchange data and/or services with
a set of other peers Peers act autonomously, including for joining/leaving Peers are not subject to global control in the form of global
registries, global services, global resource management, or global schema and data repository
Mostly used for sharing files (plain text, songs, movies, video, etc); some examples are
Napster, Gnitella, Kaaza: file sharing applications Seti@home: distributed computing application
Research Goal: Enhance data sharing on P2P networks to offer the same high quality
access to data that the classical distributed relational DBMSs offer
6
Data Sharing Research Issues:Dr. Iluju Kiringa Heterogeneity management
Interoperability of peer databases Syntactic and semantics heterogeneity
Dynamics and scale management Protocols for peer databases to join/leave networks
Query processing via propagation Query propagation through the network Query optimization
Data coordination using update propagation distributed triggers
Transaction processing Design non-classical transaction models and correctness criteria Implement the models
Service-oriented architecture Design and compare several possible architecture for a peer DBMS Implement some of these architectures Deploy a real retwork
Applications Theory behind data sharing
7
(T2) Data Mining:Dr. Herna L Viktor Multi-relational data mining and link mining
Aim to directly mine a relational database, without extensive preprocessing or “flattening”
Doctor Patient
medicine Illness
hasgives
name
MedId
…name
sin
name
patientid
address
name
…consult
Illidtakes
………
…
8
Data Mining:Dr. Herna L Viktor Multimedia (2D and 3D) data mining
Searching for similarities in multimedia databases Locating clusters of images, 3D objects Classifying images, 3D objects within a cluster
Application Anthropometry (poster) Health care Cultural Heritage
9
(T3) Software Agents in E-Commerce: Dr. Thomas Tran
The concept of an agent provides a convenient and powerful way to describe a complex software entity that is capable of acting with a certain degree of autonomy in order to accomplish tasks on behalf of its user.
An agent is defined in terms of its behavior.
10
Supporting Decision Making:
Dr. Thomas Tran Designing Intelligent Business Software Agents for E-Commerce
Modeling Trust and Reputation in E-Commerce Developing Agent-Based Frameworks for
Mobile Business Designing Recommender Systems for E-
Commerce
11
(T4) Long-term preservation of data:Dr. Herna L Viktor The “Digital Rosetta Stone”:
The life-time of a digital file is only a few decades We might need the digital file in 50+ years Our repositories may become “data morgues”, containing data
which are in formats that cannot be interpreted by present and future generations.
Towards a solution…
12
Long-term preservation of data:Dr. Herna L Viktor
Research issues scalability of information and infrastructure managing heterogonous data sources handling updating of hardware and software transparent storage, management and retrieval
“to investigate effective ways to store, maintain and analyze digital objects over a very long period of time (50 years +) ”Approach:Detachment from original mediaTransparent migration to new technologiesEmulate old software on new technologies
13
Long-term preservation of data:Dr. Herna L Viktor Architectural framework
Visualization,Exploration
Archiving
Retrieval, Trend Analysis
Data acquisition
Build metadata (index)
Store object and metadata
DBAAgent
IBM DB2Data Warehouse
Retrieve object, metadata(index)
Generate visual interface
Generate Data store
14
(T5) Evolving E-Health Business Processes Around Accessible Data Warehouses:Dr. Liam Peyton Goals
Process improvement to take advantage of e-technologies and Data warehouse (DW)
Methodology to specify, automate, manage, and analyze DW-oriented, e-health processes
Addresses privacy, confidentiality, quality, and consent, as well as heavy legacy (and often manual) processes and regulatory environments
Activities Simulation of Ottawa Hospital Data Warehouse and environment Business Intelligence prototype – Infection control data mart,
Discharge process data mart Quality Assurance Framework and Portal
15
Assessment Framework Tied to Operational Systems, Performance MGT & Data Warehouse Strategy
Business Systems & Processes
Use Case Maps Goals
Tasks
Performance Mgt Systems & Processes
DataWarehouse
PIQ measures the effectiveness of Reports to measure effectiveness of Organization in meetings its goals.
Stakeholders
Reports PIQ
16
In Summary: Vast, evolving repositories…
17
Google in 2003 had between 2 and 5 petabytes of hard-disk storage. A more recent calculation, dated June 27, 2006, suggests that the Google cluster may now have 4 petabytes of RAM, on the same order of magnitude as the quantity of hard disk space that was estimated only three years earlier.
As of October 15, 2005, all the files being shared on Kazaa totaled around 54 petabytes.
15 petabytes of data will be generated each year in particle physics experiments using CERN’s Large Hadron Collider, due to be launched in 2007
In 2007, NOAA maintains approximately 1 Petabyte of climate data. NOAA expects that their Comprehensive Large Array-data Stewardship System (CLASS) library will hold 20 Petabyte of data by 2011, 140 Petabyte by 2020
18
In Summary: Vast, evolving repositories…
Our research aims to develop new, efficient ways to manage, share and analyze such data
19
Graduate students:Dr Thomas Tran Grad Students:
Richong Zhang (PhD) Zhiyong Weng (MCS) Vikas Kumar (MCS) Xiaoguang Ma (MCS) Tapu Kumar Ghose (MCS) Catherine Cormier (MSc) Hong Chen (MSc) Bo Zhan (MCS, co-supervised with Prof. Liam Peyton) Yao Gu (MCS, part time)
20
Graduate students and their projects:Dr. Herna L Viktor Hongyu Guo (PhD): Multi-view learning
Rana Awada (PhD): XML database mining (prelim) Nadia Azam (M.Sc.): Link-based clustering Bo Wang (M.Sc.): A storage resource broker agent for long-term
preservation Divine Muhivu (M.Sc.): Data integration through link mining Isis Pena Sanchez (M.Sc): Interestingness mesaurements for
data mining Minjie Shao (M.Sc.): Mining the adverse effects of medication Xiaomei Xia (M.Sc.): Distributed data warehouse query
processing Joining us: Julie Doyle, PhD- Long-term preservation of data Collaborations: NRC, Faculty of Management
21
Graduate students:Dr. Liam Peyton Masters Students:
Sepideh Ghanavati Pierre Seguin Bo Zhan
Collaboration with Prof. Daniel Amyot (Ottawa) Prof. Greg Richards (Ottawa) Prof. Michael Weiss (Carleton) Dr. Alan Forster (Ottawa Hospital)
22
Graduate students and collaborations:Dr. Iluju Kiringa Have implemented an experimental peer DBMS This is joint work with
Renee Miller (Toronto) John Mylopoulos (Toronto & Trento) Vasiliki Kantere (Athens -- NTUA) Anastasios Kementsietsidis (Edinburgh) Several students in Toronto
Lei Jiang Dan Zhao Patricia Rodriguez
and Ottawa: Mehedi Masud Anisur Rahman Irfan Maki Several alumni …
More (strong) students are needed !!!!! Here is a link to visit: http://www.cs.toronto.edu/db/hyperion