Updated TOP List of Data Mining IEEE Project DotNet and JAVA 2016-17 for ME/MTech,BE/BTech Final...
-
Upload
elysium-technologies-private-ltd -
Category
Education
-
view
44 -
download
1
Transcript of Updated TOP List of Data Mining IEEE Project DotNet and JAVA 2016-17 for ME/MTech,BE/BTech Final...
Given a point p and a set of points S, the kNN operation finds the k closest points to p in S. It is a
computational intensive task with a large range of applications such as knowledge discovery or data
mining. However, as the volume and the dimension of data increase, only distributed approaches can
perform such costly operation in a reasonable time. Recent works have focused on implementing
efficient solutions using the MapReduce programming model because it is suitable for distributed large
scale data processing. Although these works provide different solutions to the same problem, each one
has particular constraints and properties. In this paper, we compare the different existing approaches
for computing kNN on MapReduce, first theoretically, and then by performing an extensive
experimental evaluation. To be able to compare solutions, we identify three generic steps for kNN
computation on MapReduce: data pre-processing, data partitioning and computation. We then analyze
each step from load balancing, accuracy and complexity aspects. Experiments in this paper use a variety
of datasets, and analyze the impact of data volume, data dimension and the value of k from many
perspectives like time and space complexity, and accuracy. The experimental part brings new
advantages and shortcomings that are discussed for each algorithm. To the best of our knowledge, this
is the first paper that compares kNN computing methods on MapReduce both theoretically and
experimentally with the same setting. Overall, this paper can be used as a guide to tackle kNN-based
practical problems in the context of big data.
ETPL
DM - 001 K Nearest Neighbour Joins for Big Data on MapReduce: a Theoretical
and Experimental Analysis
High utility itemsets (HUIs) mining is an emerging topic in data mining, which refers to discovering
all itemsets having a utility meeting a user-specified minimum utility threshold min_util. However,
setting min_util appropriately is a difficult problem for users. Generally speaking, finding an
appropriate minimum utility threshold by trial and error is a tedious process for users. If min_util is set
too low, too many HUIs will be generated, which may cause the mining process to be very inefficient.
On the other hand, if min_util is set too high, it is likely that no HUIs will be found. In this paper, we
address the above issues by proposing a new framework for top-k high utility itemset mining, where k
is the desired number of HUIs to be mined. Two types of efficient algorithms named TKU (mining
Top-K Utility itemsets) and TKO (mining Top-K utility itemsets in One phase) are proposed for mining
such itemsets without the need to set min_util. We provide a structural comparison of the two
algorithms with discussions on their advantages and limitations. Empirical evaluations on both real and
synthetic datasets show that the performance of the proposed algorithms is close to that of the optimal
case of state-of-the-art utility mining algorithms.
ETPL
DM - 002 Efficient Algorithms for Mining Top-K High Utility Item sets
Textual documents created and distributed on the Internet are ever changing in various forms. Most of
existing works are devoted to topic modelling and the evolution of individual topics, while sequential
relations of topics in successive documents published by a specific user are ignored. In this paper, in
order to characterize and detect personalized and abnormal behaviours of Internet users, we propose
Sequential Topic Patterns (STPs) and formulate the problem of mining User-aware Rare Sequential
Topic Patterns (URSTPs) in document streams on the Internet. They are rare on the whole but relatively
frequent for specific users, so can be applied in many real-life scenarios, such as real-time monitoring
on abnormal user behaviours. We present a group of algorithms to solve this innovative mining
problem through three phases: pre-processing to extract probabilistic topics and identify sessions for
different users, generating all the STP candidates with (expected) support values for each user by
pattern-growth, and selecting URSTPs by making user-aware rarity analysis on derived STPs.
Experiments on both real (Twitter) and synthetic datasets show that our approach can indeed discover
special users and interpretable URSTPs effectively and efficiently, which significantly reflect users’
characteristics.
ETPL
DM - 003 Mining User-Aware Rare Sequential Topic Patterns in Document
Streams
Sequence classification is an important task in data mining. We address the problem of sequence
classification using rules composed of interesting patterns found in a dataset of labelled sequences and
accompanying class labels. We measure the interestingness of a pattern in a given class of sequences
by combining the cohesion and the support of the pattern. We use the discovered patterns to generate
confident classification rules, and present two different ways of building a classifier. The first classifier
is based on an improved version of the existing method of classification based on association rules,
while the second ranks the rules by first measuring their value specific to the new data object.
Experimental results show that our rule based classifiers outperform existing comparable classifiers in
terms of accuracy and stability. Additionally, we test a number of pattern feature based models that use
different kinds of patterns as features to represent each sequence as a feature vector. We then apply a
variety of machine learning algorithms for sequence classification, experimentally demonstrating that
the patterns we discover represent the sequences well, and prove effective for the classification task.
ETPL
DM - 004 Pattern Based Sequence Classification
We propose an algorithm for detecting patterns exhibited by anomalous clusters in high dimensional
discrete data. Unlike most anomaly detection (AD) methods, which detect individual anomalies, our
proposed method detects groups (clusters) of anomalies; i.e. sets of points which collectively exhibit
abnormal patterns. In many applications this can lead to better understanding of the nature of the
atypical behavior and to identifying the sources of the anomalies. Moreover, we consider the case
where the atypical patterns exhibit on only a small (salient) subset of the very high dimensional feature
space. Individual AD techniques and techniques that detect anomalies using all the features typically
fail to detect such anomalies, but our method can detect such instances collectively, discover the shared
anomalous patterns exhibited by them, and identify the subsets of salient features. In this paper, we
focus on detecting anomalous topics in a batch of text documents, developing our algorithm based on
topic models. Results of our experiments show that our method can accurately detect anomalous topics
and salient features (words) under each such topic in a synthetic data set and two real-world text corpora
and achieves better performance compared to both standard group AD and individual AD techniques.
All required code to reproduce our experiments is available from https://github.com/hsoleimani/ATD.
ETPL
DM - 005 ATD: Anomalous Topic Discovery in High Dimensional Discrete Data
Some important data management and analytics tasks cannot be completely addressed by automated
processes. These “computer-hard” tasks such as entity resolution, sentiment analysis, and image
recognition, can be enhanced through the use of human cognitive ability. Human Computation is an
effective way to address such tasks by harnessing the capabilities of crowd workers (i.e., the crowd).
Thus, crowdsourced data management has become an area of increasing interest in research and
industry. There are three important problems in crowdsourced data management. (1) Quality Control:
Workers may return noisy results and effective techniques are required to achieve high quality; (2)
Cost Control: The crowd is not free, and cost control aims to reduce the monetary cost; (3) Latency
Control: The human workers can be slow, particularly in contrast to computing time scales, so latency-
control techniques are required. There has been significant work addressing these three factors for
designing crowdsourced tasks, developing crowdsourced data manipulation operators, and optimizing
plans of multiple operators. In this paper, we survey and synthesize a wide spectrum of existing studies
on crowdsourced data management. Based on this analysis we then outline key factors that need to be
considered to improve crowdsourced data management.
ETPL
DM - 006 Crowd sourced Data Management: A Survey
Since Jeff Howe introduced the term Crowdsourcing in 2006, this human-powered problem-solving
paradigm has gained a lot of attention and has been a hot research topic in the field of Computer
Science. Even though a lot of work has been conducted on this topic, so far we do not have a
comprehensive survey on most relevant work done in crowdsourcing field. In this paper, we aim to
offer an overall picture of the current state of the art techniques in general-purpose crowdsourcing.
According to their focus, we divide this work into three parts, which are: incentive design, task
assignment and quality control. For each part, we start with different problems faced in that area
followed by a brief description of existing work and a discussion of pros and cons. In addition, we also
present a real scenario on how the different techniques are used in implementing a location-based
crowdsourcing platform, gMission. Finally, we highlight the limitations of the current general-purpose
crowdsourcing techniques and present some open problems in this area.
ETPL
DM - 007 A Survey of General-Purpose Crowdsourcing Techniques
General health examination is an integral part of healthcare in many countries. Identifying the participants
at risk is important for early warning and preventive intervention. The fundamental challenge of learning a
classification model for risk prediction lies in the unlabeled data that constitutes the majority of the collected
dataset. Particularly, the unlabeled data describes the participants in health examinations whose health
conditions can vary greatly from healthy to very-ill. There is no ground truth for differentiating their states
of health. In this paper, we propose a graph-based, semi-supervised learning algorithm called SHG-Health
(Semi-supervised Heterogeneous Graph on Health) for risk predictions to classify a progressively
developing situation with the majority of the data unlabeled. An efficient iterative algorithm is designed
and the proof of convergence is given. Extensive experiments based on both real health examination datasets and synthetic datasets are performed to show the effectiveness and efficiency of our method.
ETPL
DM - 008 Mining Health Examination Records — A Graph-based Approach
Twitter has become one of the largest microblogging platforms for users around the world to share
anything happening around them with friends and beyond. A bursty topic in Twitter is one that triggers
a surge of relevant tweets within a short period of time, which often reflects important events of mass
interest. How to leverage Twitter for early detection of bursty topics has therefore become an important
research problem with immense practical value. Despite the wealth of research work on topic
modelling and analysis in Twitter, it remains a challenge to detect bursty topics in real-time. As existing
methods can hardly scale to handle the task with the tweet stream in real-time, we propose in this paper
TopicSketch, a sketch-based topic model together with a set of techniques to achieve real-time
detection. We evaluate our solution on a tweet stream with over 30 million tweets. Our experiment
results show both efficiency and effectiveness of our approach. Especially it is also demonstrated that
TopicSketch on a single machine can potentially handle hundreds of millions tweets per day, which is
on the same scale of the total number of daily tweets in Twitter, and present bursty events in finer-
granularity.
ETPL
DM - 009 TopicSketch: Real-time Bursty Topic Detection from Twitter
The development of a topic in a set of topic documents is constituted by a series of person interactions
at a specific time and place. Knowing the interactions of the persons mentioned in these documents is
helpful for readers to better comprehend the documents. In this paper, we propose a topic person
interaction detection method called SPIRIT, which classifies the text segments in a set of topic
documents that convey person interactions. We design the rich interactive tree structure to represent
syntactic, context, and semantic information of text, and this structure is incorporated into a tree-based
convolution kernel to identify interactive segments. Experiment results based on real world topics
demonstrate that the proposed rich interactive tree structure effectively detects the topic person
interactions and that our method outperforms many well-known relation extraction and protein-protein
interaction methods.
ETPL
DM -10 SPIRIT: A Tree Kernel-based Method for Topic Person Interaction
Detection
The ubiquity of smartphones has led to the emergence of mobile crowdsourcing tasks such as the
detection of spatial events when smartphone users move around in their daily lives. However, the
credibility of those detected events can be negatively impacted by unreliable participants with low-
quality data. Consequently, a major challenge in mobile crowdsourcing is truth discovery, i.e., to
discover true events from diverse and noisy participants' reports. This problem is uniquely distinct from
its online counterpart in that it involves uncertainties in both participants' mobility and reliability.
Decoupling these two types of uncertainties through location tracking will raise severe privacy and
energy issues, whereas simply ignoring missing reports or treating them as negative reports will
significantly degrade the accuracy of truth discovery. In this paper, we propose two new unsupervised
models, i.e., Truth finder for Spatial Events (TSE) and Personalized Truth finder for Spatial Events
(PTSE), to tackle this problem. In TSE, we model location popularity, location visit indicators, truths
of events, and three-way participant reliability in a unified framework. In PTSE, we further model
personal location visit tendencies. These proposed models are capable of effectively handling various
types of uncertainties and automatically discovering truths without any supervision or location
tracking. Experimental results on both real-world and synthetic datasets demonstrate that our proposed
models outperform existing state-of-the-art truth discovery approaches in the mobile crowdsourcing
environment.
ETPL
DM - 011 Truth Discovery in Crowdsourced Detection of Spatial Events
Feature selection is a challenging problem for high dimensional data processing, which arises in many
real applications such as data mining, information retrieval, and pattern recognition. In this paper, we
study the problem of unsupervised feature selection. The problem is challenging due to the lack of
label information to guide feature selection. We formulate the problem of unsupervised feature
selection from the viewpoint of graph regularized data reconstruction. The underlying idea is that the
selected features not only preserve the local structure of the original data space via graph regularization,
but also approximately reconstruct each data point via linear combination. Therefore, the graph
regularized data reconstruction error becomes a natural criterion for measuring the quality of the
selected features. By minimizing the reconstruction error, we are able to select the features that best
preserve both the similarity and discriminant information in the original data. We then develop an
efficient gradient algorithm to solve the corresponding optimization problem. We evaluate the
performance of our proposed algorithm on text clustering. The extensive experiments demonstrate the
effectiveness of our proposed approach.
ETPL
DM - 012 Graph Regularized Feature Selection with Data Reconstruction
The last few years have witnessed the emergence and evolution of a vibrant research stream on a large
variety of online social media network (SMN) platforms. Recognizing anonymous, yet identical users
among multiple SMNs is still an intractable problem. Clearly, cross-platform exploration may help
solve many problems in social computing in both theory and applications. Since public profiles can be
duplicated and easily impersonated by users with different purposes, most current user identification
resolutions, which mainly focus on text mining of users’ public profiles, are fragile. Some studies have
attempted to match users based on the location and timing of user content as well as writing style.
However, the locations are sparse in the majority of SMNs, and writing style is difficult to discern from
the short sentences of leading SMNs such as Sina Microblog and Twitter. Moreover, since online
SMNs are quite symmetric, existing user identification schemes based on network structure are not
effective. The real-world friend cycle is highly individual and virtually no two users share a congruent
friend cycle. Therefore, it is more accurate to use a friendship structure to analyze cross-platform
SMNs. Since identical users tend to set up partial similar friendship structures in different SMNs, we
proposed the Friend Relationship-Based User Identification (FRUI) algorithm. FRUI calculates a
match degree for all candidate User Matched Pairs (UMPs), and only UMPs with top ranks are
considered as identical users. We also developed two propositions to improve the efficiency of the
algorithm. Results of extensive experiments demonstrate that FRUI performs much better than current
network structure-based algorithms.
ETPL
DM - 013 Cross-Platform Identification of Anonymous Identical Users in Multiple
Social Media Networks
Taxonomy learning is an important task for knowledge acquisition, sharing, and classification as well
as application development and utilization in various domains. To reduce human effort to build a
taxonomy from scratch and improve the quality of the learned taxonomy, we propose a new taxonomy
learning approach, named TaxoFinder. TaxoFinder takes three steps to automatically build a taxonomy.
First, it identifies domain-specific concepts from a domain text corpus. Second, it builds a graph
representing how such concepts are associated together based on their co-occurrences. As the key
method in TaxoFinder, we propose a method for measuring associative strengths among the concepts,
which quantify how strongly they are associated in the graph, using similarities between sentences and
spatial distances between sentences. Lastly, TaxoFinder induces a taxonomy from the graph using a
graph analytic algorithm. TaxoFinder aims to build a taxonomy in such a way that it maximizes the
overall associative strengths among the concepts in the graph to build a taxonomy. We evaluate
TaxoFinder using gold-standard evaluation on three different domains: emergency management for
mass gatherings, autism research, and disease domains. In our evaluation, we compare TaxoFinder
with a state-of-the-art subsumption method and show that TaxoFinder is an effective approach
significantly outperforming the subsumption method.
ETPL
DM - 014 TaxoFinder: A Graph-Based Approach for Taxonomy Learning
As more and more applications produce streaming data, clustering data streams has become an
important technique for data and knowledge engineering. A typical approach is to summarize the data
stream in real-time with an online process into a large number of so called micro-clusters. Micro-
clusters represent local density estimates by aggregating the information of many data points in a
defined area. On demand, a (modified) conventional clustering algorithm is used in a second offline
step to recluster the micro-clusters into larger final clusters. For reclustering, the centers of the micro-
clusters are used as pseudo points with the density estimates used as their weights. However,
information about density in the area between micro-clusters is not preserved in the online process and
reclustering is based on possibly inaccurate assumptions about the distribution of data within and
between micro-clusters (e.g., uniform or Gaussian). This paper describes DBSTREAM, the first micro-
cluster-based online clustering component that explicitly captures the density between micro-clusters
via a shared density graph. The density information in this graph is then exploited for reclustering
based on actual density between adjacent micro-clusters. We discuss the space and time complexity of
maintaining the shared density graph. Experiments on a wide range of synthetic and real data sets
highlight that using shared density improves clustering quality over other popular data stream
clustering methods which require the creation of a larger number of smaller micro-clusters to achieve
comparable results.
ETPL
DM - 015 Clustering Data Streams Based on Shared Density between Micro-
Clusters
Social media networks are dynamic. As such, the order in which network ties develop is an important
aspect of the network dynamics. This study proposes a novel dynamic network model, the Nodal
Attribute-based Temporal Exponential Random Graph Model (NATERGM) for dynamic network
analysis. The proposed model focuses on how the nodal attributes of a network affect the order in
which the network ties develop. Temporal patterns in social media networks are modeled based on the
nodal attributes of individuals and the time information of network ties. Using social media data
collected from a knowledge sharing community, empirical tests were conducted to evaluate the
performance of the NATERGM on identifying the temporal patterns and predicting the characteristics
of the future networks. Results showed that the NATERGM demonstrated an enhanced pattern testing
capability and an increased prediction accuracy of network characteristics compared to benchmark
models. The proposed NATERGM model helps explain the roles of nodal attributes in the formation
process of dynamic networks.
ETPL
DM - 016 NATERGM: A Model for Examining the Role of Nodal Attributes in
Dynamic Social Media Networks
Graph classification aims to learn models to classify structure data. To date, all existing graph
classification methods are designed to target one single learning task and require a large number of
labeled samples for learning good classification models. In reality, each real-world task may only have
a limited number of labeled samples, yet multiple similar learning tasks can provide useful knowledge
to benefit all tasks as a whole. In this paper, we formulate a new multi-task graph classification (MTG)
problem, where multiple graph classification tasks are jointly regularized to find discriminative
subgraphs shared by all tasks for learning. The niche of MTG stems from the fact that with a limited
number of training samples, subgraph features selected for one single graph classification task tend to
overfit the training data. By using additional tasks as evaluation sets, MTG can jointly regularize
multiple tasks to explore high quality subgraph features for graph classification. To achieve this goal,
we formulate an objective function which combines multiple graph classification tasks to evaluate the
informativeness score of a subgraph feature. An iterative subgraph feature exploration and multi-task
learning process is further proposed to incrementally select subgraph features for graph classification.
Experiments on real-world multi-task graph classification datasets demonstrate significant
performance gain.
ETPL
DM - 018 Joint Structure Feature Exploration and Regularization for Multi-Task
Graph Classification
Resource Description Framework (RDF) has been widely used in the Semantic Web to describe
resources and their relationships. The RDF graph is one of the most commonly used representations
for RDF data. However, in many real applications such as the data extraction/integration, RDF graphs
integrated from different data sources may often contain uncertain and inconsistent information (e.g.,
uncertain labels or that violate facts/rules), due to the unreliability of data sources. In this paper, we
formalize the RDF data by inconsistent probabilistic RDF graphs, which contain both inconsistencies
and uncertainty. With such a probabilistic graph model, we focus on an important problem, quality-
aware subgraph matching over inconsistent probabilistic RDF graphs (QA-gMatch), which retrieves
subgraphs from inconsistent probabilistic RDF graphs that are isomorphic to a given query graph and
with high quality scores (considering both consistency and uncertainty). In order to efficiently answer
QA-gMatch queries, we provide two effective pruning methods, namely adaptive label pruning and
quality score pruning, which can greatly filter out false alarms of subgraphs. We also design an
effective index to facilitate our proposed pruning methods, and propose an efficient approach for
processing QA-gMatch queries. Finally, we demonstrate the efficiency and effectiveness of our
proposed approaches through extensive experiments.
ETPL
DM - 017 Quality-Aware Subgraph Matching Over Inconsistent Probabilistic
Graph Databases
General health examination is an integral part of healthcare in many countries. Identifying the
participants at risk is important for early warning and preventive intervention. The fundamental
challenge of learning a classification model for risk prediction lies in the unlabeled data that constitutes
the majority of the collected dataset. Particularly, the unlabeled data describes the participants in health
examinations whose health conditions can vary greatly from healthy to very-ill. There is no ground
truth for differentiating their states of health. In this paper, we propose a graph-based, semi-supervised
learning algorithm called SHG-Health (Semi-supervised Heterogeneous Graph on Health) for risk
predictions to classify a progressively developing situation with the majority of the data unlabeled. An
efficient iterative algorithm is designed and the proof of convergence is given. Extensive experiments
based on both real health examination datasets and synthetic datasets are performed to show the
effectiveness and efficiency of our method.
ETPL
DM - 019 Mining Health Examination Records — A Graph-based Approach
In this paper, we propose a semantic-aware blocking framework for entity resolution (ER). The
proposed framework is built using locality-sensitive hashing (LSH) techniques, which efficiently
unifies both textual and semantic features into an ER blocking process. In order to understand how
similarity metrics may affect the effectiveness of ER blocking, we study the robustness of similarity
metrics and their properties in terms of LSH families. Then, we present how the semantic similarity of
records can be captured, measured, and integrated with LSH techniques over multiple similarity spaces.
In doing so, the proposed framework can support efficient similarity searches on records in both textual
and semantic similarity spaces, yielding ER blocking with improved quality. We have evaluated the
proposed framework over two real-world data sets, and compared it with the state-of-the-art blocking
techniques. Our experimental study shows that the combination of semantic similarity and textual
similarity can considerably improve the quality of blocking. Furthermore, due to the probabilistic
nature of LSH, this semantic-aware blocking framework enables us to build fast and reliable blocking
for performing entity resolution tasks in a large-scale data environment.
ETPL
DM - 020 Semantic-Aware Blocking for Entity Resolution
Introducing recent advances in the machine learning techniques to state-of-the-art discrete choice
models, we develop an approach to infer the unique and complex decision making process of a
decision-maker (DM), which is characterized by the DM’s priorities and attitudinal character, along
with the attributes interaction, to name a few. On the basis of exemplary preference information in the
form of pairwise comparisons of alternatives, our method seeks to induce a DM’s preference model in
terms of the parameters of recent discrete choice models. To this end, we reduce our learning function
to a constrained non-linear optimization problem. Our learning approach is a simple one that takes into
consideration the interaction among the attributes along with the priorities and the unique attitudinal
character of a DM. The experimental results on standard benchmark datasets suggest that our approach
is not only intuitively appealing and easily interpretable but also competitive to state-of-the-art
methods.
ETPL
DM - 021 On Learning of Choice Models with Interactive Attributes
In many applications, there is a need to identify to which of a group of sets an element $x$ belongs, if
any. For example, in a router, this functionality can be used to determine the next hop of an incoming
packet. This problem is generally known as set separation and has been widely studied. Most existing
solutions make use of hash-based algorithms, particularly when a small percentage of false positives
is allowed. A known approach is to use a collection of Bloom filters in parallel. Such schemes can
require several memory accesses, a significant limitation for some implementations. We propose an
approach using Block Bloom Filters, where each element is first hashed to a single memory block that
stores a small Bloom filter that tracks the element and the set or sets the element belongs to. In a naïve
solution, when an element $x$ in a set $S$ is stored, it necessarily increases the false positive
probability for finding that $x$ is in another set $T$. In this paper, we introduce our One Memory
Access Set Separation (OMASS) scheme to avoid this problem. OMASS is designed so that for a giv-
n element $x$, the corresponding Bloom filter bits for each set map to different positions in the memory
word. This ensures that the false positive rates for the Bloom filters for element $x$ under other sets
are not affected. In addition, OMASS requires fewer hash functions compared to the naïve solution.
ETPL
DM - 022 OMASS: One Memory Access Set Separation
Items shared through Social Media may affect more than one user's privacy—e.g., photos that depict
multiple users, comments that mention multiple users, events in which multiple users are invited, etc.
The lack of multi-party privacy management support in current mainstream Social Media
infrastructures makes users unable to appropriately control to whom these items are actually shared or
not. Computational mechanisms that are able to merge the privacy preferences of multiple users into a
single policy for an item can help solve this problem. However, merging multiple users’ privacy
preferences is not an easy task, because privacy preferences may conflict, so methods to resolve
conflicts are needed. Moreover, these methods need to consider how users’ would actually reach an
agreement about a solution to the conflict in order to propose solutions that can be acceptable by all of
the users affected by the item to be shared. Current approaches are either too demanding or only
consider fixed ways of aggregating privacy preferences. In this paper, we propose the first
computational mechanism to resolve conflicts for multi-party privacy management in Social Media
that is able to adapt to different situations by modelling the concessions that users make to reach a
solution to the conflicts. We also present results of a user study in which our proposed mechanism
outperformed other existing approaches in terms of how many times each approach matched users’
behaviour.
ETPL
DM - 023 Resolving Multi-party Privacy Conflicts in Social Media
Data exchange is the process of generating an instance of a target schema from an instance of a source
schema such that source data is reflected in the target. Generally, data exchnge is performed using
schema mapping, representing high level relations between source and target schemas. In this paper,
we argue that data exchange solely based on schema level information limits the ability to express
semantics in data exchange. We show such schema level mappings not only may result in entity
fragmentation, they are unable to resolve some ambiguous data exchange scenarios. To address this
problem, we propose Scalable Entity Preserving Data Exchange (SEDEX), a hybrid method based on
data and schema mapping that employs similarities between relation trees of source and target relations
to find the best relations that can host source instances. Our experiments show SEDEX outperforms
other methods in terms of quality and scalability of data exchange.
ETPL
DM - 024 SEDEX: Scalable Entity Preserving Data Exchange
Despite recent advances in distributed RDF data management, processing large-amounts of RDF data
in the cloud is still very challenging. In spite of its seemingly simple data model, RDF actually encodes
rich and complex graphs mixing both instance and schema-level data. Sharding such data using
classical techniques or partitioning the graph using traditional min-cut algorithms leads to very
inefficient distributed operations and to a high number of joins. In this paper, we describe DiploCloud,
an efficient and scalable distributed RDF data management system for the cloud. Contrary to previous
approaches, DiploCloud runs a physiological analysis of both instance and schema information prior
to partitioning the data. In this paper, we describe the architecture of DiploCloud, its main data
structures, as well as the new algorithms we use to partition and distribute data. We also present an
extensive evaluation of DiploCloud showing that our system is often two orders of magnitude faster
than state-of-the-art systems on standard workloads.
ETPL
DM - 025 DiploCloud: Efficient and Scalable Management of RDF Data in the
Cloud
Rapid advance of location acquisition technologies boosts the generation of trajectory data, which track
the traces of moving objects. A trajectory is typically represented by a sequence of time stamped
geographical locations. A wide spectrum of applications can benefit from the trajectory data mining.
Bringing unprecedented opportunities, large-scale trajectory data also pose great challenges. In this
paper, we survey various applications of trajectory data mining, e.g., path discovery, location
prediction, movement behaviour analysis, and so on. Furthermore, this paper reviews an extensive
collection of existing trajectory data mining techniques and discusses them in a framework of trajectory
data mining. This framework and the survey can be used as a guideline for designing future trajectory
data mining solutions.
ETPL
DM - 026 A Survey on Trajectory Data Mining: Techniques and Applications
In this paper, we consider a new insider threat for the privacy preserving work of distributed kernel-
based data mining (DKBDM), such as distributed support vector machine. Among several known data
breaching problems, those associated with insider attacks have been rising significantly, making this
one of the fastest growing types of security breaches. Once considered a negligible concern, insider
attacks have risen to be one of the top three central data violations. Insider-related research involving
the distribution of kernel-based data mining is limited, resulting in substantial vulnerabilities in
designing protection against collaborative organizations. Prior works often fall short by addressing a
multifactorial model that is more limited in scope and implementation than addressing insiders within
an organization colluding with outsiders. A faulty system allows collusion to go unnoticed when an
insider shares data with an outsider, who can then recover the original data from message transmissions
(intermediary kernel values) among organizations. This attack requires only accessibility to a few data
entries within the organizations rather than requiring the encrypted administrative privileges typically
found in the distribution of data mining scenarios. To the best of our knowledge, we are the first to
explore this new insider threat in DKBDM. We also analytically demonstrate the minimum amount of
insider data necessary to launch the insider attack. Finally, we follow up by introducing several
proposed privacy-preserving schemes to counter the described attack.
ETPL
DM - 027 Insider Collusion Attack on Privacy-Preserving Kernel-Based Data
Mining Systems
Frequent sequence mining is well known and well-studied problem in data mining. The output of the
algorithm is used in many other areas like bioinformatics, chemistry, and market basket analysis.
Unfortunately, the frequent sequence mining is computationally quite expensive. In this paper, we
present a novel parallel algorithm for mining of frequent sequences based on a static load-balancing.
The static load-balancing is done by measuring the computational time using a probabilistic algorithm.
For reasonable size of instance, the algorithms achieve speedups up to where is the number of
processors. In the experimental evaluation, we show that our method performs significantly better than
the current state-of-the-art methods. The presented approach is very universal: it can be used for static
load-balancing of other pattern mining algorithms such as itemset/tree/graph mining algorithms.
ETPL
DM - 028 Probabilistic Static Load-Balancing of Parallel Mining of Frequent
Sequences
As more and more applications produce streaming data, clustering data streams has become an
important technique for data and knowledge engineering. A typical approach is to summarize the data
stream in real-time with an online process into a large number of so called micro-clusters. Micro-
clusters represent local density estimates by aggregating the information of many data points in a
defined area. On demand, a (modified) conventional clustering algorithm is used in a second offline
step to recluster the micro-clusters into larger final clusters. For reclustering, the centers of the micro-
clusters are used as pseudo points with the density estimates used as their weights. However,
information about density in the area between micro-clusters is not preserved in the online process and
reclustering is based on possibly inaccurate assumptions about the distribution of data within and
between micro-clusters (e.g., uniform or Gaussian). This paper describes DBSTREAM, the first micro-
cluster-based online clustering component that explicitly captures the density between micro-clusters
via a shared density graph. The density information in this graph is then exploited for reclustering
based on actual density between adjacent micro-clusters. We discuss the space and time complexity of
maintaining the shared density graph. Experiments on a wide range of synthetic and real data sets
highlight that using shared density improves clustering quality over other popular data stream
clustering methods which require the creation of a larger number of smaller micro-clusters to achieve
comparable results.
ETPL
DM - 029 Clustering Data Streams Based on Shared Density between Micro-
Clusters
Existing parallel mining algorithms for frequent itemsets lack a mechanism that enables automatic
parallelization, load balancing, data distribution, and fault tolerance on large clusters. As a solution to
this problem, we design a parallel frequent itemsets mining algorithm called FiDoop using the
MapReduce programming model. To achieve compressed storage and avoid building conditional
pattern bases, FiDoop incorporates the frequent items ultrametric tree, rather than conventional FP
trees. In FiDoop, three MapReduce jobs are implemented to complete the mining task. In the crucial
third MapReduce job, the mappers independently decompose itemsets, the reducers perform
combination operations by constructing small ultrametric trees, and the actual mining of these trees
separately. We implement FiDoop on our in-house Hadoop cluster. We show that FiDoop on the cluster
is sensitive to data distribution and dimensions, because itemsets with different lengths have different
decomposition and construction costs. To improve FiDoop's performance, we develop a workload
balance metric to measure load balance across the cluster's computing nodes. We develop FiDoop-HD,
an extension of FiDoop, to speed up the mining performance for high-dimensional data analysis.
Extensive experiments using real-world celestial spectral data demonstrate that our proposed solution
is efficient and scalable.
ETPL
DM - 030 FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce
Mining communities or clusters in networks is valuable in analyzing, designing, and optimizing many
natural and engineering complex systems, e.g. protein networks, power grid, and transportation
systems. Most of the existing techniques view the community mining problem as an optimization
problem based on a given quality function (e.g., modularity), however none of them are grounded with
a systematic theory to identify the central nodes in the network. Moreover, how to reconcile the mining
efficiency and the community quality still remains an open problem. In this paper, we attempt to
address the above challenges by introducing a novel algorithm. First, a kernel function with a tunable
influence factor is proposed to measure the leadership of each node, those nodes with highest local
leadership can be viewed as the candidate central nodes. Then, we use a discrete-time dynamical system
to describe the dynamical assignment of community membership; and formulate the serval conditions
to guarantee the convergence of each node’s dynamic trajectory, by which the hierarchical community
structure of the network can be revealed. The proposed dynamical system is independent of the quality
function used, so could also be applied in other community mining models. Our algorithm is highly
efficient: the computational complexity analysis shows that the execution time is nearly linearly
dependent on the number of nodes in sparse networks. We finally give demonstrative applications of
the algorithm to a set of synthetic benchmark networks and also real-world networks to verify the
algorithmic performance.
ETPL
DM - 031 Fast and accurate mining the community structure: integrating center
locating and membership optimization
In mobile communication, spatial queries pose a serious threat to user location privacy because the
location of a query may reveal sensitive information about the mobile user. In this paper, we study
approximate k nearest neighbour (kNN) queries where the mobile user queries the location-based
service (LBS) provider about approximate k nearest points of interest (POIs) on the basis of his current
location. We propose a basic solution and a generic solution for the mobile user to preserve his location
and query privacy in approximate kNN queries. The proposed solutions are mainly built on the Paillier
public-key cryptosystem and can provide both location and query privacy. To preserve query privacy,
our basic solution allows the mobile user to retrieve one type of POIs, for example, approximate k
nearest car parks, without revealing to the LBS provider what type of points is retrieved. Our generic
solution can be applied to multiple discrete type attributes of private location-based queries. Compared
with existing solutions for kNN queries with location privacy, our solution is more efficient.
Experiments have shown that our solution is practical for kNN queries.
ETPL
DM - 032 Practical Approximate k Nearest Neighbour Queries with Location and
Query Privacy
With advances in geo-positioning technologies and geo-location services, there are a rapidly growing
amount of spatio-textual objects collected in many applications such as location based services and
social networks, in which an object is described by its spatial location and a set of keywords (terms).
Consequently, the study of spatial keyword search which explores both location and textual description
of the objects has attracted great attention from the commercial organizations and research
communities. In the paper, we study two fundamental problems in the spatial keyword queries: top k
spatial keyword search (TOPK-SK), and batch top k spatial keyword search (BTOPK-SK). Given a set
of spatio-textual objects, a query location and a set of query keywords, the TOPK-SK retrieves the
closest k objects each of which contains all keywords in the query. BTOPK-SK is the batch processing
of sets of TOPK-SK queries. Based on the inverted index and the linear quadtree, we propose a novel
index structure, called inverted linear quadtree (IL-Quadtree), which is carefully designed to exploit
both spatial and keyword based pruning techniques to effectively reduce the search space. An efficient
algorithm is then developed to tackle top k spatial keyword search. To further enhance the filtering
capability of the signature of linear quadtree, we propose a partition based method. In addition, to deal
with BTOPK-SK, we design a new computing paradigm which partition the queries into groups based
on both spatial proximity and the textual relevance between queries. We show that the IL-Quadtree
technique can also efficiently support BTOPK-SK. Comprehensive experiments on real and synthetic
data clearly demonstrate the efficiency of our methods.
ETPL
DM - 033 Inverted Linear Quadtree: Efficient Top K ord Search
We propose TrustSVD, a trust-based matrix factorization technique for recommendations. TrustSVD
integrates multiple information sources into the recommendation model in order to reduce the data
sparsity and cold start problems and their degradation of recommendation performance. An analysis of
social trust data from four real-world data sets suggests that not only the explicit but also the implicit
influence of both ratings and trust should be taken into consideration in a recommendation model. Trust
SVD therefore builds on top of a state-of-the-art recommendation algorithm, SVD++ (which uses the
explicit and implicit influence of rated items), by further incorporating both the explicit and implicit
influence of trusted and trusting users on the prediction of items for an active user. The proposed
technique is the first to extend SVD++ with social trust information. Experimental results on the four
data sets demonstrate that Trust SVD achieves better accuracy than other ten counterpart’s
recommendation techniques.
ETPL
DM - 034 A Novel Recommendation Model Regularized with User Trust and Item
Ratings
Frequent sequence mining is well known and well-studied problem in data mining. The output of the
algorithm is used in many other areas like bioinformatics, chemistry, and market basket analysis.
Unfortunately, the frequent sequence mining is computationally quite expensive. In this paper, we
present a novel parallel algorithm for mining of frequent sequences based on a static load-balancing.
The static load-balancing is done by measuring the computational time using a probabilistic algorithm.
For reasonable size of instance, the algorithms achieve speedups up to where is the number of
processors. In the experimental evaluation, we show that our method performs significantly better than
the current state-of-the-art methods. The presented approach is very universal: it can be used for static
load-balancing of other pattern mining algorithms such as item set/tree/graph mining algorithms.
ETPL
DM - 036 Probabilistic static load-balancing of parallel mining of frequent
sequences
Although the matrix completion paradigm provides an appealing solution to the collaborative filtering
problem in recommendation systems, some major issues, such as data sparsity and cold-start problems,
still remain open. In particular, when the rating data for a subset of users or items is entirely missing,
commonly known as the cold-start problem, the standard matrix completion methods are inapplicable
due the non-uniform sampling of available ratings. In recent years, there has been considerable interest
in dealing with cold-start users or items that are principally based on the idea of exploiting other sources
of information to compensate for this lack of rating data. In this paper, we propose a novel and general
algorithmic framework based on matrix completion that simultaneously exploits the similarity
information among users and items to alleviate the cold-start problem. In contrast to existing methods,
our proposed recommender algorithm, dubbed DecRec, decouples the following two aspects of the
cold-start problem to effectively exploit the side information: (i) the completion of a rating sub-matrix,
which is generated by excluding cold-start users/items from the original rating matrix; and (ii) the
transduction of knowledge from existing ratings to cold-start items/users using side information. This
crucial difference prevents the error propagation of completion and transduction, and also significantly
boosts the performance when appropriate side information is incorporated. The recovery error of the
proposed algorithm is analyzed theoretically and, to the best of our knowledge, this is the first algorithm
that addresses the cold-start problem with provable guarantees on performance. Additionally, we also
address the problem where both cold-start user and item challenges are present simultaneously. We
conduct thorough experiments on real datasets that complement our theoretical results. These
experiments demonstrate the ef- ectiveness of the proposed algorithm in handling the cold-start
users/items problem and mitigating data sparsity issue.
ETPL
DM - 036 Cold-Start Recommendation with Provable Guarantees: A Decoupled
Approach