A new identification method for visualizing trends in transactional data

8

Click here to load reader

Transcript of A new identification method for visualizing trends in transactional data

Page 1: A new identification method for visualizing trends in transactional data

A New Identification Method for Visualizing Trends in Transactional Data

Dr.S.Subramanian Mr.B.Gopinathan R.Sajula Robin Principal Research Scholar P.G ScholarSri Krishna College of Engg. & Tech. Adhiyamaan College of Engg Adhiyamaan College of EnggCoimbatore – India Hosur – India Hosur – India

[email protected] [email protected]

Abstract - Nowadays many Organizations are capturing more data about their customers, suppliers, competitors, and business environment. Most of this data is multi-attribute and temporal in nature. Many Data mining and business intelligence techniques are used to discover patterns in such data; however, Visualizing and analyzing this type of data can be extremely difficult because it can have numerous attributes. Hence a new technique is needed to mine the data according to specific time periods and then compare the data mining results across time periods to discover similarities. A new data analysis and visualization technique that presents complex multiattribute temporal data in a cohesive graphical manner by building on well-established data mining methods is proposed. A Cluster-based Temporal Representation of EveNt Data (C-TREND) is introduced, a system that implements the temporal cluster graph construct, which maps multi-attribute temporal data to a two-dimensional directed graph that identifies trends in dominant data types over time. C-TREND provides an end user with the ability to generate graphs from data and adjust graph parameters.

Keywords— Clustering, data and knowledge visualization, data mining, interactive data exploration and discovery, temporal data mining, trend analysis.

I. INTRODUCTION

BUSINESS intelligence applications represent an important opportunity for data mining techniques to help firms gather and analyze information about their performance, customers, competitors, and business environment. Knowledge representation and data visualization tools constitute one form of business intelligence techniques that present information to users in a manner that supports business decision-making processes.

Business intelligence tools gain their strength by supporting decision-makers. The research field of data mining has developed a number of methods for identifying patterns in data to provide insights and decision support to users. Data mining and business intelligence approaches are

often used for class identification and data visualization in knowledge management systems. Increasingly, knowledge discovery in data (KDD) techniques are providing new analytical structures that complement and sometimes replace existing human-expert-based techniques to provide improved support for decision making. Identifying and visualizing temporal relationships (e.g., trends) in data constitutes an important problem that is relevant in many business, scientific, and academic settings.

Additionally, it is often desired to aggregate over the temporal dimension (e.g., by day, month, quarter, year, etc.) to match corporate reporting standards. Hence a new technique is needed to mine the data according to specific time periods and then compare the data mining results across time periods to discover similarities. Mapping of the multidimensional temporal data into an intuitive analytical construct is known as temporal cluster graph.

In this paper the approach used for addressing these types of issues is to mine the data according to specific time periods and then compare the data mining results across time periods to discover similarities. The main contribution of this paper is to develop a novel and useful approach for visualization and analysis of multiattribute transactional data based on a new temporal cluster graph construct, and to implement this approach as the Cluster-based Temporal Representation of EveNt Data (C-TREND) system.

1

Page 2: A new identification method for visualizing trends in transactional data

Fig 1. Reducing multiattribute temporal complexity by partitioning data into time periods and producing a temporal cluster graph.

Consider the plot of a retailer’s customers by age and income over three months in Fig. 1. Xs represent customers in the first month, triangles represent customers in the second month, and circles represent customers in the third month. An analyst may be tasked with the job of discovering trends in customer type over these three months.

In fig. 1a, the data are collected together. However identifying patterns in the data and relationship overtime are difficult.

In fig. 1b, the data are partitioned by time leads to the identification of clusters within each period.

In fig 1c, the multidimensional temporal data are mapped into an intuitive analytical construct known as temporal cluster graph.

II. RELATED WORK

Some techniques related to this paper are discussed below.

A. Temporal Data Mining Temporal data mining is concerned with data

mining of large sequential data sets. By sequential data, mean data that is ordered with respect to some index. For example, time series constitute a popular class of sequential data, where records are indexed by time.

Temporal data mining is an important extension as it has the capability of mining activity rather than just states and, thus, inferring relationships of contextual and temporal proximity, some of which may also indicate a cause-effect association. Moreover, temporal data mining has the ability to mine the behavioral aspects of (communities of) objects as opposed to simply mining rules that describe their states at a point in time i.e., there is the promise of understanding why rather than merely what. [1]

Many studies in conventional data mining distinguish two strategic goals for the discovery process: 1) the description of the characteristics of a population and 2) the prediction of its evolution in the future. In the context of the discovery of similarities in temporal data, we categorize data mining research across three dimensions: data type, mining paradigm, and temporal ordering. [7] Temporal data mining approaches depend on the nature of the event sequence being studied. Probably the most common form of temporal data mining— time series analysis is used to mine sequence ofcontinuous real-valued elements and is often regression based, relying on the prespecified definition of a model. Moreover, standard time series analysis techniques typically are examples of supervised learning; in other words, they estimate the effects of a set of

independent variables on a dependent variable. [5] Another common area of temporal data mining research is sequence analysis. Sequence analysis is often used when the sequence is composed of a series of nominal symbols. A sequential pattern is a subsequence that appears frequently in a sequence database. Sequential pattern mining , which finds the set of frequent subsequences in sequence databases, is an important data mining task and has broad applications, such as business analysis, web mining, security, and bio-sequences analysis.[6],[9]

B. Data Visualization

Visual data exploration [4] aims at integrating the human in the data exploration process, applying its perceptual abilities to the large data sets available in today's computer systems. The basic idea of visual data exploration is to present the data in some visual form, allowing the human to get insight into the data, draw conclusions, and directly interact with the data. Visual data exploration is especially useful when little is known about the data and the exploration goals are vague. Visual Data Exploration usually follows a three step process: Overview, Zoom and Filter, Details-on-Demand

First, the user needs to get an overview of the data. In the overview, the user identifies interesting patterns and focuses on one or more of them. For analyzing the patterns, the user needs to drill down and access details of the data. Visualization technology may be used for all three steps of the data exploration process.

Visualization techniques are useful for showing an overview of the data, allowing the user to identify interesting subsets. In this step, it is important to keep the overview visualization while focusing on the subset using another visualization technique. An alternative is to distort the overview visualization in order to focus on the interesting subsets. To further explore the interesting subsets, the user needs a drill-down capability in order to get the details about the data. Visualization technology not only provides the base visualization techniques for all three steps, but also bridges the gaps between the steps.

The techniques can be classified based on three criteria: 1) The data to be visualized.2) The visualization technique.3) The interaction and distortion technique used.

Both scientific visualization and information visualization create graphical models and visual representations from data that support direct user interaction for exploring and acquiring insight into useful information embedded in the underlying data. In scientific visualization, the graphical models are typically constructed from measured or simulated data representing objects or concepts associated with phenomena from the physical world. As such, the data and, hence, its derived visual representations represent objects that exist in a 1D (one-dimensional), 2D, or 3D object space.[2]

2

Page 3: A new identification method for visualizing trends in transactional data

Eventually, data will also include a temporal dimension and the presence of spatial and temporal dimensions is a determinant factor in deriving visual representations from the data. In information visualization, the graphical models may represent abstract concepts and relationships that do not necessarily have a counterpart in the physical world, e.g., information describing user accesses to pages of an Internet portal or records describing selected properties of different car brands and models. Typically, each data unity describes multiple related attributes (usually more than four) that are not of a spatial or temporal nature. Interaction techniques provide user with the ability to dynamically change visual representation and can empower the users perception of information. A comprehensive framework for user interface techniques used in visualization system includes:1) Interactive Filtering 2) Interactive Zooming

Interactive Filtering: A number of interaction techniques have been developed to improve interactive filtering in data exploration. An example of an interactive tool which can be used for interactive filtering is Magic Lenses. The basic idea of Magic Lenses is to use a tool like a magnifying glass to support filtering the data directly in the visualization. The data under the magnifying glass is processed by the filter and the result is displayed differently than the remaining data set.

Interactive Zooming: Zooming is a well-known technique which is widely used in a number of applications. In dealing with large amounts of data, it is important to present the data in a highly compressed form to provide an overview of the data, but, at the same time, allow a variable display of the data on different resolutions. Zooming not only means to display the data objects larger, but also means that the data representation automatically changes to present more details on higher zoom levels.

III. TEMPORAL CLUSTER GRAPHS

Temporal cluster graph is a new data mining technique for identifying and visualizing trends in multiattribute temporal data. Temporal cluster graph provide the user to adjust and visualize the clustering solution for each partition. Hierarchal and graph-based techniques are used by the temporal cluster graph to provide interactive filtering and zooming capabilities for visualization. The temporal cluster graph is a directed graph that consists of a set of nodes V= {V1, V2… Vt}, where each subset corresponds to a data partition and contains K i

nodes.

A. Temporal Cluster Graph Definition

To obtain the graph several steps are required:

1. Transactional data set D is partitioned based on

time periods into t data subsets D1; . . .;Dt (indexed chronologically), and each Di is a multiattribute data subset containing records with m number of attributes.

2. Data within each partition is then clustered using apriori algorithm.

3. The node Vi,j € {Vi} is the jth node in the ith partition. Nodes are labeled with the Size of the cluster they represent (i.e., the number of data point in that cluster).

4. Edges connect nodes in adjacent partitions and are labeled with a distance value between the two nodes, thus representing the similarity between the clusters connected by the edge.

B. Graph ParametersThree graph parameters are proposed for

displaying information at different levels of analysis:1. Partition Zoom2. within-Period Trend Strength3. Cross-Period Trend Strength

C. Partition Zoom

Zoom feature has the ability to dynamically change the size of the clustering solution in a data partition. The zoom feature allows the users to apply their domain expertise by adjusting in real time the underlying clustering solution used to build a trend graph and interactively evaluate multiple trend views.

Each data partition Di has a corresponding ki value, where ki refers to the number of clusters estimated in the clustering solution for that partition. For example, a value of ki=5 corresponds to the clustering solution for the i th

partition that contains exactly five clusters.

D. Within-period Trend Strength

Within-period trend strength is a user specified parameter that can be used to determine if nodes generated by the clustering solution are “strong” enough to be included in the trend analysis.

Within-period trend strength is denoted by parameter α. Each data partition utilizes the same value of α. By clustering each data partition the nodes in the trend graph are created. For every data partition Di, the clustering solution contains ki clusters , and some of these clusters can be filtered out based on the within-period trend strength parameter α.

E. Cross-Period Trend StrengthCross-period trend strength is a user specified

parameter that can be used to filter out spurious edges based on their weight. Cross-period trend strength is denoted by parameter β.

An edge is included in the output graph if it satisfies two criteria:

1.The edge is incident to two nodes that are both included in the output graph (as determined by the

3

Page 4: A new identification method for visualizing trends in transactional data

clustering solution and within-period trend strength α). 2. The edge weight is less than or equal to a

threshold η that depends on the cross-period trend strength β.

The edge threshold η is calculated by taking the average of the weights of all the possible edges among the nodes in two adjacent data partitions (say, partitions i and i+1) and adjusting it by the user-specified β parameter. Only edges with weights below average are included in the graph.

V. C-TREND IMPLEMENTATION

A. C-TREND Overview

C-TREND,Cluster-based Temporal Representation of EveNt Data, a new method for discovering and visualizing trends and temporal patterns in transactional attribute-value data that builds upon standard data mining clustering techniques. The C-TREND technique consists of two major processes:1) Offline preprocessing of the data2) Online interactive analysis and visualization of the trends

Fig 2. The C-TREND Process

B. Offline Preprocessing of the Data

In the preprocessing phase, the data set is partitioned based on time periods, and each partition is clustered using one of many traditional clustering techniques such as a hierarchical approach. The results of

the clustering for each partition are used to generate two data structures: the node list and the edge list.

Creating these lists in the preprocessing phase allows for more effective (real-time) visualization updates of the C-TREND output graphs. Based on these data structures, graph entities (nodes and edges) are generated and rendered as a temporal cluster graph in the system output window.

B1. Data Clustering

C-TREND can be implemented with multiple different standard clustering algorithms (e.g., agglomerative or divisive hierarchical clustering or partition- based clustering) and could be expanded to include new efficient clustering techniques such as the clustering by messaging between data points technique. Specifically apriori algorithm is utilized and the clustering is performed separately for each partition of data.

APRIORI ALGORITHMApriori steps are as follows:

1) Counts item occurrences to determine the frequent item sets

2) Candidates are generated.3) Count the support of item sets pruning process

ensures candidate sizes are already known to be frequent item sets.

4) Use the frequent item sets to generate the desired rules.

Algorithm 1. Apriori

Ck: Candidate itemset of size kLk: frequent itemset of size kL1= {frequent items};for(k= 1; Lk!=∅; k++) do beginCk+1= candidates generated from Lk;for each transaction tin database doIncrement the count of all candidates in Ck+1that are contained in tLk+1= candidates in Ck+1with min_supportendreturn kLk;

C-TREND produces a dendrogram for each data partition and utilizes a global input value N that represents the maximum-sized cluster solution maintained for each data partition. A useful solution will consist of a set of N <<n clusters (n is number of data points in partition i) and, therefore, C-TREND has to store only 2N-1 nodes per partition.

VI INTERACTIVE DATA VISUALIZATION

Interactive analysis includes the presentation of output graphs in a graphical user interface(GUI) that allows

4

Page 5: A new identification method for visualizing trends in transactional data

the user to adjust k for each partition and the α and β parameters, Which prompts C-TREND to redraw the output graph based on these new values in real time.

C-TREND utilizes a series of validation flags to maintain and update the displayed state of the output trend graph. Combinations of the validation flags are used to determine whether or not each possible edge and node should be displayed in the graph, and as these flags change, the displayed components of the graph also change.

Each cluster in the node list (dendrogram data structures) possesses two flags: k-pass and α-pass. These flags are used to indicate whether the cluster should be included in the output graph based on the ki value and the α value, respectively. Specifically, when ki is changed, the dendrogram data structure is updated so that only the clusters that should be extracted for the clustering solution of size ki have a valid k-pass flag. Similarly, when α is changed, the dendrogram data structure is updated so that only the clusters that are large enough to pass the node filter based on α are assigned a valid α -pass flag.

The nodes that have both valid k-pass and α -pass flags make up the set of nodes that are both large enough and in the desired clustering solution and therefore are included in the output graph.

VI CONCLUSION

By harnessing computational techniques of data mining, we have developed a new temporal clustering technique for discovering, analyzing, and visualizing trends in multiattribute temporal data.

The proposed technique is versatile, and the implementation of the technique as the C-TREND system gives significant data representation power to the user—domain experts have the ability to adjust parameters and clustering mechanisms to fine-tune trend graphs.

The C-TREND implementation is scalable: the time required to adjust trend parameters is quite low even for larger data sets, which provides for real-time visualization capabilities.

Furthermore, the proposed temporal clustering analysis technique is applicable in many different data analysis contexts and can provide insights for analysts performing historical analyses and generating forecasts.

REFRENCES

[1] C.M. Antunes and A.L. Oliveira, “Temporal Data Mining: An Overview,” Proc. ACM SIGKDD Workshop Data Mining, pp. 1-13, Aug. 2001.

[2] M.C.F. de Oliveira and H. Levkowitz, “From Visual Data Exploration to Visual Data Mining: A Survey,” IEEE Trans. Visualization and Computer Graphics, vol. 9, no. 3, pp. 378-394, July-Sept. 2003.

[3] A. Jain, M. Murty, and P. Flynn, “Data Clustering: A Review,” ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999.

[4] D.A. Keim, “Information Visualization and Visual Data Mining,” IEEE Trans. Visualization and Computer Graphics, vol., no. 1, pp. 1-8, 2002.

[5] E. Keogh and S. Kasetty, “On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration,” Data Mining and Knowledge Discovery, vol. 7, no. 4, pp. 349-371, 2003.

[6] J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu, “Mining Sequential Patterns by Pattern- Growth: The Prefix Span Approach,” IEEE Trans. Knowledge and Data Eng., vol. 16, no. 10, pp. 1-17, Oct. 2004.

[7] J. Roddick and M. Spiliopoulou, “A Survey of Temporal Knowledge Discovery Paradigms and Methods,” IEEE Trans. Knowledge and Data Eng., vol. 14, no. 4, pp. 750-767, July/Aug. 2002.

[8] S.F. Roth and J. Mattis, “Data Characterization for Intelligent Graphics Presentations,” Proc. Conf. Human Factors in Computing Systems (CHI ’90), pp. 193-200, 1990.

[9] M. Zaki, “SPADE: An Efficient Algorithm for Mining Frequent Sequences,” Machine Learning, vol. 42, no. 1-2, pp. 31-60, 2001.

5