Heatflip: Temporal-Spatial Sampling for Progressive Heat ...€¦ · A. Interactive Big Data...

Heatflip: Temporal-Spatial Sampling for Progressive Heat Maps on Social Media Data

Niklas Stoehr, Johannes Meyer, Volker Markl Database Systems and Information Management Group

Technical University of Berlin Berlin, Germany

{n.stoehr, j.meyer, volker.markl}@mail.tu-berlin.de

Qiushi Bai, Taewoo Kim, De-Yu Chen, Chen Li Department of Computer Science University of California, Irvine

Irvine, CA, USA {baiqiushi, taewok2, teyuc, chenli}@ics.uci.edu

Abstract— Keyword-based heat maps are a natural way to explore and analyze the spatial properties of social media data. Dealing with large datasets, there may be many different keywords, making offline pre-computations very hard. Interactive frameworks that exploit database sampling can address this challenge. We present a novel middleware technique called Heatflip, which issues diametrically opposed samples into the temporal and spatial dimensions of the data stored in an external database. Spatial samples provide insights into the temporal distribution and vice versa. The progressive exploration approach benefits from adaptive indexing and combines the retrieval and visualization of the data in a middleware layer. Without any a priori knowledge of the underlying data, the middleware can generate accurate heat maps in 85% shorter processing times than conventional systems. In this paper, we discuss the analytical background of Heatflip, showcase its scalability, and validate its performance when visualizing large amounts of social media data.

Keywords-Data exploration; Social media analytics; Big Data visualization; Database sampling; Progressive computing

I. INTRODUCTION

A. Interactive Big Data Analytics

Social media not only generates a vast amount of data butalso inseparably makes use of it. This emerging trend creates a demand for scalable data stores and interactive analytics. The data storage is handled by big data management systems. Operating on top of these large databases, analytical systems visualize the data and bring complex patterns to the fore. If the data is retrieved and then analyzed in two decoupled steps, this approach can endure a duplication of functionality and disregard opportunities for cross-optimization [1]. In particular, if the data set is large and the data exploration session [2] includes several queries to the database, data analysis loses its interactive character due to long processing times. This trade-off between interactivity and performance represents a key challenge in the era of big data. B. Middleware Layer for Data Visualization

Keyword-based analysis is a very natural way to exploreand understand social media data. A typical scenario is the visualization of a heat map on a big amount of social media data such as Twitter. The user may want to gain an insight of all tweets that include keywords such as “summer”, tweeted in the United States during the year 2017. In this case, the user specifies selection conditions (e.g., keyword “summer”,

interval “2017-01-01 to 2017-12-31”, and region “USA”) in a frontend system. Since datasets are large and constantly growing, it is difficult to obtain a subset of records offline that represents the distributions of all keywords. There is enormous scope for improvement by using a middleware layer in between the visualization system and the database [2]. The middleware acts as a mediator between the frontend and the database taking charge of the user request by either directly returning a result from its internal cache or by translating the request into a time-efficient sampling task. In this paper, we develop a novel middleware technique called Heatflip to support real-time analytics and visualizations on very large data sets. Figure 1 shows an architecture that includes a Heatflip-enabled middleware layer between a backend database and a frontend visualization interface.

Figure 1: Heatflip architecture

The middleware can operate on top of any database without prior knowledge of the stored data and can optimize data exploration through query optimization and sampling. It features a built-in view manager and is able to cache query results. It even supports a non-static environment of continuous data ingestion as long as the stored data features spatial and temporal properties. The system can invoke indexes on any data property to accelerate database sampling. Heatflip draws samples or subsets of the data by issuing “mini-queries”. Mini-queries are executed much faster than queries on the full database. Typically, the results from previous mini-queries influence the formulation of the following ones. By issuing a sequence of mini-queries or returning cached results, the middleware follows the objective to provide accurate data analytics and visualizations in a small amount of time.

prepri

nt

C. Contributions

In this work, we present the details of Heatflip. The technique can tailor a sequence of mini-queries to the specific task of accurate and quick heat map visualization and has the following unique capabilities: (a) It progressively generates keyword-based heat maps by incrementally unfolding knowledge and shifting computationally expensive steps to the backend database. (b) In a ping-pong like fashion, Heatflip draws samples alternately from the temporal and spatial dimension. It learns about the spatial distribution by retrieving spatial information from a very short time interval. In addition, by performing spatial sampling, it gains insights into the temporal distribution. (c) We analyze how to draw representative samples from the temporal and spatial dimensions and answer the questions of how many data entries are needed to generate an accurate heat map and how to validate the quality of a heat map. (d) Finally, we present an evaluation of Heatflip and its scalability when visualizing heat maps on large social media data from different countries.

D. Related Work

There exist several systems that can be used to support visualization on large data sets. For instance, Superset [3] offers a rich set of visualizations while running on top of a backend system such as Druid [4]. Amazon provides a visualization service called QuickSight [5] using their proprietary backend engine SPICE. VerdictDB [6] uses a middleware architecture that requires no changes to the backend database. A main difference between Heatflip and these existing systems is that Heatflip mainly focuses on keyword-based heat map visualization, which is not a focus of earlier studies. In addition, the methodology of Heatflip, i.e., alternative sampling between two dimensions, can be adopted by these solutions as well. The commercial software Tableau [7] operates in two modes - data can either be stored internally constrained by limited capacity or in an external database. Tableau’s database connectors however lack sampling capabilities. HadoopViz [8] targets the request for rendering high-resolution geographic images but lacks flexibility in terms of application for interactive analysis. M4 [9] is a query rewriting system that does neither consider geospatial nor textual data properties. Most frameworks are constrained by their scope of application (Tagthreed) [10], their scalability (MapD) [11], (Kite) [12] or their interoperability (DVMS) [13]. Hence, they either miss the opportunity for co-optimization or integrate the data storage and visualization framework into a coherent but inflexible system that only serves one specific purpose. Heatflip and Drum [14] were both emerging from the open source Cloudberry system [15], [16] and share a number of commonalities. While both target interactivity and usability by progressively generating a sequence of mini-queries, their overall objectives are completely different. Drum incrementally retrieves tweets in a reverse chronological order, does not perform sampling and solely uses temporal components for optimization.

II. PROBLEM FORMULATION

A. Heat Map Visualization

In this section, we introduce our terminology alongside the conventional concepts of heat map visualization. We then point out room for improvement and motivate our progressive sampling approach. A heat map is a graphical visualization of a matrix. It builds up of two discrete numerical attributes x and y as counters pointing to the matrix position and one continuous attribute z depicting the value within the matrix. The matrix position is denoted by a two-dimensional location tag. We assume a table of tweets with spatial, temporal and textual attributes. As outlined in Figure 2, we refer to the matrix as a grid, to the entries as cells, and to the entry values as cell values. A conventional approach of heat map visualization on a dataset of tweets involves the steps of “query”, “transmit”, “normalize and allocate”, and “smoothen and visualize”. During the steps of “query” and “transmit”, all data items are retrieved from the database and transmitted to an external visualization interface. The list of items is parsed and normalized, and each spatial point is allocated to a cell in the grid according to its location. Per allocation, the cell value of the corresponding cell is incremented by one. In the visualization phase, the cell values are translated into colors according to a color-value mapping scheme.

Figure 2: Heat map generation

B. Query Formulation for Heat Map Visualization

We consider a method to pre-compute the grid of the heat map in the backend instead of the frontend in order to substantially reduce the frontend processing time. A conventional approach retrieves a long list of tweets that has to be parsed in the frontend for grid creation. In contrast, the query proposed in Figure 3 retrieves a pre-structured grid of aggregated cell values. Having a fixed size defined by the number of cells (e.g., 128x128), the grid is much smaller than the list of tweets. In the frontend, the grid can be translated directly into a heat map by applying a color-value mapping scheme without any further iterations. In other words, “querying” as well as “allocating and normalizing” are all completed on the backend database. Using the query suggested in Figure 3, all tweets are selected and their spatial location is translated into a corresponding grid cell using the “spatial_cell” function. The “group by” operator then

prepri

nt

aggregates all the tweets that correspond to the same cell. These operations can benefit from an inverted index on the textual attribute, a B-tree index on the temporal attribute, and an R-tree index on the spatial attribute. The middleware autonomously sends requests to the backend to create these indexes. The query is executed in a distributed fashion, the “group by” operator is processed in parallel on each individual node of the distributed database in order to deliver the result in a timely manner. On top, the system intelligently creates and maintains materialized views. Materialized views are data subsets storing transformation aggregations of spatial, temporal and textual properties. Heatflip preserves a materialization of the aggregated cell values of all tweets leveraging the performance of the “group by” statement.

Figure 3: Computing heat map using SQL on the backend

C. Data Retrieval as the Computational Bottleneck

The querying of the data demands the major share of total processing time. Therefore, we focus on lowering the query time. We achieve this goal by progressively sampling on the database and issuing mini-queries. A mini-query adds range predicates on temporal or spatial attributes of the data and can benefit from the indexing capabilities of the database. Since it only retrieves a representative subset of the baseline data, it is executed much faster than the original query on the entire table. For example, retrieving all data from the entire temporal range would require too much time. To save time, we could sample a shorter time interval, e.g., 2017-01-01 to 2017-01-08, and visualize a heat map on this subset from only one week instead. Figure 4 shows an example mini-query:

Figure 4: Mini-query with a temporal predicate (temp. sample)

D. Limitations of Temporal Sampling

Initially, we have no knowledge of the temporal distribution of the tweets. How can we select appropriate data subsets that reproduce a representative heat map visualization? In other words, which days or weeks should we sample? Without further insights into the temporal distribution, we could take a naïve method that draws samples randomly. However, this approach has several shortcomings. Sampling with temporal range predicates, there will always be periods of data of which we have no information at all. Suppose we visualize a heat map on a

keyword that is tweeted very non-uniformly over time. For instance, “superbowl” only peaks during a short time window and is sparsely tweeted during the rest of the year. Performing random temporal sampling, we may never uncover this peak. Consequently, the heat map visualization can be highly inaccurate. To do better, we would like to make sampling decisions based on an approximation of the whole temporal distribution. How can we obtain insights about the whole temporal distribution without issuing the time-consuming original query displayed in Figure 3? We will study the problem in the following sections.

III. PROGRESSIVE HEAT MAP VISUALIZATION

A. Motivating Temporal-Spatial Sampling

As discussed above, performing temporal sampling alone, there will always be time periods of which we have no information. For this reason, we suggest an interplay of temporal and spatial samples. Our concept is to issue a mini-query with a spatial range predicate as displayed in Figure 5.

Figure 5: Mini-query with a spatial predicate (spatial sample)

Conversely to the temporal mini-query, this spatial mini-query retrieves a subset from a small spatial area and contains a representation of the whole temporal dimension. The more representative the small spatial area, the more instructive are the insights into the temporal distribution. These insights can help in deciding which time interval to sample and represent an abstraction of the whole temporal distribution. Sampling a small area does by far not cost as much processing time as the original query (Figure 3).

Figure 6: Temporal and spatial properties of sample keywords

Most keywords show a recurrent pattern with respect to their temporal and spatial occurrences as illustrated in Figure 6. An accurate heat map might not need to consider all data available, but rather make use of sampling. For consistency, we focus on the example keywords “shooting”, “summer”,

prepri

nt

“superbowl”, and “trump” throughout this paper. All four keywords show distinctive temporal-spatial patterns. For instance, the keyword “summer” is tweeted inconsistently in time with an annual peak in summer. In contrast, the keyword “trump” is tweeted more consistently throughout the year. Spatially, “trump” also occurs less regional conglomerated than “superbowl”, “summer”, or “shooting”. The keyword “trump” is tweeted not only with a high temporal frequency but also country-wide across the US, which makes it a popular keyword. When generating heat maps, the spatial and the temporal dimensions are inseparably linked. On the one hand, it is unacceptable to simplify the temporal distribution by assuming temporal uniformity, because temporal anomalies represent a characteristic trait of a keyword. On the other hand, simplifying the spatial distribution and assuming spatial uniformity in every time step is not a good option either. Spatial uniformity means that the relative number of tweets between any two areas is constant over time. In this case, any infinitely small temporal subset would perfectly reproduce the entire heat map.

B. Heatflip Temporal-Spatial Sampling

Spatial samples provide insights into the temporal distribution and vice versa. We propose an interplay between both dimensions by issuing diametrically opposed samples. In this section, we analyze how to deploy a sequence of mini-queries with the goal to maximize the accuracy of the visualization. Our approach applies two steps iteratively: (a) Temporal sampling: retrieve a small representative time

interval from the entire spatial area (Figure 4); (b) Spatial sampling: retrieve entire time range from a small

representative spatial area (Figure 5).

As illustrated in Figure 7, Heatflip constructs mini-queries upon both dimensions in a ping pong-like fashion. The user first defines a keyword, as well as the spatial and temporal range, kicking off the visualization process. Heatflip starts off by sampling the temporal dimension. Since

we have no insights into the data yet, the technique randomly chooses a starting interval on the time axis. It does so by issuing a mini-query to the database and retrieving tweets of a short time interval. From the sampled data, Heatflip generates an initial heat map and gains information about the spatial distribution. Since the data has been selected from a short time window only, Heatflip has no knowledge about the entire temporal distribution yet. For this reason, it switches over to spatial sampling. It uses the rough picture of the spatial distribution to formulate a sampling decision. It follows the objective to retrieve a small spatial area that is highly representative with respect to the entire spatial distribution. The more representative the small spatial area is, the more reliable the assumption on the temporal distribution, yet the more reliable the following sampling decisions. After retrieving the entire time range from a small representative spatial area, Heatflip switches back to (a) temporal sampling. Exploiting the temporal distribution retrieved from a spatial sample, Heatflip selects a more representative temporal interval for its next temporal sample.

C. Stopping Condition of Heatflip

With every mini-query, the iterative process reveals more information about the underlying temporal distribution and spatial distribution, ( ) and ( ). We use ( ) to denote the temporal distribution, giving the number of tweets per smallest considered temporal unit t, e.g., day. Conversely, ( ) formulates the spatial distribution, defined as the number of tweets per small spatial area s. To provide a specific example, we explain how the change of the spatial distribution is tracked. Performing temporal sampling, we draw a sample from the temporal dimension. Therefore, we select an interval of consecutive days, which we slice from the complete interval ∗ . The subscript denotes the -th mini query in the overall sampling procedure. Using the tweets from , we obtain a spatial sample distribution ( ). From previous samples, we obtained a first approximation of the spatial distribution η' ( ) = ∑ ( ) . In the formula, j denotes a control variable to aggregate all spatial

Figure 7: Heatflip sampling architecture

prepri

nt

sample distributions from previous mini queries. Comparing the old η' ( ) and the new aggregate sample distribution η ( ) = η' ( ) + ( ) , we are concerned with the

relative change of the distribution, instead of the absolute change – to this end, we will perform a normalization step. Our approach to normalizing a distribution is to divide every spatial area s in ( ) by the total number of tweets ∑ ( ) in the said distribution. Applied to our example, this division (Eq. 1) will scale the new aggregate sample distribution to a normalized sample distribution ( ). The relative change of the old and new normalized aggregate distribution, η' ( )and ( ), can be computed with the element-wise Manhattan distance (Eq. 2). The change represents the gain of information per mini-query. Like in a typical “hill-climbing problem”, this gain is expected to decrease over time because the sampling decisions gradually become smarter.

(1)

(2)

The value of δ lies between 0 (distributions equal up to a constant factor) and 1 (non-overlapping distributions) and is a good measure for the overall sampling progress. It may be used as an indirect measure for the accuracy of the distribution. Heatflip switches back and forth between both dimensions until the change measure δ drops below a certain threshold. When the additional information gain of a mini-query is smaller than the threshold δ, Heatflip assumes that its knowledge of the distribution is accurate enough and it stops issuing samples into the dimension. It invests all further mini-queries into sampling the other distribution.

Launching Heatflip’s sampling process as outlined in

Figure 7, the technique issues the initial sample into the temporal dimension due to the following reasons. A temporal mini-query yields the entire map from a short time window. In the unlikely event that the first query exceeds the maximum running time, we want to end up with at least an initial heat map visualization of the entire spatial area instead of a small spatial sub-area that we would get from a spatial sample. Secondly, spatial sampling is at higher risk to fail by retrieving a sparse area. The query may select an area only covering an ocean or a desert ending up with zero tweets. Conversely, it is less likely to hit a time window that does not show evidence of any tweets.

D. Optimization Objective

Heatflip operates towards two goals that can be expressed as an optimization problem. The user may either set a minimum threshold of accuracy or a maximum time budget. Heatflip approximates the progress by keeping track of the

relative changes to the temporal and spatial distributions using the change measure δ. If the changes become marginal, the system stops the progressive visualization process. Maximizing the heat map accuracy under the constraint of a limited time budget is another option.

IV. SAMPLING STRATEGY

A. Heat Map Quality Evaluation

So far, we introduced how Heatflip uses a sequence of mini-queries to slice samples, alternately from the temporal and spatial dimensions. Next, we will introduce how the quality of a heat map is validated, by looking at the spatial distribution more closely. Our validation approach compares the aggregated sample distribution ( ) with a reference distribution ( ) obtained from the total time range ∗. It is important to stress that the reference distribution is not known to Heatflip during actual operation time. We only make use of the reference distribution for an analytical validation during the research phase and the experiments. We obtain a measure of distance using the method developed in Section III.C, but we will not limit ourselves to the Manhattan distance, therefore we denote the metric used in Eq. 3 by “ dist ”. As sampling proceeds and the sample intervals cover more and more of the complete time interval, we expect the distance of the normalized aggregate distribution and the normalized reference distribution to converge to 0. The aggregate sampling distribution , ( ) denotes the sampling progress after the -th mini query:

(3)

In the following, we use the Earth Mover’s distance

(EMD) [17] as a measure of distance between two probability distributions over a region and will denote the result as “accuracy”. Compared to straight-line distances such as Euclidean distance, EMD comes with the advantage of cross-cell distance measurement. A good metric penalizes the errors in sparse regions more than dense regions. If we happen to find a lot of unexpected tweets (e.g., 10,000) in a usually unpopulated area, this anomaly is more prone to leading to a wrong heat map than finding the same number of unexpected tweets in a metropolitan area. For this reason, we add a penalty term to the distance function that takes account of the population frequency p by multiplying the distance by log(p+1) so that the area-wise penalty decreases as the frequency gets large.

B. Temporal Sampling

In this section, we analyze the sampling decisions on the temporal dimension and answer the question “where to place the mini-queries”. We quantify and analyze the temporal distribution with the goal to find sampling points that yield accurate heat maps. Figure 8 shows the temporal distribution of the keyword “summer”, plotting the number of tweets per day over the course of one year. As expected, the number of

δ(η , η' ) = 12 (s) − ′ ( )

(s)= (s) ∗ 1∑ η ( )

lim∑ → ∗ dist , ( ), ( ) = 0

prepri

nt

tweets associated with “summer” peaks over the summer. To quantify the distributions, we refer to the average number of tweets per day as a measure of popularity and the 0.75- as well as the 0.25-percentile of the distribution (dotted lines) as a measure of volatility or non-uniformity. In other words, the day with the maximum number of tweets is referred to as 1.0. We want to answer the question of “where to sample” by only focusing on the heat map accuracy.

Figure 8: Temporal distribution of keyword “summer”

Heatflip gives a higher priority to those days close to the 0.5 percentile. Intuitively, on an “average day”, the distribution of tweets should be most representative. Experiments on different datasets verify this intuition. For instance, we split up the time axis in equally sized intervals of one week length. From every week, we retrieve the same, fixed number of tweets, which is the number of tweets present at the 0.125 percentile. This way, we ensure to validate the representability of each week’s tweets irrespective of the absolute number of the tweets. We cancel out the effect that more tweets naturally hold more information because we are only concerned with the accuracy at every potential sampling interval. The dynamic adjustment of the interval length will later regulate the absolute number of tweets by balancing expected accuracy and query time. From the weekly sets of tweets, we generate sample heat maps that we compare with a reference heat map from the whole year. The results from this experiment are outlined in Figure 9. Consistently with our intuition, querying a day close to the 0.5 percentile leads to the most accurate sample heat map (accurate spatial distribution of tweets).

Figure 9: Temporal sampling point

On the contrary, during an anomaly like a temporal high, the heat map may be unrepresentative because users from other regions than the usual ones are unexpectedly tweeting. For instance, throughout the year, “superbowl” is relatively sparsely tweeted, mainly by permanent fans spread over the whole US. On the day of the Super Bowl game, however, an abnormal high number of tweets in urban areas is registered, affecting the year-around distribution. In Figure 9, we observe different accuracy levels for different keywords. Sampling popular keywords such as “trump” and “summer” generally produces more accurate heat maps than less popular keywords such as “shooting” or “superbowl”. The higher the mean μ number of tweets per day, the higher the accuracy because more tweets provide more information. The spatial distribution of people tweeting “trump” is relatively stable over time. More importantly, querying days with more tweets than the mean number yields higher accuracy values than undershooting the mean number. Analyzing the keyword “shooting”, this becomes very obvious. The accuracy of the sample heat map taken from the 0.25-percentile amounts to 78.2%, while the 0.75-percentile yields an 81.8% accuracy. Lastly, we observe a higher accuracy volatility for the keywords “superbowl” and “summer”, which relates to the high volatility of the keywords’ popularity. Conversely, “trump” is tweeted more constantly throughout the year and the accuracy difference between days of different percentiles is insignificant.

C. Spatial Sampling

Analogous to analyzing the optimal temporal sampling point, we wonder “where the most representative spatial area lies”. While we evaluate the temporal samples by looking at their resulting spatial heat maps, we conversely assess a spatial area by looking at the temporal distribution.

Figure 10: Analysis of the most representative spatial area

We randomly slice fixed-size areas from different regions of the map hitting dense areas with many tweets and sparse areas with only few tweets. We quantify the spatial distribution according to their quartiles in the distribution of tweet-count per spatial area. An extremely dense area with the maximum number of tweets is referred to as 1.0, while the average quartile 0.5 denotes a spatial area featuring an

prepri

nt

average number of tweets. Areas of 0.0 may be sparse areas such as oceans or desserts. Again, to enable a fair accuracy comparison of the different areas independently from the absolute number of tweets, we sample a fixed number of tweets from every area (the number of tweets at the 0.125 percentile). Using the fixed number of tweets of every area, we can generate temporal distribution histograms as introduced in Figure 8, and check how representative the small spatial areas are by comparing their temporal distributions with the temporal distribution of the entire spatial map. Essentially, this is the same procedure as carried out in the temporal-sampling case where we retrieved a fixed number of tweets from a small temporal time window, generated a heat map (a sample spatial distribution), and compared it to a reference heat map (spatial distribution). In spatial sampling, the quality of a sample is validated by comparing the sample temporal distribution with a reference temporal distribution. Figure 10 outlines this comparison procedure showing the distribution percentile values of the spatial sample areas. In search of the most representative area, we compare the temporal distribution of sample areas with the reference area, calculate the similarity using EMD, and express the results as accuracy. The correlation between tweet density of a region expressed in distribution percentiles and the accuracy of the region’s temporal distribution is shown in Figure 11.

Figure 11: Spatial sampling point

The results show that the accuracy does neither peak at very dense areas (percentile 1.0), nor at sparse ones (percentile 0.0). This finding coincides with our previous intuition and the temporal sampling case. Dense areas are not most accurate since they mainly represent urban trends, while sparse areas on the other side represent rural trends. In between, there is a “sweet spot” at medium-populated, suburban areas, which feature an average number of tweets around the 0.5 percentile.

D. Single- and Multi-Query Sampling

Performing sampling, we conceptually face a trade-off between two sampling incentives. On the one side, we would like to sample at points where we expect a high accuracy for a given number of tweets. On the other side, we foster explorative sampling where we try to get a good picture of

the whole distribution by switching between different sampling points. We will express the two sampling preferences as sampling incentives (a) and (b), which are equally applied in Heatflip’s sampling strategy. Incentive (a) takes the single-query case and (b) the multi-query case into account. To achieve a good balance of the two incentives, we use a non-deterministic approach that bases the sampling decisions on an adaptive probability distribution function. We assume that we have sampled tweet-counts of a certain keyword over a time-period (Figure 8) or a spatial area (Figure 10). Based on the present temporal or spatial distribution, we want to come to a sampling decision.

As an example, we present the temporal sampling strategy

in the following. We have obtained a temporal distribution ( ) from a small spatial area and will use this temporal distribution to make the next temporal sampling decision. To this end, we choose a sampling point , e.g., a certain day, from a probability distribution function (pdf). The pdf is generated according to the current distribution of tweets ( ) . Let the mean number of tweets over ( ) be ,

defined as = ∗ ∑ ( ), where ∗ denotes the length of

the interval used to obtain , which is the complete time range. To construct the said probability density function (pdf), we apply a transformation function that assigns a sampling probability to every potential sampling point in the complete time range ∗. The transformation function (Eq. 5) is based on previous findings on the correlation between sampling accuracy and distribution percentiles. We deduce sampling incentive (a) for the single-query case:

(a) Sampling at points where ( ) , thus at points featuring approximately the average number of tweets, yields more accurate results than choosing random sampling points.

To have a unified basis for the transformation of the input distribution ( ), we normalize it to the mean (Eq. 4) and obtain a normalized distribution ̂ ( ) . The normalized distribution is then put through a transformation function (Eq. 5), which translates ̂ ( ) to a pdf and assigns sampling probabilities to every sampling point .

(4)

(5)

Figure 12: Transformation function ω(x)

( ) = ̂( ) = ( ) − ω( ) =1/(M+Atanh(| |))

prepri

nt

In Figure 12, the transformation function (Eq. 5) is shown for A = 7.7 and M = 0.3. The parameter A controls the minimum of the transform for values below the mean, which is 1/(M + A). The parameter M controls the height of the peak at the mean, which is 1/M. Regarding the size of the sample interval , we choose to include the same fixed number of tweets from both sides around the sample point that was yielded by the pdf. Sampling incentive (a) takes the single query case into account, satisfying the preference to place samples at points where a good accuracy is expected. For the multi-query case, we introduce a second incentive:

(b) The aggregation of samples from different points leads to higher accuracy than remaining at a single point. Consequently, we are incentivized to avoid sampling close to old sampling points and prefer sampling points located far from previous ones.

Incorporating (b), we multiply the distribution with an

inverted Gaussian function at every old sampling point , (Eq. 6). We choose the standard deviation of the Gaussian to be half of the sampling interval length . The parameter G determines the probability left at the old sampling point. Finally, we normalize the distribution to get a viable pdf (Eq. 7):

(6) (7)

An example pdf for the keyword “summer” is shown in Figure 13. In the top graph, we see the number of tweets per day plotted over the time axis as previously introduced in Figure 8. The dotted line represents the mean number of tweets per day. Graph ( ) illustrates the sampling probability distribution for the singe-sample case. Following incentive (a), the days featuring approximately the mean number of tweets are assigned a higher sampling probability

and are more likely to be sampled. The sampling point is highlighted by a solid orange line, sampling interval borders by dotted orange lines. The bottom graph ( ) displays the pdf after 2 samples have been taken. The previous samples at points 1 and 2 are still visible in a less intense tone. According to the sampling incentive (b), the probability at these points has been lowered and the probability of more distant points has been increased. This is why sample 3 is placed at a point that was assigned a low probability in ( ) previously. Since the input to the transformation is a vector of arbitrary size and dimension, our approach naturally generalizes to multidimensional input arrays. Only the dimension of the Gaussian, which is used to cancel out the old sampling points, has to be adjusted. For this reason, the sampling strategy can be transferred to a 1-dimensional temporal and a 2-dimensional spatial framework. Figure 14 is presenting pseudo code of the entire sampling strategy.

Figure 14: Pseudo code of entire sampling strategy

( ) = ω ̂( ) ∗ 1-(1-G)Gauss( - )j ip( )= q( )∑ q( ') '

Figure 13: Tweet distribution and resulting pdf

prepri

nt

V. EXPERIMENTS

A. Setup and Dataset

We put Heatflip to test by visualizing heat maps on different keywords. We contrast the conventional approach of a full database scan as introduced in section II, random sampling and Heatflip in terms of total processing time and scalability. The social media data used in the evaluation has been collected through the public Twitter API. Our total number of geo-tagged tweets is 170 million, all gathered in the period from January 2017 to December 2017, spanning a time range of 12 months. To put the general validity to test, we use Twitter data from 3 countries of different user behavior and population distributions, the US (115M), Japan (25M), and Germany (30M). All tweets feature textual, temporal, and spatial attributes. When referring to keyword, we mean the occurrence of a word in the “text” attribute as it includes both the pure tweet and its corresponding hashtags. The “create_at” tag denotes the date and time when the tweet was published. A geo-tagged tweet is associated with a “bounding_box”, a rectangular space defined by two coordinate points. All data is stored in the Apache AsterixDB system running on a cluster of five Intel NUC machines with 16GB working memory and four cores each. Each sever has a 500GB SSD. With about 4KB per tweet, the 175 million tweets occupy 680GB storage space.

B. Processing Time with Minimum Accuracy Threshold

When comparing the performance of the integrated Heatflip approach with random sampling and the conventional, decoupled approach, we find a significant difference in performance. Figure 15 (left) outlines the total running time under the premise of at least 95% accuracy for different keywords. The accuracy was measured by comparing the sample heat map to a ground truth heat map generated from all tweets using EMD as introduced in section IV.A. Since the accuracy cannot be verified during the

processing time, it has been validated after termination. Firstly, we find that irrespective of the cell size, shifting the heat map generation to the backend database yields shorter computing times. The querying of the data demands a major share of the total processing time. For this reason, the contributions of Heatflip are centered around lowering the query time and performing temporal-spatial sampling. Secondly, the larger the number of tweets to visualize, the more effective are techniques that rely on sampling. In fact, if the number of tweets grows infinitely large, random sampling may perform better than Heatflip. This is due to the additional spatial samples that do not directly contribute to building the heat map but rather to gaining meta information on temporal characteristics. Conversely, Heatflip shines on keywords with distinct temporal or spatial anomalies where brute-force random sampling requires more time until sufficient accuracy is reached. Thirdly, “the bigger the data, the more effective Heatflip”. We find that the visualization of the keyword “summer” (480K tweets) terminates in about 85% less time compared to a full database scan when using Heatflip. In contrast, visualizing the 90K tweets associated with “superbowl”, our new approach only yields an 80% time saving. This may be traced back to the fact that every single database query takes an overhead time irrespective of the keyword popularity and more data holds more information.

Figure 16: Heat map visualization comparison

Figure 15: Heat map visualization evaluation

prepri

nt

Figure 16 shows two heat map visualizations of the keyword “summer”. The heat maps represent the US in the time from January 2017 to December 2017, the conventional approach on the left, and Heatflip on the right. From the highlighted areas in the map, one may identify the contour of the United States with its two coastlines. Both heat maps visualize the same time range of one year. For the evaluation, we normalize both heat maps as introduced in Section III.C (Eq. 1) and then measure their similarity using the Earth Mover’s distance (EMD) [17] (Eq. 3). The resulting visualizations are 97.4% identical, even though the right-side heat map is based on solely 12.5% of all tweets featuring “summer”. Heatflip makes extensive use of sampling and manages to visualize an accurate heat map using data from only 46 days, thereby cutting processing time from 91 seconds to 13 seconds.

C. Scalability

When generating a heat map for a highly popular keyword, sampling becomes extremely efficient. We shed light on this observation in Figure 15 (right). In order to analyze Heatflip’s scalability, we consider all tweets from the US, Japan, and Germany, irrespective of the keywords. We take shares (subsets) of the total datasets, generate heat maps, and stop the time. The findings confirm our previous intuition on the scalability of the approach. Visualizing half of the entire US dataset (57.5M tweets) takes 20% more time than visualizing half of the Japan dataset (12.5M tweets), even though it contains 80% more data.

VI. CONCLUSIONS AND FUTURE WORK

With Heatflip, we make heat map visualization progressive and interactive without forfeiting an unreasonable amount of visualization accuracy. The technique is not imposed to any unrealistic constraints. Neither does it require any a priori knowledge of the data, nor is the application restricted to visualizing social media data only. The great strength of the technique lies in the uncompromising adaption to the heat map problem. The analytical findings and the technique may be adopted to other systems as they hold general validity.

We would like to point out accompanying findings and

room for improvement. A more dynamic sampling model should incorporate an adaptive interval length that balances the trade-off between best-possible heat map accuracy and shortest-possible query time. Our idea is: if the increase of accuracy over query time becomes smaller than the increase of query time over accuracy, we should not further extend the interval length. Drum [14] integrates a dynamic interval length that considers the total running time as well as the smoothness of the result. IncVisage [18] conceptually places the trade-off between accuracy and processing time in the center of discussion.

REFERENCES

[1] Y. Jia, Z. Zhang, and M. Sarwat, “BABYLON: An End-to-End Visual Analytics System for Massive-Scale Geospatial Data,” 2017.

[2] S. Idreos, O. Papaemmanouil, and S. Chaudhuri, “Overview of Data Exploration Techniques,” 2015, pp. 277–281.

[3] “Apache Superset (incubating) — Apache Superset documentation.” [Online]. Available: https://superset.incubator.apache.org/. [Accessed: 10-Jul-2018].

[4] F. Yang, E. Tschetter, X. Léauté, N. Ray, G. Merlino, and D. Ganguli, “Druid: a real-time analytical data store,” 2014, pp. 157–168.

[5] “Amazon QuickSight - Cloud Based Business Intelligence.” [Online]. Available: https://aws.amazon.com/quicksight/. [Accessed: 18-Aug-2018].

[6] Y. Park, B. Mozafari, J. Sorenson, and J. Wang, “VerdictDB: Universalizing Approximate Query Processing,” 2018, pp. 1461–1476.

[7] “Business Intelligence und Analytics | Tableau Software.” [Online]. Available: https://www.tableau.com/de-de. [Accessed: 10-Jul-2018].

[8] A. Eldawy, M. Mokbel, and C. Jonathan, “HadoopViz: A MapReduce framework for extensible visualization of big spatial data,” 2016, pp. 601–612.

[9] U. Jugel, Z. Jerzak, G. Hackenbroich, and V. Markl, “M4: A Visualization-Oriented Time Series Data Aggregation,” presented at the PVLDB, 2014, vol. 7, pp. 797–808.

[10] A. Magdy et al., “Taghreed: a system for querying, analyzing, and visualizing geotagged microblogs,” 2014, pp. 163–172.

[11] “MapD Immerse,” 2018. [Online]. Available: https://www.mapd.com/demos/taxis/#/dashboard?_k=f0sumg. [Accessed: 04-Feb-2018].

[12] A. Magdy and M. F. Mokbel, “Demonstration of Kite: A Scalable System for Microblogs Data Management,” 2017, pp. 1383–1384.

[13] E. Wu, F. Psallidas, Z. Miao, H. Zhang, and L. Rettig, “Combining Design and Performance in a Data Visualization Management System,” presented at the 8th Biennial Conference on Innovative Data Systems Research (CIDR ‘17), Chaminade, California, USA, 2017.

[14] J. Jia, C. Li, and M. J. Carey, “Drum: A Rhythmic Approach to Interactive Analytics on Large Data,” presented at the IEEE Big Data, Irvine, United States, 2017.

[15] “Cloudberry, UC Irvine,” 2018. [Online]. Available: http://cloudberry.ics.uci.edu/. [Accessed: 09-Jan-2018].

[16] J. Jia, C. Li, X. Zhang, C. Li, M. J. Carey, and S. Su, “Towards interactive analytics and visualization on one billion tweets,” 2016, pp. 1–4.

[17] S. T. Rachev, “The Monge-Kantorovich mass transference problem and its stochastic applications,” Theory of Probability and its Applications, vol. XXIX (4), pp. 647–676, 1984.

[18] S. Rahman et al., “I’ve seen ‘enough’: incrementally improving visualizations to support rapid decision making,” Proceedings of the VLDB Endowment, vol. 10, no. 11, pp. 1262–1273, Aug. 2017.

prepri

nt

Heatflip: Temporal-Spatial Sampling for Progressive Heat ...€¦ · A. Interactive Big Data...

Documents

Transcript of Heatflip: Temporal-Spatial Sampling for Progressive Heat ...€¦ · A. Interactive Big Data...