DDJ_102113

ALSO INSIDE

Do All Roads Lead Back to SQL? >>

Applying the Lambda Architecture >>

From the Vault:Easy Real-Time Big Data Analysis UsingStorm>>

www.drdobbs.com

Dr. Dobb’s JournalNovember 2013

Really Big Data

UnderstandingWhat Big Data

Can Deliver

November 2013 2

C O N T E N T SCOVER ARTICLE8 Understanding What Big Data Can DeliverBy Aaron Kimball

It’s easy to err by pushing data to fit a projected model. Insightscome, however, from accepting the data’s ability to depict what isgoing on, without imposing an a priori bias.

GUEST EDITORIAL3 Do All Roads Lead Back to SQL?By Seth Proctor

After distancing themselves from SQL, NoSQL products are mov-ing towards transactional models as “NewSQL” gains popularity.What happened?

FEATURES15 Applying the Big Data Lambda ArchitectureBy Michael HausenblasA look inside a Hadoop-based project that matches connections insocial media by leveraging the highly scalable lambda architecture.

23 From the Vault: Easy Real-Time Big Data Analysis Using StormBy Shruthi Kumar and Siddharth Patankar

If you're looking to handle big data and don't want to tra-verse the Hadoop universe, you might well find that usingStorm is a simple and elegant solution.

6 News BriefsBy Adrian Bridgwater

Recent news on tools, platforms, frameworks, and the stateof the software development world.

7 Open-Source DashboardA compilation of trending open-source projects.

34 LinksSnapshots of interesting items on drdobbs.com including a look at the first steps to implementing Continuous Deliveryand developing Android apps with Scala and Scaloid.

www.drdobbs.com

November 2013

Dr. Dobb’s JournalMore on DrDobbs.com

Jolt Awards: The Best BooksFive notable books every serious programmer should read.http://www.drdobbs.com/240162065

A Massively Parallel Stack for Data AllocationDynamic parallelism is an important evolutionary stepin the CUDA software development platform. With it,developers can perform variable amounts of workbased on divide-and-conquer algorithms and in-memorydata structures such as trees and graphs — entirelyon the GPU without host intervention. http://www.drdobbs.com/240162018

Introduction to Programming with ListsWhat it’s like to program with immutable lists. http://www.drdobbs.com/240162440

Who Are Software Developers?Ten years of surveys show an influx of younger devel-opers, more women, and personality profiles at oddswith traditional stereotypes.http://www.drdobbs.com/240162014

Java and IoT In MotionEric Bruno was involved in the construction of the In-ternet of Things (IoT) concept project called “IoT InMotion.” He helped build some of the back-end com-ponents including a RESTful service written in Javawith some database queries, and helped a bit with thefront-end as well.http://www.drdobbs.com/240162189

www.drdobbs.com

uch has been made in the past several years about SQLversus NoSQL and which model is better suited to mod-ern, scale-out deployments. Lost in many of these argu-ments is the raison d’être for SQL and the difference be-

tween model and implementation. As new architectures emerge, thequestion is why SQL endures and why there is such a renewed interestin it today.

BackgroundIn 1970, Edgar Codd captured his thoughts on relational logic in a pa-per that laid out rules for structuring and querying data(http://is.gd/upAlYi). A decade later, the Structured Query Language(SQL) began to emerge. While not entirely faithful to Codd’s originalrules, it provided relational capabilities through a mostly declarativelanguage and helped solve the problem of how to manage growingquantities of data.

Over the next 30 years, SQL evolved into the canonical data-man-agement language, thanks largely to the clarity and power of its un-derlying model and transactional guarantees. For much of that time,deployments were dominated by scale-up or “vertical” architectures,in which increased capacity comes from upgrading to bigger, individ-

ual systems. Unsurprisingly, this is also the design path that most SQLimplementations followed.

The term “NoSQL” was coined in 1998 by a database that providedrelational logic but eschewed SQL (http://is.gd/sxH0qy). It wasn’t until2009 that this term took on its current, non-ACID meaning. By then, typ-ical deployments had already shifted to scale-out or “horizontal” mod-els. The perception was that SQL could not provide scale-out capability,and so new non-SQL programming models gained popularity.

Fast-forward to 2013 and after a period of decline, SQL is regainingpopularity in the form of NewSQL (http://is.gd/x0c5uu) implementa-tions. Arguably, SQL never really lost popularity (the market is esti-mated at $30 billion and growing), it just went out of style. Either way,this new generation of systems is stepping back to look at the last 40years and understand what that tells us about future design by apply-ing the power of relational logic to the requirements of scale-out de-ployments.

Why SQL?SQL evolved as a language because it solved concrete problems. Therelational model was built on capturing the flow of real-world data. If apurchase is made, it relates to some customer and product. If a song is

[GUEST EDITORIAL]

IN THIS ISSUEGuest Editorial >>News >>Open-Source Dashboard >>What Big Data Can Deliver >>Lambda >>Storm >>Links >>Table of Contents >>

November 2013 3

Do All Roads Lead Back to SQL?After distancing themselves from SQL, NoSQL products are moving towards transactional models as

“NewSQL” gains popularity. What happened?By Seth Proctor

M

www.drdobbs.com

played, it relates to an artist, an album, a genre, and so on. By definingthese relations, programmers know how to work with data, and the sys-tem knows how to optimize queries. Once these relations are defined,then other uses of the data (audit, governance, etc.) are much easier.

Layered on top of this model are transactions. Transactions areboundaries guaranteeing the programmer a consistent view of the

database, independent execution relative to other transactions, andclear behavior when two transactions try to make conflicting changes.That’s the A (atomicity), C (consistency), and I (isolation) in ACID. To saya transaction has committed means that these rules were met, andthat any changes were made Durable (the D in ACID). Either everythingsucceeds or nothing is changed.

Transactions were introduced as a simplification. They free develop-ers from having to think about concurrent access, locking, or whethertheir changes are recorded. In this model, a multithreaded service canbe programmed as if there were only a single thread. Such program-ming simplification is extremely useful on a single server. When scalingacross a distributed environment, it becomes critical.

With these features in place, developers building on SQL were able tobe more productive and focus on their applications. Of particular impor-tance is consistency. Many NoSQL systems sacrifice consistency for scal-

ability, putting the burden back on application developers. This trade-off makes it easier to build a scale-out database, but typically leaves de-velopers choosing between scale and transactional consistency.

Why Not SQL?It’s natural to ask why SQL is seen as a mismatch for scale-out archi-tectures, and there are a few key answers. The first is that traditionalSQL implementations have trouble scaling horizontally. This has led toapproaches like sharding, passive replication, and shared-disk cluster-ing. The limitations (http://is.gd/SaoHcL) are functions of designingaround direct disk interaction and limited main memory, however, andnot inherent in SQL.

A second issue is structure. Many NoSQL systems tout the benefit ofhaving no (or a limited) schema. In practice, developers still need somecontract with their data to be effective. It’s flexibility that’s needed —an easy and efficient way to change structure and types as an appli-cation evolves. The common perception is that SQL cannot providethis flexibility, but again, this is a function of implementation. Whentable structure is tied to on-disk representation, making changes tothat structure is very expensive; whereas nothing in Codd’s logicmakes adding or renaming a column expensive.

Finally, some argue that SQL itself is too complicated a language fortoday’s programmers. The arguments on both sides are somewhatsubjective, but the reality is that SQL is a widely used language with alarge community of programmers and a deep base of tools for taskslike authoring, backup, or analysis. Many NewSQL systems are layeringsimpler languages on top of full SQL support to help bridge the gapbetween NoSQL and SQL systems. Both have their utility and their usesin modern environments. To many developers, however, being able to

[GUEST EDITORIAL ]


November 2013 4

“Many NoSQL systems tout the benefit of having no

(or a limited) schema. In practice, developers still need some

contract with their data to be effective”

www.drdobbs.com

reuse tools and experience in the context of a scale-out databasemeans not having to compromise on scale versus consistency.

Where Are We Heading?The last few years have seen renewed excitement around SQL.NewSQL systems have emerged that support transactional SQL, builton original architectures that address scale-out requirements. Thesesystems are demonstrating that transactions and SQL can scale whenbuilt on the right design. Google, for instance, developed F1(http://is.gd/Z3UDRU) because it viewed SQL as the right way to ad-dress concurrency, consistency, and durability requirements. F1 is spe-cific to the Google infrastructure but is proof that SQL can scale andthat the programming model still solves critical problems in today’sdata centers.

Increasingly, NewSQL systems are showing scale, schema flexibility,and ease of use. Interestingly, many NoSQL and analytic systems arenow putting limited transactional support or richer query languagesinto their roadmaps in a move to fill in the gaps around ACID and de-clarative programming. What that means for the evolution of these sys-tems is yet to be seen, but clearly, the appeal of Codd’s model is asstrong as ever 43 years later.

— Seth Proctor serves as Chief Technology Officer of NuoDB Inc. and has more than15 years of experience in the research, design, and implementation of scalable systems.His previous work includes contributions to the Java security framework, the Solarisoperating system, and several open-source projects.

[GUEST EDITORIAL ]


November 2013 5

http://www.drdobbs.com/240162452


www.drdobbs.com

News Briefs[NEWS]

November 2013 6

Progress Pacific PaaS Is A Wider Developer’s PaaSProgress has used its Progress Exchange 2013 exhibition and devel-oper conference to announce new features in the Progress Pacific plat-form-as-a-service (PaaS) that allow more time and energy to be spentsolving business problems with data-driven applications and less timeworrying about technology and writing code. This is a case of cloud-centric data-driven software application development supportingworkflows that are engineered to Real Time Data (RTD) from disparatesources, other SaaS entities, sensors, and points within the Internet ofThings — for developers, these workflows must be functional for mo-bile, on premise, and hybrid apps where minimal coding is requiredsuch that the programmer is isolated to a degree from the complexityof middleware, APIs, and drivers.http://www.drdobbs.com/240162366

New Java Module In SOASTA CloudTestSOASTA has announced the latest release of CloudTest with a new Javamodule to enable developers and testers of Java applications to testany Java component as they work to “easily scale” it. Direct-to-databasetesting here supports Oracle, Microsoft SQL Server, and PostgreSQLdatabases — and this is important for end-to-end testing for enterprisedevelopers. Also, additional in-memory processing enhancementsmake dashboard loading faster for in-test analytics. New CloudTest ca-pabilities include Direct-to-Database testing. CloudTest users can nowdirectly test the scalability of the most popular enterprise and opensource SQL databases from Oracle, Microsoft SQL Server, and Post-greSQL. http://www.drdobbs.com/240162292

HBase Apps And The 20 Millisecond FactorMapR Technologies has updated its M7 edition to improve HBase appli-cation performance with throughput that is 4-10x faster while eliminat-ing latency spikes. HBase applications can now benefit from MapR’s plat-form to address one of the major issues for online applications,consistent read latencies in the “less than 20 millisecond” range, as theyexist across varying workloads. Differentiated features here include ar-chitecture that persists table structure at the filesystem layer; no com-pactions (I/O storms) for HBase applications; workload-aware splits forHBase applications; direct writes to disk (vs. writing to an external filesys-tem); disk and network compression; and C++ implementation that doesnot suffer from garbage collection problems seen with Java applications.http://www.drdobbs.com/240162218

Sauce Labs and Microsoft Whip Up BrowserSwarmSauce Labs and Microsoft have partnered to announce Browser-Swarm, a project to streamline JavaScript testing of Web and mobileapps and decrease the amount of time developers spend on debug-ging application errors. BrowserSwarm is a tool that automates test-ing of JavaScript across browsers and mobile devices. It connects di-rectly to a development team’s code repository on GitHub. When thecode gets updated, BrowserSwarm automatically executes a suite oftests using common unit testing frameworks against a wide array ofbrowser and OS combinations. BrowserSwarm is powered on thebackend by Sauce Labs and allows developers and QA engineers toautomatically test web and mobile apps across 150+ browser / OScombinations, including iOS, Android, and Mac OS X. http://www.drdobbs.com/240162298

By Adrian Bridgwater


www.drdobbs.com

[OPEN-SOURCE DASHBOARD]

November 2013 7

TOP OPEN-SOURCE PROJECTSTrending this month on GitHub:

jlukic/Semantic-UI JavaScript https://github.com/jlukic/Semantic-UICreating a shared vocabulary for UI.

HubSpot/pace CSShttps://github.com/HubSpot/paceAutomatic Web page progress bar.

maroslaw/rainyday.js JavaScripthttps://github.com/maroslaw/rainyday.jsSimulating raindrops falling on a window.

peachananr/onepage-scroll JavaScripthttps://github.com/peachananr/onepage-scrollCreate an Apple-like one page scroller website (iPhone 5S website) with OnePage Scroll plugin.

twbs/bootstrap JavaScripthttps://github.com/twbs/bootstrapSleek, intuitive, and powerful front-end framework for faster and easier Web development.

mozilla/togetherjs JavaScripthttps://github.com/mozilla/togetherjsA service for your website that makes it surprisingly easy to collaborate in real-time.

daviferreira/medium-editor JavaScripthttps://github.com/daviferreira/medium-editorMedium.com WYSIWYG editor clone.

alvarotrigo/fullPage.js JavaScripthttps://github.com/alvarotrigo/fullPage.jsfullPage plugin by Alvaro Trigo. Create full-screen pages fast and simple.

angular/angular.js JavaScripthttps://github.com/angular/angular.jsExtend HTML vocabulary for your applications.

Trending this month on SourceForge:Notepad++ Plugin Managerhttp://sourceforge.net/projects/npppluginmgr/The plugin list for Notepad++ Plugin Manager with code for the pluginmanager.

MinGW: Minimalist GNU for Windows:http://sourceforge.net/projects/mingw/A native Windows port of the GNU Compiler Collection (GCC).

Apache OpenOfficehttp://sourceforge.net/projects/openofficeorg.mirror/An open-source office productivity software suite containing word processor,spreadsheet, presentation, graphics, formula editor, and database management applications.

YTD Androidhttp://sourceforge.net/projects/rahul/Files Downloader is a free powerful utility that will help you to download yourfavorite videos from youtube. The application is platform-independent.

PortableApps.comhttp://sourceforge.net/projects/portableapps/Popular portable software solution.

Media Player Classic: Home Cinemahttp://sourceforge.net/projects/mpc-hc/This project is based on the original Guliverkli project, and contains additionalfeatures and bug fixes (see complete list on the project’s website).

Anti-Spam SMTP Proxy Serverhttp://sourceforge.net/projects/assp/The Anti-Spam SMTP Proxy (ASSP) Server project aims to create an open-source platform-independent SMTP Proxy server.

Ubuntuzilla: Mozilla Software Installerhttp://sourceforge.net/projects/ubuntuzilla/An APT repository hosting the Mozilla builds of the latest official releases ofFirefox, Thunderbird, and Seamonkey.

November 2013 8www.drdobbs.com

Understanding What Big Data Can Deliver

It’s easy to err by pushing data to fit a projected model. Insights come, however, from accepting thedata’s ability to depict what is going on, without imposing an a priori bias.

ith all the hype and anti-hype surrounding Big Data, the datamanagement practitioner is, in an ironic turn of events, in-undated with information about Big Data. It is easy to getlost trying to figure out whether you have Big Data problems

and, if so, how to solve them. It turns out the secret to taming your BigData problems is in the detail data. This article explains how focusing onthe details is the most important part of a successful Big Data project.

Big Data is not a new idea. Gartner coined the term a decade ago, de-scribing Big Data as data that exhibits three attributes: Volume, Velocity,and Variety. Industry pundits have been trying to figure out what thatmeans ever since. Some have even added more “Vs” to try and betterexplain why Big Data is something new and different than all the otherdata that came before it.

The cadence of commentary on Big Data has quickened to the extentthat if you set up a Google News alert for “Big Data,” you will spendmore of your day reading about Big Data than implementing a Big Datasolution. What the analysts gloss over and the vendors attempt to sim-plify is that Big Data is primarily a function of digging into the detailsof the data you already have.

Gartner might have coined the term “Big Data,” but they did notinvent the concept. Big Data was just rarer then than it is today.Many companies have been managing Big Data for ten years ormore. These companies may have not had the efficiencies of scalethat we benefit from currently, yet they were certainly paying atten-tion to the details of their data and storing as much of it as theycould afford.

A Brief History of Data ManagementData management has always been a balancing act between the vol-ume of data and our capacity to store, process, and understand it.

The biggest achievement of the On Line Analytic Processing (OLAP)era was to give users interactive access to data, which was summarizedacross multiple dimensions. OLAP systems spent a significant amountof time up front to pre-calculate a wide variety of aggregations over adata set that could not otherwise be queried interactively. The outputwas called a “cube” and was typically stored in memory, giving endusers the ability to ask any question that had a pre-computed answerand get results in less than a second.

By Aaron Kimball

[WHAT BIG DATA CAN DELIVER]

W


www.drdobbs.com

Big Data is exploding as we enter the era of plenty — high band-width, greater storage capacity, and many processor cores. New soft-ware, written after these systems became available, is different thanits forebears. Instead of highly tuned, high-priced systems that op-timize for the minimum amount of data required to answer a ques-tion, the new software captures as much data as possible in orderto answer as-yet-undefined queries. With this new data capturedand stored, there are a lot of details that were previously unseen.

Why More Data Beats Better AlgorithmsBefore I get into how detail data is used, it is crucial to understand atthe algorithmic level the signal importance of detail data. Since theformer Director of Technology at Amazon.com, Anand Rajaraman, firstexpounded the concept that “more data beats better algorithms,” hisclaim has been supported and attacked many times. The truth behind

his assertion is rather subtle. To really understand it, we need to bemore specific about what Rajaraman said, then explain in a simple ex-ample how it works.Experienced statisticians understand that having more training data

can improve the accuracy of and confidence in a model. For example,say we believe that the relationship between two variables — such asnumber of pages viewed on a website and percent likelihood to makea purchase — is linear. Having more data points would improve ourestimate of the underlying linear relationship. Compare the graphs inFigures 1 and 2, showing that more data will give us a more accurateand confident estimation of the linear relationship.A statistician would also be quick to point out that we cannot in-

crease the effectiveness of this pre-selected model by adding evenmore data. Adding another 100 data points to Figure 2, for example,would not greatly improve the accuracy of the model. The marginal



November 2013 9

Figure 1: Using little data to estimate a relationship. Figure 2: The same relationship with more data.

www.drdobbs.com



November 2013 10

benefit of adding more training data in this case decreases quickly. Giventhis example, we could argue that having more data does not alwaysbeat more-sophisticated algorithms at predicting the expected out-come. To increase accuracy as we add data, we would need to changeour model.The “trick” to effectively using more data is to make fewer initial as-

sumptions about the underlying model and let the data guide whichmodel is most appropriate. In Figure 1, we assumed the linear modelafter collecting very little data about the relationship between pageviews and propensity to purchase. As we will see, if we deploy our lin-ear model, which was built on a small sample of data, onto a largedata set, we will not get very accurate estimates. If instead we are notconstrained by data collection, we could collect and plot all of the

data before committing to any simplifying assumptions. In Figure 3,we see that additional data reveals a more complex clustering of datapoints.By making a few weak (that is, tentative) assumptions, we can evaluate

alternative models. For example, we can use a density estimation tech-nique instead of using the linear parametric model, or use other tech-niques. With an order of magnitude more data, we might see that thetrue relationship is not linear. For example, representing our model as ahistogram as in Figure 4 would produce a much better picture of theunderlying relationship.Linear regression does not predict the relationship between the vari-

ables accurately because we have already made too strong an assump-tion that does not allow for additional unique features in the data to be

Figure 3: Even more data shows a different relationship. Figure 4: The data in Figure 3 represented as a histogram.

www.drdobbs.com

captured — such as the U-shaped dip between 20 and 30 on the x-axis. With this much data, using a histogram results in a very accuratemodel. Detail data allows us to pick a nonparametric model — such

as estimating a distribution with a histogram — and gives us moreconfidence that we are building an accurate model.If this were a much larger parameter space, the model itself, repre-

sented by just the histogram, could be very large. Using nonparametricmodels is common in Big Data analysis because detail data allows usto let the data guide our model selection, especially when the modelis too large to fit in memory on a single machine. Some examples in-clude item similarity matrices for millions of products and associationrules derived using collaborative filtering techniques.

One Model to Rule Them AllThe example in Figures 1 through 4 demonstrates a two-dimensionalmodel mapping the number of pages a customer views on a websiteto the percent likelihood that the customer will make a purchase. Itmay be the case that one type of customer, say a homemaker lookingfor the right style of throw pillow, is more likely to make a purchasethe more pages they view. Another type of customer — for example,an amateur contractor — may only view a lot of pages when doing re-

search. Contractors might be more likely to make a purchase whenthey go directly to the product they know they want. Introducing ad-ditional dimensions can dramatically complicate the model; and main-taining a single model can create an overly generalized estimation.Customer segmentation can be used to increase the accuracy of a

model while keeping complexity under control. By using additionaldata to first identify which model to apply, it is possible to introduceadditional dimensions and derive more-accurate estimations. In thisexample, by looking at the first product that a customer searches for,we can select a different model to apply based on our prediction ofwhich segment of the population the customer falls into. We use a dif-ferent model for segmentation based on data that is related yet dis-tinct from the data we use for the model that predicts how likely thecustomer is to make a purchase. First, we consider a specific productthat they look at and then we consider the number of pages they visit.

Demographics and Segmentation No Longer Are SufficientApplications that focus on identifying categories of users are built withuser segmentation systems. Historically, user segmentation was basedon demographic information. For example, a customer might havebeen identified as a male between the ages of 25-34 with an annualhousehold income of $100,000-$150,000 and living in a particularcounty or zip code. As a means of powering advertising channels suchas television, radio, newspapers, or direct mailings, this level of detailwas sufficient. Each media outlet would survey its listeners or readersto identify the demographics for a particular piece of syndicated con-tent and advertisers could pick a spot based on the audience segment.With the evolution of online advertising and Internet-based media,

segmentation started to become more refined. Instead of a dozen de-mographic attributes, publishers were able to get much more specific



November 2013 11

“Using nonparametric models is common in Big Data

analysis because detail data allows us to let the data

guide our model selection”

www.drdobbs.com

about a customer’s profile. For example, based on Internet browsinghabits, retailers could tell whether a customer lived alone, were in a re-lationship, traveled regularly, and so on. All this information was avail-able previously but it was difficult to collate. By instrumenting cus-

tomer website browsing behavior and correlating this data withpurchases, retailers could fine tune their segmenting algorithms andcreate ads targeted to specific types of customers.Today, nearly every Web page a user views is connected directly to

an advertising network. These ad networks connect to ad exchangesto find bidders for the screen real estate of the user’s Web browser. Adexchanges operate like stock exchanges except that each bid slot isfor a one-time ad to a specific user. The exchange uses the user’s profileinformation or their browser cookies to convey the customer segmentof the user. Advertisers work with specialized digital marketing firmswhose algorithms try to match the potential viewer of an advertise-ment with the available ad inventory and bid appropriately.

Real-Time Updating of Data Matters (People Aren’t Static)Segmentation data used to change rarely with one segmentation map

reflecting the profile of a particular audience for months at a time; to-day, segmentation can be updated throughout the day as customers’profiles change. Using the same information gleaned from user behav-ior that assigns a customer’s initial segment group, organizations canupdate a customer’s segment on a click-by-click basis. Each action bet-ter informs the segmentation model and is used to identify what in-formation to present next.The process of constantly re-evaluating customer segmentation

has enabled new dynamic applications that were previously impos-sible in the offline world. For example, when a model results in an in-correct segmentation assignment, new data based on customer ac-tions can be used to update the model. If presenting the homemakerwith a power tool prompts the homemaker to go back to the searchbar, the segmentation results are probably mistaken. As details abouta customer emerge, the model’s results become more accurate. Acustomer that the model initially predicted was an amateur contrac-tor looking at large quantities of lumber may in fact be a professionalcontractor.By constantly collecting new data and re-evaluating the models, on-

line applications can tailor the experience to precisely what a customeris looking for. Over longer periods of time, models can take into ac-count new data and adjust based on larger trends. For example, astereotypical life trajectory involves entering into a long-term relation-ship, getting engaged, getting married, having children, and movingto the suburbs. At each stage in life and in particular during the transi-tions, one’s segment group changes. By collecting detailed data aboutonline behaviors and constantly reassessing the segmentation model,these life transitions are automatically incorporated into the user’s ap-plication experience.



November 2013 12

“Big Data has seen a lot of hype in recent years, yet it

remains unclear to most practitioners where they need to

focus their time and attention. Big Data is, in large part,

about paying attention to the details in a data set”

www.drdobbs.com

Instrument EverythingWe’ve shown examples of how detail data can be used to pick bettermodels, which result in more accurate predictions. And I have ex-plained how models built on detail data can be used to create betterapplication experiences and adapt more quickly to changes in cus-tomer behavior. If you’ve become a believer in the power of detail dataand you’re not already drowning in it, you likely want to know how toget some.It is often said that the only way to get better at something is to

measure it. This is true of customer engagement as well. By recordingthe details of an application, organizations can effectively recreate theflow of interaction. This includes not just the record of purchases, buta record of each page view, every search query, or selected category,and the details of all items that a customer viewed. Imagine a storeclerk, taking notes as a customer browses and shops or asks for assis-tance. All of these actions can be captured automatically when the in-teraction is digital.Instrumentation can be accomplished in two ways. Most modern

Web and application servers record logs of their activity to assist withoperations and troubleshooting. By processing these logs, it is possibleto extract the relevant information about user interactions with an ap-plication. A more direct method of instrumentation is to explicitlyrecord actions taken by an application into a database. When the ap-plication, running in an application server, receives a request to displayall the throw pillows in the catalog, it records this request and associ-ates it with the current user.

Test ConstantlyThe result of collecting detail data, building more accurate models,

and refining customer segments is a lot of variability in what getsshown to a particular customer. As with any model-based system,past performance is not necessarily indicative of future results. Therelationships between variables change, customer behavior changes,and of course reference data such as product catalogs change. In or-der to know whether a model is producing results that help drivecustomers to success, organizations must test and compare multiplemodels.A/B testing is used to compare the performance of a fixed number

of experiments over a set amount of time. For example, when decidingwhich of several versions of an image of a pillow a customer is mostlikely to click on, you can select a subset of customers to show one im-age or another. What A/B testing does not capture is the reason behinda result. It may be by chance that a high percentage of customers whosaw version A of the pillow were not looking for pillows at all andwould not have clicked on version B either.An alternative to A/B testing is a class of techniques called Bandit al-

gorithms. Bandit algorithms use the results of multiple models and



November 2013 13

Automatic Data CollectionSome data is already collected automatically. Every Web server records details aboutthe information requested by the customer’s Web browser. While not well organizedor obviously usable, this information often includes sufficient detail to reconstruct acustomer’s session. The log records include timestamps, session identifiers, client IPaddress and the request URL including the query string. If this data is combinedwith a session table, a geo-IP database and a product catalog, it is possible tofairly accurately reconstruct the customer’s browsing experience.

www.drdobbs.com

constantly evaluate which experiment to run. Experiments that per-form better (for any reason) are shown more often. The result is thatexperiments can be run constantly and measured against the data col-lected for each experiment. The combinations do not need to be pre-determined and the more successful experiments automatically getmore exposure.

ConclusionBig Data has seen a lot of hype in recent years, yet it remains unclearto most practitioners where they need to focus their time and at-tention. Big Data is, in large part, about paying attention to the de-tails in a data set. The techniques available historically have beenlimited to the level of detail that the hardware available at the timecould process. Recent developments in hardware capabilities have

led to new software that makes it cost effective to store all of an or-ganization’s detail data. As a result, organizations have developednew techniques around model selection, segmentation and experi-mentation. To get started with Big Data, instrument your organiza-tion’s applications, start paying attention to the details, let the datainform the models — and test everything.

— Aaron Kimball founded WibiData in 2010 and is the Chief Architect for the Kiji proj-ect. He has worked with Hadoop since 2007 and is a committer on the Apache Hadoopproject. In addition, Aaron founded Apache Sqoop, which connects Hadoop to relationaldatabases and Apache MRUnit for testing Hadoop projects.



November 2013 14



Applying the Big Data Lambda Architecture

A look inside a Hadoop-based project that matches connections in social media by leveraging the highly scalable lambda architecture.

ased on his experience working on distributed data process-ing systems at Twitter, Nathan Marz recently designed ageneric architecture addressing common requirements,which he called the Lambda Architecture. Marz is well-

known in Big Data: He’s the driving force behind Storm (see page 24)and at Twitter he led the streaming compute team, which provides anddevelops shared infrastructure to support critical real-time applications.Marz and his team described the underlying motivation for building

systems with the lambda architecture as:

• The need for a robust system that is fault-tolerant, both againsthardware failures and human mistakes.

• To serve a wide range of workloads and use cases, in which low-latency reads and updates are required. Related to this point, thesystem should support ad-hoc queries.

• The system should be linearly scalable, and it should scale outrather than up, meaning that throwing more machines at theproblem will do the job.

• The system should be extensible so that features can be addedeasily, and it should be easily debuggable and require minimalmaintenance.

From a bird’s eye view the lambda architecture has three major com-ponents that interact with new data coming in and responds toqueries, which in this article are driven from the command line:

By Michael Hausenblas

[LAMBDA]

B


Figure 1: Overview of the lambda architecture.

www.drdobbs.com

Essentially, the Lambda Architecture comprises the following com-ponents, processes, and responsibilities:

• New Data: All data entering the system is dispatched to both thebatch layer and the speed layer for processing.

• Batch layer: This layer has two functions: (i) managing the masterdataset, an immutable, append-only set of raw data, and (ii) topre-compute arbitrary query functions, called batch views.Hadoop’s HDFS (http://is.gd/Emgj57) is typically used to storethe master dataset and perform the computation of the batchviews using MapReduce (http://is.gd/StjZaI).

• Serving layer: This layer indexes the batch views so that theycan be queried in ad hoc with low latency. To implement theserving layer, usually technologies such as Apache HBase(http://is.gd/2ro9CY) or ElephantDB (http://is.gd/KgIZ2G) areutilized. The Apache Drill project (http://is.gd/wB1IYy) providesthe capability to execute full ANSI SQL 2003 queries againstbatch views.

• Speed layer:This layer compensates for the high latency of updatesto the serving layer, due to the batch layer. Using fast and incre-mental algorithms, the speed layer deals with recent data only.Storm (http://is.gd/qP7fkZ) is often used to implement this layer.

• Queries: Last but not least, any incoming query can be answeredby merging results from batch views and real-time views.

Scope and Architecture of the ProjectIn this article, I employ the lambda architecture to implement what Icall UberSocialNet (USN). This open-source project enables users tostore and query acquaintanceship data. That is, I want to be able tocapture whether I happen to know someone from multiple social net-

works, such as Twitter or LinkedIn, or from real-life circumstances. Theaim is to scale out to several billions of users while providing low-la-tency access to the stored information. To keep the system simple andcomprehensible, I limit myself to bulk import of the data (no capabili-ties to live-stream data from social networks) and provide only a verysimple a command-line user interface. The guts, however, use thelambda architecture.It’s easiest to think about USN in terms of two orthogonal phases:

• Build-time, which includes the data pre-processing, generatingthe master dataset as well as creating the batch views.

• Runtime, in which the data is actually used, primarily via issuingqueries against the data space.

The USN app architecture is shown below in Figure 2:

[LAMBDA]

November 2013 16


Figure 2: High-level architecture diagram of the USN app.

www.drdobbs.com

The following subsytems and processes, in line with the lambda ar-chitecture, are at work in USN:

• Data pre-processing. Strictly speaking this can be consideredpart of the batch layer. It can also be seen as an independentprocess necessary to bring the data into a shape that is suitablefor the master dataset generation.

• The batch layer. Here, a bash shell script (http://is.gd/smhcl6)is used to drive a number of HiveQL (http://is.gd/8qSOSF)queries (see the GitHub repo, in the batch-layer folder athttp://is.gd/QDU6pH) that are responsible to load the pre-processed input CSV data into HDFS.

• The serving layer. In this layer, we use a Python script(http://is.gd/Qzklmw) that loads the data from HDFS via Hive andinserts it into a HBase table, and hence creating a batch view ofthe data. This layer also provides query capabilities, necessary inthe runtime phase to serve the front-end.

• Command-line front end. The USN app front-end is a bash shellscript (http://is.gd/nFZoqB) interacting with the end-user andproviding operations such as listings, lookups, and search.

This is all there is from an architectural point of view. You may havenoticed that there is no speed layer in USN, as of now. This is due tothe scope I initially introduced above. At the end of this article, I’ll revisitthis topic.

The USN App Technology Stack and DataRecently, Dr. Dobb’s discussed Pydoop: Writing Hadoop Programs in

Python (http://www.drdobbs.com/240156473), which will serve as agentle introduction into setting up and using Hadoop with Python. I’mgoing to use a mixture of Python and bash shell scripts to implementthe USN. However, I won’t rely on the low-level MapReduce API pro-vided by Pydoop, but rather on higher-level libraries that interface withHive and HBase, which are part of Hadoop. Note that the entire sourcecode, including the test data and all queries as well as the front-end, isavailable in a GitHub repository (http://is.gd/XFI4wY), and it is neces-sary to follow along with this implementation. Before I go into the technical details such as the concrete technology

stack used, let’s have a quick look at the data transformation happen-ing between the batch and the serving layer (Figure 3).

[LAMBDA]


November 2013 17

Figure 3: Data transformation from batch to serving layer in the USN app.

www.drdobbs.com

As hinted in Figure 3, the master dataset (left) is a collection ofatomic actions: either a user has added someone to their networksor the reverse has taken place, a person has been removed from anetwork. This form of the data is as raw as it gets in the context ofour USN app and can serve as the basis for a variety of views thatare able to answer different sorts of queries. For simplicity’s sake, Ionly consider one possible view that is used in the USN app front-end: the “network-friends” view, per user, shown in the right part ofFigure 3.

Raw Input DataThe raw input data is a Comma Separated Value (CSV) file with the fol-lowing format:

timestamp,originator,action,network,target,context2012-03-12T22:54:13-07:00,Michael,ADD,I,Ora Hatfield, bla2012-11-23T01:53:42-08:00,Ted,REMOVE,I,Marvin Garrison, meh...

The raw CSV file contains the following six columns:

• timestamp is an ISO 8601 formatted date-time stamp that stateswhen the action was performed (range: January 2012 to May 2013).

• originator is the name of the person who added or removeda person to or from one of his or her networks.

• actionmust be either ADDor REMOVE and designates the actionthat has been carried out. That is, it indicates whether a personhas been added or removed from the respective network.

• network is a single character indicating the respective networkwhere the action has been performed. The possible values are:I, in-real-life; T, Twitter; L, LinkedIn; F, Facebook; G, Google+

• target is the name of the person added to or removed from thenetwork.

• context is a free-text comment, providing a hint why the personhas been added/removed or where one has met the person inthe first place.

There are no optional fields in the dataset. In other words: Each rowis completely filled. In order to generate some test data to be used inthe USN app, I’ve created a raw input CSV file from generatedata.comin five runs, yielding some 500 rows of raw data.

Technology StackUSN uses several software frameworks, libraries, and components, as Imentioned earlier. I’ve tested it with:

• Apache Hadoop 1.0.4 (http://is.gd/4suWof)• Apache Hive 0.10.0 (http://is.gd/tOfbsP)• Hiver for Hive access from Python (http://is.gd/OXujzB)• Apache HBase 0.94.4 (http://is.gd/7VnBqR)• HappyBase for HBase access from Python (http://is.gd/BuJzaH)

I assume that you’re familiar with the bash shell and have Python 2.7or above installed. I’ve tested the USN app under Mac OS X 10.8 butthere are no hard dependencies on any Mac OS X specific features, soit should run unchanged under any Linux environment.

Building the USN Data SpaceThe first step is to build the data space for the USN app, that is, themaster dataset and the batch view, and then we will have a closer lookbehind the scenes of each of the commands.

[LAMBDA]


November 2013 18

www.drdobbs.com

First, some pre-processing of the raw data, generated earlier:

$ pwd/Users/mhausenblas2/Documents/repos/usn-app/data$ ./usn-preprocess.sh < usn-raw-data.csv > usn-base-data.csv

Next, we want to build the batch layer. For this, I first need to makesure that the Hive Thrift service is running:

$ pwd/Users/mhausenblas2/Documents/repos/usn-app/batch-layer$ hive --service hiveserverStarting Hive Thrift Server...

Now, I can run the script that execute the Hive queries and builds ourUSN app master dataset, like so:

$ pwd/Users/mhausenblas2/Documents/repos/usn-app/batch-layer$ ./batch-layer.sh INITUSN batch layer created.$ ./batch-layer.sh CHECKThe USN batch layer seems OK.

This generates the batch layer, which is in HDFS. Next, I create theserving layer in HBase by building a view of the relationships to people.For this, both the Hive and HBase Thrift services need to be running.Below, you see how you start the HBase Thrift service:

$ echo $HBASE_HOME/Users/mhausenblas2/bin/hbase-0.94.4$ cd /Users/mhausenblas2/bin/hbase-0.94.4$ ./bin/start-hbase.shstarting master, logging to /Users/...$ ./bin/hbase thrift start -p 919113/05/31 09:39:09 INFO util.VersionInfo: HBase 0.94.4

As now both Hive and HBase Thrift services are up and running, I canrun the following command (in the respective directory, whereveryou’ve unzipped or cloned the GitHub repository):

$ echo $HBASE_HOME/Users/mhausenblas2/bin/hbase-0.94.4$ cd /Users/mhausenblas2/bin/hbase-0.94.4$ ./bin/start-hbase.shstarting master, logging to /Users/...$ ./bin/hbase thrift start -p 919113/05/31 09:39:09 INFO util.VersionInfo: HBase 0.94.4

Now, let’s have a closer look at what is happening behind the scenesof each of the layers in the next sections.

The Batch LayerThe raw data is first pre-processed and loaded into Hive. In Hive (re-member, this constitutes the master dataset in the batch layer of ourUSN app) the following schema is used:

CREATE TABLE usn_base (actiontime STRING,originator STRING,action STRING,network STRING,target STRING,context STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘|’;

To import the CSV data, to build the master dataset, the shell scriptbatch-layer.sh executes the following HiveQL commands:

LOAD DATA LOCAL INPATH ‘../data/usn-base-data.csv’ INTOTABLE usn_base;

[LAMBDA]


November 2013 19

www.drdobbs.com

DROP TABLE IF EXISTS usn_friends;

CREATE TABLE usn_friends ASSELECT actiontime, originator AS username, network,

target AS friend, context AS noteFROM usn_baseWHERE action = ‘ADD’ORDER BY username, network, username;

With this, the USN app master dataset is ready and available in HDFSand I can move on to the next layer, the serving layer.

The Serving Layer of the USN AppThe batch view used in the USN app is realized via an HBase tablecalled usn_friends. This table is then used to drive the USN app front-end; it has the schema shown in Figure 4.

After building the serving layer, I can use the HBase shell to verify ifthe batch view has been properly populated in the respective tableusn_friends:

$ ./bin/hbase shellhbase(main):001:0> describe ‘usn_friends’...{NAME => ‘usn_friends’, FAMILIES => [{NAME => ‘a’,

DATA_BLOCK_ENCODING => ‘NONE’, BLOOMFILTER => ‘N trueONE’, REPLICATION_SCOPE => ‘0’, VERSIONS => ‘3’,

COMPRESSION => ‘NONE’, MIN_VERSIONS => ‘0’, TTL =>‘-1’, KEEP_DELETED_CELLS => ‘false’, BLOCKSIZE => ‘65536’, IN_MEMORY => ‘false’, ENCODE_ON_DISK =>‘true’, BLOCKCACHE => ‘false’}]}

1 row(s) in 0.2450 seconds

You can have a look at some more queries used in the demo user in-terface on the Wiki page of the GitHub repository (http://is.gd/7v0IXz).

Putting It All TogetherAfter the batch and serving layers have been initialized and launched,as described, you can launch the user interface. To use the CLI, makesure that HBase and the HBase Thrift service are running and then, inthe main USN app directory run:

$ ./usn-ui.shThis is USN v0.0

u ... user listings, n ... network listings, l ... lookup,s ... search, h ... help, q ... quit

[LAMBDA]


November 2013 20

Figure 4: HBase schema used in the serving layer of the USN app.

www.drdobbs.com

Figure 5 shows a screen shot of the USN app front-end in action. Thethree main operations the USN front-end provides are as follows:

• u ... user listing lists all acquaintances of a user• n ... network listing lists acquaintances of a user in a net-

work• l ... lookup listing lists acquaintances of a user in a net-

work and allows restrictions on the time range (from/to) of theacquaintanceship

• s ... search provides search for an acquaintance over allusers, allowing for partial match

An example USN app front-end session is available at the GitHubrepo (http://is.gd/c3i6FW) for you to study.

What’s Next?I have intentionally kept USN simple. Although fully functional, it hasseveral intentional limitations (due to space restrictions here). I cansuggest several improvements you could have a go at, using the avail-able code base (http://is.gd/XFI4wY) as a starting point.

• Bigger data: The most obvious point is not the app itself but thedata size. Only laughable 500 rows? This isn’t Big Data I hear yousay. Rightly so. Now, no one stops you generating 500 millionrows or more and try it out. Certain processes such as pre-pro-cessing and the generating the layers will take longer but thereare no architectural changes necessary, and this is the wholepoint of this USN app.

• Creating a full-blown batch layer: Currently, the batch layer is asort of one-shot, while it should really run in a loop and appendnew data. This requires partitioning of the ingested data andsome checks. Pail (http://is.gd/sJAKGN), for example, allows youto do the ingestion and partitioning in a very elegant way.

• Adding speed layer and automated import: It would be inter-esting to automate the import of data from the various socialnetworks. For example, Google Takeout (http://is.gd/Zy0HcB)allows exporting all data in bulk mode, including G+ Circles. Fora stab at the speed layer, one could try and utilize the Twitterfire-hose (http://is.gd/xVroGO) along with Storm.

• More batch views: There is currently only one view (friend list pernetwork, per user) in the serving layer. The USN app might ben-efit from different views to enable different queries most effi-ciently, such as time-series views of network growth or overlapsof acquaintanceships across networks.

[LAMBDA]


November 2013 21

Figure 5: Screen-shot of the USN app command line user interface.

www.drdobbs.com

I hope you have as much fun playing around with the USN app andextending it as I had writing it in the first place. I’d love to hear backfrom you on ideas or further improvements either directly here as acomment or via the GitHub issue tracker of the USN app repository.

Further Resources• A must-read for the Lambda Architecture is the Big Data book

by Nathan Marz and James Warren from Manning(http://is.gd/lPtVJS). The USN app idea actually stems from oneof the examples used in this book.

• Slide deck on a real time architecture using Hadoop and Storm(http://is.gd/nz0wD6) from FOSDEM 2013.

• A blog post about an example “lambda architecture” for real-time analysis of hashtags using Trident, Hadoop, and Splout SQL(http://is.gd/ZTJarF).

• Additional batch layer technologies such as Pail(http://is.gd/sJAKGN) for managing the master dataset andJCascalog (http://is.gd/i7jf1W) for creating the batch views.

• Apache Drill (http://is.gd/wB1IYy) for providing interactive, ad-hoc queries against HDFS, HBase, or other NoSQL back-ends.

• Additional speed layer technologies, such as Trident(http://is.gd/Bxqt9j), a high-level abstraction for doing real-timecomputing on top of Storm and MapR’s Direct Access NFS(http://is.gd/BaoE0l) to land data directly from streaming sourcessuch as social media streams or sensor devices.

— Michael Hausenblas is the Chief Data Engineer EMEA, MapR Technologies.

[LAMBDA]


November 2013 22



Easy, Real-Time Big Data AnalysisUsing Storm

Conceptually straightforward and easy to work with, Storm makes handling big data analysis a breeze.

By Shruthi Kumar and Siddharth Patankar

[STORM]


oday, companies regularly generate terabytes of data in theirdaily operations. The sources include everything from datacaptured from network sensors, to the Web, social media,transactional business data, and data created in other busi-

ness contexts. Given the volume of data being generated, real-timecomputation has become a major challenge faced by many organiza-tions. A scalable real-time computation system that we have used ef-fectively is the open-source Storm tool, which was developed at Twitterand is sometimes referred to as “real-time Hadoop.” However, Storm(http://storm-project.net/) is far simpler to use than Hadoop in that itdoes not require mastering an alternate universe of new technologiessimply to handle big data jobs.This article explains how to use Storm. The example project, called

“Speeding Alert System,” analyzes real-time data and raises a trigger

and relevant data to a database, when the speed of a vehicle exceedsa predefined threshold.

StormWhereas Hadoop relies on batch processing, Storm is a real-time, dis-tributed, fault-tolerant, computation system. Like Hadoop, it canprocess huge amounts of data — but does so in real time — with guar-anteed reliability; that is, every message will be processed. Storm alsooffers features such as fault tolerance and distributed computation,which make it suitable for processing huge amounts of data on differ-ent machines. It has these features as well:

• It has simple scalability. To scale, you simply add machines andchange parallelism settings of the topology. Storm’s usage of

T

From the Vault

Hadoop’s Zookeeper for cluster coordination makes it scalablefor large cluster sizes.

• It guarantees processing of every message.• Storm clusters are easy to manage.• Storm is fault tolerant: Once a topology is submitted, Storm runs

the topology until it is killed or the cluster is shut down. Also, ifthere are faults during execution, reassignment of tasks is han-dled by Storm.

• Topologies in Storm can be defined in any language, althoughtypically Java is used.

To follow the rest of the article, you first need to install and set upStorm. The steps are straightforward:

• Download the Storm archive from the official Storm website(http://storm-project.net/downloads.html).

• Unpack the bin/ directory onto your PATH and make sure thebin/storm script is executable.

Storm ComponentsA Storm cluster mainly consists of a master and worker node, with co-ordination done by Zookeeper.

• Master Node: The master node runs a daemon, Nimbus, which isresponsible for distributing the code around the cluster, assign-ing the tasks, and monitoring failures. It is similar to the JobTracker in Hadoop.

• Worker Node:The worker node runs a daemon, Supervisor, whichlistens to the work assigned and runs the worker process based

on requirements. Each worker node executes a subset of a topol-ogy. The coordination between Nimbus and several supervisorsis managed by a Zookeeper system or cluster.

ZookeeperZookeeper is responsible for maintaining the coordination service be-tween the supervisor and master. The logic for a real-time applicationis packaged into a Storm “topology.” A topology consists of a graph ofspouts (data sources) and bolts (data operations) that are connectedwith stream groupings (coordination). Let’s look at these terms ingreater depth.

• Spout: In simple terms, a spout reads the data from a source foruse in the topology. A spout can either be reliable or unreliable.A reliable spout makes sure to resend a tuple (which is an or-dered list of data items) if Storm fails to process it. An unreliablespout does not track the tuple once it’s emitted. The mainmethod in a spout is nextTuple(). This method either emits anew tuple to the topology or it returns if there is nothing to emit.

• Bolt: A bolt is responsible for all the processing that happens in atopology. Bolts can do anything from filtering to joins, aggrega-tions, talking to files/databases, and so on. Bolts receive the datafrom a spout for processing, which may further emit tuples to an-other bolt in case of complex stream transformations. The mainmethod in a bolt is execute(), which accepts a tuple as input. Inboth the spout and bolt, to emit the tuple to more than one stream,the streams can be declared and specified in declareStream().

• Stream Groupings: A stream grouping defines how a streamshould be partitioned among the bolt’s tasks. There are built-in

www.drdobbs.com

[STORM]


November 2013 24

www.drdobbs.com

stream groupings (http://is.gd/eJvL0f ) provided by Storm: shuf-fle grouping, fields grouping, all grouping, one grouping, directgrouping, and local/shuffle grouping. Custom implementationby using the CustomStreamGrouping interface can also beadded.

ImplementationFor our use case, we designed one topology of spout and bolt that canprocess a huge amount of data (log files) designed to trigger an alarmwhen a specific value crosses a predefined threshold. Using a Stormtopology, the log file is read line by line and the topology is designedto monitor the incoming data. In terms of Storm components, the spoutreads the incoming data. It not only reads the data from existing files,but it also monitors for new files. As soon as a file is modified, spoutreads this new entry and, after converting it to tuples (a format that canbe read by a bolt), emits the tuples to the bolt to perform thresholdanalysis, which finds any record that has exceeded the threshold.The next section explains the use case in detail.

Threshold AnalysisIn this article, we will be mainly concentrating on two types of thresh-old analysis: instant threshold and time series threshold.

• Instant threshold checks if the value of a field has exceeded thethreshold value at that instant and raises a trigger if the condi-tion is satisfied. For example, it raises a trigger if the speed of avehicle exceeds 80 km/h.

• Time series threshold checks if the value of a field has exceededthe threshold value for a given time window and raises a trig-ger if the same is satisfied. For example, it raises a trigger if the

speed of a vehicle exceeds 80 km/h more than once in last fiveminutes.

Listing One shows a log file of the type we’ll use, which contains ve-hicle data information such as vehicle number, speed at which the ve-hicle is traveling, and location in which the information is captured.

Listing One: A log file with entries of vehicles passingthrough the checkpoint.

AB 123, 60, North cityBC 123, 70, South cityCD 234, 40, South cityDE 123, 40, East cityEF 123, 90, South cityGH 123, 50, West city

A corresponding XML file is created, which consists of the schemafor the incoming data. It is used for parsing the log file. The schemaXML and its corresponding description are shown in the Table 1.

[STORM]


November 2013 25

Table 1.

www.drdobbs.com

The XML file and the log file are in a directory that is monitored bythe spout constantly for real-time changes. The topology we use forthis example is shown in Figure 1.As shown in Figure 1, the FileListenerSpout accepts the input log

file, reads the data line by line, and emits the data to the Thresold-CalculatorBolt for further threshold processing. Once the process-ing is done, the contents of the line for which the threshold is calcu-lated is emitted to the DBWriterBolt, where it is persisted in thedatabase (or an alert is raised). The detailed implementation for thisprocess is explained next.

Spout ImplementationSpout takes a log file and the XML descriptor file as the input. The XMLfile consists of the schema corresponding to the log file. Let us consideran example log file, which has vehicle data information such as vehiclenumber, speed at which the vehicle is travelling, and location in whichthe information is captured. (See Figure 2.)Listing Two shows the specific XML file for a tuple, which specifies

the fields and the delimiter separating the fields in a log file. Both theXML file and the data are kept in a directory whose path is specifiedin the spout.

Listing Two: An XML file created for describing the logfile.

<TUPLEINFO><FIELDLIST>

<FIELD><COLUMNNAME>vehicle_number</COLUMNNAME><COLUMNTYPE>string</COLUMNTYPE>

</FIELD><FIELD>

<COLUMNNAME>speed</COLUMNNAME><COLUMNTYPE>int</COLUMNTYPE>

</FIELD> <FIELD>

<COLUMNNAME>location</COLUMNNAME><COLUMNTYPE>string</COLUMNTYPE>

</FIELD></FIELDLIST>

<DELIMITER>,</DELIMITER></TUPLEINFO>

[STORM]


November 2013 26

Figure 1: Topology created in Storm to process real-time data.

Figure 2: Flow of data from log files to Spout.

www.drdobbs.com

An instance of spout is initialized with constructor parameters of Di-rectory, Path, and TupleInfo object. The TupleInfo object storesnecessary information related to log file such as fields, delimiter, andtype of field. This object is created by serializing the XML file usingXStream (http://xstream.codehaus.org/).Spout implementation steps are:

• Listen to changes on individual log files. Monitor the directoryfor the addition of new log files.

• Convert rows read by the spout to tuples after declaring fieldsfor them.

• Declare the grouping between spout and bolt, deciding the wayin which tuples are given to bolt.

The code for spout is shown in Listing Three.

Listing Three: Logic in Open, nextTuple, and declareOutputFieldsmethods of spout.

public void open( Map conf, TopologyContext context,SpoutOutputCollector collector )

{_collector = collector;try{

fileReader = new BufferedReader(new FileReader(new File(file)));

}catch (FileNotFoundException e){

System.exit(1);}

}

public void nextTuple(){

protected void ListenFile(File file){

Utils.sleep(2000);RandomAccessFile access = null;String line = null; try{

while ((line = access.readLine()) != null){

if (line !=null){

String[] fields=null;if (tupleInfo.getDelimiter().equals(“|”))

fields = line.split(“\\”+tupleInfo.getDelimiter());

elsefields =

line.split(tupleInfo.getDelimiter()); if (tupleInfo.getFieldList().size() == fields.length)

_collector.emit(new Values(fields));}

}}catch (IOException ex) { } }

}

public void declareOutputFields(OutputFieldsDeclarer declarer){

String[] fieldsArr = new String [tupleInfo.getFieldList().size()];

for(int i=0; i<tupleInfo.getFieldList().size(); i++){

fieldsArr[i] = tupleInfo.getFieldList().get(i).getColumnName();

} declarer.declare(new Fields(fieldsArr));

}

declareOutputFields() decides the format in which the tuple isemitted, so that the bolt can decode the tuple in a similar fashion.

[STORM]


November 2013 27

www.drdobbs.com

Spout keeps on listening to the data added to the log file and as soonas data is added, it reads and emits the data to the bolt for processing.

Bolt ImplementationThe output of spout is given to bolt for further processing. The topol-ogy we have considered for our use case consists of two bolts asshown in Figure 3.

ThresholdCalculatorBoltThe tuples emitted by spout are received by the ThresholdCalcu-latorBolt for threshold processing. It accepts several inputs forthreshold check. The inputs it accepts are:

• Threshold value to check• Threshold column number to check

• Threshold column data type• Threshold check operator• Threshold frequency of occurrence• Threshold time window

A class, shown Listing Four, is defined to hold these values.

Listing Four: ThresholdInfo class.

public class ThresholdInfo implements Serializable{

private String action;private String rule;private Object thresholdValue;private int thresholdColNumber;private Integer timeWindow;private int frequencyOfOccurence;

}

Based on the values provided in fields, the threshold check is madein the execute()method as shown in Listing Five. The code mostlyconsists of parsing and checking the incoming values.

Listing Five: Code for threshold check.

public void execute(Tuple tuple, BasicOutputCollector collector){

if(tuple!=null){

List<Object> inputTupleList =(List<Object>) tuple.getValues();int thresholdColNum = thresholdInfo.getThresholdColNumber();

Object thresholdValue = thresholdInfo.getThresholdValue();String thresholdDataType =tupleInfo.getFieldList().get(thresholdColNum-1)

[STORM]


November 2013 28

Figure 3: Flow of data from Spout to Bolt.

www.drdobbs.com

.getColumnType();Integer timeWindow = thresholdInfo.getTimeWindow();int frequency = thresholdInfo.getFrequencyOfOccurence();

if(thresholdDataType.equalsIgnoreCase(“string”)){

String valueToCheck = inputTupleList.get(thresholdColNum-1).toString();

String frequencyChkOp = thresholdInfo.getAction();if(timeWindow!=null){

long curTime = System.currentTimeMillis();long diffInMinutes = (curTime-startTime)/(1000);if(diffInMinutes>=timeWindow){

if(frequencyChkOp.equals(“==”)){if(valueToCheck.equalsIgnoreCase(thresholdValue.toString())){

count.incrementAndGet();if(count.get() > frequency)splitAndEmit(inputTupleList,collector);

}}else if(frequencyChkOp.equals(“!=”)){

if(!valueToCheck.equalsIgnoreCase(thresholdValue.toString()))

{count.incrementAndGet();if(count.get() > frequency)

splitAndEmit(inputTupleList,collector);

}}else

System.out.println(“Operator not supported”);

}}

else{

if(frequencyChkOp.equals(“==”)){

if(valueToCheck.equalsIgnoreCase(thresholdValue.toString()))


splitAndEmit(inputTupleList,collector); }

}else if(frequencyChkOp.equals(“!=”)){

if(!valueToCheck.equalsIgnoreCase(thresholdValue.toString()))

{count.incrementAndGet();if(count.get() > frequency)splitAndEmit(inputTupleList,collector);

}}

}}else if(thresholdDataType.equalsIgnoreCase(“int”) ||

thresholdDataType.equalsIgnoreCase(“double”) ||thresholdDataType.equalsIgnoreCase(“float”) ||thresholdDataType.equalsIgnoreCase(“long”) ||thresholdDataType.equalsIgnoreCase(“short”))

{String frequencyChkOp = thresholdInfo.getAction();if(timeWindow!=null){

long valueToCheck =Long.parseLong(inputTupleList.

get(thresholdColNum-1).toString());long curTime = System.currentTimeMillis();long diffInMinutes = (curTime-startTime)/(1000);

System.out.println(“Difference in minutes=”+diffInMinutes);

if(diffInMinutes>=timeWindow)

[STORM]


November 2013 29

www.drdobbs.com

{if(frequencyChkOp.equals(“<”)){if(valueToCheck < Double.parseDouble(thresholdValue.toString()))


splitAndEmit(inputTupleList,collector);}

}else if(frequencyChkOp.equals(“>”)){

if(valueToCheck > Double.parseDouble(thresholdValue.toString()))



}else if(frequencyChkOp.equals(“==”)){

if(valueToCheck == Double.parseDouble(thresholdValue.toString()))



}else if(frequencyChkOp.equals(“!=”)){

. . .}

}

}else

splitAndEmit(null,collector);}

else{

System.err.println(“Emitting null in bolt”);splitAndEmit(null,collector);

}}

The tuples emitted by the threshold bolt are passed to the next cor-responding bolt, which is the DBWriterBolt bolt in our case.

DBWriterBoltThe processed tuple has to be persisted for raising a trigger or for fur-ther use. DBWriterBolt does the job of persisting the tuples into thedatabase. The creation of a table is done in prepare(), which is thefirst method invoked by the topology. Code for this method is givenin Listing Six.

Listing Six: Code for creation of tables.public void prepare( Map StormConf, TopologyContext context ){

try{

Class.forName(dbClass);}catch (ClassNotFoundException e){

System.out.println(“Driver not found”);e.printStackTrace();

}

try{

connection driverManager.getConnection(“jdbc:mysql://”+databaseIP+”:”+databasePort+”/”+databaseName, userName, pwd);

connection.prepareStatement(“DROP TABLE IF EXISTS “+tableName).execute();

[STORM]


November 2013 30

www.drdobbs.com

StringBuilder createQuery = new StringBuilder(“CREATE TABLE IF NOT EXISTS “+tableName+”(“);

for(Field fields : tupleInfo.getFieldList()){

if(fields.getColumnType().equalsIgnoreCase(“String”))createQuery.append(fields.getColumnName()+” VARCHAR(500),”);

elsecreateQuery.append(fields.getColumnName()+” “+fields.getColumnType()+”,”);

}createQuery.append(“thresholdTimeStamp timestamp)”);connection.prepareStatement(createQuery.toString()).execute();

// Insert QueryStringBuilder insertQuery = new StringBuilder(“INSERT INTO

“+tableName+”(“);String tempCreateQuery = new String();for(Field fields : tupleInfo.getFieldList()){

insertQuery.append(fields.getColumnName()+”,”);}insertQuery.append(“thresholdTimeStamp”).append(“) values (“);for(Field fields : tupleInfo.getFieldList()){

insertQuery.append(“?,”);}

insertQuery.append(“?)”);prepStatement =

connection.prepareStatement(insertQuery.toString());}catch (SQLException e){

e.printStackTrace();}

}

Insertion of data is done in batches. The logic for insertion is providedin execute() as shown in Listing Seven, and consists mostly of pars-ing the variety of different possible input types.

Listing Seven: Code for insertion of data.

public void execute(Tuple tuple, BasicOutputCollector collector){

batchExecuted=false;if(tuple!=null){

List<Object> inputTupleList = (List<Object>) tuple.getValues();int dbIndex=0;for(int i=0;i<tupleInfo.getFieldList().size();i++){

Field field = tupleInfo.getFieldList().get(i);try {

dbIndex = i+1;if(field.getColumnType().equalsIgnoreCase(“String”))

prepStatement.setString(dbIndex,inputTupleList.get(i).toString());

else if(field.getColumnType().equalsIgnoreCase(“int”))prepStatement.setInt(dbIndex,Integer.parseInt(inputTupleList.get(i).toString()));

else if(field.getColumnType().equalsIgnoreCase(“long”))prepStatement.setLong(dbIndex,Long.parseLong(inputTupleList.get(i).toString()));

else if(field.getColumnType().equalsIgnoreCase(“float”))prepStatement.setFloat(dbIndex,Float.parseFloat(inputTupleList.get(i).toString()));

else if(field.getColumnType().equalsIgnoreCase(“double”))prepStatement.setDouble(dbIndex,Double.parseDouble(inputTupleList.get(i).toString()));

else if(field.getColumnType().equalsIgnoreCase(“short”))prepStatement.setShort(dbIndex,Short.parseShort(inputTupleList.get(i).toString()));

else if(field.getColumnType().equalsIgnoreCase(“boolean”))prepStatement.setBoolean(dbIndex,

Boolean.parseBoolean(inputTupleList.get(i).toString()));

[STORM]


November 2013 31

www.drdobbs.com

else if(field.getColumnType().equalsIgnoreCase(“byte”))

prepStatement.setByte(dbIndex,

Byte.parseByte(inputTupleList.get(i).toString()));

else if(field.getColumnType().equalsIgnoreCase(“Date”))

{

Date dateToAdd=null;

if (!(inputTupleList.get(i) instanceof Date))

{

DateFormat df = new SimpleDateFormat

(“yyyy-MM-dd hh:mm:ss”);

try

{

dateToAdd =

df.parse(inputTupleList.get(i).toString());

}

catch (ParseException e)

{

System.err.println(“Data type not valid”);

}

}

else

{

dateToAdd = (Date)inputTupleList.get(i);

java.sql.Date sqlDate = new java.sql.

Date(dateToAdd.getTime());

prepStatement.setDate(dbIndex, sqlDate);

}

}

catch (SQLException e)

{

e.printStackTrace();

}

}

Date now = new Date();

try

{

prepStatement.setTimestamp(dbIndex+1,

new java.sql.Timestamp(now.getTime()));

prepStatement.addBatch();

counter.incrementAndGet();

if (counter.get()== batchSize)

executeBatch();

}

catch (SQLException e1)

{

e1.printStackTrace();

}

}

else

{

long curTime = System.currentTimeMillis();

long diffInSeconds = (curTime-startTime)/(60*1000);

if(counter.get() <

batchSize && diffInSeconds>batchTimeWindowInSeconds)

{

try {

executeBatch();

startTime = System.currentTimeMillis();

}

catch (SQLException e) {

e.printStackTrace();

}

}

}

}

public void executeBatch() throws SQLException

{

batchExecuted=true;

prepStatement.executeBatch();

counter = new AtomicInteger(0);

}

[STORM]


November 2013 32

www.drdobbs.com

Once the spout and bolt are ready to be executed, a topology is builtby the topology builder to execute it. The next section explains the ex-ecution steps.

Running and Testing the Topology in a Local ClusterDefine the topology using TopologyBuilder, which exposes the JavaAPI for specifying a topology for Storm to execute:

• Using Storm Submitter, we submit the topology to the cluster. Ittakes name of the topology, configuration, and topology as input.

• Submit the topology.

Listing Eight: Building and executing a topology.

public class StormMain{

public static void main(String[] args) throws AlreadyAliveException,

InvalidTopologyException,InterruptedException

{ParallelFileSpout parallelFileSpout =

new ParallelFileSpout();ThresholdBolt thresholdBolt = new ThresholdBolt();DBWriterBolt dbWriterBolt = new DBWriterBolt();TopologyBuilder builder = new TopologyBuilder();builder.setSpout(“spout”, parallelFileSpout, 1);builder.setBolt(“thresholdBolt”, thresholdBolt,1).

shuffleGrouping(“spout”);builder.setBolt(“dbWriterBolt”,dbWriterBolt,1).

shuffleGrouping(“thresholdBolt”);if(this.argsMain!=null && this.argsMain.length > 0){

conf.setNumWorkers(1);StormSubmitter.submitTopology(

this.argsMain[0], conf, builder.createTopology());

}

else{

Config conf = new Config();conf.setDebug(true);conf.setMaxTaskParallelism(3);LocalCluster cluster = new LocalCluster();cluster.submitTopology(“Threshold_Test”, conf, builder.createTopology());

}}

}

After building the topology, it is submitted to the local cluster. Oncethe topology is submitted, it runs until it is explicitly killed or the clusteris shut down without requiring any modifications. This is another bigadvantage of Storm.This comparatively simple example shows the ease with which it’s

possible to set up and use Storm once you understand the basic con-cepts of topology, spout, and bolt. The code is straightforward andboth scalability and speed are provided by Storm. So, if you’re look-ing to handle big data and don’t want to traverse the Hadoop uni-verse, you might well find that using Storm is a simple and elegantsolution.

— Shruthi Kumar works as a technology analyst and Siddharth Patankar is a softwareengineer with the Cloud Center of Excellence at Infosys Labs.

[STORM]


November 2013 33

http://www.drdobbs.com/database/easy-real-time-big-data-analysis-using-s/240143874


www.drdobbs.com

Items of special interest posted on www.drdobbs.com over the past month that you may have missed

IF JAVA IS DYING, IT SURE LOOKS AWFULLY HEALTHYThe odd, but popular, assertion that Java is dying can be made onlyin spite of the evidence, not because of it.http://www.drdobbs.com/240162390

CONTINUOUS DELIVERY: THE FIRST STEPSContinuous delivery integrates many practices that in their totalitymight seem daunting. But starting with a few basic steps brings im-mediate benefits. Here’s how.http://www.drdobbs.com/240161356

A SIMPLE, IMMUTABLE, NODE-BASED DATA STRUCTUREArray-like data structures aren’t terribly useful in a world that doesn’tallow data to change because it’s hard to implement even such simpleoperations as appending to an array efficiently. The difficulty is that inan environment with immutable data, you can’t just append a valueto an array; you have to create a new array that contains the old arrayalong with the value that you want to append.http://www.drdobbs.com/240162122

DIJKSTRA’S 3 RULES FOR PROJECT SELECTIONWant to start a unique and truly useful open-source project? Thesethree guidelines on choosing wisely will get you there.http://www.drdobbs.com/240161615

PRIMITIVE VERILOGVerilog is decidedly schizophrenic. There is part of the Verilog lan-guage that synthesizers can commonly convert into FPGA logic, andthen there is an entire part of the language that doesn’t synthesize.http://www.drdobbs.com/240162355

DEVELOPING ANDROID APPS WITH SCALA AND SCALOID: PART 2Starting with templates, Android features can be added quickly witha single line of DSL code.http://www.drdobbs.com/240162204

FIDGETY USBLinux-based boards like the Raspberry Pi or the Beagle Bone usuallyhave some general-purpose I/O capability, but it is easy to forget theyalso sport USB ports.http://www.drdobbs.com/240162050

This Month on DrDobbs.com

[LINKS]

November 2013 34

INFORMATIONWEEK

Rob Preston VP and Editor In Chief, [email protected] 516-562-5692

Chris Murphy Editor, [email protected] 414-906-5331

Lorna Garey Content Director, Reports, [email protected] 978-694-1681

Brian Gillooly, VP and Editor In Chief, [email protected]

INFORMATIONWEEK.COM

Laurianne McLaughlin Editor [email protected] 516-562-5336

Roma Nowak Senior Director, Online Operations and Production [email protected] 516-562-5274

Joy CulbertsonWeb Producer [email protected]

Atif Malik Director, Web Development [email protected]

MEDIA KITS

http://createyournextcustomer.techweb.com/media-kit/business-technology-audience-media-kit/

UBM TECH

AUDIENCE DEVELOPMENTDirector, Karen McAleer (516) 562-7833, [email protected]

SALES CONTACTS—WEST

Western U.S. (Pacific and Mountain states) and Western Canada (British Columbia, Alberta)

Sales Director, Michele Hurabiell(415) 378-3540,[email protected]

Strategic Accounts

Account Director, Sandra Kupiec(415) 947-6922, [email protected]

Account Manager, Vesna Beso(415) 947-6104, [email protected]

Account Executive, Matthew Cohen-Meyer(415) 947-6214, [email protected]

MARKETING

VP, Marketing, Winnie Ng-Schuchman(631) 406-6507, [email protected]

Marketing Director, Angela Lee-Moll (516) 562-5803, [email protected]

Marketing Manager, Monique Luttrell(949) 223-3609, [email protected]

Program Manager, Nicole Schwartz

516-562-7684,[email protected]

SALES CONTACTS—EAST

Midwest, South, Northeast U.S. and EasternCanada (Saskatchewan, Ontario, Quebec, NewBrunswick)

District Manager, Steven Sorhaindo(212) 600-3092, [email protected]

Strategic Accounts

District Manager, Mary Hyland (516) 562-5120, [email protected]

Account Manager, Tara Bradeen(212) 600-3387, [email protected]

Account Manager, Jennifer Gambino(516) 562-5651, [email protected]

Account Manager, Elyse Cowen(212) 600-3051, [email protected]

Sales Assistant, Kathleen Jurina(212) 600-3170, [email protected]

BUSINESS OFFICE

General Manager,

Marian Dujmovits

United Business Media LLC

600 Community Drive

Manhasset, N.Y. 11030

(516) 562-5000

Copyright 2013.

All rights reserved.


UBM TECHPaul Miller, CEORobert Faletra, CEO, ChannelKelley Damore, Chief Community OfficerMarco Pardi, President, Business Technology EventsAdrian Barrick, Chief Content OfficerDavid Michael, Chief Information OfficerSandra Wallach CFOSimon Carless, EVP, Game & App Development and Black HatLenny Heymann, EVP, New MarketsAngela Scalpello, SVP, People & CultureAndy Crow, Interim Chief of Staff

UNITED BUSINESS MEDIA LLCPat Nohilly Sr. VP, Strategic Development

and Business Administration

Marie Myers Sr. VP,

Manufacturing

UBM TECH ONLINE COMMUNITIESBank Systems & Tech

Dark Reading

DataSheets.com

Designlines

Dr. Dobb’s

EBN

EDN

EE Times

EE Times University

Embedded

Gamasutra

GAO

Heavy Reading

InformationWeek

IW Education

IW Government

IW Healthcare

Insurance & Technology

Light Reading

Network Computing

Planet Analog

Pyramid Research

TechOnline

Wall Street & Tech

UBM TECH EVENT COMMUNITIES4G World

App Developers Conference

ARM TechCon

Big Data Conference

Black Hat

Cloud Connect

DESIGN

DesignCon

E2

Enterprise Connect

ESC

Ethernet Expo

GDC

GDC China

GDC Europe

GDC Next

GTEC

HDI Conference

Independent Games Festival

Interop

Mobile Commerce World

Online Marketing Summit

Telco Vision

Tower & Cell Summit

http://createyournextcustomer.techweb.com

Andrew Binstock Editor in Chief, Dr. Dobb’[email protected]

Deirdre Blake Managing Editor, Dr. Dobb’[email protected]

Amy Stephens Copyeditor, Dr. Dobb’s

[email protected]

Jon Erickson Editor in Chief Emeritus, Dr. Dobb’s

CONTRIBUTING EDITORSScott AmblerMike RileyHerb Sutter

DR. DOBB’S UBM TECHEDITORIAL 303 Second Street, 751 Laurel Street #614 Suite 900, South TowerSan Carlos, CA San Francisco, CA 9410794070 1-415-947-6000USA


Entire contents Copyright ©2013, UBM Tech/United Busi-ness Media LLC, except whereotherwise noted. No portionof this publication may be re-produced, stored, transmittedin any form, including com-puter retrieval, without writ-ten permission from thepublisher. All Rights Re-served. Articles express theopinion of the author and arenot necessarily the opinion ofthe publisher. Published byUBM Tech/United BusinessMedia, 303 Second Street,Suite 900 South Tower, SanFrancisco, CA 94107 USA415-947-6000.

DDJ_102113

Documents

Transcript of DDJ_102113