Data Enrichment using Web APIs

6
Data Enrichment using Web APIs Karthik Gomadam, Peter Z. Yeh, Kunal Verma Accenture Technology Labs 50, W. San Fernando St, San Jose, CA { karthik.gomadam, peter.z.yeh, k.verma} @accenture.com Abstract As businesses seek to monetize their data, they are leveraging Web-based delivery mechanisms to provide publically avail- able data sources. Also, as analytics becomes a central part of many business functions such as customer segmentation, competitive intelligence and fraud detection, many businesses are seeking to enrich their internal data records with data from these data sources. As the number of sources with varying degrees of accuracy and quality proliferate, it is a non-trivial task to effectively select which sources to use for a particu- lar enrichment task. The old model of statically buying data from one or two providers becomes inefficient because of the rapid growth of new forms of useful data such as social me- dia and the lack of dynamism to plug sources in and out. In this paper, we present the data enrichment framework, a tool that uses data mining and other semantic techniques to au- tomatically guide the selection of sources. The enrichment framework also monitors the quality of the data sources and automatically penalizes sources that continue to return low quality results. Introduction As enterprises become more data and analytics driven many businesses are seeking to enrich their internal data records with data from data sources available on the Web. Consider the example where a companys consumer database might have the name and address of its consumers. Being able to use publically available data sources such as LinkedIn, White Pages and Facebook to find information such as em- ployment details and interests, can help the company col- lect features for tasks such as customer segmentation. Cur- rently, this is done in a static fashion where a business buys data from one or two sources and statically integrates with their internal data. However, this approach has the following shortcomings: 1. Inconsistencies in input data: It is not uncommon to have different set of missing attributes across records in the input data. For example, for some consumers, the street address information might be missing and for oth- ers, information about the city might be missing. To ad- dress this issue, one must be able to select the data sources Copyright c 2012, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. at a granular level, based on the input data that is available and the data sources that can be used. 2. Quality of a data service may vary, depending on the input data: Calling a data source with some missing val- ues (even though they may not be mandatory), can result in poor quality results. In such a scenario, one must be able to find the missing values, before calling the source or use an alternative. In this paper, we present an overview of a data enrichment framework which attempts to automate many sub-tasks of data enrichment. The framework uses a combination of data mining and semantic technologies to automate various tasks such as calculating which attributes are more important than others for source selection, selecting sources based on infor- mation available about a data record and past performance of the sources, using multiple sources to reinforce low con- fidence values, monitoring the quality of sources, as well as adaptation of sources based on past performance. The data enrichment framework makes the following contributions: 1. Granular context dependent ordering of data sources: We have developed a novel approach to the order in which data sources are called, based on the data that is available at that point. 2. Automated assessment of data source quality: Our data enrichment algorithm measures the quality of the output of a data source and its overall utility to the enrichment process. 3. Dynamic selection adaptation of data sources: Using the data availability and utility scores from prior invo- cations, the enrichment framework penalized or rewards data sources, which affects how often they are selected. The rest of the paper is organized as follows: Motivation We motivate the need for data enrichment through three real- world examples gathered from large Fortune 500 companies that are clients of Accenture. Creating comprehensive customer models: Creating comprehensive customer models has become a holy grail in the consumer business, especially in verticals like retail and healthcare. The information that is collected impacts key de-

description

 

Transcript of Data Enrichment using Web APIs

Page 1: Data Enrichment using Web APIs

Data Enrichment using Web APIs

Karthik Gomadam, Peter Z. Yeh, Kunal VermaAccenture Technology Labs

50, W. San Fernando St,San Jose, CA

{ karthik.gomadam, peter.z.yeh, k.verma}@accenture.com

Abstract

As businesses seek to monetize their data, they are leveragingWeb-based delivery mechanisms to provide publically avail-able data sources. Also, as analytics becomes a central partof many business functions such as customer segmentation,competitive intelligence and fraud detection, many businessesare seeking to enrich their internal data records with data fromthese data sources. As the number of sources with varyingdegrees of accuracy and quality proliferate, it is a non-trivialtask to effectively select which sources to use for a particu-lar enrichment task. The old model of statically buying datafrom one or two providers becomes inefficient because of therapid growth of new forms of useful data such as social me-dia and the lack of dynamism to plug sources in and out. Inthis paper, we present the data enrichment framework, a toolthat uses data mining and other semantic techniques to au-tomatically guide the selection of sources. The enrichmentframework also monitors the quality of the data sources andautomatically penalizes sources that continue to return lowquality results.

IntroductionAs enterprises become more data and analytics driven manybusinesses are seeking to enrich their internal data recordswith data from data sources available on the Web. Considerthe example where a companys consumer database mighthave the name and address of its consumers. Being ableto use publically available data sources such as LinkedIn,White Pages and Facebook to find information such as em-ployment details and interests, can help the company col-lect features for tasks such as customer segmentation. Cur-rently, this is done in a static fashion where a business buysdata from one or two sources and statically integrates withtheir internal data. However, this approach has the followingshortcomings:

1. Inconsistencies in input data: It is not uncommon tohave different set of missing attributes across records inthe input data. For example, for some consumers, thestreet address information might be missing and for oth-ers, information about the city might be missing. To ad-dress this issue, one must be able to select the data sources

Copyright c© 2012, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

at a granular level, based on the input data that is availableand the data sources that can be used.

2. Quality of a data service may vary, depending on theinput data: Calling a data source with some missing val-ues (even though they may not be mandatory), can resultin poor quality results. In such a scenario, one must beable to find the missing values, before calling the sourceor use an alternative.

In this paper, we present an overview of a data enrichmentframework which attempts to automate many sub-tasks ofdata enrichment. The framework uses a combination of datamining and semantic technologies to automate various taskssuch as calculating which attributes are more important thanothers for source selection, selecting sources based on infor-mation available about a data record and past performanceof the sources, using multiple sources to reinforce low con-fidence values, monitoring the quality of sources, as well asadaptation of sources based on past performance. The dataenrichment framework makes the following contributions:

1. Granular context dependent ordering of data sources:We have developed a novel approach to the order in whichdata sources are called, based on the data that is availableat that point.

2. Automated assessment of data source quality: Our dataenrichment algorithm measures the quality of the outputof a data source and its overall utility to the enrichmentprocess.

3. Dynamic selection adaptation of data sources: Usingthe data availability and utility scores from prior invo-cations, the enrichment framework penalized or rewardsdata sources, which affects how often they are selected.

The rest of the paper is organized as follows:

MotivationWe motivate the need for data enrichment through three real-world examples gathered from large Fortune 500 companiesthat are clients of Accenture.

Creating comprehensive customer models: Creatingcomprehensive customer models has become a holy grail inthe consumer business, especially in verticals like retail andhealthcare. The information that is collected impacts key de-

Page 2: Data Enrichment using Web APIs

cisions like inventory management, promotions and rewards,and for delivering a more personal experience.

1. Personalized and targeted promotions: The more infor-mation a business has about its customer, the better it canpersonalize deals and promotions. Further, business couldalso use aggregated information (such as interests of peo-ple living in a neighborhood) to manage inventory and runlocal deals and promotions.

2. Better segmentation and analytics: The provider mayneed more information about their customers, beyondwhat they have in the profile, for better segmentation andanalytics. For example, a certain e-commerce site mayknow a persons browsing and buying history on their siteand have some limited information about the person suchas their address and credit card information. However, un-derstanding their professional activities and hobbies mayhelp them get more features for customer segmentationthat they can use for suggestions or promotions.

3. Fraud detection: The provider may need more informa-tion about their customers for detecting fraud. Providerstypically create detailed customer profiles to predict theirbehaviors and detect anomalies. Having demographic andother attributes such as interests and hobbies helps build-ing more accurate customer behavior profiles. Most e-commerce providers are typically under a lot of pressureto detect fraudulent activities on their site as early as pos-sible, so that they can limit their exposure to lawsuits,compliance laws or even loss of reputation.

Often businesses engage customers to register for programslike reward cards or connect with the customers using so-cial media. This limited initial engagement gives them theaccess to basic information about a customer, such as name,email, address, and social media handles. However, in a vastmajority of the cases, such information is incomplete andthe gaps are not uniform. For example, for a customer JohnDoe, a business might have the name, street address, and aphone number, whereas for Jane Doe, the available infor-mation will be name, email, and a Twitter handle. Lever-aging the basic information and completing the gaps, alsocalled as creating a 360 degree customer view, is a signifi-cant challenge. Current approaches to addressing this chal-lenge largely revolve around subscribing to data sources likeExperian. This approach has the following shortcomings:

1. The enrichment task is restricted to the attributes providedby the one or two data sources that they buy from. If theyneed some other attributes about the customers, it is hardto get them.

2. The selected data sources may have high quality infor-mation about attributes, but poor quality about some oth-ers. Even if the e-commerce provider knows about othersources, which have those attributes, it is hard to manuallyintegrate more sources.

3. There is no good way to monitor if there is any degrada-tion in the quality of data sources.

Using the enrichment framework in this context would al-low the e-commerce provider to dynamically select the best

set of sources for a particular attribute in particular data en-richment task. The proposed framework can switch sourcesacross customer records, if the most preferred source doesnot have information about some attributes for a particularrecord. For low confidence values, the proposed system usesreconciliation across sources to increase the confidence ofthe value. ADEF also continuously monitors and downgradesources, if there is any loss of quality.

Capital Equipment Maintenance: Companies withinthe energy and resources industry have significant invest-ments in capital equipments (i.e. drills, oil pumps, etc.).Accurate data about these equipments (e.g. manufacturer,model, etc.) is paramount to operational efficiency, propermaintenance, etc.

The current process for capturing this data begins withmanual entry. Followed by manual, periodic “walk-downs”to confirm and validate this information. However, this pro-cess is error-prone, and often results in incomplete and inac-curate data about the equipments.

This does not have to be the case. A wealth of structureddata sources (e.g. from manufacturers) exist that providesmuch of the incomplete, missing information. Hence, a so-lution that can automatically leverage these sources to enrichexisting, internal capital equipment data can significantlyimprove the quality of the data, which in turn can improveoperational efficiency and enable proper maintenance.

Competitive Intelligence. The explosive growth of ex-ternal data (i.e. data outside the business such as Web data,data providers, etc.) can enable businesses to gather rich in-telligence about their competitors. For example, companiesin the energy and resources industry are very interested incompetitive insights such as where a competitor is drilling(or planning to drill); disruptions to drilling due to accidents,weather, etc.; and more.

To gather these insights, companies currently purchaserelevant data from third party sources – e.g. IHS and Dod-son are just a few examples of third party data sources thataggregate and sell drilling data – to manually enrich exist-ing internal data to generate a comprehensive view of thecompetitive environment. However, this current process ismanual one, which makes it difficult to scale beyond a smallhandful of data sources. Many useful, data sources that areopen (public access) (e.g. sources that provide weather databased on GIS information) are omitted, resulting in gaps inthe intelligence gathered.

A solution that can automatically perform this enrichmentacross a broad range of data sources can provide more in-depth, comprehensive competitive insight.

Overview of Data Enrichment AlgorithmOur Data Enrichment Framework (DEF) takes two inputs –1) an instance of a data object to be enriched and 2) a setof data sources to use for the enrichment – and outputs anenriched version of the input instance.

DEF enriches the input instance through the followingsteps. DEF first assesses the importance of each attribute inthe input instance. This information is then used by DEF toguide the selection of appropriate data sources to use. Fi-nally, DEF determines the utility of the sources used, so it

Page 3: Data Enrichment using Web APIs

can adapt its usage of these source (either in a favorable orunfavorable manner) going forward.

PreliminariesA data object D is a collection of attributes describinga real-world object of interest. We formally define D as{a1, a2, ...an} where ai is an attribute.

An instance d of a data object D is a partial instanti-ation of D – i.e. some attributes ai may not have an in-stantiated value. We formally define d as having two el-ements dk and du. dk consists of attributes whose valuesare known (i.e. instantiated), which we define formally asdk = {< a, v(a), ka, kv(a) > ...}, where v(a) is the valueof attribute a, ka is the importance of a to the data object Dthat d is an instance of (ranging from 0.0 to 1.0), and kv(a) isthe confidence in the correctness of v(a) (ranging from 0.0to 1.0). du consists of attributes whose values are unknownand hence the targets for enrichment. We define du formallyas du = {< a, ka > ...}.

Attribute Importance AssessmentGiven an instance d of a data object, DEF first assesses (andsets) the importance ka of each attribute a to the data objectthat d is an instance of. DEF uses the importance to guidethe subsequent selection of appropriate data sources for en-richment (see next subsection).

Our definition of importance is based on the intuition thatan attribute a has high importance to a data object D if itsvalues are highly unique across all instances ofD. For exam-ple, the attribute e-mail contact should have high importanceto the Customer data object because it satisfies this intuition.However, the attribute Zip should have lower importance tothe Customer object because it does not – i.e. many instancesof the Customer object have the same zipcode.

DEF captures the above intuition formally with the fol-lowing equation:

ka =

√X2

1 +X2(1)

where,

X = HN(D)(a)

(U(a,D)

|N(D)|

)(2)

andHN(D)(a) = −

∑v∈a

PvlogPv (3)

U(a,D) is the number of unique values of a across allinstance of the data object D observed by DEF so far,and N(D) is all instances of D observed by DEF so far.HN(D)(a) is the entropy of the values of a across N(D),and serves as a proxy for the distribution of the values of a.

We note that DEF recomputes ka as new instances of thedata object containing a are observed. Hence, the impor-tance of an attribute to a data object will change over time.

Data Source SelectionDEF selects data sources to enrich attributes of a data objectinstance d whose values are unknown. DEF will repeat this

step until either there are no attributes in d whose values areunknown or there are no more sources to select.

DEF considers two important factors when selecting thenext best source to use: 1) whether the source will be ableto provide values if called, and 2) whether the source targetsunknown attributes in du (esp. attributes with high impor-tance). DEF satisfies the first factor by measuring how wellknown values of d match the inputs required by the source.If there is a good match, then the source is more likely to re-turn values when it’s called. DEF also considers the numberof times a source was called previously (while enriching d)to prevent “starvation” of other sources.

DEF satisfies the second factor by measuring how manyhigh-importance, unknown attributes the source claims toprovide. If a source claims to provide a large number of theseattributes, then DEF should select the source over others.This second factor serves as the selection bias.

DEF formally captures these two considerations with thefollowing equation:

Fs =1

2M−1Bs

∑a∈dk∩Is

kv(a)

|Is|+

∑a∈du∩Os

ka

|du|(4)

where Bs is the base fitness score of a data source s beingconsidered (this value is randomly set between 0.5 and 0.75when DEF is initialized), Is is the set of input attributes tothe data source, Os is the set of output attributes from thedata source, and M is the number of times the data sourcehas been selected in the context of enriching the current dataobject instance.

The data source with the highest score Fs that also ex-ceeds a predefined minimum threshold R is selected as thenext source to use for enrichment.

For each unknown attribute a′ enriched by the selecteddata source, DEF moves it from du to dk, and computes theconfidence kv(a′) in the value provided for a′ by the selectedsource. This confidence is used in subsequent iterations ofthe enrichment process, and is computed using the followingformula:

kv(a′) =

e

(1|V

a′ |−1)W , if kv(a′) = Null

eλ(kv(a′)−1) , if kv(a′) 6= Null(5)

where,

W =

∑a∈dk∩Is

kv(a)

|Is|(6)

W is the confidence over all input attributes to the source,and Va′ is the set of output values returned by a data sourcefor an unknown attribute a′.

This formula captures two important factors. First, if mul-tiple values are returned, then there is ambiguity and hencethe confidence in the output should be discounted. Second,if an output value is corroborated by output values given bypreviously selected data sources, then the confidence shouldbe further increased. The λ factor is the corroboration factor(< 1.0), and defaults to 1.0.

Page 4: Data Enrichment using Web APIs

In addition to selecting appropriate data sources to use,DEF must also resolve ambiguities that occur during the en-richment process. For example, given the following instanceof the Customer data object:

(Name: John Smith, City: San Jose, Occupation:NULL)

a data source may return multiple values for the unknownattribute of Occupation (e.g. Programmer, Artist, etc).

To resolve this ambiguity, DEF will branch the originalinstance – one branch for each returned value – and eachbranched instance will be subsequently enriched using thesame steps above. Hence, a single data object instance mayresult in multiple instances at the end of the enrichment pro-cess.

DEF will repeat the above process until either du is emptyor there are no sources whose score Fs exceeds R. Oncethis process terminates, DEF computes the fitness for eachresulting instance using the following equation:∑

a∈dk∩dU

kv(a)ka

|dk ∪ du|(7)

and returns top K instances.

Data Source Utility AdaptationOnce a data source has been called, DEF determines the util-ity of the source in enriching the data object instance of in-terest. Intuitively, DEF models the utility of a data source asa “contract” – i.e. if DEF provides a data source with highconfidence input values, then it is reasonable to expect thedata source to provide values for all the output attributesthat it claims to target. Moreover, these values should notbe generic and should have low ambiguity. If these expecta-tions are violated, then the data source should be penalizedheavily.

On the other hand, if DEF did not provide a data sourcewith good inputs, then the source should be penalized mini-mally (if at all) if it fails to provide any useful outputs.

Alternatively, if a data source is able to provide unam-biguous values for unknown attributes in the data object in-stance (esp. high importance attributes), then DEF should re-ward the source and give it more preference going forward.

DEF captures this notion formally with the followingequation:

Us =W

1

|Os|

∑a∈O+

s

e1|Va|−1kPT v(a)

a −∑a∈O−s

ka

(8)

where,

PT (v(a)) =

{PT (v(a)) , if |Va| = 1argminv(a)∈Va

PT (v(a)) , if |Va| > 1 (9)

O+s are the output attributes from a data source for which

values were returned, O−s are the output attributes fromthe same source for which values were not returned, andPT (v(a)) is the relative frequency of the value v(a) over the

past T values returned by the data source for the attribute a.W is the confidence over all input attributes to the source,and is defined in the previous subsection.

The utility of a data source Us from the past n calls arethen used to adjust the base fitness score of the data source.This adjustment is captured with the following equation

Bs = Bs + γ1

n

n∑1

Us(T − i) (10)

whereBs is the base fitness score of a data source s, Us(T −i) is the utility of the data source i time steps back, and γ isthe adjustment rate.

System Architecture

Figure 1: Design overview of enrichment framework

The main components of DEF are illustrated in Figure 1.The task manager starts a new enrichment project by in-stantiates and executes the enrichment engine. The enrich-ment engine uses the attribute computation module to cal-culate the attribute relevance. The relevance scores are usedin source selection. Using the HTTP helper module, the en-gine then invokes the connector for the selected data source.A connector is a proxy that communicates with the actualdata source and is a RESTful Web service in itself. The en-richment framework requires every data source to have aconnector and that each connector have two operations: 1)a return schema GET operation that returns the input andoutput schema, and 2) a get data POST operation that takesas input the input for the data source as POST parametersand returns the response as a JSON. For internal databases,we have special connectors that wrap queries as RESTfulend points. Once a response is obtained from the connec-tor, the enrichment engine computes the output value con-fidence, applies the necessary mapping rules, and integratesthe response with the existing data object. In addition to this,the source degradation factor is also computed. The map-ping, invocation, confidence and source degradation valuecomputation steps are repeated until either all values for allattributes are computed or if all sources have been invoked.The result is then written into the enrichment database.

Page 5: Data Enrichment using Web APIs

In designing the enrichment framework, we have adopteda service oriented approach, with the goal of exposing theenrichment framework as a “platform as a service”. The coretasks in the framework are exposed as RESTful end points.These include end points for creating data objects, import-ing datasets, adding data sources, and for starting an enrich-ment task. When the “start enrichment task” task resource isinvoked and a task is successfully started, the framework re-sponds with a JSON that has the enrichment task identifier.This identifier can then be used to GET the enriched datafrom the database. The framework supports both batch GETand streaming GET using the comet (?) pattern.

Data mapping is one of the key challenges in any data in-tegration system. While extensive research literature existsfor automated and semi-automated approaches for mapping(?; ?; ?; ?; ?), it is our observation that these techniques donot guarantee the high-level of accuracy required in enter-prise solutions. So, we currently adopt a manual approach,aided by a graphical interface for data mapping. The sourceand the target schemas are shown to the users as two trees,one to the left and one to the right. Users can select the at-tributes from the source schema and draw a line betweenthem and attributes of the target schema. Currently, our map-ping system supports assignment, merge, split, numericaloperations, and unit conversions. When the user saves themappings, the maps are stored as mapping rules. Each map-ping rule is represented as a tuple containing the sourceattributes, target attributes, mapping operations, and condi-tions. Conditions include merge and split delimiters and con-version factors.

Challenges and Experiences• Data services and APIs are an integral part of the system

• Have been around for a while, minimal traction in the en-terprise

• API rate limiting, lack of SLA driven API contracts

• Poor API documentation, often leading the developer incircles. Minimalism in APIs might not be a good idea;

• Changes not “pushed” to developers, often finding outwhen an application breaks

• Inconsistent auth mechanisms across different APIs andmandatory auth makes it expensive (even when pullinginformation that is open on the Web)

• easy to develop prototypes, but do not instill enough con-fidence to develop a deployable client solution

• Data services and APIs are an integral part of the system

• Have been around for a while, minimal traction in the en-terprise

• API rate limiting, lack of SLA driven API contracts

• Poor API documentation, often leading the developer incircles. Minimalism in APIs might not be a good idea;

• Changes not “pushed” to developers, often finding outwhen an application breaks

• Inconsistent auth mechanisms across different APIs andmandatory auth makes it expensive (even when pullinginformation that is open on the Web)

• easy to develop prototypes, but do not instill enough con-fidence to develop a deployable client solution

• Data services and APIs are an integral part of the system

• Have been around for a while, minimal traction in the en-terprise

• API rate limiting, lack of SLA driven API contracts

• Poor API documentation, often leading the developer incircles. Minimalism in APIs might not be a good idea;

• Changes not “pushed” to developers, often finding outwhen an application breaks

• Inconsistent auth mechanisms across different APIs andmandatory auth makes it expensive (even when pullinginformation that is open on the Web)

• easy to develop prototypes, but do not instill enough con-fidence to develop a deployable client solution

Related WorkWeb APIs and data services, that have helped establish theWeb as a platform. Web APIs have enabled the deploymentand use of services on the Web using standardized commu-nication protocols and message formats. Leveraging on WebAPIs, data services have allowed access to vast amounts ofdata that were hiterto hidden in proprietary silos in a stan-dardized manner. A notable outcome is the development ofmashups or Web application hybrids. A mashup is created byintegrating data from various services on the Web using theirAPIs. Although early mashups were consumer centric ap-plications, their adoption within enterprise has been increas-ing, especially in addressing data intensive problems suchas data filtering, data transformation, and data enrichment.Web APIs and data services, that have helped establish theWeb as a platform. Web APIs have enabled the deploymentand use of services on the Web using standardized commu-nication protocols and message formats. Leveraging on WebAPIs, data services have allowed access to vast amounts ofdata that were hiterto hidden in proprietary silos in a stan-dardized manner. A notable outcome is the development ofmashups or Web application hybrids. A mashup is created byintegrating data from various services on the Web using theirAPIs. Although early mashups were consumer centric appli-cations, their adoption within enterprise has been increasing,especially in addressing data intensive problems such as datafiltering, data transformation, and data enrichment.

Web APIs and data services, that have helped establish theWeb as a platform. Web APIs have enabled the deploymentand use of services on the Web using standardized commu-nication protocols and message formats. Leveraging on WebAPIs, data services have allowed access to vast amounts ofdata that were hiterto hidden in proprietary silos in a stan-dardized manner. A notable outcome is the development ofmashups or Web application hybrids. A mashup is created byintegrating data from various services on the Web using their

Page 6: Data Enrichment using Web APIs

APIs. Although early mashups were consumer centric appli-cations, their adoption within enterprise has been increasing,especially in addressing data intensive problems such as datafiltering, data transformation, and data enrichment.