TAR - Enabling Fine-Grained Targeted Advertising in Retail ... · face challenges like...

14
TAR - Enabling Fine-Grained Targeted Advertising in Retail Stores Xiaochen Liu University of Southern California [email protected] Yurong Jiang LinkedIn * [email protected] Puneet Jain Google * [email protected] Kyu-Han Kim Hewlett-Packard Labs [email protected] CCS CONCEPTS Information systems Information systems applications; Networks Mobile networks; Location based services; Human- centered computing Mobile computing; ACM Reference Format: Xiaochen Liu, Yurong Jiang, Puneet Jain, and Kyu-Han Kim. 2018. TAR - Enabling Fine-Grained Targeted Advertising in Retail Stores. In MobiSys ’18: The 16th Annual International Conference on Mobile Systems, Applications, and Services, June 10–15, 2018, Munich, Germany. ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/3210240.3210342 Abstract Mobile advertisements influence customers’ in-store purchases and boost in-store sales for brick-and-mortar retailers. Targeting mo- bile ads has become significantly important to compete with online shopping. The key to enabling targeted mobile advertisement and service is to learn shoppers’ interest during their stay in the store. Precise shopper tracking and identification are essential to gain the insights. However, existing sensor-based or vision-based solutions are neither practical nor accurate; no commercial solutions today can be readily deployed in a large store. On the other hand, we recognize that most retail stores have the installation of surveillance cameras, and most shoppers carry Bluetooth-enabled smartphones. Thus, in this paper, we propose TAR to learn shoppers’ in-store interest via accurate multi-camera people tracking and identification. TAR leverages widespread camera deployment and Bluetooth prox- imity information to accurately track and identify shoppers in the store. TAR is composed of four novel design components: (1) a deep neural network (DNN) based visual tracking, (2) a user trajectory estimation by using shopper visual and BLE proximity trace, (3) an identity matching and assignment to recognize shopper’s identity, *The work was done at Hewlett-Packard Labs . Research was sponsored by Hewlett-Packard Labs and the Army Research Laboratory with the Co- operative Agreement Number W911NF-09-2-0053 (the ARL Network Science CTA). The views and conclusions contained in this document are those of the authors and should not be interpreted as repre- senting the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advan- tage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific per- mission and/or a fee. Request permissions from [email protected]. MobiSys ’18, June 10–15, 2018, Munich, Germany © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-5720-3/18/06. . . $15.00 https://doi.org/10.1145/3210240.3210342 and (4) a cross-camera calibration algorithm. TAR carefully com- bines these components to track and identify shoppers in real-time. TAR achieves 90% accuracy in two different real-life deployments, which is 20% better than the state-of-the-art solution. Keywords: Shopping, Computer Vision, Mobile Sensing, Tracking, Bluetooth, Edge Computing 1 INTRODUCTION Digital interactions influence 49% of in-store purchases, and over half of them take place on mobile devices [12]. With this growing trend, brick-and-mortar retailers have been evolving their campaigns to effectively reach people with mobile devices, showcase prod- ucts, and ultimately, influence their in-store purchase. Among them, sending targeted advertisements (ads) to user’s mobile devices has emerged as a frontrunner [30]. To send well-targeted information to the shopper, the retailers (and advertisers) should correctly understand customers’ interest. The key to learning the customer’s interest is to accurately track and recognize the customer during her stay in the store. Therefore, the retailers need a practical system for shopper tracking and identifica- tion with real-time performance and high accuracy. For example, the retailer’s advertising system would require aisle-level or meter-level accuracy in tracking a shopper to infer customer’s dwelling time at a certain aisle. Moreover, the advertising should be able to reflect the customer’s position change fast, because people usually stay at, or walk by, a specific shelf in just a few seconds. Some sensor-based indoor tracking metrics are invented, such as Wi-Fi localization [14, 51], Bluetooth localization [59, 84], stereo cameras [22, 26, 38, 80], and thermal sensors [28]. However, such approaches are either expensive in hardware cost or inaccurate for retail scenarios. Some commercial solutions [2, 13] send customers the entire store’s information when they enter the store zone. Such promotions are coarse-grained and can hardly trigger customers’ interests. Recently, live video analytics has become a promising solution for accurate shopper tracking. Companies like Amazon Go [17] and Standard Cognition [11] use close-sourced algorithms to iden- tify customers and track their in-store movement. The opensource community also has proposed many accurate metrics for people (customer) identification and tracking. For people (re)identification, there are two mainstream approaches: face recognition and body feature classification. Today’s face recog- nition solutions ([55, 66, 87]) can reach up to 95% of precision on public datasets, thanks to the advance of deep neural networks

Transcript of TAR - Enabling Fine-Grained Targeted Advertising in Retail ... · face challenges like...

Page 1: TAR - Enabling Fine-Grained Targeted Advertising in Retail ... · face challenges like unpredictable people movement (between the surveillance zones). In this paper, we propose TAR

TAR - Enabling Fine-Grained Targeted Advertising in RetailStores

Xiaochen LiuUniversity of Southern California

[email protected]

Yurong JiangLinkedIn*

[email protected]

Puneet JainGoogle*

[email protected]

Kyu-Han KimHewlett-Packard Labs

[email protected]

CCS CONCEPTS• Information systems → Information systems applications; •Networks → Mobile networks; Location based services; • Human-centered computing → Mobile computing;

ACM Reference Format:Xiaochen Liu, Yurong Jiang, Puneet Jain, and Kyu-Han Kim. 2018. TAR -Enabling Fine-Grained Targeted Advertising in Retail Stores. In MobiSys ’18:The 16th Annual International Conference on Mobile Systems, Applications,and Services, June 10–15, 2018, Munich, Germany. ACM, New York, NY,USA, 14 pages. https://doi.org/10.1145/3210240.3210342

AbstractMobile advertisements influence customers’ in-store purchases

and boost in-store sales for brick-and-mortar retailers. Targeting mo-bile ads has become significantly important to compete with onlineshopping. The key to enabling targeted mobile advertisement andservice is to learn shoppers’ interest during their stay in the store.Precise shopper tracking and identification are essential to gain theinsights. However, existing sensor-based or vision-based solutionsare neither practical nor accurate; no commercial solutions todaycan be readily deployed in a large store. On the other hand, werecognize that most retail stores have the installation of surveillancecameras, and most shoppers carry Bluetooth-enabled smartphones.Thus, in this paper, we propose TAR to learn shoppers’ in-storeinterest via accurate multi-camera people tracking and identification.TAR leverages widespread camera deployment and Bluetooth prox-imity information to accurately track and identify shoppers in thestore. TAR is composed of four novel design components: (1) a deepneural network (DNN) based visual tracking, (2) a user trajectoryestimation by using shopper visual and BLE proximity trace, (3) anidentity matching and assignment to recognize shopper’s identity,

*The work was done at Hewlett-Packard Labs.Research was sponsored by Hewlett-Packard Labs and the Army Research Laboratory with the Co-operative Agreement Number W911NF-09-2-0053 (the ARL Network Science CTA). The views andconclusions contained in this document are those of the authors and should not be interpreted as repre-senting the official policies, either expressed or implied, of the Army Research Laboratory or the U.S.Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmentpurposes notwithstanding any copyright notation here on.

Permission to make digital or hard copies of all or part of this work for personal or classroom use isgranted without fee provided that copies are not made or distributed for profit or commercial advan-tage and that copies bear this notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting with credit is permitted. Tocopy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific per-mission and/or a fee. Request permissions from [email protected] ’18, June 10–15, 2018, Munich, Germany© 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-5720-3/18/06. . . $15.00https://doi.org/10.1145/3210240.3210342

and (4) a cross-camera calibration algorithm. TAR carefully com-bines these components to track and identify shoppers in real-time.TAR achieves 90% accuracy in two different real-life deployments,which is 20% better than the state-of-the-art solution.Keywords: Shopping, Computer Vision, Mobile Sensing, Tracking,Bluetooth, Edge Computing

1 INTRODUCTIONDigital interactions influence 49% of in-store purchases, and overhalf of them take place on mobile devices [12]. With this growingtrend, brick-and-mortar retailers have been evolving their campaignsto effectively reach people with mobile devices, showcase prod-ucts, and ultimately, influence their in-store purchase. Among them,sending targeted advertisements (ads) to user’s mobile devices hasemerged as a frontrunner [30].

To send well-targeted information to the shopper, the retailers(and advertisers) should correctly understand customers’ interest.The key to learning the customer’s interest is to accurately track andrecognize the customer during her stay in the store. Therefore, theretailers need a practical system for shopper tracking and identifica-tion with real-time performance and high accuracy. For example, theretailer’s advertising system would require aisle-level or meter-levelaccuracy in tracking a shopper to infer customer’s dwelling time at acertain aisle. Moreover, the advertising should be able to reflect thecustomer’s position change fast, because people usually stay at, orwalk by, a specific shelf in just a few seconds.

Some sensor-based indoor tracking metrics are invented, such asWi-Fi localization [14, 51], Bluetooth localization [59, 84], stereocameras [22, 26, 38, 80], and thermal sensors [28]. However, suchapproaches are either expensive in hardware cost or inaccurate forretail scenarios. Some commercial solutions [2, 13] send customersthe entire store’s information when they enter the store zone. Suchpromotions are coarse-grained and can hardly trigger customers’interests.

Recently, live video analytics has become a promising solutionfor accurate shopper tracking. Companies like Amazon Go [17]and Standard Cognition [11] use close-sourced algorithms to iden-tify customers and track their in-store movement. The opensourcecommunity also has proposed many accurate metrics for people(customer) identification and tracking.

For people (re)identification, there are two mainstream approaches:face recognition and body feature classification. Today’s face recog-nition solutions ([55, 66, 87]) can reach up to 95% of precisionon public datasets, thanks to the advance of deep neural networks

Page 2: TAR - Enabling Fine-Grained Targeted Advertising in Retail ... · face challenges like unpredictable people movement (between the surveillance zones). In this paper, we propose TAR

MobiSys ’18, June 10–15, 2018, Munich, Germany X. Liu et. al.

(DNN). However, the customer’s face is not always available in thecamera, and the face image may be blurry and dark due to poor light-ing and long distance. The body-feature-based solutions [40, 47, 92]do not deliver high accuracy (< 85%) and also suffer from bad videoquality.

For people tracking, the retailer needs both single-camera track-ing and cross-camera tracking to understand the walking path ofeach customer. Recent algorithms for single-camera tracking [39,53, 67, 88] leverage both the person’s visual feature and past tra-jectory to track her positions in following frames. However, suchalgorithms cannot perform well in challenging environments, e.g.,similar clothes, long occlusion, and crowded scene. Existing cross-camera tracking algorithms [46, 75, 79, 86] use the camera network’stopology to estimate the cross-camera temporal-spatial similarityand match each customer’s trace across cameras. Such solutionsface challenges like unpredictable people movement (between thesurveillance zones).

In this paper, we propose TAR to overcome the above limitations.As summarized above, existing indoor localization solutions are notaccurate enough in practice and usually require the deployment ofcomplicated and expensive equipment. Instead, this paper proposesa practical end-to-end shopper tracking and identification system.TAR is based on two fundamental ideas: Bluetooth proximity sensingand video analytics.

To infer a shopper’s identity, TAR looks into Bluetooth Low En-ergy (BLE) signal broadcasted from the user’s device. BLE hasrecently gained popularity with numerous emerging applications inindustrial Internet-of-Thing (IoT) and home automation. Proximityestimation is one of the most common use cases of BLE beacons [4].Apple iBeacon [18], Android EddyStone [24], and open-sourcedAltBeacon [16] are available options. Several retail giants (e.g., Tar-get, Macy’s) have already deployed them in stores to create a moreengaging shopping experience by identifying items in proximity tocustomers [27, 59, 84].

TAR takes a slightly different perspective from the above scenarioin that shoppers carry BLE-equipped devices and TAR utilizes BLEsignals to enhance tracking and identify shoppers. In a high level,TAR achieves identification by attaching the sensed BLE identityto a visually tracked shopper. TAR notices the pattern similaritybetween shopper’s BLE proximity trace and her visual movementtrajectories. Therefore, the identification problem converts to a tracematching problem.

In solving this matching problem, TAR encounters four chal-lenges. First, pattern matching in real-time is challenging due todifferent coordinate systems and noisy trace data. TAR transformsboth traces into the same coordinates with camera homographyprojection and BLE signal processing. Then, TAR devises a proba-bilistic matching algorithm that based on Dynamic Time Warping(DTW) [42] to match the patterns. To enable the real-time matching,TAR applies a moving window to match trace segments and uses thecumulative confidence score to judge the matching result.

Next, assigning the highest-ranked BLE identity to the visual traceis often incorrect. Factors like short visual traces could significantlyincrease the assignment uncertainty. To solve this problem, TARuses a linear-assignment-based algorithm to correctly determine theBLE identity. Moreover, instead of focusing on a single trace, TAR

looks at all visual-BLE pairs (i.e., a global view) and assigns IDs forall visual traces in a camera.

Third, a single user’s visual tracking trace can frequently breakupon occlusions. To solve this issue, TAR implements a rule-basedscheme to differentiate ambiguous visual tracks during the assign-ment process and connects broken tracks, regarding each BLE ID.

Finally, it is non-trivial to track people across cameras with dif-ferent camera positions and angles. Existing works [75, 79, 82, 89]either work offline or require overlapping camera coverage to handlea transition from one camera to the other. However, overlapping cov-erage is not guaranteed in most shops. To overcome this issue, TARproposes an adaptive probabilistic model that tracks and identifiesshoppers across cameras with little constraint.

We have deployed TAR in an office and a retail store environment,and analyzed TAR’s performance with various settings. Our evalua-tion results show that the system achieves 90% accuracy, which is20% better than the state-of-the-art multi-camera people trackingalgorithm. Meanwhile, TAR achieves a mean speed of 11 frame-per-second (FPS) for each camera, which enables the live video analyticsin practice.

The main contributions of our work are listed below:• development of TAR, a system for multi-camera shopper

tracking and identification (Sec. 3). TAR can be seamlesslyintegrated with existing surveillance systems, incurringminimal deployment overhead;

• introduction of four key elements to design TAR (Sec. 3);• a novel vision and mobile device association algorithm

with multi-camera support; and• implementation, deployment, and evaluation of TAR. TAR

runs in real-time and achieves over 90% accuracy (Sec. 4).

2 MOTIVATIONRetail trends: While the popularity of e-commerce continues tosurge, offline in-store commerce still dominates in today’s market.Studies in [23, 25] show that 91% of the purchases are made inphysical shops. In addition, [15] indicates that 82% of the Millen-nials prefer to shop in brick and mortar stores. As online shoppingevolves rapidly, it is crucial for offline shops to change the form andoffer better shopping experience. Therefore, it is essential for offlineretailers to understand shoppers’ demands for better service giventhat today’s customers are more informed about the items they want.The need for shopper tracking and identification: By observingwhere the shopper is and how long she visits each area, retailerscan identify the customer’s shopping interest, and hence, providea customized shopping experience for each people. For example,many large retail stores (e.g., Nordstrom [5], Family Dollar, Mother-care [6]) are already adopting shopper tracking solutions (e.g., Wi-Filocalization). These retailers then use the gathered data to help im-plement store layouts, product placements, and product promotions.Existing solutions: Several companies [35, 72, 80] provide solu-tions for shopper behavior tracking by primarily using surveillancecamera feeds. The solutions include features like shopper counting,the spatial-temporal distribution of customers, and shoppers’ aggre-gated trajectory. However, they are not capable of understandingper-shopper insight (or identity). Services like Facebook [2, 13] offertargeted advertisement for retail stores. They leverage coarse-grained

Page 3: TAR - Enabling Fine-Grained Targeted Advertising in Retail ... · face challenges like unpredictable people movement (between the surveillance zones). In this paper, we propose TAR

TAR - Enabling Fine-Grained Targeted Advertising in Retail Stores MobiSys ’18, June 10–15, 2018, Munich, Germany

location data and the shopper’s online browsing history to identifythe store-level information (which store is the customer visiting) andpush relevant advertisements. Therefore, such solutions can hardlyrecognize the shopper’s in-store behavior.

Camera-based tracking with face recognition can be used to inferthe shoppers’ indoor activities, but it also introduces several concerns– privacy, availability, and accuracy. First, the face is usually theprivacy-sensitive information, and collecting such information mightincrease user’s privacy concern (or even violation of law). Second,the face in the surveillance camera is sometimes unavailable due tovarious camera angles and body poses. Moreover, face recognitionalgorithms are known to be vulnerable to factors like image quality.Finally, face recognition requires the user’s face image to train themodel, which adds overhead to shoppers; asking them to submit agood face image and verifying the photo authenticity (e.g., offlineID confirmation) are not easy.Our proposed approach: TAR adopts a vision-based tracking met-ric but extends it to enable shopper identification with BLE. Weexploit the availability of BLE signals from the shopper’s smart-phone and combine the signal with vision-based technologies toachieve good tracking accuracy across cameras.

Modern smartphones equips with Bluetooth Low Energy (BLE)chip, and there are many BLE-based applications and hardwaredeveloped. A typical usage of BLE is to act as a beacon, whichbroadcasts BLE signal at a particular frequency. The beacon canserve as a unique identifier for the device and can be used to estimatethe proximity to the receiver [69]. Our approach assumes the avail-ability of BLE signals from shoppers, and this assumption becomespopular via incentive mechanism (e.g. mobile apps for coupons).

Therefore, in addition to our customized vision-based detectionand tracking algorithms, we carefully integrate them with BLE prox-imity information to achieve high accuracy for tracking and identifi-cation across cameras.

In designing the system, we aim to achieve the following goals:

• Accurate: TAR should outperform the accuracy of existingmulti-camera tracking systems. It should also be precise indistinguishing people’s identity.

• Real-time: TAR should recognize each customer’s iden-tity in a few seconds since a shopper might be highlymobile across multiple cameras. Meanwhile, TAR shoulddetect the appearance of people with high frame per second(FPS).

• Practical: TAR should not need any expensive hardware orcomplex deployment. It can leverage existing surveillancecamera systems and the user’s smartphone.

3 THE DESIGNThis section presents the design of TAR. We begin with an overviewof the TAR’s design and the motivating use cases of the design. Then,we explain the detailed components that address technical challengesspecific to the retail environment.

3.1 Design OverviewFigure 1 depicts the design of TAR, and it consists of two majorparts: 1) mobile Bluetooth Low Energy (BLE) library that enables

People Detection

Tracking withVision & BLE

Single Camera Tracking

ID Filtering

ID Assignment

Cross-camera Tracking & Identification

...

BLE Sensing

Output: Track-1 = BLE-AOutput: Track-3 = BLE-B

Probabilistic Update

TAR Server

Single Camera Tracking

Figure 1—System Overview for TAR

smart devices as BLE beacons in the background, and 2) server back-end that collects the real-time BLE signals and video data as wellas performs customer tracking and identification. First, we assumecustomers usually carry their smartphones with a store applicationinstalled [34]. The store app equips with TAR’s mobile library thatbroadcasts BLE signal as a background thread. Note that the BLEprotocol is designed with minimum battery overhead [21], and thebroadcasting process does not require customer’s intervention. Next,TAR’s server backend includes several hardware and software com-ponents. We assume each surveillance camera equips with a BLEreceiver for BLE sensing. Both the camera feed and the BLE sensingdata are sent to TAR for real-time processing.

TAR is composed of several key components to enable accuratetracking and identification. It has a deep neural network (DNN)based tracking algorithm (Sec. 3.3) to track users with vision trace,and then, incorporates a BLE proximity algorithm to estimate theuser’s movement (Sec. 3.4). In addition, TAR adopts a probabilisticmatching algorithm based on Dynamic Time Warping (DTW) [42]to associate both vision and BLE data and find out the user’s iden-tity (Sec. 3.5). However, external factors such as people occlusioncould harm the accuracy of sensed data and relying solely on thematching algorithm usually results in the error. To handle this issue,TAR uses a stepwise matching algorithm based on cumulative con-fidence score. After that, TAR devises an ID assignment algorithmto determine the correct identity from the global view (Sec. 3.5.2).As the vision-based trace might frequently break, sewing them to-gether is essential to learning user interests. We propose a rule-basedscheme to identify ambiguous user traces and properly connect them(Sec. 3.5.3). Finally, the start of the probabilistic matching processwill encounter more uncertainty due to the limited trace’s length.Therefore, TAR considers each user’s cross-camera temporal-spatialrelationship and carefully initializes its initial confidence level toimprove the identification accuracy (Sec. 3.5.4).

3.2 A Use CaseFigure 2 illustrates an example of how TAR works. A grocery storeis equipped with two video cameras that cover different aisles, asshown in Figure 2(a). Assume a customer with her smartphone entersthe store and the app starts broadcasting BLE signal. The customeris looking for some snacks and finally finds the snack aisle. Duringher stay, two cameras can capture her trajectory. Briefly, camera-1(bottom) sees the user at first and senses her BLE signals. Then thestarts matching the user’s visual trace to estimated BLE proximitytrace. TAR maintains a confidence score for the tracked customer’sBLE identity. When the user exits the camera-1 zone and enters

Page 4: TAR - Enabling Fine-Grained Targeted Advertising in Retail ... · face challenges like unpredictable people movement (between the surveillance zones). In this paper, we propose TAR

MobiSys ’18, June 10–15, 2018, Munich, Germany X. Liu et. al.

SnackArea

MeatArea

DairyArea

TAR

PUSH

BLE ID: 7FD4

ID:7FD4 proximity

Chips15% OFF

Cam1

Cam2

(a) (b) (c)

Figure 2—A Targeted Ad Working Example in Store

camera-2 region (top), TAR considers various factors includingtemporal-spatial relationship and visual feature similarity, and then,adjusts the initial confidence score for the customer in the new zone.Then, camera-2 starts its own tracking and identification progressand concludes the customer’s identity (7FD4 in Figure 2(b)). TARthen continuously learns her dwell-time and fine-grained trajectoryon each shelf.

In following sections, we detail core components of TAR to real-ize the features above and other use cases of fine-grained trackingand identification.

3.3 Vision-based Tracking (VT)We design a novel vision-based tracking metric (VT) that consistsof three components: people detection and visual feature extraction,visual object tracking, and physical trajectory estimation.

3.3.1 People Detection and Deep Visual Feature. Recentdevelopment in DNN provides us with accurate and fast peopledetector. It detects people in each frame and marks the detectedpositions with bounding boxes. Among various proposals, we chooseFaster-RCNN [71] as TAR’s people detector because it achieves highaccuracy as well as a reasonable speed. We evaluate its performanceagainst other options’ in Sec. 4.

In addition to the detection, TAR extracts and uses the visualfeature of the detected bounding box to improve inter-frame peopletracking. Briefly, once a person’s bounding box is detected, TARextracts its visual feature using DNN. The ideal visual feature couldaccurately classify each people under different people poses andlighting conditions. Recently, DNN-based feature extractors havebeen proposed and outperform other features (e.g., color histogram,SIFT [64]) regarding the classification accuracy. We have evalu-ated the state-of-art feature extractors, including CaffeNet, ResNet,VGG16, and GoogleNet [31], and have identified that the convo-lution neural network (CNN) version of GoogleNet [95] deliversthe best performance in the tradeoff of speed and accuracy. Afterthat, we have further trained the model with two large-scale peoplereidentification datasets together (MARS [93] and DukeReID [54]),with over 1,100,000 images of 1,261 pedestrians.

3.3.2 People Tracking in Consecutive Video Frames. Thetracking algorithm in TAR is inspired by DeepSort [88], a state-of-the-art online tracker. In a high level, DeepSort combines eachpeople’s visual feature with a standard Kalman filter, which matches

objects based on squared Mahalanobis distance [49]. DeepSort op-timizes the matching process by minimizing the cosine distancebetween deep features. However, it often fails when multiple peoplecollide in a video. The cause is that the size of a detection bound-ing box, covering colliding people, becomes large, and the deepvisual feature calculated from the bounding box cannot accuratelyrepresent the person inside.

To overcome this problem, TAR leverages the geometric relation-ship between objects. When multiple people are close to each otherand their bounding boxes have a large intersection-over-union (IOU)ratio, TAR will not differentiate those persons using DNN-generatedvisual features. Instead, those people’s visual traces will be regardedas "merged" until some people start leaving the group. When thebounding boxes’ IOU values become lower than a certain threshold(set to 0.3), they will be regarded as "departed" and TAR will resumethe visual-feature-enabled tracking.

The hybrid metric above also faces some challenges. When twousers with similar color clothes come across each other, the matchingalgorithm sometimes fails because the users’ IDs (or tracking IDs)are switched. To avoid this error, we propose a kinematic verificationcomponent for our matching algorithm. The idea is that people’smovement is likely to be constant in a short period. Therefore, wecompute the velocity and the relative orientation of each detectedobject in the current frame, and then compare it to existing trackedobjects’ velocity and orientation. This component serves as a veri-fication module that triggers the matching only for objects whosekinematic conditions are similar. TAR avoids the confusion, as thetwo users above show different velocity and orientation.

The people tracking algorithm in TAR synthesizes the temporal-spatial relationship and visual feature distance to track each person(her ID) accurately. First, it adopts a Kalman filter to predict themoving direction and speed of each person (called, track), and then,predicts tracks’ position in the next frame. In the next frame, TARcomputes a distance between the predicted position and each detec-tion box’s position. Second, TAR calculates each bounding box’sintersection area with the last few positions of each track. LargerIOU ratio means higher matching probability. Third, TAR extracts adeep visual feature (see Sec.3.3.1) of the detected object, and then,compares the feature with the track’s. Here, TAR can filter out trackswith the kinematic verification, and then apply all the three matchingmetrics. Finally, it assigns each detection to a track. If a detectioncannot match any track with enough confidence, TAR will searchone frame backward to find any matched track. On the other hand,if a track is not matched for a long time (a person moves out ofa camera’s view), it is regarded as “missing”, and hence, will bedeleted.

3.3.3 Physical Trajectory Estimation. Once finishes the vi-sual tracking, TAR then converts the results to physical trajectoriesby applying the homography transformation [33]. Specifically, TARinfers people’s physical location by using both visual tracking resultsand several parameters of the camera. Assuming the surveillancecameras are stationary and well calibrated, TAR can estimate theheight and the facing direction of detected objects in world coor-dinates. Moreover, these cameras can provide information abouttheir current resolution and angle-of-view. With these calibrationproperties, TAR calculates a projective transformation matrix [33]

Page 5: TAR - Enabling Fine-Grained Targeted Advertising in Retail ... · face challenges like unpredictable people movement (between the surveillance zones). In this paper, we propose TAR

TAR - Enabling Fine-Grained Targeted Advertising in Retail Stores MobiSys ’18, June 10–15, 2018, Munich, Germany

H that maps each pixel in the frame to the ground location in theworld coordinates. As a person (or track) moves, TAR can associateits distance change with a timestamp, yielding physical trajectory.

However, the homography mapping process introduces a uniquechallenge; it needs to project entire pixels in a detected boundingboxes (bbox) to estimate physical distance, but the bbox size mayvary frame by frame. For example, a person’s bbox may cover herentire body in one frame, and then, it might only include an upperbody in the next. Moreover, transforming the whole pixels in thebbox imposes an extra burden on computation. To deal with thischallenge, TAR chooses a single reference pixel for each detectedperson, while ensuring spatial consistency of the reference pixel evenin changing bbox. Specifically, TAR picks a pixel that is crossingbetween the bbox and ground, i.e., a person’s feet position. TARuses this bottom-center pixel of the bbox to represent its referencepixel. One may argue that the bbox’s bottom may not always bea foot position (e.g., when the customers’ lower body is blocked).TAR leverages the fact that a person’s width and height show a ratioaround 1:3. With this intuition, TAR checks whether a detected bboxis "too short" – blocked – and, if so, TAR extends the bottom sideof the bbox, based on the ratio. Our evaluation shows that TAR’sphysical trajectory estimation achieves less than 10% of an error,even in a crowded area.

3.4 People Tracking with BLEIn addition to VT, TAR relies on BLE proximity to accurately esti-mate people’s trajectories. We first introduce BLE beacons and thenexplain TAR’s proximity estimation algorithm.BLE background. BLE beacon represents a class of BLE devices.It periodically broadcasts its identifier to nearby devices. A typicalBLE beacon is powered by a coin cell battery and could have 1−3years of lifetime. Today’s smartphones support Bluetooth 4.0 pro-tocol so they can operate as a BLE beacon (transmitter). Similarly,any device that supports Bluetooth 4.0 can be used as BLE receiver.TAR’s mobile component enables a customer’s smartphone as a BLEbeacon. This component is designed as a library, and other applica-tions (e.g., store app) can easily integrate it and run as a backgroundprocess.Proximity Trace Estimation. The BLE proximity trace is estimatedby collecting BLE beacons’ time series proximity data. Through ourextensive evaluation, we select the proximity algorithm in [16] toestimate the distance from BLE beacon to the receiver. There are twoways to calculate the proximity using BLE Received Signal Strength(RSS): (1) d = exp((E−RSS) / 10n) where E is transmission power(dBm) and n is the fading coefficient; (2) The beacon’s transmissionpower ts defines the expected RSS for a receiver that is one meteraway from the beacon. We denote the actual RSS as rs. Then we getrt = rs

ts . The distance can be estimated with rt < 1.0 ? rt10 : c1rtc2 +

c3, in which c1, c2 and c3 are coefficients from data regression.We implement both algorithms and compare their performance onthe collected data. We find the second option is more sensitive tomovement and therefore reflects the movement pattern more timelyand accurately.

In practice, these coefficients depend on the receiver device man-ufacturers. For example, Nexus 4 and Nexus 5 use same Bluetoothchip from LG, so they have the same parameters. In TAR, we have

Detection Bounding Box

Physical distance

BLE proximity

(a) (b)

Figure 3—Relationship between BLE proximity and physical distance

full knowledge of our receivers, so we regress our coefficients ac-cordingly. Since TAR also controls the beacon side, the transmissionpower of each beacon is known to TAR. Notice that the BLE RSSreading is inherently erroneous, so we apply the RSS noise filteringstrategy similar to [44] for the original signal and then calculatethe current rs with the above formula. TAR takes the time series ofBLE proximity as the BLE trace for each device and its owner. Weassume each customer has one device with TAR installed, while thecase that one user carries multiple devices or other people’s deviceis left for the future work.

3.5 Real-time Identity MatchingThe key to learning the user’s interest and pushing ads is accurateuser tracking and identification. By tracking the customer, we knowwhere she visits and what she’s interested in. By identifying theuser, we know who she is and whom to send the promotion. Inpractice, it is unnecessary to know the user’s real identity. Instead,recognizing the smart devices carried by users achieves the samegoal. We find the BLE universally unique ID (UUID) can serve asthe identifier for the device. If we associate the BLE UUID to thevisually tracked user, we will successfully identify her and learn herspecific interest by looking back at her trajectories. On the otherhand, we notice that for a particular user, her BLE proximity traceusually correlates with her physical movement trajectory and hervisual movement. Figure 3 shows the example traces of a customerand the illustration of the BLE proximity and the physical distance.Therefore, TAR aims to associate the user’s visually tracked trace tothe sensed BLE proximity trace. Inspired by the observation above,We propose a similarity-based association algorithm with movementpattern matching for TAR.

3.5.1 Stepwise Trace Matching. In the matching step, we firstneed to decide how the traces should be matched. We notice that theBLE proximity traces are usually continuous, but the visual trackscould easily break, especially in occlusion. With this observation, weuse visual tracking trace to match BLE proximity traces. The BLEtrace continuity, on the other hand, can help correct the real-timevisual tracking. To match the time series data, we devised our algo-rithm based on Dynamic Time Warping (DTW). DTW matches eachsample in one pattern to another using dynamic programming. If thetwo patterns change together, their matched sample points will havea shorter distance, and vice versa. Therefore, shorter DTW distancemeans higher similarity between two traces. Based on the DTWdistance, we define confidence score to quantify the similarity. Math-ematically, assume dti j is the DTW distance between visual track vi

and BLE proximity trace b j , the confidence score is fi j = exp(−dti j100 ).

Page 6: TAR - Enabling Fine-Grained Targeted Advertising in Retail ... · face challenges like unpredictable people movement (between the surveillance zones). In this paper, we propose TAR

MobiSys ’18, June 10–15, 2018, Munich, Germany X. Liu et. al.

(a) (b) (c)

0 2 4 6 8Time (sec)

10

20

30

40

50

60

DTW

Dist

ance

0.1

0.2

0.3

0.4

0.5

Mat

chin

g Pr

obab

ilityBLE-1

BLE-2BLE-3

DTW Distance Matching Probability --------

(d)Figure 4—(a) Example of a visual trace; (b) Sensed BLE proximity traces; (c) DTW cost matrix for successful matching; (d) Matching Process Illustration.

There are some other ways to compare the trace similarity such as Eu-clidean distance, cosine distance, etc. We compare the effectivenessand efficiency of these choices in Sec. 4.

DTW is a category of algorithms for aligning and measuring thesimilarity between two time series. There are three challenges toapply DTW to synchronize the BLE proximity and visual traces.First, DTW normally processes the two traces offline. However,both traces are extending continuously in real-time in TAR. Second,DTW relies on computing a two-dimensional warping cost matrixwhich has size increasing quadratically with the number of samples.Considering the BLE data’s high frequency and nearly 10 FPS videoprocessing speed, the computation overhead can increase dramat-ically over time. Finally, DTW calculates the path with absolutevalues in two sequences, but the physical movement estimated fromthe BLE proximity and the vision-based tracking trace are incon-sistent and inherently erroneous. Computing DTW directly on theirabsolute values will cause adverse effects in matching.

First, to deal with the negative effect of absolute value input,TAR adopts the data differential strategy similar to [44, 81]. Wefilter out the high-frequency points in the trace and calculate thedifferential of current data point by subtracting the prior with timedivided. Through this operation, either data sequence is independentof the absolute value and can be compared directly.

Second, a straightforward way to reduce the computation over-head is to minimize the input data size. TAR follows this path anddesigns a moving window algorithm to prepare the input for DTW.More concretely, we set a sliding window of three seconds andupdate the windowed data every second. We choose this windowsize for the balance between latency and accuracy. If the windowis too short, the BLE trace and visual trace will be too short to becorrectly matched. If the window is too long, we may miss someshort tracks. As the window moves, TAR performs the matchingprocess in real-time, thus solves the DTW offline issue. The windowtriggers the computation once the current time window is updated.Although we get the confidence score with window basis, connectingthe matching windows for a specific visual track remains an issue.For example, a visual track vi has a higher confidence to match BLEID-1 at window 1, but BLE ID-2 at window 2. To deal with this, TARuses cumulative confidence score to connect the windows for thevisual track. TAR accumulates the confidence scores for consecutivewindows of a visual trace and uses the cumulated the confidencescore as the current confidence score for ID assignment.

ID 1 2 ... nTrack−a pa1 pa2 ... panTrack−b pb1 pb2 ... pbnTrack− c pc1 pc2 ... pcn

Table 1—ID-matching matrix

We use Figure 4 as one example to demonstrate the algorithm.In this case, a customer’s moving trace is shown on the top of Fig-ure 4(a). Due to the aisle occlusion and pose change, our vision track-ing algorithm obtains two visual tracks for him. Figure 4(b) showsthe sensed BLE proximity during this period. TAR tries to matchthe visual tracked trace to one of those BLE proximity traces. Fig-ure 4(c) shows the calculation process of DTW for a visual trackand a BLE proximity trace, where the path goes almost diagonal.To illustrate our confidence score calculation process, we show thecomputation process for this example in Figure 4(d). The x-axisshows the time, left y-axis shows the DTW score (solid lines) foreach moving window, while the right y-axis shows the cumulativeconfidence score (dotted lines). We can see that BLE trace 2 hasbetter confidence score at the beginning, but falls behind the correctBLE trace 1 after four seconds.

3.5.2 Identity Assignment. To identify the user, TAR needs tomatch the BLE proximity trace to the correct visual track. Ideally, foreach trace, the best cumulative confidence score decides the correctmatching. However, there are two problems. First, as stated earlier,BLE proximity estimation is not accurate enough to differentiatesome users. In practice, we sometimes see two BLE proximity tracesare too similar to assign them to one user confidently. Second, visualtracks break easily in challenging scenarios, which often results inshort tracks. For example, the visual track of the user in Figure 4(a)breaks in the middle, leading to two separate track traces. Althoughthe deep feature similarity can help in some scenarios, it fails whenthe view angle or body pose changes. As TAR intends to learn theuser interest, there needs a way to connect these intermittent visualtrack traces.ID Assignment. To tackle the first challenge, TAR proposes a globalID assignment algorithm based on linear assignment [61]. TARcomputes the confidence score for every track-BLE pair. At any timefor one camera, all the visible tracks and their candidate BLE IDswill form a matrix called ID-matching matrix, where row i stands fortrack i and column j is for BLE ID j. The element (i, j) of the matrixis Prob(BLEi j ). Table 1 shows the matrix structure. Note that each

Page 7: TAR - Enabling Fine-Grained Targeted Advertising in Retail ... · face challenges like unpredictable people movement (between the surveillance zones). In this paper, we propose TAR

TAR - Enabling Fine-Grained Targeted Advertising in Retail Stores MobiSys ’18, June 10–15, 2018, Munich, Germany

candidate ID only belongs to some of the tracks, so its matchingprobability is zero with other tracks.

When the matrix is ready, TAR will assign one BLE ID for thetrack in each row. The goal of the assignment is to maximize thetotal sum of confidence score. We use Hungarian algorithm [63]to solve the assignment problem in polynomial time. The assignedID will be treated as the track’s identity in the current time slot. Asvisual tracks and BLE proximity traces change with the time window(Sec. 3.5.1), TAR will update the assignment with updated matrixaccordingly. If a track is not updated in the current window, it will betemporarily removed from the matrix as well as its candidate. Whena track stops updating for a long time (> 20 sec), the system willtreat the track as "terminated" and archives the last BLE ID assignedto the track.

3.5.3 Visual Track Sewing. The identity matching process isstill insufficient for identity tracking in practice: the vision-basedtracking technique is so vulnerable that one person’s vision trackmay break into multiple segments. For example, upon a long periodof occlusion, one person’s trajectory in the camera may be splitinto several short tracks (see Figure 4(a)). Another case is that thecustomer may appear for a very short time in the camera (enters theview and quickly leaves). These short traces make the ID assignmentresult ambiguous as the physical distance pattern can be similar tomany BLE proximity traces in that short period.

TAR proposes a two-way strategy to handle this. First, TAR triesto recognize the “ambiguous” visual track in real-time. In our design,a track will be considered as “ambiguous” when it meets either ofthe two rules: 1) its duration has not reached three seconds; 2) itsconfidence score distinction among candidate BLE IDs is vague.Explicitly, the two candidates are considered similar when the rank2’s score is more than ≥ 80% of the rank 1’s.

When there is an ambiguous track in assignment, TAR will firstconsider if the track belongs to an inactive track due to the occlu-sion. To verify this, TAR will search the inactive local tracks (notmatched in the current window but is active within 20 seconds) andcheck if their assigned IDs are also top-ranked candidate IDs ofthe ambiguous track. If TAR cannot find such inactive tracks, thatmeans the current track has no connection with previous tracks sothe current one will be treated as a regular track to be identified withID assignment process.

When a qualified inactive track is found, TAR will check if thetwo tracks have spatial conflict. The spatial conflict means the twotemporally-neighbored segments locate far from each other. Forexample, with the same assigned BLE ID, one track v1 ends atposition P1 and the next track v2 starts at position P2. Suppose thegap time between two tracks is t, and the average moving speed of T1is v. In TAR, T1 and T2 will have a spacial conflict if |P1−P2|> 5v∗t.The intuition behind is that a person cannot travel too fast from oneplace to another.

With the conflict check finished, TAR connects the inactive trackwith the ambiguous track. The trace during the gap time betweentwo tracks is automatically fulfilled with the average speed. Thesystem assumes that the people moves from the first track’s endpointto the second track’s starting point with constant velocity during theocclusion. Then the combined track will replace the ambiguous trackin the assignment matrix. After linear assignment, TAR will check if

the combined track receives the same ID that is previously assignedto the inactive track. If yes, this means the track combination issuccessful and the ambiguous track is the extension of the inactivetrack. Otherwise, TAR will try to combine the ambiguous track withother qualified inactive tracks until successful ID assignment. If nocombination wins, the ambiguous track will be treated as a regulartrack for the ID assignment process.

3.5.4 Multi-camera Calibration. In the discussion above, oneproblem with the matching process for the single camera is that theconfidence score could be inaccurate when the tracks are short. Thisis due to limited amount of visual track data and the big size ofcandidate BLE IDs. For each visual track, we should try to minimizethe number of its candidate BLE IDs. It is necessary because morecandidates not only increase the processing time but also decreasethe ID assignment accuracy. Therefore, TAR proposes Cross-cameraID Selection (CIS) to prepare the list of valid BLE IDs for eachcamera.

The task of CIS is to determine which BLE ID is currently visiblein each camera. First, we observe that 15 meter is usually the max-imum distance from the camera to a detectable device. ThereforeTAR will ignore beacons with BLE proximity larger than 15 meters.However, the 15-meter scope can still cover more than 20 IDs in realscenarios. The reason is that the BLE receiver senses devices in alldirections while the camera has fixed view angle. Therefore, somenon-line-of-sight beacon IDs can pass the proximity filter. For exam-ple, two cameras are mounted on the two sides of a shelf (which iscommon in real shops). They will sense very similar BLE proximityto nearby customers while a customer can only show up in one ofthem.

To solve the problem, TAR leverages the positions of the cameraand the shop’s floorplan to abstract the camera connectivity into anundirected graph. In the graph, a vertex represents a camera, andan edge means customers can travel from one camera to another.Figure 5(a) shows a sample topology where there are four camerascovering all possible paths within the area. A customer ID must besensed hop-by-hop. With this knowledge, TAR filters ID candidateswith the following rules: 1) At any time, the same person’s trackcannot show up in different cameras if the cameras do not have theoverlapping view. In this case, if an ID is already associated witha track in one camera with high confidence, it cannot be used asa candidate in other cameras (Figure 5(b)). 2) A customer’s graphtrajectory cannot "skip" node. For example, an unknown customersensed by cam-2 must have shown up in cam-1 or cam-3, becausecam-2 locates in the middle of the path from cam-1 to cam-3, andthere’s no other available path (Figure 5(c)). 3) The travel timebetween two neighbor cameras cannot be too short. We set the lowerbound of travel time as 1 second (Figure 5(d)).

CIS runs as a separate thread on the TAR server. In every movingwindow, it collects all cameras’ BLE proximity traces and visualtracks. CIS checks each BLE ID in the camera’s candidate list andremoves the ID if it violates any of the rules above. The filtered IDlist will be sent back to each camera module for ID assignment.

4 EVALUATIONIn this section, we describe how TAR works in the real scenario andevaluate each of its components.

Page 8: TAR - Enabling Fine-Grained Targeted Advertising in Retail ... · face challenges like unpredictable people movement (between the surveillance zones). In this paper, we propose TAR

MobiSys ’18, June 10–15, 2018, Munich, Germany X. Liu et. al.

A

BC

D

A D

B C Time Time

A

B

A

C

B or D is missing

(a) (b) (c)

Figure 5—(a) Camera Topology; (b) One ID cannot show in two cameras; (c)BLE ID must be sensed sequentially in the network; (d) It takes time to travelbetween cameras

4.1 Methodology and MetricsTAR Implementation. Our implementation of TAR contains threeparts: the BLE broadcasting and sensing, the live video detection,and tracking and the identity matching.

BLE broadcasting is designed as a mobile library supporting bothiOS and Android. TAR implements the broadcasting library withthe CoreBluetooth framework on iOS [7] and AltBeacon [16] onAndroid. The BLE RSS sensing module sits in the backend. In ourexperiments, we use Nexus 6 and iPhone 6 for BLE signal receiving.The Bluetooth data is pushed from the devices to the server throughTCP socket. In TAR, we set both the broadcasting and sensingfrequency at 10 Hz.

The visual tracking module (VT) consists of a DNN-based peo-ple detector and a DNN feature extractor. One VT processes thevideo from one camera. TAR uses the Tensorflow version of Faster-RCNN [48, 71] as people detector and our modified GoogleNet [31]as the deep feature extractor. We train the Faster-RCNN model withVOC image dataset [52] and train the GoogleNet with two pedes-trian datasets: Market-1501 [94] and DukeMTMC [96]. The detectorreturns a list of bounding boxes (bbox), which are fed to the fea-ture extractor. The extractor outputs 512-dim feature vector for eachbbox. We choose FastDTW [76] for DTW algorithm and its codecan be downloaded from [3].

Since each VT needs to run two DNNs simultaneously, we cannotsupport multiple VTs on single GPU. To ensure performance, wededicate one GPU for each VT instance in TAR, while leaving furtherscalability optimization to the future works. The tracking algorithmand identity matching algorithm is implemented with Python andC++. To ensure real-time process, all modules run in parallel throughmulti-threading.

Our server equips with Intel Xeon E5-2610 v2 CPU and NvidiaTitan Xp GPU. In the runtime, TAR occupies 5.3GB of GPU memoryand processes video at around 11FPS. Double VT instances on oneGPU will not overflow the memory but will reduce the FPS byaround half.

As cross-camera tracking and identification require collaborationamong different cameras, TAR shares the data by having one ma-chine as the master server and running Redis cache. Then each VTmachine can access the cache to upload its local BLE proximity andtracking data. The server runs cross-camera ID selection with thecached data and writes filtred ID list to each VT’s Redis space.TAR Experiment Setup. We evaluate TAR’s performance by de-ploying the system in two different environments: an office building

1

2

3

cam5

cam4cam6

cam3cam2 cam1

3.5m

(a) Office Setup

cam2

cam1

cam3

20 m

50m

30m 30m

(b) Retail Store Setup

Figure 6—Experiment Deployment Layout: (a) Office; (b) Retail store.

Cam 1 Cam 2 Cam 3 Cam 4 Cam 5 Cam 6

A

B

Figure 7—Same person’s figures under different camera views (office).

(Figure 6(a)) and a retail store (Figure 6(b)). We use Reolink IP cam-era (RLC-410S) in our setup. The test area for the office deploymentis 50m×30m with the average path width of 3.5m, while the retailstore is 20m×30m.

We deploy six cameras in the office building as shown in thelayout, and three cameras in the retail store. All the cameras aremounted at about 3m height, pointing 20°-30° down to the sidewalk.There are 20 different participants involved in the test, 12 in officedeployment and 8 in retail store deployment. Besides the recruitedvolunteers, TAR also records other pedestrians and it captures upto 27 people simultaneously in the cameras. Each participant hasTAR installed in their devices and walks around randomly based ontheir interest. To quantify the TAR performance, we record all thetrace data in two deployment scenarios for later comparison. We’vecollected around 1-hour data for each deployment, including 30GBvideo data and 10MB BLE RSS logs. Fig.7 shows the same person’sappearance in different cameras. We can see that some snapshots aredark and blurry, which makes it hard to identify people only withvision approach.

For cross-camera tracking and identification, we mainly use IDF1Score [73], a standard metric to evaluate the performance of multi-camera tracking system. IDF1 is the ratio of correctly identifieddetections over the average number of ground-truth, which equals(Correctly identified people in all frames) / (All identified peoplein all frames). For example, if one camera records three people A,B, and C. If an algorithm returns two traces: one on A with ID=A,and another on C with ID=B. In this case, we only have one personcorrectly detected, so the IDF1=33%.

4.2 TAR RuntimeBefore discussing our trace-based evaluation, we show the benefitsof TAR’s matching algorithm and optimization in the runtime.

Page 9: TAR - Enabling Fine-Grained Targeted Advertising in Retail ... · face challenges like unpredictable people movement (between the surveillance zones). In this paper, we propose TAR

TAR - Enabling Fine-Grained Targeted Advertising in Retail Stores MobiSys ’18, June 10–15, 2018, Munich, Germany

Camera 4Camera 2

Figure 8—Screenshots for Cross Camera Calibration

We first show TAR’s ID assignment process 1. In the beginning,with only detection and bbox tracking, we cannot tell the user iden-tity. We consider the user movement estimated from the visual trackand the BLE proximity traces and then apply our stepwise matchingalgorithm. After that, we use our ID assignment algorithm to reportthe user’s possible identity. Although the user’s identity is not correctat first, the real identity emerges as time window moves. This provesthe effectiveness of our identity matching algorithm.

We also demonstrate how TAR’s track sewing works in runtime2.As the first part of the video shows, in the case of broken visualtracks, the user may not get correctly identified after the break. Byapplying our track sewing algorithm, the user’s tracks get correctlyrecognized much faster. Therefore, TAR’s track sewing algorithmbenefits those scenarios.

Figure 8 shows two cameras’ screenshots in the office settingsat different time. In this trace, one user (orange bbox) walks fromcamera 2 to camera 4. Meanwhile, there are around 7 BLE IDssensed. With the user enters camera 4, TAR uses temporal-spatialrelationship and deep feature distance to filter out unqualified BLEIDs, and then assigns the highest-ranked identity to the user. Asshown in camera 4’s screenshot, the user is correctly identified.

4.3 TAR Performance4.3.1 Comparing with Existing Multi-cam Tracking Strate-

gies. Figure 9(a) shows the accuracy of TAR. The y-axis representsIDF1 accuracy. As a comparison, we also evaluate the IDF1 ofexisting state-of-the-art algorithms from vision community:

(1) MCT+ReID: We use the work from DeepCC [75], an open-sourced algorithm that reaches top accuracy in MOT Multi-CameraTracking Challenge [8]. The solution uses DNN-generated visualfeatures for people re-identification (ReID) and uses single-cameratracking and cross-camera association algorithms for identity track-ing. The single-camera part of DeepCC runs a multi-cut algorithmfor detections in recent frames and calculates best traces to min-imize the assignment cost. For cross-camera identification, it notonly considers visual feature similarity but also estimates the move-ment trajectory of each person in the camera topology to associatetwo tracks, which has the similar idea of TAR in cross-camera IDselection.1https://vimeo.com/2463685802https://vimeo.com/246388147

(2) MCT-Only: We also tested MCMT [73], the previous work ofDeepCC [75], which shares similar logic for tracking as DeepCC(both single-camera and multi-camera) but does not have DNN forpeople re-identification.

(3) ReID-Only: We directly run DeepCC’s DNN to extract eachpeople’s visual feature in each frame and classify each person to beone of the registered users. This will show the accuracy of trackingwith re-identification only.Analysis: We can see that TAR outperforms existing best offlinealgorithm (MCT+ReID) by 20%. Therefore, we analyze the fail-ures in both TAR and MCT+ReID to understand why TAR gainsmuch higher accuracy. There are two types of failures: erroneoussingle-camera tracking and wrong re-identification. Note that there-identification is BLE-vision matching in TAR’s case.

As Figure 9(b) shows, the two failures have the similar contribu-tion in TAR. In the vision-only scenario, most errors are from there-identification process. We further break down the re-identificationfailures for MCT+ReID into three types: (1) multi-camera error: aperson is constantly recognized as someone else in the cameras afterhis first appearance; (2) single-camera error: a customer is falselyidentified in one camera; (3) part-of-track error: a person is wronglyrecognized for part of her track in one camera. From Figure 9(b), wecan see that more than half of the ReID problems are cross-cameratype, which is due to the MCT module that optimizes identity as-signment across cameras - if a person is assigned an ID, she willhave a higher probability to get the same ID in following traces.

The root cause of the vision-based identification failure is theimperfect visual feature, which cannot accurately distinguish oneperson from another in some scenarios. From our observation, thereare three cases that the feature extractor may easily fail: (1) blurryimage; (2) partial occlusion; (3) similar appearance. Figure 9(c)demonstrates each failure case where two persons are recognized asthe same customer by TAR. The figure also shows the percentage ofall failure cases in the test results. We can see that the blurry and lowcontrast figures lead to near half of errors and the other two typesaccount for about 40% of the failed cases.

4.3.2 Importance of Different Components in TAR. Next,we analyze each component used in TAR.People Detection. The people detector may fail in two ways: falsepositive, which recognizes a non-person object as a people, and falsenegative, which fails to recognize a real person. For false positives,TAR could filter them out in the vision-BLE matching process. Forfalse negatives, people occluded larger than > 80% of their bodiesusually will be hardly detected by the detection model. Such falsenegative cases can be handled by TAR’s tracking algorithm andtrack sewing metric, which will also be evaluated. We evaluatethe performance of current state-of-the-art open-sourced peopledetectors using our dataset and the results are shown in Figure 11.Besides Faster-RCNN (used by TAR), we also test Mask-RCNN [56],YOLO-9000 [70], and OpenPose [43]. We can see that YOLO andOpenPose have lower accuracy although they are fast. In contrast,Mask-RCNN is very accurate but works too slow to meet TAR’srequirement.Trace matching. DTW plays the key role in matching BLE tracesto vision traces. Therefore, we should understand its effectiveness

Page 10: TAR - Enabling Fine-Grained Targeted Advertising in Retail ... · face challenges like unpredictable people movement (between the surveillance zones). In this paper, we propose TAR

MobiSys ’18, June 10–15, 2018, Munich, Germany X. Liu et. al.

TAR *MCT+ReID

*MCTOnly

ReIDOnly

20

30

40

50

60

70

80

90

IDF1

(%)

91.7

70.3

53.558.0

88.2

66.9

45.353.1

OfficeShop

(a)

Tracking

ReID

ReID (Part)

ReID (Single)

ReID (Multi)43.8%

Tracking19.6%

ReID (Single)21.1%

ReID (Part)15.4%

ReID 46.6% Tracking

53.4%TAR

MCT + ReID

(b) (b)

Partial Occlusion

(23.1%)

Other(15.0%)

Blurry / Low Contrast(45.5%)

Similar Figure(16.4%)

(c)Figure 9—(a) Multi-cam tracking comparison against state-of-the-art solutions (*offline solution); (b) Error statistics of TAR and MCT+ReID; (c) Error statisticsof re-identification in MCT+ReID and example images.

TAR TAR w/ DEEPSORT *TAR w/ LMP5060708090

IDF1

(%)

91.7

80.6

92.588.2

75.5

90.3

OfficeShop

Figure 10—Importance of Tracking Components in TAR (*offline solution).

10 20 30FPS

86

88

90

92

94

Reca

ll(%

)

8284868890929496

Prec

ision

(%)

Mask-RCNNFaster-RCNNOpenPoseYOLO-9000

Recall: Precision:

Figure 11—Recall, precision, and FPS of state-of-the-art people detectors.

in TAR’s scenario. In the experiment, we compute the similaritybetween one person’s walking trace and all nearby BLE traces tofind the one with the highest similarity. The association processsucceeds if the ground truth trace is matched, otherwise, it fails. Wecalculate the number of correct matchings across the whole datasetand compute the successful linking ratio. Besides DTW, we also testother metrics including Euclidean distance, cosine distance, Pearsoncorrelation coefficient [9], and Spearman’s rank correlation [10].The average matching ratio of each method is shown in Table 2, inwhich DTW gets the best accuracy.Visual Tracking. Visual tracking is crucial for estimating visualtraces. As TAR develops its visual-tracking algorithm based onDeepSORT [88], we want to see TAR’s performance improvementcompared with existing state-of-the-art tracking algorithms. Towardsthis end, we replace our visual tracking algorithm with DeepSORTand LMP [85], which achieves best tracking accuracy in MOT16

Similarity Metric Accuracy (%)DTW (used in TAR) 95.7Euclidean Distance 88.0Cosine Distance 84.9Pearson Correlation 66.4Spearman Correlation 72.5

Table 2—Accuracy (ratio of correct matches) of different trace similarity met-rics

challenge. LMP uses DNN for people re-identification like Deep-SORT and it works offline so it can leverage posteriori knowledgeof people’s movement and use lifted multi-cut algorithm to assigntraces globally.

We calculate the IDF1 percentage of each choice in Figure 10.We can see from the first group of bars that TAR’s visual trackingalgorithm clearly outperforms DeepSORT by 10%. This is becauseTAR’s visual tracking algorithm considers several optimizations likekinematic verification, thus reduces ID switches. Moreover, TARperforms similarly with that with LMP as the tracker, which showsthat our online tracking metric is comparable to the current state-of-the-art offline solution. LMP is not feasible for TAR since it worksoffline and slowly (0.5FPS) while our usage scenario needs real-timeprocessing.

We compare the following modules’ performance by taking awayeach of them from TAR and show the system accuracy change in Fig-ure 12.ID Assignment. An alternative solution for our ID assignment algo-rithm is to always choose the best (top-1) confident candidate foreach track. Thus, we compare our ID assignment to the top-1 schemeand show the result in the second group of Figure 12. We can see thatthe top-1 scheme is almost 20% worse than TAR. The reason is thatthe top-1 assignment usually has the conflict error, where differentvisual tracks get assigned to the same ID. TAR, on the other hand,ensures the one-on-one matching, which reduces such conflicts.Track Sewing. If we remove the track sewing optimization, a per-son’s fragmented tracks will need much longer time to be recognized,and some of them may be matched to wrong BLE IDs. Figure 12’sthird group proves this point. Removing track sewing drops the ac-curacy for nearly 25% in the retail store dataset, which has frequent

Page 11: TAR - Enabling Fine-Grained Targeted Advertising in Retail ... · face challenges like unpredictable people movement (between the surveillance zones). In this paper, we propose TAR

TAR - Enabling Fine-Grained Targeted Advertising in Retail Stores MobiSys ’18, June 10–15, 2018, Munich, Germany

Figure 12—Importance of Identification Components in TAR

occlusion. In the evaluation, we find the average number of distincttracks of the same person is 1.8, and the maximum number is 5.BLE Proximity. Incorporating BLE proximity is the fundamentalpart to help track and identify users. To quantify the effectivenessof BLE proximity, we calculate the accuracy with BLE matchingcomponents removed and TAR only relies on the cross-camera asso-ciation and deep visual features to identify and track each user. Fig-ure 12’s fourth group shows that the accuracy drops by 35% at mostwithout BLE’s help.Deep Feature. The deep feature is one of the core improvements inthe visual tracking algorithm. Figure 12’s fifth group shows that theaccuracy drops nearly 30% because removing the deep feature willcause high-frequency ID switches in tracking. In this case, it is hardto compensate the errors even with our other optimizations.Cross Camera Calibration. Our cross-camera calibration metriccontains temporal-spatial relationship and deep feature similarityacross cameras. To understand the impact of this optimization, weremove the component and evaluate TAR with the same dataset. Fig-ure 12’s most right group shows a 10% accuracy drop. Without crosscamera calibration, we find that the matching algorithm strugglesto differentiate BLE proximity traces. In some cases, these tracesdemonstrate similar pattern when they move around. For example, inthe retail scenario, TAR tries to recognize one user seen in camera-1and she’s leaving the store. Meanwhile, another user is also movingout but with a different direction seen in camera-2. In this case, theirBLE proximity traces are hard to distinguish only with camera-1’sinformation.

4.3.3 Robustness. Robustness is essential for any surveillanceor tracking system because some part of the system might fail, e.g.,one or more cameras or BLE receivers stop working. This couldhappen in many situations due to battery outage, camera damaged,or the lighting condition is bad. Therefore, how will those failuresaffect the overall performance? We focus on the system accuracyunder node failures. Figure 13 shows the performance change ofTAR when failure happens. Note that either the BLE failure or thevideo failure will cause the node failure because TAR needs bothinformation for customer tracking. Therefore, we remove the af-fected nodes randomly from TAR’s network to simulate the runtimefailure. Figure 13 shows node failures and performance downgradeswith the portion of failed nodes. We can see that TAR can still keepmore than 80% accuracy with half of the nodes down. The systemis robust because each healthy node could identify and track the

Figure 13—Accuracy of TAR with different ratio of node failure (purple linesshow the measured error).

Figure 14—Relationship between the tracking accuracy and the number ofconcurrently tracked people.

customer by itself. The only loss from the failed node is the cross-camera part, which uses the temporal-spatial relationship to filterout invalid BLE IDs.

We also evaluate the relationship between the number of concur-rent tracked people and the tracking accuracy (shown in Figure 14).As the result shows, TAR accuracy drops as more people beingtracked. The accuracy becomes stable around 85% with 20 or morepeople. This is because that there is no "new" trace pattern sinceall possible paths in each camera view are fully occupied. There-fore, adding more people will not cause more uncertainty in tracematching.

Page 12: TAR - Enabling Fine-Grained Targeted Advertising in Retail ... · face challenges like unpredictable people movement (between the surveillance zones). In this paper, we propose TAR

MobiSys ’18, June 10–15, 2018, Munich, Germany X. Liu et. al.

5 RELATED WORKMobile Coupon Deliveries Sending location-aware Ads/couponsto mobile devices has become a critical strategy for retailers [1].Many startups [20, 29, 32, 36, 37] have been working on improvingthe shopping experience with mobile coupons. Among them, UrbanAirship [37] is closest to TAR. It uses "point-to-radius" coordinatesor indoor locations based on Wi-Fi triangulation to locate user nearby.However, Wi-Fi and other indoor localization methods require a newset of infrastructure and cannot guarantee the accuracy because theyare based on proximity only. On the other hand, TAR requires littlemodification to existing infrastructure while providing accurate userrelative position.Indoor localization There have been plenty of works for indoorlocalization. Researchers have been using various devices and en-vironmental landmarks to enhance the indoor localization accu-racy [58, 62, 65, 91, 97]. Many of these designs require additionalinfrastructure to achieve reasonable accuracy. However, relying onindoor localization to learn user interest and send ads have two signif-icant problems. First, most indoor localization requires a dedicatedset of infrastructure. Second, indoor localization aims to help theusers that actively use the device to find their location. Moreover,current indoor localization schemes are vulnerable to sophisticatedindoor environments. Different from the indoor localization mech-anisms, TAR does not require a separate infrastructure other thanthose existing surveillance cameras. It works in a passive way thatusers do not need use devices actively; instead, users will only getnotified when TAR decides their shopping interests.Tracking Technologies There are many tracking technologies avail-able for people tracking. [22, 26, 38, 80] use stereo video systemwhich utilizes camera pairs to sense 3D information of surroundings.However, the equipment is usually costly and hard to deploy. [28]uses thermal sensors to sense the existence and position of people,but its tracking accuracy can also be affected by occlusion, whichmakes it hard to distinguish people number. [19] uses laser andstructured light to accurately infer the shape of people (usually incenter-meter level), which makes them the most accurate solution forpeople counting. However, the short scanning range prevents the so-lution from continuous people tracking so other supporting solutionslike cameras are needed. On the other hand, Euclid Analytics [51]and Cisco Meraki [14] have been relying on Wi-Fi MAC Addressto track the customer entry and exiting the stores. However, thistechnology requires activation of customer Wi-Fi and suffers fromlocation accuracy. Swirl [84] and InMarket [59] use BLE beacons tocount customers, but the proximity-based approach is far from theaccuracy required to track shoppers. [60] combines vision trackingwith dead reckoning, which uses smartphone IMU (Inertial Mea-surement Unit) to estimate the user’s walking speed and direction,for better user tracking accuracy. It works offline and only workson single-camera tracking. Different from the above approaches,TAR combines both vision and BLE proximity for not only trackingshoppers in large scale but also identifying shoppers.Vision Based Tracking. Recent advances in object detection likeFaster-RCNN [71], YOLO [70], and Mask-RCNN [56] have en-abled accurate online detection. Therefore, tracking by detectionhas emerged as the leading strategy for multiple object tracking.Prior works use the global optimization that processes the entire

video batches to find object trajectories. For example, three pop-ular frameworks, network flow formulations [50, 77, 78], proba-bilistic graphical models [45, 57, 85] and large temporal windows(e.g. [74, 83, 90]) have been popular among them. However, due tothe nature of batch processing, they are not applicable for real-timeprocessing where no future information is available.

Recent online tracking algorithms [39, 53, 67, 88] track multiplepeople by matching targets in the current frame to the ones in theprevious frame using their DNN-generated visual features. The algo-rithm works well for high-quality video as deep features are moredistinguishable. For the video with low light, the visual features be-come hard to distinguish and the performance degrades significantly.Among the above approaches, we chose to build our tracking algo-rithm above DeepSORT [88] because it reaches top accuracy andworks fast (> 15FPS), which is crucial for TAR scenario. Differentfrom DeepSORT, TAR takes several optimizations mentioned inSec.3.3 to increase the robustness against detection false negativesand occlusions.Multi-Camera Tracking. There are some algorithms [41, 68, 82,89] for multi-camera object tracking by knowing the positions, thedirections, and internal parameters of all cameras. They also requirethe camera views to overlap. However, for most shops, the scene-overlapping condition is not satisfied. In contrast, TAR does not havethese requirements. It utilizes various context information as wellas Bluetooth signal to re-identify the objects across cameras. Somemulti-camera works are designed for non-overlapping cases. Suchsystems [46, 75, 79, 86] leverage the visual and spatial-temporalsimilarity between tracks of different camera views to find the bestglobal matching with minimum cost. However, such systems needglobal trajectories for best tracking accuracy, which is infeasiblefor online tracking. Moreover, their algorithms’ accuracy entirelyrelies on accurate individual tracking information, i.e., people trajec-tories, and thereby will be affected by unreliable trackers, which arecommon in dense occlusion and crowded scenes.

6 CONCLUSIONWe have presented TAR, a system that utilizes existing surveillancecameras and ubiquitous BLE signals to precisely identify and trackshoppers and enable targeted advertising for retail stores. In TAR, wehave first designed a single-camera tracking algorithm that accuratelytracks people, and then extended it to the multi-camera scenario torecognize people across distributed cameras. TAR leverages BLEproximity information, cross-camera movement patterns, and single-camera tracking algorithm to achieve high accuracy of multi-cameramulti-people tracking and identification. We have implemented anddeployed TAR in two realistic retail-shop setting, and then conductextensive experiments with more than 20 people. Our evaluationresults demonstrated that TAR delivers high accuracy (90%) andserves as a practical solution for people tracking and identification.

Acknowledgements. We thank our shepherd Matthai Philipose andthe anonymous reviewers for their valuable feedback that improvedthe paper’s quality.

REFERENCES[1] 3 Ways to Drive In-store Sales With Mobile. https://www.mobify.com/insights/3-ways-drive-

store-sales-mobile/.

Page 13: TAR - Enabling Fine-Grained Targeted Advertising in Retail ... · face challenges like unpredictable people movement (between the surveillance zones). In this paper, we propose TAR

TAR - Enabling Fine-Grained Targeted Advertising in Retail Stores MobiSys ’18, June 10–15, 2018, Munich, Germany

[2] Facebook Location Targeting. https://www.facebook.com/business/a/location-targeting.[3] FastDTW. https://pypi.python.org/pypi/fastdtw.[4] How Beacons Will Influence Billions in Us Retail Sales.

http://www.businessinsider.com/beacons-impact-billions-in-reail-sales-2015-2.[5] How Nordstrom Uses Wifi To Spy On Shoppers. https://www.forbes.com/sites/petercohan/2013/05/09/how-

nordstrom-and-home-depot-use-wifi-to-spy-on-shoppers.[6] How Retail Stores Track You Using Your Smartphone. https://lifehacker.com/how-retail-stores-

track-you-using-your-smartphone-and-827512308.[7] Ios Core Bluetooth. https://developer.apple.com/documentation/corebluetooth.[8] MTMCT on MOT Challenge. https://motchallenge.net/data/DukeMTMCT/.[9] Pearson Correlation Coefficient. https://en.wikipedia.org/wiki/Pearson_correlation_coefficient.

[10] Spearman’s Rank Correlation Coefficient. https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient.

[11] Standard Cognition. https://www.standardcognition.com/.[12] What’s Mobile’s Influence In-Store. https://www.marketingcharts.com/industries/

retail-and-e-commerce-65972.[13] Twitter Mobile Ads. https://business.twitter.com/en/advertising/mobile-ads-companion.html,

2013.[14] Cisco Meraki . https://meraki.cisco.com/, 2017.[15] Accenture. https://www.accenture.com, 2017.[16] Altbeacon. http://altbeacon.org/, 2017.[17] Amazon Go. http://amazongo.com/, 2017.[18] Apple IBeacon. https://developer.apple.com/ibeacon/, 2017.[19] Bea Inc. https://www.beainc.com/en/technologies/, 2017.[20] Best Advisor. https://www.bestadvisor.com/, 2017.[21] Bluetooth Le: Broadcast. https://www.bluetooth.com/what-is-bluetooth-technology/how-it-

works/le-broadcast , 2017.[22] Brickstream. http://www.brickstream.com/, 2017.[23] Deloitte. https://www2.deloitte.com, 2017.[24] Eddystone Beacon. https://developers.google.com/beacons/, 2017.[25] Forrester. https://go.forrester.com/, 2017.[26] Hella. http://www.hella.com/microsite-electronics/en/Sensors-94.html, 2017.[27] How Beacons Can Reshape Retail Marketing. https://www.thinkwithgoogle.com/articles/

retail-marketing-beacon-technology.html, 2017.[28] Irisys. http://www.irisys.net/, 2017.[29] Moasis. http://moasis.com/, 2017.[30] Mobile Ads. https://www.technologyreview.com/s/538731/how-ads-follow-you-from-phone-

to-desktop-to-tablet/, 2017.[31] Person Re-identification. https://github.com/D-X-Y/caffe-reid, 2017.[32] Point Inside. https://www.pointinside.com/, 2017.[33] Projective Transformations (homographies). http://www-

prima.imag.fr/jlc/Courses/2010/ENSI3.FAI/ENSI3.FAI.S2.EN.pdf, 2017.[34] Shopping Easier with Store App. https://corporate.target.com/article/2017/06/sean-murphy-

target-app , 2017.[35] Skyrec. http://www.skyrec.cc, 2017.[36] Thumbvista. https://thumbvista.com/, 2017.[37] Urban Airship. https://www.urbanairship.com/, 2017.[38] Xovis. https://www.xovis.com/en/xovis/, 2017.[39] BAE, S.-H., AND YOON, K.-J. Confidence-based Data Association and Discriminative Deep

Appearance Learning for Robust Online Multi-object Tracking. IEEE transactions on patternanalysis and machine intelligence 40, 3 (2018), 595–610.

[40] BAI, S., BAI, X., AND TIAN, Q. Scalable Person Re-identification on Supervised SmoothedManifold. In CVPR (2017).

[41] BERCLAZ, J., FLEURET, F., TURETKEN, E., AND FUA, P. Multiple Object Tracking usingK-shortest Paths Optimization. IEEE transactions on pattern analysis and machine intelligence33, 9 (2011), 1806–1819.

[42] BERNDT, D. J., AND CLIFFORD, J. Using Dynamic Time Warping to Find Patterns in TimeSeries. In KDD workshop (1994), vol. 10, Seattle, WA, pp. 359–370.

[43] CAO, Z., SIMON, T., WEI, S.-E., AND SHEIKH, Y. Realtime Multi-person 2d Pose Estimationusing Part Affinity Fields. In CVPR (2017).

[44] CHEN, D., SHIN, K. G., JIANG, Y., AND KIM, K.-H. Locating and Tracking Ble Beaconswith Smartphones.

[45] CHEN, J., SHENG, H., ZHANG, Y., AND XIONG, Z. Enhancing Detection Model for MultipleHypothesis Tracking. In Conf. on Computer Vision and Pattern Recognition Workshops (2017),pp. 2143–2152.

[46] CHEN, W., CAO, L., CHEN, X., AND HUANG, K. An Equalized Global Graph Model-basedApproach for Multicamera Object Tracking. IEEE Transactions on Circuits and Systems forVideo Technology 27, 11 (2017), 2367–2381.

[47] CHEN, W., CHEN, X., ZHANG, J., AND HUANG, K. Beyond Triplet Loss: a Deep QuadrupletNetwork for Person Re-identification. In Proc. CVPR (2017).

[48] CHEN, X., AND GUPTA, A. An Implementation of Faster Rcnn with Study for Region Sam-pling. arXiv preprint arXiv:1702.02138 (2017).

[49] DE MAESSCHALCK, R., JOUAN-RIMBAUD, D., AND MASSART, D. L. The MahalanobisDistance. Chemometrics and intelligent laboratory systems 50, 1 (2000), 1–18.

[50] DEHGHAN, A., AND SHAH, M. Binary Quadratic Programing for Online Tracking of Hun-dreds of People in Extremely Crowded Scenes. IEEE transactions on pattern analysis andmachine intelligence 40, 3 (2018), 568–581.

[51] Euclid Analytics . http://euclidanalytics.com/, 2017.[52] EVERINGHAM, M., VAN GOOL, L., WILLIAMS, C. K. I., WINN, J., AND ZISSERMAN, A.

The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.

[53] FAGOT-BOUQUET, L., AUDIGIER, R., DHOME, Y., AND LERASLE, F. Improving Multi-frameData Association with Sparse Representations for Robust Near-online Multi-object Tracking. InEuropean Conference on Computer Vision (2016), Springer, pp. 774–790.

[54] GOU, M., KARANAM, S., LIU, W., CAMPS, O., AND RADKE, R. J. Dukemtmc4reid: ALarge-scale Multi-camera Person Re-identification Dataset. In The IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) Workshops (July 2017).

[55] HAYAT, M., KHAN, S. H., WERGHI, N., AND GOECKE, R. Joint Registration and Represen-tation Learning for Unconstrained Face Identification. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition (2017).[56] HE, K., GKIOXARI, G., DOLLÁR, P., AND GIRSHICK, R. Mask R-cnn. In Computer Vision

(ICCV), 2017 IEEE International Conference on (2017), IEEE, pp. 2980–2988.[57] HENSCHEL, R., LEAL-TAIXÉ, L., CREMERS, D., AND ROSENHAHN, B. Improvements to

Frank-wolfe Optimization for Multi-detector Multi-object Tracking. CoRR (2017).[58] ILIEV, N., AND PAPROTNY, I. Review and Comparison of Spatial Localization Methods for

Low-power Wireless Sensor Networks. IEEE Sensors Journal 15, 10 (2015), 5971–5987.[59] Inmarket. https://inmarket.com/, 2017.[60] JIANG, W., AND YIN, Z. Combining Passive Visual Cameras and Active Imu Sensors to Track

Cooperative People. In Information Fusion (Fusion), 2015 18th International Conference onInformation Fusion (2015), IEEE, pp. 1338–1345.

[61] JONKER, R., AND VOLGENANT, A. A Shortest Augmenting Path Algorithm for Dense andSparse Linear Assignment Problems. Computing 38, 4 (1987), 325–340.

[62] KEMPKE, B., PANNUTO, P., CAMPBELL, B., AND DUTTA, P. Surepoint: Exploiting UltraWideband Flooding and Diversity to Provide Robust, Scalable, High-fidelity Indoor Localiza-tion. In Proceedings of the 14th ACM Conference on Embedded Network Sensor Systems CD-ROM (2016), ACM, pp. 137–149.

[63] KUHN, H. W. The Hungarian Method for the Assignment Problem. Naval research logisticsquarterly 2, 1-2 (1955), 83–97.

[64] LOWE, D. G. Distinctive Image Features From Scale-invariant Keypoints. International jour-nal of computer vision 60, 2 (2004), 91–110.

[65] MA, Y., HUI, X., AND KAN, E. C. 3d Real-time Indoor Localization Via Broadband NonlinearBackscatter in Passive Devices with Centimeter Precision. In Proceedings of the 22nd AnnualInternational Conference on Mobile Computing and Networking (2016), ACM, pp. 216–229.

[66] MASI, I., RAWLS, S., MEDIONI, G., AND NATARAJAN, P. Pose-aware Face Recognition inthe Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(2016).

[67] MILAN, A., REZATOFIGHI, S. H., DICK, A. R., REID, I. D., AND SCHINDLER, K. OnlineMulti-target Tracking Using Recurrent Neural Networks. In AAAI (2017), pp. 4225–4232.

[68] NITHIN, K., AND BRÉMOND, F. Globality–locality-based Consistent Discriminant FeatureEnsemble for Multicamera Tracking. IEEE Transactions on Circuits and Systems for VideoTechnology 27, 3 (2017), 431–440.

[69] BLE Proximity Technologies. http://community.silabs.com/t5/Official-Blog-of-Silicon-Labs/How-to-Determine-Bluetooth-BLE-Beacon-Proximity/ba-p/173638, 2017.

[70] REDMON, J., AND FARHADI, A. Yolo9000: Better, Faster, Stronger. arXiv preprint (2017).[71] REN, S., HE, K., GIRSHICK, R., AND SUN, J. Faster R-cnn: Towards Real-time Object De-

tection with Region Proposal Networks. In Advances in neural information processing systems(2015), pp. 91–99.

[72] Retail Next. https://retailnext.net/en/home/, 2017.[73] RISTANI, E., SOLERA, F., ZOU, R., CUCCHIARA, R., AND TOMASI, C. Performance Mea-

sures and a Data Set for Multi-target, Multi-camera Tracking. In European Conference onComputer Vision (2016), Springer, pp. 17–35.

[74] RISTANI, E., AND TOMASI, C. Tracking Multiple People Online and in Real Time. In AsianConference on Computer Vision (2014), Springer, pp. 444–459.

[75] RISTANI, E., AND TOMASI, C. Features for Multi-target Multi-camera Tracking and Re-identification. arXiv preprint arXiv:1803.10859 (2018).

[76] SALVADOR, S., AND CHAN, P. Toward Accurate Dynamic Time Warping in Linear Time andSpace. Intelligent Data Analysis 11, 5 (2007), 561–580.

[77] SCHULTER, S., VERNAZA, P., CHOI, W., AND CHANDRAKER, M. Deep Network Flow forMulti-object Tracking. arXiv preprint arXiv:1706.08482 (2017).

[78] SHITRIT, H. B., BERCLAZ, J., FLEURET, F., AND FUA, P. Multi-commodity Network Flowfor Tracking Multiple People. IEEE transactions on pattern analysis and machine intelligence36, 8 (2014), 1614–1627.

[79] SHIVA KUMAR, K., RAMAKRISHNAN, K., AND RATHNA, G. Inter-camera Person Trackingin Non-overlapping Networks: Re-identification Protocol and On-line Update. In Proceedingsof the 11th International Conference on Distributed Smart Cameras (2017), ACM, pp. 55–62.

[80] Shoppertrak. https://www.shoppertrak.com, 2017.[81] SHU, Y., SHIN, K. G., HE, T., AND CHEN, J. Last-mile Navigation using Smartphones. In

Proceedings of the 21st Annual International Conference on Mobile Computing and Network-ing (2015), ACM, pp. 512–524.

[82] SOLERA, F., CALDERARA, S., RISTANI, E., TOMASI, C., AND CUCCHIARA, R. TrackingSocial Groups Within and Across Cameras. IEEE Transactions on Circuits and Systems forVideo Technology (2016).

[83] SOLERA, F., CALDERARA, S., RISTANI, E., TOMASI, C., AND CUCCHIARA, R. TrackingSocial Groups Within and Across Cameras. IEEE Transactions on Circuits and Systems forVideo Technology 27, 3 (2017), 441–453.

[84] Swirl. http://www.swirl.com/, 2017.[85] TANG, S., ANDRILUKA, M., ANDRES, B., AND SCHIELE, B. Multiple People Tracking

by Lifted Multicut and Person Re-identification. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (2017), pp. 3539–3548.

[86] TESFAYE, Y. T., ZEMENE, E., PRATI, A., PELILLO, M., AND SHAH, M. Multi-target Track-ing in Multiple Non-overlapping Cameras using Constrained Dominant Sets. arXiv preprintarXiv:1706.06196 (2017).

[87] TRAN, L., YIN, X., AND LIU, X. Disentangled Representation Learning Gan for Pose-invariant Face Recognition. In CVPR (2017), no. 6.

[88] WOJKE, N., BEWLEY, A., AND PAULUS, D. Simple Online and Realtime Tracking with aDeep Association Metric. arXiv preprint arXiv:1703.07402 (2017).

[89] XU, Y., LIU, X., QIN, L., AND ZHU, S.-C. Cross-view People Tracking by Scene-centeredSpatio-temporal Parsing. In AAAI (2017), pp. 4299–4305.

[90] YANG, E., GWAK, J., AND JEON, M. Multi-human Tracking using Part-based AppearanceModelling and Grouping-based Tracklet Association for Visual Surveillance Applications. Mul-timedia Tools and Applications 76, 5 (2017), 6731–6754.

[91] YANG, Z., WANG, Z., ZHANG, J., HUANG, C., AND ZHANG, Q. Wearables Can Afford:Light-weight Indoor Positioning with Visible Light. In Proceedings of the 13th Annual Interna-tional Conference on Mobile Systems, Applications, and Services (2015), ACM, pp. 317–330.

[92] ZHAO, H., TIAN, M., SUN, S., SHAO, J., YAN, J., YI, S., WANG, X., AND TANG, X. SpindleNet: Person Re-identification with Human Body Region Guided Feature Decomposition andFusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(2017).

Page 14: TAR - Enabling Fine-Grained Targeted Advertising in Retail ... · face challenges like unpredictable people movement (between the surveillance zones). In this paper, we propose TAR

MobiSys ’18, June 10–15, 2018, Munich, Germany X. Liu et. al.

[93] ZHENG, L., BIE, Z., SUN, Y., WANG, J., SU, C., WANG, S., AND TIAN, Q. Mars: A VideoBenchmark for Large-scale Person Re-identification. In European Conference on ComputerVision (2016), Springer, pp. 868–884.

[94] ZHENG, L., SHEN, L., TIAN, L., WANG, S., WANG, J., AND TIAN, Q. Scalable Person Re-identification: A Benchmark. In Computer Vision, IEEE International Conference on (2015).

[95] ZHENG, Z., ZHENG, L., AND YANG, Y. A Discriminatively Learned Cnn Embedding forPerson Re-identification. arXiv preprint arXiv:1611.05666 (2016).

[96] ZHENG, Z., ZHENG, L., AND YANG, Y. Unlabeled Samples Generated by Gan Improve thePerson Re-identification Baseline in Vitro. In Proceedings of the IEEE International Confer-ence on Computer Vision (2017).

[97] ZHU, S., AND ZHANG, X. Enabling High-precision Visible Light Localization in Today’sBuildings. In Proceedings of the 15th Annual International Conference on Mobile Systems,Applications, and Services (2017), ACM, pp. 96–108.