Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage,...

52
Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented by: Matthew Ahrens Faculty Mentor: Dr. Uma Shama

Transcript of Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage,...

Page 1: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Efficiency and Reliability of the Transit Data Lifecycle

A study of multimodal migration, storage, and retrieval techniques

for public transit data

Presented by: Matthew Ahrens

Faculty Mentor: Dr. Uma Shama

Page 2: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Overview- Background

• GeoGraphics Lab• Maintain public transit data for Regional Transit Authorities

(RTAs) in the Commonwealth of Massachusetts.

• Services

• Digitizing of static schedule data• Dynamic and real-time vehicle location data• Consultation and expert advice role

Page 3: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Overview- Background

• This project• Interdisciplinary between Mathematics and Computer Science

• Focus on real-world / business applications of data analysis

• Time Span

• Spring 2013 • exploratory analysis

• Summer 2013 and ATP summer grant• Modeling experiments

• Fall 2013• Implementation and integration

Page 4: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Overview- Background

• This project – cont.• Evolved through several iterations

• Original Purpose: Spatial analysis on ridership and vehicle location data

• Four areas of focus occurred, changing focus of project over time

• 1. Concepts were unclear among Authorities

• 2. Inconsistent data collection tools for historical analysis purposes

• 3. development on systems affected core features

• 4. documentation for systems was in code, no clear point of injection

Page 5: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Overview- Outline

• Four sections• Abstraction and modeling of transit data

• Analysis of design patterns and algorithms with comparison to existing systems

• The design and implementation of a context free data model

• The design and implementation of a multimodal, application-level interface

Page 6: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Abstraction

• Research Questions• How can the different transit data protocols be described to

compromise between conflicting definitions and structures?

• Is there a compromise that can be reached that is still purposeful and clear?

• Purpose• Comparison of three authorities

• GTFS / GTFS-realtime• TCIP• Proprietary (various).

Page 7: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Abstraction

• GTFS Example

• Pros:• Descriptive, data type or storage inclusive.

• Separation of required for definition and optional metadata

• Cons:• Perspective of transit user

• Many definitions do not have explicit relationships

Page 8: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Abstraction

• GTFS-Realtime Example

• Pros:• Descriptive, data type or storage inclusive.

• Separation of required for definition and optional metadata

• Cons:• Defined as a feed, no distinction or limitation of rate

• Optional fields not purposeful for minimum definition or structure.

Page 9: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Abstraction

• TCIP Example

• Pros:• Complete, covers every aspect of transit

• Cons:• Vague

• Concerned with relationships between data systems

• Specifies medium over message, requires XML/XSD format but does not clearly define data elements

Page 10: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Abstraction

• Proprietary Example - ERSI

• Pros:• Shows relationships between geospatial definitions

• Standard Leader for GIS protocols (GML, OpenGeo )

• Cons:• Concerned with GIS and use definitions over technical

definitions

• Missing most transit data concepts

Page 11: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Abstraction

• Methodology• Create an understandable, unambiguous definition for common

transit concepts

• Use as few primitives as possible to ease implementation

• Use composition to aggregate data

• Two options considered

• Define a object – method relationship• Define a set-theoretical model of

transit data structures

Page 12: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Abstraction

• Methodology• Remove implementation and use specific context from transit

data structures

• Find minimum required composition

• Acknowledge commonly attributed metadata

• Define data by production mechanism rate

Page 13: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Abstraction

• Disambiguation• Real-time

• Produced frequently in real-time• Best represented as a signal or a

message stream• Dynamic

• Infrequent but unknown rate of production

• Best represented as a feed• Static

• Infrequent, known interval rate of production

• File system or other static resource

Page 14: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Abstraction

• Results• Data flow model influenced the decision

Page 15: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Abstraction

• Results• Set Theoretical Model

• Description

• Define implementation independent definition of primitives

• Compose transit data structure from those primitives

• Define complex data structures as supersets of simple structures

Page 16: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Abstraction

• Commonly used examples• Primtives

• Geolocation• Datetime• Unique, Index-friendly ID (numeric,

simple text)• Simple structure

• Stop• Trip

• Composite Structures

• AVL• ETA

Page 17: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Abstraction

• Composition Example

Page 18: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Data Migration

• Research Questions• What technologies, techniques, or models most efficiently and

reliably move transit data from producer to consumer?

• Which of those best embody the concepts of reuse, extendibility, and reusability?

• Which ones are resistant to need modification and internal maintenance?

Page 19: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Data Migration

• Purpose• Perform exploratory work to set standards for handling data

transit

• Which of those best embody the concepts of reuse, extendibility, and reusability?

• Which ones are resistant to need modification and internal maintenance?

Page 20: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Data Migration

• Methodology• Study of BusLocator – current data migration technology of AVL

and Route specific data

• Duplication of Timer-event concurrency model for real-time data

• Pull design pattern vs. Push design pattern

• Approximation Algorithms

Page 21: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Data Migration

• BusLocator• C# Microsoft Solution in two parts

• Windows Service using Timer-event concurrency

• Pulls AVL data every 30 minutes• Pulls route data every 5 minutes• Sends via SOAP to WCF service

• WCF• Webservice endpoint• Accepts data• Parses and stores in SQL tables

Page 22: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Data Migration

• Graphical Depiction

Page 23: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Data Migration

• Major bottlenecks• Event timer

• Problems

• Pulls too slow to deliver real-time produced data to be consumed in real-time

• Pulls over timeframe, sends duplicate over the wire

• Does not scale or load balance• SOAP XML message is large, metadata

heavy• Not optimal for real-time

Page 24: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Data Migration

• Effort to duplicate for ETA• Pull from ETA feed as Rest service via XML

Page 25: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Data Migration

• Effort to duplicate for ETA• Purposes

• Analytical use of AVL data as static resource, not real-time

• Made easier to organize by set-theory model

• Able to composite ETA from other sources

• Able to automate analysis

Page 26: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Data Migration

• Effort to duplicate for ETA• Problems

• AVL not complete for historical use• Lead to development of clear definition

of AVL and other transit data structures• Showed need for new system

• Replace BusLocator• Define development framework for

transit applications• Eliminate pull or approximate push

design pattern

Page 27: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Data Migration

• Pull vs. Push• Pull design pattern

• A.k.a. Request-response, on-demand• Client (unknown) sends request to

Server/Source (known)• Server processes and responds

• Push design pattern

• Subscription pattern• Client establishes connection to Server• Server pushes response to client upon

local event

Page 28: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Data Migration

• Pull vs. Push• Pull design pattern

• A.k.a. Request-response, on-demand• Client (unknown) sends request to

Server/Source (known)• Server processes and responds

• Push design pattern

• Subscription pattern• Client establishes connection to Server• Server pushes response to client upon

local event

Page 29: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Data Migration

• Pull best use cases• When data is not consumed as a string

• Need the most recent data once or on demand

• Example

Page 30: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Data Migration

• Push approximating• Push is appropriate for real-time produced data

• Goal

• minimize time between production and availability for use

• Problem

• Push not supported by all web communication

• Solution

• Pull approximation

Page 31: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Data Migration

• Appx. 1 – timer event approximation

• Goal

• Predict the rate of production using historical data

• Method

• Exponential Moving Average• Use previous history and predictions to

make future predictions• Keep tabs of average interval between

data updates• Take proportion of history for accuracy• Take proportion of predictions for

smothing

Page 32: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Data Migration

• Exponential Moving Average example

• Real data hard to monitor, simulation was created

• Simulate 10 vehicles• 10% chance of packet drop

• Measurement criteria

• Minimize difference between production time and consumption time

• Minimize redundant data packets• Minimize dropped packets

Page 33: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Data Migration

• Exponential Moving Average example

• Cache free model was developed

• Emulating current system• Adaptable to batch query and

changing vehicle configuration• Measure average previous interval

Page 34: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Data Migration

• Exponential Moving Average example

• Psuedocode

Page 35: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Data Migration

• Exponential Moving Average example

• Results

Page 36: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Implementation: GLaaS Model and API

• Goals• Taking the knowledge gained so far, implement and document

a framework that exhibits best practices

• Avoid anti-patterns• Choose the best medium for the job• Separate data, metadata, and

implementation data• Keep business logic separate from data

management• Migrate data near production rate• Multimodal retrieval and consumption

mechanisms

Page 37: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Implementation: GLaaS Model and API

• Considerations• Security

• Closed Pipe vs. Open Pipe• Authentication

• Access level

• Differential Privacy• Analysis protection

• Reusability

• Maintenance

• Scalability

• Documentation and Training

Page 38: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

GLaaS Model

• Database Schema• Feature oriented

• Consider transit data primitives as features

• Make set defined elements required fields

• Make metadata Optional fields

• Design iterations

• Trigger based trickle down model• Purpose

• Fight over-index anti-pattern• Minimize select time purposefully

• Output chain, batch-oriented

Page 39: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

GLaaS Model

• Structure• Tables

• Primary• Insert Entry point

• Guaranteed for analysis use• Acts as contract and definition of

feature• Trigger

• On insert, pushes and updates specific tables

• Specific• Select / update point• Only accessible by stored procedure

• Info• Metadata chainable by indexed fields

Page 40: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

GLaaS Model

• Refactoring• Triggers did not work the way intended

• Appearance• Separate files, separate queries• Resemble event handling

• Simple and Concurrent in imperative languages

• Function• Append to insert query

• Not concurrent• Artificial dependency

• Traced• One failure invalidates entire insert --

including original

Page 41: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

GLaaS Model

• Output variable• Represents inserted data similar to trigger

• Called from and insert into primary stored procedures

• Calls down the chain, separated by query delimiter

• Enforces statically declared batching• Concurrent, let SQL environment make

dependency decisions• Responsible for populating specific tables

Page 42: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

GLaaS Model• Results, integrity and protocol

Page 43: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

GLaaS Model

• Explicit use of API and Stored Procedures

• No direct application level queries

• API only approved access point

• Explicit enforcement of authentication by function not by data type

• Eliminates need for application specific tables

• Fights Sql injection

Page 44: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

GLaaS API

• Multimodal approach to consumption

• Mechanism for static, on-demand, and real-time consumption

• File system and known URI• Similar to GTFS-realtime implementation• Application specific feed format

• Request-Response• REST in several mediums

• Binds to specific URI and HTTP Verb• Eliminates need for expensive header

• SOAP backwards compatibility

• Subscription model via push pattern• Websocket

Page 45: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

GLaaS API

• Soap vs Rest• Soap

• XML defined package• URIs surrogate for Endpoints

• 1 URI per service

• Message header contains definitions and method bindings

• RPC

• Message data contains payload

Page 46: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

GLaaS API

• Soap vs Rest• Soap definition example for AVL

Page 47: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

GLaaS API

• Soap vs Rest• Rest

• URI multiplexing via routes• URI structure relative to root bound to

request definition• Request object definition and HTTP verb

binds to method and response

• Request messages• Only contain data needed for

functionality• No header, light-weight• JSON, XML, URI-embedded, any custom

data organization

Page 48: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

GLaaS API

• Soap vs Rest• Rest

Page 49: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

GLaaS API

• Goals• Maintenance

• Dynamically generated use documentation

• Compartmentalized object definition• Requests• Response• Global Entry Point

• Configuration• Application level authentication

• Service Definition

Page 50: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

GLaaS API

• Goals• Extensibility

• Add data functionality to feature• Add specific tables• Add metadata specific data columns

• Add application level functionality• Add request, response DTOs• Add service method bindings

• Replication

• Feature encapsulates protocol defined parts

• Replicate abstraction model and appropriate retrieval mechanisms for new feature

Page 51: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

GLaaS API

• Results• Reusability of features and data mechanisms

• Tools, algorithms and methodologies reusable between applications

• Persistent data

• Design patterns built in for popular transit data techniques

• Example• AVL as a service

• Polyline Encoding

Page 52: Efficiency and Reliability of the Transit Data Lifecycle A study of multimodal migration, storage, and retrieval techniques for public transit data Presented.

Acknowledgements

• Thank you• Dr. Uma Shama, Larry Harman, and the GeoGraphics Lab for

this research opportunity.

• Dr. Gross, my honors committee and my proof readers / co-workers for their advice and help.

• CCRTA, their vehicles, and their riders for their data mechanisms and the inspiration of this study

• Future work• Integration of these results and implementations for current

GeoLab projects

• Future service-oriented software design in my graduate career.