Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

25
Cataloging The Art & Science of it... Utkarsh Principal Architect @ Flipkart.com

description

 

Transcript of Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

Page 1: Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

Cataloging The Art & Science of it...

UtkarshPrincipal Architect @

Flipkart.com

Page 2: Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

Art vs Science

Imaginative Creative

Free Form

Measurable

Methodical

Formulative

Set Patterns

Page 3: Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

3

What is Cataloging?• Catalog

A list or itemized display usually including descriptive information or illustrations.

• Cataloging

a. To list or include in a catalog

b. To classify according to a categorical system

We define it as:

Cataloging is the process of managing the inventory of products through the entire lifecycle of creating, updating, de-provisioning/re-provisioning and deletion.

Page 4: Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

4

Why is the problem interesting?

• Ever growing - “size”• Dynamic nature of the Metadata -

“elasticity”• Association(s) between data elements -

“flexibility”• Flux of changes - “variability”• De-coupled systems & Data Ownership -

“data duplication”

Page 5: Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

5

How do we solve it?• Be Comprehensive & Imaginative

• Be Methodical & Flexible

• Work with Patterns & Create new Patterns

• Be a Composer, be an artist (blend where required)

Page 6: Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

6

What do we solve?

• Identify Data Elements

• Identify Relationships b/w Data Elements

• Identify Data Usage patterns (Query patterns)

• Create an ideal representation: Logical Model

• Characterize the Data Store(s)

• Architect the Catalog Data Cluster

• Define Views/Interface(s)

Page 7: Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

7

Identify Data ElementsProduct Biblio

Product Variants

Supplier

Pricing?

Contributors

Product Images

Category

Taxation

Product SLAs

Stock Sellers

Be Comprehensive ; Be Imaginative !!

Page 8: Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

8

Identify Relationships

Book

Author

Genre

Compilation 1

Compilation 2

Year

Physical Product

?

is Ahas A

has Abelongs to

belongs to

belongs to

Be Comprehensive ; Be Imaginative !!

Page 9: Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

9

Identify Data Query Patterns• Is the querying real-time or offline (customer perspective)

• Is the query “Id” based or use of filters (adhoc or pre-defined)

• Is the query linking multiple data elements

• Understand: Query SLAs at ever increasing scale

• Question: why is the client writing such a query

Eg:

• Book with a specific title Secret of the Nagas

• Books by Chetan Bhagat published in 2012

• Books which are Thrillers, published post 2005 written in Hindi and published by Rupa Publications

Page 10: Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

10

Identification is Non Trivial

Example “Book”

Identification -->

“Title”

Page 11: Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

11

Identification is Non Trivial

Example “Book”

Identification -->

“Title”

“Title” + “Publisher”

Page 12: Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

12

Identification is Non Trivial

Example “Book”

Identification -->

“Title”

“Title” + “Publisher”

“Title” + “Publisher” + “Edition”

Page 13: Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

13

Identification is Non Trivial

Example “Book”

Identification -->

“Title”

“Title” + “Publisher”

“Title” + “Publisher” + “Edition”

“Title” + “Publisher” + “Edition” + “Variant”

Page 14: Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

14

Identification is Non Trivial

Example “Book”

Identification -->

“Title”

“Title” + “Publisher”

“Title” + “Publisher” + “Edition”

“Title” + “Publisher” + “Edition” + “Variant”

“Title” + “Publisher” + “Edition” + “Variant” + ??

Be Imaginative - an Artist’s brush stroke !!

Page 15: Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

15

Logical ModelSchemaEntities as Tables

Relationships as Constraints

Queries supported through indexes and joins

+ Rich Query Support

+ Built-in support for Relationships

+ Indexes

- Elasticity

* Frequent addition/deletion of columns

* Growing secondary indexes

- Not optimized for some use-cases

* Key-Values

*Data Blobs/ Graphs

Relational Databases:

* MySQL, Oracle, Postgres et al

Page 16: Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

16

Logical ModelSemi-SchemaBlobs (Documents) of Data

Linkages between Documents

Queries supported through document identifiers and document references

+ Flexibility: “Documents” are less rigid

+ Query Language to retrieve based on content of “Document”

- Complex Relationships are non-trivial

- “Linked” Document Queries may not be optimized

Document Stores:

* MongoDB, CouchBase et al

Page 17: Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

17

Logical ModelNo SchemaData Blobs

Rules/Relationship definitions

Queries supported through data “views”, indexes, search based on reverse indexing etc ...

+ Elasticity

* Variability of data format

* Secondary Indices

+ Tunable performance

- Relational data is a force-fit (sub-optimal)

+/- Querying models are specific to Stores

Other NoSQL Stores:

* HBase, RIAK, Cassandra, et al

Page 18: Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

18

Catalog Data Cluster

CatalogData

Product Data

BiblioData

UGC on

Products

Compliance

Data

? Pricing/Accountin

g

- “View”/”Data” Partitions - Blend multiple data stores- Interfaces provide view to the underlying data- Scale uniformly for data elements

Page 19: Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

19

Data Store Characterization• Data characteristics:

- Reliability (availability and redundancy)

- Consistency

• Querying capability- Support for indexes- Filters; secondary

indexes- linkages/relationships

• Elasticity

- increase in scale

- evolving catalog definitions

•SLAs

- Volumes

- Throughput

- Latencies

Be Comprehensive; be Methodical but be unbounded by choices - a Scientist who has a palet of colors in hand !!

Page 20: Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

20

Data Store Characterization• CAP: which 2 we pick? can data store help

configure any 2?

• Operational ease (monitoring, reporting, config mgmt ..)

• Pluggability with Distributed Computing platforms

C P

A

Page 21: Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

21

Define Views & Interfaces• Cataloging has multiple use-

cases which are business centric

• Use-cases evolve; and so do the “view” to the data

• “Views” as multiple interpretations of the data;

• De-coupled with the underlying data

• Underlying data form has to be elastic

• Overlayed views have to be adaptive

Data 1 Data 2

Data 3 Data 4

Data Access Interface

Precomputed View(s)

Dynamic View(s)

View Layer

Page 22: Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

22

Architect for Scale & Performance

Identify Usage Patterns

Right Tools for Job

Right Abstractions Pluggable

Solution Stacks

Decoupled Data Offline

Processing

Page 23: Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

23

Measure, Monitor & Evolve

• SLAs change; system has to be adaptive

• Start off with established goals; benchmark and meet the initial set goals

• Changes are gradual; plan at the first symptom

• Listen for system(s) not coping up

• Always work towards incremental changes; entire overhaul of the systems will be counter productive

Be Curious, have doubts, deeply introspect - be the ultimate Scientist !!

Page 24: Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

24

Change is constant ... adapt

• Requirements evolve• Business introduces flux• Data interpretations grow

• Be flexible, adaptive, imaginative...... work as a Scientist who appreciates Art !!

Page 25: Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh

25

Thank you !

My Co-ordinates:[email protected]

25