Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh
-
Upload
slashn -
Category
Technology
-
view
1.369 -
download
0
description
Transcript of Slash n: Tech Talk Track 1 – Art and Science of Cataloguing - Utkarsh
Cataloging The Art & Science of it...
UtkarshPrincipal Architect @
Flipkart.com
Art vs Science
Imaginative Creative
Free Form
Measurable
Methodical
Formulative
Set Patterns
3
What is Cataloging?• Catalog
A list or itemized display usually including descriptive information or illustrations.
• Cataloging
a. To list or include in a catalog
b. To classify according to a categorical system
We define it as:
Cataloging is the process of managing the inventory of products through the entire lifecycle of creating, updating, de-provisioning/re-provisioning and deletion.
4
Why is the problem interesting?
• Ever growing - “size”• Dynamic nature of the Metadata -
“elasticity”• Association(s) between data elements -
“flexibility”• Flux of changes - “variability”• De-coupled systems & Data Ownership -
“data duplication”
5
How do we solve it?• Be Comprehensive & Imaginative
• Be Methodical & Flexible
• Work with Patterns & Create new Patterns
• Be a Composer, be an artist (blend where required)
6
What do we solve?
• Identify Data Elements
• Identify Relationships b/w Data Elements
• Identify Data Usage patterns (Query patterns)
• Create an ideal representation: Logical Model
• Characterize the Data Store(s)
• Architect the Catalog Data Cluster
• Define Views/Interface(s)
7
Identify Data ElementsProduct Biblio
Product Variants
Supplier
Pricing?
Contributors
Product Images
Category
Taxation
Product SLAs
Stock Sellers
Be Comprehensive ; Be Imaginative !!
8
Identify Relationships
Book
Author
Genre
Compilation 1
Compilation 2
Year
Physical Product
?
is Ahas A
has Abelongs to
belongs to
belongs to
Be Comprehensive ; Be Imaginative !!
9
Identify Data Query Patterns• Is the querying real-time or offline (customer perspective)
• Is the query “Id” based or use of filters (adhoc or pre-defined)
• Is the query linking multiple data elements
• Understand: Query SLAs at ever increasing scale
• Question: why is the client writing such a query
Eg:
• Book with a specific title Secret of the Nagas
• Books by Chetan Bhagat published in 2012
• Books which are Thrillers, published post 2005 written in Hindi and published by Rupa Publications
10
Identification is Non Trivial
Example “Book”
Identification -->
“Title”
11
Identification is Non Trivial
Example “Book”
Identification -->
“Title”
“Title” + “Publisher”
12
Identification is Non Trivial
Example “Book”
Identification -->
“Title”
“Title” + “Publisher”
“Title” + “Publisher” + “Edition”
13
Identification is Non Trivial
Example “Book”
Identification -->
“Title”
“Title” + “Publisher”
“Title” + “Publisher” + “Edition”
“Title” + “Publisher” + “Edition” + “Variant”
14
Identification is Non Trivial
Example “Book”
Identification -->
“Title”
“Title” + “Publisher”
“Title” + “Publisher” + “Edition”
“Title” + “Publisher” + “Edition” + “Variant”
“Title” + “Publisher” + “Edition” + “Variant” + ??
Be Imaginative - an Artist’s brush stroke !!
15
Logical ModelSchemaEntities as Tables
Relationships as Constraints
Queries supported through indexes and joins
+ Rich Query Support
+ Built-in support for Relationships
+ Indexes
- Elasticity
* Frequent addition/deletion of columns
* Growing secondary indexes
- Not optimized for some use-cases
* Key-Values
*Data Blobs/ Graphs
Relational Databases:
* MySQL, Oracle, Postgres et al
16
Logical ModelSemi-SchemaBlobs (Documents) of Data
Linkages between Documents
Queries supported through document identifiers and document references
+ Flexibility: “Documents” are less rigid
+ Query Language to retrieve based on content of “Document”
- Complex Relationships are non-trivial
- “Linked” Document Queries may not be optimized
Document Stores:
* MongoDB, CouchBase et al
17
Logical ModelNo SchemaData Blobs
Rules/Relationship definitions
Queries supported through data “views”, indexes, search based on reverse indexing etc ...
+ Elasticity
* Variability of data format
* Secondary Indices
+ Tunable performance
- Relational data is a force-fit (sub-optimal)
+/- Querying models are specific to Stores
Other NoSQL Stores:
* HBase, RIAK, Cassandra, et al
18
Catalog Data Cluster
CatalogData
Product Data
BiblioData
UGC on
Products
Compliance
Data
? Pricing/Accountin
g
- “View”/”Data” Partitions - Blend multiple data stores- Interfaces provide view to the underlying data- Scale uniformly for data elements
19
Data Store Characterization• Data characteristics:
- Reliability (availability and redundancy)
- Consistency
• Querying capability- Support for indexes- Filters; secondary
indexes- linkages/relationships
• Elasticity
- increase in scale
- evolving catalog definitions
•SLAs
- Volumes
- Throughput
- Latencies
Be Comprehensive; be Methodical but be unbounded by choices - a Scientist who has a palet of colors in hand !!
20
Data Store Characterization• CAP: which 2 we pick? can data store help
configure any 2?
• Operational ease (monitoring, reporting, config mgmt ..)
• Pluggability with Distributed Computing platforms
C P
A
21
Define Views & Interfaces• Cataloging has multiple use-
cases which are business centric
• Use-cases evolve; and so do the “view” to the data
• “Views” as multiple interpretations of the data;
• De-coupled with the underlying data
• Underlying data form has to be elastic
• Overlayed views have to be adaptive
Data 1 Data 2
Data 3 Data 4
Data Access Interface
Precomputed View(s)
Dynamic View(s)
View Layer
22
Architect for Scale & Performance
Identify Usage Patterns
Right Tools for Job
Right Abstractions Pluggable
Solution Stacks
Decoupled Data Offline
Processing
23
Measure, Monitor & Evolve
• SLAs change; system has to be adaptive
• Start off with established goals; benchmark and meet the initial set goals
• Changes are gradual; plan at the first symptom
• Listen for system(s) not coping up
• Always work towards incremental changes; entire overhaul of the systems will be counter productive
Be Curious, have doubts, deeply introspect - be the ultimate Scientist !!
24
Change is constant ... adapt
• Requirements evolve• Business introduces flux• Data interpretations grow
• Be flexible, adaptive, imaginative...... work as a Scientist who appreciates Art !!