Data modeling and metadata
description
Transcript of Data modeling and metadata
DATA MODELING AND METADATAFrom graphs to graphs
1
Metadata Full metadata: relational schemas Self defining data: XML, key/value,
key/document No metadata: untagged images, video,
audio Parallel metadata: tagged images, video,
audio
2
Full schema metadata Origins:
Semantic networks in AI Metadata mixed in with data Objects (nodes in graph), has-a (arcs in graph),
is-a (arcs in graph), types (nodes), subtypes (nodes)
Essentially a network with metadata and all instances of the metadata
Goal was to model knowledge of real world, not to manage volumes of data
3
Early databases Slow to adopt data structuring
abstractions because speed of access was the focus
Hierarchical and network databases Links between records of one file to records
of another E.g., each claim record is linked to a subscriber
record Also, sets of records and sets of links
4
Relational databases5
First true abstraction of metadata separated from data
Minimal structure in order to accommodate fast retrieval of tuples
Abstractions Relation Attribute Tuple PKs, CKs, FKs, null/not null
Concurrent with relational database development: “semantic” databases
6
Like semantic networks (quite deliberately), only metadata separated from data
Not object-oriented No object IDs No classes instantiated from types
A wide variety of competing models, with “the” Semantic Model being one of them
Semantic databases, continued
7
Other modeling notions Components or aggregates that are necessary
parts of an object and cannot be changed, like the day you were born or the VIN of a car Versus Properties or attributes that can be changed,
like your name or the transmission in a car Cause and effect relationships
Such as a sales visit leading to a sale And many other specialized relationships
Interestingly, no query facilities and no commercial systems that were successful
Persistent programming languages
8
Not necessarily object-oriented Host language is the only language Data can be persistent or not, often
selectively Strong notion of metadata as
programming data types
Object-oriented databases9
Strong notion of object ID and object identity
Types/subtypes and classes Strong sense of metadata separate from
data Behavioral encapsulation
Object-relational databases10
Objects in the small User defined data types for attribute
domains No behavioral encapsulation
One-of-a-kind semantically rich databases11
Engineering/CAD data Complex objects Lots of singleton types, but with strict
notion of metadata Complex constraints Far reaching component and constraint
relationships
One-of-a-kind scientific/medical/financial databases
12
Managing type-based, voluminous data with little internal structure (imaging)
Managing textual data with some structure and lots of domain-based terminology
Often there are real-time demands made on distributed databases – very difficult problem By putting timing constraints on specific
parts of the data processing code
Self-defining data13
Inspired by need to stream data live and process it in one pass
Also inspired by the need to vary the structure of individual pieces of data, like documents and other items that don’t really have a shared type construct
XML developed as a shared language model for semi-structured (or self-defining) data Developed in part to assist the construction of the
semantic web Data is streamed on the Internet or from sensors
Self-defining data, continued
14
NoSQL databases that store extremely high volumes of loosely structured data Documents with internal structure Values with no meaning within the
database Usually no formal query language, as
data is interpreted programmatically (either partially or fully); sometimes there is a library of common query templates
No metadata databases15
Early blob and continuous data Images Video Audio Flash
All processing of data taking place in complex programs that do not retrieve metadata or insert metadata in the data E.g., image processing, facial searching,
language searching
Recent blob/continuous data
16
Development of parallel metadata databases that contain low level and semantically rich tagging
Only the metadata database is actively searched
Searching can be enhanced by downloading small samples
Feedback loops to improve tag interpretation
Tags taken from shared namespaces
Assertion based databases17
Usually use triples (assertions) Triples are chained together to make
new inferences Metadata is treated like data
Joe owns a Ford Fords are cars
SQL-like, triple-hopping query languages
Graph databases18
Networks of objects that blur the boundary between data and metadata
Supports levels of connectivity orders of magnitude bigger than in network and hierarchical databases of old
Has a purpose that is reminiscent of network/hierarchical databases – to represent the fluid and highly interconnected nature of complex data, such as that collected from social media
Use graph-like query and programming interfaces
Graphics/animation/gaming data19
Shares a lot of properties with scientific and engineering data
Innately mathematical Straight and curved line 2D geometry used
in 3-space Bezier and NURBS for curves
Matrix mathematics for 3D manipulation Transpose, Scale, Rotate
Mapping to pixel based data for presentation
Graphics/animation/gaming, continued20
For real-time rendering, low polygon objects and bounding box collision mathematics used
Creates the most aggressive demands on processing and graphics card technology
Often no notion at all of metadata at all Even non-real-time animation demands
low quality interactive rendering
Procedural data21
Used heavily in photo/video processing Focusing, removing objects, adding color
effects, changing lighting, etc. There are standalone apps and plugin products
Used heavily in animation Procedural textures and materials that don’t
need to tiled Environment procedures (often sun and sky) Cloning to make crowds Lighting and camera objects
Metadata for procedural data
22
Big problem Difficult to crisply define the “meaning” of
procedural data Often, the reason procedural data exists is that
the task is too complex This sort of data is often inherently non-
declarative The marketplace is filled with competing,
varying products, each with its own interface, and they are too powerful to scrap
Procedural data, continued23
Mathematical packages used for minding Almost ironically, these are somewhat
easier to package declaratively, since the mathematics can be so complex that its foundation is used in a black box fashion