It's all about data classification and searching

It's All About Data Classification and Searching

I don't know if this has been discussed elsewhere but I felt like I had an epiphany so there They way I

see it, in a decade or two the most important technology regarding data will be Data classification

and search technologies.

Consider this: At the moment, all the rage is archiving and storage tiers. The reason is that it simply is

too expensive to buy the fastest disks, and even if you do buy them they're smaller than the slower-

spinning drives.

Imagine if speed and size were not issues. I know that's a big assumption but let's play along for a

second... (let's just say that there are plenty of revolutionary advances in the storage space coming

our way within, say, 10-20 years, that will make this concept not seem that far-fetched).

For more information, visit: http://klassify.in/

Nobody would really care any longer about storage tiers or archiving. Backups would simply consist of

extra copies of everything, to be kept forever if needed, and replicated to multiple locations (this is

already happening, it's just expensive, so it's not common). Indeed, everyone would just leave all kinds

of data accumulate and scrubbing would not be quite as frequent as it is now. Multiple storage islands

would also be clustered seamlessly so they present a single, coherent space, compounding the

problem further.

Within such a chaotic architecture, the only real problems are data classification and mining. I.e.

figuring out what you have and actually getting at it. The where it is is not quite such an issue - nobody

cares, as long as they can get to it in a timely fashion.

I can tell that OS designers are catching on. Microsoft, of all companies, wanted a next-gen filesystem

for Vista/Longhorn, that would really be SQL on top of NTFS, with files stored as BLOBs. It got delayed

so we didn't get it, but they're saying it should be out in a few years (there were issues with scalability

and speed).

Let's forget about the Microsoft-specific implementation and just think about the concept instead (I'd

use something like a decent database on raw disk and not NTFS, for instance). No more real file

structure as we know it - it's just a huge database occupying the entire drive.

Think of the advantages:

Far more resilient to failures

Proper rollbacks in case of problems, and easy rebuilding using redo logs if need be

Replication via log shipping

Amazing indexing

Easy expandability

The potential for great performance, if done right

Lots of tuning options (maybe too many for some).

With such a technology, you need a lot more metadata for each file so you can present it in different

ways and also search for it efficiently. Let's consider a simple text document - you're trying to sell some

storage, so you write a proposal for a new client. You could have metadata on:

http://klassify.in/

http://klassify.in/

Author

Filename

Client name

Type of document - proposal

Project name

Excerpt

Salesperson's name

Solution keywords, such as EMC DMX with McData (sorry, Brocade) switches

Document revision (possible automatically generated)

A lot of these fields already are to be found in the properties of any MS Word document.

The database would index the metadata at the very least, when the file is created, and any time the

metadata changes. Searches would be possible based on any of the fields. Then, a virtual directory

structure could be created:

Create a virtual directory with all files pertaining to that specific client (most common way people

would organize it)

Show all the material for this specific project

Show all proposals that have to do with this salesperson

Virtual folders exist now for Mac OSX (can be created after a Spotlight search), Vista (saved searches)

and even Gnome 2.14, but the underlying engine is simply not as powerful as what I just described.

Normal searches are used, and metadata is not that extensive for most files anyway (mp3 files being

an exception since metadata creation is almost forced when you rip a CD).

It should be obvious by now that to enable this kind of functionality properly you need really good

ways of classifying and indexing your data and actually create all the metadata that needs to be there,

as automatically as possible. Future software will probably force you to create the metadata in some

way, of course. Existing software that does this classification is fairly poor, in my opinion. Please

correct me if I'm wrong.

It's all about data classification and searching

Technology

Transcript of It's all about data classification and searching