Profile Serialization IIPC GA 2015

13
Archive Profile Serialization | Sawood Alam @ibnesayeed Computer Science Department, Old Dominion University Norfolk, Virginia - 23529

Transcript of Profile Serialization IIPC GA 2015

Page 1: Profile Serialization IIPC GA 2015

Archive ProfileSerialization

| Sawood Alam @ibnesayeed

Computer Science Department, Old Dominion UniversityNorfolk, Virginia - 23529

Page 2: Profile Serialization IIPC GA 2015

Archive ProfileHigh-level digest of an archivePredicts presence of mementos of a URI-R in an archiveProvides various statistics about the holdingsSmall in sizePublicly availableEasy to update and partially patchUseful for Memento query routing and other things

Page 3: Profile Serialization IIPC GA 2015

Profiles ContentsHow to organize contents?What goes in it?How to serialize it?

Page 4: Profile Serialization IIPC GA 2015

Flat Organization{ " . . . " : { } , " s t a t s " : { " s u b u r i " : { " e d u ) / " : { " u r i m " : { " m a x " : 3 , " m i n " : 1 , " t o t a l " : 3 2 } , " u r i r " : 1 2 } , " e d u , h a r v a r d ) / " : { " u r i m " : { " m a x " : 1 , " m i n " : 1 , " t o t a l " : 2 } , " u r i r " : 2 } , " e d u , h a r v a r d , l a w , b l o g s ) / " : { " u r i m " : { " m a x " : 1 ,

Page 5: Profile Serialization IIPC GA 2015

Grouped Organization{ " . . . " : { } , " s t a t s " : { " t l d " : { " c o m ) / " : { " u r i m " : { " m a x " : 1 0 , " m i n " : 2 , " t o t a l " : 7 2 } , " u r i r " : 3 4 } , " e d u ) / " : { " u r i m " : { " m a x " : 3 , " m i n " : 1 , " t o t a l " : 3 2 } , " u r i r " : 1 2 } , " . . . " : { } } , " d o m a i n " : {

Page 6: Profile Serialization IIPC GA 2015

Nested Organization{ " . . . " : { } , " s t a t s " : { " t l d " : { " c o m ) / " : { " d o m a i n " : { " c o m , a d o b e ) / " : { " u r i m " : { " m a x " : 3 , " m i n " : 3 , " t o t a l " : 6 } , " u r i r " : 2 } , " . . . " : { } , } , " u r i m " : { " m a x " : 3 , " m i n " : 1 , " t o t a l " : 1 7 } , " u r i r " : 1 3 } ,

Page 7: Profile Serialization IIPC GA 2015

Frequency Metrics{ " . . . " : { } , " s t a t s " : { " s u b u r i " : { " c o m ) / " : { " u r i m " : { " 1 s t q u " : 4 . 2 , " 3 r d q u " : 7 . 1 3 , " m a x " : 1 2 , " m e a n " : 6 . 5 2 , " m e d i a n " : 8 , " m i n " : 1 , " s d " : 4 . 1 8 , " t o t a l " : 8 6 } , " u r i r " : 1 5 } , " . . . " : { } } , " . . . " : { } }}

Page 8: Profile Serialization IIPC GA 2015

JSON SerializationCan have complex nested data structureJSON-LD for linked dataNo partial key lookupUnsuitable for text processing toolsAllows processing only when fully loadedA single malformed character makes it unparsableDifficult to patch

Page 9: Profile Serialization IIPC GA 2015

Sample JSON Profile{ " @ c o n t e x t " : " h t t p s : / / o d u w s d l . g i t h u b . i o / c o n t e x t / a r c h p r o f i l e . j s o n l d " " @ i d " : " h t t p : / / w w w . w e b a r c h i v e . o r g . u k / u k w a / " , " a b o u t " : { " a c c e s s p o i n t " : " h t t p : / / w w w . w e b a r c h i v e . o r g . u k / w a y b a c k / " , " m e c h a n i s m " : " h t t p : / / o d u w s d l . g i t h u b . i o / t e r m s / m e c h a n i s m # c d x " , " n a m e " : " U K W A 1 9 9 6 C o l l e c t i o n " , " p r o f i l e _ u p d a t e d " : " 2 0 1 5 - 0 1 - 2 0 T 1 7 : 2 5 : 3 0 Z " , " s u b u r i _ c l a s s " : " h t t p : / / o d u w s d l . g i t h u b . i o / t e r m s / s u b u r i # H 3 P 1 " , " m o r e _ m e t a _ d a t a " : " . . . " } , " s t a t s " : { " l a n g u a g e " : { " e n - U S " : { " u r i m " : { " m a x " : 1 3 , " m i n " : 1 , " t o t a l " : 4 7 5 2 9 } , " u r i r " : 2 5 6 2 1 } , " m o r e _ l a n g u a g e s " : " . . . " } ,

Page 10: Profile Serialization IIPC GA 2015

CDXJSON SerializationFusion of CDX and JSON file formatsA key followed by strict single line JSON valueUnlike CDX, values can have arbitrary attributesText processing tool friendlyNo single root node or single document restrictionsEnables binary searchEnables partial key lookupError resilient

Page 11: Profile Serialization IIPC GA 2015

Sample CDXJSON ProfileKey String SPACE Single Line JSON

NEWLINE

@ c o n t e x t " h t t p s : / / o d u w s d l . g i t h u b . i o / c o n t e x t s / a r c h i v e p r o f i l e . j s o n l d "@ i d " h t t p : / / w w w . w e b a r c h i v e . o r g . u k / u k w a / "@ a b o u t { " n a m e " : " U K W A 1 9 9 6 C o l l e c t i o n " , " t y p e " : " s u b u r i # H 3 P 1 " , " . . . " : u k ) / { " u r i m " : { " m a x " : 8 , " m i n " : 1 , " t o t a l " : 9 3 2 4 3 2 } , " u r i r " : 8 6 7 8 1 7 } ,u k , c o ) / { " u r i m " : { " m a x " : 8 , " m i n " : 1 , " t o t a l " : 4 1 0 9 7 9 } , " u r i r " : 3 7 8 6 8 6u k , c o , b b c ) / { " u r i m " : { " m a x " : 2 , " m i n " : 1 , " t o t a l " : 1 2 8 } , " u r i r " : 1 1 5 } ,u k , c o , b b c ) / i m a g e s { " u r i m " : { " m a x " : 1 , " m i n " : 1 , " t o t a l " : 3 } , " u r i r " :

Page 12: Profile Serialization IIPC GA 2015

Conclusions and Future WorkCDXJSON offers scalability and failure resilienceReduces the profile size as it allows partial key lookupTODO: Update profiler script to output in CDXJSONTODO: Fomalize CDXJSON formatImplementation codes are available at:

GitHub:GitHub:

/oduwsdl/suburi_generator/oduwsdl/archive_profiler