Archive ProfileSerialization
| Sawood Alam @ibnesayeed
Computer Science Department, Old Dominion UniversityNorfolk, Virginia - 23529
Archive ProfileHigh-level digest of an archivePredicts presence of mementos of a URI-R in an archiveProvides various statistics about the holdingsSmall in sizePublicly availableEasy to update and partially patchUseful for Memento query routing and other things
Profiles ContentsHow to organize contents?What goes in it?How to serialize it?
Flat Organization{ " . . . " : { } , " s t a t s " : { " s u b u r i " : { " e d u ) / " : { " u r i m " : { " m a x " : 3 , " m i n " : 1 , " t o t a l " : 3 2 } , " u r i r " : 1 2 } , " e d u , h a r v a r d ) / " : { " u r i m " : { " m a x " : 1 , " m i n " : 1 , " t o t a l " : 2 } , " u r i r " : 2 } , " e d u , h a r v a r d , l a w , b l o g s ) / " : { " u r i m " : { " m a x " : 1 ,
Grouped Organization{ " . . . " : { } , " s t a t s " : { " t l d " : { " c o m ) / " : { " u r i m " : { " m a x " : 1 0 , " m i n " : 2 , " t o t a l " : 7 2 } , " u r i r " : 3 4 } , " e d u ) / " : { " u r i m " : { " m a x " : 3 , " m i n " : 1 , " t o t a l " : 3 2 } , " u r i r " : 1 2 } , " . . . " : { } } , " d o m a i n " : {
Nested Organization{ " . . . " : { } , " s t a t s " : { " t l d " : { " c o m ) / " : { " d o m a i n " : { " c o m , a d o b e ) / " : { " u r i m " : { " m a x " : 3 , " m i n " : 3 , " t o t a l " : 6 } , " u r i r " : 2 } , " . . . " : { } , } , " u r i m " : { " m a x " : 3 , " m i n " : 1 , " t o t a l " : 1 7 } , " u r i r " : 1 3 } ,
Frequency Metrics{ " . . . " : { } , " s t a t s " : { " s u b u r i " : { " c o m ) / " : { " u r i m " : { " 1 s t q u " : 4 . 2 , " 3 r d q u " : 7 . 1 3 , " m a x " : 1 2 , " m e a n " : 6 . 5 2 , " m e d i a n " : 8 , " m i n " : 1 , " s d " : 4 . 1 8 , " t o t a l " : 8 6 } , " u r i r " : 1 5 } , " . . . " : { } } , " . . . " : { } }}
JSON SerializationCan have complex nested data structureJSON-LD for linked dataNo partial key lookupUnsuitable for text processing toolsAllows processing only when fully loadedA single malformed character makes it unparsableDifficult to patch
Sample JSON Profile{ " @ c o n t e x t " : " h t t p s : / / o d u w s d l . g i t h u b . i o / c o n t e x t / a r c h p r o f i l e . j s o n l d " " @ i d " : " h t t p : / / w w w . w e b a r c h i v e . o r g . u k / u k w a / " , " a b o u t " : { " a c c e s s p o i n t " : " h t t p : / / w w w . w e b a r c h i v e . o r g . u k / w a y b a c k / " , " m e c h a n i s m " : " h t t p : / / o d u w s d l . g i t h u b . i o / t e r m s / m e c h a n i s m # c d x " , " n a m e " : " U K W A 1 9 9 6 C o l l e c t i o n " , " p r o f i l e _ u p d a t e d " : " 2 0 1 5 - 0 1 - 2 0 T 1 7 : 2 5 : 3 0 Z " , " s u b u r i _ c l a s s " : " h t t p : / / o d u w s d l . g i t h u b . i o / t e r m s / s u b u r i # H 3 P 1 " , " m o r e _ m e t a _ d a t a " : " . . . " } , " s t a t s " : { " l a n g u a g e " : { " e n - U S " : { " u r i m " : { " m a x " : 1 3 , " m i n " : 1 , " t o t a l " : 4 7 5 2 9 } , " u r i r " : 2 5 6 2 1 } , " m o r e _ l a n g u a g e s " : " . . . " } ,
CDXJSON SerializationFusion of CDX and JSON file formatsA key followed by strict single line JSON valueUnlike CDX, values can have arbitrary attributesText processing tool friendlyNo single root node or single document restrictionsEnables binary searchEnables partial key lookupError resilient
Sample CDXJSON ProfileKey String SPACE Single Line JSON
NEWLINE
@ c o n t e x t " h t t p s : / / o d u w s d l . g i t h u b . i o / c o n t e x t s / a r c h i v e p r o f i l e . j s o n l d "@ i d " h t t p : / / w w w . w e b a r c h i v e . o r g . u k / u k w a / "@ a b o u t { " n a m e " : " U K W A 1 9 9 6 C o l l e c t i o n " , " t y p e " : " s u b u r i # H 3 P 1 " , " . . . " : u k ) / { " u r i m " : { " m a x " : 8 , " m i n " : 1 , " t o t a l " : 9 3 2 4 3 2 } , " u r i r " : 8 6 7 8 1 7 } ,u k , c o ) / { " u r i m " : { " m a x " : 8 , " m i n " : 1 , " t o t a l " : 4 1 0 9 7 9 } , " u r i r " : 3 7 8 6 8 6u k , c o , b b c ) / { " u r i m " : { " m a x " : 2 , " m i n " : 1 , " t o t a l " : 1 2 8 } , " u r i r " : 1 1 5 } ,u k , c o , b b c ) / i m a g e s { " u r i m " : { " m a x " : 1 , " m i n " : 1 , " t o t a l " : 3 } , " u r i r " :
Conclusions and Future WorkCDXJSON offers scalability and failure resilienceReduces the profile size as it allows partial key lookupTODO: Update profiler script to output in CDXJSONTODO: Fomalize CDXJSON formatImplementation codes are available at:
GitHub:GitHub:
/oduwsdl/suburi_generator/oduwsdl/archive_profiler
Top Related