Metadata Services on the GRID
description
Transcript of Metadata Services on the GRID
University of Coimbra
Metadata Services on the GRID
Nuno Santos
ACAT’05 May 25th, 2005
University of Coimbra
Contents Metadata on the GRID ARDA-gLite Metadata Interface The ARDA Implementation Performance study: SOAP vs TCP Streaming
University of Coimbra
Metadata on the GRID
Metadata is data about data Metadata on the GRID
Mainly information about files Other information necessary for running jobs Usually living on DBs
Need simple interface for Metadata access Advantages
Easier to use by clients - no SQL, only metadata concepts Common interface - clients don’t have to reinvent the wheel
Must be integrated in the File Catalogue Also suitable for storing information about other resources
University of Coimbra
ARDA-gLite Metadata Interface ARDA proposed an interface for Metadata access on the GRID
Designed jointly with the gLite/EGEE team Incorporates feedback from GridPP Endorsed by the EGEE standards committee (PTF) Being implemented in gLite File Catalog (FiReMan)
Interface concepts Metadata - Key-value pairs Entry - Entities to which metadata is attached Attribute – Holds information about an entry
Schema – A collection of attributes Type – The type (int, float, string,…) Name/Key – The name of the attribute Value - Value of an entry's attribute
Entries are associated with schemas Think of schemas as tables, attributes as columns, entries as
rows
University of Coimbra
Interface Operations Schema management
void createSchema(String schemaName, Attribute[] attributes)
void dropSchema(String schemaName)
void removeSchemaAttributes(String schemaName, String[] attributeNames)
void addSchemaAttributes(String schemaName, Attribute[] attributes)
Entry managementvoid createEntry(MDEntry[] entries, String[] schemas)
void removeEntry(String query)
int setAttributes(String query, Attribute[] attributes)
Attribute[] listAttributes(String entry)
University of Coimbra
Interface Operations Searching and retrieving entries
MDResult query(MDQuery query)
MDResult nextQuery(String token, MDQuery query)
void endQuery(String token)
Datatypes
Allows either stateful or stateless server implementations
MDEntry {String entryAttribute[] attributes
}
MDResult {MDEntry[] entriesString tokenBoolean done
}
MDQuery {String queryString queryType
}
Attribute {String schemaString nameString typeString value
}
University of Coimbra
ARDA Prototype Validate proposed interface Architecture:
Metadata organized in a hierarchy Schemas can contain sub-schemas
Can inherit attributes Analogy to file system:
Schema Directory; Entry File
Stability with large responses Send large responses in chunks
Otherwise preparing large responses could crash server
Stateful server DB → Server – Data streamed using DB cursors Server → Client – Response sent in chunks
University of Coimbra
ARDA Implementation Backends
Currently: Oracle, PostgreSQL, SQLite
Two frontends TCP Streaming
Chosen for performance SOAP
Formal requirement of EGEE Compare SOAP with TCP
Streaming
Also implemented as standalone Python library Data stored on filesystem
Python Interpreter
Metadata Python
APIClient
filesystem
Metadata Server
MDServer
SOAP
TCP Streaming
PostgreSQL
Oracle
SQLite
Client
Client
University of Coimbra
TCP Streaming Frontend Text based protocol (like SMTP,
POP3,…)
Data streamed to client in single connection
Implementation Server – C++, multiprocess Clients – C++, Java, Python, Perl, Ruby
Client: listattr entry
Server: 0entryvalue1value2…<EOT>
Client Server Database
<operation> Create DB cursor
[data]
[data]
[data]
[data]
[data]
[data]
[data]
[data]
StreamingStreaming
University of Coimbra
SOAP Frontend Most operations in interface
implemented as simple SOAP calls query() - based on iterators
Initial request – create session Open cursor on DB Return initial chunk of data and
session token Subsequent requests
Client calls nextQuery() using session token
Termination – session closed when: End of data Client calls endQuery() Client timeout
Implementations Server – gSOAP (C++). Clients – Tested WSDL with gSOAP,
ZSI (Python), AXIS (Java)
Client Server Database
query Create DB cursor
[data]
[data]
[data]
[data]
[data]
nextQuery
[data]
nextQuery
[data]
StreamingSOAP with iterators
University of Coimbra
Current Uses of the ARDA prototype Evaluated by LHCb-bookkeeping
Migrated bookkeeping metadata to ARDA prototype 20M entries, 15 GB
Feedback valuable in improving interface and fixing bugs Interface found to be complete ARDA prototype showing good scalability
Ganga (LHCb, ATLAS) User analysis job management system Stores job status on ARDA prototype Highly dynamic metadata
University of Coimbra
Performance Study SOAP increasingly used as standard protocol for
GRID computing Promising web services standard - Interoperability
Some potential weaknesses XML encoding increases message size (4x to 10x typical) XML processing is compute and memory intensive
How significant are these weaknesses? What is the cost of using SOAP?
ARDA metadata implementation ideal for comparing SOAP with a traditional RCP protocol
University of Coimbra
Benchmark Description
Protocols TCP-S – TCP Streaming SOAP – Clients with gSoap (C++), Axis (Java) and ZSI (Python)
Operations ping – A null RPC add – Adds an entry get – Gets all attributes of an entry get (bulk) – Gets all attributes of several entries in a single operation
Entries 60 attributes (ints, floats and strings) 700 bytes on average
HTTP Keepalive/Persistant connections HTTP Keepalive increase HTTP performance. Should improve SOAP
performance. gSOAP supports Keepalive. Axis and ZSI don’t. TCP-S uses persistent TCP connections to compare with HTTP Keepalive
University of Coimbra
SOAP Data Overhead Measure size overhead of XML encoding Ping
1000 requests Minimal payload – less than 5 bytes per request SOAP overhead around 8 times
Get attributes in bulk Retrieve 1000 entries
Around 800KB of application data Streaming in TCP Iterators with SOAP – 4KB average SOAP packet payload
With keepalive SOAP overhead around 2.5 times
Total data transferred (in KB)TCP-S SOAP Overhead
Ping 151 1200 7,9Get 1000 Attrs (bulk) 820 2128 2,6
University of Coimbra
SOAP Toolkits performance
Test protocol performance No work done on the
backend Switched 100Mbits LAN
Language comparison TCP-S with similar
performance in all languages SOAP performance varies
strongly with toolkit Protocols comparison
Keepalive improves performance significantly
On Java and Python, SOAP is several times slower than TCP-S
1000 pings
0
5
10
15
20
25
Exe
cutio
n T
ime
[s]
C++ (gSOAP) Java (Axis) Python (ZSI)
TCP-S no KATCP-S KA
SOAP no KASOAP KA
University of Coimbra
Single client results (LAN) Compare performance of
different operations C++ clients (gSOAP)
When backend must do work, differences between gSOAP and TCP-S are small
Bulk operations very important for performance getBulk 4x faster than get
1000 pings/1000 Entries
0
5
10
15
20
25
Exe
cutio
n T
ime
[s]
ping add get get Bulk
TCP-S no KATCP-S KA
gSOAP no KAgSOAP KA
University of Coimbra
Single client results (WAN) Client CERN, server
Taiwan ≈300 ms latency
Results dominated by latency Execution time at server
irrelevant Large performance boost
from latency hiding techniques: keepalive – fewer TCP
handshakes bulk operations – fewer
client/server interactions
1000 pings/1000 Entries
0
200
400
600
800
1000
1200
1400
Exe
cutio
n T
ime
[s]
ping add get get Bulk
TCP-S no KATCP-S KA
gSOAP no KAgSOAP KA
x5
University of Coimbra
Scalability with Multiple Clients - Pings Measure scalability of protocols
Switched 100Mbits LAN TCP-S 3x faster than gSoap
(with keepalive) Poor performance without
keepalive Around 1.000 ops/sec (both
gSOAP and TCP-S)
1000 pings
1000
10000
1 10 100A
vera
ge
th
rou
gh
pu
t [c
alls
/se
c]# clients
TCP-S, no KATCP-S, KA
gSOAP, no KAgSOAP, KA
Client ran out of sockets
University of Coimbra
Scalability with Multiple Clients - getAttr Measure scalability with
realistic payload Switched 100Mbits LAN All tests with keepalive
Smaller difference between gSOAP and TCP-S TCP-S 2x faster (1000 vs 500
entries/sec) Poor performance of non-bulk
operations 100 entries/sec
1000 entries
100
1000
1 10 100A
vera
ge
th
rou
gh
pu
t [e
ntr
ies/
sec]
# clients
TCP-S, Single, KATCP-S, Bulk, KA
gSOAP, Single, KAgSOAP, Bulk, KA
University of Coimbra
Conclusions A common Metadata Interface was developed by
ARDA and gLite Endorsed by the EGEE standards committee
Interface validated by ARDA prototype Prototype in use by LHCb (bookkeeping, Ganga) and
ATLAS (Ganga) SOAP performance studied using ARDA
implementation Toolkit performance varies widely Large SOAP overhead (over 100%)