Book

Lily documentation

August 16, 2012

Lily documentation 1


Table of Contents

1 Lily Documentation (1.2) 13

1.1 What is Lily ? 13

2 Running Lily 14

2.1 About 14

2.2 Linux, Mac OS X, Windows 14

2.3 Java 1.6 14

2.4 Downloading Lily 14

2.5 Starting Lily 15

2.6 Create Field & Record Types 15

2.7 Define An Index 15

2.8 Loading Records Into Lily 16

2.9 Querying The Solr Index 16

2.10 REST interface 16

2.11 Rebuilding The Index 16

2.12 Next steps 17

2.13 Installing A Lily Cluster 18

2.13.1 Network configuration 18

2.13.2 Installing Hadoop, HBase and ZooKeeper 18

2.13.3 Installing Solr 18

2.13.4 The Lily Server Process 19

2.13.4.1 Configuring Lily to connect to your HBase, Hadoop &

ZooKeeper 19

2.13.4.2 Running The Lily Server Process 19


2.14 Upgrade from Lily 1.1 20

2.14.1 During upgrading 20

3 Architecture 22

3.1 Distribution 22

3.2 Main components 24

3.2.1 HBase 26

3.2.2 HDFS 26

3.2.3 The repository 26

3.2.4 The Write Ahead Log 27

3.2.5 The Message Queue 27

3.2.6 The Indexer 27

3.2.6.1 Denormalization 27

3.2.7 Solr 28

3.2.8 ZooKeeper 28

4 Repository 29

4.1 Repository Model 29

4.1.1 Basic concepts and terminology 29

4.1.2 No hierarchy 30

4.1.3 Record identification 31

4.1.4 Records, field scopes, versions 31

4.1.4.1 Records 31

4.1.4.2 Field scopes & versions 31

4.1.5 Field types 32

4.1.5.1 Value types 33

4.1.6 Record types 35

4.1.6.1 Record type versioning 35

4.1.6.2 Mixins 35

4.1.7 The record – record type relationship 36

4.1.8 Record type as a guide rather than a straightjacket 36

4.1.9 Variants 37

4.1.9.1 Why variants? 37

4.1.9.2 Cross-variant data 38


4.1.10 Operations 38

4.2 How To Create A Schema 38

4.3 How To Create Records 38

5 Indexer 40

5.1 Setting Up A Generic Index 41

5.1.1 Start Solr with the dynamic_solr_schema.xml 41

5.1.1.1 Using standalone Solr 41

5.1.1.2 Using launch-test-lily 42

5.1.2 Define the index in Lily 42

5.1.3 And we are done 42

5.2 Indexer Tutorial 42

5.2.1 Overview 42

5.2.2 The Lily schema 43

5.2.3 Version Tags 43

5.2.4 Indexer configuration sample 43

5.2.5 Solr configuration sample 45

5.2.6 Declaring an index 45

5.2.7 Triggering indexing 46

5.2.8 Committing the index 46

5.2.9 Querying 46

5.2.10 Debugging indexing 47

5.2.11 Further information 47

5.3 Managing Indexes 47

5.3.1 About multiple indexes 47

5.3.2 Index states 48

5.3.2.1 The general state 48

5.3.2.2 The update state 48

5.3.2.3 The batch build state 49

5.3.3 Performing common index actions 49

5.3.3.1 General notes 49

5.3.3.2 Knowing what indexes exist 50

5.3.3.3 Creating an index 50

5.3.3.4 Updating the indexer configuration of an index 51


5.3.3.5 Updating other index properties 51

5.3.3.6 Deleting an index 51

5.3.3.7 Performing a batch build (rebuilding an index) 51

5.3.3.8 Interrupting a batch build 52

5.4 Indexer Configuration 53

5.4.1 Indexerconf: Version Tag Based Views 54

5.4.2 Indexerconf: Records 55

5.4.2.1 Evaluation of the record rules 55

5.4.2.2 matchVariant expression 55

5.4.2.3 Version tags 56

5.4.3 Indexerconf: Formatters 56

5.4.3.1 Built-in formatter 56

5.4.4 Indexerconf: Fields 57

5.4.4.1 Correspondence between Lily LIST-type fields and Solr multi-value

fields 57

5.4.4.2 Index field name 57

5.4.4.3 Order is important 58

5.4.4.4 Determination of the relevant index Fields for an input record 58

5.4.4.5 Content extraction 58

5.4.4.6 Index fields that use a value from the current record 58

5.4.4.7 Index fields that use a value from a nested record or that dereference links

towards other records 58

5.4.4.8 Index fields that dereference towards less-scoped variants of the same

record 59

5.4.4.9 Denormalized information and index updating 59

5.4.5 Indexerconf: Dynamic Index Fields 60

5.4.5.1 Matching fields 60

5.4.5.2 The name 62

5.4.6 Indexerconf: Indexing The RecordType 63

5.5 Required Fields In The Solr Schema 64

5.6 Solr Index Sharding 65

5.6.1 Introduction 65

5.6.2 Shard selection 65

5.6.2.1 Sharding configuration (shard selection configuration) 66

5.6.3 Example usage 67


5.7 Solr Versions 68

5.7.1 Using Solr 1.4(.1) 68

5.8 Indexer Error Handling 68

5.8.1 Solr unreachable 68

5.8.2 Solr misconfiguration 68

5.8.3 Indexerconf misconfiguration 69

5.8.4 General indexer errors 69

5.9 Indexer Architecture 69

5.9.1 The indexer model 69

5.9.2 The indexer engine 70

5.9.3 The indexer worker 70

5.9.4 The indexer master 70

5.9.5 The batch build MapReduce job 71

5.9.6 The link index 71

6 Tools 72

6.1 Import Tool 72

6.1.1 The import JSON format 72

6.2 mbox Import Tool 73

6.2.1 About 73

6.2.2 Mail usage run-through 74

6.2.2.1 Get some mbox files 74

6.2.2.2 Run HBase & Lily 74

6.2.2.3 Create the schema 74

6.2.2.4 Run SOLR and define an index 74

6.2.2.5 Run the import 75

6.3 Tester Tool 75

7 REST (HTTP+JSON) API 77

7.1 REST Interface Tutorial 77

7.1.1 Abstract 77

7.1.2 Creating a schema 77

7.1.2.1 Creating the name field type 77

7.1.2.2 Creating the price field type 78


7.1.2.3 Creating the product record type 79

7.1.3 Creating records 80

7.1.3.1 Create record using POST, server assigns record ID 80

7.1.3.2 Creating a record using PUT, assigning the record ID yourself 81

7.1.4 Reading records 82

7.1.5 Creating a record with a blob field 83

7.1.6 Creating A Record With A Complex Field 84

7.1.7 Scanning Over Records 86

7.2 REST API Reference 87

7.2.1 About the REST interface 87

7.2.2 JSON Formats 87

7.2.2.1 About JSON 87

7.2.2.2 Content-Type 88

7.2.2.3 Namespaces 88

7.2.2.4 Field type format 88

7.2.2.5 Record type format 89

7.2.2.6 Record format 89

7.2.2.7 List format 91

7.2.2.8 POST format 91

7.2.2.9 Record Scan Format 91

7.2.2.10 Filter Format 93

7.2.3 REST Protocol 94

7.2.3.1 Nodes / connecting / load balancing 94

7.2.3.2 Error responses 94

7.2.3.3 Method tunneling 95

7.2.3.4 Resources for field types 95

7.2.3.5 Resources for record types 96

7.2.3.6 Resources for records 98

7.2.3.7 Resources for blobs 102

7.2.3.8 Resources for scanners 103

7.2.3.9 Resources for index management 103

7.2.3.10 Resources for the rowlog 104


8 Java Developers 106

8.1 Repository API Tutorial 106

8.1.1 Before reading this 106

8.1.2 API design 106

8.1.3 API tutorial code 106

8.1.4 API reference 106

8.1.5 API run-through 106

8.1.5.1 Project set-up 106

8.1.5.2 Connecting to Lily 107

8.1.5.3 Prerequisites 108

8.1.5.4 Creating a record type 108

8.1.5.5 Updating a record type 109

8.1.5.6 Creating a record 110

8.1.5.7 Creating a record with a user-specified ID 111

8.1.5.8 Updating a record 112

8.1.5.9 Updating a record via read 112

8.1.5.10 Updating versioned-mutable fields 113

8.1.5.11 Updating a record conditionally 113

8.1.5.12 Reading a record 114

8.1.5.13 Working with blob fields 114

8.1.5.14 Creating variants 116

8.1.5.15 Link fields 117

8.1.5.16 Complex Fields 118

8.2 Creating Records And Schema Using The Builder API 120

8.2.1 Introduction 120

8.2.2 Creating A Schema 120

8.2.3 Creating Records 123

8.3 Scanning Records And Record Locality 126

8.3.1 Records are stored in order of record ID 126

8.3.2 Scanning over records 127

8.3.2.1 Full table scan 127

8.3.2.2 Start and stop record ID 127

8.3.2.3 Filters 128


8.3.2.4 Returning a subset of fields 128

8.3.2.5 Scanner Caching 128

8.3.2.6 Scanners directly read from HBase region servers 129

8.3.2.7 Scanners: summary 129

8.3.2.8 Using the CLI tool lily-scan-records 129

8.3.2.9 Variants and scanners 129

8.3.3 Record ID as your primary index 129

8.3.4 Scanners And MapReduce 130

8.4 Setup New Maven Project From Archetype 130

8.5 Importing A Schema From JSON Programmatically 130

8.6 Writing Test Cases Against Lily 131

8.6.1 First Steps 132

8.6.1.1 Maven Settings 132

8.6.1.2 Write A Test Class 134

8.6.1.3 Run The Test With Lily Stack Embedded 135

8.6.1.4 Create LilyProxy On The Class Level 135

8.6.1.5 Connect To Independently Launched Lily 136

8.6.2 Service Configuration 137

8.6.2.1 General remarks 137

8.6.2.2 Solr Schema 137

8.6.2.3 Lily Conf & Plugins 137

8.6.3 Utilities 138

8.6.3.1 Index Schema 138

8.6.3.2 WAL and MQ processed and Solr Index commited 138

8.6.3.3 Launching A Batch Index Build 139

8.6.4 Advanced 139

8.6.4.1 User defined storage directory 139

8.6.5 More On The Lily Test Framework 139

8.7 MapReduce Integration 140

8.7.1 Using Lily As Input For MapReduce Jobs 140

8.7.2 Using Lily As Output For MapReduce Jobs 141

8.7.3 Getting Started Writing A Lily MapReduce Job 141


9 Repository (lily-server) plug-ins 143

9.1 Repository Decorators 143

9.1.1 Overview 143

9.1.1.1 What 143

9.1.1.2 Deployment 143

9.1.1.3 The Interface 144

9.1.2 Creating A Repository Decorator 144

9.1.3 Your First Decorator 144

9.1.3.1 Generate A Project 144

9.1.3.2 Implement RepositoryDecorator 145

9.1.3.3 Disable other sample plugins 145

9.1.3.4 Build 145

9.1.3.5 Deploy 145

9.1.3.6 Edit Lily Configuration 146

9.1.3.7 Restart Lily Server 146

9.1.3.8 Next Steps 146

9.2 Record Update Hooks 147

9.2.1 Overview 147

9.2.1.1 What 147

9.2.1.2 The Interface 147

9.2.2 Creating a RecordUpdateHook 148

9.3 Lily Server Plugin Mechanism 148

10 Bulk Imports 150

11 Admin 152

11.1 Table creation settings 152

11.2 Optimizing HBase Request Load Balancing 152

11.2.1 Record & linkindex tables 153

11.2.2 Rowlog tables (rowlog-mq and rowlog-wal) 153

11.2.3 Fixing bad region assignment 153

11.3 Metrics 153

11.3.1 JMX 154


11.3.2 Ganglia 154

11.4 ZooKeeper Connectionloss And Session Expiration Behavior 155

12 Glossary 156

12.1 index entry 156

13 Lily Hackers 157

13.1 Getting Started 157

13.1.1 Lily Source Code 157

13.1.1.1 Getting the sources 157

13.1.1.2 Building Lily 157

13.1.1.3 Running Lily 157

13.1.1.4 Building a binary distribution 158

13.1.2 Repository Model To HBase Mapping 158

13.1.2.1 Records 158

13.1.2.2 Record types & field types 160

13.1.3 Blobstore 160

13.1.3.1 General 160

13.1.3.2 API and usage 160

13.1.3.3 Design 161

13.2 Releasing 164

13.2.1 Building A Lily Release 164

13.2.1.1 Pre-release checks 164

13.2.1.2 Change versions 164

13.2.1.3 Configure Lily repository access 165

13.2.1.4 Run Maven release:prepare 165

13.2.1.5 Building the distribution 167

13.2.1.6 Post-release work 167

13.2.2 Publishing The Lily Maven Site (javadocs) 169

13.2.3 Branching the docs 169

13.2.4 Pre-Release Verifications 170

13.3 Guidelines 171

13.3.1 Code Style 171

13.3.1.1 Java Code style 171


13.3.1.2 Non-Java source files 174

13.3.2 Programming Guidelines 174

13.3.2.1 InterruptedException 174

13.3.2.2 ZooKeeper 174

13.4 Lily Maven Repository Access 175

13.5 Incompatible changes (by commit) 175

13.6 Creating Snapshots Of 3d Party Projects 177

13.6.1 Building HBase Snapshot 177

13.6.1.1 Check out HBase 177

13.6.1.2 Change HBase version number 177

13.6.1.3 Build 178

13.6.1.4 Test 178

13.6.1.5 Deploy 178

13.6.1.6 Make binary build available 178

13.6.1.7 Revert version number changes 179

13.6.2 Building Kauri Snapshot 179

13.6.2.1 Check out Kauri 179

13.6.2.2 Change Kauri version number 179

13.6.2.3 Deploy 179

13.6.2.4 Revert version number changes 180

13.6.3 Deploying SOLR war To Maven 180


1 Lily Documentation (1.2)

1.1 What is Lily ?

Lily is a scalable repository for storing, searching and retrieving records (or content items,documents, objects, ...) It is a distributed server application that fuses Apache HBase and SOLRand is designed to be used by front-end applications (CMS, DMS, DAM, ...) using the Lily API(Java or REST).

Getting started

To install Lily and give it a quick spin, see Running Lily (page 14). To get an overview of allavailable documentation, have a look at our sitemap1.

Printing tip: to print an individual document, change the .html extension in the URLto .pdf. To print a collection of documents, choose 'Document Basket' in the Toolsmenu, select 'Select documents from the navigation tree', select the documents youwant to print, and then choose 'Get documents aggregated as PDF'.

This is the documentation for Lily [unresolved variable: version]. Thedocumentation for other releases can be found through our documentation service2.

Notes

1. ../../../../lily-docs-current/ext/toc/

2. ../../../../

../../../../lily-docs-current/ext/toc/

../../../../


2 Running Lily

2.1 About

This guide will take you through a first Lily experience, with a sample schema about books andauthors. This will only take a few minutes, but make use of built-in versions of Hadoop/HBasewhich means your data won't be saved between server restarts. It's a good way to familiarizeyourself with the deployment of Lily before running it on a real install of Hadoop/HBase/ZooKeeper (page 18).

If at any point you run into problems, please let us know1 on the Lily mailing list.

2.2 Linux, Mac OS X, Windows

Linux is the only supported production platform for Hadoop.

For development purposes, you can also use other Unix-variants like Mac OS X.

Windows is not supported.

2.3 Java 1.6

You need to have Sun/Oracle Java 1.62 installed. An environment variable JAVA_HOME shouldpoint to where it is installed.

If everything is fine, you should be able to execute:

$JAVA_HOME/bin/java -version

and it should show something like:

java version "1.6.0_21"Java(TM) SE Runtime Environment (build 1.6.0_21-b06)Java HotSpot(TM) Server VM (build 17.0-b16, mixed mode)

2.4 Downloading Lily

Download the Lily binary distribution: lily-1.2.1.tar.gz3.

http://groups.google.com/group/lily-discuss

http://www.oracle.com/technetwork/java/javase/downloads/index.html

http://lilyproject.org/release/1.2/lily-1.2.1.tar.gz


2.5 Starting Lily

For testing purposes, Lily ships with a command called launch-test-lily which starts Lily andall its dependent services in one JVM. The started services are: HDFS, HBase, MapReduce'sJobTracker and TaskTracker, ZooKeeper, Solr and Lily-server itself.

So start this now as follows:

bin/launch-test-lily -s samples/books/books_sample_solr_schema.xml -c 5

The -s option specifies the Solr schema we need for our demo, the -c option specifies that theSolr index will be auto-committed every 5 seconds.

Wait a few moments for it to be started completely, until you see this:

-----------------------------------------------Lily is running

This setup will store its data in a temporary directory which is lost each time you stop orrestart launch-test-lily.

See further on for running against a 'real' HBase & co.

2.6 Create Field & Record Types

Before putting content in Lily, you need to create some field types and record types (page29).

For the purpose of this first run, we will upload some types for managing books and authorsusing the import tool (page 72):

bin/lily-import -s samples/books/books_sample.json

The -s option specifies that we only want to upload the schema at this point (the JSON filecontains records too).

Behind the scenes, this command connects to ZooKeeper to find out the available Lily serversand picks one from it at random to talk to.

2.7 Define An Index

Define an index using:

bin/lily-add-index -n books -c samples/books/books_sample_indexerconf.xml -s shard1:http://localhost:8983/solr

The books_sample_indexerconf.xml file is the configuration for the indexer (page 40): itdescribes what records should be indexed and how the fields of the records should be mapped toSolr fields.

The lily-add-index command will modify the configuration of indexes stored in ZooKeeper. Inresponse to this, the Lily server(s) will put everything necessary to keep the index up to date inaction: register a message queue subscription and start the indexing processes.


2.8 Loading Records Into Lily

Use the import tool to upload some records into Lily:

bin/lily-import samples/books/books_sample.json

2.9 Querying The Solr Index

Browse to

http://localhost:8983/solr/admin/

Type 'frankenstein' in the input box and press search, you should get a result with one documentin it. In some browsers you need to do view-source to see the XML result.

As mentioned above, it can take up to 5 seconds for the new records to be visible in the index, soif you were very fast you may have to retry.

2.10 REST interface

There are two protocols available to talk to Lily: an RPC-style binary one based on Avro, whichis used when you use the client Java API (page 106), and a REST-style API (page 77)(HTTP+JSON).

The port on which the REST interface is listening is printed on repository startup, by default it is12060:

Protocol [HTTP/1.1] listening on port 12060

For example, here is how you can access one of the records created earlier by the import:

http://localhost:12060/repository/record/USER.mary_shelley

2.11 Rebuilding The Index

Usually an index is kept up-to-date incrementally by listening to repository events. Sometimesit can be useful to rebuild the index: when the configuration is changed or when it was definedafter already loading content into Lily, or when the Solr index is lost, or whatever. It is alsopossible to disable incremental index updating completely, and only update the index throughbatch rebuilds.

Let's quickly run through how to trigger a batch index build.

A batch index build is triggered by changing the batch build state of an index toBUILD_REQUESTED, as follows:

bin/lily-update-index -n books --build-state BUILD_REQUESTED

In response to this state change, Lily will launch a Hadoop job to perform the index build, andchange the batch build state to BUILDING. This can be observed by running lily-list-indexes:

bin/lily-list-indexes

which shows output like this:


http://localhost:12060/repository/record/USER.mary_shelley


books + General state: ACTIVE + Update state: SUBSCRIBE_AND_LISTEN + Batch build state: BUILDING + Queue subscription ID: IndexUpdater_books + Solr shards: + shard1: http://localhost:8983/solr + Active batch build: + Hadoop Job ID: job_20101105103522869_0001 + Submitted at: 2010-11-05T10:38:33.913+01:00 + Tracking URL: http://localhost:45989/jobdetails.jsp?jobid=job_20101105103522869_0001

Notice it also shows the ID of the Hadoop Job and a tracking URL which will take you to a webui that displays more information about the progress of the job.

After a little while the job will be finished, and when you run lily-list-indexes again, the batchbuild state will be INACTIVE and information about the last run batch build will be available:

books + General state: ACTIVE + Update state: SUBSCRIBE_AND_LISTEN + Batch build state: INACTIVE + Queue subscription ID: IndexUpdater_books + Solr shards: + shard1: http://localhost:8983/solr + Last batch build: + Hadoop Job ID: job_20101105103522869_0001 + Submitted at: 2010-11-05T10:38:33.913+01:00 + Success: true + Job state: succeeded + Tracking URL: http://localhost:45989/jobdetails.jsp?jobid=job_20101105103522869_0001 + Map input records: 2 + Launched map tasks: 1 + Failed map tasks: 0 + Index failures: 0

2.12 Next steps

Now you know the basics of running Lily. Next steps include:

• Read about the Lily architecture (page 22) and the repository model (page 29).

• Install a real Hadoop & HBase setup (page 18), either standalone or a cluster.

• Write code which performs CRUD operations on Lily: see Repository API tutorial (page106)

• Learn more about indexes, such as configuring the mapping, launching batch rebuilds,defining multiple indexes, etc: see Indexer (page 40)

• If you want to work with Lily trunk or hack on its sources, see Lily sources (page 157).

• Try the mail archives sample, see the mbox import tool (page 73).

As mentioned before, the HBase, Hadoop, ZooKeeper and Solr instances launchedusing launch-hadoop and launch-solr store their data into a temporary directory whichis lost when you stop them.


2.13 Installing A Lily Cluster

For instructions on how to install HBase, Hadoop, ZooKeeper and Solr, we refer to theinstallation guides of these individual products. Below we give advice on what versions to useand how to configure Lily to connect to your installation.

Lily Enterprise4 includes comprehensive tools for installation, administration andcluster deployments, and Debian/RPM-packaged versions of Lily and related software.It also is extensively tested against the Cloudera Distribution of Hadoop5.

To provide the comfort of tested and supported releases of the Hadoop stack, we have selectedto use Cloudera's Hadoop distribution6. Similar HBase (0.90+) and Hadoop versions should alsowork, as long as the RPC interface is compatible.

2.13.1 Network configuration

Make sure your inter-host-nameresolving is set up correctly. The hostnames should be properlyset up: on each server, the local hostname should resolve to the IP address of the networkinterface (eth0), and reverse resolving the IP address should again give the same hostname (andnot localhost or the hostname with some domain suffix appended to it).It is ok to fix this using /etc/hosts instead of changing DNS, but in that case it should be doneconsistently on each node so that the nodes know each other by name.

2.13.2 Installing Hadoop, HBase and ZooKeeper

We recommend to use the versions from Cloudera CDH3u3, available from Clouderadownloads7.

We refer to the Cloudera documentation, and the generic Hadoop, HBase and ZooKeeperdocumentation, for more information on how to setup an Hadoop/HBase cluster.

HBase: deploy extra jar

The following jar should be copied from the Lily distribution to the hbase lib directory on eachof the HBase nodes:

lib/org/lilyproject/lily-hbase-ext/[unresolved variable: version]/lily-hbase-ext-[unresolvedvariable: version].jar

In principle, this should only be necessary if you make use of Lily's blob fields.

2.13.3 Installing Solr

We have developed against Solr [unresolved variable: solrVersion]. Other versions should workas long as the REST interface and the javabin format are compatible. In particular, for Solr1.4 you will need to switch to the XML format as the javabin format is not compatible, this isexplained in Solr Versions (page 68).

Download from the Solr website8.

http://www.lilyproject.org/lily/about/enterprise.html

http://www.cloudera.com/hadoop/

http://www.cloudera.com/hadoop/

http://www.cloudera.com/downloads/

http://www.cloudera.com/downloads/

http://lucene.apache.org/solr/


2.13.4 The Lily Server Process

Lily consists of a lily-server process which you can run on any number of nodes. The Lily clientswill talk to this Lily server process, which in turn will make use of HBase and Solr. In contrastto HBase and Solr, the Lily server process is lightweight in terms of memory requirements.Typically, you will run a lily-server along with each HBase region server.

2.13.4.1 Configuring Lily to connect to your HBase, Hadoop & ZooKeeper

To configure Lily you need to know:

• the hostname(s) and port number(s) (typically 2181) of the ZooKeeper ensemble.

• the hostname and port number (typically 9001) of the MapReduce job tracker.

• the hostname and port number (typically 8020) of the HDFS name node.

Then adjust the following files:

• conf/general/hbase.xml

• conf/general/mapreduce.xml

• conf/general/zookeeper.xml

• conf/repository/repository.xml

Note that you have to specify the ZooKeeper information twice: once for HBase, and once forLily. You can use the same or different ZooKeeper installations for them.

As for Hadoop/HBase, you need to make sure these configuration changes are deployed to allyour Lily nodes.

2.13.4.2 Running The Lily Server Process

Lily's server process is launched either by using the following shell script:

bin/lily-server

or by using the Java service wrapper (recommended):

service/lily-service start

The very first time you start Lily it will take a bit slower since the tables still need to be created.

When Lily is started, you will see a line like this logged in logs/lily-server:

[INFO ] <2011-10-14 16:33:22,386> (org.kauriproject.runtime.info): Kauri Runtime started [October 14, 2011 4:33:22 PM CEST]

In case you started Lily using the shell script, this line will also be printed to standard out.

When starting Lily using the service wrapper, and it fails, be sure to check logs/lily-wrapper.log

If the lily-server JVM is running and the last line printed in the log is the following

[INFO ] <2011-10-14 16:33:19,281> (org.kauriproject.runtime.info): Starting module general - /../lily-general-module-[unresolved variable: version].jar


then Lily is trying to connect to ZooKeeper. At startup it will retry this for an extended amountof time (configurable through conf/zookeeper/zookeeper.xml) to cope with services not beingstarted in order.

2.13.4.2.1 Identifying the lily-server process

Using the Java command jps you can see an overview of the running Java processes (you mighthave to run this via sudo).

Depending on whether you start Lily via the shell script or the service wrapper, you will see adifferent class name.

$ jps -l

# in case of the lily-server shell script24044 org.kauriproject.launcher.RuntimeCliLauncher

# in case of the service wrapper23431 org.tanukisoftware.wrapper.WrapperSimpleApp

2.14 Upgrade from Lily 1.1

ATTENTION: the upgrade tool doesn't upgrade the linkindex table. If you use theLily LINK value type and want to upgrade between Lily 1.1 and 1.2, please contact us.

2.14.1 During upgrading

After installing the Lily 1.2 software, but before launching Lily 1.2 for the first time, thefollowing should be done.

Convert record IDs

The way record IDs are encoded into HBase row keys has changed slightly since Lily-1.1. Thischange is backwards incompatible and requires an upgrade of the record table. The way theupgrade works is by copying all your existing records into a new table (in the new format), thenthe old record table is dropped and the new one is renamed to take its place.

The creation of the new table is done using the upgrade tool called lily-upgrade-from-1.1.Renaming/dropping the record table is performed by yourself using commands on the HBaseshell, as described below.

Use the -h option to list all options and the usage of the command.

Performs upgrade of the HBase storage format from Lily 1.1 to Lily 1.2

Be sure to read the Lily documentation on how to use this tool!

usage: lily-upgrade-from-1.1 [-confirm] [-dumplog] [-h] [-log <config>] [-tn <tablename>] [-to <filename>] [-v] [-wtw] [-z <connection-string>] -confirm Confirm you want to start the upgrade. -dumplog Dump default log4j configuration -h,--help Shows help -log <config> log4j config file (.properties or


.xml) -tn,--table-name <tablename> Destination table name, default record_lily_1_2 -to,--table-options <filename> Table creation options file, like conf/general/tables.xml -v,--version Shows the version -wtw,--write-to-wal Enable write to WAL, off by default. -z,--zookeeper <connection-string> ZooKeeper connection string: hostname1:port,hostname2:port,...

WARNING! Lily should not be running when executing this upgrade tool. Only HBase, Hadoop and Zookeeper should be running.

In the instructions below all the default settings are used : table-name = record_lily_1_2, write-to-wall = off -z = localhost:2181

• Your Lily cluster should not be running yet.

• Run the conversion tool$LILY_HOME/bin/lily-upgrade-from-1.1

This will ask you if you are sure. If you are run it with the -confirm option$LILY_HOME/bin/lily-upgrade-from-1.1 -confirm

• When the task has completed the tool prints some commands that must be executed on theHBase shell. We've listed them here for your convenience.Open up the HBase shell$HBASE_HOME/bin/hbase shell

> disable 'record_lily_1_2'

> disable 'record'

> drop 'record'

• Run the following command from the command line$HBASE_HOME/bin/hbase org.jruby.Main $HBASE_HOME/bin/rename_table.rb

record_lily_1_2 record

• Once more in the HBase shell> disable 'record'

> enable 'record'

Now you can restart your lily servers.

You can do a little test to output a number of your records to make sure everything ran properly

$LILY_HOME/apps/scan-records/target/lily-scan-records -p -l 10

Notes

1. http://groups.google.com/group/lily-discuss

2. http://www.oracle.com/technetwork/java/javase/downloads/index.html

3. http://lilyproject.org/release/1.2/lily-1.2.1.tar.gz

4. http://www.lilyproject.org/lily/about/enterprise.html

5. http://www.cloudera.com/hadoop/

6. http://www.cloudera.com/hadoop/

7. http://www.cloudera.com/downloads/

8. http://lucene.apache.org/solr/


3 Architecture

3.1 Distribution

Lily has a distributed architecture. This distribution is manifested in two ways. First, thereare nodes (= systems, servers) that perform different functions, causing a functional layering.Second, there are multiple nodes that perform the same function, for purposes of scalability andfault-tolerance.

This is illustrated in the following figure.


In this diagram, the Lily node serves as a black box node for different components, which aredescribed further on.

Not every box in this diagram necessarily corresponds to a physical server. While multipleprocesses of the same kind should be run on different servers, you can run e.g. a Lily node, aHBase region node and a HDFS data node on the same server.

While the diagram shows three nodes of every kind, the actual numbers can differ for each typeof node, depending on the needs.

For some kinds of nodes, it does not matter to what node to connect. For example, each clientcan connect to any arbitrary Lily node. For others, the node to connect to depends on the onethat hosts the data. For example, a Lily node that wants to read a row from HBase will have toconnect with the HBase node that hosts this row.

A Lily client does not connect to one fixed endpoint. It decides itself to what Lily node toconnect, and directly talks to HDFS and HBase nodes when appropriate.


3.2 Main components

The diagram below shows the main components of the Lily content repository and theconnections between them. For clarity, this figure shows only one instance of each component,but remember that there can be any number of them.

What we referred to as “Lily node” in the above section on distribution consists of differentindependent components such as the repository, the indexer, and the message queue. These couldbe run as different processes or in one process, this is of little importance for our discussion here.


3.2.1 HBase

Lily uses Apache HBase1 for the storage of fine-grained data. HBase is modeled after Google'sBigTable. HBase has little in common with the SQL databases everyone knows: it does not offermuch querying, nor transactions. But instead it offers scalability to very large amounts of data(billions of rows) by adding more hardware as needed. No manual repartitioning of the data isnecessary. It also handles failing nodes automatically.

HBase has a special data model, whereby rows can contain very large amounts of columns, andcolumns without a value do not take space, so it is ideal for sparse data structures. The BigTablepeople concisely call it “a sparse, distributed, persistent multidimensional sorted map”. HBasedoes not know data types, it handles everything as bytes.

HBase stores its data on HDFS, described next.

3.2.2 HDFS

HDFS2 is the Hadoop distributed file system, thus a file system that spans across nodes. It ismodeled after GFS, the Google File System. A file in HDFS is stored multiple times in thecluster (by default 3 times), so that if a node fails, the data is still available elsewhere. HDFS isbest used for the storage of larger files. The namespace of the file system (the link between thenames of the files and where they are stored) is maintained on one system, called the name node.The number of files one can store in HDFS is limited by the amount of memory in that system.Practically this means you can still store millions of files on it, but in Lily we will store smallerblobs in HBase, to avoid quickly hitting this limit. HDFS has a focus on high throughput ratherthan low latency.

3.2.3 The repository

The repository provides the basic record CRUD functionality. Clients connect to the repositoryusing an Avro-based protocol. Avro3 is an efficient binary serialization system. The repositoryconnects to HBase using the HBase Java API, which talks HBase RPC, also based on an efficientbinary serialization.

The basic entity managed by the repository is called a record, see the repository modeldescription (page 29). When reading a record, a client can specify to read only some fields,and when updating a record, a client only needs to communicate the changed fields.

The Java API exposed by the repository is based on simple data objects and service-styleinterfaces. This API-approach makes “playing” with the data objects straightforward.

The ID of the record can either be assigned by the user when creating the record, or isautomatically assigned by the repository, in which case it is a UUID.

Fields in a record can be blobs. These blobs are stored either in HBase or on HDFS, dependingon a size-based strategy. Smaller blobs like HTML pages can be stored in HBase, while biggerblobs that should be handled as streams are stored on HDFS (see also discussion on HDFSabove).

One record, which can contain multiple versions, maps onto (page 158) one row in HBase.This makes that a record is the unit of atomic manipulation.

http://hadoop.apache.org/hbase/

http://hadoop.apache.org/hdfs/

http://hadoop.apache.org/avro/


3.2.4 The Write Ahead Log

When creating or updating a record, often secondary actions (= post-update actions) will need tohappen, the most common example of which is keeping indexes up to date.

If we would naively update the row in HBase, and then update the corresponding indexes, therewould be a possibility that the indexes would not be updated if the repository process dies.

In more traditional architectures, transactions are used to assure that multiple actions happenas one atomic operation. For our use-cases, full transaction support is not needed. We do notneed atomicity, nor do we need rollback. All secondary actions are considered to be subordinateactions which should succeed, and if they fail, they should not invalidate the operation on therecord.

The solution we use in Lily is a write-ahead-log, or WAL for short. Before performing an actionto the repository, we write our intention to do this to the WAL. Then we update the repository,and confirm this to the WAL. Then the secondary actions are performed, each time confirming tothe WAL. If at any point the process would be interrupted, upon restart the WAL can be checkedto see up to where we got and to perform any remaining actions.

Lily's WAL is unrelated to the HBase WAL. It is also conceptually different, since Lily does notwrite the data of a record update to its WAL. Its only purpose is to guarantee the execution of thesecondary actions.

3.2.5 The Message Queue

Updating indexes does not need to happen synchronously with the update of the record, whilethe client is waiting for its response. Rather, this can be done asynchronously. The usual solutionto this is to make use of a message queue. Pushing a message onto the queue is a kind ofsecondary action that needs to happen when updating a record, and our WAL will assure that themessage will surely be pushed to the queue even if the repository dies before it gets to that.

Now, rather than bringing an existing queue technology into our system, which would have itsown persistence, admin needs, failover solution, etc. Lily will use a lightweight message queuethat reuses HBase for persistence.

3.2.6 The Indexer

The role of the Indexer is to keep the Solr-index up to date when records are created, updated ordeleted. For this purpose, the Indexer listens to the message queue.

The indexer maps Lily records onto Solr documents, by deciding (based on configuration) whichrecords and what fields of the record need to be indexed. For blob fields, it can perform contentextraction using the Tika4 library.

3.2.6.1 Denormalization

Lily records can contain link fields. Link fields are links to other records. During indexing,you can include information from linked records within the index of the current record. Thisis called denormalization. Information can be denormalized by following links multiple levelsdeep. Denormalization at index time is an alternative for SQL-join-like functionality at querytime. Join-queries are not available in Lucene, and complicated to do with sharded databasesin general. Denormalization makes querying faster and easier, but complicates indexing.

http://lucene.apache.org/tika/


Denormalization assumes you now at beforehand (= when indexing) what sort of queries youwill want to do on linked content.

A consequence of denormalization is that when a record is updated, the index entries of otherrecords might also become invalid, when they contain information from the updated record.The Lily Indexer will automatically update such index entries. For this, it makes use of anothercomponent, the LinkIndex, which maintains an index (based on the hbase indexing library (page

)) of all links between records.

3.2.7 Solr

Solr5 is a search server based on Lucene, the well-known excellent text-search library. Itprovides powerful search functionality including full-text search (with spell check, searchsuggestions, and so on), fielded search and faceted navigation. The configuration can betweaked, e.g. with regards to text analysis, to provide an optimal search experience. It supportsdistributed querying across a set of Solr nodes (to support data sets that do not fit on a singleserver), and Solr nodes can be replicated (to support many concurrent search requests).

3.2.8 ZooKeeper

ZooKeeper6 provides some basic services for the coordination of distributed applications, likedistributed synchronization, leader election and configuration. As these things are hard to getright, it is a good thing that many applications re-use ZooKeeper for this purpose. ZooKeeper isused by Lily, HBase, and is also starting to make an appearance in Solr.

Notes

1. http://hadoop.apache.org/hbase/

2. http://hadoop.apache.org/hdfs/

3. http://hadoop.apache.org/avro/

4. http://lucene.apache.org/tika/

5. http://lucene.apache.org/solr/

6. http://hadoop.apache.org/zookeeper/

http://lucene.apache.org/solr/

http://hadoop.apache.org/zookeeper/


4 Repository

4.1 Repository Model

Lily's repository model is designed for content management applications. Compared tomore data-oriented applications, this means we offer rich field types like multi-value fields,versioning, a flexible schema, and variants (such as for different languages).

4.1.1 Basic concepts and terminology

Lily manages records. A record is a set fields. Records adhere to a record type which specifiesthe field types that are allowed within the record. Field types define the kind of value thatcan be stored in the field (string, long, decimal, link, ...) and the scope of the field. The scopedetermines if the field is versioned or not. Versioned fields are immutable: upon each change ofa versioned field a new version is created within the record.

The below diagram shows the relation between these concepts, and some more that we willdiscuss further on in detail.


4.1.2 No hierarchy

Quite some content repositories use a file system metaphor for the structure of their repository,whereby content is put in a hierarchical namespace. For example, the Java Content Repository(JCR) API uses such a hierarchical model. Such models enforce users to think about a primaryorganization of the content, and require to decide where in the hierarchy to store each createdentity.

In Lily, there is no such hierarchy. The repository is one big bag of records. This avoids thatusers need to think about where to store things in the primary hierarchy.

Lily does not have tables either, there is just one set of records.


4.1.3 Record identification

A record is uniquely identified by its ID. The ID can be assigned by the user, or can be generatedby the system, in which case it is a UUID.

In case you choose to assign record IDs yourself, be sure to adjust the initial tableregion settings (page 152)!

More precisely, the record ID consists of two components: the master record ID and a setof variant properties. It is the combination of these two which uniquely identifies a record.However, the variant properties are optional, and we will discuss them in detail later on.

Record re-creation

When a record is deleted in Lily, a deleted marker flag is put to true and all historical data(record type, record type version, field data) that existed for the record is cleared. The currentversion number is however kept. When later a record would be created with the same record id,this will be regarded as a record re-create. The record is created (as for a normal create), but theversion numbering of the record will continue from where it was when it was deleted. (e.g. ifthe version number was 4 when the record was deleted, the re-created record will get verisonnumber 5). For more information on the reasoning behind this, see Repository Model To HBaseMapping (page 158) .

4.1.4 Records, field scopes, versions

4.1.4.1 Records

A record is the core entity managed by the Lily repository. All data you store in Lily is in theform of records.

A record is the unit of atomic modification in Lily, thus the granularity of a read, update ordelete operation. Since no concurrent operations can happen on a row, the number of updates toa row in a unit of time has a limit.

A record contains a set of fields. A field is a pair {field type id, value}.

Besides a pointer to its record type, a record has no built-in properties (like “last modified”,“owner”, ...), so there is no unwanted overhead of these.

4.1.4.2 Field scopes & versions

Records can have versions, so that older data stays available, but versioning is optional.

Fields can reside in three scopes: the non-versioned scope, the versioned scope, and theversioned-mutable scope. We respectively speak of non-versioned fields, versioned fields andversioned-mutable fields.

Fields that belong to the non-versioned scope are, as the name implies, not versioned. If a recordhas only fields in the non-versioned scope, the record will have no versions. If the record doeshave versions while it also has non-versioned fields, then you can consider the non-versionedfields as fields whose value counts for any version (= cross-version fields). If for such records,you modify only a non-versioned field, no new version will be created.


Fields that belong to the versioned scope are (obviously) versioned: each time a record isupdated with new values for such fields, a new version will be created in the record (the fieldsare not versioned individually). Fields in the versioned scope are immutable after creation: youcannot modify their value in existing versions.

Fields that belong to the versioned-mutable scope are somewhat special: these fields are part ofversions like the versioned fields, but they stay mutable (modifiable) in existing versions. Theyare ideal for metadata about a version, like the version's review status, a version comment, andthe like.

Typically, you will either choose to use versioning or not to use versioning, and most fieldswill fall in one of these scopes. Still it can be useful to have non-versioned fields when usingversioning, e.g. for a field which determines the access permissions to the record, as you willwant this to affect all versions.

Versions can currently not be deleted.

4.1.5 Field types

The fields in a record are not free name-value pairs: each field in a record has to be defined by acorresponding field type. For each field type, there can be at most one value in a (version of a)record.

The field type fixes some important aspects of a field:

• its scope: non-versioned, versioned or versioned-mutable


• its value type indicating the (java) type of the values that can be stored in the field. (Seebelow for more details.)

• its ID: this is a system-generated ID. It is this ID which is used internally when storing fieldvalues within a record (see HBase mapping (page 158)).

• its qualified name: a user assigned name consisting of a namespace and an actual name.

Except for the name, a field type is immutable after creation.

The name of a field type should be unique within the repository.

To illustrate what it means for the name of a field type to be unique, let's compare thiswith SQL databases. In these, the name of a field is unique within the table, but notacross tables. In Lily, field types are defined independently from record types. Thesame field type can be added to many different record types. This has the advantagethat all records which have a field of some type can be treated in a uniform way. Forexample, if we add a field type "name" to all record types, we will be able to use thatname in listings containing records of different types. The same could have beenachieved through mixins (see later) or a record type hierarchy. However, the reason wemade field types as independent entities is not primarily because of this, but rather sothat there would be a fixed {field name, scope} relation, and because a record can havea different record type per scope.

The name (namespace + simple name) can be changed after creation. It is the name users (=developers) will use for identifying fields. However, we expect name changes to be rare, it willtypically be as part of a redesign/refactoring or because of a typo.

If Lily would allow to change the value type of a field, it would fail on reading existing fieldvalues. Allowing to change the scope would also lead to difficulties reading and writing records.

If you would like to change the scope or type of a field, the solution is to make a new field. Youcould then run a task which converts all existing records to copy the value from the old field tothe new one. Or sometimes better, you make the application cope with both the old and new fieldwhen reading a record, and perform the conversion when an update is performed to a record.Note that since field types can change name, you can rename the old field type and give the newfield type the name of the old one, so that it is virtually replaced.

Field types can currently not be deleted.

4.1.5.1 Value types

The value type indicates the (java) type of the values that can be stored in the field. Lilyhas some built-in value types which are listed in the javadoc of the TypeManager1, methodgetValueType.

4.1.5.1.1 Basic value types

The basic value types include:

Name Java Type

STRING java.lang.String

INTEGER java.lang.Integer

javadoc:org.lilyproject.repository.api.TypeManager


LONG java.lang.Long

DOUBLE java.lang.Double

DECIMAL java.math.BigDecimal

BOOLEAN java.lang.Boolean

DATE org.joda.time.LocalDate

DATETIME org.joda.time.DateTime

BLOB org.lilyproject.repository.api.Blob

URI java.net.URI

BYTEARRAY org.lilyproject.bytes.api.ByteArray

4.1.5.1.2 Parametrized value types

Some value types are more complex and can be parametrized with extra information. Whenrefering to these value types, their name is extended with a parameter between brackets: < >

These parametrized value types include:

LIST: a list value type represents a java.util.List

• The values of the list can be of one of the value types (including LIST again). It is requiredto indicate this value type by placing it between brackets in the name of the value type.

• Example: LIST<STRING>

• Note: this replaces the 'multivalue' indication we had in Lily 1.0.

PATH: a path value type represents a org.lilyproject.repository.api.Hierarchy

• Similar to the list value type it is required to indicate the value type of the values of the path.

• Example: PATH<LONG>

• Note: this replaces the 'hierarchical' indication we had in Lily 1.0.

LINK: a link value type represents a org.lilyproject.repository.api.Link

• This is a pointer to another record. It can be specified to which types of records the link canpoint to. This is done by adding the record type name after the value type name betweenbrackets.

• Example: LINK<{aNamespace}aRecordTypeName>

• This record type is purely informative (it is not validated) and is optional.

RECORD: a record value type represents a org.lilyproject.repository.api.Record

• This is a record that can be stored in the value of a field of another record. The record typeof this record can be indicated by adding the record type name after the value type namebetween brackets.

• Example: RECORD<{aNamespace}aRecordTypeName>


• This record type is optional. But if it is omited, the record that is being put in the in the fieldis required to indicate the record type itself.

• This record that is used as a field value has some differences from top-level records:

• it has no ID

• it has no versions. The version property is null. All its fields behave the same regardlessof their scope.

• only the record type of the non-versioned scope is used

• only fields that are defined in the record type are allowed

4.1.6 Record types

A record type is a named set of field types. Each record is associated with a record type, and inthis way it is defined what fields a record should contain (the 'should' is explained later on).

A record type consists of the following:

• a set of associations with a field type, as part of the association certain things can be definedfor each field type:

• mandatory: is the field optional or not within a record

• (currently no other properties)

• a list of mixins: these are references to other record types which are imported within thecurrent record type.

• an ID: like for field types

• a namespaced name: like for field types

• a record type has versions: upon each change to it, a new version is created.

4.1.6.1 Record type versioning

In contrast to field types, record types can be modified, you can e.g. add and remove field types.On each such change, a new version of the record type is created. Records always point to aspecific version of a record type. This way, the state of the record type at the time of recordcreation or update is preserved.

The name of a record type is not versioned, thus changing the name affects all its versions.

When a record is updated, it will by default move to the last version of the record type.

4.1.6.2 Mixins

Mixins allow easy reuse of a set of fields in various record types.

The mixins of a record type is a list of references to other record types, or more correctly to a{record type, record type version} (references to record types are always to a specific version ofthe record type). In other words, mixins provide a way to include or import record types withinother record types.

Different mixin record types might contain the same field types. This is no problem, theduplicates will be ignored. The association attributes are merged. The behavior for themandatory attribute is that it is mandatory from the moment it is mandatory in one mixin.


Mixins work recursively: we can be mix in a record type which itself mixes in other recordtypes. If there would be a loop within the mixins, this will be detected and the recursion willstop.

4.1.7 The record – record type relationship

We mentioned earlier that each record is associated with a (version of a) record type.

This is only part of the truth. Each record type is actually associated with three record types, oneper scope: non-versioned, versioned, versioned-mutable.

The record type of the non-versioned scope is the main record type of a record.

When a version is created, as part of the version we store the reference to the current recordtype at the moment of version creation. This way, when older versions are consulted, we canknow what their record type was at that time (= the reference to record type itself is also like animmutable, versioned field). New versions are always created with the same record type as theone of the non-versioned scope.

Lastly, the versioned-mutable scope also stores its own pointer to the record type, correspondingto the non-versioned record type at the moment the versioned-mutable data was modified.

4.1.8 Record type as a guide rather than a straightjacket

A record type defines the fields that should be used within a record. When saving a record, therecord is validated with respect to the record type: all mandatory fields should be present, fieldsthat are not in the record type are not allowed, and the value of the fields should correspond tothe value type of the field types.

However, this validation is optional and can be disabled (when storing a record). When it isdisabled, you can add any field you like to a record, and the repository will store it. As such,technically a record is just a set of fields, and the record type an optional guide defining thestructure of a record. The repository does not need the record type to be able to read or write therecord.


Disabling validation is currently not yet implemented.

Let's contrast this to XML. An XML document is self-describing and can be parsedwithout a schema (if we forget about DTD's for a moment). An XML Schema can beused as an optional layer to be sure the XML document conforms to a certain structure.Lily records are somewhat the same but also somewhat different. While Lily does notneed access to the record type, Lily does need access to the field types to be able toread and write records. This could have been avoided if we stored the value type alongwith each value. The reason we did not go this way is because of the scopes. Without afixed {field name, scope} relation the user would have to specify the scope each timeshe wants to get or set a field, since the name alone would not uniquely identify a fieldacross scopes. Now this is enforced in Lily because the scope is part of the field type,and field types have a globally unique name.

This being said, usually validation will be left enabled. Disabling it can be useful e.g. for systemprocesses that do not want to care about the structural validity of records.

4.1.9 Variants

As mentioned in the section on record identification, the record ID consists of two components:

• the master record ID

• a set of variant properties

The most common use-case for variants is to maintain different language variants of the samerecord. They can also be useful for other purposes, such as for source-control-like branches.

The variant properties can be empty, in which case the record ID is equal to the master recordID. A record which has such an ID is called a master record.

The variant properties are a free set of name-value pairs. For example: {lang=en, branch=dev}(this syntax is just an informal notation used here).

The names of the name-value pairs are sometimes called the variant dimensions.

4.1.9.1 Why variants?

If we would not have variants, different languages of the same document would need to becreated as different records, with hence different IDs, in the repository. The problem is thesewould then not have a shared identity. This can be annoying with respect to the links betweenthese records. Suppose you have some records in one language, which have links between them(in link fields or in HTML blobs), these links point to other records by means of their ID. If youwould now want to translate these records to some other language, you would create a new set ofrecords, and these will have different IDs. When copying the content from the original records tothe new records (as a start for translating them), you will have to adjust all the links to the pointto the new record IDs.

When using variants, the links can be based on just the master record ID, and the variantproperties can be resolved from the context (= are the same as those of the document that


contains the link). For more information about context-dependent resolving of links, see thejavadoc of the Link2 class, especially its resolve method.

4.1.9.2 Cross-variant data

It is possible to have a variants of a record for the different languages, while at the same timealso having a variant without the language dimension. So, supposing a master ID of 'record1', wecould have these variants:

• record1 (the master record)

• record1 {lang=en}

• record1 {lang=fr}

• record1 {lang=es}

In such cases, the master record can be used to store fields that should not be translated, such asnumbers or dates. If these would be stored within each language, they would need to be updatedin each of them when they change. The same ideas can of course be applied to variants withmore dimensions.

4.1.10 Operations

The granularity of CRUD operations is on the level of a record. So one record at a time can beatomically created, read, updated or deleted. A single update operation can update fields in allthe scopes. A record read can be limited to the fields you are interested in. When updating, onlyfields that are modified need to be communicated.

4.2 How To Create A Schema

To create a schema (field types and record types) in Lily you have several options:

• use the Java API as explained in the repository API tutorial (page 106)

• use the REST API as explained in the REST interface tutorial (page 77)

• describe the schema in a JSON file and upload it with the import tool (page 72) (whichinternally uses the Java API)

The last option of using the JSON file is often the most convenient. Not only does it avoid towrite lots of API calls, it is also smarter: it will detect when a field type or record type alreadyexists and update it if necessary.

When you write an application that needs to set up some fixed schema, you can also use theimport tool as a library within your application. For example, the mbox import tool (page 73)contains a JSON file describing its schema and calls the import tool programmatically to createor update the schema. Have a look at its source code to see how this can be done.

4.3 How To Create Records

To add, update or delete records in Lily you have several options:

javadoc:org.lilyproject.repository.api.Link


• use the Java API as explained in the repository API tutorial (page 106)

• use the REST API as explained in the REST interface tutorial (page 77)

• describe the records in a JSON file and upload them using the import tool (page 72)(which internally uses the Java API)

In contrast to the creation of a schema (page 38), to create records you will typically use oneof the APIs rather than the import tool.

Notes

1. javadoc:org.lilyproject.repository.api.TypeManager

2. javadoc:org.lilyproject.repository.api.Link


5 Indexer

The Indexer is the component responsible for keeping the Solr index up to date. In essence,it takes records from the Lily repository and puts them into Solr. It does this in reaction toasynchronously-processed events produced by the repository.

The mapping of repository records onto Solr is more than just forwarding data, the Indexeroffers features such as denormalization, indexing of multiple versioned views, and blob contentextraction.

Denormalization: data from one recordcan be stored in the index entry of anotherrecord. This is useful since Solr is not able todo joins like a SQL database, and since theindex can be partitioned over many nodes.Denormalization makes searching simpler,but indexing more difficult: when a recordis updated, possibly the index entry of otherrecords needs updating too. The Indexer takescare of this.

Indexing of multiple versions of one record:tags can be assigned to versions, and you canconfigure for which tags an index should bemaintained. A version tag is like a snapshot ofthe record state across records.

Incremental index updating: upon eachrecord change, Lily produces a message queueevent (see the rowlog component). The indexercan subscribe to these events to incrementallyupdate an index as changes are happening. Ifyou run multiple Lily nodes, the indexers willrun in each of the nodes and each perform apart of the work. Also within one Lily node,the indexer will run on multiple threads.


Batch index building: when you create a newindex, change the configuration of an existingindex, or for some reason your index got lost,you can trigger a batch index build job. Thisjob executes as a map-phase-only MapReducetask which runs over all records in Lily andre-indexes them. The number of map tasks isequal to the number of HBase regions.

Blob content extraction, using the Tikalibrary1. Many common formats are supportedsuch as HTML, PDF, Microsoft Office,OpenOffice and OpenDocument format, RTF,and more.

Sharding towards multiple Solr instances:if you have too much data to fit into one Solrinstance, you can shard it over multiple ones.

5.1 Setting Up A Generic Index

The quickest way to see your content indexed in Solr is when we can avoid having to writeconfiguration files first. For this purpose, Lily comes with a sample configuration based ondynamic field rules which will index any content.

This is a quick way to get started, but after that you'll soon want to customize things. For this, theIndexer Tutorial (page 42) will help you get started.

There are basically two steps:

1. Start Solr with the generic schema

2. Define the index in Lily

5.1.1 Start Solr with the dynamic_solr_schema.xml

The Solr schema to be used can be found in:

{lily}/samples/dynamic_indexerconf/dynamic_solr_schema.xml

5.1.1.1 Using standalone Solr

Assuming you have just downloaded Solr, you can put the schema in place using:

cp {lily}/samples/dynamic_indexerconf/dynamic_solr_schema.xml \ {solr}/example/solr/conf/schema.xml

And then start Solr using

http://tika.apache.org/

http://tika.apache.org/


cd {solr}/examplejava -jar start.jar

5.1.1.2 Using launch-test-lily

When using launch-test-lily, you can specify the schema for the Solr instance using the -sparameter:

launch-test-lily -s {lily}/samples/dynamic_indexerconf/dynamic_solr_schema.xml

Tip: you might also want to use the -c argument to auto-commit the index, e.g. "-c 60" willcommit it every minute.

5.1.2 Define the index in Lily

The indexer configuration to be used can be found in:

{lily}/samples/dynamic_indexerconf/dynamic_indexerconf.xml

To define the index in Lily, execute the following command. If you're using a real cluster ratherthan running everything on localhost, you will need to adjust the host name of ZooKeeper (-zoption) and of Solr (-s option).

lily-add-index \ -z localhost \ -c samples/dynamic_indexerconf/dynamic_indexerconf.xml \ -n genericindex \ -s shard1:http://localhost:8983/solr

5.1.3 And we are done

If you add any new content now, it will be indexed. If you have existing content in Lily, you canlaunch a batch index build to re-index it.

If you would have made any errors in the parameters to lily-add-index, you can change themusing lily-update-index.

Before you will find your content in Solr, you need to commit the index, e.g. using:

curl http://localhost:8983/solr/update -H 'Content-type:text/xml' --data-binary '<commit/>'

You can perform queries via Solr's admin console:


5.2 Indexer Tutorial

5.2.1 Overview

Getting documents indexed into Solr requires the following steps:


1. write an indexer configuration, this specifies which records to index and how to map theLily fields onto Solr fields

2. write a matching Solr schema, launch a Solr instance that makes use of this schema

3. declare an index in Lily that makes use of this configuration

5.2.2 The Lily schema

Before setting up an index, you should already have a schema with field types and record types,since in your indexer configuration you will refer to these types. We assume you are alreadyfamiliar with this part.

5.2.3 Version Tags

A record in Lily can have one or more versions, or it can have no versions at all. This dependson the scope (versioned, non-versioned) of the fields in the record. A record which has only non-versioned fields will have no versions.

To index records, we need some way to identify what version(s) of the records we want to beindexed. The mechanism for this is version tags. A version tag (often shortened to vtag) is anamed pointer to a specific version of a record. For example, you could define a tag called 'live'which points to the version that contains the ready-for-publishing content. In one record, this livetag could point to version 5, for another record, it could point to version 3, etc.

You can have multiple version tags, and have the versions corresponding to all those tagsindexed. When searching, you can then limit your search to the versions having some tag.

To make records without versions fit in this system, a special version '0' is supported: version 0is essentially a pointer to the set of non-versioned fields of a record. This also works for recordsthat do have versions.

Technically, a version tag is just another field in the record. A version tag field should be non-versioned, single-valued, of type long integer. Version tag fields should be in the namespaceorg.lilyproject.vtag.

The vtag 'last'

To make things easier, Lily comes with a built-in virtual vtag that is automatically defined for allrecords. This vtag is called 'last' and always points to the last version of the record, or to the '0'version for records without versions. This vtag is not actually stored as a field in the record.

So in case you simply want to index the last content, or when you are not using versioning at all,then all you need is the 'last' vtag.

5.2.4 Indexer configuration sample

Here is a sample indexer configuration:

<?xml version="1.0"?><indexer xmlns:b="org.lilyproject.bookssample" xmlns:sys="org.lilyproject.system">

<records> <record matchNamespace="b" matchName="Book" vtags="last"/> <record matchNamespace="b" matchName="Author" vtags="last"/> </records>


<fields> <field name="title" value="b:title"/> <field name="authors" value="b:authors=>b:name"/> <field name="name" value="b:name"/> <field name="recordType" value="sys:recordType"/> </fields>

</indexer>

There are two parts to this configuration: the 'records' and the 'fields'.

Records

The 'records' section defines what records should be indexed. This decision is made based on therecord type of the record, this is specified using the matchNamespace and matchName attributes.As the name of these attributes suggest, these can contain wildcard expressions, refer to thereference documentation for full details on this. If a record matches one of these rules, than thevtags attribute is used to define what versions of the record should be indexed. This can contain acomma separated list, here we only used the built-in vtag 'last'.

Fields

The 'fields' section defines all the fields that can be sent to Solr, and their binding to Lily recordfields. The fields are all global, they are not grouped per record type or so. If an index field hasno value for some record, it will obviously not be added to the index. For example, the authorrecords have a name but no title field, so for authors no title will be added to the Solr document.If for some record, there are no index fields that produce a value, the record will not be added tothe index.

In the example above, the title and name field map straight to the Lily field of the same name.For the authors field we do something special. The authors field is a LIST<LINK> fieldpointing to the authors of a book. The "=>" symbol is the dereference operator. The expression"b:authors=>b:name" tells the indexer this: follow the link(s) in the b:authors field to theauthor records they point to, and from those records, take the b:name field. These dereferenceexpressions work both for single-valued and multi-valued (LIST) links, and can follow linksmultiple levels deep.

Record type indexing

A common need is to index the record type of the record, so that you can limit yourqueries to records of a certain type. The record type information can be addressed like anyother field, through a special system namespace. Notice how the sys prefix maps to thenamespace org.lilyproject.system. The sys:recordType will index the record type in the format"{namespace}name". There are other possibilities: to index the namespace and name separately,to index the mixins, etc. This is explained in the indexer configuration reference (page 53).

Dynamic field mappings

If you have many fields or frequently do changes to the schema, you might desire some wayto define generic field mapping rules. This is possible, and is again covered in the indexerconfiguration (page 53) reference (look for 'dynamic fields').

5.2.5 Solr configuration sample

The following is a snippet from a Solr configuration that matches the above indexerconfiguration:

<schema name="example" version="1.2">

<types> [snipped: see Solr's example schema] </types>

<fields>  <field name="lily.key" type="string" indexed="true" stored="true" required="true"/> <field name="lily.id" type="string" indexed="true" stored="true" required="true"/>

 <field name="lily.vtagId" type="string" indexed="true" stored="true"/> <field name="lily.vtag" type="string" indexed="true" stored="true"/> <field name="lily.version" type="long" indexed="true" stored="true"/>

 <field name="title" type="text" indexed="true" stored="true" required="false"/> <field name="authors" type="text" indexed="true" stored="true" required="false" multiValued="true"/> <field name="name" type="text" indexed="true" stored="true" required="false"/> <field name="recordType" type="string" indexed="true" stored="true"/> </fields>

 <uniqueKey>lily.key</uniqueKey>

<defaultSearchField>title</defaultSearchField>

<solrQueryParser defaultOperator="OR"/>

</schema>

We have left out the Solr field type definitions, as the ones we use here are those from Solr'sexample schema.

There are two sets of fields you need:

• some system fields required by Lily: these are all those whose name starts with 'lily.'

• fields corresponding to each of the index fields defined in the indexer configuration

Note that after changing the Solr configuration, Solr should be restarted. For some kinds ofchanges, the full index might need to be rebuild.

It is required to set the uniqueKey property to lily.key.

5.2.6 Declaring an index

Once you have launched Solr with an appropriate schema configured, and have written anindexer configuration, you can add an index in Lily with the lily-add-index command:

lily-add-index \ -n indexName -c indexerconf.xml \ -s shard1:http://localhost:8983/solr \ -z zookeeperhost


The lily-add-index command has three required arguments:

• a name for the index, when lacking inspiration just use 'index1' or so

• the location of an indexer configuration file

• the URL where Solr is listening, prefixed with a shard name (see index sharding (page65), just use 'shard1' for now)

The -z option specifies the ZooKeeper connection string. By default 'localhost' is used. Since thisand other indexer CLI commands are short-running, it is not really required to specify the fullZooKeeper connection string, just one host name will do.

Once defined, you can update or delete the index, this is described in more detail in managingindexes (page 47).

While Lily supports adding multiple indexes, many users will only need one index. The abilityto have multiple indexes is not for functional separation, but rather for technical reasons, asexplained over at managing Indexes (page 47). You should (typically) not have multipleindexes that point to the same Solr instance!

5.2.7 Triggering indexing

Indexing is triggered by events generated by the repository. Thus when you create, update ordelete records the Indexer will be triggered.

You can also re-index the existing records in the repository through a batch index build. It isalso possible to disable incremental indexing and only use batch index building. Or you cantemporarily pause incremental indexing (the events will be queued). All this is described inmanaging indexes (page 47).

5.2.8 Committing the index

Suppose you have defined an index, added some records, and now try to find them in Solr. Thiswill not give any results, unless the Solr index has first been committed. This is because Solrbuffers updates and only after a while flushes this buffer into a new, searchable, index segment.

You can configure Solr to commit the index automatically at the interval of your choice, or youcan also trigger the commit manually, as follows:

curl http://localhost:8983/solr/update -H 'Content-type:text/xml' --data-binary '<commit/>'

5.2.9 Querying

To query the index, directly make use of Solr. Consult the Solr documentation or book for moreinformation on this.

For example, a simple query on the word 'something' is done like this:

curl 'http://localhost:8983/solr/select/?q=something'

If you prefer to work with JSON, like in Lily's REST interface, use:

curl 'http://localhost:8983/solr/select/?q=something&wt=json'


If you use multiple vtags, you will most often want to limit your search to one vtag-view. Thiscan be done by adding a condition on the lily.vtag field. For example, we could enforce thiscondition through Solr's filter query feature, as follows:

curl 'http://localhost:8983/solr/select/?q=something&fq=%2Blily.vtag%3Alast'

in which %2B is a plus sign and %3A a colon, so the filter query is "+lily.vtag:last".

5.2.10 Debugging indexing

By enabling debug logging for the category org.lilyproject.indexer.engine you will seeinformation about what the Indexer is doing.

If you are simply launching Lily from the command line (e.g. in a development setup), you canenable logging to standard out with the -l and -m options:

lily-server -l debug -m org.lilyproject.indexer.engine

You can as well edit the lily-log4j.properties file. The above is just a shortcut totemporarily enable logging to stdout for some category. You can also change thelogging configuration at runtime through JMX (jconsole).

Among other things, this will output lines like this when a record gets actually pushed to theindex:

[Thread-8] DEBUG org.lilyproject.indexer.engine.Indexer - Record UUID.6ce28c20-bcb4-41f9-af97-63a774242208, vtag live: indexed

5.2.11 Further information

With the above you should have a basic understanding of the Indexer. You can also read about:

• how to add, update and delete indexes, trigger batch index rebuilding, and when to use morethan one index in managing indexes (page 47).

• everything you can do with the indexer configuration in its reference documentation (page53).

• the implementation architecture (page 69) of the indexer

5.3 Managing Indexes

5.3.1 About multiple indexes

Lily allows to define multiple indexes. Each of these indexes should point to a different Solrinstance (or to a different Solr core). For many uses, having just one index will suffice.

When is it useful to have more than one index?

• when you want to put in place a new index which is incompatible with the previous one.The new index can be added as a second index, and be populated through a batch build job.


During all this, your existing index stays unaffected and usable. Once done, and verifiedeverything is fine, you can switch over to this new index and drop the old index.

• when you have entities with very different indexing needs. For example, you might have arelatively small set of 'news' records whose index should be very frequently committed. Itcan make sense to put these in a different index.

5.3.2 Index states

Each index has three kinds of states:

• general state: the general state of the index

• update state: this state tells whether the index should be updated incrementally or not

• batch build state: this describes the state of a batch build

The states are read-write, though certain values can only be assigned by the system, i.o.w. certainstate transitions can only be performed by the system.

5.3.2.1 The general state

The general state can be one of:

• ACTIVE

• DISABLED

• DELETE_REQUESTED

• DELETING

• DELETE_FAILED

The ACTIVE and DISABLED states are not used by Lily at this time, you can use them to indicatewhether some index is still intended to be used.

When you want to delete an index, you change its general state to DELETE_REQUESTED. The systemwill pick this up by moving the state to DELETING. After this, the index will either dissappear orchange to DELETE_FAILED. Deleting an index only deletes the definition of the index in Lily, theactual Solr instance is left untouched.

5.3.2.2 The update state

The update state is about the incremental updating of the index. It can be one of:

• SUBSCRIBE_AND_LISTEN: a message queue subscription should be taken for this index and thelisteners for this message queue should be started to perform the incremental indexing.

• SUBSCRIBE_DO_NOT_LISTEN: a message queue subscription should be taken for this index,but the listeners should not be started. This means the queue will fill up since no listenersconsume the messages. This can be useful occasionally, e.g. if you are going to take downyour Solr servers on purpose. If you want to temporarily disable indexing and plan on doinga batch index build later on, rather use the state DO_NOT_SUBSCRIBE. Attention: onlyuse SUBSCRIBE_DO_NOT_LISTEN for buffering relatively small amounts of messages


(millions, not hundreds of millions, though it pretty much depends on the number of rowlogshards you have configured).

• DO_NOT_SUBSCRIBE: no message queue subscription will be taken, there will be noincremental updating of this index. This is useful if you want to update your index alwaysthrough batch jobs.

5.3.2.3 The batch build state

The batch build state can be one of:

• BUILD_REQUESTED

• BUILDING

• INACTIVE

By default the batch build state is INACTIVE. When you want to launch a batch build job forthis index, you change its state to BUILD_REQUESTED. The system will react to this state changeand launch the batch build job, it will then move the state to BUILDING. The system will thenfollow up on the state of this batch build job and move the state back to INACTIVE when done.Information such as the ID of the job, and whether it succeeded or not, are stored in other indexproperties, as described further on.

5.3.3 Performing common index actions

5.3.3.1 General notes

5.3.3.1.1 Command line clients and programmatic access

All index related actions are performed through a set of command line utilities. These utilitiesinternally make use of the API provided by the lily-indexer-model project. You could write yourown clients using this API, for example to provide other user interface or to integrate certainactions as part of a bigger system. If you would like to do this, we recommend to look into thesource code of the indexer admin utilities (in the source tree: cr/indexer/admin-cli).

Information about the indexes can also be retrieved from the REST interface, though at the timeof this writing the changes that could be made to it were limited to changing the index states.

5.3.3.1.2 ZooKeeper connect string

All command line utilities need to know one common setting: the ZooKeeper connect string.This is specified using the -z option, something like:

lily-list-indexes -z zookeeper1:2181,zookeeper2:2181,zookeeper3:2181

By default, localhost:2181 is used. Since the CLI utilities are short running, you can get awaywith specifying just one of the ZooKeeper hosts rather than full connection string.

5.3.3.1.3 Getting help

Use the -h option to get information on the full set of available options of each utility.


5.3.3.1.4 Forgot the name of an index

Most commands require to specify the name of an index. If you forgot the name, use lily-list-indexes to get a list of the defined indexes.

5.3.3.2 Knowing what indexes exist

Perform the following command:

lily-list-indexes

If you have three indexes, this will show something like this:

index1 + General state: ACTIVE + Update state: SUBSCRIBE_AND_LISTEN + Batch build state: INACTIVE + Queue subscription ID: IndexUpdater_index1 + Solr shards: + shard1: http://solr:8983/solr/core1index2 + General state: ACTIVE + Update state: SUBSCRIBE_AND_LISTEN + Batch build state: INACTIVE + Queue subscription ID: IndexUpdater_index2 + Solr shards: + shard1: http://solr:8983/solr/core2index3 + General state: ACTIVE + Update state: SUBSCRIBE_AND_LISTEN + Batch build state: INACTIVE + Queue subscription ID: IndexUpdater_index3 + Solr shards: + shard1: http://solr:8983/solr/core3

5.3.3.3 Creating an index

Creating an index is done through the lily-add-index command:

lily-add-index -n indexName -c indexerconf.xml -s shard1:http://localhost:8983/solr

The above shows the three arguments minimally required:

• the name of the index

• an indexer configuration file, see the configuration reference (page 53)

• the URL of a Solr instance, prefixed with a name for the (one) shard.

5.3.3.3.1 Incremental indexing is enabled by default

When a new index is added, it is by default created with the state SUBSCRIBE_AND_LISTEN, whichmeans incremental updating will be immediately enabled. You can create it with a differentinitial state through the option --update-state.

5.3.3.3.2 Using multiple shards

If you want to use more than one shard, specify a comma-separated list of Solr URLs, prefixingthem a name:


-s shard1:http://solr1:8983/solr,shard2:http://solr2:8983/solr,shard3:http://solr3:8983

Lily has a built-in default strategy for assigning records to shards, but you can provide a customconfiguration too. Sharding is explained in more detail in Solr index sharding (page 65).

5.3.3.4 Updating the indexer configuration of an index

If you make a change to the indexer configuration, you can update an existing index with:

lily-update-index -n indexName -c indexerconf.xml

Do not forget that when you have added new index fields, you need to add them to the Solrschema too. Also, do not forget that existing content will not be automatically re-indexed: youneed to start a batch build job for that.

If you would not have the indexerconf.xml file anymore, you can retrieve it as follows:

lily-get-indexerconf -n indexName -o indexerconf.xml

5.3.3.5 Updating other index properties

Similar to the indexer configuration, you can also update other index properties such as the Solrshard URLs.

5.3.3.6 Deleting an index

Deleting an index is done by updating its general state to DELETE_REQUESTED:

lily-update-index -n indexName --state DELETE_REQUESTED

This will remove the message queue subscription (if any). If a batch build job would be running,it will be killed. If all successful, the index will be deleted. Otherwise, it will move to the stateDELETE_FAILED. You can check up on this using lily-list-indexes. In case of failures, check thelog file of the Lily server that is running the indexer master.

Note that this only deletes the definition of the index in Lily, the Solr index itself is notdropped as this is not managed by Lily.

5.3.3.7 Performing a batch build (rebuilding an index)

A batch index build will (re-)index all records in the repository. A batch build can be, but is notrequired to be, run concurrently with incremental index updating, so that any changes happeningafter the batch build is started are also reflected in the index.

A batch index build will not first delete the Solr index, so if you want to re-index from ablank slate, you first have to delete the Solr index yourself. See also this Solr FAQ entry2 whichsuggests doing a query-based deletion. Alternatively, you can simply clear out the Solr indexdirectory while Solr is shut down.

To start a batch (re)build of an index, execute:

lily-update-index -n nameOfYourIndex --build-state BUILD_REQUESTED

This change in state will be picked up by Lily which will launch a MapReduce job.

http://wiki.apache.org/solr/FAQ#How_can_I_rebuild_my_index_from_scratch_if_I_change_my_schema.3F


You can follow up on the progress via lily-list-indexes, its output will be similar to this:

index1 + General state: ACTIVE + Update state: SUBSCRIBE_AND_LISTEN + Batch build state: BUILDING + Queue subscription ID: IndexUpdater_index1 + Solr shards: + shard1: http://localhost:8983/solr + Active batch build: + Hadoop Job ID: job_20101021170619294_0001 + Submitted at: 2010-10-21T18:41:38.677+02:00 + Tracking URL: http://localhost:43835/jobdetails.jsp?jobid=job_20101021170619294_0001

Note that the batch build state is now BUILDING and that a section 'Active batch build' appeared.Following the tracking URL will bring you to the Hadoop JobTracker web ui.

If there would be some failure starting the batch job, for example because the jobtracker is unreachable, the batch build state will immediately move to INACTIVE,and the last batch build information will indicate that the job failed to start. Itwill also mention on which Lily node you should check the log files to see whatthe error was. In the log file, pay attention to messages for the log categoryorg.lilyproject.indexer.master.IndexerMaster.

Once finished, the batch build state will become INACTIVE and the info about the last run batchbuild is shown below 'Last build job':

index1 + General state: ACTIVE + Update state: SUBSCRIBE_AND_LISTEN + Batch build state: INACTIVE + Queue subscription ID: IndexUpdater_index1 + Solr shards: + shard1: http://localhost:8983/solr + Last batch build: + Hadoop Job ID: job_20101021170619294_0001 + Submitted at: 2010-10-21T18:41:38.677+02:00 + Success: true + Job state: succeeded + Tracking URL: http://localhost:43835/jobdetails.jsp?jobid=job_20101021170619294_0001 + Map input records: 163 + Launched map tasks: 1 + Failed map tasks: 0 + Index failures: 0

It is important to watch that the 'Index failures' property is 0. There can be indexing errors, evenwhen the MapReduce job as a whole succeeded. When a record fails to be indexed, we do notabort the map task but only augment this counter. More details about the errors that occurred canbe found in the Hadoop log files.

5.3.3.8 Interrupting a batch build

To stop a batch build prematurely, kill it directly in Hadoop:

hadoop job -kill {id}

Depending on the configuration of Hadoop, you can delete it through the JobTracker web ui too.See the property webinterface.private.actions in your Hadoop's core-site.xml.


Lily will notice that the job was killed, and update the index state accordingly.

5.4 Indexer Configuration

The indexer configuration defines how a Lily record should be mapped to a Solr document. Youcan configure what records, and what variants and versions of those records, need to be indexed.You can use link dereferencing to denormalize data in the index.

Besides the Lily's indexer configuration, you also need to configure Solr's schema.xml. Thismight seem double work, but the purpose of both files is different, and allowing manual Solrconfiguration gives maximum flexibility. Of course, nothing prevents you from generating bothconfigurations from a common definition, maybe at some point Lily will include this itself as afeature.

It is possible to have generic rules in the configuration, so that not every record type or field typeneeds to be mapped individually. You can even go so far to make an indexer configuration thatbasically tells the indexer to index everything, see Setting Up A Generic Index (page 41).

The listing below gives an overview of the syntax of the indexer configuration. More details aregiven in the next sections (online version: see navigation or click the links in the listing).

<indexer xmlns:prefix="...">

<records (page 55)>

<record matchNamespace="..." matchName="..." matchVariant="..." vtags="..."/>

</records>

<formatters (page 56) default="...">

<formatter name="{unique name}" class="{name of formatter class that can format this kind of value}"/>

</formatters>

<fields (page 57)>

<field name="{solr field name, not necessarily unique}" value="{prefix:name or dereference expression}> [formatter="{formatter name}"] [extractContent="true|false"]> </field>

</fields>

<dynamicFields (page 60)>

<dynamicField matchNamespace="..." matchName="..." matchType="{type pattern}" matchScope="versioned|non_versioned|versioned_mutable" name="{solr field name}" extractContent="true|false" continue="true|false" formater="{formatter name}"/>

</dynamicFields>

</indexer>


5.4.1 Indexerconf: Version Tag Based Views

Version tags are used to determine what versions of a record should be indexed.

A record in Lily can have one or more versions, or it can have no versions at all. This dependson the scope (versioned, non-versioned) of the fields in the record. A record which has only non-versioned fields will have no versions.

Typically, it is not necessary to index all versions of a record, since many versions will be draftversions or old, archived versions. In some cases, it is fine to simply index the last version. Butin other cases, versions need to undergo some review workflow, and hence the published versionmight not be the last one.

The solution offered by Lily's indexer is based on version tags. A version tag is a label assignedto a version. The set of records having a version with a particular version tag attached to it formsa particular view on the repository.

An alternative to the tag-based system is to use time-based views. In this case, a 'pointin time' determines the version used for each record, thus allowing to query the state ofthe repository as it was at some point in time. Lily does currently not support point-in-time based views, if you are interested in this, please contact us.

Technically, a version tag is just another field in the record. The value of the field is a versionnumber, the name of the field is the tag. A version tag field should be non-versioned, single-valued, and of type long. Version tag fields should be in the namespace org.lilyproject.vtag.

Within a record, there can only be one version having a particular version tag (i.o.w. a particularvtag can point to at most one version). However, multiple version tags can point to the sameversion.

You can request any number of version tags to be indexed. Thus the same record might beindexed multiple times, in multiple version-tag-views.

To make records without versions fit in this system, a special version '0' is supported: version 0is essentially a pointer to the set of non-versioned fields of a record. This also works for recordsthat do have versions.

The built-in version tag: last

To make things easier, Lily comes with a built-in virtual vtag that is automatically defined for allrecords. This vtag is called 'last' and always points to the last version of the record, or to the '0'version for records without versions. This vtag is not actually stored as a field in the record.

So in case you simply want to index the last content, or when you are not using versioning at all,then all you need is the 'last' vtag.

Version tags & denormalization

As described further on, denormalization (= retrieving fields from linked records to store themwithin the index entry (page 156) of the current record) also honors the version-tag views.


Non-versioned fields & version tags

A record can contain both versioned and non-versioned fields at the same time. When non-versioned fields are indexed, they are stored within the index entry of each indexed version.When a non-versioned field changes, the index entries for all indexed versions will be rebuilt.

5.4.2 Indexerconf: Records

<records> <record matchNamespace="..." matchName="..." matchVariant="..." vtags="..."/></records>

The 'records' section determines whether a particular record should be indexed or not. This isdone based on:

• the record type of the record (the one of the non-versioned scope)

• matchNamespace: the record type namespace. Can contain either the full namespaceor a namespace prefix (if the specified string is a prefix, then it will be substituted bythe full namespace). The string can have a wildcard (*) at the start or the end. If thematchNamespace attribute is absent, it means any namespace will be matched.

• matchName: the record type name. This can again have a wildcard at the start or theend. A missing matchName attribute means any name will be matched. When bothmatchNamespace and matchName are missing, records of any record type will bematched.

• the variant properties of the record

• an empty or missing matchVariant attribute means only records whose record id has novariant properties (the master record id) will be matched

• otherwise, specifies a comma-separated list of variant properties the record should have,optionally specifying their value. See the section below for full details.

• a list of version tags that need indexing. This attribute is required. In case you don't havedefined your own version tags, you can use the built-in version tag 'last'.

5.4.2.1 Evaluation of the record rules

The list of <record> rules is evaluated in order, the first one for which the record type andvariant expression matches counts, thus will be used to determine the vtags to be indexed.

If there is no matching rule for a record, it will not be indexed. However, it might be thatinformation from this record is denormalized into the index entries (page 156) of otherrecords. Even if this record itself is not indexed, its denormalized information will still beupdated.

5.4.2.2 matchVariant expression

The matchVariant expression is quite simple, and best explained with some samples:

• matchVariant=”” matches only the master record.

• matchVariant=”prop1” matches records which have the variant property prop1 (whatever itsvalue), not more, not less


• matchVariant=”prop1,prop2” matches records which have exactly the variant propertiesprop1 and prop2 (with any value)

• matchVariant=”prop1,prop2=foo” matches records that have the variant property prop1 withany value and the variant property prop2 with the value foo.

• matchVariant=”*” matches records with any number of variant properties, including themaster record

• matchVariant=”*,prop2=foo” matches records that have the variant property prop2 withvalue foo, and optionally any number of other variant properties.

5.4.2.3 Version tags

The version tags are specified as a comma separated list of version tag names. This is thesame name as the field type name of the version tag, but without the namespace. For example:vtags="last,live,in-review".

5.4.3 Indexerconf: Formatters

Currently it is not possible to register custom formatter implementations, so you can ignore theformatters for now.

All values transmitted to Solr are strings. This means that non-string values need to be formatted(serialized) as string. This is made possible by the formatters.

Lily has a built-in formatter for all kinds of values, making the configuration of formatterscompletely optional.

The available formatters are declared in a section as follows:

<formatters default="..."> <formatter name="{unique name}" class="{name of formatter class that can format this kind of value}"/></formatters>

The attribute default should match the name of one of the formatters. It is optional, Lily's built-indefault formatter is used as fallback.

A formatter needs to implement the following interface (part of lily-indexer-model):

org.lilyproject.indexer.model.indexerconf.Formatter

To use a specific formatter for some field, specify the formatter attribute on the field tag:

<field name="..." value="..." formatter="{name of formatter}"/>

A formatter can return one or more strings, irrespective of whether the Lily field is a list or not.Likewise, the formatter handling a list value might return just one string.

While string values do not need to be formatted as string anymore, they are still passed throughthe same mechanics.

5.4.3.1 Built-in formatter

When no formatters are configured or no formatter matches the attributes specified, a built-indefault formatter is used. For most kinds of values it uses the "toString()" representation.


Date-times are formatted as ISO8601 in UTC time zone, which Solr is able to handle. Date fieldsare formatted the same, but with "/DAY" appended to them (cfr. Solr date math).

PATH-type values are formatted with slashes between the elements, for example: "value1/value2/value3".

The first level of LIST-type values maps onto Solr multi-valued fields. For deeper nested lists,the individual items are formatted and then concatenated into one space-separated string.

RECORD-type values are formatted by formatting each of their individual fields and thenconcatenating everything together in one space-separated string. This works recursively.

5.4.4 Indexerconf: Fields

The 'fields' section of the indexerconf defines the fields that should end up in the index (= thefields that are sent to Solr). Let's call these index fields, to avoid confusion with record fields.

<fields>

<field name="{solr field name, not necessarily unique}" value="{prefix:name or dereference expression}> [formatter="{formatter name}"] [extractContent="true|false"]> </field>

</fields>

The field mapping is independent of the records to which the fields belong. This matches wellwith the fact that field types in Lily are independent from record types, record types are only setsof field types.

Each index field is bound to some record field.

The value of an index field can be:

• the value of a field of the record being indexed

• the value of a field from a nested record (= a record stored in a RECORD-type field)

• the value of a field from another record, found by following links, or by taking anothervariant of the same master record. This is called link dereferencing or denormalization.Below we sometimes refer to this as 'deref values'.

5.4.4.1 Correspondence between Lily LIST-type fields and Solr multi-value fields

As you can guess, Lily LIST-type fields map to Solr multi-value fields.

In Lily you can nest LIST fields, for example LIST<LIST<STRING>>. In such case, the firstlist level maps onto the Solr multi-value, while further nested lists will be formatted as a string(space separated).

5.4.4.2 Index field name

There can be multiple index fields with the same name. If these would each produce a value for acertain record, the result will be that a multi-value will be sent to Solr.

There are other ways of producing multi-values towards Solr: the most obvious is an index fieldmapped to a Lily LIST-type field. A formatter can also produce multiple values from a singleinput value.


Index field names starting with 'lily.' are reserved for internal uses.

5.4.4.3 Order is important

The index fields will be added to the Solr document in the order specified in the indexerconfiguration. This can be important for multi-valued fields, for which Solr maintains the order.

5.4.4.4 Determination of the relevant index Fields for an input record

Not all fields will be sent to Solr for all records, but (obviously) only fields for which the value isnon-null.

For non-deref values, this will usually be the case if the field exists within the record, though theformatter might strip it to null.

For deref values, if the first field in the expression exists in the current record, it might still verywell be that further on something evaluates to null. In such case, the index field will not be addedto the Solr document.

5.4.4.5 Content extraction

Content extraction is performed using the Tika library. While Tika can extract both content andmetadata, it is only the content we are interested in here. Metadata extraction should probably behandled when storing content in Lily, and mapped onto Lily fields.

<field name="..." value="..." extractContent="true">

When extractContent is true, no formatter will be used to format the field value (if any isspecified, it will be ignored), rather content extraction will be performed. The field value has tobe a blob.

If the value is a blob but extractContent is not true, the blob value will be handled by a formatterinstead (the default formatter will not do anything useful).

LIST<BLOB> and nested lists are supported.

Tika is using the AutoDetectParser using default configuration. The amount of data extractedfrom a single blob is limited to 500K. If a blob contains more content, an info-level messageis logged and the first 500K will be sent to Solr. It should be noted that Solr also has amaxFieldLength (see solrconfig.xml), which by default is 10000 tokens (not characters).

5.4.4.6 Index fields that use a value from the current record

There is not much to say about this, the syntax is as follows:

<field name="..." value="prefix:value"/>

5.4.4.7 Index fields that use a value from a nested record or that dereference linkstowards other records

RECORD and LINK type fields are actually quite similar: they both lead to another record.Therefore they are handled in the same way in the indexer. You can navigate through them usingthe symbol "=>", which is called the dereference operator.

Examples:


<field name="..." value="prefix:value1=>prefix:value2"/><field name="..." value="prefix:value1=>prefix:value2=>prefix:value3"/>

Each field to the left of the '=>' operator should be a LINK-type or a RECORD-type field. Theexpression is evaluated from left to right, the dereferencing can go multiple levels deep. The lastfield in the list can be of any type, of course.

Dereferencing through LIST<LINK> or LIST<RECORD> fields also works, at any level in thefollow-field chain. The order of the values is maintained. Dereferencing through nested lists,such as LIST<LIST<LINK>>, is not supported.

If somewhere in the chain a field evaluates to null (either the link field has no value or it pointsto a non-existing record), the deref expression as a whole is null and hence no index field willbe added. In the LIST-case, in case one entry in the list points to a non-existing record, it will bedropped from the list. Obviously, if all entries would be dropped from the list, evaluation of thedereference stops.

The actual value from the field at the end of the chain will be handled as for non-dereferencedfield values: a formatter will be applied, or content extraction will be performed.

Dereferencing happens in a certain vtag-based view. So when we are indexing vtag X of arecord, then any information dereferenced from other records will also be taken from the versionbearing vtag X. If the target record would not have a version with tag X, than the deref evaluatesto null. Even if the deref'ed field would be a non-versioned field so that it does not really matter.

5.4.4.8 Index fields that dereference towards less-scoped variants of the same record

Next to dereferencing via link fields, it is also possible to dereference towards less-dimensionedvariants of a record. The general idea behind this is that the data which is specified on the less-dimensioned variants applies to all the more-dimensioned ones (for example, imagine the case ofnon-translatable content when working with language variants).

The syntax still uses the dereference operator, =>, but instead of a field name you can use:

• the word 'master', which is interpreted as a link to the master record of the current record(the record whose ID has no variant attributes)

• an expression of the kind '-x,-y', that is, a comma-separated list of the variants propertieswhich should be removed from the current variant to go to the target variant. For clarity, thenames of the properties are prefixed with a minus.

Examples:

<field name="..." value="master=>prefix:field1"/><field name="..." value="-x,-y=>prefix:field"/>

This can also be combined with field dereferencing:

<field name="..." value="prefix:field1=>master=>prefix:field2"/>

5.4.4.9 Denormalized information and index updating

Denormalizing information in the index is a powerful feature but you should be aware of what isinvolved in maintaining the denormalized information.


On each change of a record, regardless of whether the record itself needs indexing, the Indexerneeds to check for each index field that uses a deref-value if it is possible that the deref-valuepoints to the current record. This is done by querying an index of links between records, we callthis index the link index. If you have ten index fields that use a deref-value, this means at leastten queries on the link index for each record create, update or delete operation.

Deref-values should not be used for many-to-one links where the many is a large number.Suppose you have a million records that all have a link the same record, and all these recordsstore in their index entry (page 156) a field from this record. When this field is updated, allmillion records will have to be re-indexed.

5.4.5 Indexerconf: Dynamic Index Fields

If you have lots of fields, or when you often make changes to the schema, it would beimpractical to map each field individually in the indexer configuration.

Therefore, it is also possible to define dynamic rules, similar to the dynamicFields in Solr.

Let's repeat the syntax:

<dynamicFields>

<dynamicField matchNamespace="..." matchName="..." matchType="{type pattern}" matchScope="versioned|non_versioned|versioned_mutable" name="{solr field name}" extractContent="true|false" continue="true|false" formatter="{formatter name}"/>

</dynamicFields>

You can define any number of dynamic fields.

The evaluation is as follows:

• for each field in the record that is being indexed:

• run over all the dynamic fields in the order defined in the configuration

• the first one (if any) of which all the match attributes evaluate to true is used. Theremaining dynamic fields are then ignored, except if the continue attribute is true (itsdefault value is false).

5.4.5.1 Matching fields

Each of the match attributes defines some condition to which the field must adhere to match thisrule. All of the match attributes are optional, a <dynamicField> rule without any match attributewill match any field. So it only makes sense to have such a rule as the last one (unless it hascontinue="true").

Here is what you can do with each of the match attributes:

• matchNamespace: matches the namespaces of the field type. This can be a full namespacestring or a prefix. It can start or end on a wildcard (*).

• matchName: matches the name of the field type. This can start or end on a wildcard (*).

• matchType: this attribute contains a type pattern to match against the value type of the field.


• matchScope: comma separated list of scope names: versioned, non_versioned,versioned_mutable (can be specified in upper case too)

The matchType pattern: basics

In its simplest form, this contains one, or a comma-separated list, of type names. For example:

matchType="STRING,LONG,INTEGER"

The type name can contain a wildcard at the start or the end of the expression, so you couldwrite:

matchType="STR*"

to match all types which start with STR, which is only STRING.

The matchType pattern: matching types with arguments

If you would specify this:

matchType="LIST"

this will never match any type, since LIST has an obligatory type parameter. Specifying theliteral type name, including pattern, does work:

matchType="LIST<STRING>"

Note that the above is not valid XML, we need to escape the less than symbol:

matchType="LIST&LT;STRING>"

Since this is rather unreadable, you can replace the angle brackets by round ones:

matchType="LIST(STRING)"

You might want to match any kind of list. This can be done with:

matchType="LIST(*)"

This will match LIST<STRING> or LIST<INTEGER> but not nested lists such asLIST<LIST<STRING>>

The following special constructs are available for matching the type argument:

<*> matches types without argument or with one argument, but not deepernested arguments. In case the pattern is LIST<*>, this matches LIST,LIST<STRING> but not LIST<PATH<STRING>>

<*> matches types without argument or with one argument, but not deepernested arguments. In case the pattern is LIST<*>, this matches LIST,LIST<STRING> but not LIST<PATH<STRING>>

<+> same as <*> but the type argument is required

<++> same as <**> but the type argument is required

The distinction between <*> and <+> does not matter for types like LIST which always have atype argument. It can be useful for RECORD, where the type argument (a record type name) isoptional.

In the type pattern, you can of course also list the type argument in full (as already shown abovewith LIST(STRING)), and you are allowed to use the star wildcard in both the name of the typeand in its argument (the wildcard only works at the start or end of the string).


The following matches all RECORD types with a record type in the namespace "foo":

RECORD<{foo}*>

Other examples to show what is syntactically possible:

LIST<STR*>LIST<LIS*<**>>

5.4.5.2 The name

The name of the Solr field can be defined using an expression (a template). This expression is astring in which the following constructs can be embedded:

Expression Notes

${namespace}

${name}

${baseType} gives the type name without parameters, inlowercase. For "STRING" this gives "string".For "LIST<STRING>" this gives "list".

${nestedBaseType} gives the type name of the nested type, withoutparameters, in lowercase. If there is no nestedtype, gives the base name of the type itself.For "LIST<STRING>" this gives "string". For"STRING" this gives "string"

${type} the type name, followed by the names of anynested types, separated by underscores. For"LIST<STRING>" this gives "list_string".For "LIST<LIST<STRING>>" this gives"list_list_string". For "RECORD<{dc}Title>"this gives "record" (the argument of RECORDis not a nested type)

${nestedType} similar to type, but then for the nested type,and fall back to the current type if there is nonested type

${deepestNestedBaseType} gives the base name of the deepest nested type.For "LIST<LIST<LIST<STRING>>>" thisgives "string", while nestedBaseType wouldgive "list" in this case

${list} true or false, depending on whether the field isa LIST (regardless its type argument)

${nameMatch} if the name expression contained a wildcard,this is the text matched by that wildcard

${namespaceMatch} similar to ${nameMatch}

${list?yesvalue:falsevalue} allows to conditionally insert a string whenthe field is of type LIST. The falsevalue isoptional: ${list?yesvalue}

Examples:

<dynamicField matchNamespace="my.namespace" matchName="field1" name="field1"/>

<dynamicField matchNamespace="my.namespace" matchName="f*" name="something_${nameMatch}"/>

<dynamicField name="${name}_${nestedBaseType}${list?_mv}"/>

On the dynamic field you can also specify the formatter and extractContent attributes. It isallowed to specify the extractContent attribute if the dynamic field might map other than blobfields: the attribute will only have significance in case the field is a blob field.

Dynamic fields do not support link dereferencing.

Dynamic fields are evaluated after the classic, static field mappings. The only significance of thisis for the order of multi-values, in case the same Solr field name would occur. It is not because afield has been used in a static field mapping that it will not be used in the evaluation of dynamicfields anymore.

5.4.6 Indexerconf: Indexing The RecordType

Lily does not by default index the record type of a record (e.g. as one of the built-in 'lily.' fields),because there are many options for indexing the record type: you might want to index only therecord type or also the mixins, you might want to index the namespace separately to be able tosearch across everything in a namespace, etc.

To allow to index record type information, a set of system fields are available in the indexerconfiguration that can be used in normal <field> mappings.

To use these fields, define the following namespace:

xmlns:sys="org.lilyproject.system"

And then refer them as any other field:

<field name="recordType" value="sys:recordType"/><field name="recordTypeWithVersion" value="sys:recordTypeWithVersion"/>

The below table contains the full list of available types.

System field Data type Notes

recordType string The namespace and thename of the record type,in the following format:{namespace}name

recordTypeName string Just the name of the recordtype.

recordTypeNamespace string Just the namespace of therecord type.

recordTypeVersion long

recordTypeWithVersion string Namesapce, name, andversion of the record type,

in the following format:{namespace}name:version

mixins mv string The mixins of the recordtype, without the record typeitself, in the same syntax asrecordType

mixinsWithVersion mv string

mixinNames mv string

mixinNamespaces mv string If there are duplicates (likely),they are not indexed.

recordTypes mv string The mixins and the recordtype, in the same syntax asrecordType

recordTypesWithVersion mv string

recordTypeNames mv string

recordTypeNamespaces mv string

Technically it is possible to index the record type of some other record by following a link field:

<field name="recordType_deref" value="ns:linkfield=>sys:recordType"/>

However, you should be aware that this information will not be updated automatically when thetype of the other record would change (which is a rare case anyway).

When the name of a record type changes (which should be an infrequent event, except duringproject development), the index will not be automatically updated, since this would affectpossibly lots of records. Rather, perform a manual batch index build.

When another update to a record type happens, e.g. its mixins changes, there is no indexupdating that needs to happen since each record points to a specific version of a record type.

5.5 Required Fields In The Solr Schema

The following field declarations MUST be included into every Solr schema file:

<field name="lily.key" type="string" indexed="true" stored="true" required="true"/><field name="lily.id" type="string" indexed="true" stored="true" required="true"/>

<field name="lily.vtagId" type="string" indexed="true" stored="true"/><field name="lily.vtag" type="string" indexed="true" stored="true"/><field name="lily.version" type="long" indexed="true" stored="true"/>

The unique key field MUST be set as follows:

<uniqueKey>lily.key</uniqueKey>

This is the meaning of each of the built-in fields:

Field Notes


lily.key the unique identification of the Solr document,it is the combination of the Lily record id andthe id of the version tag (the id of the field typeof the version tag)

lily.id the record id

lily.vtagId the id of the version tag

lily.vtag the name of the version tag (withoutnamespace). For example, the string 'last'.

lily.version the version of the record, thus the versionthe vtag pointed to at the time the record wasindexed.

5.6 Solr Index Sharding

5.6.1 Introduction

When your index is too large to be managed by a single Solr instance on one node, then you canshard your index.

Index sharding is not the solution to handle high traffic in case of many users: then youshould rather use Solr replication. The same holds for high availability: use replicationrather than sharding.

For this, you need to set up multiple Solr instances. Typically these should all have the sameconfiguration (especially the schema.xml).

When you add an index to Lily, you can specify multiple Solr shards by specifying their URLs.You also give each shard a logical name (which does not have to be unique across indexes).For example you could name them “shard1”, “shard2”, and so on. See managing indexes (page47).

Shards cannot be added or removed on the fly: if you decide you want more or less shards, youneed to define a new index and re-index your content into that new index. Nonetheless, Lilyallows changing the sharding configuration of existing indexes on the fly without complaining.When doing this, working indexers will be restarted to take the new configuration into account (arunning index re-building job would be unaffected). You have to consider yourself if the changesyou make have sense without rebuilding the index.

5.6.2 Shard selection

Index updates for a certain record should always go towards the same Solr shard. The decision ofwhat shard to use for what record can only be based upon the record ID.

While it might be interesting to allow selecting a shard based on the value of a field ofa record, this is difficult in case the record has been deleted. The value of the field onwhich the sharding is based should also never change its value, something which Lilydoes not help with, leaving more responsibility to the user.


Lily can use a default sharding strategy (based on hash of the master record id modulus thenumber of available shards) or you can customize it through a configuration, specified whencreating the index.

5.6.2.1 Sharding configuration (shard selection configuration)

In many situations the default sharding behavior will suffice. It is only if you really have anopinion to what shard which record goes, probably based upon a variant property, that you needa custom configuration.

Below you find the structure of the sharding configuration. It consists of two main parts: thedefinition of the value to shard on (the sharding key) and the mapping of this sharding key onto ashard.

{ shardingKey: { value: { source: "recordId|masterRecordId|variantProperty", property: "prop name" /* only if source = variantProperty */ }, type: "long|string", hash: "md5", /* optional, only if you want the value to be hashed */ modulus: 3 /* optional, only possible if type is long */ }, mapping: { type: "list|range",

in case of list:

entries: [ { shard: "shard1", values: [0, 1, 2] }, /* values in array should be long or */ { shard: "shard2", values: [3, 4, 5] } /* string according to type */ ]

in case of range:

entries: [ { shard: "shard1", upTo: 1000 }, /* upTo value is exclusive */ { shard: "shard2" } /* upTo is optional for last shard */ ] }}

The "shard1" and "shard2" are the logical shard names specified when creating the index.

Suppose you have a variant property "language" and want to shard based upon language, thenyou could use something like the following configuration:

{ shardingKey: { value: { source: "variantProperty", property: "language" }, type: "string" },

mapping: { type: "list", entries: [ { shard: "shard1", values: ["en", "it"] }, { shard: "shard2", values: ["nl", "de", "es"] } ] }}


5.6.3 Example usage

If you want to play a bit with multiple shards, here is how to get started. These instructions areonly for playing on your local machine.

First set up multiple Solr instances. For example, using the launch-solr tool you know fromRunning Lily, you can do:

launch-solr -s schema.xml -p 8984launch-solr -s schema.xml -p 8985

If you are starting Solr via its start.jar, make two copies of the Solr home dir and start with:

java -Djetty.port=8984 -Dsolr.solr.home=solr1 -Dsolr.data.dir=solr1/data -jar start.jarjava -Djetty.port=8985 -Dsolr.solr.home=solr2 -Dsolr.data.dir=solr2/data -jar start.jar

Create a two-sharded index without specifying a sharding configuration (here using the mbox-import sample):

lily-add-index \ -n mail \ -s shard1:http://localhost:8984/solr/,shard2:http://localhost:8985/solr/ \ -c samples/mail/mail_indexerconf.xml

This will use the default sharding configuration, which is generated on the fly depending on thenumber of shards you have. For our situation here, the configuration will be like the following:

{ shardingKey: { value: { source: "masterRecordId" }, type: "long", hash: "md5", modulus: 2 },

mapping: { type: "list", entries: [ { shard: "shard1", values: [0] }, { shard: "shard2", values: [1] } ] }}

Suppose you save this in a file called shardingconfig.json, then you can specify it as followswhen creating the index:

lily-add-index \ -n mail \ -s shard1:http://localhost:8984/solr/,shard2:http://localhost:8985/solr/ \ -c samples/mail/mail_indexerconf.xml \ -p shardingconfig.json

Now you are ready to start creating records. If you keep an eye on the consoles of your Solrinstances, you will see both of them being called.


5.7 Solr Versions

Lily is built against the client libraries of Solr [unresolved variable: solrVersion], and uses bydefault the javabin format (rather than XML) to communicate with Solr.

5.7.1 Using Solr 1.4(.1)

Since Solr's javabin format changed in incompatible ways, you have to configure Lily to use theXML format in case you want Lily to talk to Solr 1.4.

This is done by editing the configuration file conf/indexer/indexer.xml, and adjusting the valueof the following two properties to the values shown here:

<requestWriter>org.apache.solr.client.solrj.request.RequestWriter</requestWriter><responseParser>org.apache.solr.client.solrj.impl.XMLResponseParser</responseParser>

This configuration change has to be done on each of the Lily nodes.

When using Lily Enterprise, be sure to edit the central template configuration and redeploy theconfiguration in order to apply it across the cluster.

5.8 Indexer Error Handling

5.8.1 Solr unreachable

When Solr, or one of the Solr's when using sharding, is unreachable, then the incrementalindexers will block indefinitely until the Solr becomes reachable again. The operation will beretried at regular intervals, each time it fails an error message will be logged to the categoryorg.lilyproject.indexer.solrconnection.

The following kinds of errors are all in category 'Solr unreachable':

• Connection failure: likely Solr is not running, or a network failure

• Unknown host name

• HTTP 404 Not Found response: most likely incorrectly configured path in the Solr url (e.g.missing '/solr')

In order for administrator to be aware of this, the following metric (page 153) is incremented:solrClient.{indexname}_{shardname}.retries. It is recommended that administrators are notifiedwhen there is a change in this metric, especially if it keeps augmenting for anything longer than ashort time.

When these kinds of errors happen, the indexers will retry until the indexing succeeds, thismeans that no index updates will be lost (in contrast to e.g. simply logging the error andprocessing the next message, as happens for unexpected errors as explained below).

5.8.2 Solr misconfiguration

When there is an error in the Solr configuration, for example a missing field in the schema.xml,the error will be logged and the same metric as for generic unexpected errors will beincremented, see below.


The indexing of the record will hence have been skipped, so the index will not be up to date.

These kinds of errors usually happen during project development. After the situation has beencorrected, a batch index build can be performed to make sure the index is up to date.

5.8.3 Indexerconf misconfiguration

When using lily-add-index or lily-update-index, various checks are performed on the'indexerconf.xml' configuration tot minimize the possibility of runtime errors: the structure of theconfiguration is validated as well as the existence of all referenced field types. These validationscan however be skipped with the '--force' option.

If for some reason the loading of the configuration would still fail, the message queue listener(s)that should perform the indexing will fail to start, and hence no indexing will be performed. Anerror will be logged. (At this time, no metric is incremented for this failure)

Note that when failed to start, the MQ-listeners (indexers) will not retry to start until a changeto the index definition happens. To trigger this without actually changing anything, use the 'lily-touch-index' command.

5.8.4 General indexer errors

The indexer handles softly all sorts of errors that are bound to happen, such as receiving amessage for a record which has meanwhile been deleted, or link fields pointing to non-existingrecords. These kinds of problems are not logged.

When an unexpected error occurs, the error will be logged, and the following metric (page153) will be incremented: indexUpdater.{indexname}.errors. It is recommended thatadministrators are notified when there is a change in the value of this metric.

The indexing of the record will be skipped, and will continue with the next message from themessage queue. So in such cases, the index might not be completely up to date.

5.9 Indexer Architecture

Here we briefly discuss the main components of the indexer. This can be helpful for a betterunderstanding of how things works or as an introduction for people who want to dive into thesource code.

5.9.1 The indexer model

This is a library that offers an API to query and modify the definition of the indexes. Othercomponents that want access to the definition of the indexes always perform it through thislibrary. This includes components within the Lily server (those discussed further on) as well asfor example the command line utilities such as lily-add-index.

Basically, the information managed by the indexer model is what you see when you execute lily-list-indexes.

When you want access to the definition of the indexes, you do not need to talk to one of the Lilynodes, but only need to make use of this library, which only needs access to ZooKeeper. Thismeans the indexer model can also be manipulated while no Lily nodes are running.


The Lily nodes register change listeners on the indexer model to react dynamically as the modelchanges (this is implemented through ZooKeeper watchers).

All information about an index is stored within the data of one znode (ZooKeeper node), thisincludes the indexer configuration. Storing it within one znode makes it easy to atomicallymodify it and watch it.

5.9.2 The indexer engine

The indexer engine contains two parts:

• it performs the mapping of Lily records to Solr documents based on the indexerconfiguration (the “indexerconf.xml”)

• it contains the algorithm for performing the incremental updating the index. This includesfinding out what denormalized information needs to be updated, based on the link index.

5.9.3 The indexer worker

The indexer worker is a component that runs on each Lily node and that registers one or moremessage queue listeners (whose implementation is provided by the indexer engine) for eachindex for which incremental indexing is enabled (update state: SUBSCRIBE_AND_LISTEN).

This happens dynamically: the message queue listeners are added or removed as indexes areadded, removed, or when their update state changes.

In the future, we might add the possibility to enable or disable the indexer worker forselected Lily nodes. You could for example have some Lily nodes which are dedicatedto indexing, and others which server client CRUD requests. Let us know if you areinterested in this.

5.9.4 The indexer master

The indexer master is a component which is active on only one of the Lily nodes, based onZooKeeper-based leader election. If the Lily node on which it runs dies, another node will takeover the role.

The tasks of the indexer master include:

• when an index is in the state SUBSCRIBE_AND_LISTEN or SUBSCRIBE_DO_NOT_LISTEN, it registersa message queue subscription for it if not already done so.

• when an index is in the state DO_NOT_SUBSCRIBE, and there is a message queue subscriptionfor it, the indexer master will unregister the subscription.

• when an index is in the batch build state BUILD_REQUESTED, and the batch job is not yetstarted, the indexer master starts it. It also monitors the execution of any batch build jobs forindexes that are in the state BUILDING and updates the batch build info of the index.

• when an index has the general state DELETE_REQUESTED, it removes the message queuesubscription for it (if any), kills the batch build job if any is running, and then deletes thedefinition of the index.


All these tasks are very lightweight and hence should not have much influence on the Lily nodeon which the index master runs.

5.9.5 The batch build MapReduce job

The batch build MR job is a map task for the MapReduce programming model that takes asinput the row keys of all records stored in HBase, and calls the indexer engine for each of theserecords.

It makes use of the HBase-provided MapReduce support, which means that the input will be splitinto as many parts as there are HBase regions in the records table.

There is no reduce part to this job, and neither does the map task produce any output key-values. It simply calls Solr directly. This approach is used since it allows to run the batch buildconcurrently with an ongoing incremental update of the index.

The map task does not talk to external Lily nodes to retrieve the records, but rather uses anembedded repository.

Since the map task spends time waiting on IO (as it reads records from HBase and sends to Solr),it uses multiple threads to perform the indexing.

5.9.6 The link index

Conceptually, the link index is unrelated to the indexer, but as its main use is currently for theindexer we discuss it here too.

The link index is an index based on the hbaseindex library, a generic library (that is part of Lily)for creating HBase-based secondary indexes.

The index is maintained by a secondary action, that is an action which is guaranteed to run aftereach update to a record. It is executed before the message related to this update is put onto themessage queue, thus it is guaranteed that the link index will be updated before any indexersreceive events about the related change (putting the message onto the message queue is itselfalso performed as a secondary action).

Notes

1. http://tika.apache.org/

2. http://wiki.apache.org/solr/FAQ#How_can_I_rebuild_my_index_from_scratch_if_I_change_my_schema.3F


6 Tools

6.1 Import Tool

The import tool allows to load a JSON file describing field types, record types and records intoLily.

For basic usage options, execute

lily-import -h

6.1.1 The import JSON format

The JSON format is basically the same as that of Lily's REST interface, but allows for multiplefield types, record types and records to be described within one JSON structure.

The general structure of the JSON import file is as follows:

{ namespaces: {

}, fieldTypes: [

], recordTypes: [

], records: [

]}

The import tool accepts relaxed JSON without quoted property names and with comments in /* ... */ format.

For the format of the field types, record types and records to be embedded within the arrays, werefer to the documentation of the REST Interface (page 87). The only difference is that thenamespaces are declared once at the top, instead of repeating them within each individual object.

The order of the sections in the import file is important: first namespaces, then fieldTypes, thenrecordTypes, then records. This is because the file is processed in order, to avoid having to readit entirely into memory.

The import tool works in a "create or update" mode, basically the same as when you do a PUT inthe REST-interface. For the field types and record types, the identification is always performedbased on their name, for records based on their ID. For example if a recordType with the given


name already exists, it will be updated (if necessary). If there would be some conflict (e.g. a fieldtype with a different scope), an error will occur. If records do not specify an ID, then they will berecreated with a different ID upon each import.

Below is a sample import file describing a Person record type. For more examples see also thesamples directory of the Lily distribution.

{ namespaces: { "org.sample.person": "p" }, fieldTypes: [ { name: "p$name", valueType: "STRING", scope: "versioned" }, { name: "p$birthDay", valueType: "DATE", scope: "non_versioned" } ], recordTypes: [ { name: "p$Person", fields: [ {name: "p$name", mandatory: true }, {name: "p$birthDay", mandatory: true } ] } ], records: [ { type: "p$Person", fields: { "p$name": "Anonymous Coward", "p$birthDay": "1978-10-13" } } ]}

6.2 mbox Import Tool

6.2.1 About

The mbox import tool allows to import mbox mail archive files into Lily. This provides an easyway to load some 'real' content into Lily.

The import uses a simple model: for each mail message, one "Message" record is created, andfor each part in the MIME message, a "Part" record is created. The content of each part is storedin a blob field of the Part records. The Message record only holds global fields like from, to andsubject. The import tool currently handles all the parts equally, and does not attempt to select oneas the main body of the mail.

+----------------------+ +------------------+| | 1 * | || Message |------------------| Part || | | |+----------------------+ +------------------+

Usage instructions are included within the mbox tool itself, execute:


lily-mbox-import -h

Below we run through the concrete steps to get it working, including indexing.

6.2.2 Mail usage run-through

6.2.2.1 Get some mbox files

One source of mbox files are the Apache mailing list archives, which can be found at:

http://{top level project}.apache.org/mail/{list name}

You can for example download them using curl:

curl -f http://hadoop.apache.org/mail/mapreduce-user/[2008-2010][01-12].gz -o "#1#2.gz"curl -f http://cocoon.apache.org/mail/dev/[2000-2010][01-12].gz -o "#1#2.gz"

Other mbox sources:

• Gmane allows getting mbox archives (but warns about overusing the service): http://gmane.org/export.php

• Linux kernel list archives: http://userweb.kernel.org/~akpm/lkml-mbox-archives/

6.2.2.2 Run HBase & Lily

As explained in the Running Lily guide, you can run a test HBase instance with the commandbelow, or you can use your own HBase installation.

bin/launch-hadoop

Start the Lily server:

bin/lily-server

6.2.2.3 Create the schema

If you run the import tool with the -s option, it will just create the schema.

bin/lily-mbox-import -s

If you need to connect to a ZooKeeper different from 'localhost:2181', use the -z option tospecify the connection string.

6.2.2.4 Run SOLR and define an index

This step is optional and can be skipped.

A sample SOLR schema configuration is provided in the file samples/mail/mail_solr_schema.xml

To run a test SOLR instance with this configuration, use:

bin/solr-launcher -s samples/mail/mail_solr_schema.xml


Now configure an index in SOLR using:

bin/lily-add-index -n mail -s shard1:http://localhost:8983/solr/ -c samples/mail/mail_indexerconf.xml

6.2.2.5 Run the import

You can import one file at a time or a complete directory. Files ending in ".gz" will bedecompressed on the fly.

lily-mbox-import -f {file name or directory name}

Again, use -z to specify the ZooKeeper connection string:

lily-mbox-import -z localhost:2181 -f {file name or directory name}

6.3 Tester Tool

The tester tool is a tool that can run a configurable scenario of CRUD operations against Lily.

It features the following:

• through a configuration file, you can specify:

• the field types and record types to be used by the test

• the scenario to play: how many CRUD operations of each record type

• it will fill fields with random data, for most field types other options are available. Forexample for strings you can specify a certain amount of words to take from a dictionary.

• it can create links between records in two ways:

• parent-child relations: a child record is created when the parent is created, and linkedfrom the parent

• arbitrary relations: the link field is completed with a record ID taken from the set ofearlier created records

• in the future, we will extend this tool with more tasks such as performing SOLR queries

Performance metrics are generated while the tester is running, do a "tail -f Tester-metrics" to seethem. These metrics can include Lily and HBase system metrics if you launch the tester with the-lm and -hm options. This requires that you have enabled JMX access for Lily and HBase on allnodes. For this, comment out the respective lines in hbase-env.sh and lily/service/wrapper.conf.Afterwards you can generate charts from these metrics using the lily-metrics-report tool.

If errors would occur, these are logged to the file failures.log.

A default configuration can be generated by running :

lily-tester -d

Here's an example configuration file, config.json (page ), also containing explanations ofthe different configuration settings.

To run this configuration execute :


lily-tester -c config.json

More usage information is available via:

lily-tester -h


7 REST (HTTP+JSON) API

7.1 REST Interface Tutorial

7.1.1 Abstract

This is a quick introduction to the REST interface. The full details are described in the referencedocumentation (page 87).

For demo purposes we will use the curl tool and assume the usage of a unix-like shell. The URIused in the samples assume you have a Lily node running on localhost listening to port 12060.

7.1.2 Creating a schema

Before we can create any records in Lily, we need to define our schema. For the purpose of thisexample, let's create two field types called name and price and combine them into a record typecalled product.

7.1.2.1 Creating the name field type

You can create the field type by entering (or copy-pasting) the following command on the shell.Since we end the first line with a quote, the shell will ask for more lines until we close the quote.

curl -XPOST localhost:12060/repository/schema/fieldType -H 'Content-Type: application/json' -d '{ action: "create", fieldType: { name: "n$name", valueType: "STRING", scope: "versioned", namespaces: { "my.demo": "n" } }}' -D -

To create a field type we POST to the resource representing the collection of field types.

The server needs to know what kind of content we are submitting, this is specified using the -Hoption.

The JSON we submit follows a structure that will return for all usage of the POST method: itis an object specifying an action and an actual object, here a fieldType. The REST interface isliberal in what it accepts: the submitted JSON does not need to have property names quoted,even though this is required by the JSON specification.


For the field type, we specify its essential properties: the name, the value type and the scope.

Names of field types are namespaced. Similar to XML, the namespace is not embedded directlyinto the name but associated with a prefix. So in this example the namespace is "my.demo" andthe associated prefix is "n". In contrast to XML, the prefix and local name are not separated witha colon but rather with a dollar sign. The reason for this is that the same syntax is used in URIs,where the colon is a reserved character. This saves us from escaping it each time.

The namespace mapping is declared such that namespaces are mapped onto prefixes. It isdone this way because when you read an entity (like a field type or a record), you are usuallyinterested in finding out what prefix is used for a particular namespace, rather than the other wayaround. However, the map can be easily reversed, since each namespace occurs only once and isbound to a different prefix.

Finally, we specify the option "-D -" to dump the response headers to standard out. This is usefulto see things like the status code and the Location header.

The response you get when executing the above command will be similar to this:

HTTP/1.1 201 CreatedContent-Type: application/json; charset=UTF-8Date: Thu, 28 Apr 2011 13:43:12 GMTAccept-Ranges: bytesLocation: http://localhost:12060/repository/schema/fieldType/n$name?ns.n=my.demoServer: Restlet-Framework/2.1snapshotX-Kauri-ModuleInfo: rest (version: 1.0)Transfer-Encoding: chunked

{ "id": "04359728-6824-4e64-8e41-f7e496148c03", "name": "ns1$name", "scope": "versioned", "valueType": "STRING", "namespaces": {"my.demo": "ns1"}}

The JSON will however not contain any whitespace and newlines, but rather appear as one longline. We added the whitespace here for readability.

The Location response header shows where the newly created field type can be retrieved from,you can try that as well:

curl -XGET http://localhost:12060/repository/schema/fieldType/n\$name?ns.n=my.demo

Next to the resource /repository/schema/fieldType, under which field types are addressed byname, there is also the resource /repository/schema/fieldTypeById, under which field types areaddressed by ID. This resource behaves the same: we could as well have created the field type byPOSTing to this resource. The only difference is that the Location header in the response wouldthen be set to:

http://localhost:12060/repository/schema/fieldTypeById/04359728-6824-4e64-8e41-f7e496148c03

7.1.2.2 Creating the price field type

Just for illustration, we will create the price field type in a different way: using the PUT method.PUT will either update or create the field type, depending on whether it already exists.

The command is as follows:


curl -XPUT localhost:12060/repository/schema/fieldType/n\$price?ns.n=my.demo -H 'Content-Type: application/json' -d '{ name: "n$price", valueType: "DECIMAL", scope: "versioned", namespaces: { "my.demo": "n" }}' -D -

Here you see how a namespaced name is represented in an URI: again using a prefix, whichmapped on a namespace using a request parameter starting with "ns.". The \ before the $ sign isonly necessary here because $ has a special meaning in the shell.

In contrast to when using POST, we now submit just the field type, without the wrapper objectspecifying an action.

The field type name occurs in both the URL and the submitted entity, which might leave youwondering which one will be used. The one specified in the URI will be used to retrieve theexisting field type, if any. The one specified in the body will be used to update the name of thefield type, or when creating the field type.

When you execute this curl command the first time, the response status will be "201 Created".If you would execute it a second time the status will be "200 OK" since the field type alreadyexists. The field type will have been updated if necessary to correspond with the submitted json.So the PUT operation behaves as "create or update". In contrast, if you would retry the POSToperation that we used to create the name field type, it will respond with "409 Conflict". Hence,you would use POST if you want to avoid updating an existing field type.

7.1.2.3 Creating the product record type

Creating a record type is done in the same way as a field type. You have again the choicebetween using POST (if you want to be sure to be creating something) or PUT (if you want toeither update or create the record type).

The submitted JSON format is of course a bit different: for a record type we specifies the list offields it should contain.

The command is as follows:

curl -XPOST localhost:12060/repository/schema/recordTypeById -H 'Content-Type: application/json' -d '{ action: "create", recordType: { name: "n$product", fields: [ { name: "n$name", mandatory: true}, { name: "n$price", mandatory: true} ], namespaces: { "my.demo": "n" } }}' -D -

Just for illustration, this time we posted to the 'ById' resource.

The response is:

HTTP/1.1 201 CreatedContent-Type: application/json; charset=UTF-8Date: Thu, 28 Apr 2011 14:03:20 GMTAccept-Ranges: bytesLocation: http://localhost:12060/repository/schema/recordTypeById/3f8f3526-0c90-4cbf-aecd-1d0f014bf5ac


Server: Restlet-Framework/2.1snapshotX-Kauri-ModuleInfo: rest (version: 1.0)Transfer-Encoding: chunked

{ "id": "3f8f3526-0c90-4cbf-aecd-1d0f014bf5ac", "name": "ns1$product", "fields": [ {"id": "0d096b72-826b-481b-970a-0097e987b066", "mandatory": true}, {"id": "04359728-6824-4e64-8e41-f7e496148c03", "mandatory": true}], "version": 1, "mixins":[], "namespaces": {"my.demo": "ns1"}}

Again, you can retrieve this record type using the URI found in the Location header:

curl -XGET http://localhost:12060/repository/schema/recordTypeById/3f8f3526-0c90-4cbf-aecd-1d0f014bf5ac

Record types are versioned, specific versions can be retrieved as follows:

curl -XGET http://localhost:12060/repository/schema/recordTypeById/3f8f3526-0c90-4cbf-aecd-1d0f014bf5ac/version/1

7.1.3 Creating records

7.1.3.1 Create record using POST, server assigns record ID

Let's create a product record:

curl -XPOST localhost:12060/repository/record -H 'Content-Type: application/json' -d '{ action : "create", record: { type: "n$product", fields: { n$name: "Bread", n$price: 2.11 }, namespaces: { "my.demo": "n" } }}' -D -

The response is:

HTTP/1.1 201 CreatedContent-Type: application/json; charset=UTF-8Date: Thu, 28 Apr 2011 14:05:33 GMTAccept-Ranges: bytesLocation: http://localhost:12060/repository/record/UUID.65d268d0-ccbb-4fff-9073-e5642a9144e0Server: Restlet-Framework/2.1snapshotX-Kauri-ModuleInfo: rest (version: 1.0)Transfer-Encoding: chunked

{ "id": "UUID.65d268d0-ccbb-4fff-9073-e5642a9144e0", "version": 1, "type": {"name": "ns1$product", "version": 1}, "versionedType": {"name": "ns1$product", "version": 1}, "fields": { "ns1$name": "Bread", "ns1$price": 2.11 },


"namespaces": {"my.demo": "ns1"}}

The response JSON is more extensive than what we submitted:

• it contains the assigned id

• it contains the number of the created version

• it contains the record type for each scope that exists in the record (see the propertyversionedType). You can also see that the type property now is an object containing aname and a version, while in the submitted JSON we simply specified the name. Lilyautomatically took the latest version.

7.1.3.2 Creating a record using PUT, assigning the record ID yourself

Another way to create a record is to PUT to the resource /repository/record/{id}. This is differentfrom the example with POST above in two important ways:

• it requires that you choose an ID for the record yourself

• PUT is not a pure create, but a "create or update", that is, if a record with the selected IDwould already exist, Lily will not complain that the record already exists but simply updatethe record.

The PUT method has the advantage that if an update would fail because of some IO error (anetwork problem, or when the Lily node died while handling the request), you can simply retrythe operation. The end result will be the same: there will be a record in the repository with thegiven ID and with the specified field values.

So the first thing we now have to do is to decide on a record ID. Lily allows to either inventyour own custom record ID, which can be an arbitrary string, or to use UUIDs. To use a customrecord ID, simply use a string of the form "USER.something". To use a UUID, the string shouldbe in the form "UUID.{valid uuid string following rfc 4122}".

For this example, let's use a UUID. In Linux, you can generate one with the command uuidgen:

$ uuidgen -ra7166289-eb7a-4715-8c8e-3c997d752926

Now let's post our record:

curl -XPUT localhost:12060/repository/record/UUID.a7166289-eb7a-4715-8c8e-3c997d752926 -H 'Content-Type: application/json' -d '{ type: "n$product", fields: { n$name: "Butter", n$price: 4.25 }, namespaces: { "my.demo": "n" }}' -D -

The response is as before:

HTTP/1.1 201 CreatedContent-Type: application/json; charset=UTF-8Date: Fri, 29 Apr 2011 08:30:05 GMTAccept-Ranges: bytes


Location: http://localhost:12060/repository/record/UUID.a7166289-eb7a-4715-8c8e-3c997d752926Server: Restlet-Framework/2.1snapshotX-Kauri-ModuleInfo: rest (version: 1.0)Transfer-Encoding: chunked

{ "id": "UUID.a7166289-eb7a-4715-8c8e-3c997d752926", "version": 1, "type": {"name": "ns1$product", "version": 1}, "versionedType": {"name": "ns1$product", "version": 1}, "fields": { "ns1$name": "Butter", "ns1$price": 4.25 }, "namespaces": {"my.demo": "ns1"}}

7.1.4 Reading records

Reading a record is very simple: just perform a GET operation on its URL. The following URLwas simply copied from the 'Location' header in the response of the previous example:

curl http://localhost:12060/repository/record/UUID.a7166289-eb7a-4715-8c8e-3c997d752926 | json_reformat

You can of course also use your web browser to view the record at this URL.

The GET operation supports a request parameter schema=true to include the schema informationof each of the requested fields. This can be useful for generic applications that have no baked-inknowledge about the field types:

curl http://localhost:12060/repository/record/UUID.a7166289-eb7a-4715-8c8e-3c997d752926?schema=true | json_reformat

This gives:

{ "id": "UUID.a7166289-eb7a-4715-8c8e-3c997d752926", "version": 1, "type": { "name": "ns1$product", "version": 1 }, "versionedType": { "name": "ns1$product", "version": 1 }, "fields": { "ns1$name": "Butter", "ns1$price": 4.25 }, "schema": { "ns1$name": { "id": "2f03a71a-1c94-4005-b56a-12db8d58c1e6", "scope": "versioned", "valueType": "STRING" }, "ns1$price": { "id": "f2c4dedb-145a-4a7e-9580-980cf07c5928", "scope": "versioned", "valueType": "DECIMAL" } }, "namespaces": { "my.demo": "ns1"


}}

If you are only interested in a subset of the fields of a record, you can specify the fields to returnwith a request parameter called fields. So suppose we only want the name:

curl 'http://localhost:12060/repository/record/UUID.a7166289-eb7a-4715-8c8e-3c997d752926?fields=n$name&ns.n=my.demo' | json_reformat

The URL has been put between quotes so the shell would ignore the special characters like $.

Other things you can do are retrieving a specific version of a record, retrieving the list ofversions, etc. All this should be straightforward: see the reference (page 87).

7.1.5 Creating a record with a blob field

Something which might take a bit more time to figure out is how to create a record with a blobfield.

Before we can do this, we need a blob field type and a record type containing this field. For thepurpose of this sample, we will create a field called data and a record type called file.

The command to create the data field type is:

curl -XPOST localhost:12060/repository/schema/fieldType -H 'Content-Type: application/json' -d '{ action : "create", fieldType: { name: "n$data", valueType: "BLOB", scope: "versioned", namespaces: { "my.demo": "n" } }}' -D -

The command to create the file record type is:

curl -XPOST localhost:12060/repository/schema/recordType -H 'Content-Type: application/json' -d '{ action : "create", recordType: { name: "n$file", fields: [ { name: "n$data", mandatory: true} ], namespaces: { "my.demo": "n" } }}' -D -

Creating a record with a blob field happens in two steps:

1. upload the blob(s)

2. create the record with a reference to the blob

A blob is uploaded by POSTing it to the /repository/blob resource. It is required to specifythe Content-Length header, which curl does automatically for you, and the Content-Typeheader. In the following command, I am uploading a file which I had lingering on my disk:zookeeper-3.3.1.tar.gz:


curl -XPOST localhost:12060/repository/blob --data-binary @zookeeper-3.3.1.tar.gz -H 'Content-Type: application/x-gzip' -D -

As response, this gives some JSON:

HTTP/1.1 200 OKContent-Type: application/json; charset=UTF-8Date: Mon, 30 Aug 2010 14:48:15 GMTAccept-Ranges: bytesServer: Restlet-Framework/2.0snapshotContent-Length: 91

{ "value": "AAAAA0RGU07g2WL00E0Ol7Fg3xroOWo", "mimeType": "application/x-gzip", "size": 10279804}

It is exactly this piece of JSON that you need to use as the value for the record field, as follows:

curl -XPOST localhost:12060/repository/record -H 'Content-Type: application/json' -d '{ action : "create", record: { type: "n$file", fields: { n$data: { "value": "AAAAA0RGU07g2WL00E0Ol7Fg3xroOWo", "mimeType": "application/x-gzip", "size": 10279804 } }, namespaces: { "my.demo": "n" } }}' -D -

This was it, we created a record with a blob field.

To download the blob, you can access it via the resource /repository/record/{id}/field/{name}/data, in this case (you can find the record ID in the output of the previous command):

curl localhost:12060/repository/record/UUID.1248605b-0c8d-40a9-a684-7c01c94a5c0c/field/n\$data/data?ns.n=my.demo --output download.tar.gz

7.1.6 Creating A Record With A Complex Field

Sometimes you might want to store a more complex value in a field. Thus not a simple valuelike a string, but a complex value which is again composed of multiple fields. In Lily this ispossible by creating fields of type RECORD. These are fields in which you can put Recordobjects. These are not real records with their own identity, it is just a re-use of the top-levelRecord data structure to use it as value within the field of another record. Since any record objectcan have fields which by themselves can again contain records (or lists of records), this allowsfor modeling arbitrarily complex structures.

Before you use complex fields, you should always ask yourself the question if you want touse either complex fields or rather link fields (which contain pointers to other records). Bothenable you to store the same kinds of nested/complex structures. In the case of complex fields,the nested structures (nested records) are all stored within one record, so don't have their ownidentity and are hence not separately retrievable or indexable. Link fields pointing to otherrecords give each part of the nested structure its own identity, but at the cost of having to create/read multiple records, and loosing the atomicity of the create operation.


Since complex fields are modeled in Lily by creating field types with as value type RECORD,they are also called record-type fields.

In the following example, we will create articles which have authors. Each author has a nameand email attribute. For the sake of this example, we are going to store the authors within thearticle, in a complex field. So there will be no re-use of the same author records across articles.

For this example, we will create the schema using the import tool. Save the following in a filecalled schema.json:

{ namespaces: { "article": "a" }, fieldTypes: [ { name: "a$name", valueType: "STRING" }, { name: "a$email", valueType: "STRING" }, { name: "a$title", valueType: "STRING" }, { name: "a$authors", valueType: "LIST<RECORD<{article}author>>" }, { name: "a$body", valueType: "STRING" } ], recordTypes: [ { name: "a$author", fields: [ {name: "a$name", mandatory: true }, {name: "a$email", mandatory: true } ] }, { name: "a$article", fields: [ {name: "a$title", mandatory: true }, {name: "a$authors", mandatory: true }, {name: "a$body", mandatory: true } ] } ]}

And then import it using:

lily-import schema.json

Now we can create an article, with authors nested in it, as follows:

curl -XPUT localhost:12060/repository/record/USER.my_article -H 'Content-Type: application/json' -d '{ type: "a$article", fields: { a$title: "Title of the article", a$authors: [ { type: "a$author", fields: { a$name: "Author X", a$email: "[email protected]" } }, { type: "a$author", fields: { a$name: "Author X", a$email: "[email protected]" } } ],


a$body: "Body text of the article" }, namespaces: { "article": "a" }}' -D -

The authors field contains a list in which each entry again follows the same structure as for a top-level record: you specify its type and its fields.

7.1.7 Scanning Over Records

Scanners allow to sequentially run over all or part of the records stored in the repository. For anintroduction to scanners, see Scanning Records And Record Locality (page 126).

To start, you need to create a scanner, giving the parameters for the scanner in the body:

curl -XPOST localhost:12060/repository/scan -H 'Content-Type: application/json' -d '{ recordFilter: { "@class": "org.lilyproject.repository.api.filter.RecordTypeFilter", recordType: "{my.demo}product" }}' -D -

In the above example, we run over all records and use a filter so that only the records of thedesired type are returned. If you would just want to run over all records, post an empty jsonobject, { }. There are many other options available, see JSON Formats (page 91).

The response will contain the URL of the created scanner in the Location header:

HTTP/1.1 201 CreatedContent-Length: 0Content-Type: application/octet-stream; charset=UTF-8Date: Thu, 15 Mar 2012 13:54:43 GMTAccept-Ranges: bytesLocation: http://localhost:12060/repository/scan/7832142591753684320

We can now query this scanner to return the next record(s). By default, just one record isreturned, use the batch parameter to retrieve multiple records.

curl 'http://localhost:12060/repository/scan/7832142591753684320?batch=10' | json_reformat

This gives our two products:

{ "results": [ { "id": "UUID.0a6e8ca6-ab06-4c7e-bcc5-33c1f048a4d9", "version": 1, "type": { "name": "ns1$product", "version": 1 }, "versionedType": { "name": "ns1$product", "version": 1 }, "fields": { "ns1$name": "Bread", "ns1$price": 2.11 }, "namespaces": { "my.demo": "ns1" }


}, { "id": "UUID.a7166289-eb7a-4715-8c8e-3c997d752926", "version": 1, "type": { "name": "ns1$product", "version": 1 }, "versionedType": { "name": "ns1$product", "version": 1 }, "fields": { "ns1$name": "Butter", "ns1$price": 4.25 }, "namespaces": { "my.demo": "ns1" } } ]}

You can repeatedly call GET on the scanner resource, until the scanner has reached the end. Atthat point, it will respond with '204 No Content':

curl --dump-header - 'http://localhost:12060/repository/scan/2764478081058669015'HTTP/1.1 204 No ContentContent-Type: application/json; charset=UTF-8Date: ...

When done with the scanner, delete it to free up the resources:

curl -XDELETE 'http://localhost:12060/repository/scan/2764478081058669015'

Scanners only live in the server where you created them. So all requests related to asingle scanner should go to the same Lily server.

7.2 REST API Reference

7.2.1 About the REST interface

For an introduction to the REST interface, see REST Interface: Getting Started (page 77).

The REST API reference documentation consists of two parts:

• JSON Formats (page 87)

• REST Protocol (page 94)

7.2.2 JSON Formats

7.2.2.1 About JSON

Lily's REST interface is liberal in the JSON it accepts: it supports unquoted property names andcomments.


7.2.2.2 Content-Type

The REST interface supports only JSON as content type. Requests that submit JSON shouldhave a header “Content-Type: application/json”.

7.2.2.3 Namespaces

7.2.2.3.1 Namespaced names

You have two options for specifying namespaces names: either you specify them in full, or youuse a namespace prefix.

7.2.2.3.1.1 Specify in namespaced names in full

To specify the name in full, use the following syntax:

{namespace}name

Thus, the namespace is specified between curly braces, followed by the name.

7.2.2.3.1.2 Use prefixes

Alternatively, for shorter typing when your namespaces are long, you can bind them to prefixes.

The syntax for the names then becomes:

prefix$name

Thus a prefix, followed by a dollar sign, followed by the non-namespaced name.

The prefix can be freely chosen, and is bound to the actual namespace as described next.

The prefixes used in entities retrieved from Lily will usually be different from those you usewhen submitting entities: Lily does not remember the prefixes, only the namespaces.

7.2.2.3.1.3 Declaring namespaces

In each format, a property called namespaces can be present containing namespace declarations.

The format for namespaces is as follows:

{ "namespace1": "prefix1", "namespace2": "prefix2"}

Since the namespace is used as the key, each namespace can be mapped to just one prefix. Thismakes it easier to read e.g. fields in a record: just find out what the prefix for the namespaceis, and use that to retrieve the name. This would be more complicated in case different prefixescould map to the same namespace.

Obviously, each namespace should be mapped to a different prefix.

7.2.2.4 Field type format

{ id: "string", [not required upon submit]


name: "prefix$name", valueType: "STRING|INTEGER|LONG|...", scope: "versioned|non_versioned|versioned_mutable", [default = non_versioned], namespaces: { ... }}

The full list of available value types can be found in the section on the record format.

7.2.2.5 Record type format

{ id: "string", [not required upon submit] name: "prefix$name", version: long, fields: [ { id: "string", [upon submit, you can specify either id or name] name: "prefix$name", [not present upon retrieval] mandatory: true|false [default = false] } ], mixins: [ { id: "string", [upon submit, you can specify either id or name] name: "prefix$name", [not present upon retrieval] version: long ] namespaces: { ... }}

7.2.2.6 Record format

{ id: "string", type: "prefix$name" or { name: "prefix$name", version: long}, versionedType: {name: "prefix$name", version: long}, [only when applies, ignored upon submit] versionedMutableType: {name: "prefix$name", version: long}, [only when applies, ignored upon submit] version: long, [only when the record has versions, ignored upon submit] fields: { "prefix$name": value [format for the value: described below} }, fieldsToDelete: [ 'prefix$name', ...], schema: { [ ignored upon submit] "prefix$name": { field type json } } namespaces: { ... }}

This format can use some more explanation.

7.2.2.6.1 The record ID

The record ID can be either:

• a custom string. In this case it should start with "USER.". For example, "USER.foobar".

• a UUID. In this case it should start with "UUID.", followed by thestring representation of a UUID as defined in RFC-4122. For example,"UUID.458ae835-7e00-42f0-9366-1caec2472a3b".


Besides the core ID, the record ID can also contain variant properties. The format for these isdescribed in RecordId.toString()1.

7.2.2.6.2 Formatting of value types

The following table shows the names of the value types, and what JSON type should be used fortheir values.

Value type name JSON type Example / details

STRING string

INTEGER number

LONG number

DOUBLE number

DECIMAL number

BOOLEAN boolean

DATE string

DATETIME string

URI string The string should beacceptable by the constructorof the java.net.URI class.

LINK string A Lily record ID, asobtained by, and describedin the Javadoc of,RecordId.toString()2

The LINK type canoptionally be qualifiedwith a record type name:LINK<{namespace}name>

BLOB object A Javascript object with thefollowing properties:

• size: integer number

• mimeType: string

• value: as returned in theresponse when creatingthe blob (POST on /repository/blob)

• name: string [optional]

LIST<sometype> array an array of values which canin their turn be arrays again incase the nested value type is aLIST or PATH

The LIST type needs tobe qualified with the kindof types in the list, for

javadoc:org.lilyproject.repository.api.RecordId

javadoc:org.lilyproject.repository.api.RecordId


example: LIST<STRING> orLIST<LINK>.

PATH<sometype> array an array of values which canin their turn be arrays again incase the nested value type is aLIST or PATH.

The PATH type needs to bequalified with the kind oftypes in the path, for example:PATH<STRING>

RECORDRECORD<recordtypename>

record json json representation of arecord, some propertieswill be ignored though:version, versionedType,versionedMutableType,fieldsToDelete

The RECORD type canoptionally be qualifiedwith a record type name:RECORD<{namespace}name>

BYTEARRAY base64 Base64 encoded representationof the bytes

7.2.2.7 List format

Some resources return a list of records, record types or field types.

The format for these is:

{ results: [ { json of a record, record type or field type }, ... ]}

7.2.2.8 POST format

For POST requests, we use a generic format which is as follows:

{ action: "update|create|...", entityName: { JSON object for the kind of entity}}

in which entityName is one of: fieldType, recordType, record.

7.2.2.9 Record Scan Format

For a general introduction on scanners, see Scanning Records And Record Locality (page126). Scanners allow to run sequentially over all or part of the records stored in the repository.Scanners can efficiently jump to the specified startRecordId, but from there accesses each recordsequentially until the scan stops at the stopRecordId, or when a filter indicates the scanning


should stop, or else it runs until the very end of the table. The filtering ability of scanners is notbased on indexes, when you specify a filter the scan still runs over each record and evaluates thefilter for it.

The syntax:

{ startRecordId: "UUID.something or USER.something", stopRecordId: "...", rawStartRecordId: "...", rawStopRecordId: "...", recordFilter: { /* see syntax below */ }, returnFields: { /* see syntax below */ }, caching: integer, cacheBlocks: true|false}

All properties are optional: in this case the scan will run over all records (and without caching,which is off by default).

Some further explanation of the properties:

• startRecordId and stopRecordId: specify from which record ID (inclusive) to which recordID (exclusive) the scan should go. This will be mostly useful with user-specified record ID's.Note that the record ID's don't need to exist in the repository: for example, the scan will startfrom the first record ID which is equal to or larger than the given startRecordId. So you canspecify "USER.a" to start the scan at the first record whose ID starts with the character a,which might be "USER.albert".

• rawStartRecordId and rawStopRecordId: you will most likely never need these. They areonly relevant if you want to specify keys that are not full valid record ID's. They specify therecord ID as raw bytes, here encoded as base64.

• recordFilter: specifies a filter to filter out certain records. There are multiple filters available,their syntax is described in the section below. You could as well filter in your client, butspecifying a filter has the advantage that the data doesn't need to be transported to the client.

• returnFields: the fields that should be returned by for each record, see the syntax below

• caching: the number of records to fetch in one go. By default, this is '-1', which disablescaching. It is strongly recommended to enable this. Either set it to about the number ofrecords you expect, or if you do a full table scan, set this to a high value, e.g. 500.

• cacheBlocks: this property is relevant when you are doing full table scans, in that case it canmake sense to hint the server to not put the data in the cache, since you are going through itjust once.

returnFields

The syntax for the returnFields property is:

{ type: "NONE|ALL|ENUM" fields: [ "field qname" ]}

The fields property is only relevant when the type is ENUM.


7.2.2.10 Filter Format

7.2.2.10.1 General

Each filter contains an attribute @class identifying the type of filter. The other properties arefilter dependent.

{ "@class": "..."}

7.2.2.10.2 Record Type Filter

Only lets through records of the given record type.

{ "@class": "org.lilyproject.repository.api.filter.RecordTypeFilter", recordType: "record type qname", version: integer (optional)}

7.2.2.10.3 Field Value Filter

Only lets through records for which the given field equals (or not equals) the given value.

{ "@class": "org.lilyproject.repository.api.filter.FieldValueFilter", field: "field qname", fieldValue: ..., compareOp: "EQUAL|NOT_EQUAL", filterIfMissing: true|false}

The field value should be specified in the same syntax as used in records.

filterIfMissing: if false (default is true), and the record does not have the field, it will let therecord through.

7.2.2.10.4 Record ID Prefix Filter

Only lets through records whose ID starts with the given record ID. For example, specifying"USER.a" will let through "USER.afoo" and "USER.abar" but not "USER.b". This filter causesthe scanning process to stop as soon as a key is encountered which is larger than the given prefix,since no further record could then again be a match.

When using this filter, you will usually set the startRecordId of the scan to the same record ID.

{ "@class": "org.lilyproject.repository.api.filter.RecordIdPrefixFilter", recordId: "..."}

7.2.2.10.5 Filter List

Combines multiple filters. Filter list can by itself again contain a filter list, allowing to createarbitrary hierarchies of filters.

{


"@class": "org.lilyproject.repository.api.filter.RecordFilterList", operator: "MUST_PASS_ALL|MUST_PASS_ONE" filters: [ ]}

7.2.3 REST Protocol

7.2.3.1 Nodes / connecting / load balancing

The REST interface is exposed by each individual Lily node. Ideally clients should load-balancetheir requests over the set of available Lily nodes. Right now, Lily does not offer a standardsolution for this.

The port to which the REST interface listens is configured in conf/kauri/connectors.xml.

7.2.3.2 Error responses

Whether a request was succesful or not can be detected through the HTTP status code, all statuscodes starting with 2xx indicate success.

Failures can be due to the client (e.g. a syntax error in the URI or the submitted JSON), or can bedue to failures in Lily itself.

The most used error responses are:

400 Bad Request

404 Not Found

500 Internal Server Error

For most errors, the entity is a JSON with the following format (look at the Content-Type headerof the response).

{ status: long, [the HTTP status code repeated] description: "description of the status code", causes: [ message: "string", type: "fully qualified java class name", ], stackTrace: "complete java stack trace"}

The causes array contains the message and type of the exceptions that happened, and all itscauses.

Here is a sample error response. The request tried to submit a record containing a non-definedfield name. The stack trace has been snipped for the most part.

{ "status": 500, "description": "Internal Server Error", "causes": [ { "message": "Error reading submitted JSON.", "type": "org.lilyproject.rest.ResourceException" }, { "message": "FieldType '{my.demo}someNonExistingField' could not be found.", "type": "org.lilyproject.repository.api.FieldTypeNotFoundException" } ],


"stackTrace": "org.lilyproject.re [snipped] odyReader.java:63)\n\t... 65 more\n"}

For some errors we currently have no control over the formatting, and the response will dependon the framework. See issue 1043.

7.2.3.3 Method tunneling

Some HTTP clients are not able to perform methods like PUT and DELETE. In such cases, youcan tunnel these methods over the POST method.

Use a request header X-HTTP-Method-Override

With curl you would do this like this:

curl -XPOST -H 'X-HTTP-Method-Override: PUT' ...

Use a request parameter 'method'

Example:

curl -XPOST localhost:8888/repository/record/USER.foobar?method=PUT

7.2.3.4 Resources for field types

7.2.3.4.1 /repository/schema/fieldType/{prefix$name}?ns.prefix=namespace

7.2.3.4.1.1 GET

Gets a field type by name.

As a reminder, the name should be a namespaced name and the namespace should be bound to aprefix declared in a request parameter. Example:

http://myhost/repository/schema/fieldType/p$title?ns.p=my.namespace

If the namespace would be an URL, it should be properly escaped.

7.2.3.4.1.2 PUT

Create or update a field type.

The field type name specified in the URI is used to determine what field type to update (ifit already exists). After update, the name of the field type will be changed to what is in thesubmitted JSON (if it is different).

In case of a created or a renamed field type, the response Location header will point to /repository/schema/fieldType/{prefix$name}?ns.prefix=namespace.

In case a field type is renamed, the response will be "301 Moved Permanently" rather than "200OK".

The only property that you can update of a field type is its name. If you try to change otherproperties such as the value type or the scope, you will get a response status of 409 Conflict.

http://dev.outerthought.org/trac/outerthought_lilyproject/ticket/104


7.2.3.4.2 /repository/schema/fieldTypeById/{id}

7.2.3.4.2.1 GET

Gets a field type by ID.

7.2.3.4.2.2 PUT

Update a field type. You cannot create a field type this way, since the ID is assigned by thesystem.

If you try to update immutable properties, you will get a 409 Conflict response.

7.2.3.4.3 /repository/schema/fieldType

7.2.3.4.3.1 GET

Get the list of all field types. The returned entity is in the list format (page 91).

7.2.3.4.3.2 POST

Creates a new field type. The advantage of this method (over PUT on /repository/schema/fieldType/{name}) is that you are sure you are performing a create, not an update.

The posted entity should be a field type embedded in the following structure:

{ action: "create", fieldType: {}}

The Location header in the response will point to /repository/schema/fieldType/{prefix$name}?ns.prefix=namespace.

7.2.3.4.4 /repository/schema/fieldTypeById

7.2.3.4.4.1 GET

Same as for /repository/schema/fieldType

7.2.3.4.4.2 POST

Same as for /repository/schema/fieldType.

The Location header in the response will point to /repository/schema/fieldTypeById/{id}.

7.2.3.5 Resources for record types

7.2.3.5.1 /repository/schema/recordType/{prefix$name}?ns.prefix=namespace

7.2.3.5.1.1 GET

Gets a record type by its name, returns the latest version of the record type.


7.2.3.5.1.2 PUT

Creates or updates a record type.

The record type name specified in the URI is used to determine what record type to update (ifit already exists). After update, the name of the record type will be changed to what is in thesubmitted JSON (if it is different).

In case of a created or a renamed record type, the response Location header will point to /repository/schema/recordType/{prefix$name}?ns.prefix=namespace.

In case a record type is renamed, the response will be "301 Moved Permanently" rather than"200 OK".

7.2.3.5.2 /repository/schema/recordTypeById/{id}

7.2.3.5.2.1 GET

Gets a record type by its ID, returns the latest version of the record type.

7.2.3.5.2.2 PUT

Update a record type. You cannot create a record type this way, since the ID is assigned by thesystem.

7.2.3.5.3 /repository/schema/recordType

7.2.3.5.3.1 GET

Gets the list of all record types. The returned entity is in the list format (page 91).

7.2.3.5.3.2 POST

Creates a new record type. The advantage of this method (over PUT on /repository/schema/recordType/{name}) is that you are sure you are performing a create, not an update.

The posted entity should be a record type embedded in the following structure:

{ action: "create", recordType: {}}

The response Location header will point to /repository/schema/recordType/{prefix$name}?ns.prefix=namespace.

7.2.3.5.4 /repository/schema/recordTypeById

7.2.3.5.4.1 GET

Same as for /repository/schema/recordType.

7.2.3.5.4.2 POST

Same as for /repository/schema/recordType.


The response Location header will point to /repository/schema/recordTypeById/{id}.

7.2.3.5.5 /repository/schema/recordType/{prefix$name}/version/{version}?ns.prefix=namespace

7.2.3.5.5.1 GET

Gets a specific version of a record type.

7.2.3.5.6 /repository/schema/recordTypeById/{id}/version/{version}

7.2.3.5.6.1 GET

Gets a specific version of a record type.

7.2.3.6 Resources for records

7.2.3.6.1 Common stuff

7.2.3.6.1.1 Specify fields to return

For operations which return a record, you can specify the fields which should be returned using arequest parameter field. For example:

/repository/record/{id}?fields=p$field1,p$field2&ns.p=namespace

7.2.3.6.2 /repository/record

7.2.3.6.2.1 POST

Allows to create a record. Create a record this way (rather than using PUT on /repository/record/{id}) when:

• you want the server to assign an ID to the record. Note that with this method, you can alsoassign the record ID yourself by putting it in the message body.

• you want to be sure it is a create operation. In contrast, the PUT on /repository/record/{id} behaves in a "create-or-update" way.

The posted entity should be a record embedded in the following structure:

{ action: "create", record: {}}

7.2.3.6.3 /repository/record/{id}

7.2.3.6.3.1 GET

Gets a record.


7.2.3.6.3.2 PUT

Creates or updates a record. For create, this assumes you assign the ID yourself. You can use thePOST method on the /repository/record resource if you want Lily to assign the ID. If you want toupdate a record without 'risking' to create it, use the POST method on this resource.

The set of submitted fields can be sparse: you only need to specify fields which you want toupdate. Missing fields will not be deleted, to delete fields specify them in the fieldsToDeleteproperty.

TODO: returned record contains currently same fields a submitted record. See issue 1004.

An update might cause the creation of a new version. The response to a successful update ishowever always 200.

7.2.3.6.3.3 POST

Using POST to update a record

The posted entity should be a record embedded in the following structure:

{ action: "update", record: {}}

Using POST to conditionally update a record

It is possible to update a record only if certain conditions are satisfied. This is typically used foroptimistic concurrency control.

The conditions are specified as an extra property 'conditions' next to the record itself.

Example syntax:

{ action: "update", record: {}, conditions: [ { field: 'prefix$name', value: value or null, operator: 'less|less_or_equal|equal|not_equal|greater_or_equal|greater', allowMissing: true|false }, [ more conditions ] ], namespaces: {}}

You can specify one ore more conditions, all conditions must be satisfied for the update to gothrough.

For each condition, you can specify:

• the name. This is required. It is possible to check on the version of the record by using aspecial field name: the namespace 'org.lilyproject.system' and name 'version'.

• the value. This is required, but can be set to null. The value should be specified in the sameway as a record field value. When specifying null as value, you can check for the field to bemissing, or to be present with any value (when using the 'not_equal' operator).

• the operator. This is optional, the default is equal. Conditions other than equal or not_equalare not supported by all field types.

http://dev.outerthought.org/trac/outerthought_lilyproject/ticket/100


• allowMissing. This is optional, the default is false. When specifying a non-null value, thismeans the condition must either be specified, or the field can be missing.

Since you need to use qualified field names in the conditions, the namespaces must be visible atthat level, and hence declared outside of the record (they do not need to be repeated inside therecord).

If the update cannot be performed because one of the conditions is not satisfied, the responsestatus will be 409 Conflict. The response body will contain the stored record state.

Below is a full example of an update with two conditions. One of the conditions checks on therecord version through the special system namespace.

{ action: "update", record: { fields: { 'p$field1': 'value2' } }, conditions: [ { name: 'p$field1', value: 'value1' }, { name: 's$version', value: 1 } ], namespaces: { 'my.namespace': 'p', 'org.lilyproject.system': 's' }}

Using POST to delete a record

To do a normal delete of a record, use the DELETE method on this resource.

Deleting via POST allows to specify conditions, thus to do a conditional delete, similar as forupdates.

The posted entity should follow this syntax:

{ action: "delete", conditions: [], namespaces: {}}

Conditions are optional, and hence namespaces too.

A successful delete reports 204 No Content. In case the conditions are not satisfied, then theresponse status is 409 Conflict, and the body will contain a record snapshot containing the fieldsof the record that were used in the conditions (as far as they exist in the record).

7.2.3.6.3.4 DELETE

Deletes a record.

In case of success, this will report 204 No Content.


7.2.3.6.4 /repository/record/{id}/version/{version}

7.2.3.6.4.1 GET

Retrieves a specific version of a record.

The returned entity is a normal record JSON with the version attribute set to the specific version.

7.2.3.6.4.2 PUT

Use this to update the versioned-mutable fields of an existing version. Cannot be used to updateversioned or non-versioned fields, any such fields will be ignored.

7.2.3.6.5 /repository/record/{id}/vtag/{vtag}

7.2.3.6.5.1 GET

Retrieves a version of a record identified by version tag.

The vtag namespace (org.lilyproject.vtag) is implied, the vtag should not contain a prefix.

For example:

http://myhost/repository/record/USER.foobar/vtag/last

7.2.3.6.6 /repository/record/{id}/version

7.2.3.6.6.1 GET

Gets information from multiple versions of a record in one call.

The following request parameters are used to specify the set of versions to be returned:

• start-index: default 1

• max-results: default 10

As when retrieving individual records, you can use the request parameter fields to specify whatfields to return.

The returned entity is in the list format (page 91).

7.2.3.6.7 /repository/record/{id}/variant

7.2.3.6.7.1 GET

Gets the list of variants of this record.

The response format is the list format (page 91). The only record property that will beassigned is the id, no fields are returned.


7.2.3.7 Resources for blobs

7.2.3.7.1 Introduction

To create a record with blobs, you first need to upload the blobs by POSTing them to the /blobresource. This gives you back a JSON blurb, which is exactly the value you should provide forthe blob field in the record.

7.2.3.7.2 /repository/blob

7.2.3.7.2.1 POST

Creates a new blob.

The request must specify the headers Content-Type and Content-Length, as these might be usedto determine the storage location for the blob.

The blob content itself should be the submitted entity (without any wrapping or encoding).

If successful, this will respond with 200 OK. It does not respond with "201 Created" and aLocation header because no accessible resource is created at this point. You need to associate theblob with a record in order for it to become accessible.

The response body will contain something of the following form:

{ value: "string (encoded byte array, identifying the blob)", size: long, mimeType: "string"}

This JSON is what you need to put in the value of the blob field when creating or updating arecord.

7.2.3.7.3 /repository/record/{id}/field/{fieldName}/data

7.2.3.7.3.1 GET

Retrieves the blob from the specified field, from the latest version of the record.

Remember that the fieldName is a namespaced name (page 88):

http://myhost/repository/record/USER.foobar/n$myBlobField/data?ns.n=org.my.namespace

If the blob field would be of value type list or path, you can specify which blob to retrieve usingthe request parameter indexes which is a comma separated list of integers. The indexes are zero-based.

7.2.3.7.4 /repository/record/{id}/version/{version}/field/{fieldName}/data

7.2.3.7.4.1 GET

Similar as the previous one.


7.2.3.8 Resources for scanners

Scanners allow to run sequentially over all or part of the records stored in the repository. For anintroduction to scanners, see Scanning Records And Record Locality (page 126).

In contrast to other resources in Lily's REST interface, scanners are stateful (in the sense thatthe scan resource encapsulates runtime application state). They only exist in the server wherethey are created. This means that requests for some scanner always need to go the same server,thus cannot be arbitrarily load-balanced.

7.2.3.8.1 /repository/scan

7.2.3.8.1.1 POST

Creates a new scanner.

The body should contain the definition of the scanner as described in the record scan format(page 91).

The response Location header will point to the created scan: /repository/scan/{scan-id}.

When it is no longer needed, a scan should be deleted. An expiry mechanism cleans up scansafter some delay (default 1 hour), but for regular use you should not rely on this.

7.2.3.8.2 /repository/scan/{scan-id}

7.2.3.8.2.1 GET

Gets the next record(s) from the scanner. You can retrieve multiple records at once using therequest parameter batch:

/repository/scan/{scan-id}?batch=100

The returned entity is in the list format (page 91).

If there are no more (or simply no) records, a 204 No Content response is given.

If the scan does not exist, a 404 Not Found response is given.

7.2.3.8.2.2 DELETE

Deletes (cleans up) a scan.

7.2.3.9 Resources for index management

These resources give access to the definition of the SOLR indexes, similar to what you can dowith commands like lily-list-indexes (see Managing Indexes (page 47)).

The update functionality is currently limited to updating the index state flags.

7.2.3.9.1 /index

7.2.3.9.1.1 GET

Gives the information of all the indexes.

Sample output, here just one index is defined:


[ { "name": "index1", "configuration": "PD94bWwgdmVy...", "generalState": "ACTIVE", "batchBuildState": "INACTIVE", "updateState": "SUBSCRIBE_AND_LISTEN", "activeBatchBuildInfo": null, "lastBatchBuildInfo": null, "solrShards": { "shard1": "http://localhost:8983/solr" }, "shardingConfiguration": null, "queueSubscriptionId": "IndexUpdater_index1", "zkDataVersion": 1 }]

The configuration, which is cut of here, are the XML bytes encoded as base64.

7.2.3.9.2 /index/{name}

7.2.3.9.2.1 GET

Gives information about one specific index.

7.2.3.9.2.2 PUT

Allows to update the state flags of the index. This is done by submitting the same JSON asreturned by the GET operation, or in fact is should contain just one or more of these attributes:generalState, updateState, buildState.

7.2.3.9.3 /index/{name}/config

7.2.3.9.3.1 GET

Returns just the indexer configuration for this index (still embedded within the json structure).

7.2.3.10 Resources for the rowlog

7.2.3.10.1 /rowlog

7.2.3.10.1.1 GET

Gives the list of rowlogs, with information about their subscriptions and listeners.

7.2.3.10.2 /rowlog/{id}

7.2.3.10.2.1 GET

Same as /rowlog, but only returns the information of one specific rowlog.

Notes

1. javadoc:org.lilyproject.repository.api.RecordId

2. javadoc:org.lilyproject.repository.api.RecordId


3. http://dev.outerthought.org/trac/outerthought_lilyproject/ticket/104

4. http://dev.outerthought.org/trac/outerthought_lilyproject/ticket/100


8 Java Developers

8.1 Repository API Tutorial

8.1.1 Before reading this

Before reading this, it is recommended to first go through the repository model documentation(page 29).

8.1.2 API design

In the design of Lily's repository API we choose to use dumb data objects (objects which arepure data structure) in combination with a few service-style interfaces. The use of these dataobjects makes that there is no difference between Record objects that you instantiate yourself orthat you retrieve from the repository.

The repository API consists mostly of interfaces, even for the data objects. As you will see in theexamples below, the consequence is that these objects are instantiated via factory methods.

The API classes are defined in a separate project, lily-repository-api, independent from anyimplementation.

8.1.3 API tutorial code

All the code used in this tutorial can also be found in the class TutorialTest of the projectrepository-api-tutorial.

8.1.4 API reference

See the Javadoc-based API documentation1.

8.1.5 API run-through

8.1.5.1 Project set-up

For programming against the API, you only need a dependency on the project lily-repository-api.

For actually talking to Lily, you need a bunch of implementation classes too. Basically, you needthe lily-client project and all its dependencies. If you use Maven to build your project and take adependency on lily-client, everything you need is automatically pulled in.

javadoc:root


Below we show a Maven pom you can use to get started. Note that this assumes you haveactually build Lily from source so that the Lily artifacts are installed in your local Mavenrepository.

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

<modelVersion>4.0.0</modelVersion> <groupId>org.mydomain</groupId> <artifactId>myproject</artifactId> <version>1.0-dev</version> <name>My Lily-based project</name>

<build/>

<dependencies> <dependency> <groupId>org.lilyproject</groupId> <artifactId>lily-client</artifactId> <version>[unresolved variable: artifactVersion]</version> </dependency> </dependencies>

</project>

8.1.5.2 Connecting to Lily

In all the examples below, we will assume you have already obtained access to a Lily Repositoryobject.

Here we explain how you can get access to that.

The Lily nodes publish their availability and address to ZooKeeper. The class LilyClient usesthis information to provide you Repository objects.

In your code, you can get access to the Repository as follows:

import org.lilyproject.client.LilyClient;import org.lilyproject.repository.api.Repository;

...

LilyClient lilyClient = new LilyClient("localhost:2181", 20000);Repository repository = lilyClient.getRepository();

The argument to the LilyClient constructor is the ZooKeeper connecting string.

Upon each method you call on the Repository object, it will at random pick one of the availableLily servers to perform the operation against. If a method would fail due to an IO relatedexception (for example a Lily node went down or a temporary network hick up), it willautomatically be retried. When an IO exception occurs, we cannot know if the server alreadygot our request and performed it. For an operation like 'create record', this means that whenthe operation is retried it could fail because the record was created in the previous, 'failed',request. Or if you let the server assign record IDs, it can mean that two records are created.Therefore, create operations are by default only retried when we are sure the request was not yetinitiated. This behavior can be configured by manipulating the RetryConf object obtained viaRepository.getRetryConf().


8.1.5.3 Prerequisites

To avoid a bit of boilerplate code in the code listings, we make the following assumptions.

A variable typeManager is available, which is obtained from the Repository as follows:

TypeManager typeManager = repository.getTypeManager();

A variable BNS (book namespace) is available, which is the namespace for the schema types,and can be declared as follows:

String BNS = “book”;

8.1.5.4 Creating a record type

Before we can create any records in the repository, we need to create a schema: a record type andsome field types. For the purpose of this tutorial, we will make a Book record type.

// (1)ValueType stringValueType = typeManager.getValueType("STRING");

// (2)FieldType title = typeManager.newFieldType(stringValueType, new QName(BNS, "title"), Scope.VERSIONED);

// (3)title = typeManager.createFieldType(title);

// (4)RecordType book = typeManager.newRecordType(new QName(BNS, "Book"));book.addFieldTypeEntry(title.getId(), true);

// (5)book = typeManager.createRecordType(book);

// (6)PrintUtil.print(book, repository);

It is useful to explain this piece of code in detail, as the same patterns will be repeated in theremainder of the code samples.

• (1) Get a reference to the value type we want to use for our field. The value type is specifiedas string (rather than an enum or so), as there are lots of variations possible and the valuetypes are designed to be extensible.

• (2) Create the field type object. Since FieldType is an interface, we cannot instantiate itdirectly. We want to keep our code implementation-independent, therefore we will notdirectly instantiate an implementation but use the factory method newFieldType(). Thismethod does nothing more than instantiating a field type object, at this point nothingchanges yet in the repository. The same holds for all methods in the API that start with"new".

• (3) Create the field type in the repository. The updated field type object, in which in thiscase the ID of the field type will be assigned, is returned by this method.

• (4) Create the record type object. This is pretty much the same as step (2): it creates anobject, but nothing yet in the repository. The field type is added to the record type. Theboolean argument specifies if the field type is mandatory.


• (5) Create the record type. Similar to point (3), here the record type is actually created in therepository, the updated record type object is returned.

• (6) The PrintUtil class is used to dump the record type to screen, its output is shown below.

The steps (1) to (3) can actually be done in just one statement, we will do that in the nextexample.

Output:

Name = {book}BookID = d716e794-213c-4ffe-be11-359cb52e017bVersion = 1Fields: Versioned scope: Field Name = {book}title ID = 93bca82e-0b93-496f-9a67-19ded2b3740b Mandatory = true ValueType = STRING

8.1.5.5 Updating a record type

We will now update the previously created record type with some more fields. We use avariety of value types. The full list of built-in value types can be found in the Javadoc of theTypeManager2, method getValueType.

FieldType description = typeManager.createFieldType("BLOB", new QName(BNS, "description"), Scope.VERSIONED);FieldType authors = typeManager.createFieldType("LIST<STRING>", new QName(BNS, "authors"), Scope.VERSIONED);FieldType released = typeManager.createFieldType("DATE", new QName(BNS, "released"), Scope.VERSIONED);FieldType pages = typeManager.createFieldType("LONG", new QName(BNS, "pages"), Scope.VERSIONED);FieldType sequelTo = typeManager.createFieldType("LINK", new QName(BNS, "sequel_to"), Scope.VERSIONED);FieldType manager = typeManager.createFieldType("STRING", new QName(BNS, "manager"), Scope.NON_VERSIONED);FieldType reviewStatus = typeManager.createFieldType("STRING", new QName(BNS, "review_status"), Scope.VERSIONED_MUTABLE);

RecordType book = typeManager.getRecordTypeByName(new QName(BNS, "Book"), null);

// The order in which fields are added does not matterbook.addFieldTypeEntry(description.getId(), false);book.addFieldTypeEntry(authors.getId(), false);book.addFieldTypeEntry(released.getId(), false);book.addFieldTypeEntry(pages.getId(), false);book.addFieldTypeEntry(sequelTo.getId(), false);book.addFieldTypeEntry(manager.getId(), false);book.addFieldTypeEntry(reviewStatus.getId(), false);

// Now we call updateRecordType instead of createRecordTypebook = typeManager.updateRecordType(book);

PrintUtil.print(book, repository);

Output:

Name = {book}BookID = d716e794-213c-4ffe-be11-359cb52e017bVersion = 2Fields: Non-versioned scope:

javadoc:org.lilyproject.repository.api.TypeManager


Field Name = {book}manager ID = bd9e6764-222f-4b82-bb4d-d1bc72a2c0bb Mandatory = false ValueType = STRING Versioned scope: Field Name = {book}authors ID = 9cfb5c07-dec5-469e-bdcd-436c299badd1 Mandatory = false ValueType = LIST<STRING> Field Name = {book}description ID = 5ce3bdc0-319d-4a57-b052-54bedc77b145 Mandatory = false ValueType = BLOB Field Name = {book}pages ID = cfb90755-f78e-43f8-8c5e-e46397b10296 Mandatory = false ValueType = LONG Field Name = {book}released ID = 50f283d5-30b3-45d3-92d4-0245d8068902 Mandatory = false ValueType = DATE Field Name = {book}sequel_to ID = 57431a4a-bab9-41c5-a340-19c5c2bf537c Mandatory = false ValueType = LINK Field Name = {book}title ID = 93bca82e-0b93-496f-9a67-19ded2b3740b Mandatory = true ValueType = STRING Versioned-mutable scope: Field Name = {book}review_status ID = 295c9568-43bf-406d-b241-e9582a62d5b0 Mandatory = false ValueType = STRING

The version of the Book record type is now 2.

8.1.5.6 Creating a record

Now that we have a record type, let's create a record.

// (1)Record record = repository.newRecord();

// (2)record.setRecordType(new QName(BNS, "Book"));

// (3)record.setField(new QName(BNS, "title"), "Lily, the definitive guide, 3rd edition");

// (4)record = repository.create(record);

// (5)PrintUtil.print(record, repository);

This asks for some more explanation:


• (1) First we create a Record object. Again, this creates nothing in the repository yet, this isonly a factory method, since Record is an interface.

• (2) We set the record type for the record. The second argument is the version, specified as aLong object. Setting it to null will cause the last version of the record type to be used, whichis usually what you want. This argument is optional, shown here only to explain it, in furtherexamples we will leave it off.

• (3) We set a field on the record. This is done by specifying its name and its value. The valueargument is an Object, the actual type of value required depends on the value type of thefield type.

• (4) Create the record in the repository. The updated record is returned, which will containthe record ID and version assigned by the repository.

• (5) We use the PrintUtil to dump the record to screen, the output is shown below.

In the PrintUtil output for records, the namespaces of the fields are listed once at the top, and inthe remainder of the output a prefix is used, like n1, n2, ... This is only a feature of PrintUtil, therecord itself knows nothing about these prefixes.

Output:

ID = UUID.dc799aca-bb4b-4e02-8fc0-8569d26368b5Version = 1Non-versioned scope: Record type = {book}Book, version 2Versioned scope: Record type = {book}Book, version 2 {book}title = Lily, the definitive guide, 3rd edition

8.1.5.7 Creating a record with a user-specified ID

In the previous example, the record ID was assigned by the repository. You can alsoassign it yourself. If you would assign an ID that already exists within the repository, aRecordExistsException will be thrown.

RecordId id = repository.getIdGenerator().newRecordId("lily-definitive-guide-3rd-edition");Record record = repository.newRecord(id);record.setDefaultNamespace(BNS);record.setRecordType("Book");record.setField("title", "Lily, the definitive guide, 3rd edition");record = repository.create(record);

PrintUtil.print(record, repository);

The self-assigned record IDs will never clash with those generated by the repository, they are ina different namespace.

8.1.5.7.1 Use setDefaultNamespace to avoid QName

In the example above you will notice another difference: the setField() and setRecordType()methods take a simple string as argument instead of a QName object. This is possible becausewe first called setDefaultNamespace() on the record. Internally, the QName object is stillcreated. The default namespace is just a volatile helper attribute on the record: it is not stored inthe repository.

Output:


ID = USER.lily-definitive-guide-3rd-editionVersion = 1Non-versioned scope: Record type = {book}Book, version 2Versioned scope: Record type = {book}Book, version 2 {book}title = Lily, the definitive guide, 3rd edition

8.1.5.8 Updating a record

Updating a record consists of calling repository.update() with a record object of which the ID hasbeen set to that of an existing record. If the record would not exist, a RecordNotFoundExceptionwill be thrown.

We use the repository.newRecord() method, even if what we are doing is updating an existingrecord. Remember that this method is used to instantiate a record object, not to create arecord. When updating a record, you only need to set the fields in the record that you actuallywant to change. Fields that are not set will not be deleted, deleting fields is done by callingrecord.delete(fieldName, true) or record.addFieldsToDelete().

RecordId id = repository.getIdGenerator().newRecordId("lily-definitive-guide-3rd-edition");Record record = repository.newRecord(id);record.setDefaultNamespace(BNS);record.setField("title", "Lily, the definitive guide, third edition");record.setField("pages", Long.valueOf(912));record.setField("manager", "Manager M");record = repository.update(record);


When updating a record, its record type will automatically move to the last version ofthe record type, unless you specify a specific version. The record type of each scopein which fields were modified will be set to this record type, in addition to the recordtype of the non-versioned scope which is always updated, since it is considered to bethe reference record type.

In the output, you will notice that the version has been increment to 2:

ID = USER.lily-definitive-guide-3rd-editionVersion = 2Non-versioned scope: Record type = {book}Book, version 2 {book}manager = Manager MVersioned scope: Record type = {book}Book, version 2 {book}pages = 912 {book}title = Lily, the definitive guide, third edition

8.1.5.9 Updating a record via read

Besides updating a record by creating a record object via newRecord and setting the updatedfield values on it, you can also read an existing record and modify that object to supply it to therepository.update() method.

RecordId id = repository.getIdGenerator().newRecordId("lily-definitive-guide-3rd-edition");Record record = repository.read(id);


record.setDefaultNamespace(BNS);record.setField("released", new LocalDate());record.setField("authors", Arrays.asList("Author A", "Author B"));record.setField("review_status", "reviewed");record = repository.update(record);


The authors field is a LIST-type field, its value should be specified as a List object.

Output:

ID = USER.lily-definitive-guide-3rd-editionVersion = 3Non-versioned scope: Record type = {book}Book, version 2 {book}manager = Manager MVersioned scope: Record type = {book}Book, version 2 {book}authors = [0] Author A [1] Author B {book}pages = 912 {book}released = 2012-01-10 {book}title = Lily, the definitive guide, third editionVersioned-mutable scope: Record type = {book}Book, version 2 {book}review_status = reviewed

As you can see, meanwhile this record has 3 versions. Each time one or more versioned fieldsare updated, a new version is created. If in a certain update operation you only change non-versioned fields, then no new version will be created. If you create a new record with only non-versioned fields, it will not have any versions (TODO: at the time of this writing, this is not true,a dummy version 1 is created).

8.1.5.10 Updating versioned-mutable fields

Normal versioned fields are immutable after creation. After all, the purpose of versions is tosee the history of previous edits, and hence it should not be possible to rewrite that history.Versioned-mutable fields are versioned fields which can be updated for existing versions. This isuseful for meta-data about the version.

[TODO: example of this.]

8.1.5.11 Updating a record conditionally

It is possible to let an update of a record only go through if the current record state satisfies someconditions. This is useful for optimistic concurrency control.

The example below shows how to update the manager field to "Manager P", but only if thecurrent value is "Manager Z" (which it is not).

List<MutationCondition> conditions = new ArrayList<MutationCondition>();conditions.add(new MutationCondition(new QName(BNS, "manager"), "Manager Z"));

RecordId id = repository.getIdGenerator().newRecordId("lily-definitive-guide-3rd-edition");Record record = repository.read(id);record.setField(new QName(BNS, "manager"), "Manager P");record = repository.update(record, conditions);

System.out.println(record.getResponseStatus());


When the conditions are not satisfied, as is the case here, the update() method will notthrow an exception, but rather the responseStatus field of the record object will be set toResponseStatus.CONFLICT.

If you supply multiple MutationCondition's, they all need to be satisfied for the update to gothrough. The MutationCondition's allow for other operators than simple equals checks, forchecking if a field is null or not-null, for checking on the record version, etc.

8.1.5.12 Reading a record

Let's have a look at the different options for reading a record.

RecordId id = repository.getIdGenerator().newRecordId("lily-definitive-guide-3rd-edition");

// (1)Record record = repository.read(id);String title = (String)record.getField(new QName(BNS, "title"));System.out.println(title);

// (2)record = repository.read(id, 1L);System.out.println(record.getField(new QName(BNS, "title")));

// (3)record = repository.read(id, 1L, Arrays.asList(new QName(BNS, "title")));System.out.println(record.getField(new QName(BNS, "title")));

• (1) If we just supply an ID when reading a record, the latest version of the record is fullyread. The Record.getField() method returns the value of the field (here again, you couldmake use of setDefaultNamespace to avoid using the QName objects). The signature of thismethod declares a return type of Object, so you need to cast it to the expected type.

• (2) We can specify a version number as second argument to read a specific version of therecord

• (3) It is also possible to read just the fields of the record that we are interested in. This way,the others do not need to be decoded and transported to us.

Output:

Lily, the definitive guide, third editionLily, the definitive guide, 3rd editionLily, the definitive guide, 3rd edition

8.1.5.13 Working with blob fields

Blob fields ("binary large object") are fields for storing arbitrary binary data. Since thiscould be a large amount of data, the content of blobs is not simply transported as part of therepository.read() or repository.update() calls. Instead, blobs are read and written as streams.

On the level of the Record object, the value of a blob field is a Blob object. This object holdssome metadata such as a mime-type (this identifies the type of content such as "text/html" or"image/png"), the size of the blob, and an optional name which is often used as a suggestion forfilename in case a user would download the blob to the desktop.

The actual data of a blob can be stored in different ways, depending upon configuration:


• it can be stored in HDFS, this is used for somewhat larger to huge blobs.

• it can be stored in an HBase table (separate from the records), this is ideal for smaller blobslike an HTML page.

• it can be stored within the HBase row of the record itself. This is good for tiny blobs.

As repository API user, you are not really aware of these different stores.

Below is some example code:

//// Write a blob//

String description = "<html><body>This book gives thorough insight into Lily, ...</body></html>";byte[] descriptionData = description.getBytes("UTF-8");

// (1)Blob blob = new Blob("text/html", (long)descriptionData.length, "description.xml");OutputStream os = repository.getOutputStream(blob);try { os.write(descriptionData);} finally { os.close();}

// (2)RecordId id = repository.getIdGenerator().newRecordId("lily-definitive-guide-3rd-edition");Record record = repository.newRecord(id);record.setField(new QName(BNS, "description"), blob);record = repository.update(record);

//// Read a blob//InputStream is = null;try { is = repository.getInputStream(record, new QName(BNS, "description")); System.out.println("Data read from blob is:"); Reader reader = new InputStreamReader(is, "UTF-8"); char[] buffer = new char[20]; int read; while ((read = reader.read(buffer)) != -1) { System.out.print(new String(buffer, 0, read)); } System.out.println();} finally { if (is != null) is.close();}

(1) To store a blob in the repository, you first create a Blob object. You need to specify the sizeof the blob, the repository will use this to determine where to store the blob. Then you request anoutput stream to upload the blob via repository.getOutputStream(blob), and write all the data toit. Finally, the output stream is closed, at that moment the repository will update the Blob objectwith a reference to the storage location that sits behind the output stream.

(2) Once the blobs are uploaded, you can create the record object as usual, setting the blob field(here description) with the blob object, and then call repository.create() to create the record onthe repository.

If the operation would have been abandoned between the previous two steps, there would bean orphan blob in the repository. You do not need to worry about this, it will be automaticallyexpire and be removed (by default, after 1 hour).


Reading the blob is done by using the repository.getInputStream() method, specifying the recordand field from which to read the blob. Instead of passing the record object to the getInputStreammethod, you could as well specify the record id, so it is not required to first retrieve the record.But if you have already retrieved the record anyway, then passing the record object will allowfor optimized retrieval of blobs which are stored inline in the record (which is the case for smallblobs).

Above we wrote a custom while loop to retrieve the data from the InputStream, but werecommend to use the IOUtils class from the Apache commons-io project instead.

8.1.5.14 Creating variants

Creating a variant record is the same as creating a record, you just have to use an ID thatcontains variant properties.

In the example below we use variants to create records about the same book in two languages (en- English, nl - Dutch). The two records will share the same master record ID.

// (1)IdGenerator idGenerator = repository.getIdGenerator();RecordId masterId = idGenerator.newRecordId();

// (2)Map<String, String> variantProps = new HashMap<String, String>();variantProps.put("language", "en");

// (3)RecordId enId = idGenerator.newRecordId(masterId, variantProps);

// (4)Record enRecord = repository.newRecord(enId);enRecord.setRecordType(new QName(BNS, "Book"));enRecord.setField(new QName(BNS, "title"), "Car maintenance");enRecord = repository.create(enRecord);

// (5)RecordId nlId = idGenerator.newRecordId(enRecord.getId().getMaster(), Collections.singletonMap("language", "nl"));Record nlRecord = repository.newRecord(nlId);nlRecord.setRecordType(new QName(BNS, "Book"));nlRecord.setField(new QName(BNS, "title"), "Wagen onderhoud");nlRecord = repository.create(nlRecord);

// (6)Set<RecordId> variants = repository.getVariants(masterId);for (RecordId variant : variants) { System.out.println(variant);}

Some more explanation:

• (1) We generate a master ID that we will use for the two variants.

• (2) We create the variant properties for the English language variant. This is simply a map.

• (3) We create the record ID for the English variant, consisting of the master record ID andthe variant properties.

• (4) We create the actual record.

• (5) We do the same for the Dutch language variant. Just as illustration, we get the masterrecord ID by retrieving it from the English variant. A shortcut notation is used to create thevariant properties map.


• (6) We use the getVariants method to get the list of all variants sharing the same masterrecord ID, and print them out.

Output:

UUID:d947dda0-cadb-4e84-b1bc-38567d05fb56VARIANT:language,nlUUID:d947dda0-cadb-4e84-b1bc-38567d05fb56VARIANT:language,en

While not shown in this example, it is also possible to create the record that corresponds to theplain master record ID, which could be used to store information shared by all the variants. Inthis example that could be for information that does not need to be translated.

Other than the shared identity between variant records, the repository itself does not have specialfunctionality around variants. It are rather the indexer and the front-end which will add this, forexample by aggregating information from different variants.

8.1.5.15 Link fields

One of the field value types supported by Lily is the link type. We usually simply speak of linkfields (just as we use 'string fields', 'long fields', etc. for the other value types). A link field allowsto store a link to another record in a field.

The following example illustrates this.

// (1)Record record1 = repository.newRecord();record1.setRecordType(new QName(BNS, "Book"));record1.setField(new QName(BNS, "title"), "Fishing 1");record1 = repository.create(record1);

// (2)Record record2 = repository.newRecord();record2.setRecordType(new QName(BNS, "Book"));record2.setField(new QName(BNS, "title"), "Fishing 2");record2.setField(new QName(BNS, "sequel_to"), new Link(record1.getId()));record2 = repository.create(record2);

PrintUtil.print(record2, repository);

// (3)Link sequelToLink = (Link)record2.getField(new QName(BNS, "sequel_to"));RecordId sequelTo = sequelToLink.resolve(record2.getId(), repository.getIdGenerator());Record linkedRecord = repository.read(sequelTo);System.out.println(linkedRecord.getField(new QName(BNS, "title")));

In this example, we created a record about a book "Fishing 2" which is a sequel to the book"Fishing 1". We link them via the sequel_to field. The value that should be assigned to a linkfield is a Link object. In its simplest form, a link is basically a RecordId. The RecordId of arecord can be obtained via the Record.getId() method.

Now suppose we had read record2 outside of this context, so without knowing what it was asequel to. In that case, we could find out what book preceded Fishing 2 by reading its sequel_tofield. This gives a Link object, which needs to be resolved in the context of the record it occursin, see the resolve call. The resolve method returns a RecordId which can be used to fetch therecord from the repository, as shown in step (3).

The output is obviously:

Fishing 1


As for variants, the repository itself does not do much fancy things with link fields, but forexample the indexer can denormalize information from linked documents to search on it.

8.1.5.15.1 Link versus RecordId

In the example above, the Link could as well have been the RecordId, and the resolve step wasnot really necessary. However, it is also possible to have relative links which need to be resolvedagainst the record they occur in. For example, a link can inherit the variant properties of therecord it occurs in. For more information on this, see the javadoc of the Link class.

8.1.5.16 Complex Fields

Sometimes you might want to store a more complex value in a field. Thus not a simple valuelike a string, but a complex value which is again composed of multiple fields. In Lily this ispossible by creating fields of type RECORD. These are fields in which you can put Recordobjects. These are not real records with their own identity, it is just a re-use of the top-levelRecord data structure to use it as value within the field of another record. Since any record objectcan have fields which by themselves can again contain records (or lists of records), this allowsfor modeling arbitrarily complex structures.

Before you use complex fields, you should always ask yourself the question if you want touse either complex fields or rather link fields (which contain pointers to other records). Bothenable you to store the same kinds of nested/complex structures. In the case of complex fields,the nested structures (nested records) are all stored within one record, so don't have their ownidentity and are hence not separately retrievable or indexable. Link fields pointing to otherrecords give each part of the nested structure its own identity, but at the cost of having to create/read multiple records, and loosing the atomicity of the create operation.

Since complex fields are modeled in Lily by creating field types with as value type RECORD,they are also called record-type fields.

In the following example, we will create articles which have authors. Each author has a nameand email attribute. For the sake of this example, we are going to store the authors within thearticle, in a complex field. So there will be no re-use of the same author records across articles.

Here's the code:

final String ANS = "article"

// (1)FieldType name = typeManager.createFieldType("STRING", new QName(ANS, "name"), Scope.NON_VERSIONED);FieldType email = typeManager.createFieldType("STRING", new QName(ANS, "email"), Scope.NON_VERSIONED);

RecordType authorType = typeManager.newRecordType(new QName(ANS, "author"));authorType.addFieldTypeEntry(name.getId(), true);authorType.addFieldTypeEntry(email.getId(), true);authorType = typeManager.createRecordType(authorType);

// (2)FieldType title = typeManager.createFieldType("STRING", new QName(ANS, "title"), Scope.NON_VERSIONED);FieldType authors = typeManager.createFieldType("LIST<RECORD<{article}author>>", new QName(ANS, "authors"), Scope.NON_VERSIONED);FieldType body = typeManager.createFieldType("STRING", new QName(ANS, "body"), Scope.NON_VERSIONED);

RecordType articleType = typeManager.newRecordType(new QName(ANS, "article"));articleType.addFieldTypeEntry(title.getId(), true);articleType.addFieldTypeEntry(authors.getId(), true);


articleType.addFieldTypeEntry(body.getId(), true);articleType = typeManager.createRecordType(articleType);

// (3)Record author1 = repository.newRecord();author1.setRecordType(authorType.getName());author1.setField(name.getName(), "Author X");author1.setField(email.getName(), "[email protected]");

Record author2 = repository.newRecord();author2.setRecordType(new QName(ANS, "author"));author2.setField(name.getName(), "Author Y");author2.setField(email.getName(), "[email protected]");

// (4)Record article = repository.newRecord();article.setRecordType(articleType.getName());article.setField(new QName(ANS, "title"), "Title of the article");article.setField(new QName(ANS, "authors"), Lists.newArrayList(author1, author2));article.setField(new QName(ANS, "body"), "Body text of the article");article = repository.create(article);

PrintUtil.print(article, repository);

Explanation:

• At (1) we create the field types and record type for storing an author

• At (2) we create the field types and record type for storing an article. Note the definition ofthe authors field: its value type is "LIST<RECORD<{article}author>>". By this we say wewant the field to contain a list of records of type author. Specifying the type for the record isoptional, so you can also use simply LIST<RECORD>, which then allows to use any kindof record (the same list could contain different kinds of records). Of course, not all complexfields need to be lists, you can use simply "RECORD" as value type as well.

• At (3) we create two author record objects. Attention: we create just objects, nothing ispersisted in the repository! It is not necessary to call repository.create() for these objects.

• At (4) we create an article record. We set the value of the authors field to an ArrayListcontaining the author1 and author2 objects. Lists.newArrayList() is an utility methodprovided by the Guava library.

Finally, we dump the record, which gives the following output:

ID = UUID.141a11c3-66b8-4c2a-a0d7-c01aa38c33faVersion = nullNon-versioned scope: Record type = {article}article, version 1 {article}authors = [0] Record of type {article}author, version null {article}name = Author X {article}email = [email protected] [1] Record of type {article}author, version null {article}name = Author Y {article}email = [email protected] {article}body = Body text of the article {article}title = Title of the article

We see the authors field is a list containing two entries, each of which is a record of type author.

Suppose that authors would have been a link field, and that each author was stored a separaterecord in its own right. Then the dump would have looked like this:


ID = UUID.141a11c3-66b8-4c2a-a0d7-c01aa38c33faVersion = nullNon-versioned scope: Record type = {article}article, version 1 {article}authors = [0] UUID.fa1cd18b-ab5b-43f5-95ce-1c1bcced603a [1] UUID.95f2cf92-814a-46a6-a815-7dc26e1b3b52 {article}body = Body text of the article {article}title = Title of the article

8.2 Creating Records And Schema Using The Builder API

8.2.1 Introduction

Besides the core repository API, Lily offers an alternative API using builder objects. This APImakes use of method call chaining to make for a more fluent way of writing code, a smallinternal DSL if you like. It avoids having to declare intermediate variables to keep references tothings. It also allows more combinations for setting the parameters of the objects to be created,since each parameter is typically set with a different method.

Here we provide a tutorial on getting started with the builder API. You are free to choosebetween either Lily's core API or the builder API, just use what fits best for your situation andtaste. One disadvantage is that, because of method chaining, very long statements are created,which makes it sometimes harder to track down what part of the statement caused an error.

If you have a custom object model that you want to map onto Lily, you might want to check outFrogPond3, a pojo-Lily mapper.

If you already have a schema and just want to create records, you can directly skip to the sectionCreating Records.

8.2.2 Creating A Schema

Before we start

If your schema is static, than rather than writing code statements to create the schema,you are better of describing it in the JSON format and importing that. You can also do theimport programmatically (page 130). Having the schema in JSON rather than code has itsadvantages: it can be easily transformed, it is isolated, it can be easily shared with non-Javaprogrammers, etc.

Classic API

Let's start with a very simple schema, and first look how it is created using the classic API:

String NS = "my_namespace";

FieldType field1 = typeManager.createFieldType("STRING", new QName(NS, "field1"), VERSIONED);FieldType field2 = typeManager.createFieldType("STRING", new QName(NS, "field2"), VERSIONED);

RecordType recordType = typeManager.newRecordType(new QName(NS, "recordtype1"));recordType.addFieldTypeEntry(field1.getId(), false);recordType.addFieldTypeEntry(field1.getId(), true);recordType = typeManager.createRecordType(recordType);

https://bitbucket.org/calmera/frogpond


System.out.println(recordType.getId());

Create the record type using a builder

Now assuming the field types are already created, let's change the creation of the record type tomake use of the builder API:

(1) RecordType recordType = typeManager(2) .recordTypeBuilder()(3) .defaultNamespace(NS)(4) .name("recordtype1")(5) .field(field1.getId(), false)(6) .field(field2.getId(), true)(7) .create();

Let's discuss this code in some detail:

(1) and (2): we create a builder by calling TypeManager.recordTypeBuilder().

(3) we set a default namespace. This namespace will be used for all further names, removing theneed to supply QName objects, though you can still use QName as well.

(4) we set the name for the record type, simply as a string. The default namespace set on line (3)will be used to construct the QName.

(5) and (6) we add field type entries to the record type. We refer to the previously created fieldtype objects to fetch their ID. The boolean argument is the mandatory flag.

(7) we create the record type in the repository. This method returns a RecordType object, whileall the previous methods returned the builder itself. You can also use other operations: update(),createOrUpdate(), or build(). The build() method will just create the RecordType object withoutmodifying anything in the repository.

Add the field entries using a builder

Some more flexibility in adding field type entries is available through a sub-builder, as illustratedin the next example.

(1) RecordType recordType = typeManager(2) .recordTypeBuilder()(3) .defaultNamespace(NS)(4) .name("recordtype1")(5) .fieldEntry().name("field1").add()(6) .fieldEntry().name("field2").mandatory().add()(7) .create();

(5) and (6) The method fieldEntry() returns a different builder object, on which you set theproperties for the field type entry. Calling add() on it will add the field type entry to the recordtype and return the record type builder.

The field entry builder allows to set the identity of the field in different ways: using its name(either relying on the default namespace, or by supplying a QName), using its ID, or bysupplying the field type object. The mandatory flag is set by calling a different method (bydefault, mandatory is false).

Create the fields while creating the record

While previously we relied on the field types already being created, you can also create theminline, as shown in the following example.


( 1) RecordType recordType = typeManager( 2) .recordTypeBuilder()( 3) .defaultNamespace(NS)( 4) .name("recordtype1")( 5)( 6) .fieldEntry()( 7) .defineField()( 8) .name("field1").type("STRING").scope(VERSIONED)( 9) .create()(10) .add()(11)(12) .fieldEntry()(13) .defineField()(14) .name("field2").type("STRING").scope(VERSIONED)(15) .create()(16) .mandatory()(17) .add()(18)(19) .create();

(7) By calling defineField(), a different builder object is returned that allows to create a new fieldtype. On (8) we set the options for the field, on (9) we call create() which creates the field type inthe repository and returns the field entry builder.

The second field is very similar, except that we also set the mandatory option (16).

This code seems longer than how we created field types before, but that's in part because herewe spread it over multiple lines on purpose. Since each option for the field type is set using adifferent method, it allows more variation of how parameters are specified. Specifying the typeand scope is optional: the default type is STRING, and the default scope is NON_VERSIONED,though that can be changed by calling defaultScope() on the record type builder.

Make the schema code re-executable through createOrUpdate()

If we would run any of the above examples twice against the same repository, it would fail onthe second run because the types will already exist.

What you really want to do is only create the schema if it does not exist yet, or update it in caseit would be different, or complain when it would be incompatible (cfr. the immutable value typeand scope properties of field type). This behavior is obtained by calling createOrUpdate() insteadof create(), both for the field types as for the record types, as in the example below.

RecordType recordType = typeManager .recordTypeBuilder() .defaultNamespace(NS) .name("recordtype1")

.fieldEntry() .defineField() .name("field1").type("STRING").scope(VERSIONED) .createOrUpdate() .add()

.fieldEntry() .defineField() .name("field2").type("STRING").scope(VERSIONED) .createOrUpdate() .mandatory() .add()

.createOrUpdate();


Switching the default namespace

At any time, you can switch the default namespace.

( 1) RecordType recordType = typeManager( 2) .recordTypeBuilder()( 3) .defaultNamespace("namespace1")( 4) .name("recordtype1")( 5) .fieldEntry().name("field1").add()( 6) .fieldEntry().name("field2").add()( 7) .defaultNamespace("namespace2")( 8) .fieldEntry().name("field1").add()( 9) .fieldEntry().name("field2").add()(10) .createOrUpdate();

On line (3) we the default namespace to 'namespace1'. This namespace will be used for therecord type name and the first two fields that are added. Then on line (7) we change the defaultnamespace to 'namespace2'. Then we add again fields called field1 and field2, but now these willbe in namespace2, so these are different fields from the ones added on line (5) and (6).

Field type builder

Besides the record type builder, there is also a field type builder. Since creating a field is alreadya one-liner with the classic API, its use is somewhat limited, though it allows you to work in thesame style as for creating record types.

8.2.3 Creating Records

Classic API

Let's first look at how a record is created using the classic Lily API.

Record record = repository.newRecord();record.setRecordType(new QName(NS, "recordtype1"));record.setField(new QName(NS, "field1"), "value 1");record.setField(new QName(NS, "field2"), "value 2");record = repository.create(record);

Instead of instantiating all those QName's, you can also set a default namespace, as shown inthe following example. The default namespace is just an ephemeral attribute of Record: it is notstored in the repository, but is just and aid when setting fields or the record type.

Record record = repository.newRecord();record.setDefaultNamespace(NS);record.setRecordType("recordtype1");record.setField("field1", "value 1");record.setField("field2", "value 2");record = repository.create(record);

Create a record using the builder API

Now let's look at how the same record is created using the builder API.

(1) Record record = repository(2) .recordBuilder()(3) .defaultNamespace(NS)(4) .recordType("recordtype1")(5) .field("field1", "value 1")(6) .field("field2", "value 2")


(7) .create();

(2) we obtain the builder by calling repository.recordBuilder()

(3) we set the default namespace. This is optional, you can also use QName's.

(4-6) the record type and fields are set

(7) we call create(). This creates the record in the repository. This method returns a Recordobject, while all the previous methods returned the builder itself. Other operations are alsoavailable: update(), createOrUpdate() and build(). Calling build() will just instantiate the recordobject without modifying anything in the repository.

Switching the default namespace

If you need to create fields in several namespaces, then it is useful to know you can switch thedefault namespace at any time, as illustrated in the following example.

( 1) Record record = repository( 2) .recordBuilder()( 3) .defaultNamespace("namespace1")( 4) .recordType("recordtype1")( 5) .field("field1", "value 1")( 6) .field("field2", "value 2")( 7) .defaultNamespace("namespace2")( 8) .field("field1", "value 1")( 9) .field("field2", "value 2")(10) .create();

In this example we set two times fields named field1 and field2, but they are in differentnamespaces so it are different fields.

Using createOrUpdate

Lily offers a "create-or-update" operation which is useful if you don't care whether the recordalready exits or not, but more importantly it has the advantage that this method allows automaticretrying in case of IO exceptions, because it is idempotent. Because of this, it requires that the IDis assigned by the client.

(1) Record record = repository(2) .recordBuilder()(3) .assignNewUuid()(4) .defaultNamespace(NS)(5) .recordType("record_type")(6) .field("field1", "value 1")(7) .field("field2", "value 2")(8) .createOrUpdate();

On line (3) we assign the ID. You can also use a user-defined id using the method id(String).

On line (8) we call the createOrUpdate().

Creating a nested record

In Lily, the type of a field can be RECORD, which means that within the field value of arecord, you can store another record. To create such a nested record, you could again userepository.recordBuilder() to construct it, but for this specific case there is a shortcut.

( 1) Record record = repository( 2) .recordBuilder()( 3) .defaultNamespace(NS)


( 4) .recordType("some_record_type")( 5) .field("field1", "value 1")( 6) .recordField("record_field")( 7) .recordType("embbed_record_type")( 8) .field("field_r", "value r")( 9) .field("field_s", "value s")(10) .set()(11) .create();

On line (5) we set a normal field as usual.

On line (6) we use the method recordField() to created a nested record. As argument we givethe name of the field. The important difference now is that this method does not return the samerecord builder, but a new one intended to create the nested record. The nested record builder isinitialized with the default namespace from the current builder, so you do not need to repeat that.

When you are done creating the nested record, you call set(), see line (10). Calling set() returnsthe pointer to the original record builder.

Creating a LIST<RECORD> field

Similar to the previous case, there is also a convenient way for filling up LIST<RECORD>fields. The following example illustrates this.

( 1) Record record = repository( 2) .recordBuilder()( 3) .defaultNamespace(NS)( 4) .recordType("some_record_type")( 5) .field("field1", "value 1")( 6) .recordListField("list_of_records")( 7) .recordType("embbed_record_type")( 8) .field("field_r", "value r1")( 9) .field("field_s", "value s1")(10) .add()(11) .field("field_r", "value r2")(12) .field("field_s", "value s2")(13) .add()(14) .field("field_r", "value r3")(15) .field("field_s", "value s3")(16) .endList()(17) .create();

You start by calling recordListField(), see line (6).

Then each time you created a record, you call add(), see lines (10) and (13). After each add()call, a new record builder is returned to create the next record. However, this builder is alreadyinitialized with the default namespace and the record type of the previous one, so you do notneed to repeat that.

After the last item, you call endList() instead of add(), see line (16).

Creating a record with link fields

For creating linked records, there is no special support yet as is the case for nested records,so you need to call repository.recordBuilder() again to create another record, as shown in thefollowing example.

Record record = repository .recordBuilder() .defaultNamespace(NS) .recordType("some_record_type") .field("link_field", new Link(repository .recordBuilder()


.defaultNamespace(NS) .recordType("some_other_record_type") .field("field1", "value1") .create().getId())) .create();

Creating records with common fields -- reusing the builder

There is nothing that prohibits you from reusing the same builder to create several records. Thiscould be useful if you want to create several records that share some common fields.

In the following example, we create 5 records which each have the same value for field1 andfield2, but a different value for field3.

RecordBuilder builder = repository .recordBuilder() .defaultNamespace("namespace") .recordType("record_type") .field("field1", "value1") .field("field2", "value2");

for (int i = 0; i < 5; i++) { builder.field("field3", new Long(i)).create();}

The RecordBuilder has also a reset() method to clear its state, which is equivalent to creating anew RecordBuilder.

8.3 Scanning Records And Record Locality

8.3.1 Records are stored in order of record ID

In Lily (as in HBase), it is possible to influence that records are stored next to each other. This isachieved by the fact that records are stored sorted by their record ID. A typical way in which thisis used is grouping records sharing the same common prefix in their record ID.

For example, consider the following record IDs:

USER.bar-1USER.bar-2USER.foo-1USER.foo-2

Since the records are stored in order of the record ID's, it is not possible for e.g. one of the 'bar'records to be stored in between one of the 'foo' records.

When you are using the default UUID record ID's, the order will be random. Still, in that casethe use of variant record IDs allows to influence record locality. Variant record IDs (or simplyvariants) are record ID's that share a common master record ID and are extended with a set offree key-value pairs.

For example, the following 3 records share the same master record ID and are extended with an'item' property:

UUID.65d268d0-ccbb-4fff-9073-e5642a9144e0.item=1UUID.65d268d0-ccbb-4fff-9073-e5642a9144e0.item=2UUID.65d268d0-ccbb-4fff-9073-e5642a9144e0.item=3


The properties are separated from the master record ID by a dot, and if there would be multiplekey-value pairs, they are separated by a comma and always sorted by their key (this is the stringsyntax, on the storage level a binary encoding is used).

Variants can be used for both USER and UUID record ID's. As we have seen, when using USERrecord IDs, you can also bring structure and grouping into record IDs yourself. Using variantshas the advantage that Lily has knowledge of this internal record ID structure, which it canexploit in the indexer for dereferencing between variants.

8.3.2 Scanning over records

Scanning is sequentially running over the records stored in the repository. Since records arestored ordered by record ID, this means a scan will run over the records in order of their recordID.

The alternative to scanning would be an ordinary (multi-)read operation. However, a scan ismore efficient for retrieval multiple records than read or multi-read operations. Also, with a scanyou don't have to know the record IDs up front.

It is not because scans run sequentially over records that they are only useful in batch scenario's:the record ID can be exploited as a primary index to jump straight to the relevant subset ofrecords.

In the sections below we will give Java code examples. The REST interface supports scanners aswell, see its tutorial and reference documentation for more details.

8.3.2.1 Full table scan

The following example shows how to scan over all records in the repository. This is somethingyou will only do in batch settings, often through MapReduce, since your repository can contain amassive amount of records.

RecordScan scan = new RecordScan();RecordScanner scanner = repository.getScanner(scan);for (Record record : scanner) { PrintUtil.print(record, repository);}scanner.close();

Note that scanners should be closed when you're done with them in order to release resources.

8.3.2.2 Start and stop record ID

A scan can run over all the records in the repository (a “full table scan”) or a subset. To run overa subset, you can specify a start and stop record ID. Lily (relying on HBase) is able to efficientlyjump to the record specified by the start record ID. The start record ID does not have to reallyexist: if it doesn't exist, the scan will position itself at the first record with a larger record ID. Thescan then runs sequentially over the records, until it reaches the stop record ID (exclusive) oruntil the very last record, whichever condition is reached first. Both start and stop record ID areoptional.

// Scan over all records whose ID starts with K up to right before// those who start with MRecordScan scan = new RecordScan();scan.setStartRecordId(idGen.newRecordId("USER.K"));scan.setStopRecordId(idGen.newRecordId("USER.M"));RecordScanner scanner = repository.getScanner(scan);for (Record record : scanner) {


PrintUtil.print(record, repository);}

8.3.2.3 Filters

A scan can skip certain records server-side, never returning them to the client, based on someconditions. This is called filtering. For example, you can filter records based on record type orfield value. A filter is not a search however: the repository will still run over each record andevaluate the filter for each record. Additionally, a filter is able to direct the repository to stop thescan.

8.3.2.3.1 Example: record type filter

The following example will only return records of type Book.

RecordScan scan = new RecordScan();scan.setRecordFilter(new RecordTypeFilter(new QName(NS, "Book")));RecordScanner scanner = repository.getScanner(scan);for (Record record : scanner) { PrintUtil.print(record, repository);}

8.3.2.3.2 Example: record ID prefix filter

The RecordIdPrefixFilter passes through all records whose record ID have a given prefix. Recallthe example earlier in the section about record locality of records starting with a prefix such as"foo-" or "bar-". If you would like all records starting with "foo-", you would do it like this:

RecordScan scan = new RecordScan();scan.setStartRecordId(idGenerator.newRecordId("foo-"));scan.setRecordFilter(new RecordIdPrefixFilter(idGenerator.newRecordId("foo-")));RecordScanner scanner = repository.getScanner(scan);...

We set the start record ID to jump efficiently to the first relevant record. We don't have a stoprecord ID (what would we set it to?), but rather the RecordIdPrefixFilter will abort the scan onceit encounters a record ID with a larger prefix.

8.3.2.4 Returning a subset of fields

By default a scan will return full record objects, that is, records with all fields loaded. If youdon't need all fields, you can gain performance by specifying the fields you are interested in.This is done via setReturnFields. It is possible to read no fields at all using ReturnFields.NONE,in which case only record ID and and record type will be loaded.

RecordScan scan = new RecordScan();scan.setReturnFields(new ReturnFields(qname1, qname2));RecordScanner scanner = repository.getScanner(scan);for (Record record : scanner) { PrintUtil.print(record, repository);}

8.3.2.5 Scanner Caching

By default, each time the next record is requested from a scanner, a call to the server will bemade. It is more efficient to request a bunch of records from the server at once. This can be done


using the caching setting. The following example will instruct to retrieve up to 100 records atonce:

scan.setCaching(100);

8.3.2.6 Scanners directly read from HBase region servers

This is an implementation detail, but interesting nonetheless. When using the Lily Java API,the LilyClient executes scans directly on the HBase region server, without going through a Lilyserver node. This avoids an extra hop and avoids pulling all data through one Lily node.

8.3.2.7 Scanners: summary

Here are some important things to remember about scanners:

• Start record ID allows to jump efficiently to some record (index-based)

• Filters

• filters are not index based, but evaluated for every record (potentially skipping billionsof records!)

• filters are evaluated within the HBase region servers (close to the data)

• Scans can stop by:

• reaching end of table

• reaching stop record ID

• filter instruction

• Caching and ReturnFields can dramatically improve scanner performance

• When done, you should close a scan to release its resources

8.3.2.8 Using the CLI tool lily-scan-records

You can execute a scan without any programming using the lily-scan-records tool. This tool canwork in two modes: count or print. In count mode, it will only count the records, in print modeit will rather dump them to standard out. The lily-scan-records tool also allows to configure alloptions such as start record ID and filters. Run 'lily-scan-records -h' for more information.

8.3.2.9 Variants and scanners

The Repository method getVariants allows to retrieve all the variants for some master record ID.Internally, this is obviously based on scanners with a technique similar to the record ID prefixfilter. You could use a custom scan operation as well, which will offer more flexibility.

8.3.3 Record ID as your primary index

As you have learned from all the above, Lily (by means of HBase) offers much more than a"distributed hash map" kind of storage: by storing the records in record ID order and offering ascan operation, the record ID can be exploited as powerful index for accessing your data.


8.3.4 Scanners And MapReduce

Scanners can be used as input for MapReduce jobs. See MapReduce Integration (page 140).

8.4 Setup New Maven Project From Archetype

You can quickly set up the structure for a new Lily-based project by executing the followingarchetype. This command will allow to change some settings, such as the artifactId. It will thencreate a subdirectory named after this artifactId and put all files below that.

mvn archetype:generate \ -DarchetypeGroupId=org.lilyproject \ -DarchetypeArtifactId=lily-archetype-basic \ -DarchetypeVersion=[unresolved variable: artifactVersion] \ -DarchetypeRepository=http://lilyproject.org/maven/maven2/deploy/

If you are using a Lily version whose artifacts are not deployed in Lily'sMaven repository, you can change the pointer to the repository as follows: -DarchetypeRepository=file:///path_to_lily_home/lib

Before making any changes to it, you might want to verify that it compiles by changing to thecreated directory and executing:

mvn install

8.5 Importing A Schema From JSON Programmatically

If you want to set up a schema for you application, a good approach is to use the import tool'sJSON format (page 72) to describe the schema.

Besides running the import tool manually, you might want to create the schema from yourapplication or testcase.

Here are the steps to do that.

Add a dependency on lily-import to the pom.xml:

<dependency> <groupId>org.lilyproject</groupId> <artifactId>lily-import</artifactId> <version>${version.lily}</version></dependency>

Add the schema.json file itself to the resources of your application, thus in src/main/resources/{package-name}

Then use the following code to import the schema:

import org.lilyproject.tools.import_.cli.JsonImport;...System.out.println("Importing schema");InputStream is = YouClass.class.getResourceAsStream("schema.json");JsonImport.load(repository, is, false);is.close();System.out.println("Schema successfully imported");


8.6 Writing Test Cases Against Lily

When developing a project on top of Lily, you will want to write tests that perform stuff againstLily. For this purpose, Lily offers the ability to launch an embedded Lily stack within your testcase, to make it independent of any external setup. The data of this embedded Lily stack is storedin temporary directories.

Launching Lily embedded is however rather slow, therefore we also offer an alternative: an easyway to launch a standalone Lily stack, and let the testcases talk to that. At the start of each testcase, the state of that Lily will be cleared, which can take a few seconds, but this should be muchfaster than the launching everything embedded. Your test cases will be written in a way that isagnostic of which Lily it talks to.

We will refer to these two cases as embed mode and connect mode.


We don't support running the test cases against an arbitrary, custom cluster setup. Thisis because the lily-test-launcher offers some specific features such as the ability toreset the state and the ability to change the Solr schema.

8.6.1 First Steps

In these instructions, we will assume Maven is used. There is nothing Maven-specific to theapproach, you should be able to translate it to other environments as well.

8.6.1.1 Maven Settings

We will start by adding some stuff to the pom of your project.

When you generate a project from the project archetype (page 130), these settingswill already be set up for you.

8.6.1.1.1 Add Dependencies

Make sure you have these dependencies:

<project> <dependencies>

<dependency> <groupId>org.lilyproject</groupId> <artifactId>lily-server-test-fw</artifactId> <version>[unresolved variable: artifactVersion]</version> <scope>test</scope> </dependency>

<dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.8.1</version> <scope>test</scope> </dependency>

</dependencies></project>

8.6.1.1.2 Configure Surefire Plugin

For reasons that will be explained later, we need to enable forkMode=always for the surefireplugin. Also, we pass on a system property "lily.lilyproxy.mode", that is set via the connectprofile defined in the next section and the properties "lily.config.customdir" and "lily.plugin.dir"which are discussed further on.

<project> <build> <plugins>

<plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-surefire-plugin</artifactId> <version>2.5</version> <configuration>

<forkMode>always</forkMode> <systemPropertyVariables> <lily.lilyproxy.mode>${lily.lilyproxy.mode}</lily.lilyproxy.mode> <lily.config.customdir>${lily.config.customdir}</lily.config.customdir> <lily.plugin.dir>${lily.plugin.dir}</lily.plugin.dir> </systemPropertyVariables> </configuration> </plugin>

</plugins> </build></project>

8.6.1.1.3 Configure "connect" Profile

The connect profile will be used to easily switch between embedded launching of Lily, orconnecting to an external Lily.

<project> <profiles>

<profile> <id>connect</id> <properties> <lily.lilyproxy.mode>connect</lily.lilyproxy.mode> </properties> </profile>

</profiles></project>

8.6.1.1.4 Add lily-kauri-plugin, resolve-project-dependencies goal

This plugin will make sure that all the dependencies required to launch Lily are available in yourlocal Maven repository.

The reason why this is needed is as follows. Lily runs on top of the Kauri runtime.Kauri launches all modules listed in the conf/kauri/wiring.xml. It finds these modulesfrom a Maven-style repository. When launched in test cases, this is your localMaven repository. Kauri is not able to download dependencies itself, it assumesaccess to a file-system based repository that contains them. Not all modules listedin the wiring.xml are Maven dependencies, so Maven will not download them. Inaddition, Kauri module jars contain a classpath definition in the file KAURI-INF/classloader.xml, listing the classpath needs of that module, again searched for inMaven-style repositories. Even if we would list all modules in the pom too, the versionresolving of the jars might be different causing for jars not to be found. The followingplugin will make sure that all listed dependencies are downloaded and available inyour local Maven repository.

<project> <build> <plugins>

 <plugin> <groupId>org.lilyproject</groupId> <artifactId>lily-kauri-plugin</artifactId> <version>[unresolved variable: artifactVersion]</version>


<configuration> <wiringXmlResource>org/lilyproject/lilyservertestfw/conf/kauri/wiring.xml</wiringXmlResource> </configuration> <executions> <execution> <phase>compile</phase> <goals> <goal>resolve-project-dependencies</goal> </goals> </execution> </executions> </plugin>

</plugins> </build></project>

We need to tell Maven where it can download this plugin using:

<project>

<pluginRepositories> <pluginRepository> <id>lilyproject-plugins</id> <name>Lily Maven repository</name> <url>http://lilyproject.org/maven/maven2/deploy/</url> </pluginRepository> </pluginRepositories>

</project>

8.6.1.1.5 Configuring Repositories

If you have an existing project, you will usually already have this. If not, make sure to add theLily repository:

<project> <repositories>

<repository> <id>lilyproject</id> <name>Lily Maven repository</name> <url>http://lilyproject.org/maven/maven2/deploy/</url> </repository>

</repositories></project>

8.6.1.2 Write A Test Class

As usual in Maven, put your test classes below src/test/java

Example:

import org.junit.Test;import org.lilyproject.lilyservertestfw.LilyProxy;import org.lilyproject.repository.api.*;

public class MyTest { @Test public void testMe() throws Exception { LilyProxy proxy = new LilyProxy();

// Depending on mode, this will: // - start Hadoop, ZooKeeper, HBase, Solr, Lily embedded


// - connect to locally running instance launched by launch-test-lily and clear its state proxy.start();

Repository repository = proxy.getLilyServerProxy().getClient();

// Do stuff with the repository // ... proxy.stop(); }}

8.6.1.3 Run The Test With Lily Stack Embedded

Execute

mvn install

Because of all the services which are launched on the fly, this will take some time, like half aminute.

If all is well, it will end with a BUILD SUCCESSFUL message.

8.6.1.4 Create LilyProxy On The Class Level

In the previous test class example, we created the LilyProxy within a test method. This approachis not recommended because of two reasons:

• It takes quite some time to launch LilyProxy (especially in embedded mode).

• It is not possible to call start-stop on LilyProxy multiple times within the same JVM. Thisis because when calling LilyProxy.stop(), not all threads are properly shut down and not allresources properly cleaned up. Or in some cases, threads are interrupted but not joined towait for them to finish. Since this is inside Hadoop & HBase, this is out of our control. (thiscan be observed by doing a thread dump after calling LilyProxy.stop()).

Therefore, the usual approach is to create LilyProxy on the test class level, and to set theforkMode of the Maven surefire plugin (the plugin which executes test cases) to 'always',causing a new JVM to be launched per test class.

Of course, this approach requires the tests to be written in such a way that they don't conflictwith each other: each test method in a test class should be able to run independently of theothers, without assumptions about state.

Here is an example of how to start LilyProxy at the class level:

import org.junit.AfterClass;import org.junit.BeforeClass;import org.junit.Test;import org.lilyproject.client.LilyClient;import org.lilyproject.lilyservertestfw.LilyProxy;

public class MyTest { private static LilyProxy LILY_PROXY;

@BeforeClass public static void setUpBeforeClass() throws Exception { LILY_PROXY = new LilyProxy(); LILY_PROXY.start(); }


@AfterClass public static void tearDownAfterClass() throws Exception { LILY_PROXY.stop(); }

@Test public void testOne() throws Exception { LilyClient lilyClient = LILY_PROXY.getLilyServerProxy().getClient(); // Do stuff }

@Test public void testTwo() throws Exception { LilyClient lilyClient = LILY_PROXY.getLilyServerProxy().getClient(); // Do stuff }}

You can again check with "mvn install" that this runs.

8.6.1.5 Connect To Independently Launched Lily

Up to now when launching tests, the Lily stack was launched within the test JVM, which israther slow.

We can speed this up by launching an independent Lily instance, and letting the testcasesconnect to that one. When LilyProxy.start() is called, a reset trigger will be sent to this Lilyinstance, causing it to clear all Lily state in HBase, HDFS and ZooKeeper, to delete alldocuments in Solr, and to restart (within the JVM) the Lily server.

To use this, we first need to launch the standalone Lily stack using this command:

./bin/launch-test-lily

Wait until it is completely started, this will be clearly visible by a series of informational (non-log) messages being printed to standard out.

Then start the build using:

mvn -Pconnect install

The -Pconnect flag activates the connect profile we added to the pom.xml earlier, which thencauses the system property lily.lilyproxy.mode=connect to be passed to the JVM's executing thetests, putting LilyProxy in connect mode.

You cannot run tests without -Pconnect while launch-test-lily is running, as these startthe same services listening on the same port numbers.

When you stop launch-test-lily (use Ctrl+C), the temporary directory in which the datawas stored will be deleted. If you want to retain the data for a future run, specify acustom storage directory using the -d option.


8.6.2 Service Configuration

8.6.2.1 General remarks

We provide limited control over the configuration of the various services.

All services use the default TCP port numbers and this cannot be changed. This means that ifyou have running any of the services locally, the embedded variants will clash with them.

8.6.2.2 Solr Schema

By default, Solr will be launched with the example schema that ships with Solr. This is not veryuseful, you will typically want to specify your own schema.

You can specify your own schema by passing an argument to LilyProxy.start(). For example, ifthe schema is available as a resource in the same package as your test class, you can do it likethis:

import org.apache.commons.io.IOUtils;

...

byte[] solrSchemaData = IOUtils.toByteArray(MyTest.class.getResourceAsStream("solr_schema.xml"));LilyProxy.start(solrSchemaData);

If LilyProxy is in embedded mode, this schema will be used directly when Solr is started. WhenLilyProxy is in connect mode, and the current schema is different from the one supplied, thecurrent schema will be overwritten and Solr will be reloaded.

You can also change the schema after Lily proxy was started, using this method:

LILY_PROXY.getSolrProxy().changeSolrSchema(solrSchemaData);

8.6.2.3 Lily Conf & Plugins

Because LilyProxy supports switching between an embedded or externally launched Lily stack,we do not support dynamically defining the configuration of Lily in the testcase. Thus, typicallyevery launched Lily instance in a testcase in your project should use the same configuration andplugins. This is because when using connect mode, we can't change the configuration of theexternally launched Lily.

8.6.2.3.1 Connect

When running in connect mode, Lily should be launched using the ./bin/launch-test-lilycommand as described before. Any changes to configuration files or adding additionalconfiguration files should be done in the $LILY_HOME/conf folder before starting launch-test-lily.

Similarly, plugins should be put in the $LILY_HOME/plugins folder.

8.6.2.3.2 Embedded

In the embedded mode, by default the standard Lily configuration is used. This configuration isincluded within the test framework jar and extracted to a temporary folder so that Kauri/Lily canread it.


An additional configuration folder (which will take precedence over the default configuration)can be given by setting its path in the system property lily.conf.customdir.

Similarly for the plugins, the plugin folder can be set with the system property lily.plugin.dir.

Example :

mvn -Dlily.conf.customdir=/home/user/custom/conf \ -Dlily.plugin.dir=/home/user/custom/plugins install

Since these are properties you will typically not modify on a run-to-run basis, you can configurethem directly in the pom.xml (see the surefire plugin).

8.6.3 Utilities

8.6.3.1 Index Schema

For data to be indexed into your Solr index, an index should be defined first. To be able to dothis from your tests, some utility methods are provided on the LilyServerProxy.

The methods addIndexFromFile(String indexName, String indexerConf, long timeout) andaddIndexFromResource(String indexName, String indexerConf, long timeout) add an indexrespectively defined in a file or resource to the indexer with the given indexName. Thesemethods will wait until the information of this new index has propagated to the indexer aswell as the rowlog. Only when this is the case one can be sure that events about record createsor updates will be picked up by the MQ rowlog and given to the Indexer in order to put thenecessary data in the Solr index. The timeout is the maximum amount of time the methods wouldwait. If the timeout is exceeded, they will return false.

Example :

Assert.assertTrue("Adding index took too long", LILY_PROXY.getLilyServerProxy().addIndexFromResource("testIndex", "org/lilyproject/mylilyproject/my_indexerconf.xml", 60000L));

Variants of these methods are also available with booleans to indicate if the call shouldwait for the information about the new index to propagate to the indexer and the rowlog :addIndexFromFile(String indexName, String indexerConf, long timeout, booleanwaitForIndexerModel, boolean waitForMQRowlog) . If these are put to false it is possible thata record create (for example) would not result in an update on the Solr index since the indexerand / or rowlog were not yet aware of the newly defined index.

8.6.3.2 WAL and MQ processed and Solr Index commited

With the above calls to add an index and wait for it to be fully operational one can be sure thatany changes to records will eventually be reflected in the Solr index. This does not mean thatthese changes will be visible immediately. First messages need to be processed by the WALand MQ before an update is performed on the Solr index, and then this Solr index needs to becommited for its changes to become visible. When writing a test it is useful to know if all recordupdates are reflected in the Solr index as well. We've provided some utility methods to help withthis.

On LilyProxy, the waitWalAndMQMessagesProcessed(long timeout) method waits for allmessages of the WAL and MQ to be processed and then commits the Solr index. When the giventimeout expires before all messages have been processed, the call will return false. A variant


of this method with a boolean to indicate if the Solr index should be commited or not is alsoavailable.

Example:

Assert.assertTrue("Processing messages took too long", LILY_PROXY.waitWalAndMQMessagesProcessed(60000L));

It is also possible to explicitly commit the Solr index by calling commit() on the SolrProxy.

8.6.3.3 Launching A Batch Index Build

A convenience method is available to perform a batch index build. This method will launch thebuild and block until it is finished. If it would not finish successfully, and exception is thrown. Ifit does not finish within the expected time out, it returns false.

Example:

Assert.assertTrue("Batch index build took too long", LILY_PROXY.getLilyServerProxy().batchBuildIndex("testIndex", 60000L * 10));

8.6.4 Advanced

8.6.4.1 User defined storage directory

By default the embedded mode will create a temporary directory in which to store the data andlog files. This directory is cleared at shutdown. The parent directory in which the temporarydirectory is created is defined by the system property java.io.tmpdir.

Instead of creating a temporary directory, it is possible to use a fixed directory location. Thisdirectory can be set by using the system property lily.lilyproxy.dir. By default this directory isstill cleared at shutdown. If the data stored in this directory should be kept in order to use it at anext run the system property lily.lilyproxy.clear should be set to false.

Example:

mvn -DargLine="-Dlily.lilyproxy.dir=/home/user/mydir -Dlily.lilyproxy.clear=false" install

8.6.5 More On The Lily Test Framework

Lily's test framework consists of three separate projects.

Maven ProjectName

Services Class ForEmbeddedLaunching

Abstractionbetweenembedded/connect mode

System propertyto set connectmode

hadoop-test-fw HDFS,MapReduce,ZooKeeper,HBase

HBaseTestingUtility(this is part ofHBase)

HBaseProxy lily.hbaseproxy.mode

solr-test-fw Solr inside Jetty SolrTestingUtility SolrProxy lily.solrproxy.mode

lily-server-test-fw Lily Server Node LilyServerTestingUtilityLilyServerProxy lily.lilyserverproxy.mode


LilyProxy lily.lilyproxy.mode

If you write a project that only needs HBase and/or Solr, you can immediately use thecorresponding projects, without having to launch Lily as well.

Switching between connect and embed mode

The system properties in the last column can be set to the value 'connect' or 'embed'. If notspecified, embed is the default.

LilyProxy

LilyProxy combines HBaseProxy, SolrProxy and LilyServerProxy. When using LilyProxy,the single property lily.lilyproxy.mode will set the embed/connect mode for all of the proxy's.Mixing different modes is not possible since reset state functionality requires all services to runtogether.

launch-test-lily (LilyLauncher)

The launch-test-lily script (LilyLauncher class) basically creates, in one JVM, anHBaseTestingUtility, a SolrTestingUtility and a LilyServerTestingUtility.

LilyLauncher exposes through JMX the operation "resetLilyState", which performs thefollowing actions:

• it stops the Lily Server (the Kauri Runtime)

• it clears all tables on HBase. For some of the tables, we need to force a compaction, and waitfor it to finish, because of the way Lily uses the HBase timestamp dimension.

• it deletes the blobs stored on HDFS

• it deletes the /lily node in ZooKeeper

• it performs a 'delete all' query (and commit) on Solr

• it starts up Lily again

When in connect mode, each time LilyProxy.start() is called, this resetLilyState operation will becalled.

The launch-test-lily script opens JMX access on port 10102.

8.7 MapReduce Integration

8.7.1 Using Lily As Input For MapReduce Jobs

Lily has an InputFormat for Hadoop which enables to efficiently run over records in therepository.

The InputFormat is based on the Lily scanner feature (page 126), thus (within each input split)runs sequentially over all or a subset of the records, possibly with some filter(s), and with all or aselection of fields loaded.

This InputFormat is conceptually quite similar to HBase's TableInputFormat. The number ofsplits, thus the number of map tasks launched, equals the number of regions of the record table.


Lily scanners directly access HBase, by-passing the Lily server nodes, and hence should be fast.A hint is passed to Hadoop so that the map task for a certain input split can be co-located withregion server where the corresponding region is hosted, reducing network traffic.

To see some example code, generate a project using the archetype, as described further on.

8.7.2 Using Lily As Output For MapReduce Jobs

To write to Lily from MapReduce jobs, just use the usual LilyClient class.

There would be little added value in providing a Hadoop OutputFormat for Lily. Having accessto the repository from within the map or reduce method gives more flexibility: you can choosewhich method to use (create, update or createOrUpdate) and you could read before write, useconditional updates, etc.

As with any MapReduce task which has side-effects, be sure to be careful with the behavior ofre-execution of failed tasks, of multiple reducers, and of speculative execution.

Create the LilyClient in the setup method, and close it in the cleanup method. We provide utilityfunctions:

import org.lilyproject.mapreduce.LilyMapReduceUtil;import org.lilyproject.util.io.Closer;

...

public class YourClass extends Mapper or Reducer { private LilyClient lilyClient; private Repository repository;

@Override protected void setup(Context context) throws IOException, InterruptedException { super.setup(context); this.lilyClient = LilyMapReduceUtil.getLilyClient(context.getConfiguration()); this.repository = lilyClient.getRepository(); }

@Override protected void cleanup(Context context) throws IOException, InterruptedException { Closer.close(lilyClient); super.cleanup(context); }

8.7.3 Getting Started Writing A Lily MapReduce Job

The quickest way to get started writing a MapReduce job is to set up a project using the Mavenarchetype:

mvn archetype:generate \ -DarchetypeGroupId=org.lilyproject \ -DarchetypeArtifactId=lily-archetype-mapreduce \ -DarchetypeVersion=[unresolved variable: artifactVersion] \ -DarchetypeRepository=http://lilyproject.org/maven/maven2/deploy/

This generates a classic word-count style MapReduce job based on Lily. See the README.txt inthe generated project for more information on how to try it out.

Notes

1. javadoc:root


2. javadoc:org.lilyproject.repository.api.TypeManager

3. https://bitbucket.org/calmera/frogpond


9 Repository (lily-server) plug-ins

9.1 Repository Decorators

9.1.1 Overview

9.1.1.1 What

Repository decorators are hooks added to Lily server nodes that allow to do things before or afterany (CRUD) operation on the repository. A typical use case is auto-assignment of record state,such as generated meta data or calculated fields.

Client requests arrive at the first decorator in the chain. Clients are unaware of the existence ofdecorators, they don't know the requests pass through them. A decorator will then typically callthe next in the chain, called the delegate, until the request arrives at the repository itself. Thenthe call returns through the call chain, allowing to do things after the operation as well.

Since decorators are put in front of the repository, they don't influence the behavior of therepository internally, they can only manipulate what goes in and out.

The term "interceptor" is often used for these kind of components. We opted fordecorator instead, since this is same terminology as used by CDI (the Java Contextand Dependency Injection specification), where the term interceptor is rather used fororthogonal concerns.

Decorators are not applied when records are read by the batch index builder.

9.1.1.2 Deployment

Repository decorators are packaged in a jar and have to be deployed on all Lily server nodes.They have to be explicitly activated through configuration as well, which allows to control theorder in which the decorators are called, is a safe-guard against lingering jar files, and allows tokeep the extension jar loaded in case it also offers other functionality.


There is no smart distributed deployment or management of decorators within Lily itself.Configuration and setup of nodes is usually managed in a central location (cfr. Lily Enterprise),making this is non-issue. This approach also has the advantage that it is possible to havedifferently configured nodes, such as during a rolling upgrade.

9.1.1.3 The Interface

A repository decorator needs to implement the interface RepositoryDecorator, which is definedas follows:

public interface RepositoryDecorator extends Repository { void setDelegate(Repository repository);}

As you can see, this extends from Repository, so all Repository methods can be decorated.The setDelegate() method is called by the framework to provide your implementation with thedelegate it should call.

9.1.2 Creating A Repository Decorator

The steps to make a repository decorator and have it in production are:

1. making an implementation of the RepositoryDecorator interface

2. write code to register your RepositoryDecorator implementation with the PluginRegistry.

3. packaging it as a Kauri module (a jar file)

4. deploying it to your Lily server node(s)

5. activating it in the Lily configuration

6. restarting the Lily server(s)

Steps 2 and 3 are specific to the mechanics of how to get an extension running in Lily.Fortunately, you don't have to worry too much about them since we have a template projectthat takes care of these. We will first walk through the steps to get your first decorator running,afterwards we'll provide some more background on step 2 and 3.

9.1.3 Your First Decorator

9.1.3.1 Generate A Project

Open a shell, go to a directory where you want the project to be located (a single sub-directorywill be created for you), and generate a project using the following command:

mvn archetype:generate \ -DarchetypeGroupId=org.lilyproject \ -DarchetypeArtifactId=lily-archetype-lily-server-plugin \ -DarchetypeVersion=[unresolved variable: artifactVersion] \ -DarchetypeRepository=http://lilyproject.org/maven/maven2/deploy/

This will ask you to confirm the settings for some parameters:

Confirm properties configuration:groupId: com.mycompanyartifactId: my-lily-server-pluginversion: 1.0-SNAPSHOT

package: com.mycompanyY:

It is recommended to answer N and change the values appropriately. The version number (1.0-SNAPSHOT) is the version of your decorator, not of Lily.

9.1.3.2 Implement RepositoryDecorator

The generated project contains a decorator implementation at the following path:

src/main/java/com/mycompany/SampleRepositoryDecorator.java

This sample decorator prints a message before and after record creation calls. For now, we canjust continue with this sample decorator, you can come back to it later and adjust it to implementthe desired functionality.

9.1.3.3 Disable other sample plugins

Edit the following file:

src/main/kauri/spring/services.xml

Remove or comment out the sections related to samples of other types of plugins, at the time ofthis writing this was only the SampleRecordUpdateHook:



9.1.3.4 Build

To build the project, execute:

mvn assembly:assembly

9.1.3.5 Deploy

The build will have created a tarball at

target/my-lily-server-plugin-1.0-SNAPSHOT.tar.gz

This bundles the plugin (jar file and wiring.xml), together with all its dependencies, in a formatwhich can be deployed with Lily Enterprise.

Here we are just interested in local testing, so we extract this again:

tar xvzf target/my-lily-server-plugin-1.0-SNAPSHOT.tar.gz

And then we copy the wiring.xml file to the plugins directory:

cp my-lily-server-plugin-1.0-SNAPSHOT/plugins/load-before-repository/wiring.xml \ $LILY_HOME/plugins/load-before-repository

And copy the library files to Lily's lib dir:


cp -r my-lily-server-plugin-1.0-SNAPSHOT/lib/* \ $LILY_HOME/lib

Tip: It is not very tidy to copy our own extensions directly into Lily's lib dir. This is infact not necessary: when starting Lily using the 'bin/lily-server' script, you can definean environment variable LILY_MAVEN_REPO to point to additional lib dirs. Youcould make this point to the lib dir of the plugin. When using the service wrapper, seewrapper.conf.

Be productive: what's even easier during plugin development: let theLILY_MAVEN_REPO point directly to ~/.m2/repository. Then each time you rebuildusing "mvn install", you don't have to deploy anything, just restart lily-server.

9.1.3.6 Edit Lily Configuration

Edit the file

$LILY_HOME/conf/repository/repository.xml

In that file, you will see a <decorators> element. Within that element, you need to list all thedecorators that should be active:

<decorators> <decorator>com.mycompany.my-lily-decorator</decorator> </decorators>

The decorator name can be found in the SampleRepositoryDecorator.java file mentioned above,in the NAME member.

9.1.3.7 Restart Lily Server

Now restart the Lily server.

During startup, two lines will be logged related to the decorator (depending on log configuration,but should be the case by default).

First you will see a line indicating that the decorator plugin jar is being loaded:

[INFO ][snipped] Starting module plugin-my-lily-decorator-1.0-SNAPSHOT - /[snipped path]/plugins/load-before-repository/my-lily-decorator-1.0-SNAPSHOT.jar

A bit later a line will be printed showing the active repository decorators, which correspondsexactly to those configured in repository.xml:

[INFO ][snipped] The active repository decorators are: [com.mycompany.my-lily-decorator]

If you would now create records, messages will be printed to standard out.

9.1.3.8 Next Steps

Now that you know how to get a decorator running, you can adjust the decorator implementationto suit your own needs.


For some more insight in how the plugins are packaged and deployed, see Lily Server PluginMechanism (page 148).

9.2 Record Update Hooks

9.2.1 Overview

9.2.1.1 What

A record update hook is an extension mechanism of the lily-server process. It is called before arecord is updated but after the record has been locked for updating and the original record statehas been read.

Compared to a Repository Decorator (page 143), you would use it when you would decoratethe update method and would find that you need to read the original record. Since the repositoryimplementation reads the previous record state anyway, we can avoid this double HBase-involving work. Possibly more important, since the record is locked, you can be sure the recordstate won't change anymore between the read and the update.

The hook is called before the conditional update checks are checked.

9.2.1.2 The Interface

public interface RecordUpdateHook { void beforeUpdate(Record record, Record originalRecord, Repository repository, FieldTypes fieldTypes) throws RepositoryException, InterruptedException;}

The hook is provided with:

• the record object supplied by the user

• the record object read from HBase (immutable)

• the repository

• the field types snapshot


9.2.2 Creating a RecordUpdateHook

The steps to create a RecordUpdateHook are very similar to those for creating a RepositoryDecorator (page 143), so have a look over there. The archetype which is used there togenerate a sample project also contains a sample RecordUpdateHook.

9.3 Lily Server Plugin Mechanism

TODO: this was cut and paste from the decorators document, needs some rewording.

Lily runs on a platform called Kauri1. What Kauri basically does is start a number of modules. Amodule is a jar file with two things added to it:

• a spring container definition (this is what makes it 'active', i.e. startable)

• a classpath definition

Besides this, modules can also export or import services, this allows for wiring services betweenmodules.

The decorator we created above is also such a Kauri module.

The Spring container definition is in the source tree at src/main/kauri/spring/services.xml.The maven build is configured such that the file ends up in the jar at KAURI-INF/spring/services.xml.

The classpath definition is generated as part of the build by a Maven plugin called kauri-genclassloader-plugin. It ends up in the jar at KAURI-INF/classloader.xml.

Let's have a closer look at what is in the Spring container definition:

<kauri:import-service id="pluginRegistry" service="org.lilyproject.plugin.PluginRegistry"/>

<bean id="decorator" class="com.mycompany.SampleRepositoryDecorator"> <constructor-arg ref="pluginRegistry"/> </bean>

The special tag kauri:import-service will make the PluginRegistry service (provided by anotherKauri module) available within this Spring container.

The <bean> tag causes the SampleRepositoryDecorator to be instantiated when the module isstarted.

Inside its constructor, the decorator will register itself with the PluginRegistry:

public SampleRepositoryDecorator(PluginRegistry pluginRegistry) { this.pluginRegistry = pluginRegistry; pluginRegistry.addPlugin(RepositoryDecorator.class, NAME, this); }

To deploy our module, we copied it to the directory $LILY_HOME/plugins/load-before-repository. Kauri knows what modules to start due to the configuration in conf/kauri/wiring.xml. In that file, you will see a line that tells Kauri to load all the jar files inside thatdirectory:

<directory id="plugin" path="${lily.plugin.dir}${file.separator}load-before-repository"/>

http://www.kauriproject.org/


As you can see, the actual plugin directory location is provided by a system property,lily.plugin.dir.

The purpose of the subdirectory load-before-repository is that it are modules that will be startedbefore the actual repository. For the decorators, it is important that they are registered before therepository is created, so that there is no window during startup in which the repository can getcalled without the decorators being active.

The above should have given you the basic insight in how this all works (if not, don't hesitate toask questions on the mailing list).

If you want to register multiple decorators, it is not necessary to put each of them in a separateKauri module, you can just add more implementations in the same project, and add <bean> tagsfor each of them to the Spring container.

Notes

1. http://www.kauriproject.org/


10 Bulk Imports

Lily has no special support for bulk uploads, but below we provide some tips.

Disable indexes during import

Disabling incremental index updating during import will usually given an important performanceadvantage, especially if you make use of link dereferencing. You can then batch-build the indexonce the import is done.

If you have not already defined an index, simply wait to create your index until after the import.Otherwise, you can disable the incremental updating using:

lily-update-index -n nameOfTheIndex --update-state DO_NOT_SUBSCRIBE

Afterwards, re-enable it using:

lily-update-index -n nameOfTheIndex --update-state SUBSCRIBE_AND_LISTEN

Trigger a batch index build using:

lily-update-index -n nameOfTheIndex --build-state BUILD_REQUESTED

And follow up on its status using:

lily-list-indexes

See managing indexes (page 47) for more details.

Run multiple clients in parallel

Be sure to run multiple clients in parallel, or write a multi-threaded client, even if your "cluster"would only contain a single node.

Configure initial region splits

When starting out on a blank Lily install and planning to do some bulk loading, be sure toincrease the number of initial table splits for these tables: records, links-forward, links-backward.For example, set each to 10 times the number of servers you have (e.g. 60 for 6 nodes).

See Table creation settings (page 152) for more details. Note that these initial region splitsettings only work upon initial creation of the table. If you use custom record IDs you will have


to assign appropriate split keys yourself, or if unsure leave it to 1 initial split. Also with customrecord IDs, make sure they are not monotonically increasing or you will be hitting the sameregion of the record table all the time.

Disable link index maintenance

Lily keeps an index of all links between records. This is used to keep denormalized data in theSOLR index up to date, or it can also be used for custom purposes.

If you are not interested at all in the link index, either because you don't have any link-type fieldsin your records, or you have no need for denormalized data in the index, than you can disablethe updating of the link index. This can gain quite a bit in performance since otherwise for eachrecord create/update, this index has to be kept up to date, which involves reading the record andquerying the existing state of the index.

To disable the link index, edit the configuration file rowlog/rowlog.xml, and put the followingflag to false

<linkIndexUpdater enabled="true"/>

This needs to be done on all Lily nodes, and the Lily server needs to be restarted after thischange.

Change HBase flush settings

While not recommended for general Lily use, you could temporarily relax the HBase flushsettings.

This is done with the following properties:

• hbase.regionserver.flushlogentries

• hbase.regionserver.optionallogflushinterval

More information on these properties can be found in HBase's hbase-default.xml

General HBase tuning

Reduce HDFS replication

On small clusters (say, < 8 nodes), it is recommended to reduce the HDFS replication factor(dfs.replication property) to 2. Don't make the replication factor equal to or larger than thenumber of nodes, else HDFS/HBase will complain it can't reach the needed replicationfactor. The replication setting should be configured in hbase-site.xml rather than in HDFS'sconfiguration, as it is the HBase client which sets the replication level for each file it creates.

Other

See the HBase Book on the HBase website for other tuning tips, including memoryconfiguration, GC settings, LZO compression, etc. Be sure to keep an eye on the metrics.


11 Admin

11.1 Table creation settings

The very first time Lily is launched, it will create the necessary tables on HBase. Some settingsfor these tables can be configured through the configuration file conf/general/tables.xml. Oncethe tables are created, modifying this file will not have any effect anymore.

Initial region splits

A table in HBase is divided into a number of partitions, called regions. Initially each table startsout with one region, when a region reaches a certain size, it is split into two.

On an empty cluster, there will be only one region for each table, which makes that all updateswill go to that one region, and hence the load will be unevenly spread among the servers in yourcluster. Therefore, HBase allows to define initial table splits when creating a table.

Lily creates certain tables, such as the records table, with initial splits. Each split is defined by astart key and an end key, these need to be selected such that the created records will spread moreor less evenly over the various regions.

If you create records with UUIDs as record IDs, than Lily can automatically calculate theappropriate start and end keys, given a certain number of regions. In case you assign the recordIDs yourself, you will need to defined the splits yourself, or simpler, set the number of initialregions to 1.

11.2 Optimizing HBase Request Load Balancing

To get the maximum out of your cluster, the request load of each of the HBase region serversshould be similar. For example, if one server would be processing 300 requests/sec and anotherone 2000 requests/sec, then you are making far from optimal use of your cluster's resources.

While we won't explain HBase regions in detail here, the important thing is that the regions ofone table should be spread as equally as possible over all the region servers. For example if youhave two tables with 10 regions each, and you have two region servers, than rather than puttingall 10 regions of one table on one region server, it is better to put 5 regions of each table on eachregion server.


11.2.1 Record & linkindex tables

All Lily records are stored in one big table called records. If you are making use of Lily-generated UUID record ID's, then load balancing will be optimal. If you are making use of yourown ID's, make sure to choose them such that they are not sequentially increasing.

On an empty cluster, it might take a while for the record table to grow to a good number ofsplits. Therefore, it is possible to pre-split the record table. See Table creation settings (page152).

11.2.2 Rowlog tables (rowlog-mq and rowlog-wal)

The rowlog tables are system tables used by Lily. They contain the time-ordered sequence ofevents happening to the records. Since it is time-ordered, normally load balancing would be badsince we would always be touching the same region server. However, the rowlog tables can becreated with a number of splits, and Lily will 'salt' the timestamps so they are equally dividedover the splits. We call this "rowlog sharding".

To configure the number of splits, see the shardCount parameter in conf/rowlog/rowlog.xml. Bydefault, 1 split is created. Important: this parameter should not be changed after the initial Lilystartup, and it should have the same value on all Lily nodes.

There is no dynamic way to change the shardCount after initial Lily startup, though if necessaryyou can do it with the procedure described next. This procedure involves dropping the rowlog-mq & rowlog-wal tables, so only do this if either they are empty or there is nothing in themthat you care about (e.g. you will do a full index rebuild and you don't need the linkindex).The procedure is: stop all Lily servers, drop the rowlog-mq & rowlog-wal tables, change theshardCount setting (on all servers), start the Lily servers. On startup, Lily will create the rowlog-mq & rowlog-wal tables again, with the newly configured number of splits.

11.2.3 Fixing bad region assignment

In case for some reason the regions in your cluster are not well-balanced, you can tell HBase toreassign the regions in a round-robin fashion by adding the following configuration to the hbase-site.xml of the HBase master:

<property> <name>hbase.master.startup.retainassign</name> <value>false</value></property>

After changing this setting, you need to restart your HBase cluster. The usual HBase startupbehavior is that it will try to redeploy each region on the same server, as this assures good datalocality with the HDFS data nodes, but the above setting will enable a fresh assignment. Don'tforget to remove it again afterwards.

11.3 Metrics

Lily makes available some metrics by making use of Hadoop's metrics package. Metrics giveinformation about the average time a certain operation takes, or the number of operations doneper second, and the like.

Some of the tools like the tester (page 75) and the mbox-import (page 73) also report metrics.

The metrics can be consulted via JMX or can be reported to Ganglia. Ganglia can collect metricsdata from multiple nodes, and uses RRDtool to store the data and make graphs of it.

11.3.1 JMX

The JMX metrics are enabled by default.

You can for example consult them using jconsole. For local processes, look for the class nameorg.kauriproject.launcher.RuntimeCliLauncher.

The values are updated every 15 seconds, to modify this see conf/general/metrics.xml.

11.3.2 Ganglia

For Ganglia you can use either version 3.0.x or 3.1.x.

The Ganglia metrics need to be enabled by editing the file conf/general/metrics.xml

For example in that file you will see:

<attribute name="rowlog.class" value="org.apache.hadoop.metrics.spi.NullContextWithUpdateThread"/> <attribute name="rowlog.period" value="15"/> 

For ganglia you would then change this to:

<attribute name="rowlog.period" value="15"/>


<attribute name="rowlog.class" value="org.apache.hadoop.metrics.ganglia.GangliaContext31"/> <attribute name="rowlog.servers" value="localhost:8649"/>

If you use Ganglia 3.0.x, drop the "31" at the end of the class name.

11.4 ZooKeeper Connectionloss And Session Expiration Behavior

ZooKeeper is the central service for coordination among and configuration of the Lily processes.

An application such as Lily that makes use of ZooKeeper needs to decide how it deals withsituation where the connection with ZooKeeper is lost or when its ZooKeeper session is expired.

For Lily, it works as follows:

• Upon startup, Lily waits for the connection to ZooKeeper to come up before continuing. Ifthis takes longer than the session timeout, the process exits.

• When a Lily server process loses its ZooKeeper connection, it immediately shuts down anyleader-election based services, such as the indexer master and the row log processor (= themessage queue message dispatcher). This is to avoid that these services could run on twoLily servers at the same time. When the connection with ZooKeeper is re-established withinthe session timeout, the Lily server will still have its leader position for these services andrestart them.

• When the ZooKeeper session of a Lily server is expired, the Lily server process shuts itselfdown.

• When the connection is lost, and fails to re-establish itself within twice the session timeout,we pre-actively assume the session will be expired and shut down the Lily server process.This is because a Lily server which has lost the connection to ZooKeeper will often not beable to do much useful anymore, additionally we prefer not to let a Lily node working whenit has possibly out of date configuration information.

Per Lily server, there are two connections with ZooKeeper:

• one for Lily itself

• and one established by HBase


12 Glossary

12.1 index entry

In the context of Lily's indexer, an index entry is the entry in the index for a certain Lily record,thus the Solr document corresponding to a certain Lily record, or more correctly, to a specificversion of a Lily record. Thus there can be multiple index entries for each record, in casemultiple versions of a record are indexed.


13 Lily Hackers

This section of the documentation contains information intended for people working on (ratherthan with) Lily.

13.1 Getting Started

13.1.1 Lily Source Code

13.1.1.1 Getting the sources

Use:

svn co http://dev.outerthought.org/svn_public/outerthought_lilyproject/trunk/ lily-trunk

13.1.1.2 Building Lily

See the README.txt in the root of the source tree.

In short, if you have Maven installed, do:

mvn -Pfast install

The -Pfast option is to skip the test cases. Some of the tests will by default launch an embeddedHadoop/HBase, which takes time. This can be sped up by running against an existing HBaseinstall, this is all explained in the README.txt.

13.1.1.3 Running Lily

During development, you can run Lily similarly to how you run the binary distribution (seeRunning Lily (page 14)), the only difference is that the commands are in different locations.

To run launch-test-lily, you do

cd cr/standalone-launcher./target/launch-test-lily

To run Lily, you do

cd cr/process/server./target/lily-server


Tip: when you make changes to the Lily source code, after building with Maven you can directlyrestart Lily. There is no packaging or deploying to do. This is because the Kauri Runtimeplatform on which Lily runs directly loads the project dependencies (= constructs the classpath)using your local Maven repository (~/.m2/repository).

To run the indexer related commands like lily-add-index, lily-list-indexes:

cd cr/indexer/admin-cli./target/lily-add-index./target/lily-list-indexes...

To run the import tool:

cd apps/import./target/lily-import

13.1.1.4 Building a binary distribution

To build a .tar.gz like you can download from the Lily website, see the instructions in dist/README.txt

13.1.2 Repository Model To HBase Mapping

Here we describe how Lily stores records, record types and field types in HBase.

We assume you are familiar with HBase: you know about tables, rows, row keys, columnfamilies, column qualifiers, timestamps.

13.1.2.1 Records

13.1.2.1.1 One table for all records

All records are stored within one HBase table.

13.1.2.1.2 One record = one HBase row

A record, including all its versions, is stored in one row.

13.1.2.1.3 Row key = Record ID

The row key is the binary representation of the ID of the record as produced by theRecordId.toBytes() method. This byte encoding is such that it starts with the master record ID, soa search for all variants of a record can be done by prefix-scanning on the master record ID.

13.1.2.1.4 Column families and version numbering

Lily uses two column families:

• one called "data" which contains all the record data, both system fields (such as record type,current version number, delete marker, lock field) and user (record) fields

• one called "rowlog" which contains the columns related to the WAL and MQ rowlog


The system and user fields are distinguished by means of a prefix-byte in the column key: seeLilyHBaseSchema.RecordColumn.SYSTEM_PREFIX and DATA_PREFIX.

The data column family is configured to keep all versions (by default, HBase only keeps the 3most reent versions and throws away the others).

For versioned data we make use of the time dimension of HBase. As timestamp we use theversion number : 1, 2, 3, ... Non-versioned data is always stored at timestamp 1.

If a value is not changed from one version to another, it is not stored a second time but the valueis 'inherited' from the previous version (cfr. sparseness of the HBase tables). If a field is deletedin a version, its value should not be inherited, this is done by storing a 'deleted' marker as value.This also brings the advantage that we can do a delete as part of a HBase Put, so that all updatesto the row are done as one atomic unit (in HBase, Put and Delete are both atomic, but separateactions).

13.1.2.1.4.1 Version numbering and record re-creates

This section gives some more background information on the version numbering wrt recorddeletes and re-creates.

When a record is deleted in Lily, the deleted marker flag is put to true and all historical data(record type, record type version, field data) that existed for the record is cleared. The currentversion number is however kept. When later a record would be created with the same record id,this will be regarded as a record re-create. The record is created (as for a normal create), but theversion numbering of the record will continue from where it was when it was deleted. (e.g. ifthe version number was 4 when the record was deleted, the re-created record will get verisonnumber 5).

There are a number of reasons why this has been designed an implemented like this and not forinstance with a HBase row-delete:

1. First of all there is the way HBase behaves wrt row-deletes. When a row is deleted inHBase, a tombstone is written. When a major compaction happens (can take as long as 24hours), the tombstone and everything older than the tombstone will be removed. As longas the tombstone is present, reading data from HBase will ignore everything that is olderthan the tombstone. However, if we write information after the row was deleted, while thetombstone is present and with a timestamp (version) older than the tombstone (e.g. our non-versioned data) this data will still be ignored and even removed when a major compactionhappens. It the major compaction would have already happend (and thus the tombstone wasremoved) then writing and reading new data would succeed. This is inconsequent (non-idempotent) behaviour. Issues HBASE-2847, HBASE-2256 and HBASE-2856 relate to this.And as long as those are not solved, this is a problem.

2. The row in HBase representing our record does not only contain our record's data, butalso rowlog related information like the row-local table (see HBase Rowlog Library1 ).This information is for instance used to update the link-index and is still needed even afterthe record has been deleted. Removing the whole hbase row would thus also remove thisinformation.

3. If we would hide from the lily user that the version numbering continues from where itwas when the record was deleted, some mapping would be needed to map record versionnumbers onto an internal (increasing) version numbering of the record. This howeverintroduces more complex (and thus slower) read / write paths which we like to avoid asmuch as possible.

http://www.lilyproject.org/lily/about/playground/hbaserowlog.html


13.1.2.1.5 Fields = columns

The fields are stored as columns, thus one column per field. This is also true for LIST orRECORD fields: these are encoded into one column's value. The byte-encoding of a field valueis provided by the ValueType interfaces.

The column qualifier (= the name of the column) is the system-generated field type ID.

13.1.2.2 Record types & field types

The repository schema, thus the record types and field types, are also stored in a HBase table.

The details of their mapping onto HBase are currently not documented here.

13.1.3 Blobstore

In this document we describe the api and design of the blob store. How are blobs stored in therepository and how do they relate to the records and fields.

13.1.3.1 General

The general idea is that, to enable introducing record-level access control in the future, blobsshould only be accessed through the record they are used in (via the repository API) and notdirectly using their blob key.

Only in the very initial phase, where blobs are uploaded to the blobstore, can they exist withoutbeing part of a record. Before a blob can be used in a record it must have been uploaded to theblobstore. During a certain amount of time (e.g. 1 hour) the uploaded blob can then be used in arecord. If after that time the blob was not used in a record it will become unavailable and will beremoved from the blobstore.

Blobs can be re-used, but only within different versions (also non-sequential ones) of the samefield of the same record. Blobs cannot belong to multiple records or multiple fields at the sametime.

13.1.3.2 API and usage

13.1.3.2.1 Writing

Repository : OutputStream getOutputStream(Blob blob) throws BlobException, InterruptedException;

To upload a blob to the blobstore, an Outputstream must be requested on the Repository. Afteruploading the blob and closing the OutputStream, the blob will be updated with information thatallows the repository to find and retrieve the blob's data in the blobstore.

13.1.3.2.2 Reading

Repository : BlobInputStream getInputStream(RecordId recordId, QName fieldName, Long version, Integer multivalueIndex, Integer hierarchyIndex);(+ variants of this method with only the essential parameters)

To retrieve a blob an InputStream must be requested on the Repository. The InputStream canonly be retrieved by giving the 'location' of the Blob within a record by giving the record's Id, the


fieldName, the version of the record (or null if the latest record version should be used or if it isnot applicable as is the case for non-versioned fields) and the multivalueIndex or hierarchyIndex(e.g. 0 for the first position) of the blob in case the field is multivalue or hierarchical or both.

When finished reading, this InputStream must be closed just like any other InputStream.

The returned InputStream is a subclass of InputStream, BlobInputStream, which offers oneadditional method to return the Blob metadata object:

Blob BlobInputStream.getBlob()

The purpose is to get access to this metadata (size, content type) without having to do anadditional record read, which already happens as part of the getInputStream implementation.One immediate application is the ability to set the appropriate response headers in the RESTinterface.

13.1.3.2.3 Referring

A blob can be referred to by using it in a record create or update operation. If the record is notallowed to refer to the blob the create or update operation will throw an InvalidRecordException.

13.1.3.2.4 Deleting

A blob cannot be removed explicitly.

A blob will however be removed from the blobstore in three situations.

1. The blob was uploaded to the blobstore, but was not used in a record within the definedtimeout. In other words, the blob upload was not followed by a create or update operation ofa record referring to the blob.

2. An update or delete operation (of a non-versioned field or a mutable field) on a recordcan cause the blob not to be referred anymore. The blob will then be removed from theblobstore. Note that if an older version of the record (or newer version in case of an updateof a mutable field) still refers to the blob, the blob will not be deleted.

3. When a record is deleted, all its fields will be cleared. As a consequence any referred blobswill be deleted as well.

13.1.3.3 Design

13.1.3.3.1 Repository

The Repository provides the methods getOutPutStream and getInputStream to write and readblobs (cfr API above). The usage of blobs within records is managed through the normal recordcrud operations of the repository.

13.1.3.3.2 BlobManager

The Repository uses a BlobManager component to manage the state of blobs. The BlobManagermanages the HBase table: BlobIncubatorTable.


13.1.3.3.3 Blob Incubator Table

The BlobIncubatorTable is used to store references to the blobs that have just been uploaded.When a blob is then used in a record create or update operation, this table is checked to see if theblob is indeed available to be used in a record. Before a blob is used, it is 'reserved' so that noother records can use the blob at the same time. Reserving a blob is done by adding the recordIdnext to the blob reference with a checkandput on HBase.

After a blob has been used in a record, its reference is removed from the BlobIncubatorTable.

13.1.3.3.3.1 Table layout:

• Table name = 'blobincubator'

• Rowkey = blobKey

• ColumnFamily = 'ref'

• Column1 = 'record' : contains -1 if the blob has just been incubated or the recordId (bytes) ifthe blob has been reserved

• Column2 = 'field' : contains the fieldId (bytes) if the blob has been reserved

13.1.3.3.4 Blob Incubator Monitor

When a blob is uploaded a reference is put in the BlobIncubatorTable and it can be usedin a record. If the blob would never be used in a record, it would remain forever in theblobstore, and a reference to it will stay in the BlobIncubatorTable forever. To avoid this aBlobIncubatorMonitor scans the table on a regular basis and removes any blobs that wereuploaded a minimal amount of time ago. This minimal time ('minimalAge') can be configured.(Default = 1 hour).

This monitor is a process that runs on only one lily node (cfr leader election) and it should run ata sufficient low pace ('monitorDelay')as not to use too many system resources and influence theother operations. (Default = 1 check / minute)

13.1.3.3.5 Workflows

13.1.3.3.5.1 Create record

1. A blob is uploaded using the OutputStream received from the repository.

2. Upon closing the outputstream:

1. The BlobManager is requested to put the reference to the blob in theBlobIncubatorTable

2. A reference to the blob (blobId + where it is stored) is generated and stored in the valueof the blob

3. The repository is asked to create a record containing blobs

1. For each blob in the record :

1. The BlobManager is requested to reserve the blob

1. A check is done if the blob reference is available in the BlobIncubatorTable(cfr checkAndPut). If not, the create is not allowed.

2. The blob reference in the blobIncubator is updated to include the recordId ofthe record where it is going to be used.


1. The blob is reserved for this record

2. No other records can use the blob

2. Create record in HBase

1. For each blob the BlobManager is requested to remove the blob reservation

13.1.3.3.5.2 Update record

1. Either a new blob is uploaded using the OutputStream (see above), or a blob will be usedthat is already used by another version of the same field in the same record.

2. The repository is asked to update a record containing blobs

1. For each (to-be-updated) blob in the record :

1. The BlobManager is requested to reserve the blob

1. A check is done if the blob reference is available in the BlobIncubatorTable,and it is reserved. If not, a check is done if the blob was already used in anotherversion of the same field in the record. If not, the update is not allowed.Note: for the common case where a blob field was not changed with respect tothe previous version, the field will already have been removed since it was notmodified, and hence there is no additional overhead.

2. Perform the update of the record

3. The BlobManager is requested to remove any reservations made

4. In case of non-versioned fields or an update of a mutable field (either by putting a newblob or deleting a field) it is possible that some blobs are no longer referred to by therecord

1. For each of these blobs the BlobManager is requested to remove these blobs.

1. The BlobManager will put a reference of these blobs in the BlobDeleteTableand the BlobDeleteMonitor will pick these up and delete them

Note that for inline blobs no incubation or reservation is done. Inline blobs can always be used ina record, no matter if another field or record uses the 'same' inline blob.

13.1.3.3.5.3 Delete record

When a record is deleted, all fields are cleared. For each blob that was referred to by the recordthe BlobManager is requested to delete the blob.

13.1.3.3.6 Failure scenarios

13.1.3.3.6.1 Failure after blob reservation

If a failure occurs in a record create or update operation after the step where the blob has beenreserved, the blob would remain marked as reserved. If the create or update operation is retriedthis reservation can be re-used, but only if the record id is known and corresponds to the recordid in the existing reservation.

If the operation is not retried, there will be a reservation that refers to a record that either doesnot exist, or does not use the blob. The BlobIncubatorMonitor will clean up the reservation after


the defined timeout, but before removing the blob from the blobstore an extra check is done tosee if the blob has indeed not been used by the record, cfr next failure scenario.

13.1.3.3.6.2 Failure before removing the blob reservation

If the record create or update succeeded but removing the blob reservation failed, the reservationwill remain on the BlobIncubatorTable.The BlobIncubatorMonitor will encounter this reservation and clean it up. Before removing theblob from the blobstore it will check if the referred record exists and if it indeed uses the blob.

13.1.3.3.6.3 Failure before removing the blob

If a blob needs to be removed due to an update or delete operation, the BlobManager is requestedto remove it. If a failure occurs just before (or during) doing this, the blob will never be removed.To avoid this we could introduce a secondary action on the WAL. But we choose not to do thissince this would introduce a slowdown for all operations for a corner case which should happenvery unfrequently.

If needed a BlobJanitor can still be implemented which scans all blobs in the blobstore andchecks if they are still used in some record or not.

13.2 Releasing

13.2.1 Building A Lily Release

These are the steps to perform an official Lily release.

13.2.1.1 Pre-release checks

13.2.1.1.1 Verify a clean Maven build works

This is to verify a "real clean build" scenario would work, thus that the pom's don't referencesomething which is only available in your local repository.

Depending on how much you value your local Maven repository, you can either just throw awayyour local repository or temporarily use another location as the local repository:

EM2R=/tmp/EMPTY_MAVEN2_REPO; rm -rf ${EM2R}; mkdir -p ${EM2R}echo "<settings><localRepository>${EM2R}</localRepository></settings>" > emptyrepo.xml

mvn install -s emptyrepo.xml

13.2.1.2 Change versions

13.2.1.2.1 in wiring.xml


cr/process/server/conf/kauri/wiring.xml

and remove the "-SNAPSHOT" suffix from the versions. Commit this.


13.2.1.2.2 in README.txt

Adjust the documentation link to point to the correct version of the docs.

13.2.1.2.3 in dist README.txt

dito in dist/src/main/resources-filtered/README.txt

13.2.1.2.4 in scm pom section

When using git, no scm configuration changes are needed.However, make sure you have read http://maven.apache.org/scm/git.html

In the root pom.xml, check the scm section.

For releases from trunk, it should contain:

<scm> <connection>scm:svn:https://dev.outerthought.org/svn/outerthought_lilyproject/trunk</connection> <developerConnection>scm:svn:https://dev.outerthought.org/svn/outerthought_lilyproject/trunk</developerConnection> <url>https://dev.outerthought.org/svn/outerthought_lilyproject</url> </scm>

For releases from a branch, it should contain something similar to this (modify branch name asapplicable):

<scm> <connection>scm:svn:https://dev.outerthought.org/svn/outerthought_lilyproject/branches/BRANCH_1_1_X</connection> <developerConnection>scm:svn:https://dev.outerthought.org/svn/outerthought_lilyproject/branches/BRANCH_1_1_X</developerConnection> <url>https://dev.outerthought.org/svn/outerthought_lilyproject</url> </scm>

13.2.1.3 Configure Lily repository access

See Lily Maven repository access (page 175).

13.2.1.4 Run Maven release:prepare

Maven release:prepare performs the steps documented here2, most importantly:

• updates the version numbers in the pom.xml's to the release version number, and committhem

• tag the sources

• update the version numbers in the pom.xml's to the next development version, and committhem

This does not yet deploy anything.

It is strongly recommended (read: official releases: obliged) to do this on a fresh git checkout toavoid non-clean situations:

http://maven.apache.org/scm/git.html

http://maven.apache.org/plugins/maven-release-plugin/examples/prepare-release.html


rm -rf lilyprojectgit clone [email protected]:NGDATA/lilyproject.gitcd lilyproject

Then first do a dry run of release:prepare:

mvn -Pfast release:prepare -DautoVersionSubmodules=true -DpreparationGoals="clean install" -DdryRun=true

As long as the effective mvn release:prepare has not been performed, you can back outwith mvn release:clean

We do not need the preparationGoals parameter, I just took this from the Kauri buildinstructions, and since we use Kauri in Lily it it likely that we will need it eventually.Here is the original reason: Why we need the preparationGoals parameter: by defaultthe release plugin only executes the 'verify' phase, not install, but Kauri requires theartifacts to be installed in the local repository for Kauri Runtime based test cases torun.

Maven will interactively ask for:

• the release artifact version numbers: examples: 1.2 (not 1.2.0), 1.2.1

• the tag name: examples: RELEASE_1_2, RELEASE_1_2_1

• the next development version numbers: example: 1.3-SNAPSHOT

If this finished successfully, you can proceed for real:

mvn -Pfast release:prepare -DautoVersionSubmodules=true -DpreparationGoals="clean install"

If the above would fail with a build failure like "The svn tag command failed. ... File ...already exists." then do an "svn up" and run the above command again. Apparently thisis a problem starting from subversion 1.5.1.

To deploy the artifacts to the repository:

• optionally increase your maven memory setting:

export MAVEN_OPTS=-Xmx2048m

• execute (-Pfast to skip the tests):

mvn release:perform -Dgoals=deploy -Pfast

• to clean up:

cd ..


rm -rf lily-trunk

13.2.1.5 Building the distribution

See Outerthought-internal procedure (lily-packages repository).

13.2.1.6 Post-release work

13.2.1.6.1 Change versions in wiring.xml


cr/process/server/conf/kauri/wiring.xml

and change the version numbers to those of the current development release: x.y-SNAPSHOT.

+ reverse the changes done to the README.txt's earlier.

13.2.1.6.2 Deploy javadoc

Checkout the tagged sources (this is important so that the version in the pom would be correct, asthis determines the directory in which the javadoc will be deployed)

git clone [email protected]:NGDATA/lilyproject.git lily-releasecd lily-releasegit checkout RELEASE_X_Y_Z # NOTE: this will put you in 'detached head' state. Use git checkout -b release-x-y-z if you want to.mvn site-deploy

Once I had the problem that this gave the error "ArtifactNotFoundException: The skindoes not exist: Unable to determine the release version". This was solved by bringingthe versions of maven-site-plugin and maven-project-info-reports-plugin in Lily's rootpom.xml in sync with the versions listed on http://maven.apache.org/plugins/

Verify the result is ok by surfing to:

http://lilyproject.org/maven-site/X.Y(.Z)/

And then relink the 'current' javadoc:

ssh lilyproject.orgcd /var/www/lilyproject.org/maven-siterm currentln -s {current version} current

13.2.1.6.3 Make new doc site

We need to make a Daisy site for the documentation of the new release.

Check out from outerthought svn the directory projects/outerthought/ot_dpt/trunk

Make a directory for the new site based on the lily-docs-trunk

cd site/src/main/dsy-wiki/sitescp lily-docs-trunk to lily-docs-{version}cd lily-docs-{version}


find -name .svn -exec rm -rf {} \;

Have a look at siteconf.xml & skinconf.xml to change version dependent things.

The branch can stay at lilydocs-trunk until actual work on docs for next version start. This avoidshaving to make edits in two versions for changes that happen shortly after the release. But do notforget to branch it + change the branch configuration in lily-docs-{version} once necessary, seeBranching the docs (page 169).

When done, commit to svn:

svn add lily-docs-{version}svn commit -m "Adding docs site for new Lily release" lily-docs-{version}

Retarget the link lily-docs-current: (The version should be in the same style as the others, withunderscore, e.g. 1_2)

(I don't know how to retarget links in subversion, the below is my quick hack)svn delete lily-docs-currentsvn commit -m "retargetting lily-docs-current link: remove existing link" lily-docs-currentln -s lily-docs-1_0 lily-docs-currentsvn add lily-docs-currentsvn commit -m "retargetting lily-docs-current link: link to new target" lily-docs-current

Log in on lilyproject.org

ssh lilyproject.orgsudo su - daisycd ot_dpt/sitesvn upmvn daisy:init-wiki

Verify the new site works (can take up to 10 seconds for Daisy to refresh the site information):

http://docs.outerthought.org/lily-docs-{version}

Check that the current now shows the documentation of this new release:

http://docs.outerthought.org/lily-docs-current/

With the 0.2 release, it seemed like Daisy was not able to detect that lily-docs-current waschanged, even if surely the timestamp of the siteconf.xml was changed. This was solved byrestarting Daisy: /etc/init.d/ot-sites-wiki restart

13.2.1.6.4 Other things

• Change link to javadoc in the navigation of the documentation

• Adjust variables: http://docs.outerthought.org/lily-docs-current/variables - choose Edit linknext to "Lily Documentation Variables".

• Change download link in 'Running Lily' document (414-lily)

• Add release to releases table (457-lily) & change download link on site

• Mark milestone as done in trac

• Update docs.outerthought.org homepage

• Announce

http://docs.outerthought.org/lily-docs-current/variables


13.2.2 Publishing The Lily Maven Site (javadocs)

The Maven-generated site (containing the javadoc) is available on http://lilyproject.org/maven-site.

Setup the Lily Maven repository access (page 175) if not already done.

Then execute the following command in the root of the source tree:

MAVEN_OPTS="-Xmx2500m" mvn site-deploy

The memory increase is because it seems to make site-deploy run much faster (ymmv).

13.2.3 Branching the docs

The release instructions cover how to set up a documentation site for the release, but assume thedocs are not yet branched immediately. Here we describe how to branch the docs.

Branch the docs

Log in to Daisy, switch to Administrator role and go to Administration screen.

Create a branch called lilydocs-M_m (where M_m is the version number, e.g. 1_4)

Choose Tools, Document Tasks and create a new document task

select documents using a query:

select name where collections='lilydocs' and branch='lilydocs-trunk'

Move to next screen

Enter as description "Branching lilydocs for M.m"

Choose as Type of task for Simple actions

Choose as task 'Create variant', choose branch lilydocs-M_m and as language en

Start the task, verify it finished successfully.

Update the site definition

Check out from outerthought svn the directory projects/outerthought/ot_dpt/trunk

Edit the siteconf file for the release:

cd site/src/main/dsy-wiki/sitesvi lily-docs-M_m/siteconf.xml

Change content of branch tag from lilydocs-trunk to lilydocs-M_m

Now apply the update to the live site:

ssh lilyproject.orgsudo su - daisycd ot_dpt/sitesvn up

Go to the site and check that the correct branch is used (by looking at the Variants menu or usingthe info icon).

http://lilyproject.org/maven-site

http://lilyproject.org/maven-site


Edit the homepage of the trunk site to update references of the old version number to'nextversion-dev' (or 'trunk' if the next version number would be unknown).

Modify variables

Go to http://docs.outerthought.org/lily-docs-trunk/variables

Choose the Edit link next to "Lily Documentation Variables". Edit the content of this documentappropriately.

13.2.4 Pre-Release Verifications

The goal of this section is to collect things that are useful to verify before doing a release. Someof these could be automated, though it is often useful to do some manual observations too.

So, the things to check:

The real basics.

• Go through the Running Lily (page 14) scenario.

Are there no HBase/ZooKeeper connection leaks?

• This should be verified both for lily-server and lily-client

• Update: as of Lily 1.2, there is an integration test which checks this for lily-client.

• An easy way to check is using jps and grepping for threads with "EventThread" in the name(these are from ZooKeeper, there is one per ZooKeeper client). A more advanced way is touse jprofiler.

• In the lily-server process, there should be one ZooKeeper client for Lily, and, at the time ofthis writing, 2 for HBase (one for HTable and one for HBaseAdmin)

Can LilyClient survive restarts of the lily-server process?

• Start global/hadoop-test-fw/target/launch-hadoop

• Start cr/process/server/target/lily-server

• Run a client process which does repeated createOrUpdate operations on Lily: e.g. lily-mbox-import or lily-tester

• Stop the lily-server process

• The client process should now keep retrying the operation for a while

• Start the lily-server process

• The client process should contain where it left of. What might go wrong for example is thatit stays blocked.

Can the lily-server be stopped within a reasonable time?

Stopping lily-server (e.g. using ctrl+c if started in a console) should finish within a reasonableamount of time, and not take many dozens of seconds or minutes or hang forever.

http://docs.outerthought.org/lily-docs-trunk/variables


Verify this also while a client process is continuously doing create/update operations, and withan index defined (to be sure the rowlog and indexer processes are interrupted correctly).

Are there no important memory leaks and especially thread leaks when using resetLilyState a repeatednumber of times?

You need to run the whole lily stack with cr/standalone-launcher/target/launch-test-lily.sh. (TheresetLilyState operation only exists in that case).

There is a script in cr/standalone-launcher/resetLilyState_duration_test to help with this, observethread counts and memory with jconsole, let it run for at least 200 resetLilyState iterations

Run integration tests

Make sure to also run the integration tests:

mvn -Pintegration

Batch index build on real clusters

Batch index build should be tested on a real cluster, not only in combination with launch-hadoopor launch-test-lily, since there can be classpath differences in the launched task VMs.

13.3 Guidelines

13.3.1 Code Style

13.3.1.1 Java Code style

The goal of the code style guidelines is that the code looks the same throughout the code base.This improves both readability and writeability (you don't have to decide how to write yourcode).

The style proposed here is one that is followed by many open source projects.

13.3.1.1.1 Whitespace

13.3.1.1.1.1 Indenting

Indenting is done using 4 spaces, not tabs. The tab character should not occur in source files.

Indenting should, obviously, increase as nesting increases. Each increment should be 4 spaces.

Bad:

for (int i = 0; i < 10; i++) { System.out.println(i); total += i; }

13.3.1.1.1.2 In-line indenting

13.3.1.1.1.3 Spacing

Expressions are written with whitespace between them, rather than sticking everything together.


Bad:

String foo=”bar”;int y=3*5+(64/8);void doSomething( String x,String y ){

Good:

String foo = “bar”;int y = 3 * 5 + (64 / 8);void doSomething(String x, String y) {

Casts are written without space after them:

Object object = "hello";String hello = (String)object;

13.3.1.1.1.4 Trailing spaces

Configure your editor or IDE to drop trailing spaces.

13.3.1.1.1.5 Newlines

Between methods there should be one blank line. Between instance variables there should be noblank lines, except for grouping related variables.

13.3.1.1.2 Bracket placement

Opening brackets are not placed on a new line.

Good:

if (x < 3) { ...} else { ...}

13.3.1.1.3 Line length

The maximum line length should be (about) 120 characters. Code should not be written such thateverything is chopped at 80 characters.

13.3.1.1.4 Names

Names are important, think about them.

13.3.1.1.4.1 Use camel-case

Follow the Java guidelines.

Static final variables should be all uppercase.

13.3.1.1.4.2 Use descriptive names

Thus use “image” rather than “im”, especially for method names & arguments.


13.3.1.1.4.3 Single-letter variables

Do not single-letter variable names except for loop indices and maybe in short, complexalgorithms where long names would complicate reasoning about the code.

13.3.1.1.5 Comments

Besides documenting APIs, comments should certainly be used for anything unusual, so thatpeople including yourself do not end up wondering a few months later why something was donein a particular way.

13.3.1.1.5.1 Write HTML-formatted Javadoc

When writing Javadoc comments, make sure they contain the necessary markup to be readablein the generated javadoc. Most importantly, start new paragraphs with a <p>. In HTML, it isnot necessary to add the closing </p>. The first sentence up to the first dot is used by Javadoc toshow in overviews, so make sure it exists and is meaningful.

Example:

/** * Thing to do stuff. * * <p>Blah blah blah ... * * <p>Blah blah blah... */

13.3.1.1.5.2 Drop meaningless comments

Sometimes IDEs generate standard javadoc with @param declarations for all parameters. Sourcefiles are then sometimes full of empty comments just listing these parameters. This are extralines to read, and are never maintained anyway as parameters are added and removed, so it isbetter to drop them altogether. In summary, only leave meaningful comments.

13.3.1.1.5.3 Do not use designer comments

Do not use things like:

// =============================================

// ~~~~~~~ begin methods ~~~~~~~~~~~

/******************************************************/

13.3.1.1.5.4 TODO and FIXME comments

TODO and FIXME comments can be used.

TODO comments can be useful markers during development, but we encourage to fix as manyof them as possible before committing a change set, since otherwise many of these TODO's stayaround a long time. Have the discipline to write code in production-style immediately.


13.3.1.1.6 For loops

The new-style for loops are preferred over the old-style.

13.3.1.2 Non-Java source files

13.3.1.2.1 XML

XML is indented with 2 spaces.

The XML declaration should always be present on the first line: <?xml version="1.0"?>

No spaces are used around the = sign of attributes.

Empty tags are written with the closing marker sticked to the tag name or last attribute:

Bad:

<foo /><foo x="y" />

Good:

<foo/><foo x="y"/>

13.3.2 Programming Guidelines

13.3.2.1 InterruptedException

If you get an InterruptedException, after handling it if necessary, always throw it further. If youcan't throw it further because you are implementing an interface which does not have it declared(such as Runnable), set the Thread.interrupted flag again.

By adding InterruptedException to the throws clause of a method, you are indicating to the callerthat it is an operation which can be interrupted (typically because it is blocking/waiting, butcould also be an interruptable loop).

See this article by Brian Goetz3.

13.3.2.2 ZooKeeper

• In the Lily server node we make use of one common ZooKeeper instance, rather thaninstantiating multiple ones here and there.

• Beware that all ZooKeeper events are delivered to watchers by a single thread. Therefore,do not do anything time-consuming or blocking in them, so that the other watchers can alsoreact timely to events. Obviously, also do not do anything which by itself might again waitfor ZK events.

• Be sure your code behaves correctly when the connection is temporarily lost. Thus correctlyhandle DisConnected & SyncConnected events.

• Any ZK call can throw a ConnectionLossException, in which case you are not sure if youroperation succeeded or not, and if your requested watcher has been installed or not.

• Possibly make use of a retry-on-disconnected loop, see the ZkUtil class.

http://www.ibm.com/developerworks/java/library/j-jtp05236.html


• Handle the exceptions relevant to the operation you are doing (e.g. a NoNodeExceptionfor a delete operation), throw the other ones further on (see also the section onInterruptedException).

• Currently we give up on the ZK expired event (shut down the application), thus being ableto recover from that is not necessary.

13.4 Lily Maven Repository Access

Here we explain what to set up to be able to deploy artifacts to the Lily Maven repository.

Maven settings

Configuring your Maven settings is important so that the permissions of the deployed files arecorrect, otherwise you'll have to fix them manually afterward (or most likely, you won't notice it,and the next person trying to deploy might have problems).

In the following file (create it if it does not exist):

~/.m2/settings.xml

make sure the following two server entries are included:

<settings> <servers> <server> <id>org.lilyproject.maven-deploy</id> <directoryPermissions>775</directoryPermissions> <filePermissions>664</filePermissions> </server>

<server> <id>org.lilyproject.maven-snapshot</id> <directoryPermissions>775</directoryPermissions> <filePermissions>664</filePermissions> </server>

<server> <id>org.lilyproject.website</id> <directoryPermissions>775</directoryPermissions> <filePermissions>664</filePermissions> </server> </servers></settings>

Passwordless login

To avoid entering your password many times during the deployment of the artifacts to the publicrepository, you should add your public key to the ~/.ssh/authorized_keys2 file on lilyproject.org.If you are unfamiliar with this, stop reading here and find out how to do this. It will take you lesstime than entering your password a gazillion times.

13.5 Incompatible changes (by commit)

Here we list incompatible changes that happen to Lily. This can be changes to data format,configuration, API, scripts, etc. The changes are listed by commit so that when using Lily trunk,


you can check if any incompatible changes happened between now and the last time you fetchedthe sources.

Revision 51144 (October 11, 2011)

It is no longer allowed to use 'null' as namespace in a QName. An upgrade tool for existingrepositories is available.


The configuration for dynamic fields in the indexerconf changed to cope with the refactoredvalue types.

Changes concerning the matching of fields:

• matchMultivalue dropped

• matchHierarchical dropped

• matchType now contains a new kind of expression, though this is in fact backwardscompatible. You can now do things like LIST<*> which matches a list with any kind of arg,or RECORD<{namespace}*> which matches any record type within the given namespace.

Changes to the expression for producing the Solr field name:

• primitiveType & primitiveTypeLC have been dropped, replacements are:

• type

• baseType

• nestedType

• nestedBaseType

• deepestNestedBaseType

• hierarchical dropped

• multivalue kept, though should be considered deprecated. list has been added as replacement(with the same semantics)


The syntax for declaring formatters in the indexerconf.xml changed, as well as how thisconfiguration is interpreted. The Formatter interface changed as well.

These changes were done to cope with the change from primitive value type to the generifiedvalue types.

Since it was not possible to register custom formatters and since there was only one defaultformatter available, this change should not affect you.

Revision 50737 (September 26, 2011)

In the REST interface, the indexes request parameters mvIndex and hIndex for getting blobs arereplaced by the 'indexes' parameter which is a comma separated list of Integers.

changeset:5114

changeset:5096

changeset:5082

changeset:5073


Revision 50718 (September 26, 2011)

Changed the JSON format for the lily-tester the same way as for the lily-importer and RESTinterface in revisions 5066 and 5067 (See below).

Revision 50669 and 506710 (September 22, 2011)

The JSON format for the lily-importer and REST interface has changed so to support the newvalue types: List, Path, Record and Link.

In a FieldType, the ValueType should be represented by just a string and no longer an objectwith the primitive, multivalue and hierarchical properties.

The string for the ValueType represents the full name of the value type : valuetype =BLOB | BOOLEAN | DATETIME | DATE | DECIMAL | DOUBLE | INTEGER | LONG |STRING | URI | LIST<valuetype> | PATH<valuetype> | LINK[<rtNamespace$rtName>] |RECORD[<rtNamespace$rtName>]

13.6 Creating Snapshots Of 3d Party Projects

13.6.1 Building HBase Snapshot

Here we explain how to deploy a HBase SVN snapshot version to Lily's Maven repository. Thisis used in case we want to make use of a non-released HBase version in Lily.

13.6.1.1 Check out HBase

13.6.1.1.1 Existing checkout

• do 'svn status' to be sure there are no dirty files

• do 'svn up' and write down the SVN revision number

13.6.1.1.2 No existing checkout

Fetch a copy of the source using:

svn export http://svn.apache.org/repos/asf/hbase/trunk hbase-trunk

When this finishes, a revision number will be printed, write it down.

13.6.1.2 Change HBase version number

HBase trunk has a version number of the style X.Y.Z-SNAPSHOT. As we want to know exactlywhat sources we are using, we will rename this to something that includes the SVN revisionnumber.

It seems like a newer, unreleased version of the Maven release plugin has a special commandfor this (release:update-versions). But since we cannot use this yet, we revert to a simplermechanism: find and sed. The following command relies on the fact that the HBase versionnumber is unique, i.e. that no other dependency uses the same version number. Adjust thesections in bold to match the current HBase version and the SVN revision number determinedearlier.

changeset:5071

changeset:5066

changeset:5067


find -name pom.xml -exec sed -i 's/<hbase.version>0.89.0-SNAPSHOT<\/hbase.version>/<hbase.version>0.89.0-r917988<\/hbase.version>/g' {} \;

Do an 'svn diff' to verify that this made the correct changes.

13.6.1.3 Build

Execute:

mvn -DskipTests clean install

13.6.1.4 Test

At this point, before going on with the deploy, you will probably want to change the HBase (andHadoop) version in Lily to try out if this HBase build works fine.

13.6.1.5 Deploy

First set up Lily Maven repository access (page 175) if not already done.

Execute:

mvn deploy -DaltDeploymentRepository=org.lilyproject.maven-deploy::default::scp://lilyproject.org/var/www/lilyproject.org/maven/maven2/deploy

Note: For now, use Maven 2. Using Maven 3 here gives a "No connector available to accessrepository" error.

13.6.1.6 Make binary build available

For when someone wants to run Lily against installed HBase cluster of this same version, makeavailable a binary distribution of HBase like this:

mvn -DskipTests=true package assembly:assembly

scp target/hbase-0.89.0-r{revision number}-bin.tar.gz [email protected]:/var/www/lilyproject.org/files/hbase

The matching Hadoop version should also be provided.

Currently (June 30, 2010) HBase trunk uses the Hadoop "branch-0.20-append" which can beobtained as follows:

svn co -r {revision found in hbase pom} http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-append/ hadoop-common

(please update the above if not correct anymore)

And then to build:

In theory:

ant tar

In practice, to succeed, I used:

ant -Djava5.home=/usr/lib/jdk1.5/ -Dforrest.home=/path/to/apache-forrest-0.8 tar


Note: forrest does need java5 (will give sitemap validation errors with java6), and it seems likebuilding the docs cannot be skipped (even though it does seem to be intended to be skipped ifforrest.home is not set).

At the end of the build, the path to the created hadoop tar file will be printed.

scp hadoop-0.20.3-append-r{revision}.tar.gz [email protected]:/var/www/lilyproject.org/files/hadoop

13.6.1.7 Revert version number changes

If you have a HBase checkout (rather than an export), revert the changed version numbers using:

svn revert -R .

13.6.2 Building Kauri Snapshot

These are the instructions to build a versioned Kauri release from subversion. This is for the casewe want to use a non-released Kauri version in Lily.

13.6.2.1 Check out Kauri

13.6.2.1.1 Existing checkout

• do 'svn status' to be sure there are no dirty files

• do 'svn up' and write down the SVN revision number

13.6.2.1.2 No existing checkout

Fetch a copy of the source using:

svn export https://dev.outerthought.org/svn/outerthought_kauri/trunk kauri-trunk

When this finishes, a revision number will be printed, write it down.

13.6.2.2 Change Kauri version number

Look in Kauri's pom.xml for the current version number.

Execute the following command (from within the kauri-trunk directory) to replace the versionnumbers. Adapt the sections in bold: the first one should be equal to Kauri's current developmentversion number (as found in the pom.xml), the second should be the same but with the word'SNAPSHOT' replaced with the Subversion revision number noted above.

find -name pom.xml -exec sed -i 's/<version>0.4-dev-SNAPSHOT<\/version>/<version>0.4-r1538<\/version>/g' {} \;

13.6.2.3 Deploy

Make sure the repository org.lilyproject.maven-deploy is configured in your ~/.m2/settings.xml, as described over here (page ).

Execute:


mvn deploy -DaltDeploymentRepository=org.lilyproject.maven-deploy::default::scp://lilyproject.org/var/www/lilyproject.org/maven/maven2/deploy

13.6.2.4 Revert version number changes

If you have a Kauri checkout (rather than an export), revert the changed version numbers using:

svn revert -R .

13.6.3 Deploying SOLR war To Maven

SOLR does not publish its war in Maven (though see SOLR-1218), but for use in Lily's testcasesit is convenient it is available in Maven. Here we describe how to publish the Solr war to Lily'sMaven repository.

First, because Maven 3 does not have scp support by default, you need to create a dummypom.xml file containing:

<project> <modelVersion>4.0.0</modelVersion> <groupId>dummy</groupId> <artifactId>dummy</artifactId> <version>1.0-SNAPSHOT</version> <build> <extensions> <extension> <groupId>org.apache.maven.wagon</groupId> <artifactId>wagon-ssh</artifactId> <version>2.0</version> </extension> </extensions> </build></project>

Then, it can be published into Lily's repository as follows:

mvn deploy:deploy-file \ -Dfile=/path/to/apache-solr-1.4.1/dist/apache-solr-1.4.1.war \ -Durl=scp://lilyproject.org/var/www/lilyproject.org/maven/maven2/deploy \ -DgroupId=org.apache.solr \ -DartifactId=solr-webapp \ -Dversion=1.4.1 \ -Dpackaging=war

Notes

1. http://www.lilyproject.org/lily/about/playground/hbaserowlog.html

2. http://maven.apache.org/plugins/maven-release-plugin/examples/prepare-release.html

3. http://www.ibm.com/developerworks/java/library/j-jtp05236.html

4. changeset:5114

5. changeset:5096

6. changeset:5082

7. changeset:5073

8. changeset:5071

9. changeset:5066

10. changeset:5067

Book

Documents

Transcript of Book