SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore
Click here to load reader
-
Upload
anshum-gupta -
Category
Technology
-
view
1.315 -
download
0
Transcript of SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore
The Fifth Elephant 2013, Bangalore12th July 2013
SolrCloud and NoSQL
Anshum Gupta
The Fifth Elephant 2013, Bangalore12th July 20132
Who am I?
• Anshum Gupta• Search and related stuff for around 8 years now• Apache Lucene since 2006, Solr since 2010• Currently:
• Helped launch the first AWS search service, CloudSearch.• Places I’ve worked at:
The Fifth Elephant 2013, Bangalore12th July 2013
Big Data
• Real Value = Process + Store + Search
• Search- No longer expensive- Affordable- Necessity- Can get as complicated as
you’d want it to get.
3
Loads of Data
Loads of Data
Loads of Data
Loads of Data
Loads of Data
Loads of Data
Loads of Data
Loads of Data
Loads of Data
Loads of Data
Loads of Data
Loads of Data
Loads of DataData
Search
The Fifth Elephant 2013, Bangalore12th July 2013
NoSQL Databases
• Wikipedia says:A NoSQL database provides a mechanism for storage and retrieval of data that use looser consistency models than traditional relational databases in order to achieve horizontal scaling and higher availability. Some authors refer to them as "Not only SQL" to emphasize that some NoSQL systems do allow SQL-like query language to be used.
• Non-traditional data stores• Doesn’t use / isn’t designed around SQL• May not give full ACID guarantees
- Offers other advantages such as greater scalability as a tradeoff
• Distributed, fault-tolerant architecture
The Fifth Elephant 2013, Bangalore12th July 2013
DB Rankings: Overall
Source: http://db-engines.com/en/ranking
The Fifth Elephant 2013, Bangalore12th July 2013
Search Engine Rankings
Source: http://db-engines.com/en/ranking/search+engine
The Fifth Elephant 2013, Bangalore12th July 2013
MongoDB
• Data Model: BSON• Distributed Model: Sharded master-slave async
replication.• Consistency: Per table write lock.
• Search:- Built in full text search, large gaps with ‘search’ players.- Alternate and popular solution: Use another search solution
along with MongoDB, Solr?. Consistency issues and more.
The Fifth Elephant 2013, Bangalore12th July 2013
Cassandra
• Data Model: Column based data store.• Distributed Model: Uses consistent hashing for
distributed updates.• Consistency: Timestamps for consistency.
• Search- Lucandra : Lucene based search.- Solandra : Solr based search.
The Fifth Elephant 2013, Bangalore12th July 20139
• Implements principles from the Amazon Dynamo paper.
• Riak Search - Distributed index and full-text search engine.- Merge Index – Storage backed used by Riak Search. It’s a pure
Erlang storage format and among other things uses the Apache Lucene file format.
- Riak Solr – Adds a subset of Apache Solr HTTP capabilities to Riak Search.
• Yokozuna- “next generation of Riak Search that marries Riak with Apache Solr”.- Sits alongside of Riak.
The Fifth Elephant 2013, Bangalore12th July 201310
The story so far…
• Different approaches for:- Data Model- Distributed Update handling- Consistency management
• Work reasonably well on different fronts as far as storage is concerned.
• Search:- There’s barely anything native and in the core.- (Almost) Everyone is trying to fuse together with Lucene/Solr.
The Fifth Elephant 2013, Bangalore12th July 201311
Adding Search to NoSQL
• To begin with, wasn’t built for that• Compromises• Integration is the buzzword.• Lucandra, Solandra…No strong contender yet.
The Fifth Elephant 2013, Bangalore12th July 201312
Adding NoSQL to Search
• Already store documents• With growing data, more intuitive for this to happen• More intuitive = makes more sense = easier (perhaps)• No key player as yet.
The Fifth Elephant 2013, Bangalore12th July 2013
The Fifth Elephant 2013, Bangalore12th July 2013
Apache Solr 4 at a glance• Document Oriented NoSQL Search Server
- Data-format agnostic (JSON, XML, CSV, binary)- Schema-less options (more coming soon)
• Distributed- Multi-tenanted
• Fault Tolerant- HA + No single points of failure
• Atomic Updates• Optimistic Concurrency• Near Real-time Search• Full-Text search + Hit Highlighting• Tons of specialized queries: Faceted search, grouping, pseudo-join, spatial search, functions
The desire for these features drove some of the “SolrCloud” architecture
The Fifth Elephant 2013, Bangalore12th July 2013
SolrCloud Design Goals
• Automatic Distributed Indexing• HA for Writes• Durable Writes• Near Real-time Search• Real-time get• Optimistic Concurrency
The Fifth Elephant 2013, Bangalore12th July 2013
SolrCloud
• Distributed Indexing designed from the ground up to accommodate desired features
• CAP Theorem- Consistency, Availability, Partition Tolerance (saying goes “choose 2”)- Reality: Must handle P – the real choice is tradeoffs between C and A
• Ended up with a CP system (roughly)- Value Consistency over Availability- Eventual consistency is incompatible with optimistic concurrency- Closest to MongoDB in architecture
• We still do well with Availability- All N replicas of a shard must go down before we lose writability for that
shard- For a network partition, the “big” partition remains active (i.e. Availability
isn’t “on” or “off”)
The Fifth Elephant 2013, Bangalore12th July 2013
SolrCloud
shard1
replica2
replica3
replica2
replica3ZooKeeper quorum
ZK nod
e
ZK node
ZK nod
e
ZK node
ZK node
/configs /myconf solrconfig.xml schema.xml
/clusterstate.json/aliases.json
/livenodes server1:8983/solr server2:8983/solr/collections
/collection1 configName=myconf /shards /shard1 server1:8983/solr server2:8983/solr /shard2 server3:8983/solr server4:8983/solr
http://.../solr/collection1/query?q=awesome
Load-balanced sub-requestreplica1
shard2
replica1
ZooKeeper holds cluster state• Nodes in the cluster• Collections in the cluster• Schema & config for each
collection• Shards in each collection• Replicas in each shard• Collection aliases
The Fifth Elephant 2013, Bangalore12th July 2013
Shard1 Shard2
Replica1 Replica3
Replica2 Replica4
Distributed Indexing
http://.../solr/collection1/update
• Update sent to any node• Solr determines what shard the document is on, and forwards to shard leader• Shard Leader versions document and forwards to all other shard replicas• HA for updates (if one leader fails, another takes it’s place)
Document Update
Leader
Non leading replica
The Fifth Elephant 2013, Bangalore12th July 2013
Optimistic Concurrency
• Conditional update based on document version
Solr
1. /get document
2. Modify document, retaining _version_
3. /update resulting document
4. Go back to step #1 if fail code=409
client
The Fifth Elephant 2013, Bangalore12th July 2013
Distributed Query RequestsDistributed query across all shards in the collectionhttp://localhost:8983/solr/collection1/query?q=foo
Explicitly specify node addresses to load-balance acrossshards=localhost:8983/solr|localhost:8900/solr, localhost:7574/solr|localhost:7500/solr A list of equivalent nodes are separated by “|” Different phases of the same distributed request use the same node
Specify logical shards to search acrossshards=NY,NJ,CT
Specify multiple collections to search acrosscollection=collection1,collection2
public CloudSolrServer(String zkHost) ZK aware SolrJ Java client that load-balances across all nodes in cluster Calculate where document belongs and directly send to shard leader (new)
The Fifth Elephant 2013, Bangalore12th July 2013
Document Routing
80000000-bfffffff
00000000-3fffffff
40000000-7fffffff
c0000000-ffffffff
shard1shard4
shard3 shard2
id = BigCo!doc5
9f27 3c71(MurmurHash3)
q=my_queryshard.keys=BigCo!
9f27 0000 9f27 ffffto
(hash)
shard1
numShards=4router=compositeId
Hash Ring
The Fifth Elephant 2013, Bangalore12th July 2013
Durable Writes
• Lucene flushes writes to disk on a “commit”- Uncommitted docs are lost on a crash (at lucene level)
• Solr 4 maintains it’s own transaction log- Contains uncommitted documents- Services real-time get requests- Recovery (log replay on restart)- Supports distributed “peer sync”
• Writes forwarded to multiple shard replicas- A replica can go away forever w/o collection data loss- A replica can do a fast “peer sync” if it’s only slightly out of
date- A replica can do a full index replication (copy) from a leader.
The Fifth Elephant 2013, Bangalore12th July 2013
Collections APICreate a new document collectionhttp://localhost:8983/solr/admin/collections? action=CREATE &name=mycollection&numShards=4&replicationFactor=3
CREATE DELETE ALIAS
SPLITSHARD DELETESHARD RELOAD
The Fifth Elephant 2013, Bangalore12th July 2013
Solr 4.3: Seamless Online Shard Splitting
Shard2_0
Shard1
replicaleader
Shard2
replicaleader
Shard3
replicaleader
Shard2_1
1. http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=mycollection&shard=Shard2
2. New sub-shards created in “construction” state3. Leader starts forwarding applicable updates, which are buffered by the sub-shards4. Leader index is split and installed on the sub-shards5. Sub-shards apply buffered updates then become “active” leaders and old shard
becomes “inactive”
update
The Fifth Elephant 2013, Bangalore12th July 2013
Solr 4.4: Schemaless• “Schemaless” really normally means that the client(s) have an implicit
schema.• “No Schema” impossible for anything based on Lucene
- A field must be indexed the same way across documents• Dynamic fields: convention over configuration
- Only pre-define types of fields, not fields themselves- No guessing. Any field name ending in _i is an integer
• “Guessed Schema” or “Type Guessing”- For previously unknown fields, guess using JSON type as a hint - Coming soon (4.4?) based on the Dynamic Schema work
• Many disadvantages to guessing- Lose ability to catch field naming errors- Can’t optimize based on types- Guessing incorrectly means having to start over
The Fifth Elephant 2013, Bangalore12th July 2013
Bangalore Apache Lucene/Solr Meetup
1 meetup alreadyAlmost 150 membersAnother one coming up soon…Join us at: http://www.meetup.com/Bangalore-Apache-Solr
-Lucene-Group/
The Fifth Elephant 2013, Bangalore12th July 2013
Twitter: @anshumguptaLinkedIn: http://www.linkedin.com/in/anshumguptaBlog: http://www.anshumgupta.net
Thanks!