Redis Indices (#RedisTLV)

1. Redis Indices 127.0.0.1:6379> CREATE INDEX _email ON user:*->email @itamarhaber / #RedisTLV / 22/9/2014

2. A Little About Myself A Redis Geek and Chief Developers Advocate at .com I write at http://redislabs.com/blog and edit the Redis Watch newsletter at http://redislabs.com/redis-watch-archive

3. Motivation Redis is a Key-Value datastore -> fetching (is always) by (primary) key is fast Searching for keys is expensive - SCAN (or, god forbid, the "evil" KEYS command) Searching for values in keys requires a full (hash) table scan & sending the data to the client for processing

4. https://twitter.com/antirez/status/507082534513963009

5. antirez is Right Redis is a "database SDK" Indices imply some kind of schema (and there's none in Redis) Redis wasn't made for indexing ... But despite the Creator's humble opinion, sometimes you still need a fast way to search :)

6. So What is an Index? "A database index is a data structure that improves the speed of data retrieval operations" Wikipedia, 2014 Space-Time Tradeoff

7. What Can be Indexed? Data Index Key -> Value Value -> Key Values can be numbers or strings Can be derived from "opaque" values: JSONs, data structures (e.g. Hash), functions,

8. Index Operations Checklist 1. Create index from existing data 2. Update the index on a. Addition of new values b. Updates of existing values c. Deletion of keys (and also RENAME/MIGRATE) 3. Drop the index 4. If needed do index housekeeping 5. Access keys using the index

9. A Simple Example: Reverse Lookup Assume the following database, where every user has a single unique email address: HMSET user:1 id "1" email "[email protected]" How would you go about efficiently fetching the user's ID given an email address?

10. Reverse Lookup (Pseudo) Recipe def idxEmailAdd(email, id): # 2.a if not(r.setnx("_email:" + email, id)): raise Exception("INDEX_EXISTS") def idxEmailCreate(): # 1 for each u in r.scan("user:*"): id, email = r.hmget(u, "id", "email") idxEmailAdd(email, id)

11. Reverse Lookup Recipe, more admin def idxEmailDel(email): # 2.c r.del("_email:" + email) def idxEmailUpdate(old, new): # 2.b idxEmailDel(old) idxEmailAdd(new) def idxEmailDrop(): ... # similar to Create

12. Reverse Lookup Recipe, integration def addUser(json): ... idxEmailAdd(email, id) ... def updateUser(json): ...

13. Reverse Lookup Recipe, usage def getUser(id): return r.hgetall("user:" + id) TA-DA! def getUserByEmail(email): # 5 return getUser(r.get("_email:" + email))

14. Reverse Lookup Recipe, Analysis Asymptotic computational complexity: o Creating the index: O(N), N is no. of values o Adding a new value to the index: O(1) o Deleting a value from the index: O(1) o Updating a value: O(1) + O(1) = O(1) o Deleting the index: O(N), N is no. of values What about memory? Every key in Redis takes up some extra space...

15. Hash Index _email = { "[email protected]": 1, "[email protected]": 2 ... } Small lookups (e.g. countries) single key Big lookups partitioned to "buckets" (e.g. by email address hash value) More info: http://redis.io/topics/memory-optimization

16. Always Remember That You Are Absolutely Unique (Just Like Everyone Else)

17. Uniqueness The lookup recipe makes the assumption that every user has a single email address and that it's unique (i.e. 1:1 relationship). What happens if several keys (users) have the same indexed value (email)?

18. Non-Uniqueness with Lists Use lists instead of using Redis' strings/hashes. To add: r.lpush("_email:" + email, id) # 2.a Simple. What about accessing the list for writes or reads? Naturally, getting the all list's members is O(N) but...

19. What?!? WTF do you mean O(N)?!? Because a Redis List is essentially a linked list, traversing it requires up to N operations (LINDEX, LRANGE). That means that updates & deletes are O(N) Conclusion: suitable when N (i.e. number of duplicate index entries) is smallish (e.g. < 10)

20. OT: A Tip for Traversing Lists Lists don't have LSCAN, but with RPOPLPUSH you easily can do a circular list pattern and go over all the members in O(N) w/o copying the entire list. More at: http://redis.io/commands/rpoplpush

21. Back to Non-Uniqueness - Hashes Use Hashes to store multiple index values: r.hset("_email:" + email, id, "") # 2.a Great - still O(1). How about deleting? r.hdel("_email:" + email, id) # 2.b Another O(1). (unused)

22. Non-Uniqueness, Sets Variant r.sadd("_email:" + email, id) # 2.a Great - still O(1). How about deleting? r.srem("_email:" + email, id) # 2.b Another O(1).

23. List vs. Hash vs. Set for NUIVs* * Non-Unique Index Value Memory: List ~= Set ~= Hash (N < 100) Performance: List < Set, Hash Unlike a List's elements, Set members and Hash fields are: o Unique - meaning you can't index the same key more than once (makes sense). o Unordered - a non-issue for this type of index. o Are SCANable Forget Lists, use Sets or Hashes.

24. Forget Hashes, Sets are Better Because of the Set operations: SUNION, SDIFF, SINTER Endless possibilities, including matchmaking: SINTER _interest:devops _hair:blond _gender:...

25. [This Slide has No Title] NULL means no value and Redis is all about values. When needed, arbitrarily decide on a value for NULLs (e.g. "") and handle it appropriately in code.

26. Index Cardinality (~= unique values) High cardinality/no duplicates -> use a Hash Some duplicates -> use Hash and "pointers" to Sets _email = { "[email protected]": 1, "[email protected]": "*" ...} _email:[email protected] = { 2, 3 } Low cardinality is, however, another story...

27. Low Cardinality When an indexed attribute has a small number of possible values (e.g. Boolean, gender...): If distribution of values is 50:50, consider not indexing it at all If distribution is heavily unbalanced (5:95), index only the smaller subsets, full scan rest Use a bitmap index if possible

28. Bitmap Index Assumption: key names are ordered How: a Bitset where a bit's position maps to a key and the bit's value is the indexed value: first bit -> dfucbitz is online _isLoggedIn = /100/ second bit -> foo isn't logged in

29. Bitmap Index, cont. More than 2 values? Use n Bitsets, where n is the number of possible indexed values, e.g.: _isFromTerah = /100.../ _isFromEarth = /010.../ Bonus: BITOP AND / OR / XOR / NOT BITOP NOT _ET _isFromEarth BITOP AND onlineET _isLoggedIn _ET

30. Interlude: Redis Indices Save Space Consider the following: in a relational database you need "x2" space: for the indexed data (stored in a table) and for the index itself. With most Redis indices, you don't have to store the indexed data -> space saved :)

31. Numerical Ranges with Sorted Sets Numerical values, including timestamps (epoch), are trivially indexed with a Sorted Set: ZADD _yearOfBirth 1972 "1" 1961 "2"... ZADD _lastLogin 1411245569 "1" Use ZRANGEBYSCORE and ZREVRANGEBYSCORE for range queries

32. Ordered "Composite" Numerical Indices Use Sorted Sets scores that are constructed by the sort (range) order. Store two values in one score using the integer and fractional parts: user:1 = { "id": "1", "weightKg": "82", "heightCm": "218", ... } score = weightKg + ( heightCm / 1000 )

33. "Composite" Numerical Indices, cont. For more "complex" sorts (up to 53 bits of percision), you can construct the score like so: user:1 = { "id": "1", "weightKg": "82", "heightCm": "218", "IQ": "100", ... } score = weightKg * 1000000 + heightCm * 1000 + IQ Adapted from: http://www.dr-josiah.com/2013/10/multi-column-sql-like-sorting-in-redis.html

34. Full Text Search (Almost) (v2.8.9+) ZRANGEBYLEX on Sorted Set members that have the same score is handy for suffix wildcard searches, i.e. dfuc*, a-la autocomplete: http://autocomplete.redis.io/ Tip: by storing the reversed string (gnirts) you can also do prefix searches, i.e. *terah.net, just as easily.

35. Another Nice Thing With Sorted Sets By combining the use of two of these, it is possible to map ranges to keys (or just data). For example, what is 5? ZADD min 1 "low" 4 "medium" 7 "high" ZADD max 3 "low" 6 "medium" 9 "high" ZREVRANGEBYSCORE min inf 5 LIMIT 0 1 ZRANGEBYSCORE max 5 +inf LIMIT 0 1

36. Binary Trees Everybody knows that binary trees are really useful for searching and other stuff. You can store a binary tree as an array in a Sorted Set: (Happy 80th Birthday!)

37. Why stop at binary trees? BTrees! @thinkingfish from Twitter explained that they took the BSD implementation of BTrees and welded it into Redis (open source rulez!). This allows them to do efficient (speed-wise, not memory) key and range lookups. http://highscalability.com/blog/2014/9/8/how-twitter-uses-redis- to-scale-105tb-ram-39mm-qps-10000-ins.html

38. Index Atomicity & Consistency In a relational database the index is (hopefully) always in sync with the data. You can strive for that in Redis, but: Your code will be much more complex Performance will suffer There will be bugs/edge cases/extreme uses

39. The Opposite of Atomicity & Consistency On the other extreme, you could consider implementing indexing with a: Periodical process (lazy indexing) Producer/Consumer pattern (i.e. queue) Keyspace notifications You won't have any guarantees, but you'll be offloading the index creation from the app.

40. Indices, Lua & Clustering Server-side scripting is an obvious consideration for implementing a lot (if not all) of the indexing logic. But ... in a cluster setup, a script runs on a single shard and can only access the keys there -> no guarantee that a key and an index are on the same shard.

41. Don't Think Copy-Paste! For even more "inspiration" you can review the source code of popular ORMs libraries for Redis, for example: https://github.com/josiahcarlson/rom https://github.com/yohanboniface/redis-limpyd

Redis Indices (#RedisTLV)

Data & Analytics

Transcript of Redis Indices (#RedisTLV)