Elasticsearch_MEAP_ch1

26

description

Elasticsearch_MEAP_ch1

Transcript of Elasticsearch_MEAP_ch1

  • MEAP Edition Manning Early Access Program

    Elasticsearch in Action Version 8

    Copyright 2014 Manning Publications

    For more information on this and other Manning titles go to www.manning.com

    Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

    http://www.manning-sandbox.com/forum.jspa?forumID=871

  • Welcome

    Thanks for purchasing the MEAP for Elasticsearch in Action. We're excited to see the book reach this stage, and we're looking forward to its continued development and eventual release. This is an intermediate level book, designed for anyone writing applications using Elasticsearch, or responsible for managing Elasticsearch in a production environment.

    We've worked to make the content both approachable and meaningful, and to explain not just how to do things with Elasticsearch, but also why things are done the way they are. We feel it's important to know Elasticsearch's foundations prior to diving into all of the different ways you can leverage it.

    Were releasing the first five chapters to start. Chapter 1 explains what Elasticsearch is: what a typical use-case looks like, the tasks it's good at, and the challenges it faces. By the end of chapter one, you'll know if Elasticsearch is likely to be a good fit for you and what you can expect from it. You will also learn various ways to install it.

    Chapter 2 is about getting your hands dirty with Elasticsearch's functionality. You will understand how data is organized, both logically and physically. By the end, you'll know how to index and search for documents, as well as how to configure Elasticsearch.

    Chapter 3 is all about indexing, updating and deleting documents. You will understand the types of fields you can have in your documents all of which goes into your mapping definition, as well as how updating and deleting data works under the covers.

    Chapter 4 is all about searching. You will understand most types of queries, and when you'll use which. You will also get a clear idea about the difference between queries and filters: how they work and where you'd use which.

    Chapter 5 explains analysis, which is a big chunk of how Elasticsearch can make your searches fast, and also return the right results in the right order. Analysis is the process of making tokens out of your text and by the end, you'll understand how it works and how to configure analysis for your use-case.

    Looking ahead in part 2, other chapters will deal with more advanced functionality: how to make searches more relevant where you'll discover some more query types; how to work with relational data, and how to extract statistics from your data in real-time, using aggregations. We'll also show how to make Elasticsearch perform to your production standards: we'll talk about scaling out, tuning indexing and search performance, as well as administering your cluster.

    As you're reading, we hope youll take advantage of the Author Online forum. Well be reading your comments and responding. We appreciate any feedback, as it's very helpful in the development process. Thanks again!

    Matthew Lee Hinman and Radu Gheorghe

    Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

    http://www.manning-sandbox.com/forum.jspa?forumID=871

  • brief contents

    PART 1: CORE ELASTICSEARCH FUNCTIONALITY

    1 Introducing Elasticsearch

    2 Diving into the functionality

    3 Indexing, updating and deleting data

    4 Searching your data

    5 Analyzing your data

    6 Searching with relevancy

    7 Relations among documents

    8 Exploring data with Aggregation

    PART 2: ADVANCED FUNCTIONALITY

    9 Scaling out

    10 Indexing and searching fast

    11 Administering your cluster

    Appendix A Working with geo-spatial data

    Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

    http://www.manning-sandbox.com/forum.jspa?forumID=871

  • 1 Introducing Elasticsearch

    This chapter covers

    What are search engines, and what issues do they address How Elasticsearch fits in the context of search engines Typical scenarios using Elasticsearch Features Elasticsearch provides Getting and installing Elasticsearch

    We use search everywhere these days. And that's a good thing, because search helps you finish tasks quickly and easily. Whether you're buying something from an online shop or visiting a blog, you expect to have a search box somewhere to help you find what you're looking for, without scanning the whole website. Maybe it's just me, but when I wake up in the morning, I wish I could just enter the kitchen and type in bowl somewhere and have my favorite bowl highlighted.

    We've also come to expect a lot of those search boxes. I don't want to have to type the whole bowl word; I'm expecting the search box to come with suggestions. And I don't want results and suggestions to just come up to me in some random order. I want the search to be smart and give me the most relevant results first. To guess what I want, if that's possible. For example, if I'm on an online shop searching for laptop, and I have to scroll through laptop accessories to get to an actual laptop, I'm likely to go somewhere else after the first page.

    And this is not only because we're in a hurry, and spoiled with good search interfaces, it's also because there's increasingly more stuff to choose from. A while ago, a friend asked me to help her buy a new laptop. I couldn't just type best laptop for my friend in the search box of an online store with thousands of laptops. Good keyword search is often not enough: you need some aggregated data, so you can narrow the results down to what you're interested in. I

    Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

    http://www.manning-sandbox.com/forum.jspa?forumID=871

    1

  • could select the size of the screen, the price range and so on, until I had to choose between five laptops or so.

    Finally, there's the matter of performance. Because nobody wants to wait. I've seen some websites where you search for something and get the results in few minutes. Minutes! For a search.

    If you want to provide search for some of your data, you'll have to deal with what we've just talked about: return relevant results to searches, return some statistics, and do all that quickly. And this is where search engines come into play, because they're built exactly to meet those challenges.

    You can deploy a search engine on top of a relational database, to create indexes and speed up the SQL queries. Or, you can index data from your NoSQL data-store, to add search capabilities there. You can do that with Elasticsearch, and it works especially well with document-oriented stores like MongoDB, because data is represented in Elasticsearch as documents, too.

    Modern search engines, like Elasticsearch, also do a good job storing your data. So you can use it as a NoSQL data store with powerful search capabilities. It's open-source, distributed and it's built on top of Apache Lucene, an open-source search engine library, which allows you to implement search functionality in your own Java application. Elasticsearch takes this functionality and extends it to make storing, indexing and searching faster, easier and, as the name suggests, elastic. And your application doesn't need to be written in Java to work with Elasticsearch. It allows you to send data over HTTP in JSON to index, search, and manage your Elasticsearch cluster.

    We'll expand the features above in the rest of this chapter, and you'll learn to use them throughout this book. If any of those topics appears to be overwhelming, don't worry. We'll take each topic and explain it thoroughly in the following chapters. First, let's take a closer look at the challenges search engines are typically confronted with, and Elasticsearch's approach in solving them.

    1.1 Elasticsearch as a search engine One of the main questions you might have at this point is how Elasticsearch fits your use case. To get a better idea, lets look at an example to illustrate how Elasticsearch works. Imagine yourself working on a website that hosts lots of blogs, and you want to let users search for posts across the whole site. The first thing you need to do is to make keyword searches work. For example, if someone is searching for elections, you'd better return all posts containing that word.

    A search engine will do that for you, but for a good search you'll need more than that: results need to come in quickly and they need to be relevant. And it's also nice to have some features that will help users search, even though they don't know the exact words of what they're looking for: detecting typos, providing suggestions and breaking down results in to categories are some of those features.

    Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

    http://www.manning-sandbox.com/forum.jspa?forumID=871

    2

  • TIP In most of this chapter, you will get an overview on Elasticsearch's features. If you want to get practical and jump to installing it, jump to section 1.5. You'll probably find the installation procedure surprisingly easy. And you can always come back to the rest of this chapter for the high-level overview.

    1.1.1 Searches need to be quick If you have a huge number of posts in your site, searching through all of them for the word elections can take a long time. And you don't want your users to wait. That's where Elasticsearch helps, because it uses Lucene's high-performance search engine library to index all your data by default.

    An index is a data structure you create along with your data, which is meant to allow faster searches. You can add indexes to fields in pretty much any database, and there are several ways to do it. Lucene does it with inverted indexing, which means it creates a data structure where it keeps a list of where each word belongs. For example, using inverted indexing with blog tags might look like the following table.

    Table 1.1 Inverted index for blog tags

    Raw data Index data Blog Post ID Tags Tags Blog Post IDs

    1 elections elections 1,3

    2 peace peace 2,3,4

    3 elections, peace

    4 peace

    If you now want to search for blog posts that have an elections tag, it will be much faster looking at the index, than looking at each word of each blog post, because you only have to look at the place where Tags is elections, and you'll get all the corresponding blog posts. This speed gain makes a lot of sense in the context of a search engine. In real world, you're rarely searching for just one word. For example, if you're searching for Elasticsearch in Action, three word look-ups will imply multiplying your speed gain by three.

    All this may seem a bit complex at this point, but we'll discuss indexing in great detail in chapter 3 and searching in chapter 4.

    An inverted index is appropriate for a search engine when it comes to relevance, too. For example, when you're looking up a word like peace, not only will you see which document matches, but you get the number of matching documents for free. This is important, because if a word occurs in most documents, it's probably less relevant. Let's say you search for Elasticsearch in Action and a document contains the word in, along with a million other

    Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

    http://www.manning-sandbox.com/forum.jspa?forumID=871

    3

  • documents. At this point you know that in is a common word, and the fact that this document matched doesn't say much about how relevant it is to your search. Instead, if it contains Elasticsearch, along with a hundred others, you know you're getting closer. Although, it's not you who has to know you're getting closer, Elasticsearch will do that for you. You'll learn all about tuning it for relevancy in chapter 7.

    That said, there is a tradeoff for this improved search performance and relevancy. The index will take up disk space, and adding new blog posts will be slower, because you also have to update the index after adding the data itself. On the upside, you can do a lot of tuning to make Elasticsearch faster, both when it comes to indexing and searching. We'll discuss these topics in great details in chapters 12 and 13 respectively.

    1.1.2 Results have to be relevant Then there's the hard part: how do you make the blog posts that actually are about elections appear before the ones that simply contain that word? With Elasticsearch, you have a few algorithms for calculating the relevancy score, which is used by default to sort the results. The relevancy score is a number that's associated to each document matching your search criteria, which should indicate how relevant the given document is to the criteria. For example, if a blog post contains elections more times than another, you can think it's more likely to be about elections.

    By default, the algorithm used is tf-idf. We'll discuss scoring and tf-idf more in chapters 4 and 7, which are about searching and relevancy, but here's the basic idea: tf-idf comes from term frequencyinverse document frequency, and the relevancy score is influenced by two factors: term frequency, which means the more the words you are looking for appear in a document, the higher the score, and inverse document frequency, meaning the weight of each word is higher if the word is uncommon across other documents. For example, if you're looking for bicycle race on a cyclist's blog, the word bicycle will count much less for the score than race. However, the more times they both appear in a document, the higher that document's score.

    Besides choosing the algorithm, you have lots of ways built in Elasticsearch to influence the relevancy score to suit your needs. For example, you can boost the score of a particular field, like the title of a post, to be more important than the body. This will give higher scores to documents that match your search criteria in the title, compared to similar documents that only match with the body. You can make exact matches count more than partial matches, and you can even use a script to add custom criteria to the way score is calculated. For example, if you let users like posts, you can boost the score based on the number of likes, or you can make newer posts have higher scores than similar, older posts. But you don't have to worry about any of these right now, we'll discuss relevancy in great detail in chapter 7. This chapter is only about what you can do with Elasticsearch and when you would want to use those features.

    Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

    http://www.manning-sandbox.com/forum.jspa?forumID=871

    4

  • 1.1.3 Searching beyond exact matches Finally, with Elasticsearch you have a lot of options that make your searches intuitive and go beyond exactly matching what the user types in. That's good when the user made a typo, or used a synonym or a derived word from what you have stored. And it's also good when the user didn't know exactly what to search for in the first place.

    TYPOS For the first situation, you can configure Elasticsearch to be tolerant to variations, instead of just looking for exact matches. A fuzzy query can be used so a search for a typo-ed bicycel will match a blog post that's about bicycles. We'll explore the fuzzy query and other features that make your searches relevant in chapter 7.

    DERIVATIVES You can also use analysis which will be covered in great detail in chapter 5, to make Elasticsearch understand that a blog that has bicycle in its title should also match queries that mention bicyclist or cycling.

    STATISTICS When users don't know what to search for, there are a number of things which will help. One of them is to present some statistics via facets, which we'll cover in chapter 6. Imagine users entering your blog, and seeing on the right-hand side a list of popular topics. One may be cycling. Those who are interested in cycling would click that to narrow the results down. Then, you might have another facet, that would separate cycling posts into bicycle reviews, cycling events and so on.

    SUGGESTIONS Once users start typing, you can help them discover popular searches and popular results. You can use suggestions to predict their searches as they type, like most search engines on the web do. You can also show some popular results as they type, with some special query types that match prefixes, wild cards or regular expressions.

    We'll discuss all these features: most types of searches will be covered in chapter 4, and chapter 7 is all about relevance. For now, let's look at how Elasticsearch is typically used in production.

    1.2 Typical setups using Elasticsearch We've already established that Elasticsearch is good for storing and indexing your data, in order to get quick and relevant results to your searches. But in the end it's just a search engine, and you'll never use it on it's own. Like any other data store, you'll need a way to feed data into it, and you'll probably need some sort of interface for the user searching that data.

    To get an idea of how Elasticsearch might fit into a bigger system, let's talk about 3 typical scenarios:

    As the primary back end for your website. Like we've discussed, you may have a

    Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

    http://www.manning-sandbox.com/forum.jspa?forumID=871

    5

  • website that allows people to write blog posts, and you also want to search through them. You can use Elasticsearch to store all the data related to these posts, and serve queries as well.

    Adding search to a bigger system. You may be reading this book because you already have a system that's crunching data, and you simply want to add search. We'll look at a couple of overall designs on how that might be done

    As the back end of a ready-made solution built around it. Because Elasticsearch is open-source and offers a straightforward HTTP interface, there's a big ecosystem around it that you can use. For example, Elasticsearch is popular for centralizing logs: you have tools that can write to Elasticsearch, and tools that can read from it. So you won't need to develop anything, just to configure those tools to work the way you want

    1.2.1 One stop shop for storing, searching and statistics Traditionally, search engines have been deployed on top of well-established data stores, to provide fast and relevant search alone. That's because typically, search engines didn't offer durable storage or a lot of other features that are often needed, such as statistics.

    Elasticsearch is one of those modern search engines providing durable storage, statistics, and many other features you've come to expect from a data store. Especially if you start a new project, it's a good idea to consider if you can use Elasticsearch as the only data store, to make you design as simple as possible. You can also use it on top of another data store, as we'll discuss later.

    Let's take the blog example: when someone writes a new blog post, it can be stored in Elasticsearch. Similarly, you can use Elasticsearch to retrieve, search or do statistics through all that data, as you can see in figure 1.1.

    Figure 1.1 Elasticsearch as the only data store

    Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

    http://www.manning-sandbox.com/forum.jspa?forumID=871

    6

  • You need fault tolerance by replicating your data to different servers? Elasticsearch offers this as well, as you'll see in chapter 11, which is all about scaling out. There are lots of features that make Elasticsearch a tempting NoSQL data-store. It can't be great for everything, but it's worth checking out if it's worth the added complexity of including another data store in your design.

    1.2.2 Plug in search in a complex system There are situations where Elasticsearch doesn't provide all the functionality you need from a data store. For example, transaction support or complex relationships are features that, at least in version 1.0, aren't present. So if you need those features, you might want to try to use Elasticsearch along with a different data store.

    Or, you might already have a pretty complex system that works, except you want to add search. It might be quite a risky step to re-architect the whole system for the sole purpose of using Elasticsearch alone. You might want to do that in time, though.

    Either way, you'll have to find a way to keep the two data stores synchronized. Depending on what your primary data store is, and how your data is laid out, you can deploy an Elasticsearch plug in to keep the two entities synchronized, as illustrated in figure 1.2. We'll discuss more about plug ins in chapter 10.

    Figure 1.2 Elasticsearch in the same system with another data store

    Typically, when you do this synchronizing, Elasticsearch will also store the original data. For example, let's assume you have a retail store with lots of products stored in an SQL database. You need some fast and relevant searching, so you install Elasticsearch. Next, to

    Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

    http://www.manning-sandbox.com/forum.jspa?forumID=871

    7

  • index data, you'll need to deploy something that will pull all the data corresponding to each product, and index it in Elasticsearch. Each product will be a document, and when a user types in some criteria, a number of product documents will come back as results.

    1.2.3 Use it with existing tools In some use-cases, you don't have to write a single line of code to get the job done with Elasticsearch, because there are lots of tools built to work with; you don't have to write yours from scratch.

    For example, let's say you want to deploy a large-scale logging framework, where you want to store, search and analyze a large number of events. As you can see in figure 1.3, you can use logging tools that can process logs and output to Elasticsearch, such as rsyslog1, Logstash2 or Apache Flume3. Then, to search and analyze those logs in a visual interface, you can use Kibana.4

    Figure 1.3 Elasticsearch in a system of logging tools that supports it out-of-the box

    The fact that Elasticsearch is open-source under Apache 2 license, to be precise isn't the only reason there are lots of tools that support working with it. Even though Elasticsearch is written in Java, there's more than just a Java API that lets you work with it. It also exposes an HTTP API, which can be used by any application, no matter the programming language it was written in.

    1 http://www.rsyslog.com/ 2 http://www.elasticsearch.org/overview/logstash/ 3 https://flume.apache.org/ 4 http://www.elasticsearch.org/overview/kibana/

    Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

    http://www.manning-sandbox.com/forum.jspa?forumID=871

    8

  • What's more, the HTTP requests and replies typically come in JSON. Every HTTP request would have its payload in JSON, or JavaScript Object Notation. Every reply would also be a JSON document.

    JSON and YAML JSON is a format for expressing data structures. A JSON object typically contains keys and values, where values can be strings, numbers, true/false, null, another object or an array. For more details about the JSON format, visit http://json.org

    JSON is a format which is easy for applications to parse and generate. YAML, which stands for Yet Another Markup Language, is also supported for the same purpose, and you can activate it by adding the format=yaml parameter to the HTTP request. For more details on YAML, visit http://yaml.org

    While JSON is typically used for HTTP communication, the configuration files are usually written in YAML. In this book, we'll stick with the popular formats: JSON for HTML communication and YAML for configuration.

    For example, a log event might look like this when you index it in Elasticsearch:

    { "message": "logging to Elasticsearch for the first time", #A "timestamp": "2013-08-05T10:34:00Z", #B }

    #A A field with a string value #B that string value can be a date, that Elasticsearch will evaluate it automatically

    NOTE Throughout this book, you will see that JSON field names are in blue and their values are in red, to make the code easier to read.

    And a search request for log events with first in the message field would look like this:

    { "query": { "match": { #A "message": "first" #B } #A } }

    #A The value of the field query is an object containing the field match #B The match field contains another object, where first is the value of message

    Sending data and running queries by sending JSON objects over HTTP makes it easy for someone to extend anything, from a syslog daemon like rsyslog to a connecting framework like ManifoldCF, to interact with Elasticsearch. And if you're building a new application from

    Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

    http://www.manning-sandbox.com/forum.jspa?forumID=871

    9

  • scratch, or just want to add search to an existing one, the REST API is one of the features that makes Elasticsearch appealing. In the next section, we'll look at other such features.

    1.3 Data structure and interaction When you want to start using Elasticsearch from your application, or you're at the stage of evaluating it, you might have two sets of essential questions:

    How do I index, search and run statistics on a set of data? What are the actual features that let me do that, and how do they work?

    How should I organize my data? What are my options in terms of data structure?

    When it comes to indexing and searching, a lot of Elasticsearch's functionality maps to that of Apache Lucene, since that is the search engine library on which it's built. Elasticsearch exposes most of its functionality though its REST API, extends it in a lot of places, and has a different implementation of some of the concepts found in Lucene.

    In terms of structuring your data, the main unit of indexing and searching is a document. A document can be a blog post with all its metadata, a user with all its metadata, or any other type of data you plan to search for. We say it's document-oriented, like many NoSQL data stores. Those documents are organized in document types, and document types in indices. In SQL terms, you can roughly think of an Elasticsearch index as a database, and a document type as a table in that database.

    Finally, there's the matter of extending the existing functionality. Whether you want to synchronize Elasticsearch with an external data store, such as MongoDB, or you want some additional features, there are lots of plug ins you can choose from. We'll talk more about plug ins in chapter 10, and for now, we'll look at the core functionality.

    1.3.1 Indexing and search functionality Being a search engine library, anyone could use Lucene to implement search in their own Java projects. And if you don't work with Java, there are a few ports of Lucene in other languages. Lucene could do all the heavy lifting for you, from indexing and storing documents to searching them in various ways.

    So why would you need Elasticsearch when you can use Lucene directly? The answer: Elasticsearch offers functionality you'll want to use in your application, which is not available in Lucene because it's out of its scope. For example, Elasticsearch has lots caching features, has an HTTP API you can use from applications written in any language, and is also backwards-compatible.

    Elasticsearch makes most of Lucene's features available for your application through the REST API we've talked about earlier. You can index documents over HTTP, as you'll see in chapter 3, and you can run searches the same way, as you'll see in chapter 4. But we've already established that Elasticsearch offers much more than good keyword search. So we'll discuss statistics via facets in chapter 6, and you'll see various ways to make searches more relevant in chapter 7.

    Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

    http://www.manning-sandbox.com/forum.jspa?forumID=871

    10

  • What about Solr? If you've heard about Lucene before, you've probably also heard about Apache Solr, which is also an open-source, distributed search engine based on it. Actually, Lucene and Solr have merged as Apache projects on 2010. So you might wonder how does Elasticsearch compare with Solr.

    Both search engines provide similar functionality, and features evolve very quickly from one version to the other. You can search the web for comparisons, but we'd recommend taking them with a grain of salt. Besides being tied to particular versions which makes such comparisons obsolete in a matter of months, few authors of such comparisons have in-depth knowledge of both solutions.

    That said, there are a few historical facts that might explain the origins of the two products. Solr was created in 2004 and Elasticsearch in 2010. When Elasticsearch came around, its distributed model, which will be discussed later in this chapter, meant it was much easier to scale out than any of the competitors which suggests the elastic part of the name. In the meantime, Solr added sharding with version 4.0, which makes the distributed argument debatable, like many other aspects.

    At the time of this writing, both Elasticsearch and Solr have features that the other one doesn't, and choosing between them may be down to the actual functionality you need at that point in time. For many use-cases, all the functionality you need is covered by both and, as it's often the case with competitors, choosing between them becomes a matter of taste.

    ANALYSIS Another key feature of any Lucene-based search engine is analysis. Through analysis, the words from the text you're indexing become terms in Elasticsearch. For example, if you index bicycle race, analysis may produce the terms bicycle, race, cycling, racing. And when you search for any of those terms, the corresponding document will be included in the results.

    The same applies when you search, as illustrated in figure 1.4. If you enter bicycle race, you probably don't want to search only for the exact match. Maybe a document which contains both those words somewhere will do.

    The default analyzer will break text into words by looking for common word separators, such as a space or a comma. Then, it will lowercase those words, so that Bicycle Race will generate bicycle and race. We'll talk more about analysis in chapter 5.

    Figure 1.4 Analysis breaks text into words, both when you're indexing and searching

    Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

    http://www.manning-sandbox.com/forum.jspa?forumID=871

    11

  • At this point you might want to know more about what's in that indexed data box of figure 1.4, because it sounds quite vague. As we'll develop next, data is organized in documents. By default, Elasticsearch will store your documents as they are, and it will also put all the terms resulting from analysis into the inverted index, to enable the all-important fast and relevant searches. We'll also go into more detail on indexing and storing data in chapter 3.

    1.3.2 Structuring your data in Elasticsearch Unlike relational databases, that store data in records or rows, Elasticsearch stores it in documents. To some extent, the two concepts are similar: if you have a table in mind, you have columns, and for each column, each row has a value. Documents, have keys and values, in much the same way.

    The difference is that a document is more flexible, mainly because, in Elasticsearch at least, a document can be hierarchical. For example, just like you would associate a key with a string value like author:Joe, you can have an array of strings, like tags:[cycling, bycicles]. Or even key-value pairs again, like: author:{first_name:Joe, last_name:Smith}.

    This flexibility is important because it encourages you to keep all the data that belongs to a logical entity in the same document, as opposed to different rows in different tables. For blogs, the easiest way and probably fastest of storing blog articles is to keep all the data that belongs to a post in the same document. This way, searches will be fast because you don't need joins or any other relational work.

    If you have an SQL background, you might miss the ability to use joins. Unfortunately, they're not supported, at least in version 1.0, because it's difficult to do complex relational work in a distributed environment. When data starts going back and forth between many nodes, the query time decreases dramatically.

    There are ways to define relationships in Elasticsearch (We explore these in chapter 8). While this functionality is limited compared to what you'd expect in a relational database, it can still prove very useful. For example, if you index access logs from your HTTP server, you can have a relationship between each log and the blog post that was accessed. This way, you can let users search for blog posts matching certain criteria, and show the most accessed first.

    But how would this data actually be organized? Let's say we'll put all the posts of each particular blog in its own index. And you have two blogs: one about cycling and one about elections, like in figure 1.5. An index is much like a database, in the sense that it can have its own completely different settings, such as the way it's stored on disk or whether it's read-only. Indices are stored on disk in different sets of files, makes them physically separated, although you can search across multiple indices in the same way you search in one index.

    Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

    http://www.manning-sandbox.com/forum.jspa?forumID=871

    12

  • Figure 1.5 Data organized in multiple indices

    Inside each index, each document has to go into a type. The purpose of types is to logically separate different kinds of documents. In our case, let's assume we have blog posts and access logs. In the following figure, we'll define those two types in each index.

    #A to the right of logs (logs can only have 1 parent) #B to the left of type (posts can have many children)

    Figure 1.6 Types provide logical separation within the same index

    Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

    http://www.manning-sandbox.com/forum.jspa?forumID=871

    13

  • Types are also used to specify relationships between documents as well. You can define a one-to-many relationship between documents of one type, and documents of another, which we'll cover in chapter 8. In this case, we'll make posts the parent of logs, because each post can have many corresponding access logs (children), while an access log can only hit one post (parent). The resulting data structure is illustrated in the following figure.

    Figure 1.7 Data organized in indices and types

    Next, let's look at how this data organization lies on your physical servers running Elasticsearch. Because it's distributed, your data can be spread on multiple such servers, which will form a cluster. We'll have a look next at how that works, and how Elasticsearch will look when it comes to operating.

    1.4 Scaling and performance Performance is often the main reason you want a search engine. If you have lots of data, you need to be able to index it quickly, and get results from your searches quickly. If you have many users, you need your search engine to support many concurrent searches.

    With Elasticsearch, you can do lots of tuning to make indexing operations fast, from sending many of them in the same bulk request, to changing the way your data is stored on disk; we'll discuss these techniques in detail in chapter 12. Searches can also be tuned for speed: for example, some search types are faster than others, depending on the use-case, and we'll show you how to make your searches faster in chapter 13.

    As your data keeps growing, an important feature is the ability to split your data across multiple servers, also known as sharding. Fortunately, Elasticsearch is sharded by default, as

    Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

    http://www.manning-sandbox.com/forum.jspa?forumID=871

    14

  • you'll come to understand, making it easy to spread your data in a cluster of multiple instances. You can also store multiple copies of the same data on multiple machines, which is good for availability.

    Another feature, which is especially useful when you have multiple instances, is the ability to change the configuration on the fly though the REST API. The same API that you normally used for indexing and searching can be used for changing most of the configuration options. Those changes can apply to all nodes of a cluster, when you're running a single command.

    1.4.1 Scaling, sharding and performance A cluster is made up of one or more nodes, where a node is an Elasticsearch process, that typically runs on a separate server. Elasticsearch is clustered by default: when you start your first process, you have a cluster of one node, which can be expanded without making any configuration changes.

    This works because Elasticsearch divides every index into chunks called shards. By default, each index has 5 shards. When you start your first node and index your first blog posts, your one-node cluster works as illustrated in the following figure.

    Figure 1.8 One node cluster with one index divided into five shards

    How does this help? you might ask. If you'll stay on one node forever, it doesn't you might as well have only one shard. But if you'll ever add new nodes to your cluster, Elasticsearch can move shards from your initial node to the new ones. Figure 1.9 shows how a three-node cluster might look with your initial five shards spread through the available nodes. Now the load of indexing and searching is spread between more machines, giving you more performance and capacity. Each node can receive all kinds of requests, so from the application's point of view, talking to any node is like talking to the single entity that is your Elasticsearch cluster.

    Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

    http://www.manning-sandbox.com/forum.jspa?forumID=871

    15

  • Figure 1.9 Five shards spread on a three-node cluster

    Shards can be moved around, allowing you to expand or shrink your cluster at any time. By default, Elasticsearch will try to balance the number of shards across your nodes, so the load is evenly spread.

    The problem with the setup in figure 1.9 is that if a node goes down, part of your data will become unavailable. To increase availability, you can create one or more copies for each of your initial shards. Initial shards are called primaries and the copies are called replicas.

    Primaries differ from replicas because they are the first to receive new documents. Other than that, they're pretty much the same: both index the same documents eventually, and both can serve searches. This helps in two ways:

    Searches run on replicas also; you can serve more concurrent searches by simply adding more nodes and replicas to your cluster.

    Replicas index documents just like primary shards; they can be easily promoted to primaries. Which is exactly what Elasticsearch does in case a node hosting a primary shard goes down.

    The number of replicas per shard can be changed on the fly. Figure 1.10 shows how the cluster from figure 1.9 would look like with one replica added for each shard.

    Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

    http://www.manning-sandbox.com/forum.jspa?forumID=871

    16

  • Figure 1.10 The three-node cluster with a set of replicas

    With this cluster, if a node goes down, you'll still have a complete set of shards, because of the replicas. From the application's point of view, the cluster continues to work as expected. In the background, Elasticsearch automatically promotes the needed replicas to primaries, and creates additional replicas, to get back to the configuration you requested. You'll learn more about this process in chapter 11, which is all about scaling.

    1.5 Getting started Let's begin by installing your own, single-server Elasticsearch cluster. To do that, you need to have at least Java 6 installed. Once that is in place, you're typically just a download away from getting Elasticsearch ready to start.

    1.5.1 Getting Java If you don't have a Java run-time environment already, you'll have to install it first. Any Java run-time environment should work, but typically you'd install the one from Oracle (https://www.java.com/en/download/index.jsp) or the open-source implementation OpenJDK (http://download.java.net/openjdk/).

    Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

    http://www.manning-sandbox.com/forum.jspa?forumID=871

    17

  • Troubleshooting no Java found errors With Elasticsearch, as with other Java applications, it might happen that you have downloaded and installed Java, but the application still refuses to start, complaining that it doesn't find Java.

    Elasticsearch's script looks for Java in two places. First one is the JAVA_HOME environment variable. You can check if it's there by using the env command on UNIX-like systems and the set command on Windows.

    The second place is the system path. To check if that's your case, you can run the following command: % java version If it works, then you have Java in your path. If it doesn't, you have to either configure JAVA_HOME, or add the Java binary to your path. The Java binary is typically found wherever you installed Java (which should be JAVA_HOME), in the bin directory.

    1.5.2 Downloading and starting Elasticsearch With Java set up, you need to get Elasticsearch and start it. Typically, you'll download the package from http://www.elasticsearch.org/download/, which is the best fit for your environment.

    ANY UNIX-LIKE OPERATING SYSTEM If you're running on Linux, Mac, or any other UNIX-like operating system, you can get Elasticsearch from the tar.gz package. Then, you can simply unpack it and start Elasticsearch with the shell script from the archive:

    % tar zxf elasticsearch-*.tar.gz % cd elasticsearch-* % bin/elasticsearch -f

    The -f parameter is there to start Elasticsearch in foreground. This lets you see what it's doing, and you can you can easily stop it with CTRL+C. If you omit the -f, you'll have to look for what process to kill. And if you need to look up the logs, you can find it in the logs/ directory within the unpacked package.

    MAKING ELASTICSEARCH A SYSTEM SERVICE In production, you'd probably want to run Elasticsearch as a service instead of starting it manually. To do that, you need to install the service wrapper, which is available on GitHub. You need to first download it:

    % git clone https://github.com/elasticsearch/elasticsearch-servicewrapper.git

    Then copy the service/ directory in the bin/ directory of your Elasticsearch installation:

    % cp -r elasticsearch-servicewrapper/service/ bin/

    Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

    http://www.manning-sandbox.com/forum.jspa?forumID=871

    18

  • Finally, you'll have your init script as a symbolic link to the elasticsearch service wrapper script:

    % ln -s `pwd`/bin/service/elasticsearch /etc/init.d/elasticsearch

    Now you're done! You'll be able to restart Elasticsearch by running /etc/init.d/elasticsearch restart

    MAC If you need an easier way to install Elasticsearch on your Mac, you can install Homebrew. Instructions for doing that can be found at http://brew.sh

    With Homebrew installed, getting elasticsearch is a matter of running:

    % brew install elasticsearch

    Then you'd start it in a similar way to the tar.gz archive:

    % elasticsearch f

    WINDOWS If you're running on Windows, you can download the ZIP archive. Unpack it, then run elasticsearch.bat from the bin/ directory, much like you would run elasticsearch -f on UNIX:

    % bin\elasticsearch.bat

    RPM PACKAGE If you're running on Red Hat Linux, CentOS, SUSE, or anything else that works with RPMs, you can get the RPM package, then install it:

    % rpm -Uvh elasticsearch-*.noarch.rpm

    And that's it! Elasticsearch is installed and started. If you need to restart it, you'd do it like you would with any other service. For example:

    % systemctl restart elasticsearch.service

    DEB PACKAGE If you're running on Debian, Ubuntu, or anything else that works with DEBs, download the DEB package, and install it:

    % dpkg -i elasticsearch-*.deb

    And again, that's it! If you need to restart it, you can use the init script:

    % /etc/init.d/elasticsearch restart

    Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

    http://www.manning-sandbox.com/forum.jspa?forumID=871

    19

  • If you need to see what Elasticsearch is doing, you can look up the logs in /var/log/elasticsearch/. This is the same for the RPM package.

    1.5.3 Verifying that it works Now that you have Elasticsearch installed and started, let's have a closer look at what's going on. We'll take a look at the logs generated during startup and we'll connect to the REST API for the first time.

    STARTUP LOGS When you first run elasticsearch -f you should see a bunch of log lines telling you what's going on. Let's take a look at some of those lines and what they mean:

    [node] [Basilisk] version[1.0.0.Beta1], pid[6011], build[77bc5d5/2013-11-06T14:40:44Z]

    The first line typically tells you some statistics about the node you just started. By default, Elasticsearch will give your node a random name, in this case Basilisk, which can be changed from the configuration. You can see details on the particular version you're running, along with the PID of the Java process that started.

    [plugins] [Basilisk] loaded [], sites []

    Plugins are loaded during initialization, and no plugins are included by default. We'll talk some more about plugins in chapter 10.

    [transport] [Basilisk] bound_address {inet[/0.0.0.0:9300]}, publish_address {inet[/192.168.1.8:9300]}

    Port 9300 is used by default for inter-node communication, called transport. If you need to use the native Java API instead of the HTTP API, this is the point where you need to connect.

    [cluster.service] [Basilisk] new_master [Basilisk][YPHC_vWiQVuSX-ZIJIlMhg][inet[/192.168.1.8:9300]], reason: zen-disco-join (elected_as_master)

    A master node was just elected and it's the node you've just started. We'll talk more about master election in chapter 11, which is about scaling out. The basic idea is that each cluster has a master node, responsible for knowing which nodes are in the cluster and where all the shards are located. Each time the master is unavailable, a new one is elected. In this case, we've just started the first node in the cluster, so this will be our master.

    [http] [Basilisk] bound_address {inet[/0.0.0.0:9200]}, publish_address {inet[/192.168.1.8:9200]}

    Port 9200 is used for HTTP communication by default. This is where applications using the REST API will connect to.

    [node] [Basilisk] started

    Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

    http://www.manning-sandbox.com/forum.jspa?forumID=871

    20

  • Your node is now started. At this point you can connect to it and start issuing requests.

    [gateway] [Basilisk] recovered [0] indices into cluster_state

    Gateway is the component of Elasticsearch responsible for persisting your data to disk, so that you don't lose it when the node goes down. When you start your node, Gateway will look on the disk and see if any data is saved, so it can restore it. In this case, there's no index to restore.

    A lot of what we've looked at here is configurable: from the node name to the Gateway settings. We'll talk about configuration options, and the concepts around them, as the book progresses. You can expect a lot such configuration options to appear in Part III, which is all about performance and administration. Until then, you don't need to configure an awful lot, because the defaults are developer-friendly.

    USING THE REST API The easiest way to connect to the REST API is by opening your browser and going to http://localhost:9200. If you didn't install Elasticsearch on your local machine, you should be able to reach it by replacing localhost with the IP address of the remote machine. By default, Elasticsearch listens for incoming HTTP requests on port 9200 of all interfaces. If the request works, you should get a JSON reply, like the one in the following figure:

    Figure 1.11 Checking out on Elasticsearch from your browser

    Let's look at each field of the JSON and see what it's about:

    Ok: when it's true, it means that the request was successful. Status: the HTTP error code that resulted from the request. 200 means OK. Name: the name of our Elasticsearch instance, that you can also see from the logs. Version: this is a field that contains an object, to demonstrate the hierarchical nature of

    Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

    http://www.manning-sandbox.com/forum.jspa?forumID=871

    21

  • documents that we've discussed earlier. The object here has a number of fields that tell you about the version you have installed: version number, its hash, the time it was built, whether it's an official release or a build from a snapshot of a branch from GitHub, and the underlying Lucene version.

    Tagline: this contains the first tagline of Elasticsearch: You Know, for Search.

    Now that you're all set up, let's have another look at what we went though in this chapter and move on to chapter chapter 2, where you'll get to know Elasticsearch even better, by indexing and searching some real data.

    1.6 Summary Here you've met Elasticsearch: an open-source, distributed search engine built on top of Apache Lucene. It's good for indexing lots of data, and doing full-text searches on that data afterwards. It's very powerful, as well as fast when it comes to searching, which lets you implement all sorts of functionality, from sorting results by relevance to statistics and search suggestions. Getting started is a matter of a couple of commands, regardless of which operating system you're on. You can work with it through the REST API exposed via HTTP, by sending JSON data and getting JSON replies. And in addition to the full-text search functionality, you can also look at Elasticsearch as a NoSQL data-store. It's document-oriented and scalable: if one server isn't enough to hold your indexing or search load, you can always solve that by adding more instances. Elasticsearch is clustered by default, and automatically balances your data between instances. This makes it very good for cloud environments, because it lets you add or remove instances in a fast and straightforward way.

    Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

    http://www.manning-sandbox.com/forum.jspa?forumID=871

    22

    Elasticsearch in Action MEAP V08 Copyright WelcomeTable of Contents Chapter 1: Introducing Elasticsearch1.1 Elasticsearch as a search engine1.1.1 Searches need to be quick1.1.2 Results have to be relevant1.1.3 Searching beyond exact matches

    1.2 Typical setups using Elasticsearch1.2.1 One stop shop for storing, searching and statistics1.2.2 Plug in search in a complex system1.2.3 Use it with existing tools

    1.3 Data structure and interaction1.3.1 Indexing and search functionality1.3.2 Structuring your data in Elasticsearch

    1.4 Scaling and performance1.4.1 Scaling, sharding and performance

    1.5 Getting started1.5.1 Getting Java1.5.2 Downloading and starting Elasticsearch1.5.3 Verifying that it works

    1.6 Summary