-
Notifications
You must be signed in to change notification settings - Fork 18
ElasticSearch
Savvy is an online community for the world’s product enthusiasts. Our communities are the product trendsetters that the rest of the world follows. Across the site, our users are able to compare products, ask and answer product questions, share product reviews, and generally share their product interests with one another. Savvy1.com boasts a vibrant community that save products on the site at the rate of 1 product every second. We wanted to provide a search bar that can search across various entities in the system - users, products, coupons, collections, etc. - and return the results in a timely fashion.
The search server should satisfy the following requirements:
- Full Text Search: The ability to not only return documents that contain the exact keywords, but also documents that contain words that are related or relevant to the keywords.
- Clustering: The ability to distribute data across multiple nodes for load balancing and efficient searching. 1.Horizontal Scalability: The ability to increase the capacity of the cluster by adding more nodes.
- Read and Write Efficiency: Since our application is both read and write heavy, we need a system that allows for high write loads and efficient read times on heavy read loads.
- Fault Tolerant: The loss of any node in the cluster should not affect the stability of the cluster.
- REST API with JSON: The server should support a REST API using JSON for input and output. At the time, we looked at Sphinx, Solr and ElasticSearch. The only system that satisfied all of the above requirements was ElasticSearch, and — to sweeten the deal — ElasticSearch provided a way to efficiently ingest and index data in our MongoDB database via the River API so we could get up and running quickly.
Heads Up: The river API is being deprecated in ElasticSearch. You shouldn't use it anymore in production systems; see Data Ingestion Strategies for an alternative and scalable way to push data into ElasticSearch.
SavvyOne.com was built with Ruby on Rails on the backend, jQuery and Handlebars on the client side, and MongoDB backing it up. We are using ElasticSearch as the search index, while MongoDB is the main database and the source of truth. The general architecture is as follows:
The WWW cluster serves as both the API server and the UI server. Every node on the cluster is stateless, and so is capable of serving any load balanced request. The cluster talks to MongoDB for performing CRUD operations, and to the ElasticSearch cluster for querying information requested from the front end. It acts as the single entry point for the whole application.
The ElasticSearch cluster consists of 6 nodes — 3 data nodes, 2 dedicated master nodes and 1 search load balancer node. We deployed 2 dedicated master nodes to prevent the famous split brain problem with ElasticSearch.
In a classic 3 node deployment of ElasticSearch in the EC2 environment, all nodes act as master nodes and data nodes by default. In EC2, the network connection between nodes is sometimes lost, even when the nodes are deployed in the same region. When that happens, each node in the cluster assumes the master role. As a result, the data that is indexed is one node will not be replicated to the other nodes, resulting in index corruption and cluster failure. This is referred to as the split brain problem. To prevent the split brain issue from happening, it is recommended that a set of dedicated master nodes be deployed.
The data from MongoDB is pulled into ElasticSearch via ElasticSearch MongoDB river. This river plugin follows the operations log (oplog.rs) on MongoDB and updates the ElasticSearch indices directly. Finally, the API layer abstracts out the search DSL and exposes endpoints that can be invoked from the client side. Because it is very risky to open port 9200 to the internet, we use the API to manage the validation, authentication and authorization of incoming requests. This also allows us to keep the query DSL isolated in the API layer.
Pro Tip: If the search load is not high and the data being indexed is small, you can deploy a 3 node cluster and scale it up later. To prevent the split brain issue, make sure that the minimum number of master eligible nodes is set to (|n/2|+1), e.g.: discovery.zen.minimum_master_nodes: 2 for a cluster with 3 nodes
Let's build an application that emulates Savvy1.com, but only provides search functionality, from scratch. We will first add a naive search system, then optimize it to improve search performance. The tools we will use for this tutorial are listed below — please follow the instructions on the setting up their prerequisites as described on each tool's page.
- Node.js NPM for installing dependencies.
- Yeoman, Bower, and Grunt for automating the boilerplate code generation.
- AngularJS for building the UI and displaying the data retrieved.
- Dropwizard for building the API layer.
- ElasticSearch for setting up the search cluster.
- ESClient for loading data into ElasticSearch.
- generator-angular-dropwizard for generating the boiler plate code for Dropwizard and AngularJS.
- Test Data to import into ElasticSearch. Download and unzip the data. Note: The source code is in my Github repository
Download the the latest distribution of ElasticSearch.
Extract the contents and start the server:
$ tar -xzvf elasticsearch-1.3.2.tgz
$ cd elasticsearch-1.3.2
$ bin/elasticsearch
In a different terminal window, unzip the test data.
Use the esimport tool to import the data into ElasticSearch:
$ esimport -u http://localhost:9200/ -f collections-anon.txt
$ esimport -u http://localhost:9200/ -f products-anon.txt
$ esimport -u http://localhost:9200/ -f users-anon.txt
4.2 Generate boilerplate code
$ mkdir es-tutorial
$ cd es-tutorial/
$ yo angular-dropwizard
The angular-dropwizard generator creates the directory structure, the required files and modules to enable us to start development. It generates 3 dropwizard apps - tutorial-client, tutorial-api and tutorial-service. For this tutorial we will be using only the tutorial-service app. You can import the whole project, starting with the pom.xml in the es-tutorial directory, to your favorite IDE and start hacking.
Open 2 terminal windows, one for grunt and the other for mvn. Run grunt server in one terminal and mvn compile exec:exec -pl es-tutorial in the other.
The reason we need use both the grunt and dropwizard servers is to enable rapid development. For dropwizard to serve the modified HTML, JS, CSS files, these files need to be copied to the target directory that contains all the classes. Frequently compiling the Java code to see the changes made slows down development.
Modify the pom.xml file of tutorial-service and add the following dependencies:
<dependency>
<groupId>com.bazaarvoice.dropwizard</groupId>
<artifactId>dropwizard-redirect-bundle</artifactId>
<version>0.3.0</version>
</dependency>
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch</artifactId>
<version>1.3.2</version>
</dependency>
Add a new Java class SearchResource.java to src/main/java/tutorials/elasticsearch/resources/:
@GET
@Produces(MediaType.APPLICATION_JSON)
@Timed
public Response get(@Context UriInfo uriInfo, @Context HttpServletRequest request) {
StringBuilder sb = new StringBuilder();
sb.append("Received parameters:\n");
MultivaluedMap<String, String> qp = uriInfo.getQueryParameters();
String sort = qp.getFirst("sort");
SearchRequestBuilder searchRequestBuilder = esClient.prepareSearch("products")
.setTypes("product")
.setSearchType(SearchType.DFS_QUERY_THEN_FETCH)
.setSize(20);
SearchResponse response = searchRequestBuilder.execute().actionGet();
SearchHits hits = response.getHits();
ArrayList<Map<String,Object>> list = new ArrayList<Map<String,Object>>();
for (SearchHit hit : hits) {
Map<String,Object> result = hit.getSource();
list.add(result);
}
try {
ObjectWriter ow = mapper.writer().withDefaultPrettyPrinter();
String json = ow.writeValueAsString(list);
return Response.ok(json, MediaType.APPLICATION_JSON_TYPE).build();
}catch(JsonProcessingException ex){
//log the error
LOG.error(String.format("Could not process the returned doc: %s. %s", ex.getMessage(), getStackTrace(ex)));
}
return Response.serverError().build();
}
...
This route fetches the product data from ElasticSearch and returns the results.
The main service class for the sever is TutorialService.java. Let's configure it thusly:
Make the class implement the Managed interface. This enables the app to be started and stopped along with the application.
public class TutorialService extends Application<TutorialConfiguration> implements Managed {
Update the initialize method, adding redirections and assets folder for static serving.
@Override
public void initialize(Bootstrap<TutorialConfiguration> bootstrap) {
bootstrap.addBundle(new RedirectBundle(
new UriRedirect("/favicon.ico", "/assets/app/favicon.ico"),
new UriRedirect("/", "/app/"),
new UriRedirect("/index.html", "/app/index.html")
));
bootstrap.addBundle(new AssetsBundle("/assets/app/", "/app"));
bootstrap.addBundle(hibernateBundle);
mapper = bootstrap.getObjectMapper();
mapper.disable(SerializationFeature.WRITE_DATES_AS_TIMESTAMPS);
//this.esNode = nodeBuilder().client(true).node();
this.esClient = new TransportClient().addTransportAddress(new InetSocketTransportAddress("localhost", 9300));
}
Configure Cross Origin Request Sharing (CORS)
private void configureCors(Environment environment) {
Dynamic filter = environment.servlets().addFilter("CORS", CrossOriginFilter.class);
filter.addMappingForUrlPatterns(EnumSet.allOf(DispatcherType.class), true, "/*");
filter.setInitParameter(CrossOriginFilter.ALLOWED_METHODS_PARAM, "GET,PUT,POST,DELETE,OPTIONS");
filter.setInitParameter(CrossOriginFilter.ALLOWED_ORIGINS_PARAM, "*");
filter.setInitParameter(CrossOriginFilter.ACCESS_CONTROL_ALLOW_ORIGIN_HEADER, "*");
filter.setInitParameter("allowedHeaders", "Content-Type,Authorization,X-Requested-With,Content-Length,Accept,Origin");
filter.setInitParameter("allowCredentials", "true");
}
Add the SearchResource object to the environment. This line is needed so that the environment knows which route to respond to and how to respond to it.
@Override
public void run(TutorialConfiguration configuration,
Environment environment) throws Exception {
// Register the route. Without this, the route will not recognized by the framework
environment.jersey().register(new SearchResource(configuration, esClient, mapper));
configureCors(environment);
}
Compile the code with mvn compile.
Add images to src/main/resources/assets/app/images directory.
Add CSS files to src/main/resources/assets/app/css directory.
Edit src/main/resources/assets/app/js/app.js, which creates the main application object:
'use strict';
//create the cseApp module. We will use this object to attach controllers
var cseapp = angular.module('cseApp', [
'ngCookies',
'ngResource',
'ngSanitize',
'ngRoute'
]);
cseapp.config(['$routeProvider', '$locationProvider', configureApp]);
function configureApp($routeProvider,$locationProvider) {
$routeProvider
.when('/', {
templateUrl: 'views/home/main.html',
controller: 'MainCtrl'
})
.otherwise({
redirectTo: '/'
});
}
Allow the app to be loaded:
<html lang="en" ng-app="cseApp">
Replace the CSS and font:
<link rel="stylesheet" href="http://fonts.googleapis.com/css?family=Roboto:400,700,500|Roboto+Condensed:400,700" type="text/css">
<link rel="stylesheet" href="css/website.css">
<link rel="stylesheet" href="css/cse.css">
Add the top bar code:
<!--[if lt IE 9]>
<script src="js/es5-shim/es5-shim.js"></script>
<script src="js/json3/lib/json3.min.js"></script>
<![endif]-->
<!-- Add your site or application content here -->
<div class="header">
<section class="top-bar" style="min-width: 960px;">
<div class="row">
<a class="left logo" href="/"><img src="/images/logo.png" alt=""></a>
<ul class="inline-list actions-list left">
<li class="autocomplete" ng-controller="AutocompleteCtrl">
<form ng-submit="submit()">
<div style="position: absolute; top: 0px;" class="search left icon icon-magnifier">
<input ng-model="typeahead" class="left wide" autocomplete="off" type="text" placeholder="Search on Savvy">
</div>
</form>
<div class="submenu {{searchSubmenuView}}" id="autoSearchResultsContainer" style="background: #fff; z-index: 9999999999999999; padding: 10px; position: absolute; top: 40px; width: 460px;">
<div auto-complete="typeahead"></div>
</div>
</li>
<!--
<div ng-controller="DemoController">
Date format: <input ng-model="format"> <hr/>
Current time is: <span my-type-ahead="format"></span>
</div>
-->
</ul>
<ul class="inline-list actions-list right">
<form style="margin: 0px;padding:0px;height:0px;width:0px;" id="logoutForm" method="post" action="/users/sign_out">
<input type="hidden" name="authenticity_token" value="XrRJvLAnihfO1xNADm1v0Gy4rZEKqlvizOKwpr5BEvI=">
</form>
</ul>
</div>
</section>
</div>
<div class="container" id="mainContainer" ng-cloak class="ng-cloak" ng-view=""></div>
Add the main controller:
<script src="js/home/main.js"></script>
Add the controller code to src/main/java/resources/assets/app/css/home/main.js:
'use strict';
//retrieve a module object named cseApp
var cseapp = angular.module('cseApp');
cseapp.controller('MainCtrl', ['$scope', '$http', '$location', function($scope, $http, $location){
$scope.categories = [];
var url = 'http://localhost:8080/api/v1/search';
var sortingMenu = {
"oldest_first" : "",
"newest_first" : "",
"highest_rated" : "current",
"lowest_rated" : ""
};
console.log(url);
$http({method: 'GET', url: url})
.success(function (response){
var res = response;
console.log(res);
$scope.products = res;
$scope.sortingMenu = sortingMenu;
$scope.urlPrefix = $location.path();
});
}]);
Add the template to src/main/resources/assets/app/views/home/main.html:
<div role="main">
<header class="page-header">
<h1>Latest Products</h1>
</header>
<div class="row">
<div class="column">
<div class="row results-list block-view vertical-grid product-listing">
<ul id="pgrid" class="large-block-grid-4" style="height: 2429px;">
<li ng-repeat="product in products" class="productSnippet" style="width: 230px; float: left" id="product-snippet-">
<article class="box" style="padding-bottom: 0px;">
<a href="/#/products/{{product._id}}">
<figure style="text-align:center;">
<img style="max-height: 160px; max-width: 100%; width: auto; height: auto;" src="{{product.image_url}}" alt="">
</figure>
<h4 style="max-height:90px;overflow: hidden;">{{product.description}}</h4>
</a>
<footer>
<div class="price-box cf">
<div class="price">{{product.price}}</div>
<div class="buy">
<a class="button tiny radius tertiary" data-source="{{product.source}}" data-productid="5260fa3c971c41fc53000032" data-price="{{product.price}}" data-description="{{product.description}}" data-image="{{product.image_url}}" href="{{product.source}}">See It</a>
</div>
</div>
</footer>
</article>
</li>
</ul>
</div>
</div>
</div>
</div>
Now, we built the system to pull top products from the ElasticSearch index and display them in a webpage. But ElasticSearch is used for searching, so let's build a search box and wire it up to pull search results from the server and display them.
We'll implement a MultiSearchResource.java route that queries multiple indices in ElasticSearch and returns the aggregated results. We search each index separately, aggregate all the results in the response object and return. We do this instead of running one query across all indices because we need results from all the indices, not just one.
private HashMap<String,Object> getResultsMap(String index, String kws){
ArrayList<Map<String,Object>> list = getResultsList(index, kws);
HashMap<String,Object> map = new HashMap<>();
map.put("index", index);
map.put("results", list);
return map;
}
private ArrayList<Map<String, Object>> getResultsList(String index, String kws){
SearchRequestBuilder searchRequestBuilder = esClient.prepareSearch(index)
//.setTypes("product", "collection", "user")
.setSearchType(SearchType.DFS_QUERY_THEN_FETCH)
.setQuery(QueryBuilders.matchQuery("_all", kws).operator(MatchQueryBuilder.Operator.AND))
.setHighlighterRequireFieldMatch(true)
.setSize(5);
if(index.equals("collections")){
FilterBuilder filterBuilder = FilterBuilders.rangeFilter("count").gt(1);
searchRequestBuilder.setPostFilter(filterBuilder);
}
SearchResponse response = searchRequestBuilder.execute().actionGet();
SearchHits hits = response.getHits();
ArrayList<Map<String,Object>> list = new ArrayList<Map<String,Object>>();
for (SearchHit hit : hits) {
Map<String,Object> result = hit.getSource();
if(!result.containsKey("_id")){
result.put("_id", hit.getId());
}
if(index.equals("users")){
ArrayList<String> pics = (ArrayList<String>)result.get("pics");
if(pics.size() == 0){
pics.add("http://www.gravatar.com/avatar/00000000");
}
}
list.add(result);
}
return list;
}
We need to create an autocomplete.js controller to control the search autocomplete behavior in the application.
First, add a simple controller which hides and displays the search autocomplete dropdown:
'use strict';
//retrieve a module object named cseApp
var cseapp = angular.module('cseApp');
cseapp.controller('AutocompleteCtrl', ['$scope', '$http', '$rootScope', '$compile',
function($scope, $http, $rootScope, $compile){
$scope.categories = [];
$scope.searchSubmenuView ="hidden";
$scope.typeahead = null;
}
]);
Now, add directive for watching the search box for data entered. The code is doing a few things: it declares a directive called autoComplete, specifies the dependency injected parameters, and defines a link method that is invoked on the element. In the link method, we setup a $watch on the attributes of the autoComplete directive. Think of watch methods as onChange listeners — when text is entered in the search box, the $watch method gets invoked and does the following:
- Calls the multi-search API endpoint to fetch the results.
- Fetches the template.
- Renders the template with the search result data.
- Inserts the rendered template as the child of the element.
//setup a new directive for fetching the results in typeahead
cseapp.directive('autoComplete', [ '$http', '$compile', '$rootScope',
function( $http, $compile, $rootScope) {
function link(scope, element, attrs) {
scope.$watch(attrs.autoComplete, function(value) {
if(value === null || value === "" ){
scope.searchSubmenuView = "hidden";
return;
}
var url = 'http://localhost:8080/api/v1/multi-search?q='+value;
$http({method: 'GET', url: url})
.success(function (response){
var res = response;
//console.log(res);
scope.multiResults = res;
scope.products = res.products;
scope.coupons = res.coupons;
scope.collections = res.collections;
scope.users = res.users;
scope.searchSubmenuView = "";
scope.keywords=value;
}).then(function (response) {
scope.keywords = value;
console.log(JSON.stringify($rootScope.appData));
// get the menu bar template from the server.
$http.get('/views/home/search_menu.html').success(function(response){
console.log(response);
element.replaceWith($compile(angular.element(response))(scope));
});
});
});
}
return {
link: link
};
}]);
What happens when a user enters text in the search box and immediately presses enter? In the above code we don't handle that situation. When that occurs, we need to display the search results page immediately, so let's add a method to handle the submit event of the form:
cseapp.controller('AutocompleteCtrl', ['$scope', '$http', '$compile',
function($scope, $http, $compile){
$scope.categories = [];
$scope.searchSubmenuView ="hidden";
$scope.typeahead = null;
// the submit method to handle the case when the user enters
// immediately after entering the text
$scope.submit = function(){
$scope.searchSubmenuView ="hidden";
var element = angular.element('#mainContainer');
// fetch the template from the server
$http.get('/views/home/search.html').success(function(response){
console.log(response);
element.replaceWith($compile(angular.element(response))($scope));
});
}
}
]);
<ul class="inline-list actions-list left">
<li class="autocomplete" ng-controller="AutocompleteCtrl">
<form ng-submit="submit()">
<div style="position: absolute; top: 0px;" class="search left icon icon-magnifier">
<input ng-model="typeahead" class="left wide" autocomplete="off" type="text" placeholder="Search on Savvy">
</div>
</form>
<div class="submenu {{searchSubmenuView}}" id="autoSearchResultsContainer" style="background: #fff; z-index: 9999999999999999; padding: 10px; position: absolute; top: 40px; width: 460px;">
<div auto-complete="typeahead"></div>
</div>
</li>
</ul>
This setup helps us to get to a naive search implementation. We just dumped the data from MongoDB, loaded it into ElasticSearch, added a search box, and then code to retrieve the results for the keywords entered. Currently, we can only match on full words and certain fields. We can certainly improve this — let's say we have the following requirements:
Partial Word Matching: The query must match not only on full words, but also on substrings. So, typing sam should return the results containing samsung.
Multiple Search Fields: The query must match across several fields. So, typing camera should match on [ "name", "description", "tags"] etc.
Before we can dive into how to improve, we need to learn about a few concepts.
Analyzers in ElasticSearch are used to break up a document into strings that are used for indexing.
When documents are indexed in ElasticSearch, it builds an inverted index. An inverted index is basically a dictionary (lookup table) of the strings in the document and the references to that document in the data store. Each analyzer in ElasticSearch is composed of one tokenizer and zero or more token filters. When a query is performed, the words in the query are also analyzed and the tokens are used to lookup the document in the inverted index.
Tokenizers break a string down into a stream of terms or tokens. The default tokenizer splits the string by punctuation or white space.
Token filters transform the stream of tokens from the tokenizer. The filters can add, modify or delete tokens to the token stream. For example, a Synonym Filter adds synonyms of the tokens to the token stream, Lowercase Token Filter modifies the token and lowercases all characters and a Stop Token Filter deletes all the stop words from the token stream
nGram is a sequence of characters constructed by taking the substring of the string being evaluated. For example, nGram analysis for the string Samsung will yield a set of nGrams like ['S', 'Sa', 'Sam', 'Sams', 'Samsu', 'Samsun', 'Samsung', 'am', 'ams', 'amsu', 'amsun', 'amsung', 'ms', 'msu', 'msun', 'msung', 'su', 'sun', 'sung', 'un', 'ung', 'ng']. Indexing these strings will enable partial matches. For our application, we need type autocomplete search. A typical user input for a string like Samsung will see in the form:
S
Sa
Sam
Sams
Samsu
Samsun
Samsung
So, storing all nGrams of the string will be wasteful. For this reason we use a special form of nGram called Edge nGram. Edge nGram analysis produces a set of nGrams starting at the left most edge. For example, Edge nGram analysis of Samsung produces ['S', 'Sa', 'Sam', 'Sams', 'Samsu', 'Samsun', 'Samsung'] which is identical to what a typical user would input.
Now, let's set up the search index to support typeahead autocomplete. Since the analyzers are run at the index creation time on the documents, we need to drop the existing index and create a new one. then add the analyzers to the index.
curl -XPUT "http://localhost:9200/products " -d'
{
"settings": {
"analysis": {
"filter": {
"edgeNGram_filter": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 20,
"side" : "front"
}
},
"analyzer": {
"edge_nGram_analyzer": {
"type": "custom",
"tokenizer": "edge_ngram_tokenizer",
"filter": [
"lowercase",
"asciifolding",
"edgeNGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
},
"tokenizer" : {
"edge_ngram_tokenizer" : {
"type" : "edgeNGram",
"min_gram" : "2",
"max_gram" : "5",
"token_chars": [ "letter", "digit" ]
}
}
}
},
"mappings": {
...
}
}'
In the above command, we specify 2 analyzers:
WhiteSpace Analyzer: The whitespace_analyzer splits the text on whitespace, then applies two token filters. The lowercase token filter which converts all tokens to lowercase and the ascii folding token filter which only lets the characters in the Basic ASCII block (U+0000 - U+007F) of Unicode and converts the some characters to their ASCII equivalents if one exists, for example:
+ ß (German Sharp S) ⇒ ss
+ æ (Latin AE Ligature) ⇒ ae
+ ł (Latin Small L with a Stroke) ⇒ l
+ ɰ (Latin Small Letter Turned M With Long Leg) ⇒ m
Edge nGram Analyzer: The edge_ngram_analyzer does everything the whitespace_analyzer does and then applies the edge_ngram_token_filter to the stream. The edge_nGram_filter is what generates all of the substrings that will be used in the index lookup table. It is a token filter of "type": "nGram". "min_gram": 2 and "max_gram": 20 set the minimum and maximum length of substrings that will be generated and added to the lookup table. Now lets setup the mapping for the products index:
curl -XPOST http://localhost:9200/products/_mapping -d '
{
"products": {
"settings": {
"analysis": {
"filter": {
"edgeNGram_filter": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 20,
"side": "front"
}
},
"analyzer": {
"edge_nGram_analyzer": {
"type": "custom",
"tokenizer": "edge_ngram_tokenizer",
"filter": [
"lowercase",
"asciifolding",
"edgeNGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
},
"tokenizer": {
"edge_ngram_tokenizer": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "5",
"token_chars": [
"letter",
"digit"
]
}
}
}
},
"mappings": {
"product": {
"_all": {
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
},
"properties": {
"average_rating": {
"type": "double",
"index": "not_analyzed"
},
"bid": {
"type": "long",
"index": "not_analyzed"
},
"brand_id": {
"type": "long",
"index": "not_analyzed"
},
"category": {
"type": "string",
"include_in_all": true
},
"collection_id": {
"type": "string",
"index": "not_analyzed"
},
"comments_count": {
"type": "long",
"index": "not_analyzed"
},
"country": {
"type": "string",
"index": "not_analyzed"
},
"created_at": {
"type": "date",
"format": "dateOptionalTime",
"include_in_all": false
},
"description": {
"type": "string",
"include_in_all": true
},
"featured": {
"type": "long",
"index": "not_analyzed"
},
"hashtags": {
"properties": {
"hashtag": {
"type": "string",
"include_in_all": true
},
"indices": {
"type": "long",
"index": "not_analyzed"
}
}
},
"image_attrs": {
"properties": {
"height": {
"type": "long",
"index": "not_analyzed"
},
"width": {
"type": "long",
"index": "not_analyzed"
}
}
},
"image_s3_id": {
"type": "string",
"index" : no
},
"image_url": {
"type": "string",
"index": "not_analyzed"
},
"likes": {
"type": "long",
"index": "not_analyzed"
},
"mentions": {
"properties": {
"indices": {
"type": "long",
"index": "not_analyzed"
},
"screen_name": {
"type": "string",
"index": "not_analyzed"
}
}
},
"original": {
"type": "long",
"index": "not_analyzed"
},
"pid": {
"type": "long",
"index": "not_analyzed"
},
"price": {
"type": "string",
"index": "not_analyzed"
},
"product_id": {
"type": "long",
"index": "not_analyzed"
},
"rating": {
"properties": {
"1": {
"type": "long",
"index": "not_analyzed"
},
"2": {
"type": "long",
"index": "not_analyzed"
},
"3": {
"type": "long",
"index": "not_analyzed"
},
"4": {
"type": "long",
"index": "not_analyzed"
},
"5": {
"type": "long",
"index": "not_analyzed"
}
}
},
"ratings_count": {
"type": "long",
"index": "not_analyzed"
},
"resavers": {
"type": "string",
"index": "not_analyzed"
},
"resaves": {
"type": "long",
"index": "not_analyzed"
},
"root_id": {
"type": "string",
"index": "not_analyzed"
},
"source": {
"type": "string",
"index": "not_analyzed"
},
"updated_at": {
"type": "date",
"format": "dateOptionalTime",
"include_in_all": false
},
"urls": {
"properties": {
"indices": {
"type": "long",
"index": "not_analyzed"
},
"url": {
"type": "string",
"index": "not_analyzed"
}
}
},
"user": {
"properties": {
"fb_id": {
"type": "string",
"index": "not_analyzed"
},
"id": {
"type": "string",
"index": "not_analyzed"
},
"name": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
}'
There are several things to notice here:
-
_all field is the easiest way to setup a search across multiple fields. By setting "include_in_all": "no" on fields we can exclude them from search results. We specify the edge_ngram_analyzer as the index analyzer, so all documents that are indexed will be passed through this analyzer. We also specify the whitespace_analyzer as the search analyzer, which means that the search query is passed through the whitespace analyzer before looking for the words in the inverted index.
-
"index": "no" instructs ElasticSearch to not even bother indexing the field. If you don't anticipate searching on a particular field you can exclude it from index to save space.
-
"index": "not_analyzed" instructs ElasticSearch to exclude the field from analysis. This is useful when you are searching on complete words or phrases and are looking for an exact match. For example, lets say you want to retrieve all products saved by a user with a user id. If the user id is analyzed, it will be broken up into multiple tokens and indexed. So, potentially there can be multiple ID matches. If we set it to not_anlyzed then we can retrieve exact matches.
Now our index definition is ready as per our requirements for multi-field match for autocomplete. By reloading the documents in the index, we can improve both performance and relevance.
Previously, we discussed how to ingest data from MongoDB using the now-deprecated river plugin. River APIs are designed as singleton objects in the system; at any given point in time, only one node in the cluster is running the river.
Singleton objects are inherently unscalable, though the ElasticSearch core team did try to shoehorn some fault tolerance features into the API. For example, when the node running the river dies, a surviving node in the cluster restarts the river. However, this design introduced a number of issues and complexities for the scalability of a distributed system, hence the deprecation. So what is the right way to ingest data in ElasticSearch? I am glad you asked. There are are 2 ways to do it:
Index only fields that are needed for search in two stage commit.
Use another tool/server that follows oplog.rs of MongoDB and pushes the data to ElasticsSearch nodes.
The architecture of such a system is as shown below. The API and UI servers push the data to both the database and the message queue simultaneously. Logstash subscribes to the channel to which the API server is writing, gets notified of incoming messages, then finally pushes the messages to ElasticSearch. This is similar to adding data to Memcache for fast access and then persisting the same data in the database. The advantages of this design are:
-
Horizontal Scalability: Assuming that the API server is designed to be a stateless and distributed queue which supports exactly once semantics, we can achieve infinite scalability by adding queue server nodes (RabbitMQ, SQS) and queue processors (logstash).
-
Loose Coupling: Each system can scale independent of each other. Let's say the search query load spikes due to some event in your industry; by adding more search and data nodes to the ElasticSearch cluster, the query response can be improved.
The disadvantages are:
-
New application specific code needs to be written to index only data required for search.
-
The access patterns and search patterns need to be known for indexing only the data needed.
If new fields need to be indexed, then new code needs to be developed and deployed to production.
In this setup you can use Mongo Connector, a tool that is designed to push data from MongoDB into ElasticSearch. Although this setup is much simpler, it does introduce a single point of failure, compromising the system's resilience and fault tolerance. If the Mongo Connector server quits, then the data in the ElasticSearch cluster will become stale. This could serve as a reasonable starting point if you are testing the waters with ElasticSearch; however, using such a setup in production is not recommended.
ElasticSearch Reference Guide ElasticSearch Definitive Guide Multi-field Partial Word Autocomplete in Elasticsearch Using nGrams, Sloan Ahrens