Elasticsearch

Elasticsearch is a powerful search engine, and the most important thing is tokenization. So Elasticsearch built in many analyzer (or tokenizer), edge-ngram is the one of many tokenizers that suits in autocomplete feature.

Edge-ngram analyzer will cut the whole word via ngram. For example, house will tokenize h , ho , hou , hous , house 5 tokens. Autocomplete also cut the whold word, so Elasticsearch and edge-ngrom analyzer very suitable at autocomplete feature.

Write data

Writing data stage is the index stage of Elasticsearch, and settings and mappings are very important, they would be related to reading data.

Settings

// PUT /autocomplete_index/_settings
{
  "analysis": {
    "tokenizer": {
      "autocomplete_tokenizer": {
        "type": "edge_ngram",
        "min_gram": 1,
        "max_gram": 20
      }
    },
    "analyzer": {
      "autocomplete_analyzer": {
        "tokenizer": "autocomplete_tokenizer"
      }
    }
  }
}

There is a custom Elasticsearch analyzer autocomplete_analyzer , the analyer has one custom (type is edge_ngram ) tokenizer autocomplete_tokenizer , and set min_gram is 1 , and max_gram is 20 . Limitation of ngram length that when user press keyword length would be less 20, so speed up query and save the hard disk usage, set limitation is a good choice.

Mappings

// PUT /autocomplete_index/_mapping
{
  "properties": {
    "name": {
      "type": "text",
      "analyzer": "autocomplete_analyzer",
      "search_analyzer": "keyword"
    }
  }
}

When mapping the fields, it needs to set field ( name in here) analyzer autocomplete_analyzer . For example A black horse and A white house , they are tokenized totally different in classic analyzer and autocomplete_analyzer .

Word classic analyzer autocomplete_analyzer
A black horse
  • a
  • black
  • horse
  • a
  • a b
  • a bl
  • a bla
  • a blac
  • a black
  • a black 
  • a black h
  • a black ho
  • a black hor
  • a black hors
  • a black horse
A white house
  • a
  • white
  • house
  • a
  • a w
  • a wh
  • a whi
  • a whit
  • a white
  • a white 
  • a white h
  • a white ho
  • a white hou
  • a white hous
  • a white house

Real writing index commands are

// POST /autocomplete_index/_doc
{
  "name": "A black horse"
}

// POST /autocomplete_index/_doc
{
  "name": "A white house"
}

Read data

Reading data stage is the search stage of Elasticsearch. The index already wrote all tokens via edge-ngram, so keyword would match any token in the index data.

// GET /autocomplete_index/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "name": "a bl"
          }
        }
      ]
    }
  }
}

It will use keyword analyzer and filter when searching keyword, keyword analyzer is not to tokenize the keyword. If user press a bl will find a black horse , and press a wh will find a white house .

And filter can cache results because filter can’t calculate score. If user always search a bl , will speed up the search results, because the results store the memory.

Sort by character length

// GET /autocomplete_index/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "name": "a bl"
          }
        }
      ]
    }
  },
  "sort": {
    "_script": {
      "script": "doc['name'].value.length()",
      "type": "number",
      "order": "asc"
    }
  }
}

Use painless script to calculate the word length, and sort by ascending.

Sort by frequency

// GET /autocomplete_index/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "name": "a bl"
          }
        }
      ]
    }
  },
  "sort": [
    {
      "pageview": {
        "order": "desc"
      }
    }
  ]
}

If original data store frequency ( pageview in here), can use Elasticsearch sort syntax easily.