Elasticsearch
Elasticsearch is a powerful search engine, and the most important thing is tokenization. So Elasticsearch built in many analyzer (or tokenizer), edge-ngram is the one of many tokenizers that suits in autocomplete feature.
Edge-ngram analyzer will cut the whole word via ngram. For example,
house
will tokenize
h
,
ho
,
hou
,
hous
,
house
5 tokens. Autocomplete also cut the whold word, so Elasticsearch and edge-ngrom analyzer very suitable at autocomplete feature.
Write data
Writing data stage is the index stage of Elasticsearch, and
settings
and
mappings
are very important, they would be related to reading data.
Settings
// PUT /autocomplete_index/_settings
{
"analysis": {
"tokenizer": {
"autocomplete_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete_analyzer": {
"tokenizer": "autocomplete_tokenizer"
}
}
}
}
There is a custom Elasticsearch analyzer
autocomplete_analyzer
, the analyer has one custom (type is
edge_ngram
) tokenizer
autocomplete_tokenizer
, and set
min_gram
is
1
, and
max_gram
is
20
. Limitation of ngram length that when user press keyword length would be less 20, so speed up query and save the hard disk usage, set limitation is a good choice.
Mappings
// PUT /autocomplete_index/_mapping
{
"properties": {
"name": {
"type": "text",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "keyword"
}
}
}
When mapping the fields, it needs to set field (
name
in here) analyzer
autocomplete_analyzer
. For example
A black horse
and
A white house
, they are tokenized totally different in classic analyzer and
autocomplete_analyzer
.
Word | classic analyzer | autocomplete_analyzer |
---|---|---|
A black horse |
|
|
A white house |
|
|
Real writing index commands are
// POST /autocomplete_index/_doc
{
"name": "A black horse"
}
// POST /autocomplete_index/_doc
{
"name": "A white house"
}
Read data
Reading data stage is the search stage of Elasticsearch. The index already wrote all tokens via edge-ngram, so keyword would match any token in the index data.
// GET /autocomplete_index/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"name": "a bl"
}
}
]
}
}
}
It will use
keyword
analyzer and
filter
when searching keyword,
keyword
analyzer is not to tokenize the keyword. If user press
a bl
will find
a black horse
, and press
a wh
will find
a white house
.
And
filter
can cache results because
filter
can’t calculate score. If user always search
a bl
, will speed up the search results, because the results store the memory.
Sort by character length
// GET /autocomplete_index/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"name": "a bl"
}
}
]
}
},
"sort": {
"_script": {
"script": "doc['name'].value.length()",
"type": "number",
"order": "asc"
}
}
}
Use painless script to calculate the word length, and sort by ascending.
Sort by frequency
// GET /autocomplete_index/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"name": "a bl"
}
}
]
}
},
"sort": [
{
"pageview": {
"order": "desc"
}
}
]
}
If original data store frequency (
pageview
in here), can use Elasticsearch sort syntax easily.