Redis
Redis is a powerful in-memory database, it includes common structures (set, list, hash) when developing. One of them called Sorted set (zset) can apply to autocomplete feature.
Store no duplicate data sort by lexicographic to set, as the name implies. So don’t extra work when store autocomplete data.
Tokenization
You must to breaks it up into individual characters before write text to zset. E.g. “house” and “horse” can divide into tokens.
Text | Result |
---|---|
|
|
Sort by character length - simple version
Because of the scenario doesn’t use score, can use the simple version of zset r/w strategy.
Write data
To write all tokens to zset, we use
ZADD
.
1. ZADD autocomplete_index 0 h
2. ZADD autocomplete_index 0 ho
3. ZADD autocomplete_index 0 hou
4. ZADD autocomplete_index 0 hous
5. ZADD autocomplete_index 0 house
6. ZADD autocomplete_index 0 house\x00
7. ZADD autocomplete_index 0 h # <--- duplicated, write data fail
8. ZADD autocomplete_index 0 ho # <--- duplicated, write data fail
9. ZADD autocomplete_index 0 hor
10. ZADD autocomplete_index 0 hors
11. ZADD autocomplete_index 0 horse
12. ZADD autocomplete_index 0 horse\x00
The
0
in the middle position of command represents the score is
0
when writing data. So the arrangement result will sort by lexicographic order. The sort algorithm is order by ascending from head to tail character sequences. The result looks like following.
position = 0 => h
position = 1 => ho
position = 2 => hou
position = 3 => hous
position = 4 => house
position = 5 => house\x00 # <--- real search result
position = 6 => hor
position = 7 => hors
position = 8 => horse
position = 9 => horse\x00 # <--- real search result
The position 6 and 9 add
0x00
(ascii code, means
NULL
) at tail. We can use
0x00
when reading data.
Read data
It has two steps when reading data from zset. First of you need use
ZRANK
to indicate the position when key-in ‘ho’. The result is
1
.
position = ZRANK autocomplete_index ho
The second step that reading data from position
1
to
n
(n is 50 in here) via
ZRANGE
, you can read position from 1 to 9. The final step filter by
\x00
at tail, and output ‘house’ and ‘horse’. And if you key-in ‘hor’, will output ‘horse’.
ZRANGE autocomplete_index 2 52 # <--- 2 is 1+1,52 is 1+1+50
Sort by frequency - complicated version
We need to use another zset r/w strategy because the scenario has external score.
Write data
When writing data, we still use
ZADD
to write all tokens to zset. But we add token at index name tail and real score additionally.
1. ZADD autocomplete_index:h 100 house
2. ZADD autocomplete_index:ho 100 house
3. ZADD autocomplete_index:hou 100 house
4. ZADD autocomplete_index:hous 100 house
5. ZADD autocomplete_index:house 100 house
6. ZADD autocomplete_index:h 240 horse
7. ZADD autocomplete_index:ho 240 horse
8. ZADD autocomplete_index:hor 240 horse
9. ZADD autocomplete_index:hors 240 horse
10. ZADD autocomplete_index:horse 240 horse
The
100
in the middle position of command represents the score is
100
when writing data. So the arrangement result will sort by lexicographic order. The sort algorithm is order by ascending from head to tail character sequences. The result looks like following.
# autocomplete_index:h
autocomplete_index:h 100 house
autocomplete_index:h 240 horse
# autocomplete_index:ho
autocomplete_index:ho 100 house
autocomplete_index:ho 240 horse
# autocomplete_index:hou
autocomplete_index:hou 100 house
# autocomplete_index:hous
autocomplete_index:hous 100 house
# autocomplete_index:house
autocomplete_index:house 100 house
# autocomplete_index:hor
autocomplete_index:hor 240 horse
# autocomplete_index:hors
autocomplete_index:hors 240 horse
# autocomplete_index:horse
autocomplete_index:horse 240 horse
Read data
You don’t need
ZRANK
to indicate the position, only use
ZRANGE
can read data. But you need some parameters (
BYSCORE
and
REV
), it represents using score to ordering them.
ZRANGE autocomplete_index:ho +INF -INF BYSCORE REV LIMIT 0 50
Comparison between simple and complicated version
Whether sort by length or by frequency, the
complicated
version can be suitable them. But the
complicated
version has some problem (e.g. maintenance)
Comparison | Simple version | Complicated version |
---|---|---|
Write data | We will write every n-gram of keyword, in addition to add a complete keyword with delimiter | When writing, we will append every n-gram of keyword at key tail, and value is complete keyword |
Read data |
Use
ZRANK
to indicate position then use
ZRANGE
to retrieve data, the final step is filter out unnecessary value. It can developed by lua
|
Use
ZRANGE
to retrieve data directly
|
Delete data |
Use
DEL
to delete all data
|
Use
SCAN
iterating to delete data
|
Data size (key prefix is
autocomplete_index
, and data are
house
and
horse
)
|
68 bytes | 242 bytes |
Summary
-
To save the data storage: Use
simple
version, but more code in the backend in order to pre-processing and post-processing -
Sort by frequency: You must use
complicated
version, because ifsimple
version can’t sort directly. -
Simple
version can’t ensure the count of result, because theZRANGE
of result includes unnecessary data. It maybe contains no data after filtering. -
When maintaining data, redis need more extra 1x data size. The new data use different key prefix when writing. The final step use
RENAME
to rename key prefix.-
If you use
complicated
version, it will consume more time whenRENAME
process.
-
If you use