Redis

Redis is a powerful in-memory database, it includes common structures (set, list, hash) when developing. One of them called Sorted set (zset) can apply to autocomplete feature.

Store no duplicate data sort by lexicographic to set, as the name implies. So don’t extra work when store autocomplete data.

Tokenization

You must to breaks it up into individual characters before write text to zset. E.g. “house” and “horse” can divide into tokens.

Text Result
  • h
  • ho
  • hou
  • hous
  • house
  • h
  • ho
  • hor
  • hors
  • horse

Sort by character length - simple version

Because of the scenario doesn’t use score, can use the simple version of zset r/w strategy.

Write data

To write all tokens to zset, we use ZADD .

1. ZADD autocomplete_index 0 h
2. ZADD autocomplete_index 0 ho
3. ZADD autocomplete_index 0 hou
4. ZADD autocomplete_index 0 hous
5. ZADD autocomplete_index 0 house
6. ZADD autocomplete_index 0 house\x00
7. ZADD autocomplete_index 0 h # <--- duplicated, write data fail
8. ZADD autocomplete_index 0 ho # <--- duplicated, write data fail
9. ZADD autocomplete_index 0 hor
10. ZADD autocomplete_index 0 hors
11. ZADD autocomplete_index 0 horse
12. ZADD autocomplete_index 0 horse\x00

The 0 in the middle position of command represents the score is 0 when writing data. So the arrangement result will sort by lexicographic order. The sort algorithm is order by ascending from head to tail character sequences. The result looks like following.

position = 0 => h
position = 1 => ho
position = 2 => hou
position = 3 => hous
position = 4 => house
position = 5 => house\x00 # <--- real search result
position = 6 => hor
position = 7 => hors
position = 8 => horse
position = 9 => horse\x00 # <--- real search result

The position 6 and 9 add 0x00 (ascii code, means NULL ) at tail. We can use 0x00 when reading data.

Read data

It has two steps when reading data from zset. First of you need use ZRANK to indicate the position when key-in ‘ho’. The result is 1 .

position = ZRANK autocomplete_index ho

The second step that reading data from position 1 to n (n is 50 in here) via ZRANGE , you can read position from 1 to 9. The final step filter by \x00 at tail, and output ‘house’ and ‘horse’. And if you key-in ‘hor’, will output ‘horse’.

ZRANGE autocomplete_index 2 52 # <--- 2 is 1+1,52 is 1+1+50

Sort by frequency - complicated version

We need to use another zset r/w strategy because the scenario has external score.

Write data

When writing data, we still use ZADD to write all tokens to zset. But we add token at index name tail and real score additionally.

1. ZADD autocomplete_index:h 100 house
2. ZADD autocomplete_index:ho 100 house
3. ZADD autocomplete_index:hou 100 house
4. ZADD autocomplete_index:hous 100 house
5. ZADD autocomplete_index:house 100 house
6. ZADD autocomplete_index:h 240 horse
7. ZADD autocomplete_index:ho 240 horse
8. ZADD autocomplete_index:hor 240 horse
9. ZADD autocomplete_index:hors 240 horse
10. ZADD autocomplete_index:horse 240 horse

The 100 in the middle position of command represents the score is 100 when writing data. So the arrangement result will sort by lexicographic order. The sort algorithm is order by ascending from head to tail character sequences. The result looks like following.

# autocomplete_index:h
autocomplete_index:h 100 house
autocomplete_index:h 240 horse

# autocomplete_index:ho
autocomplete_index:ho 100 house
autocomplete_index:ho 240 horse

# autocomplete_index:hou
autocomplete_index:hou 100 house

# autocomplete_index:hous
autocomplete_index:hous 100 house

# autocomplete_index:house
autocomplete_index:house 100 house

# autocomplete_index:hor
autocomplete_index:hor 240 horse

# autocomplete_index:hors
autocomplete_index:hors 240 horse

# autocomplete_index:horse
autocomplete_index:horse 240 horse

Read data

You don’t need ZRANK to indicate the position, only use ZRANGE can read data. But you need some parameters ( BYSCORE and REV ), it represents using score to ordering them.

ZRANGE autocomplete_index:ho +INF -INF BYSCORE REV LIMIT 0 50

Comparison between simple and complicated version

Whether sort by length or by frequency, the complicated version can be suitable them. But the complicated version has some problem (e.g. maintenance)

Comparison Simple version Complicated version
Write data We will write every n-gram of keyword, in addition to add a complete keyword with delimiter When writing, we will append every n-gram of keyword at key tail, and value is complete keyword
Read data Use ZRANK to indicate position then use ZRANGE to retrieve data, the final step is filter out unnecessary value. It can developed by lua Use ZRANGE to retrieve data directly
Delete data Use DEL to delete all data Use SCAN iterating to delete data
Data size (key prefix is autocomplete_index , and data are house and horse ) 68 bytes 242 bytes

Summary

  1. To save the data storage: Use simple version, but more code in the backend in order to pre-processing and post-processing
  2. Sort by frequency: You must use complicated version, because if simple version can’t sort directly.
  3. Simple version can’t ensure the count of result, because the ZRANGE of result includes unnecessary data. It maybe contains no data after filtering.
  4. When maintaining data, redis need more extra 1x data size. The new data use different key prefix when writing. The final step use RENAME to rename key prefix.
    • If you use complicated version, it will consume more time when RENAME process.