blog

ブログ

SoCal developer building with Ruby, JavaScript and Python.


Autocomplete at Scale: Building a Prefix Tree for the English Dictionary...

04 Mar 2026

Autocomplete at Scale: Building a Prefix Tree for the English Dictionary

Introduction

Autocomplete is one of those things we (sometimes frustratingly) interact with multiple times a day. We’ve all accidentially sent “duck” when we meant… well, the other word 🤣

While sometimes inaccurate, that autocomplete is blazingly fast. The naive solution, scanning every word in a vocabulary list every time you type a character, doesn’t scale. With a 235,000-word English dictionary, that means checking every single word on every single keystroke.

That’s why, for a recent course I took on data structures and algorithms, I decided to try building an autocomplete system backed by a prefix tree which makes search time independent of vocabulary size and proportional only to the length of the prefix typed. In this post, I’ll dive into what a prefix tree is, how it works conceptually, how I implemented it in Python, and why the performance gains are so so cool!


Before getting into prefix trees it’s important to get a baseline with the naive implementation.

A simple autocomplete with a plain list and linear search can look something like this:

def autocomplete(prefix, word_list):
    return [word for word in word_list if word.startswith(prefix)]

It’s clean and readable Python. And frankly it works fine for small vocabularies. But no matter what prefix you search for, it always scans every word in the list. And for a language like English, with 235,976 words, every single query visits all 235,976 entries. Typing “a”, “au”, “aut”, and “auto” each trigger a full scan of the entire list. In algorithm terms, this is O(n): linear in the size of the vocabulary. Double the vocabulary, double the search time.


What is a Prefix Tree?

A prefix tree (or trie, pronounced ‘try’) is a tree data structure that stores strings by breaking them into individual characters. Different from a list which would store each word in its entirety. With the prefix tree, each node represents a single character and paths from the root to terminal nodes spell out complete words.

The performance gains from a prefix tree come from the fact that words which share a common prefix share a common path in the tree. Thus the words “apple”, “apply”, and “apt” all start with ‘a’ and ‘p’, so they share the same first two nodes before diverging.

Here’s how a prefix tree looks after inserting apple, apply, apt, and banana:

           root
          /    \
         a      b
         |      |
         p      a
        / \     |
       p   t*   n
       |        |
       l        a
      / \       |
     e*  y*     n
                |
                a*

Nodes marked with * are terminal nodes. They mark where a complete word ends.

Each node in the tree stores three things:

  • A character (the letter it represents)
  • A dictionary of children (mapping each next character to its child node)
  • A terminal flag (True if a complete word ends at this node)

Using a dictionary for children affords a O(1) lookup for any child character regardless of how many children a node has.


How Inserts Work

Inserting a word walks down the tree one character at a time. If a node for the current character already exists, follow it. If not, create one. When the final character is reached, set the terminal flag. A pseudocode implementation of this looks like:

insert(string):
    current = root
    for each character in string:
        if current has a child for character:
            current = that child
        else:
            create a new node for character
            add it as a child of current
            current = new node
    mark current as terminal
    increment tree size

Here’s how inserting “apply” might look for a tree that already contains the word “apple”:

  • ‘a’ → exists, follow it
  • ‘p’ → exists, follow it
  • ‘p’ → exists, follow it
  • ‘l’ → exists, follow it
  • ‘y’ → does not exist -> create new node, mark it as terminal

Through this insertion, only one new node was created! The shared prefix “appl”, four characters, costs nothing extra since it already exits! And this is the prefix tree’s core efficiency: shared structure for shared prefixes. This shared structured reveals its benefits at search time…


How Search Works

Finding all completions for a prefix is a two-step process:

Step 1. Walk the prefix: traverse the tree following each character in the prefix. If no child exists for a character, the prefix doesn’t exist in the tree so return an empty list.

Step 2. Collect completions: from the node where the prefix ends, perform a depth-first traversal of the entire subtree beneath. Every terminal node encountered during this traversal is a valid completion.

Here’s how searching looks for the prefix "ap" in the tree from before along with some pseudocode:

           root
          /    \
         a      b          <- Step 1: follow 'a'
         |      |
         p      a          <- Step 1: follow 'p' — prefix matched!
        / \     |
       p   t*   n          <- Step 2: traverse subtree, collect terminals
       |        |
       l        a
      / \       |
     e*  y*     n          <- collect 'apple' *, collect 'apply' *
                |
                a*

Completions found: ['apple', 'apply', 'apt']
                    ^        ^        ^
                    |        |        t* is also in the subtree
def complete(self, prefix):
    completions = []
    node, depth = self._find_node(prefix)
    if depth == len(prefix):   # full prefix was matched
        self._traverse(node, prefix, completions.append)
    return completions

def _traverse(self, node, prefix, visit):
    if node.is_terminal():
        visit(prefix)
    for character, child_node in node.children.items():
        self._traverse(child_node, prefix + character, visit)

The _traverse function here is the recursive depth-first search. It takes in a visit callback, for searching that’d be completions.append, which makes sure we capture all the complete words.


Implementation

The tree is built from two classes: PrefixTreeNode and PrefixTree.

PrefixTreeNode stores a character, a dictionary of children, and a terminal flag. For example:

class PrefixTreeNode:
    def __init__(self, character=None):
        self.character = character
        self.children = {}    # char -> PrefixTreeNode
        self.terminal = False

PrefixTree manages the root node and exposes the public interface. The root stores an empty string which represents the entry point before any character has been matched. For example:

class PrefixTree:
    def __init__(self, strings=None):
        self.root = PrefixTreeNode('')
        self.size = 0
        if strings is not None:
            for string in strings:
                self.insert(string)

The Autocomplete Application

For my autocomplete application I built a CLI tool allowing users to type in a prefix and see matching words!

It’s data source is the full English dictionary, 235,976 words. At the start of the program all those words are loaded into the tree:

vocabulary = get_lines('/usr/share/dict/words')
tree = PrefixTree(vocabulary)

Then any prefix query can return results in a few microseconds:

completions = tree.complete('auto')   # returns 478 words
completions = tree.complete('axl')    # returns 4 words: axle, axled, axlesmith, axletree

My CLI tool has three modes which are triggered by flags.

Single lookup - search with either algorithm and see timing:

python3 autocomplete.py <prefix> --algorithm trie

Benchmark mode - run both algorithms on the same prefix and compare:

python3 autocomplete.py <prefix> --benchmark

Interactive mode - load the dictionary once into a prefix tree, then type freely:

python3 autocomplete.py --interactive
Loading 235,976 words into prefix tree...
Ready in 1.50 sec. Type a prefix to autocomplete. Press Enter to quit.

> auto
478 completions: autobahn, autobus, autoclave, autocracy, autocrat, ...

> axl
4 completions: axle, axled, axlesmith, axletree

Algorithm Analysis

Linear Search: O(n)

With the naive implementation of autocomplete, every query scans the full vocabulary. With n = 235,976, every keystroke costs the same amount of work, regardless of the prefix.

Prefix Tree Search: O(k + m)

With the prefix tree, our searching is completely removed from the size of the full vocabulary. The vocabulary could grow to 10 million words and the search time wouldn’t change, only the start-up time to build the tree (which is a one-time cost).

  • k = length of the prefix (time to walk down to the prefix node)
  • m = number of completions (time to traverse the result subtree)

Benchmark Results

These numbers were measured on the full 235,976-word system dictionary running my application with the --benchmark flag:

Prefix Completions Linear Search Prefix Tree Speedup
a 14,545 0.007667 sec 0.012010 sec ~1x
pre 3,018 0.007820 sec 0.002673 sec ~3x
auto 478 0.007317 sec 0.000546 sec ~13x
axl 4 0.007458 sec 0.000021 sec ~355x

While the linear search time is nearly constant at ~7.5ms across all prefixes, it always scans all 235,976 words. On the other hand, The prefix tree gets dramatically faster as the prefix gets longer and the result set shrinks.

Setup Time Tradeoff

Algorithm Setup Time
Linear Search ~0.000001 sec
Prefix Tree ~1.48 sec

There is one got-ya with the prefix tree implementation and that is its setup time. The prefix tree takes 1.48 seconds to build. Wheras the linear search implementation has no build time. In a script run once and discarded, that cost can feel steep. But in reality, any real-world deployment for instance a web server, search backend, etc. will build the tree once at startup and then handle thousands of queries per second. A single query saved 7ms so after just 212 queries the setup cost has paid for itself.


Conclusion

A prefix tree is one of the most elegant applications of tree data structures to a real-world problem and it works wonderfully for something like autocomplete. By storing characters along shared paths it eliminates redundant work at every level of search. Two words that share a prefix share the entire cost of storing and searching that prefix which is a huge advantage over a simple linear search!

Adding benchmark capabilitiy to this CLI tool reveals a compelling story: for targeted queries that define real autocomplete usage, such as longer, more specific prefixes, the prefix tree is orders of magnitude faster than a linear scan. A 355x speedup on a 4-letter prefix against a 235,000-word dictionary is not just a marginal improvement, it’s the difference between a feature that scales and one that doesn’t.

As vocabularies grow and users expect instant results, the prefix tree is a near perfect tool for the job!


Sources

  • Vaidehi Joshi (2017). Trying to Understand Tries. Medium
  • Wikipedia contributors. (2024). Trie. Wikipedia. https://en.wikipedia.org/wiki/Trie
  • macOS system dictionary: /usr/share/dict/words — 235,976 words