Solving Crosswords with Information Retrieval

3 min readDec 16, 2020

James Worthington

As a new crossword enthusiast, both as a solver and writer, I’ve become acquainted with some common trends among puzzles’ clues and answers. In order to accommodate the format, many words have become commonplace due to being short words with plenty of vowels or other common letters. To list a few examples, “EON”, “ASP”, and “MAO” are familiar terms to any who dabble in the world of crossword puzzles.

Seeing these commonalities, I was drawn towards the possibility of creating a system that could search for crossword answers based on the provided clue, and letter pattern of the word (i.e. “a five letter word starting with R”).

In order to undertake this task, I would need an adequate amount of relevant data. First and foremost, I collected crossword clues and answers from hundreds of previous New York Times crosswords. To augment this data, I also collected dictionary definitions, to ensure that my database had a plethora of words to select from. After all the proper parsing was said and done, I had over 450,000 words in my database, all paired with definitions and clues in previous crosswords. For a sample, let’s see the database entries for those three common crossword answers I mentioned before: eon, asp, and Mao.

To create a viable system, the program will need to be able to look at all relevant words, and see how well they match the user-provided clue. If the user entered A?? (Three-letter word starting with ‘A’) as the word’s pattern, and ‘Egyptian biter’ as the clue, then the program would only look at words of the appropriate length starting with ‘A’ and query their former clues and definitions to see how well they match the user-provided clue ‘Egyptian biter’.

Naively, we can just check for the presence of the words ‘Egyptian’ and ‘biter’ in the previous clues, and the word that has the most matches is our best suggestion. Of course, this has a number of obvious limitations. Variants of the words in the user-provided clue (‘Egypt’ instead of ‘Egyptian’ or ‘bites’ instead of ‘biter’) would not contribute towards the relevance score of the target word. To solve for this issue, I use spaCy’s word similarity functionality to determine how similar two texts are. Without getting into details, this will allow us to determine how similar to terms are in a much more robust manner. Synonyms have high similarity scores, as do different tenses and pluralities of the same word, which allows for more accurate determinations of whether a word is relevant or not.

Using this method, our system returns ‘ASP’ as its top suggestion for the provided clue. When tested on clues from real crossword puzzles, not included in the database, the program returns the correct answer in its top ten suggestions 49.35% of the time, or 64.29% of the time if one letter in the word’s pattern is provided, a common case when solving crosswords.

Of course, the joy of crosswords is not in finishing them, but in figuring out cryptic clues to intersecting answers. Moving forward, this program is an interesting exercise in the underlying structure of crosswords, but I wouldn’t plan on using it to help solve them. Nonetheless, we were able to leverage past data and similarity measurements to solve an interesting problem in the realm of information retrieval.

Solving Crosswords with Information Retrieval

Written by Jworthy