In this post, I explain how I made the program “SubWordsFinder” that takes a string, and finds all real English words that contain that string, then removes the string from the match, and sees if the remaining characters form a real English word or not.
It started with me stumbling across the following “truism” on Facebook the other day:
“If you take the PI out of OPINION, you’re left with an ONION“
My first thought was “gee, these guys have too much time on their hands. My second thought was: “Hmm, maybe I can do that as well?”
i.e. if I run SubWordsFinder tea, I expect the results to include:
”protease –> prose”
and SubWordsFinder ox should include
”toxic –> tic”
Finding all the words
So I started to search for “all English words download” on Bing and quickly found The English Open Word List (EOWL), which I then downloaded and extracted into a folder:
The ZIP file included all English words in both a CSV format as well as LF separated entries in a series of textfiles. I chose to attack the latter, and leave the CSV format alone.
Loading the files
I opted to create class WordsFinder that takes the path to the folder containing all the text files as its input:
Next, I created a LoadWords() method that loads all the files in parallel. I used a ConcurrentDictionary<string, IEnumerable<string>> in order to be able to work in parallel and not having to worry about locking issues:
I did however, need to lock the variable finalWordCount in order to avoid writing to it from two separate threads.
Finding the SubString
Next challenge was to identify all words containing the substring that I was after. Turns out, this was really easy as well. Because I had split all the words into a dictionary where the first letter of the word is the key, and because I had them all in a ConcurrentDictionary, I could parallelize the search to make it blazing fast:
Again, using Concurrent collections makes writing to them thread-safe, so a ConcurrentBag is where I stored all my matches. In my main program, I store all matches in the variable wordsContainingMatch.
Ok, so now I have a way of finding all words in the English dictionary that contains any substring. Next up is to remove that substring from the match:
The idea here is simply to create pairs of words containing the substring, and the corresponding suggestion for a word without it. These aren’t yet verified as English words, they’re simply strings with that substring removed. You can see that the form of the words is a string with a colon ‘:’ where the original match is on the left, and the suggested word without that match is on the right.
Last piece of the puzzle
With this new collection, wordsWithMatchRemoved, It was simply a matter of figuring out if the right side of the colon is a real English word or not, by checking against our dictionary:
I extract the string to the right of the colon, and see if that string exists in my dictionary using the Contains method:
As long as the string exists as an English word, it is added to the collection actualWordsWithMatchRemoved. I know, I could probably come up with better names…
Running it
Putting this all together in a running console application, I added a few System.Diagnostics.Stopwatch instances to time the program. Here’s running against the string PI:
Searching for lad gave me these:
So I could say “If you take the lad away from the paladin, you’ll end up with pain”
I know, weird humor. But a fun way to spend 50 minutes on a saturday morning.
Contact me if you want the full source code, it’s so ugly that I’m not going to post it on github
Pedro