Fun with english words

In this post, I explain how I made the program “SubWordsFinder” that takes a string, and finds all real English words that contain that string, then removes the string from the match, and sees if the remaining characters form a real English word or not.

It started with me stumbling across the following “truism” on Facebook the other day:

“If you take the PI out of OPINION, you’re left with an ONION

My first thought was “gee, these guys have too much time on their hands. My second thought was: “Hmm, maybe I can do that as well?”

i.e. if I run SubWordsFinder tea, I expect the results to include:
”protease –> prose”

and SubWordsFinder ox should include
”toxic –> tic”

Finding all the words

So I started to search for “all English words download” on Bing and quickly found The English Open Word List (EOWL), which I then downloaded and extracted into a folder:
image
The ZIP file included all English words in both a CSV format as well as LF separated entries in a series of textfiles. I chose to attack the latter, and leave the CSV format alone.

Loading the files

I opted to create class WordsFinder that takes the path to the folder containing all the text files as its input:
image

Next, I created a LoadWords() method that loads all the files in parallel. I used a ConcurrentDictionary<string, IEnumerable<string>> in order to be able to work in parallel and not having to worry about locking issuesNyah-Nyah:
image
I did however, need to lock the variable finalWordCount in order to avoid writing to it from two separate threads.

Finding the SubString

Next challenge was to identify all words containing the substring that I was after. Turns out, this was really easy as well. Because I had split all the  words into a dictionary where the first letter of the word is the key, and because I had them all in a ConcurrentDictionary, I could parallelize the search to make it blazing fast:
image
Again, using Concurrent collections makes writing to them thread-safe, so a ConcurrentBag is where I stored all my matches. In my main program, I store all matches in the variable wordsContainingMatch.

Ok, so now I have a way of finding all words in the English dictionary that contains any substring. Next up is to remove that substring from the match:
image

The idea here is simply to create pairs of words containing the substring, and the corresponding suggestion for a word without it. These aren’t yet verified as English words, they’re simply strings with that substring removed. You can see that the form of the words is a string with a colon ‘:’ where the original match is on the left, and the suggested word without that match is on the right.

Last piece of the puzzle

With this new collection, wordsWithMatchRemoved, It was simply a matter of figuring out if the right side of the colon is a real English word or not, by checking against our dictionary:
image
I extract the string to the right of the colon, and see if that string exists in my dictionary using the Contains method:
image

As long as the string exists as an English word, it is added to the collection actualWordsWithMatchRemoved. I know, I could probably come up with better names…

Running it

Putting this all together in a running console application, I added a few System.Diagnostics.Stopwatch instances to time the program. Here’s running against the string PI:

image
Searching for lad gave me these:

image

So I could say “If you take the lad away from the paladin, you’ll end up with pain

I know, weird humor. But a fun way to spend 50 minutes on a saturday morning.

Contact me if you want the full source code, it’s so ugly that I’m not going to post it on github Smile

 

Pedro

About digitaldias

Software Engineer during the day, photographer, videographer and gamer in the evening. Also a father of 3. Pedro has a strong passion for technology, and gladly shares his findings with enthusiasm.

View all posts by digitaldias →

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.