Gibberish Detector

v1.2 BETA, 2021-01-27 by Robert Giordano

What is gibberish? An example would be "Thyxqkhsitopj" entered as input in a form or messaging app. While it's easy for humans to see that's gibberish, it turns out to be much more difficult for software to detect this kind of thing.

A gibberish detector does NOT simply compare an input to a list of words. First of all, the list would have to be huge, which could cause performance issues. Second, valid input would still be rejected if there were spelling errors or new words not found in the current list.

A few people have done extensive research on detecting gibberish and they have come up with some fairly good functions. However, all of them have some issues. As usual, I wanted to build something better. You can test my latest version here:

Enter a word or gibberish:

Letters only. No numbers, hyphens, accents, or punctuation

Goals for this Project:

I started this project on December 27, 2020 and I set the following goals:

Accept 99.9% of the words in a "Target List" of real words.
Reject 98.0% of random, software generated letter sequences (gibberish).
Reject 99.0% of random letter sequences from physically mashing keys on a keyboard (other people's algorithms have trouble with this for some reason)
Accept words that are not listed in any dictionary or list, as long as they "appear" to be similar to real words.
Fast enough to check at least 50,000 words/second.

Performance Tests:

Exactly one month later, I have obtained the following results:

Word List	Total Words	Accepted	Rejected	Accuracy*
Target List, Unix Words	232,345	232,345	0	100.00%
Scrabble, NASPA 2018	192,111	192,110	1	100.00%
Scrabble, OTCWL 2016	188,738	188,734	4	100.00%
Random generated strings	250,000	143	249,857	99.94%
Random keyboard mashes	800	0	800	100.00%

*rounded to 2 decimal places, so 99.997 is rounded to 100.00

I wrote the main algorithm and all supporting functions in client-side Javascript, with no external libraries. On a 2012 Macbook Pro, 2.3 GHz Intel Core i7, using Firefox 78.7.0, it checked all 234,345 words in the Target List in 2.42 seconds, or 96,836 words/sec. This page uses a PHP version of my original algorithm. And yes, I'm STILL using my 2012 Macbook that I photographed here.

Notes

TARGET LIST
Gibberish algorithms (mine and others) all have to be "taught" what GOOD words look like. Different lists of words can be selected to train the algorithm, depending on your specific needs. For example, if you're only dealing with names, you would find a large list of names to use for training. I wanted this page to be general purpose so I gave it around 250,000 words, including some common names, slang, and abbreviations.

The lists of words I used for training add up to around 4MB. But the data file generated from the training session is less than 250K. This small data file only needs to be loaded once when checking a list of words.
SCRABBLE WORDS
In the Official 2018 North American Scrabble Players Association (NASPA) word list, the only word not accepted by my algorithm is "qajaqs". In the previous 2016 Official Tournament and Club Word List (OTCWL), the four words not accepted by my algorithm are "cazher", "cazhest", "drekkish", and "qajaqs". The first three are no longer present in the 2018 word list. For more info, see wikipedia.org/wiki/NASPA_Word_List.
RANDOM GENERATED STRINGS
I wrote a simple PHP function to generate random strings from 8 to 15 characters, using only the letters a-z. Next, I sorted the list alphabetically and removed any duplicates. When generating a large number of random strings, there will always be a few real words in the list, simply due to chance.

Therefore, you will never achieve 100% rejection against randomly generated strings. Even so, many of the strings will not be actual dictionary words, just something close. The trick is to reject as many of these "borderline" words as possible, WITHOUT rejecting more real words from the Target List.
RANDOM KEYBOARD MASHES
I don't know if there's a better term for it or not, but this is where you just mash a bunch of keys on the keyboard to make random words. Apparently, some of the other gibberish detectors have trouble with this, rejecting only 79% of 1000 gibberish words. In my current list of 800 words, not only did I randomly mash sections of my mechanical keyboard, I also typed in many of the linear sequences of keys, for example werty, ertyu, rtyui, tyuio, yuiop, etc. I did it both left to right and right to left.

I did NOT include "qwerty", which is a real dictionary word and a valid Scrabble word. It is also quite difficult to produce "qwerty" by randomly mashing keys. Try it, you'll see.
UPCOMING BLOG
Maybe I'm the only one who found this journey fascinating but I'll be writing a detailed blog post about it. I will show how my algorithm uses a different approach than the others I found in Google and on Github. I have a lot of data to share and I can also share the word lists I used for testing (I cannot share the Scrabble lists because they are not public domain). Stay Tuned.

Acknowledgments

Here is a list of gibberish detection projects I found. While I wrote all of my own code from scratch, some of these projects served as inspiration and benchmarks for me to measure against. I'll be exploring each one of these in my upcoming blog. I've ordered them by date because some are based on the work of the previous author.

github.com/rrenaud/Gibberish-Detector
Jun 2011, updated Aug 2015. This was the earliest attempt at gibberish detection I could find. It's written in Python, a language I've never used.
github.com/buggedcom/Gibberish-Detector-PHP
Oct 2011, updated Sep 2016. They took rrenaud's code and translated it to PHP. I installed it on my server but it was too slow to check my Target List. I had to make some slight modifications in order to get comparable percentages. I'll detail all of the results in my blog.
github.com/casics/nostril
Aug 2018, updated Nov 2019. Mike Hucka from Caltech seems to have done a lot of research on gibberish detection and I'm using his list of Unix Words as my Target List (/nostril/training/word-corpora/nltk/3.2.2/nltk_data/corpora/words/). I haven't looked at his code yet either because it's also in Python. On his github page he has a chart of Performance Tests and since we share the same Target List, it is probably the best comparison so far.
github.com/gschoppe/Gibberish-Detector
May 2019, updated May 2019. This author has a PHP version that sounds like the same approach as rrenaud but I haven't compared his code to buggedcom's PHP code. I'll probably examine it at some point. He is one of the few with a live demo here.
dataanalyticsruan.com/2019/09/27/gibberish-detection-using-brown-corpus-and-nlp-techniques/
Sep 2019, updated Sep 2019. A blog post with more Python code. But, this is the only person in this list who talks about using BOTH "bigrams" and "trigrams" (these are also called n-grams, i.e. 2-grams and 3-grams) for analysis. Everyone else in this list only uses 2-grams. The problem of course is there are only 676 possible combinations of 2 letters (26 x 26 = 676 2-grams) but there are 17,576 possible 3-grams and 456,976 possible 4-grams. He also talks about using the Brown Corpus which has over a million words. I felt that was overkill for my algorithm.
github.com/gaganmanku96/Gibberish-Detection
Jul 2020, updated Jul 2020. This author mentions rrenaud and has some really nice looking charts but the project itself looks very similar to rrenaud's. Like the others, it's also written in Python. I don't know if he's added anything.

Why is everyone writing this stuff in Python? Why is everyone using the same 2-dimensional array? HINT: My algorithm looks at 2-grams, 3-grams, AND 4-grams, but not quite like anyone here is doing.

I'll explore all of this in my yet unwritten blog. I'll post the link here once it exists. If you made it this far, thanks for your interest!!

❤ Robert