v1.2 BETA, 2021-01-27 by Robert Giordano
What is gibberish? An example would be "Thyxqkhsitopj" entered as input in a form or messaging app.
While it's easy for humans to see that's gibberish, it turns out to be much more difficult for
software to detect this kind of thing.
A gibberish detector does NOT simply compare an input to a list of words. First of all, the list would
have to be huge, which could cause performance issues. Second, valid input would still be rejected if there were
spelling errors or new words not found in the current list.
A few people have done extensive research on detecting gibberish and they have come up with some fairly good
functions. However, all of them have some issues. As usual, I wanted to build something better.
You can test my latest version here:
Enter a word or gibberish:
Letters only. No numbers, hyphens, accents, or punctuation
Goals for this Project:
I started this project on December 27, 2020 and I set the following goals:
- Accept 99.9% of the words in a "Target List" of real words.
- Reject 98.0% of random, software generated letter sequences (gibberish).
- Reject 99.0% of random letter sequences from physically mashing keys on a keyboard (other people's algorithms
have trouble with this for some reason)
- Accept words that are not listed in any dictionary or list, as long as they "appear" to
be similar to real words.
- Fast enough to check at least 50,000 words/second.
Exactly one month later, I have obtained the following results:
|Target List, Unix Words
|Scrabble, NASPA 2018
|Scrabble, OTCWL 2016
|Random generated strings
|Random keyboard mashes
*rounded to 2 decimal places, so 99.997 is rounded to 100.00
On a 2012 Macbook Pro, 2.3 GHz Intel Core i7, using Firefox 78.7.0, it checked all 234,345 words in the Target List
in 2.42 seconds, or 96,836 words/sec. This page uses a PHP version of my original algorithm. And yes, I'm STILL using
my 2012 Macbook that I photographed here.
- TARGET LIST
Gibberish algorithms (mine and others) all have to be "taught" what GOOD words look like. Different lists of words can be
selected to train the algorithm, depending on your specific needs. For example, if you're only dealing with names, you
would find a large list of names to use for training. I wanted this page to be general purpose so I gave it
around 250,000 words, including some common names, slang, and abbreviations.
The lists of words I used for training add up to around 4MB. But the data file generated from the training session is
less than 250K. This small data file only needs to be loaded once when checking a list of words.
- SCRABBLE WORDS
In the Official 2018 North American Scrabble Players Association (NASPA) word list, the only word not accepted by my algorithm
is "qajaqs". In the previous 2016 Official Tournament and Club Word List (OTCWL), the four words not accepted by my algorithm
are "cazher", "cazhest", "drekkish", and "qajaqs". The first three are no longer present in the 2018 word list.
For more info, see wikipedia.org/wiki/NASPA_Word_List.
- RANDOM GENERATED STRINGS
I wrote a simple PHP function to generate random strings from 8 to 15 characters, using only the letters a-z. Next, I sorted the
list alphabetically and removed any duplicates. When generating a large number of random strings, there will always be a few
real words in the list, simply due to chance.
Therefore, you will never achieve 100% rejection against randomly generated strings. Even so, many of
the strings will not be actual dictionary words, just something close. The trick is to reject as many of these "borderline"
words as possible, WITHOUT rejecting more real words from the Target List.
- RANDOM KEYBOARD MASHES
I don't know if there's a better term for it or not, but this is where you just mash a bunch of keys on the keyboard to make
random words. Apparently, some of the other gibberish detectors have trouble with this, rejecting only 79% of 1000 gibberish words.
In my current list of 800 words, not only did I randomly mash sections of my mechanical keyboard, I also typed in many of the
linear sequences of keys, for example werty, ertyu, rtyui, tyuio, yuiop, etc. I did it both left to right and right to left.
I did NOT include "qwerty", which is a real dictionary word and a valid Scrabble word. It is also quite difficult to produce
"qwerty" by randomly mashing keys. Try it, you'll see.
- UPCOMING BLOG
Maybe I'm the only one who found this journey fascinating but I'll be writing a detailed blog post about it. I will show
how my algorithm uses a different approach than the others I found in Google and on Github. I have a lot of data to
share and I can also share the word lists I used for testing (I cannot share the Scrabble lists because they are not public
domain). Stay Tuned.
Here is a list of gibberish detection projects I found. While I wrote all of my own code from scratch, some
of these projects served as inspiration and benchmarks for me to measure against. I'll be exploring each one of these in my upcoming
blog. I've ordered them by date because some are based on the work of the previous author.
Jun 2011, updated Aug 2015. This was the earliest attempt at gibberish detection I could find. It's written in
Python, a language I've never used.
Oct 2011, updated Sep 2016. They took rrenaud's code and translated it to PHP. I installed it on my server but it
was too slow to check my Target List. I had to make some slight modifications in order to get comparable percentages.
I'll detail all of the results in my blog.
Aug 2018, updated Nov 2019. Mike Hucka from Caltech seems to have done a lot of research on gibberish detection and I'm using
his list of Unix Words as my Target List (/nostril/training/word-corpora/nltk/3.2.2/nltk_data/corpora/words/). I
haven't looked at his code yet either because it's also in Python.
On his github page he has a chart of Performance Tests and since we share the same Target List, it is
probably the best comparison so far.
May 2019, updated May 2019. This author has a PHP version that sounds like the same approach as rrenaud but I haven't compared
his code to buggedcom's PHP code. I'll probably examine it at some point. He is one of the few with a live demo
Sep 2019, updated Sep 2019. A blog post with more Python code. But, this is the only person in this list who talks
about using BOTH "bigrams" and "trigrams" (these are also called n-grams, i.e.
2-grams and 3-grams) for analysis. Everyone else in this list only uses 2-grams. The problem of course is there are only 676 possible
combinations of 2 letters (26 x 26 = 676 2-grams) but there are 17,576 possible 3-grams and 456,976 possible 4-grams. He also talks
about using the Brown Corpus which has over a million words.
I felt that was overkill for my algorithm.
Jul 2020, updated Jul 2020. This author mentions rrenaud and has some really nice looking charts but the project itself looks
very similar to rrenaud's. Like the others, it's also written in Python. I don't know if he's added anything.
Why is everyone writing this stuff in Python? Why is everyone using the same 2-dimensional array?
HINT: My algorithm looks at 2-grams, 3-grams, AND 4-grams, but not quite like anyone here is doing.
I'll explore all of this in my yet unwritten blog. I'll post the link here once it exists. If you made it this
far, thanks for your interest!!