It's an unanticipated result, but it's a unanimous result...

It is



This page determines whether "a" or "an" should precede a word. It does this using the method described in this stackoverflow response. The dataset used is the wikipedia-article-text dump. Some additional preprocessing was done to remove as much wiki-markup as possible and extract only things vaguely resembling sentences using regular expressions. If the word following 'a' or 'an' started with a quote or parenthesis, the initial quote or parenthesis was ignored. The resulting prefix-list with the code to query it is less than 10KB in size; excluding the actual counts would reduce the size still further.


You may use, modify, redistribute and do whatever you want with the data+script used on this page, but please don't misrepresent its source (license: Apache 2.0). If you've any questions, you can mail me at <firstname>@<lastname>.org.


The implementations are efficient: on a single thread of a 3.88GHz i7-4770k a benchmark classifying all words of an english dictionary achieves about 22.5 million words a second; that's just 173 clock cycles per word. The javascript implementations were benchmarked on chrome 35, firefox 32.0a1 (2014-05-22), IE 11, and opera (12 and 21), and are all around 10 times slower, at approximately 4-5 million classifications per second.

--Eamon Nerbonne