A vs An - Determine english indeterminate article

It's an unanticipated result, but it's a unanimous result...

It is a

Details:

This page determines whether "a" or "an" should precede a word. It does this using the method described in this stackoverflow response. The dataset used is the wikipedia-article-text dump. Some additional preprocessing was done to remove as much wiki-markup as possible and extract only things vaguely resembling sentences using regular expressions. If the word following 'a' or 'an' started with a quote or parenthesis, the initial quote or parenthesis was ignored. The resulting prefix-list with the code to query it is less than 10KB in size; excluding the actual counts would reduce the size still further.

Try...

unanticipated result
unanimous vote
honest decision
honeysuckle shrub
0800 number
xmas tree
unidirectional beam
unidiomatic phrase
NASA scientist
NSA analyst
FIAT car
FAA policy

You may use, modify, redistribute and do whatever you want with the data+script used on this page, but please don't misrepresent its source (license: Apache 2.0). If you've any questions, you can mail me at <firstname>@<lastname>.org.

Downloads:

.NET nuget package AvsAn (.NET 2.0 or later); you can also manually download binaries from nuget if you prefer.
JS: Variant including counts of a's and an's : AvsAn.js (12043 bytes, minified+gzipped it's 6333 bytes)
JS: Variant including only which article is more common: AvsAn-simple.js (4441 bytes, minified+gzipped it's 2299 bytes)
Source code: on github: EamonNerbonne/a-vs-an. Contributions, bug reports, and pull requests welcome!
Ruby gem: by Marten Veldthuis, github: marten/a_vs_an.
node.js package: (alternative implementation by Chad Kirby) github: uplake/Articles

The implementations are efficient: on a single thread of a 3.88GHz i7-4770k a benchmark classifying all words of an english dictionary achieves about 22.5 million words a second; that's just 173 clock cycles per word. The javascript implementations were benchmarked on chrome 35, firefox 32.0a1 (2014-05-22), IE 11, and opera (12 and 21), and are all around 10 times slower, at approximately 4-5 million classifications per second.

--Eamon Nerbonne