How it works

In 1984, at I.B.M. labs  the small speech recognition team presented an experimental system. It was extremely simple, with a person pronouncing several sentences making long pauses between words. Of course, all the words were pronounced extremely slowly and distinctly completely not in the way things are uttered by users connecting to different call-centers nowadays.  The result was about 1 mistake in every 10 words, for the time discussed it was an immense success. Today computers process billions of words uttered by different users sometimes in an especially noisy environment. All that turned possible only due to the advances in computing speed, the size of databases, as well as big advances in software.
The whole processing looks in the following way.  A person’s speech is recorded and converted to digital form. Later the patterns are compared with digital examples containing sounds of consonants and vowels that make up words, uttered by thousands of users of different sex, origin and speaking abilities. After that the system just calculates statistical probabilities  that this cluster of digital bits is that word and that another one. Using special databases it learns that those two words are quite possible to be met in a  range of a sentence, which increases their usage probability as well.
In early 80-s about 100 hours of recording was a result of a very successful project. Now Microsoft has calculated that its voice recognition software used in call centers, Web search and automobiles now fields 11 billion speech requests a year, while Google when developing  its Google voice and creating a model of the English language, compiled  a 240 billion-word database of commonly used words, terms and phrases.
Such large databases and the overall speed of data proceeding allows the systems to work even more accurate creating more refined calculations of words usage which we, speakers, using our language intuitively are unaware about.