Baidu DeepSpeech are a pretty okay place to start. If you are just starting, you can find a TensorFlow implementation (and a link to the original paper) that Mozilla works on here:

To reduce it you need a ton of good . If you are just playing around, you should look into the LibriSpeech set available here:

The LS data set is made from license-free audio books, so it is a highly biased data set (e.g. no noise). For starters, you can always add noise yourself (by mixing in other sounds, distorting the audio, playing with speed and volume, etc), but if you want really good results for arbitrary audio, you need to get a bunch of data from a bunch of different sources.

