Sometimes you already have a large amount of historical data and a precise ground truth knowledge about each data point, in which case your dataset is already labelled and all you need to do is clean, normalize, sub-sample, analyze, and train a model, and then iterate until you achieve a good evaluation.
But more often, all you have is a big bucket of raw unlabelled data and the process of manually building a consistent ground truth might be the most painful phase of your machine learning workflow. Some of these scenarios are well covered by companies and services that provide subject matter expertise about your specific context (linguistics, semantics, statistics, etc), usually at a very high cost. Other contexts, for example in the case of multimedia annotations, are way harder to handle, and it turns out that crowdsourcing might be a great way to cut down both costs and time.
Mechanical Turk – or MTurk – is a crowdsourcing marketplace where you (as a Requester) can publish and coordinate a wide set of Human Intelligence Tasks (HITs), such as classification, tagging, surveys, and transcriptions. Other users (as Workers) can choose your tasks and earn a small amount of money for each completed task.
How to build a model from Mechanical Turk results
Amazon Mechanical Turk will notify you when your results are ready and you will finally have a labelled dataset. In some cases, a few records might not have achieved any consensus, so could either improve your task instructions or, if the remaining dataset is big and statistically distributed enough to generate a useful model, simply discard them.
Amazon Mechanical Turk and other crowdsourcing platforms can be very useful in helping you to build your machine learning model from an unlabelled dataset.
Other solutions could involve unsupervised learning techniques, such as clustering and neural networks, which are pretty good at identifying patterns and structures in unlabelled data. However for most tasks, they are still far behind human intelligence. “Low-tech” solutions involving real humans will probably bring much higher accuracy, with an acceptable trade off between cost, complexity, and speed.