Most of the time when we talk about machine learning projects we talk about building predictive models that generate insights that directly help someone to make a decision – for example using churn prediction to identify the customers that a retention team should contact, using fraud detection to decide which cases to investigate, or using dosage prediction to help a doctor decide the treatment plan for a patient. We always refer to the DATA – INSIGHT – DECISION triplet and use the image below to illustrate this. In these type of applications machine learning is very much to the foreground.


From Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples and Case Studies

There is, however, another group of applications of machine learning in which the application of machine learning is very much in the background. We refer to these as data enrichment applications as in these cases machine learning models are used to add useful tags or other information to existing data so that that data can be used more effectively. In these instances you might say that the application of machine learning happens at the earliest part of the diagram above before the arrow labelled data.

Augmenting data collected by large-scale sky survey telescopes (such as the Sloan Digital Sky Survey) is a good example of data enrichment. Astrophysicists (as well as astronomers, cosmologists, and others) use data collected by these telescopes to answer fundamental questions about how the universe works, how it started, and where it is headed. In order to use sky survey data to answer these big questions, however, the data needs to be augmented with information identifying the sky objects contained in each image and capturing their characteristics (many of which cannot be directly measured). It is interesting to hear astrophysicists talk about this type of work as they are very blunt that collecting and processing data does not count as science – the science starts only after this is all done and they start trying to answer their questions. Given the big questions they are answering we’ll allow them this conceit for now.

The volume of data collected by modern sky survey telescopes is so large (hundreds of thousands of images per night), however, that it is not feasible to manually add the kind of cataloguing information described above. Instead modern sky survey projects turn to machine learning for these tasks. Given a set of training images models can be trained to recognise different types of sky objects and their characteristics, and the raw image data can be enriched with this information (which can be seen a a kind of meta-data). Astrophysicists can then use this extra information to retrieve examples in which they are particularly interested or to separate data for analytical tests.

Morphological classification of galaxies is a nice example of this. Edwin Hubble developed a classification of different galaxy types based on their shapes (or morphology). This is commonly referred to as the Hubble Tuning Fork. The basic level of separation is into spiral or elliptical galaxies with these high level categories breaking down into further subcategories. The images below show some examples.


From Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples and Case Studies and the Sloan Digital Sky Survey.

Machine learning models can be built that accurately recognise what type of galaxy is in an image as well as being able to tell other sky objects apart and to accurately predict values for key attributes such as red shift. One of the freely available sample chapters of our book, Fundamentals of Machine Learning for Predictive Data Analytics, covers how to do this type of classification using machine learning models in detail. Current state of the art approaches, for example Dielman et al, use convolutional neural networks to do this to very high levels of accuracy.

The astronomical example is, however, just one of many examples of using machine learning to enrich datasets for later processing. Other examples include recognising the course a recipe best suits, document classification, adding personality characteristics to social networks accounts. It is worth thinking about this when deciding how best to apply machine learning to your data.