The process of signing up for a Twitter account is interesting for its brevity. All that users need to provide is an email address, a a password, and an optional full name. So, on the day that you join Twitter they know very little about you.

screen-shot-2016-12-01-at-22-40-53

This is in stark contrast with the amount of detail that Twitter claim to know about their users when advertisers are choosing the target audience for new campaigns. Advertisers can choose to put their ads in front of Twitter users located in specific regions, of specific genders, and with very particular interests (see below). So how do they go from knowing just a user’s email address and full name to knowing that they are a woman located in Ireland and interested in offroad vehicles? Twitter, like most other similar companies, make extensive use of profiling to learn these nuggets of demographic and preference information about their users. Earlier this year Facebook, for example, released a list of 98 personal data points that they use for this type of profiling.

screen-shot-2016-12-01-at-22-41-07screen-shot-2016-12-01-at-22-58-24

A lot of the profiling that companies like Twitter and Facebook perform is based on machine learning models that extract these pieces of profile data based on patterns in the content we post, the content we engage with (like, follow or share) and the networks of friends and followers we create. On Twitter for example we we have the profile information we explicitly provide (username and optional full name and description), the tweets we post, the tweets we share and favorite, the tweets posted by the people we follow, the network of people we follow and people who follow us, and all of the metadata recording when, where and on which platforms we interact with Twitter.  This is a very rich seam of data from which demographic, preference, and even intention data can be mined.

screen-shot-2016-12-01-at-22-41-22

For a demonstration of how this type of profiling works we recently set out to a build a system that could extract Twitter users gender and main interests from the data that is publicly available through the Twitter API (all of the data described above). The lesson in this post comes from the attempt to identify users’ genders. We first attempted to build a text mining solution by collecting a large set of Twitter usernames that we knew belonged to men, collecting a similar sized set that we knew belonged to women, and training a naive Bayes classifier to recognize the difference between them based on the presence or absence of certain words in the Tweets they posted and those posted by the people they followed. This didn’t really work. So, next we tried a similar approach based on the accounts present in the network of people that a user followed. This didn’t really work either.

At this point someone made a very good suggestion. Most Twitter users supply a full name and people’s first name provides a very good indication as to their gender. In fact the Central Statistics Office in most countries provides statistics on the most popular names for baby boys and girls each year. Using this data it is very easy to calculate the probability that someone with a particular first name is a male or female. There are even nice APIs that provide this classification as a service based on a name and a country, for example Gender API. This performs much better than the much more complicated machine learning based approaches.

I think this is a great example of why it is important to always look for simple solutions before diving straight into more complicated ones. Or as William Ockham more eloquently put it in what came to be known as Ockham’s Razor:

screen-shot-2016-12-01-at-22-41-58Frustra fit per plura quod potest fieri per pauciora
(It is futile to do with more things that which can be done with fewer)

We should always remember this when doing data analytics projects. We need to resist the temptation to attack a problem with the latest, most sophisticated technique we have read about and try the simple solutions first. Simple solutions are likely to be more robust, require less effort and maintenance and will certainly be easier to explain to clients who are not data analytics experts. Only if the simple solutions don’t work  as needed should we move to more sophisticated ones.