Let’s start today’s post in space. The image below shows an astronaut on the surface of the moon (this image, and many of the others in this post come from the very excellent book Digital Image Processing by Gonzalez & Woods). The image, however, has been corrupted with cosmic noise and so it is very hard to make much sense of it. Luckily noise removal is one of the core jobs in image processing and should help us to clean up this image.

screen-shot-2016-12-05-at-21-21-12
Source: Digital Image Processing by Gonzalez & Woods

If we zoom right in on this image we can see that the basic data representation for a grey-scale image like this is a grid of pixels each containing a shade of grey.

zoomedinpixels
Source: Digital Image Processing by Gonzalez & Woods

Using this simple representation we can perform a simple convolution operation on the image to attempt to remove the noise. This is a standard technique in image processing and the main tool in the image processing toolbox for working with pixel representations like this. Unfortunately, as shown below, although the image is cleaned up it is not quite as sharp as we might like.

screen-shot-2016-12-05-at-21-21-20
Source: Digital Image Processing by Gonzalez & Woods

We could work more on trying out different types of convolution kernels and this would probably lead to some improvement, but we will hit a wall in terms of what can be achieved with this type of representation. Rather than trying to push against this wall, we can instead change the data representation. The fast Fourier transform is another standard tool in the digital image processing toolbox. This transforms the representation of an image from a set of pixel values into a set of frequencies, or converts the image into the frequency domain. Below we show our image of the astronaut on the surface of the moon represented in the frequency domain. The details of exactly what is going on here are not terribly important (although Chapter 4 of Digital Image Processing by Gonzalez & Woods goes into this in great detail and is fascinating stuff). What is important is that when we look at the frequency domain image below a set of bright points arranged in a ring around the image jump out. The noise that is corrupting this image is sinusoidal in nature (most likely cause by the camera being near a generator or other equipment like that) and so really pops out once the representation of the image is changed to the frequency domain.

screen-shot-2016-12-05-at-21-20-58
Source: Digital Image Processing by Gonzalez & Woods

Even better removing the noise is now very easy. We can apply what is called a band pass filter which removes particular frequencies from the image – int his case the frequencies represented by the bright spots in the ring. After applying the band pass filter we get the lovely, sharp, clean image below.

screen-shot-2016-12-05-at-21-21-36
Source: Digital Image Processing by Gonzalez & Woods

Now, that was a lot of image processing for a post about data analytics but a lesson can be taken directly over to analytics projects. Often times changing our data representation is the most powerful thing we can do to generate better insights from our data. Here is a really simple example from the world of sabremetrics. The small dataset below shows the number of passes made and the number of passes completed in a season by 30 American football quarterbacks. Also shown in the value that the press corps placed on each quarterback at the end of the season.

Player Team Passes Attempted Passes Completed Player Value
D.Culpepper MIN 548 379 110.9
D.McNabb PHI 469 300 104.7
B.Griese TAM 336 233 102.5
M.Bulger STL 485 321 93.7
B.Favre GBP 540 346 92.4
J.Delhomme CAR 533 310 87.3
K.Warner NYG 277 174 86.5
M.Hasselbeck SEA 474 279 83.1
A.Brooks NOS 542 309 79.5
T.Rattay SFX 325 198 78.1
M.Vick ATL 321 181 78.1
J.Harrington DET 489 274 77.5
V.Testaverde DAL 495 297 76.4
P.Ramsey WAS 272 169 77.8
J.McCown ARI 408 233 74.1
P.Manning IND 497 336 113.1
D.Brees SDC 400 262 104.8
B.Roethlisberger PIT 295 196 98.1
T.Green KAN 556 369 95.2
T.Brady NEP 474 288 92.6
C.Pennington NYJ 370 242 91.0
B.Volek TEN 357 218 87.1
J.Plummer DEN 521 303 84.5
D.Carr HOU 466 285 83.5
B.Leftwich JAC 441 267 82.2
C.Palmer CIN 432 263 77.3
J.Garcia CLE 252 144 76.7
D.Bledsoe BUF 450 256 76.6
K.Collins OAK 513 289 74.8
K.Boller BAL 464 258 70.9

 

It would be useful to understand how the statistics we can measure about a player influence the value the press corps place on that player. the images below show scatter plots illustrating the relationships between player value and passes completed and player value and pass attempted. The correlation coefficients between each of these player statistics and player value are 0.15 and 0.453 respectively which suggest pretty weak associations.

playervaluecompletedpasses   playervaluecompletedpasspercent

A simple change to the data representation, however, can add significantly more value to this dataset. Simply dividing a player’s passes completed statistic by their passes attempted statistic yields the player’s percentage of passes completed.

Player Team Passes Attempted Passes Completed Pass Completion % Player Value
D.Culpepper MIN 548 379 69 110.9
D.McNabb PHI 469 300 64 104.7
B.Griese TAM 336 233 69 102.5
M.Bulger STL 485 321 66 93.7
B.Favre GBP 540 346 64 92.4
J.Delhomme CAR 533 310 58 87.3
K.Warner NYG 277 174 63 86.5
M.Hasselbeck SEA 474 279 59 83.1
A.Brooks NOS 542 309 57 79.5
T.Rattay SFX 325 198 61 78.1
M.Vick ATL 321 181 56 78.1
J.Harrington DET 489 274 56 77.5
V.Testaverde DAL 495 297 60 76.4
P.Ramsey WAS 272 169 62 77.8
J.McCown ARI 408 233 57 74.1
P.Manning IND 497 336 68 113.1
D.Brees SDC 400 262 66 104.8
B.Roethlisberger PIT 295 196 66 98.1
T.Green KAN 556 369 66 95.2
T.Brady NEP 474 288 61 92.6
C.Pennington NYJ 370 242 65 91.0
B.Volek TEN 357 218 61 87.1
J.Plummer DEN 521 303 58 84.5
D.Carr HOU 466 285 61 83.5
B.Leftwich JAC 441 267 61 82.2
C.Palmer CIN 432 263 61 77.3
J.Garcia CLE 252 144 57 76.7
D.Bledsoe BUF 450 256 57 76.6
K.Collins OAK 513 289 56 74.8
K.Boller BAL 464 258 56 70.9

 

Percentage of passes completed is a much more useful measure in trying to determine the player value. This is evident in the scatter plot below and the correlation coefficient of 0.87 between player value of percent passes completed.

playervalueattemptedpasses

This is a pretty simple example and it is unlikely that any American football teams will be beating down our door based on this insight. Determining good metrics in sports, however, is serious business. Moneyball is one of our favourite movies (and books) here at The Analytics Store and one of our favourite scenes is when Billy Bean repeatedly turns to Peter Brand in the scouting meeting for the refrain “because he gets on base“. The number of times a player got on base was the key metric for capturing the value of a player.

Although he probably never actually said it (see here), Albert Einstein is often attributed with the quote:

“If I had an hour to solve a problem I’d spend 55 minutes thinking about the problem and 5 minutes thinking about solutions.”

In data analytics we say:

“If I had an hour to analyse a dataset I’d spend 55 minutes working on the data representation and 5 minutes running the analysis.”

 

 

Save