Estimating missing values through linear regression

How I fixed a bad edge case which escalated quickly when our D3 graphs went wild with missing data in our APIs

published 8 months ago

Introduction

At work, we were implementing this new feature that would allow our users see how a certain dataset was changing with time. To put simply, it was an innocent line graph. This was a fairly simple task. All we had to do was fetch the data points for the graph and plot them using D3 on the page.

The problem grew a bit more complex when the dataset started lacking consistency. When we tried to use our innocent way of plotting the graph, things escalated. Real quick.

The Problem

Screenshot 2020-09-13 at 04.50.03.jpg

For a data set for the years 2000 - 2003, values [2400, 816, 1415, 2496] would generate a graph like above.

Our problem soon came as edge cases for when there would be missing data in the list as we integrated more and nore datasets in our db. Soon, our graphs would go either completely off the charts or would stop drawing different parts of the line, or worse, parse undefined or null as 0.

To give you an example, here, the value for 2001 is set to null and the graph renders this value as 0.

Screenshot 2020-09-13 at 04.58.29.jpg

This became a huge problem since according to the data, it is supremely misleading. The value must be around 816. Imagine if this was a case if you were seeing the trend in currencies. Worse, if you were comparing 2 currencies.

We never saw this coming.

When Jerry, our UX Lead, saw this, he came to decision that it's better to show approximated values on the graph so that our users are not mislead and use grey to show to demarcate actual from approximated ones.

UX wise, this was a nifty trick. Programming wise, I had to go back to my college Stats classes to find a way.

The Math

The inter-webs was filled with examples on how people used the d3-interpolate library function to accomplish something similar. But we're not supposed to use interpolation because it never does approximations.

Interpolation vs Regression

Interpolation is a way to find a line 𝐹(𝑥) such that for all values of 𝑥, there is an exact 𝑦. Regression is a way to find a line 𝐹(𝑥) such that when 𝐹(𝑥) is the closest to all values of 𝑥 at once.

The difference between Interpolation and Regression is the problem that they try to solve.

interpolation-without-line-without-missing.png

For simpler cases, we probably do interpolation without even realizing it. For example, by looking at this chart above, if we were to find the value of 𝐹(𝑥) (or in other words 𝑦) when 𝑥 is 35, our brain would probably draw conclusions in seconds because the human brain understands visuals better and come to to a conclusion.

interpolatioin-with-line-without-missing.png

It would first try to imagine a line passing exactly through all the dots.

interpolation-with-line-with-missing.png

and then come to a conclusion of 3.5. The process of finding this line which passes through all these points is interpolation. We see best through visuals. Computers see best in mathematics. There is a separate post I will write on the math behind all of this since it's seriously way out of scope for this one.

regression-no-line.png

But what if the dots look like this and we want to estimate the value of 𝑦 when the value of 𝑥 is 25? By looking at it, It could very well be any value between 3.8 or 4.8. Imagine if this dataset was about saving the cost of a dying multinational company. Imagine if the scales and stakes would go up and our guestimation was dwindling between 38 million USD or 48 million USD.

Linear Regression

By looking at this graph, we are trying to find a similar line like we did in while interpolating. But we do not want it to pass exactly through the points. In fact, it should be as close as possible to all the points.

regression-with-line-without-missing-wiith-least-squares.png

One way to do this is if we would start drawing the smallest squares from each dot so that if there is a line which touches the other end of the square, it is straight. This method is called the least squares method.

regression-with-line-with-missing-data.png

Now, if we were to estimate the value of 𝑦 when the value of 𝑥 is 25, or in other words, if we were to estimate 𝐹(25), we get to a conclusion that it should be 4.226699.

I needed the computer to do this. Find the value of 𝑦 when the second number in the dataset was null. Programmatically.

The Code

Our objective is to make the function of this line. Let's code along. We know that the a straight line follows the equation y = mx + c. Where y is our function, m is the slope or gradient of the line with respect to the webpage's bottom edge (horizontal axis) and x is what we need to solve for. Here's the dataset that built the scatterplot above.

dataset-x-y.png

Loading...

The expression for finding the slope m is

(N Σ(xy) − Σx Σy)/ (N Σ(x sq.) − (Σx) sq)

where N is the number of datapoints. Which means our function will also need all the datapoints to calculate the sum of all x square, xy, x, and y . dataset-sumation.png

Loading...
m = (N Σ(xy) − Σx Σy)/ (N Σ(x sq.) − (Σx) sq)
c = (Σy − m Σx)/N
Loading...

Conclusion

Of course, we were dealing with a [time, value] pairs instead of [value,value] pairs but our predictor function was similar. This use case was also solved quickly because the conditions to solve the problem met the criteria to use this technique.

Screenshot 2020-09-17 at 13.21.16.png

Linear Regression should only be used when the data value pairs are somehow related. There are also outliers and residual plots we need to be aware of. This is where things start going supremely data-sciency where I won't be able to keep up.

Interesting Readings:

More stuff to come soon! Cheers! :D