Some time in May I started a script on the web server that collects weather information about selected cities. It is launched every hour and saves data like temperature, wind direction, wind speed and general weather condition (cloudy/sunny/etc). So it has been slowly doing its job for several months now.

Last night right before going to sleep I remembered this script and decided that I should do something interesting with the data that it has collected so far and will collect in the future. So this is what I did:

1. First, the data is collected from, mainly because it seems to get the tricky UK weather correctly more often than others.

2. On the page of each city there is also a link to a RSS stream which lends itself quite nicely to information mining. I use PHP and MagpieRSS to extract relevant information and save it to a MySQL database. The script is launched once every hour through a cron job.

3. Since my two main locations are Tallinn and Cambridge, it would be interesting to visually display all the information about those cities. I use the JPGraph library to draw a graph and here is the result:

4. The rapid fluctuations are due to changes in temperature between day and night. This is useful for seeing the extremes conditions in both cities but let’s clean it up a bit and draw a graph of temperatures averaged over 24 hours:

5. Ok, enough of drawing information from the past – could we also predict the weather using this data? Here is my rudimentary attempt. Each point in the graph can be represented using a feature vector (a sequence of properties) consisting of previous temperatures. For example, if I have a sequence of temperatures (10, 11, 14, 14, 13), then I could say that I’ll represent the last point (13) using the previous values (10, 11, 14, 14).

Now when I try to predict 1 hour into the future, the system has no idea what the temperature will be but it will know the feature vector of that (yet unseen) point. So I compare the feature vector with the feature vectors of all the other points I have from the past – if I find one that is very similar, then there is a good chance the future temperature will also be similar to the known one.

I construct feature vectors consisting of 8 previous measurements and average over 3 points from the past that are most similar. Since the feature vector can be thought of as a coordinate in 8 dimensions, I can use Euclidean distance to find the similarity/dissimilarity.

However, I found that using a modified version of this, where more recent measurements are weighted higher, gave better results. Mathematically not really a proper distance any more but I guess it can be thought of as a temporal version:

Here is the result for Cambridge and Tallinn:

6. And that’s it. The graphs are updated automatically every hour so it is possible to see the progress almost real-time :). I guess we’ll see if such a basic algorithm can predict anything and if it will get better over time. Of course there are much better machine learning and data modeling methods out there but this is just meant to be a lightweight exercise. There are some restrictions due to the web server – the algorithm has to finish in under 30 seconds as a PHP implementation. If anyone has ideas on how to improve this, let me know.

This entry was posted in Research. Bookmark the permalink.

10 Responses to Weather

Leave a Reply