Since early July, I have been working on a project to predict snowdays using machine learning. In this post I will explain how I built the neural network that is the backbone of this project. You can find it on GitHub here. The next step is to create a web app which uses the network and the weather forecast to give you predictions. You can track this app’s development here and I intend to be finished with it before school starts in September.
Process
Collecting the Raw Data
Before starting the project, adequate data had to be collected. The data needed to consist of hundreds of examples of snowdays, 2hr delays, and early closings. In order to mitigate the effect of variables such as state, region, and school type, Only suburban, public districts in the capital region of New York State were selected. These districts were East Greenbush (Columbia), Bethlehem, Saratoga, Schodack, and Shenendehowa. The data was retrieved through the disticts’ Facebook pages or by requesting it from the school’s superintendent, and it goes back about a decade.
Adding Regular Days
In order for the network to discern between normal days and “events” such as snowdays, 2hr delays and early closings, regular days had to be added into the data. Most of the normal days added were days from November to March, as events outside this range are rare; it is important for the network to discern the difference between a regular winter day and one on which an event occurs. This addition of regular days increased the size of the data by about 250%. This high number was chosen since events are much more rare than regular days, and it is necessary for the training data to reflect that.
Retrieving the Weather Data
Now that the days that the network would be training on were set, the weather data for each one had to be retrieved, thus creating the training data for the network. This step was done for each day usingĀ Dark Sky’s API. For each day, 10 weather measureables were added on an hourly basis over 25 hours, from 2pm the day before to 3pm the day of. This was done as the weather the day before can influence a district’s decision to close, while the weather after school has ended does not. The data included was wind gust, temperature, dew point, humidity, apparent temperature, pressure, wind speed, visibility, precipitation intensity, and precipitation probability. Each of these measureables over 25 hours plus the day and month combined for 252 data points for each day.
Data Pre-Processing
Before the data was fed through the network for training, it had to be normalized. This was done on a scale from 0-1 for each weather measureable based upon a pre-defined range. This way, training data, testing data, and prediction data could all be normalized on the same scale. Finally, the data was shuffled randomly to prevent the network from shifting in a particular direction when it was training on a specific area of the data. Thus, the training data was ready.
Testing Data
In order to test the network, all of East Greenbush’s data from 2017 was taken out of the training data. Then, every day in East Greenbush in 2017 could be tested on after the network had trained to assess its ability.
Network Training
The neural network used in this project is a feedforward multilayer perceptron neural network. It consists of 4 layers, an input layer of size 252 corresponding to the 252 data points for each day, two hidden layers with 90 neurons each, and an output layer corresponding to the 4 possible outcomes of a day. The 449 example days in the training data go through 151 epochs in a batch size of 32. Finally, the learning rate is set to 0.02.
Results
With the above mentioned network structure and a random seed of 1234, the network is able to reach an accuracy of 89% on the training data after the final epoch and a loss score of 0.348. On the 2017 training data, the network achieved an accuracy of 99.5%. While that number is impressive, it should be noted that an accuracy of 97.5% could be reached by simply guessing that there would be a regular day for every day of the year. What’s more impressive is the network’s event accuracy, which is 80%. This is the accuracy when the network predicts an event or an event actually occurred on that day. This shows that the network is accurate when tasked with classifying difficult winter days. The network correctly classifies 4 snow days and 4 2hr delays, only incorrectly classifying a snowday as a 2 hr delay and a regular day as a 2hr delay.
Doubts About Accuracy
While the results above are very good, there are various reasons as to why they may not be representative of the network’s true accuracy. It is possible that the network just happens to be good at classifying the days in 2017 due to random chance. Especially considering that there are only 9 events in 2017 that help constitute the 80% event accuracy, and changing the random seed can drastically reduce this event accuracy, it is quite possible that random chance plays a large factor in the results. A surefire way to test the network’s accuracy would be to perform training on all but one training example and test on that example. Then, repeat this process for every single example in the training data. Then, the maximum amount of training and testing data could be utilized so that a proper accuracy rating could be measured.
Web App
I am currently working on a web app to make the network interactive for everyone. In it, users will be able to enter their location and get the network’s predictions based on the weather forecast. Its GitHub repository can be found here and I intend to finish it before school starts in early September.
This thing is amazing, great work Thomas
Forest Quest