July 05, 2017
In this article I will go through what we consider to be the single most important preprocessing step when analysing time series data from the Internet of Things (IoT): Linear resampling. I will describe the challenges regarding resampling IoT sensor data and introduce a simple solution we devised in order to address them.
In the Internet of Things (IoT), sensors and other connected devices collect data. These data are sent to the cloud in order to be processed, analysed, or modelled in order to build smart applications.
Because IoT devices usually don’t emit data at regular intervals, IoT time series data is highly irregular regarding their sampling rate — within as well as across devices. This can make processing and analysing cumbersome, especially when data from multiple sensors are being combined and analysed together.
Therefore, resampling irregular and unevenly-spaced time series data to a consistent and regular frequency is a crucial preprocessing step in IoT that can tremendously facilitate consecutive handling of the data.
At WATTx, a Berlin-based venture builder, we specialise in deep tech, including IoT. In one of our latest projects, we built a prototype for energy optimisation in a building in which we installed hundreds of sensors. These sensors collected information on temperature, humidity, luminosity, and motion. Below, you can see one of the floor maps illustrating the setup:
Smart Office IoT setup
A particular interesting finding from analysing data coming from these sensors in the office is visualised here:
Clusters of averaged daily temperature curves
The plot illustrates clusters of averaged temperature curves and shows that overheating during the night as well as during the day occur in a large number of rooms, indicating a suboptimal usage of the HVAC system and therefore wasted energy.
Before running analyses similar to the one above, a crucial preprocessing step is to convert irregular time series data to a regular frequency, consistently across all sensors. In doing so, we remove the pain of having to deal with irregular and inconsistent cross-sensor timestamps in later analysis processes.
Let’s take a look at a simple example of a simulated temperature curve with unevenly spaced datapoints:
Now, let’s say we want to resample this temperature curve to a frequency of 50 minutes (arbitrary choice) using the mean as aggregation function and linear interpolation:
pd.resample(tseries, ‘50min’, ‘mean’).interpolate()
As illustrated above, the resulting resampled time series does not match the original data exactly, but shows considerable shifts. This is mainly due to the fact that pandas’ resampling function is under the hood a group by time operation, which makes it very vulnerable to the exact sampling of the original data and the desired goal frequency.
To overcome these inconsistencies, we came up with a solution that solves this problem across all our IoT sensors that require linear interpolation.
This solution consists of a two-step resample process in which we (1) first upsample the original time series to a high frequency (for instance 1 minute or 1 second, depending on the resolution of the initial data) using the mean as aggregation function and linear interpolation and (2) then downsample the series to the desired goal frequency using a simple forward fill aggregation.
Applied to the above example curve, the code looks like this:
tmp = pd.resample(tseries, ‘1min’, ‘mean’).interpolate() resampled = pd.resample(tmp, ‘50min’, ‘ffill’)
Let’s take a look at the result:
Evidently, the resulting curve of this two-step resampling process matches the underlying data much better, and it does so consistently across all our IoT sensors that require linear resampling.
An additional situation we wanted to account for when resampling IoT sensor data, is when large gaps of missing data in the time series occur. This is for instance the case when sensors go offline for a few hours.
Here is an example of a simulated temperature curve that illustrates a gap of a few hours in which a sensor was offline during the night, roughly from 10pm to 6am:
If we resample this time series, the resulting datapoints look like this:
Clearly, the resampled data does not match the underlying true temperature curve for the period in which a big chunk of data was missing. To cover for such cases we introduced an additional parameter that states how big a gap of missing data should maximally be in order to be interpolated. If the gap is above this specified threshold (we found that for most of our cases around 1 hour is a good choice), datapoints are not interpolated but instead returned as NaN (Not a Number) values.
Applying this extra functionality we get the result that we want; regularly sampled data with null values for periods in which no data was transmitted:
In the present blogpost we demonstrated the challenges regarding resampling irregular IoT time series data, as well as a simple solution we devised in order to overcome these. The code for our simple resampling solution for IoT sensor data is available in this GitHub gist.
For more exciting content from our work at WATTx, check out Statice — the project I’m currently working on, in which we develop technologies for enabling privacy-preserving data science.
This article is based on a talk I gave at PyData Berlin. If you want to see the slides from that talk you can find them here.
If you’re interested, you can also see my talk from PyData Amsterdam in which I go through in more detail what we’ve learned while working with IoT.
A venture builder’s take on the “Why” and “How” to fail quickly for traditional companies...
It’s time we stop comparing ourselves to Silicon Valley and start offering something different
Each month, we, WATTx, organize a two-day internal hackathon.