- Jump to section
- Weather data is big data
- Tables versus multidimensional arrays
- An array-native infrastructure
- Closing thoughts
The transition to renewables relies heavily on accurate weather forecasting. Wind farms and solar parks produce energy when the wind blows and the sun shines, and unforeseen weather patterns often lead to volatile power prices and high balancing costs.
The good news: Numerical weather prediction (NWP) models have seen a ‘quiet revolution’ – significant, steady progress rarely headlined in the news. Following many advances such as launching satellites, building new weather stations, and improving data infrastructure, we can nowadays do amazing things with atmospheric simulations.
The bad news: The incredible performance of NWPs also leads to an immense amount of data to store and manage. Weather data is high volume, and storage has many limitations, mainly because modern data storage systems are not designed to accommodate specific requirements.
In this article, I map out the difficulties of storing large weather data, share insights into the custom solution we developed at Dexter Energy, and explain how this approach benefits our customers. Ultimately, I propose a shift to array-native storage, a tailored infrastructure designed to handle weather data complexities efficiently.
Weather data is big data
Let’s start by understanding what makes weather data unique – and high-volume. NWP models simulate the atmosphere and oceans by solving physical equations (i.e., the flow of fluids) and return predictions for hundreds of meteorological variables, ranging from the ocean to the upper atmosphere.
The image below provides an example output from the ECMWF high-resolution model. Picture for a second the type and amount of data that lies behind it. Note that, for every ‘piece’ of the atmosphere, there is a cell holding a type of information across:
- The position on Earth, i.e., spatial dimensions in latitude, longitude, and height;
- The precise timestep;
- Different meteorological variables, e.g., irradiation, wind speed, temperature, pressure, and humidity.
The amount of output data is amplified by the resolution. The finer the spatial resolution and the higher the temporal frequency, the more data.
This variety of measurements taken at different locations and times accounts for the multidimensional nature of weather information. Consequently, it results in a staggering volume of data, with modern high-resolution NWP models covering millions of data points.
At Dexter, we are currently ingesting all major weather data providers daily, from large global institutes to local providers, both private and commercial. We also store two, three, and sometimes five years of historical forecasts, all accessible to our data scientists at a moment’s notice. This enables us to develop our machine learning products and to offer our customers the most reliable backtests.
What does it amount to? Our historical NWP forecast data adds up to more than 500 terabytes (half a petabyte), and we store more than one terabyte of new forecasts daily. If you showed these forecasts on a high-definition TV, you’d have 2,500 hours of footage – every day.
Tables versus multidimensional arrays
Most modern big data infrastructures are not ideal for storing large amounts of weather data, primarily because these technologies evolved around addressing the specific needs and challenges in the business domain.
The ever-growing volume of structured business data, such as customer records or financial transactions, is well suited for representation in tabular formats. Thus, we’ve seen developments in relational databases and SQL-based querying tailored to this type of information.
But unlike business data, weather information is, as outlined, inherently multidimensional, encompassing various spatial and temporal dimensions with numerous meteorological variables. While it too can be represented in tabular form, a storage format that natively ‘understands’ this multidimensional nature is more effective, allowing proper access to the data across its different dimensions.
Therefore, our paradigm shift involves moving away from the conventional tabular mindset and organizing weather data as dense multidimensional arrays. Imagine going from a spreadsheet to a sophisticated 3D model where each layer holds information about specific meteorological variables.
This shift is something the geospatial community understands well, and we make use of many open-source packages for data analysis and visualization, such as the excellent xarray package.
However, there is no perfect and readily available database technology for efficient large-scale storage and retrieval of this data. At least, not one that meets our needs of being able to retrieve years of data in seconds for our research and production workloads.
An array-native infrastructure
To address the limitations of existing big data systems and leverage the benefits of dense multidimensional arrays, we developed an in-house, array-native storage solution, on top of a selected storage format. This custom infrastructure, likely unique in our industry, further aligns with cloud usage and benefits our customers.
The approach
We started by using a simple big-data optimized database with a tabular structure and concluded that it is not optimal for storing terabytes upon terabytes of weather data. Transitioning to array databases, we explored several file formats before settling on an embedded, cloud-optimized database that fully decouples the storage and compute layers.
There were three decisive criteria in our selection. The storage format had to:
- Support parallel saving of data, safely. In our work, we want to process and save raw weather data for different days without risking data corruption. With most array data formats, writing concurrently poses the threat of overwriting the same file.
- Permit incremental writes but efficient queries over extended time periods. This is important because we write meteorological data per day, but we need to query it for very long periods, since patterns and trends over time are often of interest. Using consolidation features, we can efficiently query large arrays.
- Work well with cloud object (blob) storage. This modern form of storage is several times cheaper and much more scalable than disk (block) storage but offers fewer guarantees.
Once we selected our storage format, we built a layer on top to translate to and from weather data. It was necessary as there are a lot of different geographical projections, resolutions, time filtering methods, and other peculiarities that require translation.
The benefits
This approach offers several advantages, including enhanced analysis, better compression, faster retrieval, and improved compatibility with mathematical and statistical operations. Overall, it is a much easier format to work with, allowing for very powerful postprocessing.
But beyond making our lives easier, we are able to offer our customers better forecasting products because we can:
- Better leverage AI. AI models rely on a lot of historical training data coupled with the ability to do extensive feature engineering, and our weather forecast storage paradigm supports this.
- Make precise backtests and tune the models just right.
- Quickly process the large amount of weather forecasts that are coming in continuously and have them available for forecasting.
The result
Our data pipelines run 24/7 to extract all necessary information to create the best forecasts. Some of our AI models use a single time series of three years for one point, while others use the gridded forecast for thousands of points across an entire continent. Additionally, we have our data scientists doing research and our meteorologists plotting weather forecasts to gain insights and inform better modeling of phenomena.
Below is a simple visualization of some of the forecast data: a time series of high-resolution forecast model ICON-D2 and one of the corresponding 2D images. Due to the division in tiles, we can quickly query in both the time and space dimensions, satisfying all our AI models and data scientists’ hunger for data.
The footprint
Last but not least, since our company’s reason for existing is to have a positive impact on climate, we cannot neglect the carbon emissions resulting from computationally intense data operations. Our decoupling of storage and computing, together with the efficiency in compressing array data, contribute to a reduced carbon footprint.
Closing thoughts
Weather forecast data storage and processing is a constant challenge, mirroring the chaotic interactions in the atmosphere it represents. In this article, I aimed to match the problem with a solution. The main takeaways I’d like to leave our readers with are:
- Weather data is high volume due to its inherently multidimensional nature;
- The tabular format in most modern big data infrastructures is not ideal for weather data;
- Organizing weather data as dense, tiled multidimensional arrays allows for more powerful postprocessing.
The approach I presented here is the result of a three-year journey at Dexter Energy. This transition to array-native storage has empowered us to store extensive historical forecasts and offer on-the-fly querying for our machine learning products. Ultimately, it has allowed us to offer better AI-powered forecasts and reliable backtests.
Interested in a demo of our Power Forecast? Let us know.