key: cord-0502439-5bf6xx7x
authors: Ruan, Guangchun; Yu, Zekuan; Pu, Shutong; Zhou, Songtao; Zhong, Haiwang; Xie, Le; Xia, Qing; Kang, Chongqing
title: Open-Access Data and Toolbox for Tracking COVID-19 Impact on Power Systems
date: 2021-12-10
journal: nan
DOI: nan
sha: 0f8aae6880abdae6710770fa1840f6812bb381ab
doc_id: 502439
cord_uid: 5bf6xx7x

Intervention policies against COVID-19 have caused large-scale disruptions globally, and led to a series of pattern changes in the power system operation. Analyzing these pandemic-induced patterns is imperative to identify the potential risks and impacts of this extreme event. With this purpose, we developed an open-access data hub (COVID-EMDA+), an open-source toolbox (CoVEMDA), and a few evaluation methods to explore what the U.S. power systems are experiencing during COVID-19. These resources could be broadly used for research, public policy, and educational purposes. Technically, our data hub harmonizes a variety of raw data such as generation mix, demand profiles, electricity price, weather observations, mobility, confirmed cases and deaths. Typical methods are reformulated and standardized in our toolbox, including baseline estimation, regression analysis, and scientific visualization. Here the fluctuation index and probabilistic baseline are proposed for the first time to consider data fluctuation and estimation uncertainty. Based on these, we conduct three empirical studies on the U.S. power systems, and share new solutions and unexpected findings to address the issues of public concerns. This conveys a more complete picture of the pandemic's impacts, and also opens up several attractive topics for future work. Python, Matlab source codes, and user manuals are all publicly shared on a Github repository.

A. Background T HE COVID-19 pandemic is a once-in-a-century crisis for the globe, causing 288.7 million infections and 5.4 million deaths until the end of 2021 [1] . Governments worldwide took a wide range of non-pharmaceutical interventions in response to the pandemic [2] , and as a result, these restrictions and lockdowns have significantly changed the electricity consumption patterns, and had a domino effect through the entire power systems. Although the power sector has long prepared against a few predictable threats [3] , such kind of large-scale, long-term, and high-intensity interference is still quite special.

A systematic [4] and empirical [5] perspective is critical to understand the pandemic's impacts on power systems. For power system operators, COVID-19 has opened up an opportunity to assess the abnormal operation patterns, and identify the future pathways for sustainable recovery [6] . Existing works repeatedly dominate the research time of everyone, or even worse, potential academic communication might be interrupted or blocked. Reference [7] collected the power consumption and meteorological data from the French system operator RTE and Meteo-France, but their cleaned dataset was not shared to the public. Similarly, reference [8] , [9] , [11] did not directly share their datasets either. Up to now, two of the most popular data sources are the U.S. Energy Information Administration (EIA) [12] , and the European Network of Transmission System Operators for Electricity (ENTSO-E) [22] . But users are still required to get familiar with the complex data category and storage rules, and implement all the data preprocessing steps by themselves. Additionally, it would be tougher to get access to some other rare data sources, e.g., the Indian electricity market data [23] , or the Swedish building standards and statistics [24] .

Another finding is that most scholars (including the above) have rarely expanded their data category to consider some cross-domain data which may inspire interdisciplinary studies.

2) Model Issue: Quite different methods, models, and criteria are applied in different publications, but very few of them provides an open-source license. Benchmarking is so challenging in this condition that one may take a long time to realize even a basic function. This is, of course, not friendly to the public, students, and scholars in other fields. For example, an ordinary least squares model was used in [13] to analyze the online survey data in California. Since a few household characteristics and respondent demographics were mentioned, it would need extra efforts to specify the detailed expressions. Then a join-point regression was applied in [25] to assess the electricity load trends in Brazil and its geographic regions, but the discussion about model details was somehow limited. Reference [26] developed three timeseries models to determine the impacts on the Spain electricity market, and reference [27] used a five-year moving average method to establish a non-pandemic scenario. It is a pity that both works [26] and [27] didn't share the codes for public use.

Another trend is that machine learning approaches become increasingly popular in analyzing the potential impacts on the operation or resilience of power systems [28] . Reference [29] used five classical machine learning approaches for electric load forecast in India. Reference [30] established a randomforest-bagging and board learning system for estimating the daily confirmed cases. Learning models of other kinds were also effective, including deep learning models [31] , capsule networks [32] , and domain adaptation [33] . Although powerful, these models made it tough to reproduce a reported result because of their increasing complexity [34] .

There is also an urgent need for methodological developments in different use cases. For example, reliable baselines are critical for many applications, but limited contributions are made in stabling the estimation or considering uncertainty problems [35] . This might become more challenging when our focus is zooming into the sub-categories, such as buildings [36] or electric vehicles [37] , which have more complex operational dynamics. It has always been a difficult task to understand the causality in fluctuating variables such as prices, and our empirical knowledge might be limited. New methods are expected to validate and demonstrate any unconventional factors [38] , and avoid degrading the data granularity in distributional calculations like [39] .

3) Open Source Efforts: Open source community has actively involved in combating COVID-19 [40] . Perhaps the most prominent efforts in tracking the pandemic's impacts and sharing open data are made by Johns Hopkins University [1] and Oxford University [2] . In reference [1] , an interactive dashboard was developed for all affected countries in real time. And an Oxford COVID-19 government response tracker (Ox-CGRT) was established in [2] to assess the policy responses of over 180 countries and subnational jurisdictions.

Other efforts include CovidCounties (a public health data tracker at the level of U.S. counties) [41] , COVID-ResNet (a radiography scanner) [42] , and OpenABM (an agent-based model for non-pharmaceutical interventions) [43] . All of these works, however, were mainly conducted in the public health field, with special focus on the confirmed cases, deaths, government responses, and so on.

One of the few examples from the energy community is reference [44] , where the authors have made both their data and codes available on Github. This is a positive step forward, but these resources were only collected for five months without timely updates. Dynamic data aggregation is generally needed, but the resources are preferred to be fully free for use, different from the NRGStream (a charged service) in [45] .

To the best of our knowledge, we are the unique team that develops and constantly upgrades the open-source resources (both data, methods, and toolbox) to track the pandemic's impacts on power systems. Not to mention that we have extensively collected the cross-domain data for interdisciplinary studies.

This paper has made a special effort to evaluate the COVID-19 impacts on power system operation and resilience. Here, the major contributions of our work are summarized as follows:

• The proposed data hub and toolbox have unique values for data-driven analysis on power system operation and resilience. A variety of data categories from power sector to public health are collected, dynamically updated, and quality-controlled by a support team. The toolbox has two standalone versions developed in Python and Matlab to support diverse users, such as scholars, policy makers, and educators. • Typical methods are collected and fully standardized to be more applicable for a wide range of use cases. The fluctuation index and probabilistic baselines are proposed for the first time to consider data fluctuation and confidence interval in baseline estimation. More detailed dynamics and uncertainty features can be effectively detected. • Three empirical studies are conducted on U.S. power systems to share several new perspectives, solutions, and unexpected findings. This reflects the high potential value and broad applicability of the proposed resources. Note that we have established a Github repository [46] to launch our data hub and toolbox online before finalizing this paper. This repository has attracted a special attention from the power community, and supported over 40 research groups or individuals up to now, such as a team from Florida State University and New York University [47] . In addition, it has been successfully applied in two graduate courses at Texas A&M University and Tsinghua University. Some analytical reports and open webinars have featured and introduced our work as well.

The remainder of this paper is organized as follows: Section II introduces the overall framework, main features, and several quick start guides. Section III demonstrates the details of data, models, and algorithms, then Section IV discusses the implementation issues in Python and Matlab. Three empirical studies are conducted in Section V. At last, Section VI gives the concluding remarks.

This paper creates a Github repository [46] that consists of an open-access data hub (COVID-EMDA+) and an opensource toolbox (CoVEMDA). One can access these resources from the directories of "data release/" and "toolbox/" respectively. Note that COVID-EMDA+ is the abbreviation for "Coronavirus Disease -Electricity Market Data Aggrega-tion+", and CoVEMDA for "CoronaVirus -Electricity Market Data Analyzer". Fig. 1 demonstrates the entire framework and workflow of the proposed data hub and toolbox.

As shown, the backend system will routinely run the data formatter and quality controller to update the data hub. Outliers and missing data are largely handled with backup data or historical trends, while we also prepare a data quality report to record those highly problematic data. Fig. 1 has listed out three built-in functions in the toolbox: baseline estimation, regression analysis, and scientific visualization. Users are allowed to run this toolbox with Python or Matlab consoles, and generate a variety of graphic and statistical results for further empirical studies.

In addition, external data and user-defined models are both supported, providing great flexibility for special or advanced extensions. Readers may refer to the online manuscript for further details and quick start examples.

The whole system, including the data hub and toolbox, is maintained by a support team from Texas A&M University and Tsinghua University. The routine maintenance includes making regular backups, fixing bugs, handling feedback, upgrading online systems, logging, and so on.

We summarize the main features of the data hub (COVID-EMDA+) and toolbox (CoVEMDA) as follows: 

A. Data Sources

Our data hub collects raw data from multiple sources: (i) electricity data from all regional system operators (e.g., CAISO for California, NYISO for New York) along with backup data from EIA and EnergyOnline company, (ii) public health data from Johns Hopkins University, (iii) meteorological data from Iowa State University, (iv) mobile device location data (mobility data) from Safegraph company, and (v) satellite image data from NASA (for visualization only). Readers may find all the detailed links for these sources on Github [46] .

Most data records in our data hub could be expressed by X ymdt . Here, X is a placeholder for some variable, and the indices collectively specify a time point-year y, month m, day d, and hour t. We often use X ymd or X ym to represent different kinds of mean values, for example:

where X ym denotes the mean value of month m in year y. It is derived by averaging X ymdt along the axes d and t, and N {d,t} denotes an auxiliary number.

One barrier for merging multiple data sources is the inconsistent data formats and structures, which can be further categorized into: (i) inconsistent data file format and record frequency, (ii) different definitions and abbreviations, and (iii) diverse data quality control policies. This motivates us to clean and standardize those messy data by wide data frames (well-known and efficient). Fig. 2 shows the proposed data structure with details. Here, a wide data frame refers to a kind of unstacked table that has more columns than a long frame. This structure enables a more compact way to store data, and both the row-wise and column-wise operations have clear physical meanings. Besides, a variety of basic operational functions (e.g., filtering, resampling, and statistical computing) are available in Python and Matlab to handle such a matrix-like structure.

We store the raw data (e.g., {X ymdt ∀y, m, d, t}) as a wide data frame by assigning a date index (combining axes y, m, and d) to the rows and an hour index (axis t) to the columns. Fig. 2 also demonstrates how to finalize the released data after several preprocessing steps. These clean and regularlyupdated data can be either downloaded from the Github repository [46] as offline files, or directly retrieved online by using the toolbox functions.

In fact, all the preprocessing steps have been automated by our backend system which consists of a few web crawlers, a set of automation modules, the workflow controller, the quality controller, and a logging module. This backend system is scheduled to run periodically, and for each run, 31 raw data files from 25 sources will be extracted and cleaned to update 73 spreadsheets. Here, outliers and missing data are efficiently detected and handled by analyzing the historical trend or backup data-different rules are specialized for different variables. We further record some problematic data (very rare) in a quality control report for ease of reference.

Baselines refer to the reference status for comparison. We focus on estimating a counterfactual baseline that assumes the absence of COVID-19. The difference between a counterfactual outcome and an actual observation will naturally substantiate the pandemic's impacts.

In this aspect, baseline estimation is recognized as the firstand-foremost step for a few impact assessments, and a lowquality baseline may distort our judgment on the influential intensity and duration. Fig. 2 . Demonstration of the proposed data structure and preprocessing steps. This procedure is already automated and executed by the backend system.

We next summarize a collection of five built-in methods which are applicable for most applications. The last two methods, proposed for the first time, are effective to consider and inform the uncertainty of estimations.

1) Date-and Week-Aligned Estimation: This method is simple but effective for many use cases when the only or major influential factor is time. The main idea is choosing the proper historical records to be the baselines.

A date-aligned estimator selects the same date in last year or several years before, shown as:

where y ≤ y−1, and the annotated arrow links an observation (left) with its baseline (right). A week-aligned estimator selects another historical date which shares the same week-weekday index as the current date. This method is technically formulated as follows:

where the above two dates should satisfy:

In equation (4), the function f d2w (·) calculates the week number and weekday for a given date. For example, f d2w (2020, 6, 1) = f d2w (2019, 6, 3) because both of them are Mondays of the 22nd week.

2) Trend and Detrend Estimation: This method is designed to extract or eliminate the trends' impacts, and thus leads to a better estimation result. Here, the trend can be estimated by either of the following formulas:

where T ymdt is the trend series, f trend w (·) andf trend w (·) are two estimation functions, w is a given length of the sliding window, and θ trend denotes the model parameters to be calibrated. For illustration, weekly moving average is an instance of (5), and other advanced models may follow the format of (6).

A trend and a detrend estimator calculate the baselines differently, shown as follows:

The baselines in (7) use the trend to remove potential noises, while the baselines in (8) detrend the original data to find any additional changes, e.g., extra increments.

3) Backcast Estimation: This method has a complicated expression based on machine learning, so more data and computations are required to calibrate the unknown parameters. This method is originally used to analyze the electricity consumption with great improvement in accuracy. Here, a backcast estimation can be described as follows:

where B ymdt is the backcast outcome calculated by a machine learning modelf back w (·), and θ back denotes the corresponding model parameters (often high-dimensional). Here, X and Y represent different input variables.

It is intuitive to extend (9) to formulate an ensemble backcast model by averaging the outputs of multiple base models (indexed by i):

Often, a backcast estimation can largely mitigate the adverse impacts of non-pandemic factors to establish a reliable baseline, shown as follows:

Note that one distinct advantage of this method is the flexibility because there are so many possible options and combinations for the base models.

4) Distribution-based Estimation: This method provides a new distributional perspective to understand the underlying patterns. The key point is monitoring the distributions rather than the raw data when handling some fluctuating variable such as electricity price. A classical metric is given as follows:

where S ym is a monthly metric of distributional distance. F w,ym and F w,y m describes the cumulative distribution for month m in year y and y . In both functions, the sliding window w takes the length of one month.

The key disadvantage of S ym is that it significantly reduces the granularity of original time-series data. For instance, using (12) will degrade the data granularity from a hourly to a monthly frequency.

To overcome this issue, we develop a novel fluctuation index to capture the distributional features while maintaining the same data granularity. Technically, this index evaluates the possibility that an observation data might be abnormal with the following expression:

where I ymdt is the proposed fluctuation index, f fluc w (·) is an estimation function that is formulated by the cumulative distribution function F w (·). Fig. 3 offers a graphic illustration of the fluctuation index from two aspects: the highlighted distance and the shaded area. By definition, 0 ≤ I ymdt ≤ 1, and I ymdt ≥ 0.7 rarely happens. It is thus possible to evaluate the abnormal dynamics by monitoring this fluctuation index.

A distribution-based estimator is able to offer a baseline in either of the following way:

where I ym is the monthly average derived similarly as (1), and other possible frequencies are also allowed.

This method intends to handle the uncertainty problem in baseline estimation, which is a key knowledge gap that has not been filled by the existing works. In detail, most related works are focused on point estimation, but we extend to consider the baseline uncertainty by developing quantiles in a probabilistic framework, and that also naturally meets the practical needs for constructing confidence intervals.

A probabilistic baseline estimation includes a collection of quantile estimators, each of which should be a parameterized model chosen from the previous subsections, such as:

B ymdt (q) =f back w (X y mdt , · · · ; θ back q )

whereT ymdt andB ymdt are the probabilistic versions of those defined in (6) and (9), given q as a quantile value. These models are configured with a pinball loss function to consider the following five indexed quantile levels: q 1 = 10%, q 2 = 25%, q 3 = 50% (mean value), q 4 = 75%, and q 5 = 90%.

The unique feature of this method is that baselines are now estimated by a collection of sub-models, shown as:

This is more informative than any single deterministic estimator, and one can further establish a 50% confidential interval by picking out the results of q 2 and q 4 , or a 80% confidential interval by q 1 and q 5 .

Regression is widely used in empirical analysis to explore the potential relationship between different factors. In particular, regression allows us to answer a few questions on correlation or causality during COVID-19. We have collected two popular regression models, along with several useful statistical tests.

1) Ordinary Least Squares Regression (OLS): This method offers multiple expressions to check the underlying correlation or causality. Supported formulations include linear expressions as well as a few nonlinear expressions with quadratic, interaction, or logarithms terms.

An OLS model can be formulated as follows:

where X, Y , Z are placeholders for a group of correlated variables, θ ols 1 , θ ols 2 denote the regression coefficients, and ols ymdt represents the error term. Note that (20) is highly flexible to represent a series of possible formulations. For example, if all the correlated variables share the same time index, the regression is simply considering the correlation between variables instead of time. Temporal coupling can be considered by adding some historical items. Also, the ellipsis mark in (20) indicates that other regression terms (linear or nonlinear) are fully allowed.

We calibrate an OLS model by determining a set of regression coefficients to minimize the regression residuals. Here is the related optimization problem:

After calibration, an OLS model can be further validated by running a few statistical tests, including t-test, F-test, and normality-test. R-squared and adjusted R-squared are also informative to evaluate the goodness of fit.

2) Vector Autoregression (VAR): This method is specialized to capture the complicated correlation between multiple timeseries data. One can extend this method to restricted vector autoregression when some regression coefficients are imposed to be zeros. Both models are powerful and widely adopted in empirical studies.

A VAR model combines all the variables together and uses the following formula to model the evolution over time:

where X ymdt should be interpreted as one variable or a concatenation of several variables. p is called the order of this VAR model, and the lag terms for the last p periods are considered. Besides, θ var 0 , · · · , θ var p are regression coefficients, and var ymdt denotes the error term. The establishment of a VAR model can be divided into four steps: pre-estimation preparation, model calibration, model verification, and post-estimation analysis.

First, we need to conduct an Augmented Dickey-Fuller (ADF) test, a cointegration test, and a Granger causality test to analyze the conditions of stationarity, cointegration, and potential causality respecitvely.

Second, the regression coefficients can be determined by a series of minimization problems, each of which is similar as (21) . For a p-order VAR model (22) , one should run a total of p optimizations.

Third, another ADF test is used to test if the residual series is stationary, while a Ljung-Box test and a Durbin-Watson test are used to inspect the underlying endogeneity and autocorrelation. A robustness test is then preferred to demonstrate the model performance against coefficient perturbations.

Finally, the calibrated VAR model can provide further insights by running the impulse response analysis and forecast error variance decomposition.

Note that any other regression models, beyond the above two, can be implemented and used as toolbox extensions.

Scientific visualization is one of the most intuitive way to exhibit empirical findings, but the methods turn out to be highly diverse in different applications. We thus specialize the methods for several classical use cases.

A line chart and a scatter chart are useful to show a series of changing data, such as the raw data X ymdt , any aggregated data like t X ymdt , and any filtered data like X ymdt (m ≤ 6).

When the x-axis represents dates, our toolbox further supports highlighting the dates of big events during COVID-19.

A stacked bar chart is able to make comparisons between different categories. Visually, different bars (representing those categories) are stacked end-to-end and assigned different colors for distinction. Assume the raw data X ymdt can be divided into several sub-categories X k ymdt ∀k, then the corresponding proportion for X k ymdt is calculated as:

A histogram describes the distribution or frequency features of a group of fluctuating data. This is helpful to handle a large amount of observations and detect any possible outliers. In particular, our toolbox supports visualizing the cumulative distribution function and the probability density.

A box plot is designed to graphically display groups of data through their quantiles. It can effectively handle a data matrix by calculating the quantiles for each column and visualizing these quantiles with color bands. The following five quantiles are of interest:

where F (·) denotes the cumulative distribution function for one column. q 1 -q 5 have been defined before, just after (17) . A heat map is often used to show how a series of data are clustered or varying over space. In such a figure, it is thus easy to understand the spatial correlations, and discover any typical pattern visually as well. Our toolbox implements the two dimensional heat map, and each element is measured by the Pearson correlation coefficient.

It should be noted that a few more visualization schemes can be taken into consideration as toolbox extensions.

We get started by a discussion about the high-level architecture to implement the models and algorithms in Section III. Both standalone versions of our toolbox, developed by Python or Matlab, follow this architecture.

1) Folder Structure: Fig. 4 shows a concise folder structure in the left part. Note that all the archived data and pre-trained model are prepared in the "data/" folder, and the source codes can be found in the "lib/" folder. Beginners may get started by reading the user manual or quick start examples.

2) Programming Structure: Fig. 4 also illustrates a programming structure (on the right) that classifies the entire function family into three levels: basic operations, low-level functions, and high-level functions. It breaks down large tasks (user-oriented) into small activities (data-oriented), and helps clarify the calling relationships and dependencies between different functions. 2) Object-Oriented Design: Fig. 5 elaborates how to organize the major classes and their inheritance relationship to realize the proposed methods. There are four base classes-a baseline estimator class, a regressor class, a visualizer calss, and an area class-they mainly build up the fundamental properties and some key components. A few high-level classes are then established to specify the algorithm details, and finally integrated into the RTO and City class for ease of use.

As for extensions, users are allowed to develop their own class based on the predefined classes. External data sources, special parsers, and user-defined functions could be included in this new class to support further development.

We follow the folder structure in Subsection IV-A to organize the Python script files (.py files). Relevant classes and functions are collected in the same file for readability. 2) Functional Design: Functions are carefully assigned to different abstraction levels (Fig. 4) , and most functions share the same or similar names as those in the Python version.

Using the folder structure in Subsection IV-A, the Matlab script files (.m files) are collected in three different folders according to the function level.

The COVID-19 pandemic is exacerbating uncertainty and causing a series of unexpected outcomes in power systems. Among all, we typically pick three issues of public concern: • How much and how long has COVID-19 influenced the operation of U.S. power systems? • How were the electricity prices influenced by COVID- 19 and the gas price collapse in 2020? • How to adapt the load forecast models to capture the lock-down patterns during COVID-19? Answers to the above issues will deepen our understanding of the operational risks induced by extreme events of such kind. To fill the knowledge gap, this section conduct empirical studies to share and demonstrate several new findings.

This issue covers a broad topics but we focus on three specific aspects that have been less discussed.

1) Peak Demand Changes among Different Regions: For a given region, the reduction of peak demand is assessed for each day and averaged for the whole month:

where the peak demand D peak ymd has a backcast baseline B ymd , which can be derived by running the pre-trained backcast model (deep learning model) in the toolbox. in both April and May. According to the average reduction rates, the situations in June were largely alleviated for all seven regions, but NYISO appeared to recover much more slowly. Fig. 6 then shows the spatial correlation of peak demand changes between different regions. Observing the lightest color in the figure, it is clear to find that CAISO (California) and ERCOT (Texas) exhibits a different pattern, and ERCOT might be the most special one.

We also calculate the probabilistic baselines in Table II to understand the situation in ERCOT more clearly. One may check whether the quantiles pass through zero point for probabilistic validation. Here, we are 50% sure that the peak demand drops from April to June, but it is unclear for March. Larger uncertainty is found in May, 4.71% within the 25-75% interval and 8.24% within the 10-90% interval, which reflects a underlying pattern change in this month.

As peak demand is strongly related to the resource adequacy, validating the change and locating the pattern transition could be informative to get the system operator well prepared.

2) Price Distribution Shift in Chicago: We then apply the fluctuation index to evaluate the price distributions in Chicago.

Results changes during COVID-19. We pay further attention to the weekly statistics for more detailed information of the changing dynamics. Table III detects the extreme prices every week that are beyond 2 − σ interval (with the fluctuating indexes beyond 0.9544) and calculates the total number of these prices. Note that the week alignment rule is applied here.

As shown in Table III , 2020 has witnessed more extreme prices in March and April, and the situation largely relieves after mid-May, but soon rebounds in June. Average number of the extreme price occurrence indicates that it is more than five times more likely to experience an abnormal price (often unexpected low price) in 2020 than 2019.

System operators and power plant managers should take care of these price collapses which may affect the profitability of gas-fired power plants.

3) Duck Curves and Renewable Energy Share in California: A duck curve, also known as the residual demand, is derived by calculating the difference between electricity consumption and the solar generation: difference or peak-valley ratio will call for more flexible resources for power and energy balancing, and system operators should pay close attention to this new situation.

The share of renewable energy is calculated as follows:

We typically consider the monthly proportions in California, and apply an ARIMA model for trend estimation. This model is configured by grid-searching the best hyper-parameters, and the best configuration turns out to be ARIMA(2,0,1). Results show that the observed share of renewable energy during March-June is 34.88% on average, while the ARIMA model estimates a slightly larger baseline of 34.90%. This tiny difference, much less than the demand drop, is clearly against the statement that renewables might enjoy extra benefits during COVID-19 because of their low marginal costs. A possible explanation for this finding is the conservative dispatch strategies that take the system safety into consideration.

There is an open debate on the underlying drivers of the low electricity prices during COVID-19, because the pandemic shares a time overlap with the gas price collapse in 2020. Taking Boston as an example, the Pearson correlation coefficient between electricity price and gas price is 0.213, and the coefficient between electricity price and confirmed case number is -0.187, both statistically significant. Spurious correlation may happen in this case, and we have to conduct rigorous regression analysis to figure out the potential causality.

The first step is selecting proper variables and data for the electricity prices and pandemic situations. We calculate the logit value of the fluctuation index, denoted by LoI ymd , to describe the abnormality of electricity price observations:

where I ymd is defined similarly as (13) . Note that LoI ymd has no lower and upper bounds (important for a unbiased Note: "Coeff" is the coefficient value, "Std" is the standard deviation. The top part shows the results for (29) , and the bottom part for (30) . In addition, we highlight the rows when the corresponding coefficients are statistically significant.

OLS regression), and a smaller LoI ymd means more normal.

In practice, we regard 0.75 ≤ LoI ymd ≤ 3 as unusual, and LoI ymd > 3 as highly unusual.

We also need to construct a gas price variable λ gas ymd by importing and organizing the data from an external source. As for the pandemic modeling, we come up with two ways: one is the logarithm of daily confirmed cases C ymd , and the other is a binary dummy variable δ ymd that indicates the absence (δ ymd = 0) or presence (δ ymd = 1) of the pandemic.

With the above variables, two OLS regression models are designed as follows:

LoI ymd = θ 5 λ gas ymd + θ 6 C ymd + θ 7 λ gas ymd C ymd + θ 8 (30) The basic idea for (29) and (30) is controlling the effects of gas prices when assessing the pandemic's impacts. We are also curious about the interaction between these two factors. Table IV illustrates the results of model calibration and statistical tests. We highlight four coefficients that are statistically significant: θ 1 , θ 3 , θ 6 , and θ 7 .

Here, the pandemic's impact is validated to exist according to a strong statistical evidence that θ 1 and θ 3 in (29) are nonzero-it turns out to be true that LoI ymd is dependent on δ ymd .

In fact, the influence of COVID-19 is critical because θ 3 > 1 and θ 6 > 1, meaning that the abnormality of prices is really sensitive to the pandemic-related variables.

Another finding is that there may exist an offset relationship between the impacts of COVID-19 and gas prices. One supporting evidence is the negative sign of θ 1 . This is further validated by (30) with a negative θ 7 . While the impacts of both factors are synergistic rather than additive (because θ 7 = 0), it is at least statically clear that COVID-19 have truly caused more abnormal electricity prices (because θ 6 > 0).

Note that the above findings still hold true when we run further robust tests by variable substitution and data resampling. Policy makers may need to take effective financial actions to help those power companies with revenue loss during this special time.

One severe outcome of COVID-19 is the rapid drop of electricity consumption. Even worse, most load forecast models may fail to capture this sudden break caused by the lockdown policy. This calls for an improved forecasting strategy that could quickly adapt to the new situation and make more accurate predictions. We will next show that using mobility data to enhance the load forecast models might be an effective solution.

This case considers the day-ahead hourly load prediction tasks. Three popular models are tested here: neural network (NN), random forest (RF), and support vector machine (SVM). The inputs for these models include calendar variables, meteorological variables, and the previous load. We also grid-search the hyper-parameters for each kind of model carefully.

In most cases, the above models cannot capture the novel load pattern during COVID-19, so we improve these models in the following two ways: one is fine-tuning the model with new observations, the other is using mobility data to enhance the results.

Technically, the latter idea can be described as follows:

where the improved resultD ymdt has an enhanced item ∆f enh (·) that takes the previous mobility data as its input. f pred (·) is exactly the same as the original model, but we avoid listing all inputs here by an ellipsis mark. For simplification, we only consider a linear regression formula forf enh (·), and we calibrate its parameter θ enh by the residual error series during COVID-19 (very few are needed).

We pay attention to the forecast task in New York City on March 21, two weeks after the state-of-emergence order on March 7, 2020. The main focus is on the prediction performance (measured by mean average percentage errors, or MAPE) of different models in the remaining days before mid-2020. We also validate the performance gaps between the normal period (January 1-March 21) and the lockdown period (March 21-June 30). Table V gives the comparison results for different models. They are five kinds in total: the original models, the models updated by new observations (denoted by *-Updated), and the adaptive models enhanced by price/confirmed-case/mobility data (denoted by *-Price/*-Case/*-Mobility). Here the models that use price or confirmed case data share a similar formulation as (31).

It may not be surprising that the performance gaps between the normal and lockdown periods are exceeding 5%, and some errors are almost tripled. Also, there is nearly no difference when fine-tuning these models with new observations, e.g., RF and its updated model RF-Updated has the same error estimation of 8.20%.

The major message from Table V is that using mobility data might improve the forecast performance with an accuracy increase of nearly 25-40% or 2-4 percentage points. This result can be further improved when obtaining more abnormal observations or increasing the enhancement model size. Note: "NN" is neural network, "RF" is random forest, "SVM" is support vector machine. For each part, take the original NN/RF/SVM model as a baseline to calculate the improvement. We also highlight the best estimators of each part.

Another interesting finding is the best improvements of mobility data integration. In all tests, the mobility-enhanced models work better than those enhanced by price or confirmed case data. Take neural networks as an example, mobility enhancement reduces the error level by 3.38%, which is 0.93% and 2.09% higher than the other two. It can be concluded that mobility is a good and stable indicator for tracking the change of electricity consumption, better than the price or confirmed case data.

Note that the above solution could be useful for system operators to effectively improve their load forecasting models. Our idea, mainly derived from a cross-domain data perspective, is compatible with any model-side improvements to enjoy extra performance benefits.

Evaluating the COVID-19 impacts on real-world power systems is critical to understand the potential risks as well as the abnormal operation patterns. But up to now, there is still a lack of reliable and ready-to-use data, methods, and toolboxes for empirical studies.

This paper overcomes the above difficulty by developing an open-access data hub, an open-source toolbox, and a full collection of methods for users with diverse backgrounds, such as scholars, policy makers, and educators. The toolbox is implemented in Python and Matlab with three key functions: baseline estimation, regression analysis, and scientific visualization. The fluctuation index and probabilistic baseline are highlighted because they are general and powerful to provide reliable estimations. We also conduct a few empirical studies with practical evidences to demonstrate new findings and methodologies on three issues of public concerns.

Typical uses case of the proposed data and toolbox include but not limited to: researches on power system operation and resilience, cross-domain study in energy-related topics, causality analysis on pandemic-induced risks, method designs for adaptive forecasting, policy impact evaluation, and relevant university courses or webinars.

However, our methodologies and toolbox do not involve many specialized methods for spatial correlation analysis, because geological information often lacks documentations and appears in messy formats. Instead, we mainly focus on the temporal relationship, which is often informative enough for pandemic-related problems.

Since the world is confronting a uncertain future induced by COVID-19, this paper may hopefully advance our understanding of the ongoing situations, and guide the preparations through this difficult time.

An interactive web-based dashboard to track COVID-19 in real time

A global panel database of pandemic policies (Oxford COVID-19 Government Response Tracker)

Electric power grids under high-absenteeism pandemics: History, context, response, and opportunities

Impacts of COVID-19 on energy demand and consumption: Challenges, lessons and emerging opportunities

Quantitative assessment of U.S. bulk power systems and market operations during the COVID-19 pandemic

What opportunities could the COVID-19 outbreak offer for sustainability transitions research on electricity and mobility?

Adaptive methods for shortterm electricity load forecasting during COVID-19 lockdown in France

Effects of COVID-19 pandemic on the Italian power system and possible countermeasures

Ancillary services in Great Britain during the COVID-19 lockdown: A glimpse of the carbon-free future

A cross-domain approach to analyzing the short-run impact of COVID-19 on the US electricity sector

Analysis of the electricity demand trends amidst the COVID-19 coronavirus pandemic

Impact of the COVID-19 pandemic on the U.S. electricity demand and supply: An early view from data

Exploring the effects of California's COVID-19 shelter-in-place order on household energy practices and intention to adopt smart home technologies

Machine learning model to project the impact of COVID-19 on US motor gasoline demand

COVID-19 assistance needs to target energy insecurity

Navigating the clean energy transition in the COVID-19 crisis

The short-run and long-run effects of Covid-19 on energy and the environment

Effects of the covid-19 pandemic on energy systems and electric power grids-a review of the challenges ahead

Readiness of small energy markets and electric power grids to global health crises: Lessons from the covid-19 pandemic

Impacts of covid-19 on ontario's electricity market

Flexibility enhancement measures under the covid-19 pandemic-a preliminary comparative analysis in denmark, the netherlands, and sichuan of china

How did the German and other European electricity systems react to the COVID-19 pandemic?

Experience of Indian electricity market operation and other events during COVID-19 pandemic

A preliminary simulation study about the impact of COVID-19 crisis on energy demand of a building mix at a district in Sweden

Trend analyses of electricity load changes in Brazil due to COVID-19 shutdowns

The impact of COVID-19 on the electricity sector in Spain: An econometric approach based on prices

Impact analysis of COVID-19 responses on energy grid dynamics in Europe

Review of learning-assisted power system optimization

Impact of Covid-19 on electricity load in Haryana (India)

Random-Forest-Bagging broad learning system with applications for COVID-19 pandemic

Artificial intelligence and COVID-19: Deep learning approaches for diagnosis and treatment

CapsCovNet: A modified capsule network to diagnose Covid-19 from multimodal medical imaging

Cross-site severity assessment of COVID-19 from CT Images via domain adaptation

Benchmarking methodology for selection of optimal COVID-19 diagnostic model based on Entropy and TOPSIS methods

Prediction-based analysis on power consumption gap under long-term emergency: A case in china under covid-19

Energy consumption in commercial buildings in a post-covid-19 world

Impacts of covid-19 on electric vehicle charging behavior: Data analytics, visualization, and clustering

Nexus of covid-19 and carbon prices in the eu emission trading system: evidence from multifractal and the wavelet coherence approaches

Impact on electricity consumption and market pricing of energy and ancillary services during pandemic of covid-19 in italy

Involvement of the opensource community in combating the worldwide COVID-19 pandemic: A review

CovidCounties is an interactive real time tracker of the COVID19 pandemic at the level of US counties

COVID-ResNet: A deep learning framework for screening of COVID-19 from radiographs

OpenABM-Covid19-An agent-based model for non-pharmaceutical interventions against COVID-19 including contact tracing

Impact of COVID-19 measures on short-term electricity consumption in the most affected EU countries and USA states

Canadian electricity markets during the COVID-19 pandemic: An initial assessment

The home page for COVID-EMDA+ and CoVEMDA

On the feasibility of load-changing attacks in power systems during the COVID-19 pandemic