key: cord-0210036-sjfxzayr authors: Sun, Chenxi; Hong, Shenda; Song, Moxian; Zhou, Yanxiu; Sun, Yongyue; Cai, Derun; Li, Hongyan title: TE-ESN: Time Encoding Echo State Network for Prediction Based on Irregularly Sampled Time Series Data date: 2021-05-02 journal: nan DOI: nan sha: df676f999de12681016f950ee4b1b45a155ed88f doc_id: 210036 cord_uid: sjfxzayr Prediction based on Irregularly Sampled Time Series (ISTS) is of wide concern in the real-world applications. For more accurate prediction, the methods had better grasp more data characteristics. Different from ordinary time series, ISTS is characterised with irregular time intervals of intra-series and different sampling rates of inter-series. However, existing methods have suboptimal predictions due to artificially introducing new dependencies in a time series and biasedly learning relations among time series when modeling these two characteristics. In this work, we propose a novel Time Encoding (TE) mechanism. TE can embed the time information as time vectors in the complex domain. It has the the properties of absolute distance and relative distance under different sampling rates, which helps to represent both two irregularities of ISTS. Meanwhile, we create a new model structure named Time Encoding Echo State Network (TE-ESN). It is the first ESNs-based model that can process ISTS data. Besides, TE-ESN can incorporate long short-term memories and series fusion to grasp horizontal and vertical relations. Experiments on one chaos system and three real-world datasets show that TE-ESN performs better than all baselines and has better reservoir property. Prediction based on Time Series (TS) widely exists in many scenarios, such as healthcare management and meteorological forecast [Xing et al., 2010] . Many methods, especially Recurrent Neural Networks (RNNs), have achieved state-ofthe-art [Fawaz et al., 2019] . However, in the real-world applications, TS usually is Irregularly Sampled Time Series (ISTS) data. For example, the blood sample of a patient during hospitalization is not collected at a fixed time of day or week. This characteristic limits the performances of the most methods. Basically, a comprehensive learning of the characteristics of ISTS contributes to the accuracy of final prediction [Hao and Cao, 2020] . For example, the state of a patient is related to a variety of vital signs. ISTS has two characteristics of irregularity under the aspects of intra-series and inter-series: • Intra-series irregularity is the irregular time intervals between observations within a time series. For example, due to the change of patient's health status, the relevant measurement requirements are also changing. For example, in Figure 1 , the intervals between blood sample collections of a COVID-19 patient could be 1 hour or even 7 days. Uneven intervals will change the dynamic dependency between observations and large time intervals will add a time sparsity factor [Jinsung et al., 2017] . • Inter-series irregularity is the different sampling rates among time series. For example, in Figure 1 , because vital signs have different rhythms and sensors have different sampling time, for a COVID-19 patient, heart rate is measured in seconds, while blood sample is collected in days. The difference of sampling rates is not conducive to data preprocessing and model design [Karim et al., 2019] . However, grasping both two irregularities of ISTS is challenging. In the real-world applications, a model usually has multiple time series as input. If seeing the input as a multivariate time series, the data alignment with up/downsampling and imputation occurs. But it will artificially introduce some new dependencies while omit some original dependencies, causing suboptimal prediction [Sun et al., 2020a] ; If seeing the input as multiple separated time series and changing dependencies based on time intervals, the method will encounter the problem of bias, embedding stronger short-term dependency in high sampled time series due to smaller time intervals. This is not necessarily the case, however, for example, although the detection of blood pressure is not frequent than heart rate in clinical practice, its values have strong diurnal correlation [Virk, 2006] . In order to get rid of the above dilemmas and achieve more accurate prediction, modeling all irregularities of each ISTS without introducing new dependency is feasible. However, the premise is that ISTS can't be interpolated, which makes the alignment impossible, leading to batch gradient descent for multivariate time series hard to implement, aggravating the non-converging and instability of error Back Propagation RNNs (BPRNNs), the basis of existing methods for ISTS [Sun et al., 2020a] . Echo State Networks (ESNs) is a simple type of RNNs and can avoid non-converging and computationally expensive by applying least square problem as the alternative training method [Jaeger, 2002] . But ESNs can only process uniform TS by assuming time intervals are equal distributed, with no mechanism to model ISTS. For solving all the difficulties mentioned above, we design a new structure to enable ESNs to handle ISTS data, where a novel mechanism makes up for the disadvantage of no learning of irregularity. The contributions are concluded as: • We introduce a novel mechanism named Time Encoding (TE) to learn both intra-series irregularity and inter-series irregularity of ISTS. TE represents time points as dense vectors and extends to complex domain for more options of representation. TE injects the absolute and relative distance properties based on time interval and sampling rate into time representations, which helps model two ISTS irregularities at the same time. Existing methods for ISTS can be divided into two categories: The first is based on the perspective of missing data. It discretizes the time axis into non-overlapping intervals, points without data are considered as missing data points. Multidirectional RNN (M-RNN) [Jinsung et al., 2017] handled missing data by operating time series forward and backward. Dated Recurrent Units-Decay (GRU-D) [Che et al., 2018] used decay rate to weigh the correlation between missing data and other data. However, data imputation may artificially introduce not naturally occurred dependency beyond original relations and totally ignore to model ISTS irregularities. The second is based on the perspective of raw data. It constructs models which can directly receive ISTS as input. Time-aware Long Short-Term Memory (T-LSTM) [Baytas et al., 2017] used the elapsed time function for modeling irregular time intervals. Interpolation-Prediction Network [Shukla and Marlin, 2019] used three time perspectives for modeling different sampling rates. However, they just performed well in the univariate time series, for multiple time series, they had to apply alignment first, causing the data missing in some time points, back to the defects of the first category. The adaption of the BPRNNs training requirements causes the above defects. ESNs with a strong theoretical ground, is practical and easy to implement, can avoid non-converging [Gallicchio and Micheli, 2017; Sun et al., 2020c] . Many state-of-the-art ESNs designs can predict time series well. [Jaeger et al., 2007] designed a classical reservoir structure using leaky integrator neurons (leaky-ESN) and mitigated noise problem in time series. [Gallicchio et al., 2017] proposed a stacked reservoirs structure based on deep learning (DL) (DeepESN) to pursue conciseness of ESNs and effectiveness of DL. [Zheng et al., 2020] proposed a long shortterm reservoir structure (LS-ESN) by considering the relations of time series in different time spans. But there is no ESNs-based methods for ISTS. The widely used RNN-based methods, especially ESNs, only model the order relation of time series by assuming the time distribution is uniform. We design Time Encoding (TE) mechanism (Section 3.2) to embed the time information and help ESNs to learn the irregularities of ISTS (Section 3.3). First, we give two new definitions used in this paper. ISTS has two irregularities: (1) Irregular time intervals of intra-series: For prediction tasks, one-step-ahead forecasting is using the observed data u 1:t to predict the value of u t+1 , and continues over time; Early prediction is using the observed data u 1:t (t < t pre ) to predict the classes or values in time t pre . Definition 2 (Time Encoding TE). Time encoding mechanism aims to design methods to embed and represent every time point information of a time line. TE mechanism extends the idea of Positional Encoding (PE) in natural language processing. PE was first introduced to represent word positions in a sentence [Gehring et al., 2017] . Transformer [Vaswani et al., 2017] model used a set of sinusoidal functions discretized by each relative input position, shown in Equation 1. Where pos indicates the position of a word, d model is the embedding dimension. Meanwhile, a recent study encoded word order in complex embeddings. An indexed j word in the pos position is embeded as g pe (j, pos) = re iωj pos+θj . r, ω and θ denote amplitude, frequency and primary phase respectively. They are all the parameters that should to be learned using deep learning model. First, we introduce how the Time Vector (TV) perceives irregular time intervals of a single ISTS with the fixed sampling rate. Then, we show how Time Encoding (TE) embeds time information of multiple ISTS with different sampling rates. The properties and proofs are summarized in the Appendix. In Equation 2, each time vector has d T V embedding dimensions. Each dimension corresponds to a sinusoidal. Each sinusoidal wave forms a geometric progression from 2π to M T π. M T is the biggest wavelength defined by the maximum number of input time points. Without considering the different sampling rates of interseries, for a single ISTS, TV can simulate the time intervals between two observations by its properties of absolute distance and relative distance. Property 1 (Absolute Distance Property). For two time points with distance p, the time vector in time point t + p is the linear combination of the time vector in time point t. Property 2 (Relative Distance Property). The product of time vectors of two time points t and t + p is negatively correlated with their distance p. The larger the interval, the smaller the product, the smaller the correlation. For a computing model, if its inputs have the time vectors of time points corresponding to each observation, then the calculation of addition and multiplication within the model will take the characteristics of different time intervals into account through the above two properties, improving the recognition of long term and short term dependencies of ISTS. Meanwhile, without imputing new data, natural relation and dependency within ISTS are more likely to be learned. When the input is multi-series, another irregularity of ISTS, different sampling rates, shows up. Using the above introduced time vector will encounter the problem of bias. It will embed more associations between observations with high sampling rate according to the Property 2, as they have smaller time intervals. But we can not simply conclude that the correlation between the values of time series with low sampling rate is weak. Thus, we design an advanced version of time vector, noted Time Encoding (TE), to encode time within multiple ISTS. TE extends TV to complex-valued domain. For a time point t in the d-th ISTS with r s (d) sampling rate, the time code is in Equation 5, where ω is the frequency. Compared with TV, TE has two advantages: The first is that TE not only keeps the property 1 and 2, but also incorporates the influence of frequency ω, making time codes consistent at different sampling rates. ω reflects the sensitivity of observation to time, where a large ω leads to more frequent changes of time codes and more difference between the representations of adjacent time points. For relative distance property, a large ω makes the product large when distance p is fixed. Property 3 (Relative Distance Property with ω). The product of time encoding of two time points t and t + p is positive correlated with frequency ω. In TE, we set ω = ω d · r −1 s (d). ω d is the frequency parameter of d-th sampling rate. TE fuses the sampling rate term r −1 s (d) to avoid the bias of time vector causing by only considering the effect of distance p. The second is that each time point can be embeded into d T E dimensions with more options of frequencies by setting different ω j,k in Equation 7. In TE, ω j,k means the time vector in dimension j has K frequencies. But in Equation 2 of TV, the frequency of time vector in dimension i is fixed with c i . Time encoding with different sampling rates is related to time vector with fixed sampling rate and a general complex expression . • TV is a special case of TE. If we set ω k,j = c i , then T E(d, t) = T V (t, 2i + 1) + iT V (t, 2i). • TE is a special case of a fundamental complex expression r · e i·(ωx+θ) . We set θ = 0 as we focus more on the relation between different time points than the value of the first point; We understand term r as the representation of observations and leave it to learn by computing models. Besides, TE inherits the properties of position-free offset transformation and boundedness . Echo state network is a fast and efficient recurrent neural network. A typical ESN consists of an input layer W in ∈ R N ×D , a recurrent layer, called reservoir W res ∈ R N ×N , and an output layer W out ∈ R M ×N . The connection weights of the input layer and the reservoir layer are fixed after initialization, and the output weights are trainable. u(t) ∈ R D , x(t) ∈ R N and y(t) ∈ R M denote the input value, reservoir state and output value at time t, respectively. The state transition equation is: Figure 1 . The state transition equation is: Equation 11 is the calculation formula of the readout weights when training to find a solution to the least squares problem with regularization parameter λ. Algorithm 1 shows the process of using TE-ESN for prediction. Line 1-11 obtains the solution of readout weights W out of TE-ESN by that using the training data. Line 12-18 shows the way to predict the value of test data. Assuming the reservoir size of TE-ESN is fixed by N , The maximum time M T is T , the input has D time series, the complexity is: , 1977] . Mackey-Glass system is a classical chaotic system, often used to evaluate ESNs. y(t + 1) = y(t) + δ(a (t)). We initialized δ, a, b, n, τ, y(0) to 0.1, 0.2, −0.1, 10, 17, 1.2, t random increases with irregular time interval. The task is one-step-ahead-forecasting in first 1000 time. We extracted the records of temperature, snowfall and precipitation from New York, Connecticut, New Jersey and Pennsylvania. Each TS has irregular time intervals, from 1 to 7 days. Sampling rates are different among TS, from 0.33 to 1 per day. The task is to early predict the temperature of New York in next 7 days. • COVID-19 [Yan L, 2020]. The COVID-19 patients' blood samples dataset were collected between 10 Jan. and 18 Feb. 2020 at Tongji Hospital, Wuhan, China, containing 80 features from 485 patients with 6877 records. Each TS has irregular time intervals, from 1 minus to 12 days. Sampling rates are different among TS, from 0 to 6 per day. The task is to early predict in-hospital mortality before 24 hours and one-step-ahead forecasting for each biomarkers. The code of 9 baselines with 3 categories is available at https: //github.com/PaperCodeAnonymous/TE-ESN. • BPRNNs-based: There are 3 methods designed for ISTS data with BP training -M- RNN [Jinsung et al., 2017] , T-LSTM [Baytas et al., 2017] and GRU-D [Che et al., 2018] . Each of them have be introduced in Section 2. • ESNs-based: There are 4 methods designed based on ESNs -ESN [Jaeger, 2002] , Leaky-ESN [Jaeger et al., 2007] , We use Genetic Algorithms (GA) [Zhong et al., 2017] to optimize hyper-parameters shown in Table 5 . where r 2 is the squared correlation coefficient. We show the results from five perspectives below. The conclusions drawn from the experimental results are shown in italics. More experiments are in Appendix. Prediction results of methods Shown in Table 1 : (1) TE-ESN outperforms all baselines on four datasets. It means Learning two irregularities of ISTS helps for prediction and TE-ESN has this ability. (2) TE-ESN is better than TV-ESN in multivariable time series datasets USHCN) shows the effect of Property (1) Figure 3 shows the relation of TE dot product and time distance, it shows that using multiple frequencies will enhance monotonous of negative correlation between dot product and distance. (2) Table 3 shows the prediction results in different TE settings, results shows that using multiple frequencies can improve the prediction accuracy. We test the effect of TE, LS and SF, which are introduced in Section 3.3, by removing TE term, setting γ l = 1 and setting γ f = 1. The results in Table 4 show that all theses three mechanisms of TE-ESN contribute to the final prediction tasks. TE has the greatest impact in COVID-19, the reason may be that the medical dataset has the strongest irregularity compared with other datasets. LS has the greatest impact in USHCN and SILSO, as there are many long time series, it is necessary to learn the dependence in different time spans. SF has a relatively small impact, the results have no change in SILSO and MG as they are univariate. In TE-ESN, each time series has a reservoir, reservoirs setting can be different. Figure 4 shows COVID-19 mortality prediction results when changing spectral radius ρ and time skip k of LDH and hs-CRP. Setting uniform hyper-parameters or different hyper-parameters for each reservoir has little effect on the prediction results. Thus, we set all reservoirs with the same hyper-parameters for efficiency. Table 5 shows the best hyper-parameter settings. chio et al., 2018] . Table 6 shows that TE-ESN obtains the best MC, and TE mechanism can increase the memory capability. In this paper, we propose a novel Time Encoding (TE) mechanism in complex domain to model the time information of Irregularly Sampled Time Series (ISTS). It can represent the irregularities of intra-series and inter-series. Meanwhile, we create a novel Time Encoding Echo State Network (TE-ESN), which is the first method to enable ESNs to handle ISTS. Besides, TE-ESN can model both longitudinal long shortterm dependencies in time series and horizontal influences among time series. We evaluate the method and give several model related analysis in two prediction tasks on one chaos system and three real-world datasets. The results show that TE-ESN outperforms the existing state-of-the-art models and has good properties. Future works will focus more on the dynamic reservoir properties and hyper-parameters optimization of TE-ESN, and will incorporate deep structures to TE-ESN for better prediction accuracy. The international sunspot number, int. sunspot number monthly bull. online catalogue Optimization and applications of echo state networks with leaky-integrator neurons Model-free prediction of spatiotemporal dynamical systems with recurrent neural networks: Role of network spectral radius. CoRR, abs/1910.04426 Predicting covid-19 disease progression and patient outcomes based on temporal deep learning A review of designs and applications of echo state networks Diurnal blood pressure pattern and risk of congestive heart failure TENER: adapting transformer encoder for named entity recognition. CoRR, abs/1911.04474 Genetic algorithm optimized doublereservoir echo state network for multi-regime time series prediction