key: cord-222868-k3k0iqds
authors: Goswami, Anindya; Rajani, Sharan; Tanksale, Atharva
title: Data-Driven Option Pricing using Single and Multi-Asset Supervised Learning
date: 2020-08-02
journal: nan
DOI: nan
sha: 
doc_id: 222868
cord_uid: k3k0iqds

We propose three different data driven approaches for pricing European style call options using supervised machine-learning algorithms. The proposed approaches are tested on two stock market indices, NIFTY50 and BANKNIFTY from the Indian equity market. Although neither historical nor implied volatility is used as an input, the results show that the trained models have been able to capture the option pricing mechanism better than or similar to the Black Scholes formula for all the experiments. Our choice of scale free I/O allows us to train models using combined data of multiple different assets from a financial market. This not only allows the models to achieve far better generalization and predictive capability, but also solves the problem of paucity of data, the primary limitation of using machine learning techniques. We also illustrate the performance of the trained models in the period leading up to the 2020 Stock Market Crash, Jan 2019 to April 2020.

Fair pricing of financial instruments is at the heart of market stability. Mispricing securities may cause traders to incur massive losses and can also indirectly affect the financial health of a market. It is thus vital to be able to derive the fair price of tradable financial instruments. The seminal paper [3] laid the foundation of the theory of no arbitrage option pricing, following which the scope of the theory has been extended by several authors. However, the fair price of an option contract depends on the current anticipation of the future dynamics of the underlying asset. This is why the authors of [15] argued that the success or failure of theoretical option pricing and hedging is closely tied to the success in capturing the dynamics of the underlying assets price movements. Since this is a hard problem, adoption of data-driven approaches in pricing option contracts is gaining attention with the advent of superior computational power and advancements in statistical learning techniques. In this manuscript, we propose data-driven approaches for prescribing the fair price of an option contract without assuming any particular theoretical law of the underlying asset dynamics. We also propose and illustrate the use of data drawn from multiple assets/sources to train these data-driven option pricing models. This allows us a way to mitigate the possible paucity of data available to train models. We would like to emphasize that the work presented in this study does not attempt to emulate the Black-Scholes formula or any other theoretical option pricing model.

In the past, several authors have investigated the possibility of building a data-driven option pricing model; We give a brief overview of the literature that exists. In [20] , the authors conveyed their belief that the trading process of option contracts itself may reveal analytical models. The data-driven investigations in [15] and [20] were based on option contracts on the S&P 500. While the former used only the moneyness parameter (ratio of spot and strike values) and time-to-maturity as inputs to their learning model, the latter also used historical volatility, interest rate, and lagged prices of the underlying asset and option contract. The authors of [18] obtained a better prediction performance than [15] by including the open interest in addition to all the non-lagged inputs of [20] . On the other hand, in [19] , S&P 100 data was used to predict the implied volatility instead of the option price, using past volatilities and option-contract parameters. In [17] , a variant of implied volatility was used as an input to predict the deviation of the actual market price from the Black-Scholes price of the option contract. The model performance was illustrated on AO SPI Index options. If the log returns of the underlying asset is independent of the stock price level, the formula for fair price of an option is homogeneous of degree one in both spot and strike. The authors of [11] implemented this relation in the structure of the neural network and built a model using option contract data of the S&P 500 Index. The authors of [5] discuss how a technique named profiling could be used to select the optimal neural network structure to predict the implied volatility. This technique was illustrated on USD/NEM exchange rate options and the model took various contract parameters as inputs. The authors of [22] argued that option contract data should be partitioned according to moneyness in order to improve the accuracy in pricing options and they illustrated this performance improvement using Nikkei 225 Index option contracts. In [13] the authors exhibited the effectiveness of cross validation, Bayesian regularization, early stopping and bagging in preventing overfitting and improving generalization, in the process of pricing S&P 500 call options using an artificial neural network (ANN). The author of [1] attempted to predict the bid-ask spread of options on the OMX Stockholm 30 Index, using multiple lagged asset prices and their sample standard deviations. In [2] , the authors used the dividend rate in addition to Black-Scholes-based features to price options contracts on the FTSE 100 Index; the model performance was compared with the Black-Scholes-Merton price that incorporates dividends. The authors of [12] used S&P 500 option contract data and developed a "modular" ANN model for option price prediction. In particular, they divided the data set into 9 disjoint parts or modules, according to the moneyness and the time to maturity parameters of the contracts. A similar modularity is adopted in [7] where the authors build a hybrid model using BANKNIFTY option contracts. Some of the previously mentioned papers have prescribed data-driven option hedging strategies, while some others have also demonstrated success in predicting the price of exotic options using their model outputs. The above survey is not meant to be exhaustive but conveys the broadly accepted methodologies for developing supervised learning models to price options. This manuscript borrows aspects like homogeneity hint and modularity from the existing literature.

In this manuscript, we propose three different approaches to generate feature sets from the market data, each of which yields 17 − 22 features. Each feature set is then used to train two modelsusing an ANN and the XGBoost algorithm respectively. None of the approaches include measures of volatility as features. However, we assume that the statistical distribution of the underlying assets' returns is independent of the level of the stock price (s). This implies that the option price function is homogeneous of degree one in both, the spot price (S) and the strike price (K). In view of this, we construct feature sets using the underlying asset's log returns, moneyness ( S K ), and time to maturity. Furthermore, the output variable has been constructed using the ratio ( C K × 100) of option price (C) to the strike price (K). The fair price of an option contract must depend on the anticipated statistical distribution of the future price of the underlying asset. We try to incorporate this principle using a non-parametric approach, wherein we consider a fixed number of consecutive Order Statistics of log returns of the daily underlying close prices as features. We compare the performance of this approach with another approach, wherein the feature set consists of only the first two moments of the log returns of the underlying asset's daily Open-High-Low-Close prices. Both the approaches appear to be equally effective. Finally we compare these two approaches with a third approach, that augments features from the second approach by including a few additional features derived from the historical option price data. This particular approach outperforms the previous two as the option price data contains significant additional information relevant to the present day option price. To the best of our knowledge, option pricing models using these feature sets have not been reported in the literature so far.

In the proposed data-driven approaches, disjoint consecutive intervals of the option contract price is set as the output instead of a single predicted option price, as we believe that no real market is complete. In other words, a random payoff such as an option contract may have multiple fair prices, and a single predicted price is more confusing than convincing. Hence we define the output variable in a manner that conveys the range of fair prices. We measure and compare the performance of the models described in the manuscript using two different error metrics. The first proposed error metric attempts to mimic the mean absolute error (MAE) while the second metric gives the inaccuracy in predicting the option price to lie within a certain neighborhood of the actual option price. We also compare the performance of the proposed models with the theoretical Black-Scholes option pricing model. It is observed that the models constructed using the third approach outperform the Black-Scholes pricing formula in terms of the above mentioned metrics whereas other proposed models perform equivalently, if not better. Again, we would like to emphasize that neither historical nor implied volatility is used as an input in any of the proposed models.

We would also like to emphasize on the fact that none of the features were selected based on importance analysis, as the process of determining feature importance essentially depends on the particular choice of the training data used. Despite maintaining such indifference, the success in predicting option prices indicates that perhaps these data-driven models are capable of learning certain universal rules of option pricing. We also ensure that the inputs and outputs of the models are scale-free, which allows us to investigate if models could be trained on option contract data from two different assets/sources. This, in principle, would allow us to construct models that can capture the option pricing mechanism for a broader range of underlying asset dynamics. Our experiments show that the models trained using data from multiple assets/sources possess superior option pricing capabilities than the models trained on individual assets/sources. These experiments have been performed using NIFTY50 and BANKNIFTY option price data. However, since we have not experimented with a sufficiently broad class of assets, the complete scope and the limitations of this technique (referred to as combined training) is still unclear. Nevertheless, we propose a methodology to gain a deeper understanding of the combined training effect than what the error metrics offer. In this method, for a trained model, we perform a family of tests using simulated Black-Scholes option price data with varying volatility. Results show that the simple idea of combined training produces models that predict the option price for a wide range of underlying asset price dynamics fairly well. In other words we observe domain adaptability for a wide variety of simulation data, clearly indicating the effectiveness of the combined training technique.

Drawing from the modularity approach proposed by [22] , [12] and [7] , we choose to train our models on a particular subset of the contract data. To elaborate, we perform our experiments on a "filtered" dataset comprising of only near-ATM (at-the-money) contracts. The "filtered" dataset also excludes option contracts that have either too short or too long time-to-maturity values. We believe that including a full range of modularity, as in [22] , [12] , and [7] , would complicate the exposition of this paper with too many experiments, as we study six different models constructed using three approaches and two algorithms, on two different assets/sources. This paper is organized in eight sections. The second section briefly presents the basics of supervised learning, and explains the two supervised learning algorithms used to construct the models. Section 3 contains details about the data under consideration. The input and output of the learning models are explained in Section 4. In Section 5 we report the performance of the trained models. An analysis of the combined-trained models' performance is presented in Section 6. Performance of the models on 2019-2020 data is given in Section 7. Finally we comment on future research directions in the last section.

Attempts to develop algorithms that are capable of performing a task without explicitly specifying the expected outcome have led to the development of the field of Machine Learning. This manuscripts leverages a specific subset of machine learning algorithms, known as supervised learning algorithms. These algorithms take in labelled data as input and "learn" the task at hand. The term "learn" implies that the algorithms construct abstract representations of the data with the aim of capturing patterns that are fundamental to the task at hand. In the following subsections, we describe briefly two supervised learning algorithms, namely Extreme Gradient Boosting (XGBoost) and Artificial Neural Network (ANN). These algorithms are used in the later sections of this manuscript. Before studying the specifics of the algorithms, it is instructive to understand the general premise of supervised learning algorithms.

Consider a finite labelled dataset represented as {(X 1 , Y 1 ), (X 2 , Y 2 ), (X 3 , Y 3 ), . . . , (X J , Y J )}, where the vector X j is associated with a label Y j . The algorithms attempt to find a mapping f : X j → Y j such that the mapping obtained is the "best" out of all the possible mappings. A qualitative assessment of the mapping (also referred to as a model) is made possible by an "objective" function (also known as a "loss function"). The specifics of the objective function and the strategy used to create the mappings vary with the choice of the algorithm.

2.1. Extreme Gradient Boosting. Developed by Tianqi Chen in 2016 (refer [6] ), Extreme Gradient Boosting combines two powerful techniques, namely "boosting" and "gradient descent". It builds upon the gradient boosting decision tree algorithms developed by Friedman in 2001 (refer [9] ) and 2002 (refer [10] ). Gradient boosting involves constructing an ensemble of "weak" learners, which in the case of XGBoost, are decision trees. These "weak" learners are combined in an iterative fashion to obtain a "strong" learner. A "weak" learner is a model whose accuracy of predictions is slightly better than a model making random predictions. Refer to [8] for more details on how "weak" learners can be combined to create "strong" learners.

A typical classification task involves categorizing an input to its label (or class). Successfully performing classification requires the model to determine a close approximate of the true conditional probabilities of the classes, given an input. The XGBoost algorithm, for a set of N output classes, assigns a score F i (x) to the i th class for the input x. We define F (x) as (

). The scores obtained are then used to calculate the probability of each class to be the predicted class by using the softmax function, P (x) defined as

The XGBoost algorithm then computes the "objective" (or loss) function value for each input x by determining how far away from the true distribution is the distribution of the predicted values. This is done by using Categorical Cross Entropy (CE), a loss function, which is defined as

where z := (z 1 , z 2 , . . . z N ) is a given p.m.f of the true outputs. The XGBoost algorithm seeks to minimize the value of this loss function over all possible F (x) based on the training set of J input-output pairs, {(x j , y j ) | j = 1, 2, 3, . . . J}. These pairs are used to compute the value of z (j) , for each j, such that z 

The weak learner h (m) (x), is then fit to the training dataset {(x j , r (m) j )} J j=1 . The algorithm then computes the multiplier α (m) using the equation

This multiplier, α (m) , is then used to update the model/score as given by the scheme

The XGBoost algorithm thus results in a strong learner by combining M weak learners in order to obtain a close approximate to the true probability distribution. The reader is encouraged to consult the references cited as this exposition is not meant to be comprehensive.

Artificial Neural Network. Developments in the field of machine learning led to the advent of algorithms that sought to mimic biological neural networks. These algorithms (referred to as ANN) attempt to harness the ability of biological networks to learn patterns within data. This manuscript presents a brief overview of a special type of ANN known as Feed Forward neural network 1 . We use Feed Forward neural networks for the experiments proposed in the later sections to classify structured data inputs. The reader [14] for a comprehensive study of ANNs. Table 1 for details). A neural network is a set of "neurons" that interact with each other to "learn" the representation space of the input data. Figure 2 shows the structure of a neuron.

As can be seen in Figure 2 , the output η of a neuron can be given by-

where ψ = (ψ 1 , ψ 2 , . . . , ψ n ) are the inputs to the neuron, w i is the weight associated with each input ψ i and b is the overall bias associated with the neuron; the function f is called the activation function and is used to impart non-linearity to the neural network. As evident from Figure 1 , a feed forward neural network consists of a number of "layers" of stacked neurons. Each neuron in a layer is connected to every neuron in the next layer. Thus the outputs of the neurons in the preceding layer act as the inputs to the neurons in the next layer. As stated earlier, each "connection" between any pair of neurons, has a weight w associated with it. The optimal number of layers in a neural network and the number of neurons in each layer is to be uniquely determined for a given problem, and is referred to as the architecture of the neural network. Along with this, it is also necessary to determine the appropriate activation functions for each of the neurons as well as the optimization scheme to be used. The architecture of the ANN used in the present study is given in Table 1 .

Number of Neurons Activation Function Layer 1 128 ReLU Layer 2 64 ReLU Layer 3 50 softmax Table 1 . "Architecture" of the Neural Net used

The activation function used for each layer has been indicated in Table 1 . The ReLU activation function is defined as -ReLU :: f (x) = max(0, x) The softmax function, as explained previously (refer Equation (1)), gives the class probabilities. We use the loss function, categorical crossentropy (refer Equation (2)) to determine how far the true probability distribution is from the distribution of the predicted values. In order to "learn" a given task, the sequence of weights that serve as a minimizer to the loss function are to be found, as this corresponds to a higher prediction accuracy by the neural network. This is achieved by optimizing the weights using an optimization scheme (commonly known as training the network). In the present study, we use the Adam optimiser, an advancement of the stochastic gradient descent optimizer (refer [16] ).

We aim to model the pricing mechanism of option contracts that are traded in a financial market. NSE, an Indian stock exchange, facilitates the trading of option derivatives on stocks and stock indices in high volumes. Markets with high trading volumes generally imply a high level of trader participation, which further implies a lower chance of the market being imperfect (i.e, the market is efficient). This also allows us to consider the traded price of the derivative as the "fair" price. Persistent high trading volumes for a particular range of option contracts give us a better chance to "learn" the pricing mechanism of those option contracts. Some of the NSE based stock indices that have a high option contract trade volumes are the NIFTY50 and BANKNIFTY. For our experimentation, we extract the daily contract price data of call options for both, NIFTY50 and BANKNIFTY. Data is extracted for the years 2015 − 2018 (data for 4 years), from the NSE website's contract wise archive section 3 . It is then ensured that the data set obtained is purged of contracts that are not traded. For reasons related to the construction of the models, we add a new column to the filtered dataset that records the close price of the same option on the previous day. If the option contract did not exist on the previous day, we report the value 0 in this new column. We subsequently screen the data to remove all rows that have a zero in the new column. We then add more columns to the data array to include the "Open", "High", "Low" and "Close" prices of the underlying asset for the past 20 days corresponding to each row. Further more, we add an additional column that represents the three months' government bond yield (see Section 4.2).

We then select the option contracts that are in the vicinity of at-the-money(ATM) contracts. To be more precise, we only select those contracts for which the quantity |1− S K | is not more than the pre-decided value of 0.04, where K and S are the strike and the spot prices respectively. We refer to such contracts as near-ATM option contracts. It has been observed that numerous near-ATM option contracts are traded everyday with identical or different time to maturities. However, significantly low trading volume is observed for contracts with very large or very small time to maturities. Hence we choose to study, only those contracts whose time-to-maturity values are not more than 45 days and not less than 3 days. Figure 3 is an indicative sample of the NIFTY50 option contract dataset that we obtain from the NSE. In order to build a predictive model using the algorithms described in Section 2, the dataset needs to be split into separate datasets that would be used to train and evaluate the trained models. Most supervised learning algorithms when trained with time series data, necessitate splitting the dataset linearly as the individual observations are not independent. In the same vein, we split the dataset in two parts according to the timestamp. The first 33 months, i.e., data from Jan 2015 to Sept 2017 forms the training dataset and the succeeding data i.e. from Oct 2017 to Dec 2018 forms the test dataset for evaluating the proposed models. Table 2 shows the number of datapoints we deal with at every step of the model building and evaluation process.

As mentioned previously, this study aims to develop supervised learning models that can "learn" the market perceived pricing of option contracts, and give us the fair price of an option contract in accordance with past market behaviour. In order to develop supervised machine learning models (refer Section 2), we need to train the models with a set of 'inputs" and "outputs". Sections 4.2, 4.3 and 4.4 describe the different feature sets, each of which we intend to use as inputs to the supervised learning algorithms. These feature sets are derived from the information available to market participants. Before describing each of the feature sets, we explain the desired format of the output variable which is kept uniform across all the approaches. 4.1. Categorical Output Variable. As for the output of the proposed data-driven option pricing models, using the option contract prices obtained directly from the market would not be prudent. This is because, for contracts with a fixed value of moneyness, the magnitude of contract parameters like "Strike" and "Spot" prices may vary over the years. It makes much more sense to create an output variable that is scale free. We therefore define the "output" as the ratio-expressed in percentage-of the Close price (C) and the Strike price (K) of the contract, i.e. we designate 100 × C K as the output variable. This ratio serves as a scale free proxy to contract price for the model.

Since the "output" variable is continuous, it is natural to formulate the problem using a regression model. However, since no real market is complete, a single predicted price of an option contract is more confusing than convincing. Indeed the fair price could be anything in a certain interval. Determining this interval is a hard problem from both, the theoretical and the empirical aspects. Instead of finding such an interval of the fair price, selecting the most likely interval from a pre-determined set of non-overlapping consecutive intervals is fairly straight forward. One can divide the range of outputs into non-overlapping "bins" and select the "embracing" bin as the output variable. However, a major hurdle in this approach is determining the width of the bin. Larger the width of each bin, lesser the usefulness of the model due to lack of precision. On the other hand a finer binning confuses the model due to the presence of a certain degree of in-docile uncertainties in the option trading price, which can be attributed to the lack of completeness in the market. The most straightforward way to tackle this quandary is to formulate an optimization of an appropriate loss function. Instead of adopting such an objective approach which essentially depends on the type of data and the model used, we first introduce a binning insensitive performance measure for the models. We refer to this measure as the EM. Subsection 5.1 (refer Equation (5)) gives a description of the proposed metric. We then study the values of EM obtained for different bin widths, for a fixed dataset and a fixed model type. Depending on the persistent stability of EM and the gain of precision, we decide the bin width to be used. Figure 4 shows the results of procedure used to determine the interval width. We observe that for bin width intervals larger than 0.1, the supposedly bin-insensitive measure drastically decreases. This is expected as larger bin intervals imply lesser number of classes, which makes classification easier for the models due to increased imprecision. For bin intervals lesser than 0.075, a certain monotonicity appears. But for bin interval width roughly between 0.1 and 0.075, the EM value behaves insensitive to binning. This manuscript uses the value of bin width as 0.1 and partitions the entire range of output values in the manner explained in the following paragraph.

The interval ((n − 1)w, nw] is set as the n th bin where n is a natural number and w (here w = 0.1) is the bin interval width. This creates a set of equispaced bins allowing us to map option contracts to their respective bins by computing the value of 100 × C K for the particular contract and assigning the corresponding integer valued bin number to it as its label. These labels are then considered as the ordinal output variables and are used to train and test the constructed models. We illustrate this binning in Figure 5 . The figure is a histogram (plotted using 0.1 as the bin width) of 100 × C K values, for the filtered NIFTY50 contract dataset. It is evident from the plot that there are just enough data points per bin and yet we have enough number of categories, ie. bins, to make the model robust.

The above procedure of binning is rather subjective and is not meant to be precise for a vital reason. The reason being, any precise data-driven optimization depends upon the choice of the data and the model. On the other hand, we wish to fix binning, regardless of the choice of the model or the dataset. The reason being that, binning defines the output variable in the training and test datasets for each model and we wish to keep all the models comparable. Without identical binning, it would not be possible to combine or compare models trained on different datasets. 4.1.1. Remark. The following subsections (4.2, 4.3 and 4.4) describe 3 separate, independent "approaches" used to generate feature sets that serve as inputs to the supervised learning algorithms described in Section 2. Here the term "approach" is used to convey the motivation/idea behind generating the feature sets. The Table 3 .

Approach I. From the very definition of an option contract, it is known that the fair price of a call option must depend on the values of the option contract parameters (like the strike price (K), the time to maturity (τ )), the risk free interest rate (r), the spot price (S) and the anticipated statistical behavior of the future dynamics of the underlying asset. The closest real world approximate of the value of r would be the government bond yield. Amongst the parameters that are available to the practitioner, it is natural to hypothesize that the most important determinant of the option contract's value is the present value of the underlying security and the price dynamics followed by it over the past few days. Directly using the past asset price data as features would make the values scale dependent, especially so when data over many years is to be considered for model training. As a means to resolve the scale dependency, log returns of the time series (henceforth, referred to as LR) are considered, the values of which are given by

where, S i is the i th term of a time series S.

In order to obtain a non-parametric inference of the recent distribution of log returns, we calculate the Order Statistics of the log returns. This is done by computing the log returns of the daily close prices of the underlying asset for a window of the past 20 trading days, as it corresponds to approximately a calendar month excluding all holidays. Following this, the Order Statistics is computed by simply arranging the log returns in ascending order for each sample.

To be more precise, if x (i) denotes the ith order statistics of a sample of different real values (x 1 , x 2 , x 3 , . . . , x n ), then x (i) = x j for some j = 1, . . . , n, and x (1) < x (2) < · · · < x (n) hold.

In view of the preceding discussions, we calculate the Order Statistics of historical log returns for each of the near-ATM option contracts resulting in a row of 22 features as given below:

(1) The 19 log return order statistics.

(2) The time to maturity (τ ) of the option contract.

(3) The interest rate r: We use the 3 month sovereign bond yield rates as an approximation for the risk free interest rates. (4) Moneyness: This quantity is computed as S K (the ratio of Spot to Strike prices). A collection of such rows is what constitutes the train/test dataset. 4.3. Approach II. This subsection proposes a feature set that takes into account the market participant's access to other facets of the asset price data. Intuitively, a lot more information on asset dynamics can be gleaned by taking into account the values of "Open", "High", and "Low" along with the values of "Close" (refer 6). However, this intuitive anticipation deserves a quantitative backing. Let us first understand the Figure 6 . Cross section of the underlying asset price dataset need for a completely new "approach". The previous subsection attempted to generate a feature set that captures the empirical distribution of the "Close" price data of the underlying asset. The present subsection seeks to remedy the fact that the asset price data obtained from the market consists of multiple facets that haven't been accounted for in Approach I. The joint distribution of these four time series' ("Open", "High","Low" and "Close" ) cannot be inferred from the order statistics of every individual time series as they are not independent. This renders a direct mimicking of Approach I ineffective. Moreover, using a direct extension of Approach I would lead to a feature set with 19 × 4 = 76 features. This bloating up of the feature set prevents any meaningful comparison between different models. It is therefore prudent to adopt a moments based approach to generate a feature set that is sensitive to all facets of the underlying asset data.

Instead of trying to obtain an empirical distribution of the multivariate time series, we measure the central tendency and the dispersion using the first raw moment and the covariance matrix of the component-wise log returns of the vector valued series. The feature set is then built using these statistics. As Σ is symmetric, six entries on the upper triangular part are repeated in the lower part. We include the square root of entries of Σ in the feature set after discarding the repetitions. Thus we build the second feature set using the following 17 features:

(1) Means of the log return series'; µ O , µ H , µ L , and µ C .

(2) Ten statistics from Σ, namely

where Σ ij is the (i, j) th element of Σ using the convention x √ |x| = 0 iff x = 0.

(3) Features (2) − (4) from Approach I.

Approaches I and II primarily utilize the underlying asset price data to derive the set of features. However, a market participant also has access to the historical option contract trade prices. It would be imprudent to not develop an approach that factors in this key aspect. In fact, including the historical option contract trade prices in an appropriate form would help the supervised learning algorithms to develop abstract representations of market factors like implied volatility, allowing them to predict the option contract price more accurately. We would like to stress on the fact that the intent of Approach III is to build upon the progress made in Approach I and II. We cannot use an extension of Approach I for reasons mentioned previously. We instead, seek to augment the feature set developed in Approach II by adding the features listed below to the feature set obtained from Approach II:

(1) Previous Option Price (scaled): This is computed as Ct−1 K where C t−1 is the previously reported close price of the option contract under study and K is the Strike price of the contract. Including this feature helps account for any auto-regressive characteristics that might be present in the option price data.

(2) Mean Moneyness: Computed asS K , whereS is the mean of the underlying asset prices (for a window of the past 20 trading days) and K is the strike price of the contract. Table 3 summarizes the features used by the three approaches described in this section. Figure 7 presents an overview of the steps that constitute the process of model building. Table 3 . An overview of feature sets for all the Approaches

Once a model is trained, it is imperative to test the performance of the model on data that has not been used for training (ie. the test dataset) and study the quality of the predictions. The most common way to evaluate the predictions of nominal variables is to find the value of the accuracy metric A, defined as

where C is the number of correct predictions and T is the total number of predictions. It is however, not ideal to use the accuracy metric for an ordinal output variable having a wide range. In such cases, one can examine the quality of the incorrect predictions by measuring the distance between the actual and the predicted classes. Doing so is meaningful because, it is desirable for a good model to be able to predict a class identical to or very close to the actual class. In contrast, the accuracy metric treats all incorrect predictions in the same manner, regardless of whether the predicted class is close to or far from the actual class. It is therefore important to come up with a metric that does a better job of informing us about the where, w denotes the binwidth, T is the number of contracts in the test dataset and the ordinal variables C i and P i denote the actual and the model predicted bin numbers respectively. As mentioned in Subsection 4.1, we set the value of w as 0.1. Multiplying the bin number with the bin width makes EM asymptotically insensitive to binning. We illustrate the implication of EM in Figure 8 . Figure 8 gives an example of the case where the distance between the actual and the predicted classes is 2. It can easily be proved that the EM converges to the Mean Absolute Error (MAE) as bin width tends to 0. However, the MAE metric is known to be sensitive to outliers. Hence, in order to get a better insight into the performance of the models, we also consider an additional metric-the "inaccuracy metric"-that is robust to outliers. The "inaccuracy metric" (ρ) gives the probability of the predicted and actual bins to lie more than 2 bins apart. In other words, the metric ρ gives the probability that the model will fail to include the actual price bin (labelled as C i ) in a band of five consecutive bins where the predicted bin (labelled as P i ) is in the middle. Henceforth we refer to the above mentioned band as the predicted band (see Figure 9 ). The ρ metric is defined as

While EM is a measure of prediction imprecision, the empirical quantiles of the error C i − P i gives the confidence interval of C i using prediction P i . In particular, 1 − ρ denotes the confidence of C i being in [P i − 2, P i + 2]. NIFTY50 Index option data. Table 4 lists the EM and ρ values for all models that are trained and tested using NIFTY50 data. The results reported in Table 4 convey that all trained models perform at par or better Table 4 . Model evaluation metrics for models trained and tested on NIFTY50 options contract price data than the pricing formula of the Black-Scholes model (we use the historical volatility values observed over a window of the past 20 trading days to compute the Black-Scholes price). We also note that in comparison to XGBoost, the use of ANN results in lower values of EM and ρ. Table 4 also shows that the values of the metrics do not differ significantly between Approach I and Approach II. This indicates that the two supervised learning algorithms were unable to extract additional information on the asset dynamics from the first two moments of the Open-High-Low-Close (OHLC) data than solely from the Close price data. From the results, it is also clear that the performance of Approach III is far superior to that of Approaches I and II, which indicates that the historical option price data contains valuable information relevant to the current option price.

It is evident from Table 4 that for all cases the EM value is less than 0.19. Loosely speaking, this implies that on an average, the predicted value of 100 × C K is not further than 0.19 from the actual (refer to Equation (5)). In other words, the difference between the actual and predicted option prices is on an average, less than 0.0019 × K. A more precise statement in terms of confidence interval can be made using the empirical quantiles (refer to Figure 10 ). The 2% and 98% quantiles of C i − P i obtained using the Approach I ANN model for NIFTY50 data are −5 and 5 respectively. This implies that the actual price bin is within 5 neighbouring bins of the predicted bin with 96% probability for the test dataset. Similarly, from other quantile values we can also deduce that the actual price bin is within 2 neighbouring bins of the predicted bin with 74% confidence. Figure 10 illustrates this using a plot of the empirical CDF of C i − P i . Indeed, the ρ metric is useful in this regard. To be more precise, the difference between the actual and predicted option price intervals is less than 2K 1000 with probability 1 − ρ (refer Equation (6)). Thus an interval of length 5K 1000 (predicted band) can be obtained from a model prediction which succeeds in containing the close price of the option (having strike price K) with probability 1 − ρ (refer to Figure 9 ). We recall from Figure 5 that this predicted band width is less than one tenth of the full range of option prices for the NIFTY50 data under consideration.

We consider Approach III ANN models to further illustrate the implication of the predicted bands. For this, we first identify the upper and lower limit option prices of the band and compute the corresponding daily implied volatility values for each contract. From these values, we obtain the daily averaged predicted implied volatility band. We then compute the average market-realized implied volatility for each day using near-ATM options data and compare it with the predicted implied volatility band. A time series plot of that comparison is presented in Figure 11 . The figure shows that for 90% of the time, the market-realized implied volatility lies within the predicted band. It is not surprising that this band prediction error is only 0.10, a value that is much lesser than the ρ value for Approach III in Table 4 . The main reason behind the observed error reduction is the presence of averaging in the computation. This indicates the possibility of building a superior hybrid model by exploiting such an averaging effect. However, we do not attempt to build such models in the present study.

BANKNIFTY Index option data. Table 5 lists out the performance of the models that were trained and tested on BANKNIFTY Index data. The data processing, feature-set generation and train-test splitting for the BANKNIFTY options dataset is done in the exact same way as for NIFTY50 Index option data, in accordance with the methodologies laid down in Sections 3 and 4. It can clearly be seen that the values of the EM and ρ are the lowest for Approach III models. The evaluation metrics for Approach III are also lower than those for the Black-Scholes formula. Using the trained models and the results shown in Table 5 , Figure 11 . Average Empirical IV and the predicted IV Band, plotted for the NIFTY50 test dataset an analysis of the results similar to what has been done for NIFTY trained models can be performed, but we avoid repetitive explanation.

EM ρ Table 5 . Model evaluation metrics for models trained and tested on BANKNIFTY options contract price data From the results shown in Table 5 and Table 4 , it is evident that Approach III ANN models perform significantly better than all other proposed models. Furthermore, they are far more accurate than what the Black-Scholes formula can prescribe. Having said so, it is also important to recall that no measures of volatility has been fed into any of the proposed models. We also present a set of experiments that shows the promise of ensemble modeling. 5.3. Ensemble Models. The predictions of the two pricing models obtained using ANN and XGBoost for each approach can be averaged out, to obtain a new prediction. We refer to this as the prediction of a simple ensemble model. The rationale behind this approach is straightforward. It is plausible that for a particular approach, the XGBoost model learns a subset of the representation space very well, but does not learn it well enough for some other subsets. The ANN model could hypothetically learn those missed subsets of representation space better than what the XGBoost model is capable of learning. By averaging out the predictions of the models, we seek to minimize the number of subsets over which the individual models perform poorly. Averaging the model predictions allow us a way to leverage the well learnt portions of the representation space of both the models at the same time.

We evaluate the performance of the ensemble models by computing the EM and the ρ values for the test sets. Tables 6 and 7 present the model evaluation metric values for the ensemble models trained and tested on NIFTY50 and BANKNIFTY contracts respectively. The results in Table 6 and Table 7 show a marked improvement in the EM values for all the approaches when compared to the results in Table 4 and Table 5 respectively. Table 7 . Model evaluation metrics for ensemble averaged models trained and tested on BANKNIFTY option contracts Remarks : It is important to note that the "predictions" (P ) of the ensemble model need not be an integer class label but could instead be an integer multiple of 1 2 . However, no change is needed in the computation scheme of the model evaluation metrics.

Trained with Multiple Sources. Since the features and the output variable used are scale free, models trained on one asset should be able to give reasonable option price predictions for another asset provided their log return distributions are not too different from each other. This anticipation hinges on our assumption that, for a given financial market, two assets having the same return distribution should have the same option pricing mechanism. Again, there is a possibility that the prediction quality may be inferior even though the training and test datasets belong to the same asset, as the return dynamics of the underlying asset may have changed drastically. This subsection presents some experiments in this direction.

We first carry out an empirical investigation on the asset portability of the models. In order to do this we consider all six models trained on NIFTY50 option contracts, and test them with data from BANKNIFTY based contracts on non-overlapping time intervals. The results of this experiment are given in Table 8 . It is crucial to note that these two indices are sufficiently independent and have contract parameters with vastly different magnitudes. We present the Q-Q plot (Figure 12 ) of the "Close" price log returns of the two underlying assets in order to compare their log return distributions. Figure 12 shows a moderate mismatch between the return distributions of these two assets. Thus, although we do not expect the predictive performance to be equivalent to NIFTY50 test sets, we expect the error metric to be decently small in magnitude. Our experiment supports this anticipation. However, a quick comparison of our results (Table  8 ) with Table 5 shows that the NIFTY50-trained models outperform the BANKNIFTY-trained models for the BANKNIFTY test set. This gives evidence of the fact that a model trained on a different asset/source can outperform a model trained on the target asset/source.

The results of the above experiment encourages us to train the models using contract data from two or more number of assets/sources. In principle, this should broaden the range of features and allow the models Table 8 . Model evaluation metrics for models trained on NIFTY50 contract data and tested on BANKNIFTY contracts to achieve far better generalization and predictive capability. We investigate this by training all six models (one XGB and one ANN for each of the three approaches) using the combined data of NIFTY50 and BANKNIFTY contracts and then perform out-of-sample tests for each asset. The EM and ρ values of the respective experiments are given in Table 9 . Table 9 . Model evaluation metrics for models trained on both NIFTY50 and BANKNIFTY contract data A comparison of the metrics given in Table 9 with those in Tables 4 and 5 clearly shows that combined-trained models have better option pricing capabilities than the models trained on the respective assets individually. Each of the combined-trained models also outperform the price prescription of the Black-Scholes formula. The performance of the option price prediction can be better perceived using a scatter plot of the actual and predicted option prices, which we present in Figure 13 . Since the proposed models predict a bin, in order to plot the graph (Figure 13 ) we take the mid point of the predicted bin to get a single predicted price. The prices obtained using the mid point of the bins are plotted along the horizontal axis and the actual price is plotted along the vertical axis in the scatter plot. The scatter plot shown in Figure 13 is constructed using the predictions given by the Approach III ANN model (trained using combined data). To the plot, we add the line y = x (dashed red) and the orthogonal regression line (dashed green). The proximity of these two lines validates the absence of bias in the model. In principle, such scatter plots can be constructed for all the proposed models. The success of the above experiment warrants an in-depth explanation. In the next section, we use the concept of domain adaptation for designing a methodology that provides a deeper understanding of the combined-training effect.

This section brings to fore an interesting application of the models constructed using Approach I in the Sections 5.2 and 5.4 respectively. We test the pre-trained models (obtained using Approach I) with simulated Black-Scholes option price data. A family of such tests is conducted by varying the volatility parameter in the Geometric Brownian motion that is used to generate the simulated asset price time series'; these time series datasets are then augmented with the option prices prescribed by the Black-Scholes formula. We recall from Section 4.2 that Approach I based models use Order Statistics of the log returns of the underlying asset's daily close prices as their primary inputs. Thus Approach I can be directly used to generate the simulated test datasets by considering the simulated time series data as "Close" prices. But Approach II or Approach III cannot be used directly, as they use "Open", "High" and "Low" time series' along with the "Close" time series to generate the features, and simulating the corresponding "Open", "High" and "Low" time series' is not straightforward. Hence we only use Approach I based models for the experiments described in this section.

We simulate Geometric Brownian motion with the drift parameter set at µ = 0.1 and vary the volatility parameter from 1% to 20% using an increment of 1%. Daily data is simulated for each value of the volatility parameter, such that we obtain a test set that represents a trading session of 500 days. This test data is augmented with the price of several near-ATM option contracts (with values of time to maturity ∈ [10, 25, 40]) using the Black Scholes formula. We then find the prediction error of the models for each variant of the test data and plot them against the volatility parameter. We do this using XGBoost/ANN models trained on NIFTY50, BANKNIFTY and the combined dataset respectively. The purpose of this exercise is to explain the results detailed in Section 5.4. It has not been done to judge the performance of the trained models on data derived from theoretical models (as option contract prices obtained using theoretical models involve certain mathematical assumptions that renders the pricing obtained dissonant from reality). The minimizing volatility values of EM provide a class of theoretical asset dynamics whose option prices are best predicted by the trained model. We call it the "Error Minimizing Volatility" or EMV of a given option price dataset corresponding to the learning model. For example from Figures 14 and 15 Figures 14 and 15 it is evident that the EM plot obtained for the combined-trained models give a lower and flatter V shape curve. This implies that models trained on the combined dataset result in lower EM values for a wide range of test sets having varying σ values. This hints at the possibility of domain adaptability of predictive models trained on datasets derived from multiple assets/sources. It also hints at the existence of a common representation space for datasets with similar log return distributions. Such an application of domain adaptability can be a very powerful method, as it could potentially aid research in areas where data is scarce.

During the period from January 2020 to April 2020 of the COVID-19 pandemic, the dynamics of the NIFTY50 Index were radically different from its usual dynamics. A Q-Q plot comparison of the log return distributions of the NIFTY50 Index during the periods Oct'19-Dec'19 and Jan'20-Mar'20 is shown in Figure 16 . It is evident from the Q-Q plot that there seems to be almost no match between the price dynamics of these two time intervals. Therefore for option contracts based on the NIFTY50 index, we cannot expect the models trained on 2015 − 2017 data to perform well on 2019 − 2020 data. Table 11 presents the values of the performance metrics, for when the pre-trained Approach III models (constructed in sections 5.2 and 5.4) are tested on 2019 − 2020 data for the NIFTY50 Index. We consciously make the choice to use models constructed using Approach III as the benchmark to test the 2019 − 2020 dataset, as these models have given us the best predictive capability. Table 11 . Model evaluation metric for models trained on 2015-2017 NIFTY50 contract data but tested on 2019 NIFTY50 contract data Table 11 makes it evident that the error in predicting option prices for 2019 − 2020 NIFTY50 test data is significantly larger in comparison to the prediction error for 2017 − 2018 NIFTY50 test data as in Table 4 . It must be noted that performance of the models on the recent data is far better than what the Black Scholes formula prescribes. The large value of the evaluation metrics for the Black-Scholes pricing formula implies a large gap between the historical and implied volatilities. This is typically observed when drastic changes occur in a financial market. We also observe a significant improvement in case of the combined-trained models as compared to the individually-trained NIFTY50 models. This reaffirms the power of combined training. In addition to the above experiments, we plot the empirical IV and the predicted IV band in Figure 17 in a manner similar to the plot reported in Figure 11 . The band prediction error for the 2019 − 2020 dataset ( Figure 17) is 25%, which is lesser than the value of ρ observed in Table 11 . Figure 17 helps us identify regions in the test dataset where the model does not perform well. It is observed that when the implied volatility of the underlying asset changes sharply, the prediction bands deviate from the actual values. These abrupt changes are usually caused by rapid changes in the market sentiment (in this case due to the COVID-19 pandemic); an aspect that is not represented in the data used to train the models.

In this paper, we present three data-driven approaches to build option pricing models using supervised learning algorithms. These approaches are illustrated for two different assets/sources (NIFTY50 and BANKNIFTY), and we use two different learning algorithms to build a range of models. Upon evaluating the performance of the models on out-of-sample data, it was seen that Approach I and II based models performed better than the Black-Scholes option pricing formula in most cases, while the Approach III based models performed significantly better than all comparative models. Since Approach III uses features derived from the historical option price data that are not present in the Approach I and II based feature sets, the performance improvement clearly indicates the vitality of including such information. The results also highlight the superior performance of ANN-based models in comparison to the XGBoost-based models. In this paper, we have also attempted to build averaging ensemble models for each data source; the results of which clearly shows an unprecedented level of accuracy in pricing option contracts. Lastly, we have investigated the effect of multi-asset combined training for each of the proposed approaches. It was observed that the multi-asset trained models gave us a significant improvement in the prediction quality when compared to single-asset trained models. We have further examined this performance enhancement by using the concept of domain adaptation.

The success of the multi-asset trained models makes us optimistic about the viability of building a non-assetspecific data-driven option pricing model. Such a modelonce trained on data from multiple assets belonging to a particular financial marketwould be capable of predicting the fair price of any European-style call option on any asset belonging to the same financial market with a high degree of precision. However, in our paper, we have examined the combined-training effect using only two assets/sources. Extensive experimentation is required to determine the limitations and the scope of such non-asset-specific models. Readers may refer to [21] which reports a similar extensive experiment to study some other universal non-asset-specific relations captured by a deep learning model. Further research to develop and validate the existence of such models has been planned by the authors. The codes used in this study can be made available on request.

A Neural Network versus BlackScholes: a comparison of Pricing and Hedging Performances

Black-Scholes versus artificial neural networks in pricing FTSE 100 options. Intelligent Systems in Accounting

The pricing of Options and Corporate Liabilities

Classification and regression trees

Profiling Neural Networks for Option Pricing

XGBoost: A Scalable Tree Boosting System. KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

A new hybrid parametric and machine learning model with homogeneity hint for European-style index option pricing

A decision-theoretic generalization of on-line learning and an application to boosting

Greedy function approximation: a gradient boosting machine

Stochastic gradient boosting

Pricing and Hedging Derivative Securities with Neural Networks and a Homogeneity Hint

Option Pricing with Modular Neural Networks

Pricing and Hedging Derivative Securities with Neural Networks: Bayesian Regularization, Early Stopping and Bagging

A Nonparametric Approach to Pricing and Hedging Derivative Securities via Learning Networks

A Method for Stochastic Optimization

A hybrid neural network approach to the pricing of options

Option pricing using Artificial Neural Networks: the case of S&P 500 Index Call Options. Neural networks in Financial Engineering

Using neural networks to forecast the S&P 100 implied volatility

Universal features of price formation in financial markets: perspectives from deep learning. Quantitative Finance

Option Price Forecasting using Neural Networks

We are grateful to Arkaprava Sinha and Prof. Amit Mitra (IIT-Kanpur) for some useful discussions.