key: cord-0474222-j1mg2zzq
authors: Muralidhar, Nikhil; Muthiah, Sathappah; Butler, Patrick; Jain, Manish; Yu, Yu; Burne, Katy; Li, Weipeng; Jones, David; Arunachalam, Prakash; McCormick, Hays 'Skip'; Ramakrishnan, Naren
title: Using AntiPatterns to avoid MLOps Mistakes
date: 2021-06-30
journal: nan
DOI: nan
sha: a9101c49bb6ee02358928ad58d38503e745ded2f
doc_id: 474222
cord_uid: j1mg2zzq

We describe lessons learned from developing and deploying machine learning models at scale across the enterprise in a range of financial analytics applications. These lessons are presented in the form of antipatterns. Just as design patterns codify best software engineering practices, antipatterns provide a vocabulary to describe defective practices and methodologies. Here we catalog and document numerous antipatterns in financial ML operations (MLOps). Some antipatterns are due to technical errors, while others are due to not having sufficient knowledge of the surrounding context in which ML results are used. By providing a common vocabulary to discuss these situations, our intent is that antipatterns will support better documentation of issues, rapid communication between stakeholders, and faster resolution of problems. In addition to cataloging antipatterns, we describe solutions, best practices, and future directions toward MLOps maturity.

The runaway success of machine learning models has given rise to a better understanding of the technical challenges underlying their widespread deployment [28, 34] . There is now a viewpoint [16] encouraging the rethinking of ML as a software engineering enterprise. MLOps-Machine Learning Operations-refers to the body of work that focuses on the full lifecycle of ML model deployment, performance tracking, and ensuring stability in production pipelines.

At the Bank of New-York Mellon (a large-scale investment banking, custodial banking, and asset servicing enterprise) we have developed a range of enterprise-scale ML pipelines, spanning areas such as customer attrition forecasting, predicting treasury settlement failures, and balance prediction. In deploying these pipelines, we have encountered several recurring antipatterns [5] that we wish to document in this paper. Just as design patterns codify best software engineering practices, antipatterns provide a vocabulary to describe defective practices and methodologies. Antipatterns often turn out to be commonly utilized approaches that are actually bad, in the sense that the consequences outweigh any benefits. Using antipatterns to desribe what is happening helps ML teams get past any blamestorming and arrive at a refactored solution more quickly. While we do not provide a completed formal antipattern taxonomy, our intent here is to support better documentation of issues, rapid communication between stakeholders, and faster resolution of problems.

Our goals are similar to the work of [34] that argues for the study of MLOps through the lens of hidden technical debt. While many of the lessons from [34] dovetail with our own conclusions, our perspective here is complementary, viz. we focus less on software engineering but more on data pipelines, how data is transduced into decisions, and how feedback from decisions can (and should) be used to adjust and improve the ML pipeline. In particular, our study recognizes the role of multiple stakeholders (beyond ML developers) who play crucial roles in the success of ML systems.

Our main contributions are:

(1) We provide a vocabulary of antipatterns that we have encountered in ML pipelines, especially in the financial analytics domain. While many appear obvious in retrospect we believe cataloging them here will contribute to greater understanding and maturity of ML pipelines. (2) We argue for a new approach that rethinks ML deployment not just in terms of predictive performance but in terms of a multi-stage decision making loop involving humans. This leads to a more nuanced understanding of ML objectives and how evaluation criteria dovetail with deployment considerations. (3) Finally, similar to Model Cards [25] , we provide several recommendations for documenting and managing MLOps at an enterprise scale. In particular we describe the crucial role played by model certification authorities in the enterprise.

Capital markets have long suffered from a nagging problem: every day, roughly 2% of all U.S. Treasuries and mortgage-backed securities set to change hands between buyers and sellers do not end up with their new owners by the time they are supposed to arrive. Such 'fails' happen for many reasons, e.g., unique patterns in trading, supply and demand imbalances, speediness of given securities, operational hiccups, or credit events. After the collapse of Lehman Brothers, which led to an increase in settlement fails, the Treasury Market Practices Group (TPMG) in our organization recommended daily penalty charges on fails to promote better market functioning. The failed-to party generally requests and recoups the TPMG fails charge from the non-delivering counterparty. After broad industry adoption, according to the Federal Reserve, the prevailing rate of settlement fails has fallen considerably.

In the middle of the COVID-19 market crisis, demand for cash and cash-like instruments such as Treasuries was drastically higher than normal, compounding the issue of settlement fails. the Treasury market prompted the Fed to step in and buy more of the securities to restore calm. We have developed a machine learning service that uses intraday metrics and other signals as early indicators of liquidity issues in specific sets of bonds to forecast settlement failures by 1:30pm daily NY time. The service also takes into account elements like the velocity of trading in a given security across different time horizons, the volume of bonds circulating, a bond's scarcity, the number of trades settled every hour and any operational issues, such as higherthan-normal cancellation rates. Fig. 1b showcases the daily failure rate dynamics (per hour) and characterizes the complexity of the task that the aforementioned machine learning service is modeling. The resulting predictions help our clients, including bond dealers, to monitor their intraday positions much more closely, manage down their liquidity buffers for more effective regulatory capital treatment, and offset their risks of failed settlements. Through this and other ML services we have gained significant insight into MLOps issues that we aim to showcase here.

In developing and deploying this application, we encounter issues such as:

(1) Does the data processing pipeline have unintended sideeffects on modeling due to data leakage or HARKing [13] ? (Sections 3.1, 3.6) (2) What happens when models 'misbehave' in production?

How is this misbehavior measured? Are there compensatory or remedial pipelines? (Sections 3.2, 3.8) (3) How often are models re-trained and what is the process necessary to tune models? Is the training and model tuning reproducible? (Section 3.3) (4) How is model performance assessed and tracked to ensure compliance with performance requirements? (Sections 3.4, 3.5) (5) What constitutes a material change in the MLOps pipeline? How are changes handled? (Section 3.7) (6) Where does the input data reside and how is it prepared on a regular basis for input to an ML model? (Section 3.9) Any organization employing ML in production needs to grapple with (at least) each of the questions above. In the process of doing so, they might encounter several antipatterns as we document below. 

For the most part, we present our antipatterns (summarized in Table 1 ) in a supervised learning or forecasting context. In a production ML context, there is typically a model that has been approved for daily use. Over time, such a model might be replaced by a newer (e.g., more accurate) model, or retrained with more recent data (but keeping existing hyperparameters or ranges constant or fixed), or retrained with new search for hyperparameters in addition to retraining with recent data. In this process, we encounter a range of methodological issues leading to several antipatterns, which we identify below.

The separation of training and test while extolled in every ML101 course can sometimes be violated in insidious ways. Data leakage refers broadly to scenarios wherein a model makes use of information that it is not supposed to have or would not have available in production. Data leakage leads to overly optimistic model performance estimates and poses serious downstream problems upon model deployment (specifically in high risk applications). Leakage can happen sometimes unintentionally when feature selection is driven by model validation or test performance or due to the presence of (typically unavailable) features highly correlated with the label. Samala et. al. [33] talk more about the hazards of leakage, paying particular attention to medical imaging applications. In our domain of financial analytics, increasingly complex features are constantly developed such that their complexity masks underlying temporal dependencies which are often the primary causes of leakage. Below are specific leakage antipatterns we have encountered.

Peek-a-Boo AntiPattern. Many source time-series datasets are based on reporting that lags the actual measurement. A good example is Jobs data which is reported in the following month. Modelers who are simply consuming this data may not be cognizant that the date of availability lags the date of the data, and unwittingly include it in their models inappropriately.

When constructing training and test datasets by sampling, the process by which such sampling is conducted can cause leakage and thus lead to not truly independent training and test sets. In forecasting problems, especially, temporal leakage happens when the training and test split is not carried out sequentially, thereby leading to high correlation (owing to temporal dependence and causality) between the two sets.

AntiPattern. An egregious form of leakage can be termed oversampling leakage, seen in situations involving a minority class. A well known strategy to use in imbalanced classification is to perform minority over-sampling, e.g., via an algorithm such as SMOTE [8] . In such situations, if oversampling is performed before splitting into training and test sets, then there is a possibility of information leakage. Due to the subtle nature of this type of leakage, we showcase illustrations and performance characterizations of oversampling leakage, in the context of customer churn detection in banking transactions in Fig. 2 .

Metrics-from-Beyond AntiPattern. This type of antipattern can also be seen as pre-processing leakage or hyper-parameter leakage. Often times, due to carelessness in pre-processing data, both training and test datasets are grouped and standardized together leading to leakage of test data statistics. For example when using standard normalization, if test and train datasets are normalized together, then the sample mean and variance used for normalization is a function of the test set and thus leakage has occurred.

Once models are placed in production, we have seen that predictions are sometimes used as-is without any filtering, updating, reflection, or even periodic manual inspection. This is an issue especially in situations where we see 1) concept drift (discussed in section 3.7), 2) irrelevant or easily recognisable erroneous predictions, and 3) adversarial attacks.

It is important to have systems in place that can monitor, track, and debug deployed models. For instance, under such situations it can be productive to have a meta-model that evaluates every model prediction and deems if it is trustworthy (or of required quality) to be delivered. For example, Ramakrishnan et. al. [31] describe a meta-model called the fusion and suppression system that is responsible for the generation of final set of alerts from an underlying alert-stream originating from multiple ML models. The fusion and suppression system is responsible for performing duplicate detection, filling in missing values, and is also used to fine-tune precision / recall by suppressing alerts deemed to be of low quality. A second solution could be to inspect model decisions further by employing explanation frameworks like LIME [32] . Fig. 3 characterizes modeling decisions using meta-modeling frameworks.

Different values of hyper-parameters often prove to be significant drivers of model performance and are expensive to tune and mostly task specific. Hyper-parameters play such a crucial role in modeling architectures that entire research efforts are devoted to developing efficient hyper-parameter search strategies [3, 14, 27, 29, 37] .

The set of hyper-parameters differs for different learning algorithms. For instance, even a simple classification model like the decision tree classifier, has hyper-parameters like the maximum depth of the tree, the minimum number of samples to split an internal node and the criterion to use for estimating either the impurity at a node (gini) or the information gain (entropy) at each node. Ensemble models like random forest classifiers and gradient boosting machines also have additional parameters governing the number of estimators (trees) to include in the model. Another popular classifier, the support-vector machine which is a maximum margin classifier requires the specification of hyper-parameters that govern the type of kernel used (polynomial, radial-basis-function, linear etc.) as well as the penalty for mis-classification which in-turn governs the margin of the decision boundary learned. For an exhaustive analysis of the effect of hyper-parameters, please refer to [37] wherein the authors perform a detailed analysis of the important hyperparameters (along with appropriate prior distributions for each) for a wide range of learning models.

The resurgent and recently popular learning methodology employing deep neural networks also has hyper-parameters like the hidden size of intermediate layers, the types of units to employ in the network architecture (fully connected, recurrent, convolutional), the types of activation functions (TanH, ReLU, Sigmoid), and types of regularizations to employ (Dropout layers, Batch Normalization, Strided-Convolutions, Pooling, -norm regularization terms). In Table 2a ) than (b) (i.e., F1 score for Attrited class in Table 2b ) due to leakage in information from over-sampling before selecting the test set (a) Feature Importance Characterization (b) Single Instance Explanation using LIME framework Figure 3 : Inspecting Model Decisions using explanation frameworks. Demonstrated on churn (i.e., Attrition) detection application using the LIME [32] framework. the context of deep learning models, this area of research is termed neural architecture search [11] . Hyper-parameter optimization, has been conducted in multiple ways, however thus far a combination of manual tuning of hyper-parameters with grid-search approaches have proven to be the most effective [15, 20, 21] in searching over the space of hyper-parameters. In [2] , the authors propose that random search (within a manually assigned range) of the hyperparameter space yields a computationally cheap and an equivalent if not superior alternative to grid search based hyper-parameter optimization. Yet other approaches pose the hyper-parameter search as a Bayesian optimization problem [17, 36] over the search space. Fig. 4 characterizes the optimization process on the learning task of detecting "churn" or customer attrition using their activity patterns in the context of banking transactions. The figures therein yield an analysis of the hyper-parameter optimization process characterizing the relative importance of each hyper-parameter employed in the learning pipeline. As hyper-parameters play such a crucial role in learning (e.g., we notice from the statistics in Fig. 4a that an XGBoost model is able to achieve a 3.5% improvement in the F1 score of detecting attrited customers relative to an XGBoost variant without hyperparameter tuning i.e., Fig. 4b) , it is imperative that the part of a learning pipeline concerned with hyper-parameter optimization be explicitly and painstakingly documented so as to be reproducible and easily adaptable.

Like many applied scientific disciplines, machine learning (ML) research is driven by the empirical verification and validation of theoretical proposals. Novel contributions to applied machine learning research comprise (i) validation of previously unverified theoretical proposals, (ii) new theoretical proposals coupled with empirical verification, or (iii) effective augmentations to existing learning pipelines to yield improved empirical performance. Sound empirical verification requires a fair evaluation of the proposed approach w.r.t previously proposed approaches to assess empirical performance. However, it is quite often the case that empirical verification of newly proposed ML methodologies is insufficient, flawed, or found wanting. In such cases, the reported empirical gains are actually just an occurrence of the Perceived Empirical SuperioriTy (PEST) antipattern.

For example, in [14] , the authors question claimed advances in reinforcement learning research due to the lack of significance metrics and variability of results. In [24] , the authors state that many years of claimed superiority in empirical performance in the field of language modeling is actually faulty and showcase that the well-known stacked LSTM architecture (with appropriate hyperparameter tuning) outperforms other more recent and more sophisticated architectures. In [26] , the authors highlight a flaw in many previous research works (in the context of Bayesian deep learning) wherein a well established baseline (Monte Carlo dropout) when run to completion (i.e., when learning is not cut-off preemptively by setting it to terminate after a specified number of iterations), achieves similar or superior results compared to the very same models which showcased superior results when introduced. The authors thereby motivate the need for identical experimental pipelines for comparison and evaluation of ML models. In [35] , authors conduct an extensive comparative analysis of the supposed state-of-the-art word embedding models with a 'simple-word-embedding-model' (SWEM) and find that the SWEM model yields performance comparable or superior to previously claimed (and more complicated) state-of-the-art models. In our financial analytics context, we have found the KISS principle to encourage developers to try simple models first and to conduct an exhaustive comparison of models before advocating for specific methodologies. Recent benchmark pipelines like the GLUE and SQuAD benchmarks [30, 38] are potential ways to address the PEST antipattern.

Another frequent troubling trend in ML modeling is the failure to appropriately identify the source of performance gains in a modeling pipeline. As the peer-review process encourages technical novelty, quite often, research work focuses on proposing empirically superior, and complicated model architectures. Such empirical superiority is explained to be a function of the novel architecture while it is most often the case that the performance gains are in fact a function of clever problem formulations, data preprocessing, hyperparameter tuning, or the application of existing well-established methods to interesting new tasks as detailed by [22] .

Whenever possible, it is imperative that effective ablation studies highlighting the performance gains of each component of a newly proposed learning models be included as part of the empirical evaluation. There must also be a concerted effort to train and evaluate baselines and the proposed model(s) in comparable experimental settings. Finally as noted in [22] , if ablation studies are infeasible, quantifying the error behavior [19] and robustness [9] of the proposed model can also yield significant insights about model behavior.

Usually modeling projects begin as curiosity-driven iterations to explore for potential traction. The measure of traction is calculated somewhat informally without formal 3rd-party review or validation. While not a problem at first, if the data science team continues this practice long enough, while building confidence in their results, they never validate them and cannot compare unvalidated results against other methods. To avoid this antipattern, testing and evaluation data should be sampled independently, and for a robust performance analysis, should be kept hidden until model development is complete and must be used only for final valuation. In practice, it is not uncommon for model developers to have access to the final test set and by repeated testing against this known test set, modify their model accordingly to improve performance on the known test set. This practice called HARKing (Hypothesizing After Results are Known) has been detailed by Gencoglu et al. [13] . This leads to implicit data leakage. Cawley et. al. [6] discusses the potential effects of not having a statistically 'pure' test set such as over-fitting and selection bias in performance evaluation.

The refactored solution here is not simple, but is essential and necessary for effective governance and oversight. Data science teams must establish an independent 'Ground Truth system' with APIs to receive and catalog all forecasts and the data that were used to make them. This system can provide a reliable date stamp that accurately reflects when any data object or forecast was actually made available or made and can help track independent 3rd party metrics that will stand up to audit.

A core assumption of machine learning (ML) pipelines is that the data generating process being sampled from (for training and when the model is deployed in production) generates samples that are independent and identically distributed (i.i.d). ML pipelines predominantly adopt a 'set & forget' mentality to model training and inference. However, it is quite often the case that the statistical properties of the target variable that the learning model is trying to predict change over time (concept drift [40] ). Decision support systems governed by data-driven models are required to effectively handle concept drift and yield accurate decisions. The primary technique to handle concept drift is learning rule sets using techniques based on decision trees and other similar interpretable tree-based approaches. Domingos et al. [10] proposed a model based on Hoeffding trees. Klienberg et al. [18] , propose sliding window and instance weighting methods to maintain the learning model consistent with the most recent (albeit drifted) data. Various other approaches based on rule sets, Bayesian modeling have been developed for detection and adaptation of concept drift, details can be found in [12, 23, 39] . An example of model drift adaptation can be seen in Chakraborty et al. [7] for forecasting protest events. This work provides a use case wherein changes in surrogates can be used to detect change points in the target series with lower delay than just using the target's history.

Most ML pipelines are tuned to generate predictions but little attention is paid to ensure that the model can sufficiently communicate information about its own uncertainty. A well-calibrated model is one where the Brier score (or similar) is carefully calibrated in its confidence. When poorly calibrated models are placed in production, it becomes difficult to introduce compensatory or remedial pipelines when something goes wrong. Characterizing model uncertainty is thus a paramount feature for large-scale deployment.

Recent work [4] shows that in addition to explainability, conveying uncertainty can be a significant contributor to ensuring trust in ML pipelines.

The development of models using data manually extracted and hygiened without recording the extraction or hygiene steps leads to a massive data preparation challenge for later attempts to validate (or even deploy) ML models. This is often the result of 'sensitive' data that is selectively sanitized for the modelers by some third-party data steward organization that cannot adequately determine the risk associated with direct data access. The data preparation steps are effectively swept under the carpet and must be completely reinvented later, often with surprising impact on the models because the pipeline ends up producing different data. The refactored solution here is to: (i) ensure that your enterprise sets up a professional data engineering practice that can quickly build and support new data pipelines that are governed and resilient; (ii) use assertions to track data as they move through the pipeline, and (iii) track the pedigree and lineage of all data products, including intermediaries. We've found graph databases to be ideal for maintaining linkages between data objects and the many assertions you must track.

Machine Learning (ML) models are usually evaluated with metrics (e.g., precision, recall, confusion matrices serve as evaluation metrics in a classification setting) that are solely focused on characterizing the performance of the core learning model. However, production systems are often decision guidance systems with multiple additional notification (e.g., a process that raises an alert when the core learning model yields a particular prediction) and intervention (e.g., a process that carries out an appropriate action based on the results of the notification system) layers built on top of the core learning layer. Fig. 5a showcases the outcomes in a traditional (ML focused) evaluation pipeline wherein an ML model predicts a transaction to have a Favorable or Unfavorable outcome. Depending on the application, the definition of what is considered Favorable or Unfavorable may differ.

For illustration, let us consider a fraud detection application, wherein an unfavorable outcome would be defined as a fraudulent transaction while legitimate transactions would be considered favorable outcomes. An ML model tasked with detecting fraudulent transactions would predict whether each transaction was Favorable or Unfavorable. In this context, Fig. 5a indicates that the deployed ML model pipeline may enter four possible states during its operation life-cycle. However, Fig. 5b showcases a slightly more realistic ML pipeline wherein notification (send alerts) and intervention (take appropriate action) layers are added on top of the ML model decisions to appropriately raise alerts or intervene to arrest progress of a potentially fraudulent transaction detected by the ML model.

The addition of these alerting and notification mechanisms which are imperative and ubiquitous in enterprise ML settings augment the number of possible states the ML pipeline may enter during its operation. These new states the model may enter create more nuanced situations with new dilemmas which are not highlighted by a simplistic evaluation approach like the one indicated in Fig. 5a . For example, if we observe the state the pipeline reaches if the ML model predicts a transaction to be fraudulent (i.e., unfavorable) and the notification pipeline does not notify the client of the model decision, then if the transaction is actually fraudulent, then this situation is fraught with ethical ramifications. This exhaustive state representation of the ML decision pipeline in Fig. 5c allows us to explicitly add high penalties to such states allowing the ML, notification and intervention models to be trained cognizant of such penalties, essentially allowing fine-grained control of the learning and decision process. A more rigorous approach is to use a reinforcement learning formulation to track decision making and actions as models are put in production.

How do we make use of these lessons learnt and operationalize them in a production financial ML setting? Specific recommendations we have include:

(1) Use AntiPatterns presented here to document a model management process to avoid costly but routine mistakes in model development, deployment, and approval. (2) Use assertions to track data quality across the enterprise. This is crucial since ML models can be so dependent on faulty or noisy data and suitable checks and balances can ensure a safe operating environment for ML algorithms. (3) Document data lineage along with transformations to support creation of 'audit trails' so models can be situated back in time and in specific data slices for re-training or re-tuning. (4) Use ensembles to maintain a palette of models including remedial and compensatory pipelines in the event of errors. Track model histories through the lifecycle of an application. (5) Ensure human-in-the-loop operational capability at multiple levels. Use our model presented for rethinking ML deployment from Section 4 as a basis to support interventions and communication opportunities.

Overall, the model development and management pipeline in our organization supports four classes of stakeholders: (i) the data steward (who holds custody of datasets and sets performance standards), (ii) the model developer (an ML person who designs algorithms), (iii) the model engineer (who places models in production and tracks performance), and (iv) the model certification authority (group of professionals who ensure compliance with standards and risk levels). In particular, as ML models continue to make their way into more financial decision making systems, the model certification authority within the organization is crucial to ensuring regulatory compliance, from performance, safety, and auditability perspectives. Bringing such multiple stakeholder groups together ensures a structured process where benefits and risks of ML models are well documented and understood at all stages of development and deployment.

BNY Mellon is the corporate brand of The Bank of New York Mellon Corporation and may be used to reference the corporation as a whole and/or its various subsidiaries generally. This material does not constitute a recommendation by BNY Mellon of any kind. The information herein is not intended to provide tax, legal, investment, accounting, financial or other professional advice on any matter, and should not be used or relied upon as such. The views expressed within this material are those of the contributors and not necessarily those of BNY Mellon. BNY Mellon has not independently verified the information contained in this material and makes no representation as to the accuracy, completeness, timeliness, merchantability or fitness for a specific purpose of the information provided in this material. BNY Mellon assumes no direct or consequential liability for any errors in or reliance upon this material.

Algorithms for hyper-parameter optimization

Random search for hyper-parameter optimization

Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures

Uncertainty as a form of transparency: Measuring, communicating, and using uncertainty

AntiPatterns: Refactoring Software, Architectures, and Projects in Crisis

On Over-Fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation

Hierarchical Quickest Change Detection via Surrogates

SMOTE: Synthetic Minority over-Sampling Technique

Are all languages equally hard to language-model?

Mining high-speed data streams

Neural architecture search: A survey

A survey on concept drift adaptation

HARK Side of Deep Learning-From Grad Student Descent to Automated Machine Learning

Deep reinforcement learning that matters

A practical guide to training restricted Boltzmann machines

You Can't Escape Hyperparameters and Latent Variables: Machine Learning as a Software Engineering Enterpri

Hyperparameter tuning for big data using Bayesian optimisation

Learning drifting concepts: Example selection vs. example weighting. Intelligent data analysis

Scaling semantic parsers with on-the-fly ontology matching

An empirical evaluation of deep architectures on problems with many factors of variation

Efficient backprop

Troubling trends in machine learning scholarship

Learning under concept drift: A review

On the state of the art of evaluation in neural language models

Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model cards for model reporting. In FAccT

On the importance of strong baselines in bayesian deep learning

Bayesian optimization for iterative learning

Challenges in deploying machine learning: a survey of case studies

Tunability: Importance of hyperparameters of machine learning algorithms

Squad: 100,000+ questions for machine comprehension of text

Beating the News' with EMBERS: Forecasting Civil Unrest Using Open Source Indicators

Explaining the predictions of any classifier

Hazards of data leakage in machine learning: a study on classification of breast cancer using deep neural networks

International Society for Optics and Photonics

Hidden technical debt in machine learning systems

Baseline needs more love: On simple word-embedding-based models and associated pooling mechanisms

Practical bayesian optimization of machine learning algorithms

Hyperparameter importance across datasets

GLUE: A multi-task benchmark and analysis platform for natural language understanding

Characterizing concept drift

Learning in the presence of concept drift and hidden contexts