key: cord-102850-0kiypige
authors: Huang, C.-C.; Lai, J.; Cho, D.-Y.; Yu, J.
title: A Machine Learning Study to Improve Surgical Case Duration Prediction
date: 2020-06-12
journal: nan
DOI: 10.1101/2020.06.10.20127910
sha: 
doc_id: 102850
cord_uid: 0kiypige

Predictive accuracy of surgical case duration plays a critical role in reducing cost of operation room (OR) utilization. The most common approaches used by hospitals rely on historic averages based on a specific surgeon or a specific procedure type obtained from the electronic medical record (EMR) scheduling systems. However, low predictive accuracy of EMR leads to negative impacts on patients and hospitals, such as rescheduling of surgeries and cancellation. In this study, we aim to improve prediction of operation case duration with advanced machine learning (ML) algorithms. We obtained a large data set containing 170,748 operation cases (from Jan 2017 to Dec 2019) from a hospital. The data covered a broad variety of details on patients, operations, specialties and surgical teams. Meanwhile, a more recent data with 8,672 cases (from Mar to Apr 2020) was also available to be used for external evaluation. We computed historic averages from EMR for surgeon- or procedure-specific and they were used as baseline models for comparison. Subsequently, we developed our models using linear regression, random forest and extreme gradient boosting (XGB) algorithms. All models were evaluated with R-squre (R^2), mean absolute error (MAE), and percentage overage (case duration > prediction + 10 % & 15 mins), underage (case duration < prediction - 10 % & 15 mins) and within (otherwise). The XGB model was superior to the other models by having higher R^2 (85 %) and percentage within (48 %) as well as lower MAE (30.2 mins). The total prediction errors computed for all the models showed that the XGB model had the lowest inaccurate percent (23.7 %). As a whole, this study applied ML techniques in the field of OR scheduling to reduce medical and financial burden for healthcare management. It revealed the importance of operation and surgeon factors in operation case duration prediction. This study also demonstrated the importance of performing an external evaluation to better validate performance of ML models.

It becomes more and more important for clinics and hospitals in managing resources for 2 critical cares during the COVID-19 pandemic. Statistics show that approximately 60 % 3 of patients admitted to the hospital will need to be treated in the Operation Room 4 (OR) [11] , and the average cost of OR is up to 2,190 dollars per hour in the United 5 States [1, 6] . Hence, the OR is considered as one of the highest hospital revenue 6 generators and accounts for as much as 42 % of a hospital's revenue [6, 10] . Based on 7 these statistics, a good OR schedule and management is not only critical to patients 8 who are in need of elective, urgent and emergent operations, but is also important for 9 surgical teams to be prepared. Owing to the importance of OR, improvement of OR 10 efficiency has high priority so that the cost and time spent on OR is minimized while the 11 utilization of OR is maximized to increase surgical case number and patient access [15] . 12 In a healthcare system, numerous factors are involved in affecting OR efficiency, for 13 example patient expectation and satisfaction, interactions between different professional 14 specialties, unpredictability during operations, surgical case scheduling and etc [20] . 15 Although the process of OR is complex and involves multiple parties, one way to 16 enhance OR efficiency is by increasing the accuracy of predicted surgical case duration. 17 Over-or under-utilization of OR time often leads to undesirable consequences such as 18 idle time, overtime, cancellation or rescheduling of surgeries, which may implement 19 negative impact on the patient, staffs and hospital [21] . In contrast, high efficiency in 20 OR scheduling not only contribute to better arrangement for the usage of operating 21 room and resources, it can also lead to cost reduction and revenue increment since more 22 surgeries can be performed. 23 Currently, most hospitals schedule surgical case duration by employing estimations 24 from surgeon and/or averages of historical case durations, and studies show that both of 25 these methods have limited accuracy [14, 17] . For case length estimated by surgeons, 26 factors including patient conditions, anesthetic issues might not be taken into 27 consideration. Moreover, underestimation of case duration often occurs as surgeon 28 estimations were usually made by leaning towards maximizing block scheduling to 29 account for potential cancellations and cost reduction. Furthermore, operations with 30 higher uncertainty and unexpected findings during operation add difficulties and 31 challenges into case length estimation [14] . Historic averages of case duration for a 32 specific surgeon or a specific type of operation obtained from electronic medical record 33 (EMR) scheduling systems have also been used in hospitals. However, these methods 34 have been shown to produce low accuracy due to large variability and lack of same 35 combination in the preoperative data available on the case that is being performed [25] . 36 In order to improve the predictability, researchers utilized linear statistical models, 37 such as regression, or simulation for surgical duration prediction and evaluation of the 38 importance of input variables [8, 12, 13] . However, a common shortcoming of these 39 studies is that relatively lesser input variables or features were used in their models due 40 to the limitation of statistical techniques in handling too many input variables. Similarly, we combined categories for primary surgeon's ID, specialty, anesthesia type 101 and room number which had case numbers less than 50 into the category of 'Others'.

In addition, since operation case duration can be related to the performance of 103 surgeons and surgeons' performance is affected by their working time, we also analysed 104

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted June 12, 2020. . Figure 1 . The workflow of model training for this study. The data set used for model training fall within the time range of Jan 1, 2017 to Dec 31, 2019. From this data set, about 17 % of the cases were excluded based on these criteria: patients with two or more surgical procedures performed at the same time, emergent and urgent cases, surgeons with age under 28, patients with age younger than 20, pregnant patients, procedure duration longer than 10 hours or less than 10 minutes and cases with missing value. The total number of cases included in the data set for model building was 142,448. This data set was then split into training (80 %) and validation (20 %) subsets for model development. Machine learning and linear regression models were developed on the training data set and validated on the validation data set using R-square and mean absolute error. Percentage of cases with actual duration differences falling within 10 % and 15 minutes of predicted procedure duration was also computed. Eventually, the models were further evaluated on the most recent surgical cases (from Mar 1 to Apr 30, 2020) which were not included in the original data set for model training.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 12, 2020. 105 surgical minutes performed by the same primary surgeons on the same day as well as 106 within the last 7 days, and the number of urgent and emergent operations prior to the 107 case that was being performed by the same surgeon were included in the analysis.

Together, 24 predictor variables were included for predictive model building in this 109 study. These predictors can be categorised into 5 groups: patient, surgical team, 110 operation, facility and primary surgeon's prior events (see Table 1 ).

Model development and training 112 We applied multiple ML methods for operation case duration prediction. Operation case 113 duration (in minutes) is the total period starting from the time patient entering into the 114 OR to the time exiting the OR. Historic averages of case durations based on 115 surgeon-specific or procedure-specific from EMR systems were used as baseline models 116 for comparison in case duration prediction. At the beginning, we performed multivariate 117 linear regression (Reg) to predict operation case duration. However, when we looked at 118 the distribution of operation case duration, it was observed to be skewing to the right 119 ( Fig. 2) . We performed logarithmic transformation on operation case duration to reduce 120 the skewness. The model built from log transformed multivariate linear regression 121 (logReg) outperformed Reg in all evaluation indexes. Subsequent ML algorithms were 122 also trained by using the log transformed case duration as the target.

The first ML algorithm that we tested is random forest (RF), a tree-based 124 supervised learning algorithm. RF uses bootstrap aggregation or bagging technique for 125 regression by constructing a multitude of decision trees based on training data and 126 outputting the mean predicted value from the individual trees [19] . Bagging technique 127 is unlikely to over-fitting, in other words, it reduces the variation without increasing the 128 bias. Tree-based techniques were suitable for our data since they include a large number 129 of categorical variables, e.g. ICD code and procedure type, most of which were sparse. 130 The number of trees that was set in study is 50. Extreme Gradient Boosting (XGB) 131 algorithm is the other supervised ML algorithm that was tested for comparison to RF. 132 Recently, XGB algorithm gains popularity within the data science community due to its 133 ability in overcoming the curse of dimensionality as well as capturing the interaction of 134 variables [18] .

XGB is also a decision tree-based algorithm but more computationally efficient for 136 real-time implementation than RF. XGB and RF algorithms are different in the way of 137 how the trees are built. It has been shown that XGB performs better than RF if 138 parameters are tuned carefully, otherwise it would be more likely to over-fitting if the 139 data are noisy [3, 9] . We adopted 5-fold cross validation strategy to tune out the best 140 

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 12, 2020. . Data-splitting strategy was used in the training for all the models to prevent 145 over-fitting consequences. We randomly separated the data into training and testing 146 subsets at a ratio of 4:1. The training data were used to build different predictive 147 models as well as to extract important predictor variables. The testing data were used 148 for internal evaluation of the models.In addition to interval evaluation, external 149 evaluation on all the models were performed using data from Mar 1 to Apr 30, 2020. surgeon-or procedure-specific calculated from EMR were also evaluated on the same 154 internal and external testing sets to ensure fair and uniform comparison across all 155 models. Data processing and cleaning as well as model development in this study were 156 performed using R software. The packages "xgboost and "randomforest were used to 157 implement XGB and RF algorithms in R [4, 5] . 164 R 2 is the coefficient of determination, it represents the proportion of the variance for 165 the actual case duration that is explained by predictor variables in our models.

Mean Absolute Error (MAE) measures the average of errors between the actual case 167 6/15 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 12, 2020. hand, the average model based on a specific procedure had lower percentage underage 188 and overage compared to the surgeon-specific model. These differences were due to an 189 extensive procedure classification in the procedure-specific model. However, the 190 percentage underage was still quite high. Since no other information is taken into 191 consideration in the average model, except durations of operation cases happened in the 192 past, prediction bias and low accuracy usually result from the average model. 193 We first fitted the Reg model by including all the input variables showed in Table 1 . is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 12, 2020. when its performance is better on training set but poor on testing set.

When we log transformed operation case duration and re-ran a regression model (i.e. 203 logReg), the performance of logReg model improved and outperformed Reg model. [12, 23] . Again in the logReg model, the results of all the evaluation metrics were 208 close for training, internal and external training sets, so the model was not over-fitting. 209 Although performance of the logReg model was not bad, an assumption of linear Performance of the XGB model was better than the RF model on training set but did 219 not improve a lot compared to the RF model on internal and external testing sets.

Since XGB was more computing efficient than RF, the XGB model was chosen to be 221 the best model and was used in subsequent analysis.

In addition to the three key metrics, we studied inaccuracy of different models by 223 using external testing set. We calculated the total prediction error (in minutes) and the 224 corresponding inaccurate percentage for all the models. The results are reported in 225 In Fig. 3 , we plotted scatter plots of actual versus predicted duration on the external 234 testing set for the average models of surgeon-and procedure-specific, and the XGB 235 model. A straight line indicating the theoretical perfect relationship, i.e. predicted and 236 actual procedure duration are identical, was added as a reference in each scatter plot.

The data points of the XGB models were aligned closer to the straight line. Therefore, 238 the XGB model showed a higher correlation between predicted and actual duration 239 compared to the other two types of average model. Fig. 4 shows the density plot of 240 differences between actual and predicted case durations for the two average models and 241 the XGB models. It clearly demonstrates that the error distribution of XGB model was 242 narrower and closer to 0. As a result, the XGB model is more accurate than the other 243 models in predicting operation case duration. 244 

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 12, 2020. . . Density plot of differences between the actual operation case durations and predicted case durations obtained from the XGB model (light blue color) was narrower and centered more at 0 than density plots of those obtained from the average models (pink and cyan colors). In the average models, previous operation case durations, either averaging for a specific surgeon (cyan color) or specific procedure (pink color), were used as predictions.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 12, 2020. how important the variable is in making a branch of a decision tree to be purer [5, 22] . 248 A higher WFG percentage indicates that the variable is more important. The result of 249 the top 15 important variables are shown in Table 4 . One thing worth noting is that 3 250 of the top 4 important variables are attributed to operation information. Moreover, 251 three of the features which we computed from surgeons' data (i.e. total surgical minutes 252 performed by the surgeon within the last 7 days and on the same day, and number of Accurate prediction of operation case duration is vital in elevating OR efficiency and 257 reducing cost. This study not only helps to improve accuracy of OR case prediction, it 258 also has novelty in the following aspects. First, the data set used in this study contained 259 more than 140,000 cases and more than 400 different types of surgical procedures which 260 set up a new benchmark for huge amount and large diversity. The maximal number of 261 cases that had been used in other studies were in the range of 40,000 to 60,000 [2, 21] .

Second, OR events was modeled as dependent events instead of independent. To this 263 end, we extracted some additional information from surgeons' data, e.g. previous . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 12, 2020. . https://doi.org/10.1101/2020.06. 10.20127910 doi: medRxiv preprint April 2020 as external testing data for model evaluation. Fourth, though urgent and 268 emergent surgeries were excluded from the data, number of urgent and emergent 269 operations prior to the case that was being performed by the same surgeon was included 270 as an input variable to account for its effect on operation case duration.

Currently, surgical cases at CMUH are scheduled according to estimates made by 272 primary surgeons. However, surgeon estimates rely heavily on prior experiences of the 273 surgeons and many factors beyond expectation will not be taken into consideration.

Since there is no formal record on surgeon estimates, we used averages calculated based 275 on a specific surgeon or procedure type on the testing set to be our baseline models.

The performance of these two average models, as reported in Table 2 , clearly showed 277 that these models were poor in predicting operation case duration. They also tended to 278 under-predict operation case duration according to their scatter plots of actual versus 279 prediction and density plot of differences between actual versus prediction (see Fig. 3 280 and 4). When 24 feature variables ( Table 1) were included in our model development, 281 R 2 , MAE, percentage underage, overage and within improved greatly compared to the 282 baseline models. We applied 15 minutes as tolerance threshold for percentage underage, 283 overage and within because ± 15 minutes is an acceptable periodic range in CMUH to 284 be considered as accurately booking. To avoid having too stringent standard and to 285 better compare our outcomes with other studies [2, 24] , tolerance threshold of 10 % was 286 also applied.

By using regression and ML approaches, we were able to decrease the total 288 prediction error (Table 3 ) of operation case durations at CMUH. Among all the models, 289 performance of the XGB model was considered to be the best because it was more 290 computing efficient and had the lowest inaccuracy. Moreover, even though the results of 291 evaluation metrics of the RF model were similar to the XGB model, the XGB model 292 was still able to reduce the total prediction error in minutes from 223,686 to 218,415 293 minutes. In other words, the XGB model was able to save more than 5,000 minutes of 294 idle or delay times than the RF model. Since most ORs usually have multiple cases 295 scheduled per day, the total prediction error represents the cumulative effect of total 296 OR cases in the 2-month period of Mar to April 2020. This cumulative effect may 297 eventually reflects a significant financial advantage in scheduling an additional 298 operation case [7] . This would also lead to a significant cost reduction and increment in 299 revenue because ORs are utilized appropriately and efficiently.

It has been reported in the past studies that primary surgeons contributed the 301 largest variability in operation case duration prediction compared to other factors 302 attributed to patients [2, 16, 23] . These studies provide evidence and rationale that more 303 factors relating to primary surgeon should be added as input variables in the training of 304 ML models. Moreover, extensive feature engineering usually improve the quality of ML 305 model which can be independent to the modeling technique itself. As a result, in 306 addition to primary surgeon's identifier, gender and age, we computed previous working 307 time and number of previous surgeries performed by the same primary surgeons within 308 the last 7 days and on the same day. We also counted the number of urgent and 309 emergent operations prior to the case that was being performed by the same primary 310 surgeon. These variables extracted from the data of primary surgeon were significantly 311 (p < 0.05) correlated with operation case duration (see Table 5 in Appendix). The 312 correlation coefficients of these variables also revealed that an operation case duration 313 performed by a primary surgeon may decrease as he or she becomes more familiar with 314 the surgical procedure but may increase if his or her total surgical minutes are too long. 315 Although performing a surgery multiple times on different patients may help a primary 316 surgeon to be more efficient in his or her next operation, long working time may also 317 lead to lethargic and affect the primary surgeon's performance.

In the methodology of data processing, for predictor variables which contained a lot 319

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted June 12, 2020. . of categories, we grouped categories that had cases less than 50 into a categories named 320 'Others'. In addition to reducing data dimensionality for categorical features, this may 321 aid in generalization of our model. This indicates that our model will still be able to 322 predict case duration even for operations that are rare. Moreover, our model can be 323 applied to new primary surgeons, who are not included in the training set during model 324 development, by setting their ID as 'Others' for case duration prediction. However, 325 there is still a need to update our model after a while, for example, when the operation 326 cases performed by a new primary surgeon has increased beyond a certain number. In 327 terms of timing, we recommend updating the model annually by using operation cases 328 performed in the most recent 3 years as training data.

One limitation in this study is that we selected predictor variables which could only 330 be extracted from preoperative data. Our ML model still needs to be improved in order 331 to be able to predict surgical case duration dynamically. For example, blood loss during 332 operation may affect case duration as an unexpected increase in blood loss may cause 333 surgeons to take longer time to complete the surgery. Therefore, it would be better if 334 intra-operative data are incorporated during ML model development and prediction 335 made by the ML model can be updated during operation. One common issue in all ML 336 studies in predicting operation case duration, including our study, is that ML models 337 were developed using data from a single site. These ML models have difficulties in 338 generalization, since the surgical team, facilities and patient populations are different 339 across entities. It has to be custom made for a given organization using training data 340 containing its patients, procedures, surgeons, medical staffs, and the facility itself. As a 341 result, the exact same ML model is not meant to and will not perform well when 342 applied to another organization or hospital. The other interesting issue of applying ML 343 or artificial intelligence in operation estimation is that medical technologies evolve fast. 344 Hence, how frequent should a ML or artificial intelligence model need to be updated 345 still remains to be answered.

The XGB model was superior in predictive performance when comparing to the average, 348 the Reg and the logReg models. The total inaccuracy of predicted outcomes of the XGB 349 model was the lowest among the other models developed in this study. Although the 350 performance of the RF model was close to the XGB model, the XGB model was more 351 computing efficiency than the RF model in which it took shorter time to complete the 352 training process. The coefficient of determination (R 2 ) was higher while percentages of 353 under-and over-prediction of the XGB model built in this study were also lower than 354 other ML studies [2, 21, 24] . Moreover, this model improves the current OR scheduling 355 method which is based on estimates made by surgeons at CMUH. 356 We propose extracting additional information from operation and surgeons' data to 357 be used as predictor variables for ML algorithm training since their importance was 358 high in the XGB model. Moreover, we validated the model types using an external 359 testing set in additional to the internal testing set split from the original data used in 360 model training. This helped us to validate and test the models in a more stringent and 361 rigorous way. Therefore, we suggest external evaluation should be used as a tool to 362 better validate the predictive power of ML models in the future. 363 1 Appendix

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted June 12, 2020. Table 5 . Correlation coefficient, standard error, t-value and p-value of predictor variables extracted from primary surgeons' data. These information were obtained from the log transformed multivariate regression (logReg) model.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted June 12, 2020. . https://doi.org/10.1101/2020.06.10.20127910 doi: medRxiv preprint

Optimization and planning of operating theatre activities: An original definition of pathways and process modeling

Improving Operating Room Efficiency: Machine Learning Approach to Predict Case-Time Duration

A Comparative Analysis of XGBoost

Package 'randomForest' Title Breiman and Cutler's Random Forests for Classification and Regression

Package 'xgboost' Type Package Title Extreme Gradient Boosting

Understanding costs of care in the operating room

Decrease in Case Duration Required to Complete an Additional Case During Regularly Scheduled Hours in an Operating Room Suite

Predicting the unpredictable: A new prediction model for operating room times using individual characteristics and the surgeon's estimate

Greedy function approximation: A gradient boosting machine

Factors that influence the expected length of operation: Results of a prospective study

Surgical unit time utilization review: Resource utilization and management implications

Surgical Duration Estimation via Data Mining and Predictive Modeling: A Case Study

Use of simulation to assess a statistically driven surgical scheduling system

Improving predictions of pediatric surgical durations with supervised learning

The Surgical Scheduling Problem: Current Research and Future Opportunities

Tree Boosting With XGBoost -Why Does XGBoost Win "Every" Machine Learning Competition? Tree Boosting With XGBoost -Why Does XGBoost Win "Every" Machine Learning Competition?

Newer classification and regression tree techniques: Bagging and random forests for ecological prediction

Operating room efficiency

Improved Prediction of Procedure Duration for Elective Surgery

Decision tree methods: applications for classification and prediction. Shanghai Archives of Psychiatry

Surgeon and type of anesthesia predict variability in surgical procedure times

A Machine Learning Approach to Predicting Case Duration for Robot-Assisted Surgery

Relying solely on historical surgical times to estimate accurately future surgical times is unlikely to reduce the average length of time cases finish late

The authors would like to thank Shu-Cheng Liu, Jhao-Yu Huang and Min-Hsuan Lu in 365 providing feedback during the progress of this study.