key: cord-103788-sxw4l9tt authors: Talbot, Steven R.; Struve, Birgitta; Wassermann, Laura; Heider, Miriam; Weegh, Nora; Knape, Tilo; Hofmann, Martine C. J.; von Knethen, Andreas; Jirkof, Paulin; Keubler, Lydia; Bleich, André; Häger, Christine title: One score to rule them all: severity assessment in laboratory mice date: 2020-06-24 journal: bioRxiv DOI: 10.1101/2020.06.23.166801 sha: doc_id: 103788 cord_uid: sxw4l9tt Animal welfare and the refinement of experimental procedures are fundamental aspects of biomedical research. They provide the basis for robust experimental designs and reproducibility of results. In many countries, the determination of welfare is a mandatory legal requirement and implies the assessment of the degree of the severity that an animal experiences during an experiment. However, for an effective severity assessment, an objective and exact approach/system/strategy is needed. In light of these demands, we have developed the Relative Severity Assessment (RELSA) score. This comprehensive composite score was established on the basis of physiological and behavioral data from a surgical mouse study. Body weight, the Mouse Grimace Scale score, burrowing behavior, and the telemetry-derived parameters heart rate, heart rate variability, temperature, and general activity were used to investigate the quality of indicating severity during postoperative recovery. The RELSA scores not only revealed individual severity levels but also allowed a comparison of severity in distinct mouse models addressing colitis, sepsis, and restraint stress using a k-means clustering approach with the maximum achieved RELSA scores. We discriminated and classified data from sepsis nonsurvivors into the highest relative severity level. Data from mice after intraperitoneal transmitter implantation and sepsis survivor al were located in the next lower cluster, while data from mice subjected to colitis and restraint stress were placed in the lowest severity cluster. Analysis of individual variables and their combinations revealed model- and time-dependent contributions to severity levels. In conclusion, we propose the RELSA score as a validated tool for objective real-time applicability in severity assessment and as a first step towards a unified and accessible risk assessment tool in biomedical research. As an effective severity assessment system, it will fundamentally improve animal welfare, as well as data quality and reproducibility. Good science and high quality data derived from animal experiments in basic and translational 43 research requires good animal welfare. Consequently, researchers are obligated to ensure the best 44 possible welfare of research animals, in line with the refinement principle in the 3Rs 1,2 . Therefore, 45 determination of the welfare of animals under scientific procedures is embedded in many 46 international animal protection guidelines and acts, e.g., the Guide for the Care and Use of 47 assessment in a realistic scenario, a selection of the best performing variables for a given model is 166 desirable. We can clearly show that some single variables or combinations outperform others in this 167 study. However, this outperformance differs from day to day. Each variable can assume a state of 0 168 (not chosen) or 1 (chosen). Since there are eight variables, a total of 2 8 =256 combinations were 169 analyzed. Therefore, we tested the possible variable combinations within the pooled TM data and 170 calculated the RELSA scores for each day. The RELSA scores were summed in a 30 by 256 matrix 171 during the iteration steps. Finally, the individual sums were averaged using the total number of 172 analyzed animals (n=13). The resulting RELSA performance score for each variable combination 173 across postoperative days is shown in Fig. 5a . On the post-op day, the best performing RELSA scores 174 in the present data were bur2h (0.96), bur2h/act (0.93), bur2h/hrv (0.9), bur2h/hrv/act (0.9) and 175 bur2h/hr (0.89). The top 5 worst performers on the post-op day 1 were burON/temp/mgs (0.49), 176 bwc/temp/mgs (0.48), mgs (0.29), temp/mgs (0.25) and temp (0.1) ( Table 2 ). Using only bur2h as the 177 best performing variable on post-op day to display individual RELSA scores revealed a similar grading 178 of the TM vs sham groups regarding the maximum values (Fig. 5b) . This underscores the rationale for 179 selecting the best informative parameters for a given model. However, this parameter alone does 180 not reflect the welfare state over the whole time course of this study. 181 The clustering of RELSA max scores reveals objective severity levels. In addition to the data for 184 building the RELSA reference set from TM-implanted mice, we evaluated RELSA performance as a 185 tool for severity comparisons between models by including data from three additional animal studies 186 (colitis, stress, sepsis). All included studies recorded data for the following five variables: heart rate, 187 heart rate variability, temperature, activity, and body weight. Each study was analyzed using the 188 RELSA methodology and was therefore referenced against the data set from the TM-implanted mice. 189 This way, the overall context allowed the comparison of studies in terms of general model severity. 190 For this, we used the individual RELSA max values as previously described to assess the maximum 191 achieved severity for each animal in these studies. With these data, we used a k-means cluster 192 analysis to segment the ordered univariate RELSA max outputs into distinct clusters. We estimated the 193 number of clusters heuristically to k=4 using scree analysis (Fig. 6a, b) . The resulting borders of the 194 clusters are shown as dashed lines in Fig. 6c We used data from mice suffering from colitis induced by dextran sulfate sodium (DSS), colitis + 197 stress where the animals received DSS and were additionally subjected to immobilization stress on 198 10 consecutive days for 1 h per day, and corresponding colitis control animals treated only with 199 water. Furthermore, we used data from mice submitted to cecal ligation puncture (CLP) surgery for 200 sepsis induction and the corresponding sham-operated animals (CLP sham). Here, the data were 201 divided into CLP survivors and nonsurvivors. The above-described cluster levels enabled a ranking of 202 the respective animal models regarding the severity that was experienced. Cluster analysis revealed 203 the highest severity level for CLP nonsurvivors, followed by a cluster of TM-implanted animals (which 204 were the RELSA reference set) and CLP survivors. Lower severity clusters were formed by data from 205 animals suffering from colitis and stress, colitis alone and CLP sham-operated animals. Data from 206 colitis control animals were allocated to the lowest RELSA cluster (Fig. 6c, d) . Furthermore, we 207 investigated how stable the RELSA max distributions were in terms of their mean values and cluster 208 positions. Some studies or subgroups involved small sample sizes. Therefore, we applied 10000-fold 209 bootstrapping to assess the 95% confidence intervals of the RELSA max centroids. Except for the colitis 210 + stress study, the confidence intervals remained within their relative k-means cluster levels. The 211 confidence interval for the colitis control group did not overlap with any other higher-level 212 confidence interval. 213 Model-specific parameter contributions to the general RELSA severity estimation. RELSA curves 214 from the individual animals over time displayed the generalized biological variation that occurs 215 during severity monitoring. Individual animals deviated from the group mean (Fig. 7a, b) . This 216 enabled individual severity monitoring. We used radar charts to quantify the contribution of single 217 parameters to the RELSA max scores (Fig. 7c ). For data from the TM-implanted animals (Fig. 7b) , it 218 became obvious that immediately after surgery, all variables except for temperature contributed to 219 the overall detected severity. Over time, some parameters returned to their baseline positions, but 220 hrv and act remained contributors to an elevated RELSA score. In the case of the CLP model, the 221 RELSA score was dominated by the large differences in the temperature variable. However, the other 222 parameters, except for body weight, contributed to the overall RELSA score (Fig. S3) . For the CLP 223 study, the time variable is hours and not days (Fig. S4a, b) . Therefore, the body weight variable was 224 not flexible enough to indicate rapid impairment of the animals in a manner similar to, e.g., the 225 temperature. Interestingly, in CLP sham animals (Fig. S4c) , activity was the most active variable, but 226 temperature and heart rate also contributed to the RELSA score. In animals suffering from colitis with 227 ( Fig. S4d ) and without stress (Fig. S4e) , activity was the dominating variable over the first days, but 228 on day 7, body weight became more relevant. As expected, radar charts with data from colitis control 229 mice showed no relevant changes within any of the observed variables (Fig. S4f) . 230 Evidence-based severity assessment is increasingly becoming indispensable in animal research. From 232 a researcher´s point of view, it enables the best possible monitoring of the welfare state. From an 233 ethical point of view, it is the prerequisite for a refinement of experimental procedures leading to a minimal burden for animals and, in unity, provides a basis for high-quality data. From a legal point of 235 view, ensuring animal welfare and severity assessment is mandatory in many countries, e.g., in all EU 236 member states 4 . The large number and diversity of animal models and the lack of validated methods 237 hinder clear definitions of severity categories 16 . This has multiple consequences, ranging from legal 238 uncertainties for scientists and authorities to a potential bias in rating the prospective severity of the 239 animals in their studies. 240 We have developed a tool that enables evidence-based severity assessment. With the algorithm 241 presented, an arbitrary number of outcome variables can be used to compute a composite score for 242 welfare assessment and severity grading 17-19 . To our knowledge, this is the first attempt in preclinical 243 science to combine phenotypical data using matrices of standardized differences to weigh variable 244 contributions as a means for obtaining a measure for relative severity grades. This contrasts with 245 current standards using human judgment to generate numerical scores for assessing welfare. Using 246 the approach presented, we have also shown that variables differ in performance and sensitivity and, 247 therefore, strengthen the concept of a multimodal severity assessment. Finally, the RELSA algorithm 248 enabled the quantitative comparison of distinct animal models with regard to severity levels, which 249 leads to the speculation that it will do so in human patients as well. 250 When developing RELSA, we aimed at a quantitative grading of severity while methods at hand are 252 characterized by qualitative scoring. The principle of composite scoring is based on systems utilized 253 for clinical monitoring and risk assessment in human medicine. One example is the Acute Physiology 254 And Chronic Health Evaluation (APACHE II) score, which was first reported in 1985. The APACHE II 255 score comprises 12 physiological and laboratory parameters with an additional weighting for age and 256 preadmission health status to predict the risk of death 20,21 . In contrast, the Sequential Organ Failure 257 Assessment (SOFA) score, which was established in 1996, consists of 6 different scores assessing 258 distinct organ dysfunction and failure 22,23 . The score describes the status of morbidity and critical illness but does not predict the outcome. Currently, the SOFA score is being used in the severity 260 assessment of COVID-19 patients to characterize mortality among intensive care unit (ICU) patients 24 . 261 In veterinary medicine and laboratory animal science, there are various composite scores available, 262 e.g., the clinical severity index for acute pancreatitis in canines 25 , composite behavior scores for pain 263 assessment in rodents 26,27 , or composite measure schemes for rat epilepsy models 28 To create a more generalized severity assessment score, we used a system that can potentially 277 combine any measurement or variable from the clinical and behavioral examination. This takes into 278 account the multidimensional nature of severity, reflecting not only pain and distress but also 279 affective emotional states. Therefore, the chosen parameters for severity assessment should be 280 multimodal 30 . This concept is supported by growing evidence in the literature. In a study assessing 281 severity during a chronic pancreatitis model, it was shown that the combination of multiple variables 282 improved the sensitivity of read-out parameters 14 . In the present study, we used a comprehensive 283 panel of methods to monitor the welfare of animals after various experimental procedures, with TM implantation as a use case. To exclude selection bias, we calculated the models' severity level with a 285 full set of available variables: body weight change, burrowing behavior, MGS score and telemetry-286 derived parameters including hr, hrv and temperature. These parameters were selected based on 287 increasing evidence of their suitability in various model systems as well as several round -table 288 discussions within our German Research Foundation (DFG)-funded research consortium 2591, which 289 focuses on severity assessment in animal-based research (www.severity-assessment.de) 7,8,31 . 290 We observed that even though some variables showed high sensitivity towards the implantation 291 procedure, they only showed strong changes over a short time frame. The most prominent example 292 here is the bur2h variable. Burrowing is a highly motivated behavior of mice and is known to be 293 impaired under painful conditions or in mouse models of anxiety and schizophrenia 32,33 . In this study, 294 burrowing was highly sensitive in detecting changes in welfare but only immediately after TM 295 implantation. Likewise, bwc sensitively indicated the impact of TM surgery but quickly recovered 296 within 4 to 6 days after the operation. Body weight is considered one of the most critical parameters 297 in classic clinical scoring in rodents 12 . However, monitoring body weight as a severity assessment 298 parameter was shown to be model-specific and should be used in combination with other 299 parameters 12 . Similarly, using mgs, only short-term effects were detected within 180 minutes after 300 implantation (not shown). On a daily scale, the mgs variable played no role in indicating severity. 301 In contrast, the telemetry-derived parameters hr and hrv showed strong changes on the post-op day 302 but also indicated a longer-lasting impact on animals, suggesting an extended recovery period (up to 303 day 14). Telemetry is a frequently used method in biomedical research. It has been shown that hr 304 and hrv are parameters indicating distress and pain 34,35 , and hr and body temp serve as critical 305 parameters in sepsis studies 36 . 306 This leads to the assumption that the various parameters reflected different facets of severity (e.g., 307 pain) better than others or that the animals did not experience the particular facets after a while. 308 However, this question remains elusive, and the results of the present study underscore the need for a combination of parameters, including physiological parameters, to fully assess the severity 310 situation. variance, the resulting RELSA max values can be used, e.g., in animal model comparisons (Figs. 3 and 4) . 328 Comparing the RELSA max values revealed that TM implantation exhibited higher severity than sham 329 operations. However, sham operation also shows some level of severity. The small peak in the RELSA 330 score on day 14 after TM implantation (Fig. 3 a, b) demonstrates how sensitive the algorithm is 331 towards value changes. Here, the RELSA score was not zero like the rest of the variables, but it was 332 slightly elevated due to some minor variation in the hr variable (RELSA hr,14 =0.021 (SD 0.06)). If there are changes in the measured values, the RELSA score will adequately reflect this. An overall effect of 334 the chosen analgesics was not observed, leaving the search for an ideal treatment for future studies. 335 To validate the RELSA algorithm, we used data from models with different forms and grades of 337 impairments. An acute DSS-colitis model, an acute DSS colitis in combination with repeated restraint 338 stress model and a CLP sepsis model were assessed. Fig. 4c shows that the RELSA max scores remained 339 within the moderate frame of the 4 k-means cluster levels and did not exceed the RELSA level of 1, 340 with the exception of the CLP nonsurvivors. The colitis RELSA max values reliably clustered in level 2, 341 indicating a lower severity for the DSS-colitis model compared to the TM-implantation study. 342 However, in the colitis study, 9 animals had to be euthanized because the humane endpoint (max. of 343 20% weight loss) was reached. This had been set to ensure that animals experience a maximum of 344 moderate severity levels according to the project authorization. Although the RELSA values indicated 345 increased suffering, they also imply that the animals may have been euthanized too early, 346 challenging the use of a 20% loss of body weight as an objective endpoint to ensure moderate 347 severity levels. Even though the humane endpoint for a single variable was reached, the remaining 348 variables did not support a general increase in overall suffering in relation to the reference set. 349 Data from the CLP study revealed very high RELSA scores for the animals that did not survive the 350 procedure (RELSA max ≥ 2.60) and lower values for the surviving and sham animals (RELSA max < 1). The 351 main factor responsible for the high scores was a large decrease in temperature, but hrv and act also 352 indicated increases in severity. Here, more than one variable is pointing towards increased suffering 353 and therefore to an increased impairment in well-being. 354 RELSA enables scientists to quantify severity. In addition, it can be used to classify animals and 356 models in qualitative frameworks, e.g., mild, moderate, and severe. For qualitative grading, data 357 from a predefined reference set are needed, and subsequently, the severity context can be extrapolated. Trivially, the extrema (min/max) of each variable serve as ranges for the given severity 359 context. One caveat is that researchers must provide some sort of estimation about the quality of 360 severity for the reference set, a step that involves human judgment. However, once defined, a new 361 experiment can be used for severity quantification with regard to the reference set. This concept is 362 new to the field and allows an evidence-based comparison of models within actual statutory 363 provisions and guidelines. 364 In addition to providing context, the reference set has another purpose: it regularizes the possible 365 ranges of the input variables. This can prove essential, as variables behave differently when animals 366 are negatively affected. For example, a loss of 17% in body weight is generally recognized as a threat 367 to animal health 12 . At the same time, burrowing behavior may drop to zero. In this case, a difference 368 of 17% in one variable is equivalent to 100% in the other variable. For an optimal representation of 369 this bias, we calculated individual RELSA weights (R w ) as effect sizes for each variable and day, which 370 were then used in the final score calculation. 371 Individual variables contribute to the final RELSA score as RELSA weights (R w ). These weights can be 372 considered a special form of effect size that is somewhat related to Glass' ∆ 37 . For the R w values, 373 however, the differences are that these values are not standardized to the standard deviation in the 374 control group but rather to the difference of the respective variable to its maximum deviation in the 375 reference set. This approach allows an estimation of within-animal effect sizes and measurements of 376 a particular variable's importance. 377 For the generalization of the weights in a final score, we concluded that variables with larger 378 deviations should have more impact, while smaller deviations mostly represent noise and effects that 379 are less prominent within a cohort. In statistics, this is followed by the root mean square (RMS) 380 concept, e.g., in error and regression analysis. In contrast to a pure sum score, the RMS has the 381 advantage that it directly translates to the scale of the individual weights and is considered to be 382 more accurate in showing the best fit. 383 Another important issue is the sampling and measurement frequency. Body weight is detected (e.g., 384 once per day in the morning) and burrowing behavior after a certain time (e.g., after 2 h or 385 overnight). The sampling rates in these cases are a) not equal and b) not frequent enough to catch 386 minute-by-minute changes. Transient changes in some variables thus appear as "all-or-nothing" 387 parameters. They change much faster than the sampling rates so that the exact development over 388 time cannot be seen. Although the sampling rate cannot be corrected with RELSA, the skewness in 389 distribution can be adjusted to a certain degree by including extreme values of a reference model 390 with known severity into the calculation. To be comparable, the RELSA algorithm requires the same 391 reporting frame (e.g., day) in all input variables even though this can mean that the integration times 392 are different (e.g., bur2h). 393 RELSA was designed to assess the multidimensional severity an animal experiences under impaired 395 welfare conditions using multivariate data. The combination of objective variables into a composite 396 score has the advantage of unbiased severity assessment without the need for interpretation or 397 analysis. We have shown that such a composite model can be built, tested and validated. In the 398 future, a comparison of more animal models will lead to a severity map that can then be used to 399 obtain a better understanding of the multivariate severity context. It will not only become much 400 clearer to assess severity but also enable the ranking of animal models with regard to their 401 impairment of welfare. Finally, this may also reveal more generalized or specific variables for preoperative 5 mg/kg carprofen s.c. and postoperative 2.5 mg/kg s.c. every 12 h until day 3. The mice 457 that underwent additional colitis or stress induction were treated using the metamizole analgesia 458 regimen. In the CLP study, mice aged 12 to 14 weeks were anesthetized via s.c. injection of 120 mg/kg in 460 10 ml/kg ketamine (Ketaset ® , Zoetis Deutschland GmbH, Berlin, Germany) and 8 mg/kg in 10 ml/kg 461 xylazine (Rompun ® , Bayer Vital GmbH, Leverkusen, Germany). Perioperative management was the 462 same as described above. The blood pressure catheter was placed in the left carotid artery and 463 positioned so that the gel-filled sensing region of the catheter was approximately 2 mm in the aortic 464 arch. The telemetry transmitter device body was placed along the lateral flank between the forelimb 465 and hindlimb, close to the back midline. Biopotential ECG leads were tunneled subcutaneously to 466 achieve positioning analogous to lead II in human ECG. Burrowing behavior. One week before intraperitoneal transmitter implantation or the corresponding 480 sham surgery, the mice were housed pairwise in type ll macrolon cages filled with aspen bedding 481 material (AsBewood GmbH, Buxtehude, Germany) and two compressed cotton nesting pads 482 (AsBewood GmbH, Buxtehude, Germany). On days five and four before surgery, the burrowing 483 apparatus was provided to the animals to train burrowing behavior 39 . Baseline measurements were 484 taken on days two and one before surgery. A 250-ml plastic bottle with a length of 15 cm, a diameter of 5.5 cm and a port diameter of 4 cm was used as a burrowing apparatus. It was filled with 140 g +/-486 1.5 g of the standard diet pellets of the mice (Altromin1324, Lage, Germany). 487 For burrowing testing after surgeries (1 st , 2 nd , 3 rd , 5 th and 7 th night after surgery), mice were single 488 housed in a type-II macrolon cage with autoclaved hardwood shavings. The burrowing bottles were 489 placed in the left corner. In the right corner, half of the used nesting material from the home cage 490 was provided as a shelter. The tests started three hours before the dark phase, and after two hours, 491 the content of the burrowing bottles was weighed (bur2h). The bottles containing the remaining 492 pellets were placed back into the cages and weighed again the next morning (burON). AstraZeneca GmbH, Wedel, Germany) was used. The depth of anesthesia was checked by means of 521 the corneal and eyelid reflex. During the entire period of anesthesia, the mice were on a heating pad 522 at 37.0 ± 1.0°C. The abdominal cavity was aseptically opened via a midline laparotomy incision of 523 approximately 3 cm, and the cecum was exposed. Subsequently, the cecum was 2/3 ligated (Nylon 524 Monofilament Suture 6/0, Fine Science Tools GmbH, Heidelberg, Germany) distal to the ileocecal 525 valve, while care was taken that the intestinal continuity was maintained. The exposed cecum was 526 punctured twice, "through-and-through", with a 21-gauge needle. Next, sufficient pressure was 527 applied to the cecum to extrude fecal material from each puncture site (~ 1 mm). The cecum was 528 returned to the abdominal cavity and placed in the upper central abdomen. Following this 529 procedure, the peritoneum was closed with three knot fissures with nonresorbable sterile suture 530 material (Nylon Monofilament Suture 7/0, Fine Science Tools GmbH, Heidelberg, Germany), and the 531 upper skin layer was stapled with sterile clips (Michel Suture Clips 7.5 x 1.75 mm, Fine Science Tools 532 GmbH, Heidelberg, Germany). For the mice undergoing a sham laparotomy, the same procedure was 533 performed without CLP. After fully recovering from the anesthesia, the mice were put back into their 534 home cage, after which the continuous data acquisition of all physiological parameters began 535 immediately. The mice received 0.1 mg/kg buprenorphine s.c. three hours after surgery and 536 subsequently every 8 h for the rest of the experiment. At the end of the experiments, mice were 537 anaesthetized deeply with isoflurane and killed by cervical dislocation. . 538 Colitis induction and restraint stress. After intraperitoneal transmitter implantation and 28 days of 539 postoperative recovery, the female C57BL6/J mice were exposed to 0% (control; receiving water 540 only) or 1% DSS (colitis; mol wt 36000-50000; MP Biomedicals, Eschwege, Germany) in drinking 541 water for 5 consecutive days to induce intestinal inflammation. The mice were weighed daily, and 542 the telemetry-derived parameters hr, hrv, activity, and temperature were recorded. A third group of 543 mice was subjected to restraint stress (colitis + stress) in addition to DSS treatment. The mice were 544 inserted into restraint tubes on 10 consecutive days (d1-d10) for 60 minutes (from 09:00 to 10:00 545 am). The restraint tubes (23-mm internal diameter, 93-mm length) consisted of clear acrylic glass 546 with ventilation holes (8 mm diameter) and a whole length spanning 7-mm-wide opening along the 547 upper side of the tube. The ends of the tube were sealed on one side by a piece of acrylic glass with a 548 slot for the mouse tail and on the other end by a solid plastic ring that screwed into place. The mice 549 were able to rotate around their axis but could not move horizontally. 550 Data characterization. Before analysis, the data were brought into the tabular format required for 551 RELSA analysis (Table S5) . Eight variables were used in the calculations (body weight change (bwc), 552 Mouse Grimace Scale (mgs), 2 h of burrowing started 3 h before dark phasebur2h), burrowing 553 overnight (burON), heart rate (hr), heart rate variability (hrv), body temperature (temp) and activity 554 (act)). For each variable, the data were pooled, and the effective ranges were determined (Table 1) . 555 Furthermore, Cohen's d was calculated for each variable and each day, i.e., post-op (0), 1, 4 and 7, to 556 compare the resulting RELSA scores with an independent measurement of effect. 557 Principal component analysis (PCA). PCA was conducted using the factoextra package in R. PCA 558 requires complete data so that the present data were limited to the following days: baseline, post-op 559 and postoperative days 1 to 4. For PCA calculations, all variables were scaled and centered. The principal components of the first two dimensions for all respective days were plotted, as well as the 561 factor loadings and variable contributions. 562 Relative Severity Assessment (RELSA) score calculation. The principal methodology of the RELSA 563 calculation is depicted in Fig. 3 . Quantitative input data were normalized to the range [0; 100]% with 564 100% as starting values (based on physiological or baseline conditions, e.g., on pre-op day (-1); Table 565 S5). 566 The RELSA methodology requires a reference set. If this set has a qualitative severity attribute, the 567 calculated scores will be in reference to that category. According to Annex XIII of the EU directive, 568 surgical interventions under general anesthesia, such as the TM implantations or sham surgery, are 569 categorized as "moderate" in terms of severity. Thus, the RELSA reference set quantitatively 570 reflected this category. It uses the respective extrema of the monitored variables, thereby 571 establishing the context within the referential severity category. For each time point (t), data 572 differences from the normalized baseline for each contributing variable (i) were calculated. To 573 establish the severity context, the differences were divided by the normalized maximum-reached 574 differences in the respective variables of the reference set to yield weights (R w , see formula 1). For 575 this measure, absolute differences were used. Each R w is an expression of the similarity of an actual 576 data point to the maximum-reached value observed in the reference set at any observed time point. 577 This step also regularized differences in variable contributions at any given level of severity so that 578 different scales do not skew the results. To give larger differences more weight, the final RELSA score 579 was calculated by the root mean square (RMS) of the available R w divided by the number of variables 580 (N) (see formula 2). Missing variables did not contribute to the RELSA score, whereas values equal or 581 above baseline level contributed with values of zero. Furthermore, levels of severity in the reference 582 data were calculated using a k-means algorithm 13 . The number of clusters was determined 583 heuristically with a scree plot (Fig. 4a) . A RELSA score of 1 means that all contributing variables for a 584 test animal reached the same values as the largest observed deviations in the reference set with the 585 defined level of severity (here, "moderate"). Happy animals make good science The principles of humane experimental technique Guide for the Care and Use of Laboratory Animals: Eighth Edition 721 on the protection of animals used for scientific purposes Operational Details of the Five Domains Model and Its Key Applications to the 724 Assessment and Management of Impulse for animal welfare outside the 727 experiment Assessing Affective State in Laboratory Rodents to 729 Promote Animal Welfare-What Is the Progress in Applied Refinement Research? Animals : an 730 open access journal from MDPI 9 How can we assess their suffering? German research consortium 732 aims at defining a severity assessment framework for laboratory animals Improving bioscience 735 research reporting: the ARRIVE guidelines for reporting animal research Reproducibility: Respect your cells Reproducibility in science: improving the standard for basic and 740 preclinical research Defining body-weight reduction as a humane endpoint: a critical appraisal Running in the wheel: Defining individual severity levels in mice A novel multi-parametric analysis of non-invasive methods to assess 746 animal distress during chronic pancreatitis Grading Distress of Different Animal Models for Gastrointestinal Diseases 749 Based on Scientific assessment of animal 751 welfare Assessment of unnecessary suffering in animals 753 by veterinary experts Guidelines on severity assessment and classification of genetically altered 755 mouse and rat lines Classification and reporting of severity experienced by animals used in 757 scientific procedures: FELASA/ECLAM/ESLAV Working Group report APACHE-acute 760 physiology and chronic health evaluation: a physiologically based classification system APACHE II: a severity of disease 763 classification system The SOFA (Sepsis-related Organ Failure Assessment) score to describe 765 organ dysfunction/failure. On behalf of the Working Group on Sepsis-Related Problems of 766 the European Society of Intensive Care Medicine Use of the SOFA score to assess the incidence of organ dysfunction/failure 769 in intensive care units: results of a multicenter, prospective study Clinical course and risk factors for mortality of adult inpatients with COVID-19 773 in Wuhan, China: a retrospective cohort study Development of a clinical severity index for 776 dogs with acute pancreatitis Automated analysis of postoperative 779 behaviour: assessment of HomeCageScan as a novel method to rapidly identify pain and 780 analgesic effects in mice Comparative effects of vasectomy 782 surgery and buprenorphine treatment on faecal corticosterone concentrations and 783 behaviour assessed by manual and automated analysis methods in C57 and C3H mice Design of composite measure schemes for comparative severity 786 assessment in animal-based neuroscience research: A case study focussed on rat epilepsy 787 models Sickness Behavior Score Is Associated with 789 Neuroinflammation and Late Behavioral Changes in Polymicrobial Sepsis Animal Model. 790 Inflammation Where are we heading? Challenges in evidence-based severity 792 assessment Where are we heading? Challenges in evidence-based severity assessment Burrowing and nest building behavior as indicators of well-being in mice Do GluA1 knockout mice exhibit behavioral abnormalities relevant to the 798 negative or cognitive symptoms of schizophrenia and schizoaffective disorder? 799 Assessment of post-laparotomy 801 pain in laboratory mice by telemetric recording of heart rate and heart rate variability Implantation of radiotelemetry transmitters 804 yielding data on ECG, heart rate, core body temperature and activity in free-moving 805 laboratory mice Use of Biotelemetry to Define Physiology-Based Deterioration Thresholds in 807 a Murine Cecal Ligation and Puncture Model of Sepsis Meta-analysis and the integration of research in special education FELASA recommendations for the health monitoring of mouse, rat, hamster, 812 guinea pig and rabbit colonies in breeding and experimental units Burrowing in rodents: a sensitive method for detecting behavioral dysfunction Improvement of the Mouse Grimace Scale set-up for implementing a semi-817 automated Mouse Grimace Scale scoring (Part 1) Semi-automated generation of pictures for the Mouse Grimace Scale: A multi-820 laboratory analysis (Part 2) Coding of facial expressions of pain in the laboratory mouse Statistics. Data distributions were tested against the hypothesis of normality using the Shapiro-Wilk 594 test. In the case of failed rejections, nonparametric methods were used for group comparisons 595 (Kruskal-Wallis test) and the Mann-Whitney U-test for pairwise tests. Parametric analyses were 596 performed using analysis of variance (ANOVA) or single t-tests (with Welch's correction in case of 597 unequal variance). For multiple comparisons, the resulting p-values were adjusted using the 598 Bonferroni correction. The RELSA max and cluster centroids were bootstrapped 10000-fold to yield 599 mean values as well as 95% bias-corrected and accelerated (BCa) confidence intervals. With either 600 method, the resulting p-values were considered to be significant at the following levels: 0.05 (*), 0.01 601 (**), 0.001 (***) and 0.0001 (****). 602Software and Packages. The algorithm was developed in R (version 3.6.0). In addition to RELSA, the 603 following packages were used for analysis: ggplot2, factoextra, effsize, plyr, and boot. Radar charts 604 were realized using the fsmb package. The RELSA algorithm and the raw data are available as an R 605 package with full documentation on GitHub: https://github.com/mytalbot/relsa. Test data require the same variables that are included in the reference set. Both data sources are 655 normalized to their baseline values, followed by the calculation of the individual RELSA weights (R w ) 656 as standardized effect sizes with regard to the maximal observed changes in the reference set. The 657final RELSA score is calculated as a root mean square from the available R w . The RELSA score can be 658 calculated for single values and for multiple time points.