key: cord-0711949-89hntro2 authors: Tavaziva, Gamuchirai; Harris, Miriam; Abidi, Syed K; Geric, Coralie; Breuninger, Marianne; Dheda, Keertan; Esmail, Aliasgar; Muyoyeta, Monde; Reither, Klaus; Majidulla, Arman; Khan, Aamir J; Campbell, Jonathon R; David, Pierre-Marie; Denkinger, Claudia; Miller, Cecily; Nathavitharana, Ruvandhi; Pai, Madhukar; Benedetti, Andrea; Ahmad Khan, Faiz title: Chest X-ray Analysis With Deep Learning-Based Software as a Triage Test for Pulmonary Tuberculosis: An Individual Patient Data Meta-Analysis of Diagnostic Accuracy date: 2021-07-21 journal: Clin Infect Dis DOI: 10.1093/cid/ciab639 sha: fbc149487deb4e4646d727499d1da336c147d7e2 doc_id: 711949 cord_uid: 89hntro2 BACKGROUND: Automated radiologic analysis using computer-aided detection software (CAD) could facilitate chest X-ray (CXR) use in tuberculosis diagnosis. There is little to no evidence on the accuracy of commercially available deep learning-based CAD in different populations, including patients with smear-negative tuberculosis and people living with human immunodeficiency virus (HIV, PLWH). METHODS: We collected CXRs and individual patient data (IPD) from studies evaluating CAD in patients self-referring for tuberculosis symptoms with culture or nucleic acid amplification testing as the reference. We reanalyzed CXRs with three CAD programs (CAD4TB version (v) 6, Lunit v3.1.0.0, and qXR v2). We estimated sensitivity and specificity within each study and pooled using IPD meta-analysis. We used multivariable meta-regression to identify characteristics modifying accuracy. RESULTS: We included CXRs and IPD of 3727/3967 participants from 4/7 eligible studies. 17% (621/3727) were PLWH. 17% (645/3727) had microbiologically confirmed tuberculosis. Despite using the same threshold score for classifying CXR in every study, sensitivity and specificity varied from study to study. The software had similar unadjusted accuracy (at 90% pooled sensitivity, pooled specificities were: CAD4TBv6, 56.9% [95% confidence interval {CI}: 51.7–61.9]; Lunit, 54.1% [95% CI: 44.6–63.3]; qXRv2, 60.5% [95% CI: 51.7–68.6]). Adjusted absolute differences in pooled sensitivity between PLWH and HIV-uninfected participants were: CAD4TBv6, −13.4% [−21.1, −6.9]; Lunit, +2.2% [−3.6, +6.3]; qXRv2: −13.4% [−21.5, −6.6]; between smear-negative and smear-positive tuberculosis was: were CAD4TBv6, −12.3% [−19.5, −6.1]; Lunit, −17.2% [−24.6, −10.5]; qXRv2, −16.6% [−24.4, −9.9]. Accuracy was similar to human readers. CONCLUSIONS: For CAD CXR analysis to be implemented as a high-sensitivity tuberculosis rule-out test, users will need threshold scores identified from their own patient populations and stratified by HIV and smear status. for deep learning-based CAD-an artificial intelligence method that is highly effective for image recognition [4] . Given the small evidence base, and that most users will have no field experience with this novel technology, it is important to know if accuracy varies between populations and by patient characteristics. There are no published data on whether human immunodeficiency virus (HIV) infection affects accuracy of deep learning-based CXR analysis, and effects of other patient characteristics on sensitivity and specificity were reported in only 1 study [6] . We performed an individual patient data (IPD) meta-analysis to address gaps in the evidence base on the diagnostic accuracy of CXR analysis with CAD for detecting tuberculosis. We focused on the use of CXR to evaluate individuals self-referring for symptoms of tuberculosis. In this context, chest radiography functions as a triage test: when the CXR is abnormal, sputum microbiologic tests are required to diagnose active pulmonary tuberculosis, whereas a normal CXR is sufficient to rule out active disease [7] . Our reporting follows PRISMA-IPD recommendations [8] . We sought to estimate the diagnostic accuracy of CXR analyzed by deep learning-based, commercially available CAD software for the detection of culture-or nucleic acid amplification test (NAAT)-confirmed pulmonary tuberculosis in symptomatic, self-referred individuals. Our secondary objective was to identify patient characteristics that modify diagnostic accuracy. Eligible studies were identified through published systematic reviews [4, 5] . We added 1 study prior to its publication, through referral by its principal investigator (author F. A. K.) [6] . Eligible studies consecutively enrolled individuals selfreferring for medical care due to symptoms of pulmonary tuberculosis, estimated the diagnostic accuracy of any commercially available CAD program for the detection of pulmonary tuberculosis, and used either NAAT or mycobacterial culture as the reference test. For eligible studies to be included, investigators had to share de-identified clinical data and digital CXR images. Exclusion criteria are in the Supplementary materials (page 1). Investigators provided data on age, sex, HIV status, prior tuberculosis history, smear status, and results of culture and/or NAAT, as well as CXR DICOM files. Data management is described in Supplementary materials page 1. We included participants who had available CXRs and conclusive microbiological results. We excluded participants without CXR images; whose CXR could not be analyzed by all 3 CAD programs; and those with growth of nontuberculous mycobacteria in culture. One reviewer (F. A. K.) performed a quality assessment of each study by adapting the QUADAS-2 tool [9] . Investigators had local approvals to share data and CXR images. The IPD meta-analysis was approved by the Research Ethics Board of the McGill University Health Centre. We analyzed each CXR with 3 commercially available deep learning-based CAD programs: CAD4TB version 6 (Delft, Netherlands), Lunit INSIGHT version 3.1.0.0. (Lunit, South Korea), and qXR version 2 (qure.ai, India). Each software was installed and run at the Research Institute of the McGill University Health Centre. After analyzing a CXR image, each software outputs an abnormality score on a 100-point scale (CAD4TB, 0 to 100; Lunit, 0 to 100; qXRv2, 0.00 to 1.00). A threshold score is selected for categorization: if the abnormality score is below the threshold, the CXR is classified as sufficient to rule out pulmonary tuberculosis; otherwise, the CXR is categorized as consistent with pulmonary tuberculosis. Sensitivity and specificity thus depend on the threshold score. We classified participants as having pulmonary tuberculosis if at least 1 sputum specimen demonstrated Mycobacterium tuberculosis in culture or NAAT (Xpert MTB/RIF). Among participants not meeting criteria for pulmonary tuberculosis, those with at least 1 sputum specimen negative by culture or NAAT were categorized as not having pulmonary tuberculosis. We classified sputum specimens that grew exclusively nontuberculous mycobacteria as indeterminate. We generated within-study and pooled receiver operating characteristic (ROC) curves and estimated area under the ROC curves (AUC). To estimate pooled AUC, we used 1-step parametric linear mixed effects meta-analysis [10, 11] , specifying common random intercepts. We used 3 approaches to select threshold scores for estimating sensitivity and specificity. From an implementation perspective, it would be much easier if software came with a recommended threshold score for universal application. Lunit and qXRv2 come with such developerrecommended threshold scores, whereas CAD4TBv6 does not [12] . To estimate sensitivity and specificity using a universal threshold, we applied: (1) for Lunit and qXRv2, developer-recommended threshold scores; (2) for all 3 software, threshold scores needed to reach a pooled sensitivity of 90%, which we refer to as "meta-analysis-derived threshold scores." Our third approach to threshold selection was an alternative to using universal threshold scores: (3) the use of threshold scores tailored to each site, which we refer to as "study-specific threshold scores." We identified study-specific threshold scores by using each study's ROC curve to find the score with sensitivity of 90% (the minimum recommended by WHO for a tuberculosis triage test) [13] . We first estimated, for each study separately, sensitivity and specificity using developer-recommended, meta-analysisderived, and study-specific threshold scores. We used forest plots to investigate between-study heterogeneity. Next, using 2-step bivariate random-effects meta-analysis [14, 15] , we estimated pooled sensitivity and pooled specificity for developerrecommended and meta-analysis derived threshold scores. We did not estimate pooled sensitivity and specificity using studyspecific threshold scores. We estimated pooled negative and positive likelihood ratios, across a range of threshold scores, using a bivariate modelling approach [14] . In addition to estimating unadjusted accuracy, we estimated sensitivity and specificity within predefined subgroups of sex, HIV-status, sputum smear-status, prior tuberculosis history, and age (details in Supplementary materials, page 1). We first identified associations in univariable analyses, within each study, and pooling data across studies. To determine whether associations remained after adjusting for covariates, we performed generalized linear IPD multivariable meta-regression. In these models, parameter estimates are the absolute difference in sensitivity, or specificity, between subgroups adjusted for other variables in the model. We judged absolute differences as statistically significant if 95% confidence intervals (95% CI) excluded 0. We estimated diagnostic outcomes at varying prevalence of tuberculosis (5%, 17%, 20%) using the meta-analysis-derived threshold scores in hypothetical cohorts of 1000 patients undergoing CXR analysis with these software, stratified by the characteristics that were associated with sensitivity. In a post-hoc analysis, we sought to compare accuracy of CXR analysis by CAD to interpretation by human readers. For these analyses, we used the categorical CXR categorizations by human readers provided in the original study data to classify images as Abnormal (if any abnormality present) or Normal; we chose this classification as it is known to maximize sensitivity of humanread CXR for tuberculosis (details in Supplementary materials, page 2) [16] . To compare the accuracy of CAD software to human CXR interpretation, we visualized the sensitivity and specificity of human readers on plots of the ROC curves of the 3 software. In another post hoc analysis, we repeated our search of the literature on 24 April 2020 to identify potentially eligible studies published since our original search. Statistical analyses were performed with SAS software (version 9.4, SAS Institute, Cary, North Carolina, USA) [17] and R statistical software (RStudio, version 1.2.5033) [18] using packages diagmeta [11] and mada [14] . The funding source had no input on the design, conduct, analysis, or reporting. The selection of studies for inclusion from the previously published systematic review [4] into the IPD meta-analysis is depicted in Supplementary Figure 1 (Supplementary materials, page 3). Of 54 full-text articles reviewed, 7 met inclusion criteria, of which 3 were excluded because IPD could not be obtained. The IPD meta-analysis includes 4 studies [6, [19] [20] [21] [22] . Of 3967 participants for whom IPD were provided, we included 3727/3967 (94%), and we excluded 240 (6%) for the following reasons (Supplementary Table 1 , Supplementary materials, page 4): 20 missing CXR; 20 whose CXR could not be analyzed; 140 with nontuberculous mycobacteria in sputum culture; and 60 without reference standard results. Risks of bias in QUADAS-2 patient selection and flow and timing domains were low for all studies; in the reference standard domain, was low in 3/4 and unclear in 1/4 (Supplementary Figure 2 , Supplementary materials, page 5). Applicability concerns were low for 4/4. The included studies were undertaken in Pakistan [6] , South Africa [21, 22] , Tanzania [19] , and Zambia [20] (Table 1) . Age, sex, prior tuberculosis, and smear data were unavailable for 1 study [21, 22] . Women accounted for 47% (1583/3352) of participants, and PLWH for 17% (621/3695). NAAT-or cultureconfirmed tuberculosis was diagnosed in 17% (645/3727). Smear-positive disease accounted for 73% (417/573) of confirmed tuberculosis. . The software had similar pooled AUC estimates with overlapping confidence intervals: CAD4TBv6, 0.83 (95% CI: .82-.84); Lunit, 0.83 (95% CI: .79-.86); qxRv2, 0.85 (95% CI: .83-.88). For each software, when the same threshold score was applied in every study, sensitivity and specificity varied from study to study. This between-study variability was observed when using developer-recommended threshold scores ( Figure 1A ) and also with meta-analysis derived threshold scores ( Figure 1B ). When study-specific threshold scores were applied, between-study variability in specificity persisted ( Figure 1C ). The threshold score needed to achieve sensitivity of 90% varied between each study. Pooled sensitivity and specificity ( . At a threshold score close to the maximum abnormality score (95 for CAD4TBv6 and Lunit, and 0.95 for qXRv2), positive likelihood ratios were modest for CAD4TBv6 (5.4, 95% CI: 3.9-7.3) and Lunit (6.3, 95% CI: 3.8-10.2), and high for qXRv2 (20.7, 95% CI: 13.5-30.5). In within-study univariable analyses (Supplementary Tables 3-5, Figures 3-5 , Supplementary materials, pages 9-21), for all 3 approaches to threshold score selection, in at least 1 study, sensitivity was lower among women versus men, lower amongst PLWH versus HIV-uninfected participants, and lower for smear-negative versus smear-positive disease. In at least 1 study, specificity was lower among men versus women, among PLWH versus HIV-uninfected participants, among participants with prior tuberculosis, and in the highest age tertile. In univariable pooled analyses ( Adjusted absolute differences in pooled sensitivity and pooled specificity between subgroups estimated using multivariable IPD meta-regression are shown in Table 4 . In Figure 1 . Within-study sensitivity and specificity of chest X-ray analysis with deep learning-based software as a tuberculosis triage test in self-referred, symptomatic individuals. A, Developer-recommended threshold scores. B, Meta-analysis derived threshold scores. C, Study-specific threshold scores. * We used the prespecified developerrecommended threshold score to classify chest X-rays as either consistent with tuberculosis or tuberculosis ruled-out. † For each software, the following threshold scores were applied in all studies: CAD4TBv6, 54; Lunit, 16.68; qXRv2, 0.44. These threshold scores were chosen as each was required to reach a pooled sensitivity of 90%. ‡ For each particular study, we identified the threshold score needed to reach a within-study sensitivity of 90% and estimated its within-study specificity. When no threshold score reached a sensitivity of exactly 90%, we selected the score achieving sensitivity >90%. For all 3 software, in multivariable IPD meta-regression models that included sex, HIV-status, prior TB, and age group, specificity was significantly lower in: men vs women (with meta-analysis-derived threshold scores: CAD4TBv6, −6.7% [95% CI: −9. Table 5 provides expected outcomes of CXR analysis in hypothetical cohorts at varying tuberculosis prevalence. Results of human CXR readers were available for 2/4 studies, so we did not pool results. In Tanzania, 3 human readers interpreted each CXR (sensitivity range: 83%-97.3%; specificity, 12.0%-58.6%) [19] . In Zambia, a single reader was used (sensitivity, 96.8%; specificity, 48.8%) [20] . In both settings, confidence intervals for each human reader's sensitivity and specificity intersected with each software's ROC curves (Supplementary Figure 6, Supplementary materials, page 22) , indicating accuracy of CAD and human readers were similar. As was seen for CAD in within-study analysis, sensitivity and specificity of human CXR reading were modified by sex, HIV, sputum smear, prior tuberculosis, and age (Supplementary Table 6 , Supplementary materials, page23). Pooled sensitivity and specificity estimated using bivariate random effects 2-step individual patient data meta-analysis. Point estimates are not always equivalent to division of numerator by denominator as they were estimated via meta-analysis. Abbreviation: CI, confidence interval. a Each software's threshold score, which we applied in all studies, was prespecified by the software developers. b Each software's threshold score, which we applied in all studies, was identified using meta-analysis as the one that reached an unadjusted pooled sensitivity of 90%: CAD4TBv6, 54; Lunit, 16.68; qXRv2, 0.44. Each software's threshold score, which we applied in all studies, was identified using meta-analysis as the one that reached an unadjusted pooled sensitivity of 90%: CAD4TBv6, 54; Lunit, 16.68; qXRv2, 0.44. On 24 April 2020, we repeated our search strategy to identify relevant studies published since our initial search in February 2019. We identified 570 unique records, excluded 557 based on title and abstract screening and 12 after full-text screening, leaving 1 study [23] eligible for inclusion had IPD been available (study selection is summarized in Supplementary Figure 8 , Supplementary materials, page 24). The study by Qin et al retrospectively estimated the diagnostic accuracy of CAD4TBv6, Lunit, and qXRv2, against a reference of a single sputum specimen tested by NAAT [23] . Data originated from 2 TB referral centers in Nepal and Cameroon. Among 1196 individuals, 38 (3.2%) were PLWH, and 109 (9.1%) had NAAT-positive sputum of whom 76/109 (69.7%) were sputum smear-positive. AUCs were higher compared to what we reported: CAD4TBv6 (0.92, 95% CI: .90-.95), Lunit (0.94, 95% CI: .93-.96), and qXRv2 (0.94, 95% CI: .92-.97). Sensitivity and specificity stratified by smear and HIV status were not reported. Similar to our study, the authors found that application of the same threshold score resulted in sensitivity and specificity differing between study sites. Through meta-analysis of data from 3727 individuals selfreferring for tuberculosis symptoms, we evaluated the diagnostic accuracy of CXR analyzed by commercially available, deep learning-based CAD software, as a triage test for NAATor culture-confirmed tuberculosis. For each software, applying the same threshold in all studies resulted in sensitivity and specificity varying from study to study. In adjusted analyses, sensitivity was associated with HIV status for CAD4TBv6 and qXRv2, and with sputum-smear status for all 3 software. For all 3 software, specificity was associated with age, sex, prior tuberculosis, and HIV status. In 2 studies where human interpretation of CXR was reported, accuracies of human readers and CAD software were comparable, and patient characteristics similarly affected human reading. Differences with confidence intervals (CI) that exclude the null value are shown in bold. For sensitivity, N = 567; for specificity N = 2742. For sensitivity we used fixed effects individual patient data multivariable meta-regression, and for specificity we used random effects individual patient data meta-regression. Abbreviations: HIV, human immunodeficiency virus; PLWH, people living with HIV; TB, tuberculosis. a Model for differences in sensitivity with qXRv2 did not converge when sex was included. Estimates are the absolute difference in sensitivity (or specificity) comparing subgroups, after adjusting for the other co-variates in the model. For example, the estimate for Lunit with its developer-recommended threshold score for HIV status means that Lunit sensitivity was 1.0% lower among PLWH compared to the HIV-uninfected, after adjusting for sex and smear-status, but the CI and P-value indicate that the difference was not statistically significant. The high sensitivity and moderate to low specificity of CXR analysis by these software, and our observed associations between certain patient characteristics and software accuracy, are similar to what has been reported for human-read CXR [19, [24] [25] [26] [27] [28] [29] [30] . Hence, the evidence suggests that CXR analysis by CAD software would lead to similar diagnostic outcomes for tuberculosis as CXR interpretation by humans. Based in part on our findings, WHO recently issued new guidance, supporting the use of CAD as a replacement to humans for analyzing CXR for tuberculosis. Use of CAD software will improve reliability of CXR analysis by eliminating intra- [31] and inter-reader [25, [31] [32] [33] [34] [35] [36] [37] variability that occur with human reading and would also eliminate problems tied to human reader fatigue [38, 39] . However, implementation should take into consideration three important limitations of current software. First, we found that the accuracy of threshold scores varied between studies, such that users will face uncertainty about the sensitivity and specificity achieved in their particular setting. To reduce uncertainty, users will need estimates of sensitivity and specificity at different threshold scores identified using ROC curve analysis of data from individuals sampled from their own patient population. Developers and other implementation partners will need to provide all de novo users with resources, protocols, and tools to undertake these analyses. Given the importance of minimizing bias when estimating diagnostic accuracy [40] , WHO provides a draft protocol for new CAD users to undertake threshold selection for their patient populations [41] . In areas where HIV-associated or smear-negative tuberculosis are epidemiologically important, users should be provided with estimates of accuracy at different thresholds within strata of these variables. We recognize that this represents an important implementation challenge. However, a cautious approach is warranted considering this is a novel technology, and ours is not the only study to report between-setting heterogeneity [23] . A second implementation consideration is that these software do not provide differential diagnoses as would radiologists. Third, none of the software are validated for use in infants and young children. Taken together, these limitations of existing software mean that they cannot yet fully replace human CXR readers, hence their deployment should not preclude efforts to This study has a number of strengths. First, the quality of the included studies reduces the likelihood of selection or measurement bias. Second, we substantially expanded the evidence base for deep learning-based CAD by reanalyzing CXR images from studies that had initially reported on older, non-deep learningbased programs. Third, through IPD meta-analysis we identified patient characteristics modifying diagnostic accuracy and were able to estimate the associated absolute changes in sensitivity and specificity-which have not previously been reported. Fourth, we conducted our study independently of companies who have a commercial interest in this field. Finally, our evaluation was based on CXR that had not been used for software training, which could have overestimated accuracy [4] . Some limitations should also be considered. First, we did not have data from 3 [25, 27, 42] of 6 eligible studies identified from our initial literature search, nor from the 1 study [23] identified in the updated search. However, we think their inclusion would not have changed our main results of conclusions, for a number of reasons. Two of the studies for which data could not be obtained [25, 42] were conducted by the same investigators and in the same 2 countries as 2 studies that we did include [6, 20] . Another study we could not include [27] evaluated an older CAD4TB version in Bangladesh, against a single NAAT as the reference and without sputum smear data. In a preprint [43] , AUCs of CAD4TBv6, Lunit, and qXR on the Bangladesh data set were similar to our estimates from Pakistan (both low HIV prevalence settings). Importantly, our finding of between-site variability in diagnostic accuracy was also reported by Qin et al [23] . A second limitation of our study is that we did not have data on CD4 counts, which could have further explained heterogeneity amongst PLWH. Another limitation is that over 1 year has passed since the updated literature search. Future research should focus on reducing betweenpopulation heterogeneity in accuracy, validation for use in childhood tuberculosis, and re-evaluation in the era of Covid-19, which could reduce CAD specificity for tuberculosis due to shared manifestations. In summary, among individuals self-referring for pulmonary tuberculosis symptoms, CXR analysis with these deep learning-based CAD software can be a high sensitivity ruleout test. Moreover, tuberculosis diagnostic outcomes when using these software will be similar to those achieved with human CXR readers. However, to reduce uncertainty related to diagnostic heterogeneity, developers should provide de novo users with threshold scores and estimates of accuracy derived from their own patient populations, and stratified by HIV and smear status. The Use of X-Ray examinations in pulmonary tuberculosis Recent advances in chest radiography Artificial intelligence in radiology A systematic review of the diagnostic accuracy of artificial intelligence-based computer programs to analyze chest x-rays for pulmonary tuberculosis Computer-aided detection of pulmonary tuberculosis on digital chest radiographs: a systematic review Deep learning-based chest X-ray analysis software as triage tests for pulmonary tuberculosis: a prospective study of diagnostic accuracy for culture-confirmed disease Chest Radiography in Tuberculosis Detection -Summary of current WHO recommendations and guidance on programmatic approaches Preferred reporting items for systematic review and meta-analyses of individual participant data: the PRISMA-IPD statement QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies Modelling multiple thresholds in meta-analysis of diagnostic test accuracy studies diagmeta: meta-analysis of diagnostic accuracy studies with several R packages World Health Organization. High-priority target product profiles for new tuberculosis diagnostics: report of a consensus meeting Meta-analysis of diagnostic accuracy with mada. R Packages Bivariate analysis of sensitivity and specificity produces informative summary measures in diagnostic reviews Systematic screening for active tuberculosis: principles and recommendations SAS Institute Inc. Base SAS® 9.4 procedures guide: statistical procedures R: a language and environment for statistical computing Diagnostic accuracy of computer-aided detection of pulmonary tuberculosis in chest radiographs: a validation study from sub-Saharan Africa The sensitivity and specificity of using a computer aided diagnosis program for automatically scoring chest X-rays of presumptive TB patients compared with Xpert MTB/RIF in Lusaka Zambia An automated tuberculosis screening strategy combining X-ray-based computer-aided detection and clinical information Automated chest-radiography as a triage for Xpert testing in resource-constrained settings: a prospective study of diagnostic accuracy and costs Using artificial intelligence to read chest radiographs for tuberculosis detection: a multi-site evaluation of the diagnostic accuracy of three deep learning systems The validity of classic symptoms and chest radiographic configuration in predicting pulmonary tuberculosis Detection of tuberculosis using digital chest radiography: automated reading vs interpretation by clinical officers Scoring systems using chest radiographic features for the diagnosis of pulmonary tuberculosis in adults: a systematic review An evaluation of automated chest radiography reading software for tuberculosis screening among public-and privatesector patients Predictors of smear-negative pulmonary tuberculosis in HIV-infected patients The role and performance of chest X-ray for the diagnosis of tuberculosis: a cost-effectiveness analysis in Diagnostic strategy for pulmonary tuberculosis in a lowincidence country: results of chest X-ray and sputum cultured for Mycobacterium tuberculosis Chest radiograph abnormalities associated with tuberculosis: reproducibility and yield of active cases Variability in interpretation of chest radiographs among Russian clinicians and implications for screening programmes: observational study How reliable is chest radiography? In: Frieden T. Toman's tuberculosis case detection, treatment, and monitoring: questions and answers Evaluation of a chest radiograph reading and recording system for tuberculosis in a HIV-positive cohort Development of a simple reliable radiographic scoring system to aid the diagnosis of pulmonary tuberculosis A simple, valid, numerical score for grading chest x-ray severity in adult smear-positive pulmonary tuberculosis Intra-observer and overall agreement in the radiological assessment of tuberculosis Understanding and confronting our mistakes: the epidemiology of error in radiology and strategies for error reduction Tired in the reading room: the influence of fatigue in radiology Computer-aided reading of tuberculosis chest radiography: moving the research agenda forward to inform policy Evaluation of the diagnostic accuracy of computer-aided detection of tuberculosis on chest radiography among private sector patients in Can artificial intelligence (AI) be used to accurately detect tuberculosis (TB) from chest x-ray? a multiplatform evaluation of five AI products used for TB screening in a high TB-burden setting Acknowledgments. The authors thank Delft, Lunit, and qure.ai, for providing technical support with the local installation of the software used in this study. The authors thank Marcel Behr of the McGill International TB Centre for critical feedback on the manuscript.Financial support. L'Observatoire International Sur Les Impacts Sociétaux de l'Intelligence Artificielle (Fonds de recherche Quebec). The funder had no role in the collection, analysis and interpretation of the data; in the writing of the report; or in the decision to submit the article for publication.Potential conflicts of interest. M. B. reports grants from European and Development Countries Clinical Trials Partnership, during conduct of the study. A. J. K. has had financial interests in the company Alcela, of which qure.ai is a client, and has been discussions with qure. ai for additional business development since October 2019. A. J. K. had helped conceive and design the included study from Pakistan in 2016 to 2017 but was never directly involved in data collection, analysis, or reporting of that study and his relationship with Alcela arose after the completion of data collection for that study. A. J. K. was not involved in the design, analysis, reporting, writing, editing, or decision to submit the work reported in the present manuscript. C. D. reports working for FIND until April 2019. FIND is a not-for-profit foundation, whose mission is to find diagnostic solutions to overcome diseases of poverty in LMICs. Since leaving FIND, C. D. continues to hold a collaborative agreement with FIND. M. P. reports that he serves on the Scientific Advisory Committee (SAC) of FIND, Geneva. M. P. reports no financial or industry conflicts. F. A. K. reports grants from Fonds de Recherche du Quebec and the Canadian Institutes of Health Research, both are publicly funded government-run research agencies. F. A. K. has no financial or industry conflicts. All other authors report no potential conflicts. All authors have submitted the ICMJE Form for Disclosure of Potential Conflicts of Interest. Conflicts that the editors consider relevant to the content of the manuscript have been disclosed.Protocol Registration: PROSPERO (CRD42018073016). Prevalence of 17% is shown as this was the prevalence of pulmonary tuberculosis in the present study. Prevalence of 5% and 20% are shown by convention. The same threshold score was applied in each group. Estimates calculated using pooled sensitivity and specificity of the meta-analysis-derived threshold score (CAD4TBv6, 54; Lunit, 16.68; qXRv2, 0.44).Abbreviations: CAD, computer-aided detection software; HIV, human immunodeficiency virus; PLWH, people living with HIV.