key: cord-0052818-is9kbik7
authors: James, Hannah K.; Pattison, Giles T. R.; Griffin, James; Fisher, Joanne D.; Griffin, Damian R.
title: Assessment of technical skill in hip fracture surgery using the postoperative radiograph: pilot development and validation of a final product analysis core outcome set
date: 2020-09-24
journal: Bone Jt Open
DOI: 10.1302/2633-1462.19.bjo-2020-0101.r1
sha: e3d05436406d9c099c1782137c270cfa7bd799b5
doc_id: 52818
cord_uid: is9kbik7

AIMS: To develop a core outcome set of measurements from postoperative radiographs that can be used to assess technical skill in performing dynamic hip screw (DHS) and hemiarthroplasty, and to validate these against Van der Vleuten’s criteria for effective assessment. METHODS: A Delphi exercise was undertaken at a regional major trauma centre to identify candidate measurement items. The feasibility of taking these measurements was tested by two of the authors (HKJ, GTRP). Validity and reliability were examined using the radiographs of operations performed by orthopaedic resident participants (n = 28) of a multicentre randomized controlled educational trial (ISRCTN20431944). Trainees were divided into novice and intermediate groups, defined as having performed < ten or ≥ ten cases each for DHS and hemiarthroplasty at baseline. The procedure-based assessment (PBA) global rating score was assumed as the gold standard assessment for the purposes of concurrent validity. Intra- and inter-rater reliability testing were performed on a random subset of 25 cases. RESULTS: In total, 327 DHS and 248 hemiarthroplasty procedures were performed by 28 postgraduate year (PGY) 3 to 5 orthopaedic trainees during the 2014 to 2015 surgical training year at nine NHS hospitals in the West Midlands, UK. Overall, 109 PBAs were completed for DHS and 80 for hemiarthroplasty. Expert consensus identified four ‘final product analysis’ (FPA) radiological parameters of technical success for DHS: tip-apex distance (TAD); lag screw position in the femoral head; flushness of the plate against the lateral femoral cortex; and eight-cortex hold of the plate screws. Three parameters were identified for hemiarthroplasty: leg length discrepancy; femoral stem alignment; and femoral offset. Face validity, content validity, and feasibility were excellent. For all measurements, performance was better in the intermediate compared with the novice group, and this was statistically significant for TAD (p < 0.001) and femoral stem alignment (p = 0.023). Concurrent validity was poor when measured against global PBA score. This may be explained by the fact that they are measuring difference facets of competence. Intra-and inter-rater reliability were excellent for TAD, moderate for lag screw position (DHS), and moderate for leg length discrepancy (hemiarthroplasty). Use of a large multicentre dataset suggests good generalizability of the results to other settings. Assessment using FPA was time- and cost-effective compared with PBA. CONCLUSION: Final product analysis using post-implantation radiographs to measure technical skill in hip fracture surgery is feasible, valid, reliable, and cost-effective. It can complement traditional workplace-based assessment for measuring performance in the real-world operating room . It may have particular utility in competency-based training frameworks and for assessing skill transfer from the simulated to live operating theatre. Cite this article: Bone Joint Open 2020;1-9:594–604.

introduction post-implantation radiographs are routinely used in orthopaedic practice to assess the success of hip fracture surgery and to predict risk of fixation failure. the position of the implant is widely believed to be an important factor in predicting clinical outcome [1] [2] [3] [4] and has been repeatedly shown in the simulation laboratory to be influenced by the technical skill of the surgeon. [5] [6] [7] [8] [9] [10] With the notable exception of tip-apex distance (tad) for dynamic hip screw (dHs), 4 there is a paucity of published evidence on the relationships between postoperative implant position, patient outcome following hip fracture surgery, and technical skill of the surgeon in the real-world clinical environment. in the absence of accepted criteria, judgement as to the satisfactory position of the implant in dHs and hemiarthroplasty appear to be made in everyday clinical practice using a global, qualitative 'expert eye' judgement, refined through experience.

the need to define a core radiological outcome set to assess a technically successful hip fracture operation is driven by the requirement in both the surgical training and educational research settings for a technical skills outcome measure that is clinically relevant and objectively measurable, reproducible, and reliable. the use of patient-centred outcome measures is key to being able to demonstrate the highest level of evidence of learning according to Kirkpatrick's hierarchy (level 4; patient results). 11 this is important for two reasons. first, in an increasingly competency-based training climate, residents must objectively demonstrate attainment of surgical skill in core procedures. 12 second, demonstrating skills transfer to the operating theatre with resultant patient benefit is necessary to justify financial investment decisions around simulation provision. measurement of technical skill in the real-world orthopaedic theatre is fraught with methodological challenges, and a recent systematic review showed that none of the technical skills assessment tools in current use in orthopaedic training around the world satisfy the norcini criteria for effective assessment. 13 most have not been validated beyond the simulated environment, and are unsuitable for use in real life due to reliance on simulatorderived metrics for assessment of technical skill.

there is growing interest in the use of 'final product analysis' (fpa) to objectively assess the real-world technical skill of the trainee orthopaedic surgeon. there is an increasing body of evidence to show that fpa in the simulation laboratory is face, 14 content, 5, 14 construct, [5] [6] [7] 9, [14] [15] [16] [17] and concurrent 14 valid, and educationally impactful 5, 18 for orthopaedic surgery. there is no evidence to date of the utility of fpa in the clinical setting using real patient operations. postoperative radiographs are an attractive candidate for fpa in the real-world clinical setting as they are objective, proximate to the time of surgery, non-invasive, and routinely collected as part of usual care. they are a useful surrogate measure where measurement of traditional gold-standard clinical outcomes such as revision rate and mortality is impractical. the ideal radiological measures for this purpose are those that are easily perceptible on a radiograph, that are clinically relevant, and which have sufficiently high resolution to be responsive to small incremental changes in technical skill.

this study is the first investigation into the real-world utility of using postoperative patient radiographs for technical skills assessment of junior surgeons. Our hypothesis is that postoperative patient radiographs can be used to measure technical skill in junior residents performing hip fracture surgery on real patients in the operating theatre, and that this will satisfy four of the domains of effective assessment described by van der vleuten: 19 validity; reliability; feasibility; and cost-effectiveness.

national research ethics approval was granted for this study by the nHs Research authority south Birmingham Research ethics Committee (15/Wm/0464). Confidentiality advisory Group approval was granted for accessing radiological data without patient consent (16/CaG/0125). Phase 1: Consensus exercise to define core outcomes. an informal scoping literature review was undertaken to identify current evidence for assessing technical skill using postoperative radiographs in hip fracture surgery. When none was found, the focus of the scoping review was moved to look for evidence of radiological features of dHs and hemiarthroplasty that predict clinical outcome. there were no studies found relating hemiarthoplasty implant position to clinical outcome, and so the total hip arthroplasty literature was used. a list of candidate measurements was developed from the scoping literature search and externally checked with internationally recognized experts in the field to ensure that they were in line with leading opinion. an e-delphi exercise was undertaken to systematically combine expert opinion and achieve consensus where none currently exists. Consensus was determined to have been reached when there was ≥ 75% panel agreement, which is a widely accepted benchmark in the consensussetting literature. 20 the consultant orthopaedic surgeon cohort in a major regional trauma centre in the uK were invited to participate (n = 39). nineteen consultants completed all three survey rounds (49%). all consultant orthopaedic surgeons were invited regardless of subspecialism or involvement with the on-call trauma service, as hip fracture operations are a basic core trauma procedure Overview of the consensus process.

in which independent competence is required before completion of surgical training. 21 the delphi panel demographic information is shown in supplementary table i. the survey was built using an online survey platform (survey monkey inc, san mateo, California, usa) and administered in three rounds. an overview of the delphi process is shown in figure 1. in round 1, participants were presented with the candidate measurements and given binary yes/no answer options to indicate if they believed each of the proposed measures were important for assessing the skill of dHs or hemiarthroplasty. there was free text space to record opinion. in round 2, items that had achieved consensus were re-presented with proposed cut-off thresholds of acceptability, in a binary yes/no format, and free text space was provided for elaboration. items that had not reached consensus were re-presented with the level of participant agreement (expressed as percentage) with additional details of supporting literature evidence. in round 3, items that had still not achieved consensus were represented with new acceptability threshold proposals in line with panel opinion from round 2, along with relevant supporting published evidence where appropriate. items that failed to reach consensus after three rounds were abandoned. Phase 2: Feasibility testing. the feasibility of obtaining the measurements identified in phase 1 was assessed by two of the authors (HKJ and GtRp). measurements were taken within the hospital electronic picture archiving and Communication system (paCs) using the inbuilt user interface tools. intraoperative image intensification (ii) images were used for dHs, as postoperative radiographs for dHs are not routinely taken in uK orthopaedic practice. the ii pictures are not autocalibrated within paCs and so were manually scaled using a known fixed implant dimension (the outer thread diameter of the dHs lag screw). Phase 3: Validity and reliability testing. Radiographs of operations performed by orthopaedic trainee participants of a multicentre randomized controlled educational trial 22 (isRCtn20431944) were used for validity and reliability testing. Cases were identified from the electronic surgical logbooks of operations performed by trial participants. the corresponding radiographs were retrieved from the hospital servers.

face and content validity were addressed in phase 1. Construct validity, the ability of an assessment instrument to discriminate between experience levels, was measured by novice and intermediate-level trainee performance over the same time period and setting(s). novice trainees were defined as having performed < ten dHs or hemiarthoplasty cases at baseline, and 'intermediate', defined as having performed ≥ ten dHs or hemiarthroplasty cases at baseline. Classification of trainee experience was independently assessed for each procedure. ten cases was chosen because previous learning curve analysis of trainees performing simulated hip fracture osteosynthesis suggests that around ten repetitions are required for performance to stabilize in the associative learning phase. 23 for continuous outcomes, we compared the means between both groups using t-test, and tested whether the difference between groups was zero. for categorical outcomes, we conducted a chi-squared test of association, or fisher's exact test if cell counts were less than five. Concurrent validity, the performance of an assessment instrument against the current gold standard, was determined by comparing performance as measured by implant position on the radiographs against the global rating scale component of the procedure-based assessment (pBa) scores. the pBa is the current gold standard summative assessment tool used in higher orthopaedic surgical training in the uK. 24 they are collected routinely during training, although not mandated for every case. We assessed the same outcome measures for both procedures described above.

all primary measurements were taken by one author (HKJ, orthopaedic trainee). an adequate reliability testing sample size was determined to be 25 cases. a randomly selected subset of 25 dHs and 25 hemiarthroplasty cases were re-measured on two occasions one week apart to determine intra-rater reliability, and on one occasion by a second rater (GtRp, attending orthopaedic surgeon) to determine inter-rater reliability.

We conducted both intra-and inter-rater reliability analyses to assess the reliability of the primary rater, and the comparability of measures with the independent rater, respectively. for measures which were continuous, we plotted Bland-altman plots to assess differences in measures and then calculated intraclass correlation coefficients to describe how strongly associated the scores were with accompanying 95% confidence intervals. for categorical outcomes, we used an equivalent measure for assessing agreement, the Cohen's kappa statistic, and the crude percentage agreement in absolute terms.

Overall, 28 core trainee 1 (Ct1) to specialty trainee 3 (st3) trainees performed 327 dHs operations and 248 hemiarthroplasty operations during one surgical training year (august 2014 to august 2015) in nine regional nHs hospitals in the uK. there were 109 pBas completed for dHs and 80 for hemiarthroplasty in the study population. Baseline demographics of the trainee participants are shown in table i. Only operations coded as 'supervised trainer scrubbed' or 'supervised trainer unscrubbed' were included in the analysis, to ensure that the included operations were actually performed by the trainee participants. Operations coded as 'performed' were excluded as these are unsupervised, non-training operations and therefore there would not be a corresponding pBa completed. Face and content validity. face validity (that a tool is fit for purpose) and content validity (that a tool tests appropriate domains) can both be demonstrated through expert consensus-setting exercises. Candidate items were externally checked by recognized international experts in hip fracture surgery. the items that achieved consensus > 75% through the e-delphi process, with descriptors and acceptability thresholds, are shown in table ii.

four fpa radiological parameters were identified for dHs: tip-apex distance; lag screw position in the femoral head with reference to Cleveland's zones; 25 flushness of the plate against the lateral femoral cortex; and eightcortex hold of the plate screws (figures 2 and 3). three radiological parameters were identified for hemiarthroplasty: leg-length discrepancy; femoral stem alignment; and femoral offset (figure 4). Rejected items were 'cortical screws perpendicular to plate' for dHs, which was rejected 68% against in round one, and 'cement thickness' for hemiarthroplasty, which failed to reach consensus after three rounds. a schematic diagram of the measurement parameters is shown in figures 2-4. Construct validity. Construct validity (the discriminant ability of a test instrument to distinguish between experience levels) was evaluated by comparing between-group differences for the various metrics. Results of construct validity testing are shown in table iii (dHs) and iv (hemiarthroplasty). for dHs, tad as a continuous variable was found to be significantly different between experience levels, with the intermediate group having a lower mean tad, signifying a technically superior result, 18.3 mm compared with 15.7 mm for novices and intermediates, respectively, p < 0.001. tip-apex distance < 25 mm as a dichotomous variable was not discriminant between the two groups (p = 0.222). mean pBa scores for dHs were seen to improve significantly between the novice group with a mean global rating score of 2.4 and the intermediate group with a mean score of 2.8 (p < 0.001). there was no difference seen in lag screw position in the femoral head between the two groups (p = 0.393). there were fewer plates flush to the lateral femoral cortex in the novice group (58%) as compared with the intermediate group (66%), but this was not statistically significant (p = 0.153). similarly, there were slightly more procedures that failed to demonstrate eight-cortex hold in the novice group (4%) compared with the intermediate group (1%) but this difference was again not significant.

for hemiarthroplasty, femoral stem alignment was found to be significantly better in the intermediate group than in the novice group, with a mean deviation from neutral of 3.1° for novices and 2.6° for intermediates; p = 0.023. leg length discrepancy and femoral offset difference were both found to be better in the intermediate group as compared with the novices, but these differences were not statistically significant.

the intermediate group achieved a significantly higher mean global pBa score than the novice group for hemiarthroplasty; the mean score was 2.4 for novices and 2.9 for intermediates; p < 0.001. Concurrent validity. Concurrent validity was measured by examining differences between global pBa scores (treated as categorical variables) and each of the radiological measurements for dHs and hemiarthroplasty (table iv to table v). no significant association between pBa global rating score and any of the seven tested radiological parameters was found using the chi-squared test. Reliability. for dHs, both intra-and inter-rater reliability were found to be excellent for tad (Cohen's kappa 0.84 and 0.76), and moderate for position of lag screw (Cohen's kappa 0.47 for both intra-and inter-rater reliability) (table vi). intra-rater reliability was found to be poor for assessing whether or not the plate was flush to the lateral cortex of the femur (Cohen's kappa 0.12). the Kappa statistic could not be calculated for intra-and inter-rater reliability for eight-cortex hold and for interrater reliability for plate flush to femur due to one rater having no variation in measurement.

for hemiarthroplasty, the intra-rater reliability was found to be moderate for leg length discrepancy and femoral stem alignment (Cohen's kappa 0.57 and 0.59, respectively), and excellent for femoral offset difference (Cohen's kappa 0.79). the inter-rater reliability was moderate for leg length discrepancy (Cohen's kappa 0.54), fair for femoral stem alignment (Cohen's kappa 0.33), and poor for femoral offset difference (Cohen's kappa 0.18) (table vii) . Cost-effectiveness. the radiographs that were measured were collected as a routine part of intra/postoperative care, and therefore represented no extra cost burden from an educational assessment or clinical care point of view. assessor time in taking the measurements was recorded as a mean of 45 seconds per case for dHs and 57 seconds per case for hemiarthroplasty. this is significantly lower than the recommended average time to complete a pBa form of ten to 15 minutes. 24 to be sure that the case mix encountered by the two groups was comparable, we classified the hip fractures into simple/moderate/complex for dHs and simple/ complex for hemiarthroplasty, based around the aO classification system. 26 We found no differences in the fracture complexity between the novice and intermediate groups for either dHs or hemiarthroplasty (classification matrix, and table in supplementary tables ii and iii). 

the ultimate goal of surgical training is to produce safe surgeons who perform good quality operations for their patients. the use of fpa to assess surgical skill using patient radiographs may help bridge the perceived gap between educational assessment and real-world clinical performance. postoperative radiographs are a promising resource for real-world fpa as part of the move towards competency assessment in training, and also to measure transfer of skills from the simulated environment.

this is the first study to explore the use of patient radiographs for fpa assessment of technical skill, and we have systematically addressed the key domains of effective assessment: face, content, construct, and concurrent validity; feasibility; reliability; and cost-effectiveness.

Our results showed reasonable face and content validity of the radiological outcome measures within the limits of a delphi exercise. as we are assessing the role of radiological fpa as a surrogate for clinical outcome in an educational assessment setting, it is difficult to show comprehensiveness and comprehensibility as is traditionally required of content validity. it is reasonable to say that this outcome set appears to be relevant for the population (trainee surgeons) and context of interest (hip fracture surgery). the construct validity picture was mixed, with one of four measures for dHs (tad) and one of three measures for hemiarthroplasty (femoral stem alignment) demonstrating statistically significant differences between groups divided by experience level. the trend across all measurements showed improvement between the novice and intermediate groups, suggesting evidence of construct validity, although we cannot claim construct validity of the tool as a whole based on this pilot study. the construct validity of the global rating scale for pBa was found to be excellent for both dHs and hemiarthroplasty. the concurrent validity with pBa was explored by comparing the radiological parameters for operations which scored a pBa level 2 or 3 (table viiia ivb). We did not show evidence of concurrent validity to our gold standard. the use of pBa as gold standard, as opposed to a clinical outcome, is a significant limitation of this study. none of the parameters demonstrated a statistically significant relationship with pBa global rating scale score. this is the first investigation of the association between pBa and the quality of the outcome of the operation as measured by the radiograph. this finding might be a reflection of the fact that they are assessing different things; the pBa global rating scale is designed to assess the overall ability of the trainee to perform the procedure without supervision, rather than to assess the quality of the operation or technical skill in doing so. Hence the pBa and our radiological outcome measurements are assessing two different facets of competence that are not directly comparable, which may explain the apparent observed lack of concurrent validity. the intra-rater reliability was generally excellent for both dHs (with the exception of plate flush to femur) and hemiarthroplasty, and the inter-rater reliability was generally excellent for dHs, but moderate to poor for hemiarthroplasty. this finding might be explained by the fact that the measurement technique is more readily standardized for dHs, whereas there is greater scope for subjectivity in deciding on appropriate landmarks for measuring leg length discrepancy and offset. this is likely to be compounded by the fact that the postoperative films were often of poor quality, supine and rotated, in contrast to standing films seen in the elective arthroplasty setting. the feasibility was excellent, with the measurements easy and quick to obtain using readily accessible technology. the cost-effectiveness was also superficially excellent, with no additional cost associated with the radiographs other than the assessors' time. time-toassess per case was substantially lower for fpa using postoperative radiographs than for pBa completion by an order of magnitude of at least ten-fold.

a strength of our study is that we have systematically assessed nearly 600 real operations performed by 28 trainees across nine hospital sites over one surgical training year, which is a much larger sample with longer follow-up than most educational studies. the large sample size and the multicentre nature of the data suggest that the generalizability of our results is good and the chance of a type 2 error small. this study has several weaknesses. Our scoping review was informal, and therefore it is possible some outcomes could have been missed. We only considered four of five of the van der vleuten's utility domains of effective assessment, as we have excluded 'educational impact'. this decision was taken because separate, qualitative assessment of the educational impact of using radiological measurements for learning would be required and this analysis was conducted retrospectively. previous work has shown that the morning trauma meeting, where radiographs are displayed and discussed, is educationally valuable for trainees. 27 Other simulation-based studies have shown fpa to be an educationally valuable assessment method in orthopaedic surgery. 5, 18 it is therefore not unreasonable to assume that fpa, with appropriately delivered feedback, would be educationally impactful, although we did not specifically seek to address this in our study. Our gold standard, the pBa, may not have been the best comparator. ideally, we would have compared the radiological outcomes with clinical outcomes.

With the probable exception of tad, given the weight of evidence supporting its clinical relevance and clearly significant construct validity demonstrated in our results, the measurements we have defined here are unlikely to be useful in isolation for assessing competence. Rather, they may be most useful as an adjunct to traditional technical skills assessment in the workplace, to help overcome the well-recognized limitations of these. the pilot fpa outcome sets we have described here may also be useful in developing competency thresholds for simulationbased training. it is possible that the radiological metrics we have investigated in this study could be combined into a composite score, and further work is needed to ascertain appropriate weightings for the individual items and to pilot test these. 

it is feasible to measure technical skill in orthopaedic trainees performing hip fracture surgery using intra-or postoperative patient radiographs, and this is probably cost-effective, and appears to be face and content valid. performance was widely observed to be better in the intermediate than in the novice group suggestive of construct validity, and this was statistically significant for tad in dHs and femoral stem alignment in hemiarthroplasty. improvement in these measures with increased experience suggest that they are responsive to small incremental changes in technical skill. Concurrent validity was poor when measured against the pBa global rating scale score, but this may be because the pBa is not designed to assess technical skill. procedure-based assessment may not be the best gold standard measure. intra-and inter-rater reliability were variable, and found to be excellent for tad, and moderate for lag screw position (dHs) and leg length discrepancy (hemiarthroplasty). use of a large, longitudinal, multicentre educational trial dataset suggests the generalizability of these results is good. the fpa using patient radiographs is likely to be most useful as part of a battery of assessment of technical skill, and may have a role in complementing traditional workplace-based assessment in determining technical skill in the real-world OR. it may have particular utility in competency-based training frameworks, and for assessing skill transfer from the simulated to live operating theatre. these results should be regarded as provisional, and until further validation evidence is provided, the pBa remains the best current tool for assessing technical skill in surgical trainees.

-Post/intra-operative radiographs can be used to assess technical skill in hip fracture surgery. This can complement traditional workplace-based assessment for measuring operative performance and may have particular value in competencybased surgical training.

Follow H. K. James @hannah_ortho Follow G. T. R. Pattison @pattison_giles Follow D. R. Griffin @DamianGriffin

tables showing demographics of delphi panel, fracture complexity by surgeon experience level, and fracture complexity codes by aO classification.

Total hip arthroplasty following failed fixation of proximal hip fractures

Patient outcomes after screw fixation of hip fractures

Radiological predictive factors in the healing of displaced intracapsular hip fractures. A clinical study of 404 cases

The value of the tip-apex distance in predicting failure of fixation of peritrochanteric fractures of the hip

Construct validation of a novel hip fracture fixation surgical simulator

Teaching basic trauma: validating FluoroSim, a digital fluoroscopic simulator for guide-wire insertion in hip surgery

Training safer orthopedic surgeons. Construct validation of a virtual-reality simulator for hip fracture surgery

Surgical simulators and hip fractures: a role in residency training?

Training femoral neck screw insertion skills to surgical trainees: computer-assisted surgery versus conventional fluoroscopic technique

Virtual reality training improves trainee performance in total hip arthroplasty: a randomized controlled trial

Techniques for evaluating training programmes

Measuring the educational impact of simulation training in Trauma & Orthopaedics

Analysis of tools used in assessing technical skills and operative competence in trauma & orthopaedic surgical training

Role of Visuohaptic Surgical Training Simulator in Resident Education of Orthopedic Surgery

Development of a surgical skills curriculum for the training and assessment of manual skills in orthopedic surgical residents

Significance of Preoperative Planning Simulator for Junior Surgeons' Training of Pedicle Screw Insertion

Evaluating internal fixation skills using surgical simulation

How accurately do novice surgeons place thoracic pedicle screws with the free hand technique? Spine

The assessment of professional competence: developments, research and practical implications

Defining consensus: a systematic review recommends methodologic criteria for reporting of Delphi studies

Cadaveric simulation vs standard training for postgraduate trauma & orthopaedic surgical trainees: protocol for the CAD:TRAUMA study multi-centre randomised controlled educational trial

COVID-19. ISCP (Intercollegiate Surgical Curriculum Programme)

A ten-year analysis of Intertrochanteric fractures of the femur

Orthopedic trainees' perceptions of the educational value of daily trauma meetings