Diagnostic Test Accuracy of the YEARS Algorithm for Pulmonary 7 Embolism 8 A systematic review and meta-analysis

18 This systematic review and meta-analysis aimed to evaluate both the diagnostic test accuracy 19 of the YEARS algorithm in excluding pulmonary embolism and to compare the advance 20 imaging utilisation rate of YEARS against standard practice. Published studies were selected 21 across several databases from July 2017 to September 2022 using Joanna Briggs Institute 22 methodology for systematic reviews of diagnostic accuracy. The analysis included ten studies 23 with nearly 14,000 participants. YEARS showed a sensitivity of 96% (95% CI 93-98%) and 24 specificity of 50% (95% CI 33-67%). The risk ratio for advanced imaging was 0.78 (95% CI 25 67-90), showing an overall reduction. YEARS is an effective means of safely managing 26 patients with suspected pulmonary embolism.


Introduction
Pulmonary Embolism (PE) carries an incidence between 39-115 per 100,000 people and is the third most common cardiovascular cause of death worldwide. 1,2The presentation of PE, though sometimes causing overt cardiovascular compromise, can often be non-specific.4][5] Evidence shows a recent rise in the use of emergency resources to investigate PE compared to previous decades. 6,7Despite this increase in investigation and diagnosis of PE, the overall mortality has not significantly changed. 3,8p with diagnosing patients with suspected PE whilst avoiding the unnecessary use of Advanced Imaging (AI), several algorithms have been developed.The YEARS algorithm was first published in the Lancet on May 2017. 9The algorithm used an abbreviated version of the WELLS algorithm consisting of three criteria: clinical signs and symptoms of Deep Vein Thrombosis (DVT), likely diagnosis of PE, and the presence of haemoptysis.Patients are classified as either having a low (no criteria present) or high (one or more criteria present) pre-test probability of having PE.Based on this pre-test probability, a lower or higher Ddimer decision threshold, 500 ng/ml or 1000 ng/ml respectively, is used to categorize patients as low risk (not requiring AI) or high risk (requiring AI).YEARS is presented in Figure 1.
The algorithm had a sensitivity of 98%, specificity of 55%, positive predictive value 25%, negative predictive value 99%. 9 The gold standard in AI for diagnosing PE is Computed Tomography Pulmonary Angiography (CTPA). 10,11This method carries several risks to the patient such as cancer development secondary to radiological exposure, 12 as well as the risk of nephrotoxicity and anaphylaxis from the required intravenous contrast. 13Additionally, Computer Tomography (CT) scans are not always readily available and a reduction in their use would likely lead to budgetary savings and better use of emergency resources. 14Alternative AI modalities exist which can modify patient risk but do not negate them entirely. 15This calls into question the liberal use of AI on patients with a low or insignificant chance of having PE. 11 Thus, we pose the following question, what is the diagnostic test accuracy and utility of the YEARS algorithm at excluding PE?
A preliminary search on CINAHL Plus and Medline revealed one systematic review on this topic.The review analysed four different algorithms (one of which being YEARS) within four specific patient sub-groups. 16Furthermore, this previously published review retrospectively implemented its own study protocol on cohorts from various studies and compared three different clinical decision rules to YEARS. 16In light of this evidence, we aimed to evaluate the accuracy of YEARS in excluding PE and compare YEARS' AI utilisation rate against standard practice.Our review proposes to broaden this theme as our included studies differed in our cohort population and study design.Furthermore, our outcome metrics differed from their singular use of missed PE and included sensitivity, specificity, likelihood ratios and predictive values.
When evaluating an algorithm designed to exclude PE whilst negating AI use, patient safety is fundamental.This will be demonstrated by estimating both the likelihood of missed PE and the incidence of AI exposure.When analysing the utilisation rate of AI, a reference test will be used to demonstrate either an increase or decrease.The reference test will be termed standard practice and may include any alternative algorithms for investigating suspected PE (for example the WELLS algorithm).The objectives of this study are to evaluate the accuracy of YEARS in excluding PE defined as the sensitivity, specificity, likelihood ratios and predictive values.Additionally, to compare the AI utilisation rate of YEARS against standard practice via calculation of the associated risk ratio.

Methods
We followed the Joanna Briggs Institute (JBI) methodology for systematic reviews of diagnostic test accuracy. 17Moreover, the study was reported using the Preferred Reporting Items for Systematic review and Meta-Analysis (PRISMA) guidelines and the PRISMA of diagnostic test accuracy as an extension. 18,19

Inclusion Criteria
The Population, Index test, Reference test and Diagnosis of interest (PIRD) model was utilised to develop the inclusion and exclusion criteria. 17,19tion: Suspected PE in individuals aged 16 years or older of any ethnicity or gender; living in any geographical location.Studies conducted in ED, inpatient, and outpatient departments were included.Studies involving pregnant participants were excluded.
Index Test: The original YEARS algorithm with D-dimer decision thresholds of 500 ng/ml or 1000 ng/ml respectively depending on YEARS score.
Reference Test: A reference test is required to detect true/false PE.This will include either unilateral validation via AI or alternatively, prospective implementation of YEARS with three-month participant follow up if no AI is ordered (as is commonly utilised in the Venous Thromboembolism [VTE] literature).The use of alternative PE algorithms were used as a reference test to assess the rate of AI utilisation.

Diagnosis of Interest:
Any thromboembolism within the pulmonary circulatory tree as diagnosed on AI/autopsy.This review shall not delineate between Isolated Sub-segmental Pulmonary Embolism (ISPE) and PE to improve homogeneity across all studies.Subsequently, DVT (without PE) will not be counted as missed PE.
Study Design: randomised control trials, case-controlled studies, cross-sectional studies, and retrospective or prospective cohort studies published in peer reviewed journals were included.Studies which conducted post-hoc analysis upon literature already included for review was excluded to avoid repetition of synthesised outcomes.

Search Strategy
The search strategy was developed with the input of the librarian and the following terms were applied to the databases: ("pulmonary embol*" OR "pulmonary thromboembol*" OR "fibrin split product*" OR "Fibrin degradation product*" OR "D-dimer") [TI, AB] AND ("years score" OR "years study" OR "years algorithm" OR "years tool" OR "years criteria" OR "years rule" OR "years clinical decision" OR "years diagnostic") [TX]       An initial limited search (Step 1) was performed in Medline (EBSCO host, Birmingham, Alabama, USA) in order to identify key words and indexed terms.This informed the development of a comprehensive search strategy (listed above) which was adapted for each database searched (Step 2).The databases used to collate studies were CINAHL Plus (EBSCO host, Birmingham, Alabama, USA), AMED (EBSCO host, Birmingham, Alabama, USA), Medline (EBSCO host, Birmingham, Alabama, USA), and EMBASE (Ovid Technologies, Inc., New York, USA).
All studies published in English between July 2017 (the original publication date of the YEARS algorithm) to September 2022 (date of database search) were included. 9Data was managed using Rayyan TM reference manager and Microsoft Excel Software TM 20,21 and was assessed by two independent authors.Disagreements were resolved through discussion.

Assessment of Methodological Quality
Selected studies were critically appraised for methodological quality using the JBI checklist for diagnostic test accuracy studies Critical Appraisal Tool -CAT.This instrument is based on 10 ''signalling questions'' from the revised Quality Assessment of Diagnostic Studies (QUADAS-2) critical appraisal tool for diagnostic test accuracy. 22The four domains used to assess risk of bias included patient selection (questions 1 to 3), index tests (questions 4 and 5), reference standard (questions 6 and 7), and flow and timing (questions 8 to 10).This provided objective appraisal of potential bias present within the included studies.All studies were included regardless of methodological quality and 20% of the total included studies were quality assured between two authors with the remaining critiqued via a single researcher.

Data Extraction
Data from selected studies was extracted by one author utilising a modified version of the Standards for Reporting Diagnostic accuracy studies (STARD) checklist to fit this reviews aims and objectives. 23Two authors completed independent data extraction of a minimum 20% studies as quality assurance.Upon data collection, if required data was not made available we calculated the values from what was provided.If this was not possible authors of included studies were contacted to provide additional data.

Data Analysis and Synthesis
Data was synthesized using meta-analysis as described below.When meta-analysis was not performed the synthesis without meta-analysis reporting guidelines were utilised. 24We have performed meta-analysis of the test accuracy in terms of sensitivity and specificity as per the JBI methodology for systematic reviews of diagnostic test accuracy. 17The meta-analysis results have been presented in a forest plot and summary receiver operating characterises (ROC) curve. 19,24,25A hierarchical random-effects logit model was employed using Stata package metadta. 26In addition, we were interested in the impact of YEARS in terms of reducing scans compared to other algorithms.We collected data from studies and performed a meta-analysis of the risk ratio of YEARS incurring imaging.Profile likelihood method was adopted as suggested by Kontopantelis and Reeves, 27 using the 'metan' package in Stata 18. 28 The result was again presented in a forest plot.In addition, subgroup analysis was performed, to assess the heterogeneity between study design (prospective and retrospective) regarding both outcomes: diagnostic test accuracy and impact on AI utilisation.Statistical analysis mirrored the protocols described above and was presented via forest plots.This was reasoned to be important as the difference in study design may be a significant contributing factor of heterogeneity between articles.

Study Inclusion
In total, 226 studies were retrieved and abstracts read.25 papers were assessed for full-text reading, for which 15 were excluded with reasons.Of these, six used YEARS as a screening tool on patients diagnosed with another disease including chronic obstructive pulmonary disease (COPD), sickle cell disease and coronavirus disease.This failed to meet the specified inclusion criteria of 'suspected PE' and is not how YEARS was intended to be applied.Four combined YEARS with additional investigations such as the pulmonary rule-out exclusion criteria rule or C-reactive protein studies and failed to meet the original YEARS protocol.This left 10 papers included for the systematic review [supplementary material Figure 1].

Characteristics of Included Studies
In total, 13,993 participants were included across 10 studies 9,[29][30][31][32][33][34][35][36][37] with no one study making up more than 25% of the total review cohort.Participants were recruited internationally across 11 countries by way of 39 different hospitals.In total, 1.4% of the participants were lost of follow up in two studies. 9,31Most participants were recruited within the ED.
The incidence of PE varied significantly between studies with an average incidence of 17.2% (SD ±9.9%).Only one study 35 included participants aged 16-17 years old compared to all other studies who opted for 18 years or older.Conversely, one study 29 excluded all participants under the age of 50 years.The key characteristics of the included studies is presented in Table 1.

Methodological Quality
All studies 9,29-37 received critical appraisal with the average result equalling 8.5/10.No studies were at high risk for bias and only one study 35 appearing to be at moderate risk of bias: 7/10.A table with critical appraisal of the included studies is presented in Table 2.A description of how individual studies were scored for each of the 10 questions within the JBI critical appraisal tool is discussed below.
All studies 9,29-37 incorporated consecutive enrolment of participants, avoided a case-control design, used the original YEARS and D-dimer decision threshold, interpreted the reference test without knowledge of the index test and allowed for a suitable time between index and reference tests (questions 1, 2, 5, 7, and 8).One study 32 allowed for using venous compression ultrasonography when investigating the presence of a DVT.However, this did not alter the original YEARS algorithm hence this study was scored favourably on question 5.
In question 3, one study 29 implemented inappropriate exclusions by omitting all participants under 50-years-old.In question 4, three studies 9,30,31 interpreted the index test without knowledge of the reference test.One study 9 did not blind clinicians to the D-dimer result before participants had YEARS applied to them.This posed a risk of bias as clinicians knew the result of YEARS before recruitment into the study. 25Nevertheless, as the D-dimer is separate to the reference test, this study 9 was scored favourably (yes) for question 4.
One study 35 was identified as having the potential for missing PE upon review of their stated reference test (question 6).This study retrospectively reviewed D-dimers ordered for suspected PE and utilised a three-month follow up for patients who did not receive AI at their initial visit.Three-month follow up was performed by reviewing for representation to the same ED or further AI ordered.This sub-group made up 73.7% of the cohort.This follow up was considered at higher risk of missing PE as direct patient follow-up was not performed.
Additionally, representation to an alternative ED in the area, for which several were available, was not discussed.
In question 9, six of the studies 29,30,33,34,36,37 uniformly received the same reference test as they were retrospective chart reviews of CTPA scans ordered for suspected PE.One study 31 had a significant number of participants lost to follow up who did not undergo AI upon the index visit -11% of total cohort (question 10).An additional study 9 also documented participants lost to follow up; however, this number was minuscule compared to the cohort size (0.1%) and thus was scored positively.

Review Findings
Upon review, three studies 9,30,32 required recalculation of their data to align with the protocol of this review (see Table 1).In total, an incidence of 25 ISPE were identified, four of which were negative via the YEARS algorithm.Further sub-group analysis was not possible due to insufficient data.However, characteristics documented were malignancy, heart failure, history of VTE, syncope, lower respiratory tract infections (including corona virus disease-19), asthma, hormone replacement therapy and Chronic Obstructive Pulmonary Disease (COPD).

Heterogeneity was observed in relation to the two different reference tests used for diagnosis.
These being either a mix of AI or three-month follow-up or unilateral use of AI.Both strategies were deemed adequate to detect missed PE.Two studies 9,31 prospectively utilised a mix of AI or three-month follow-up depending on the result of YEARS, which, despite causing heterogeneity held high value as it produced data within the live clinical environment. 25,38oup analysis was performed to compare prospective vs. retrospective studies (two groups) and it was found that there was little evidence of heterogeneity across the study designs.This is suggested by the highly overlapped pooled 95% confidence intervals regarding sensitivity and specificity between the two groups (supplementary material Figure 2).The difference in sensitivity was small whereas greater deviation was observed for specificity between different study types.Pooled outcomes observed upon sub-group analysis of the efficacy of YEARS in terms of the risk ratio of advanced imaging utilisation was very similar between the two groups (supplementary material Figure 3).Again, little or no evidence of heterogeneity by study design was observed.
Given the lack of heterogeneity across the study designs, meta-analysis based on all studies was presented in the main body of our paper.Figure 2 shows a forest plot demonstrating the meta-analysis results of sensitivity and specificity. 26Figure 3 shows the summary ROC plot with effect-analysis. 26The overall outcome metrics as per the first primary objective were calculated: sensitivity = 96% (95% CI 93-98%) and specificity = 50% (95% CI 33-67%).The sensitivity calculated via meta-analysis held a reassuringly narrow confidence interval suggesting good between-study reproducibility for this metric.This was not the case for the specificity which held a wide confidence interval and was inconsistent.This is shown within the summary ROS plot where the prediction region suggests significant heterogeneity between studies despite an identical decision-threshold being universally applied.Further pooled statistics were calculated: positive and negative predictive values = 29% and 99% and positive and negative likelihood ratios = 2.35 and 0.06. 39tegories of reference tests were identified to compare rates of AI utilisation: Dichotomized WELLS (D-WELLS), altered D-WELLS, three-tier WELLS, age-adjusted three-tier WELLS, age-adjusted D-WELLS and clinical gestalt.Five studies 9,29,33,35,36 utilised the D-WELLS though one study, 33 despite inferring AI was reduced, did not supply statistical data for this.The author was contacted however we were unable to resolve this query. 33The most commonly used reference test was D-WELLS whereas the one which consistently faired the strongest against YEARS was age-adjusted D-WELLS.
Published online letters 40,41 reported that the D-WELLS score used as a reference test in one study 30 considered a positive result only if both a score greater than four in addition to a Ddimer level above 500 ng/ml was present.This deviates from any known version of WELLS and would theoretically produce a lower rate of AI utilisation via the 'threshold effect'. 42A published response to this letter from the authors 41 confirmed they do not support the use of this altered D-WELLS for use in clinical practice.Because of this, data regarding AI utilisation from this study 30 was not used for meta-analysis.In the case of both prospective studies, 9,31 the reference test was retrospectively applied to the same sample population.
Figure 4 shows the risk ratio of AI being required between YEARS and the reference tests available. 26The combined risk ratio of AI utilisation attributed to the use of YEARS was 0.78 (95% CI 0.67-0.90).This indicates YEARS decreased the risk ratio of AI being required by 22%.The mean reduction of AI utilisation without effect analysis was 11%.Only one study demonstrated a minimal increase in AI utility.As demonstrated the results between studies were varied despite the overall gross reduction of scans which is signified by the relatively wide confidence interval seen upon meta-analysis.Despite this, the confidence interval of the combined data lay outside of the null effect indicating statistical significance.

Discussion
This systematic review evaluated the diagnostic test accuracy of the YEARS algorithm on nearly 14,000 patients.All participants were recruited using a probability sampling strategy via way of 48 different sampling events internationally (including sites used more than once within a different time period).Malignancy, respiratory or cardiac disease, respiratory tract infections, previous VTE, syncope and hormone replacement therapythese were among the diverse cohort recruited and represent common challenges when investigating PE due to their increased risk of VTE and/or similar clinical presentations. 1 Upon review of the YEARS algorithm, the combined sensitivity and specificity was demonstrated to be 96% and 50%.The confidence intervals shown in the forest plot suggests the sensitivity to be largely consistent between studies.Several risks for potential bias were noted within studies risking over representation of the sensitivity and under representation of the specificity.This included a failure to blind clinicians to D-dimer levels in one study 9 and seven studies 29,30,[32][33][34]36,37 being retrospective chart reviews of CTPA requests. It isunknown to what degree this review was affected by these variables, if at all.In this review, the diagnostic test accuracy of YEARS has been shown to be effective for use in the clinical environment for safely excluding disease in suspected PE.In fact, if the missed ISPE were excluded from the false negatives, for which emerging evidence may encourage, the miss rate would be even lower than what was demonstrated in this review.43,44 When analysing the ability to correctly detect disease on the other hand the specificity was largely inconsistent and low.This was similar to the original YEARS study which found the specificity to be only 5% more than this review.9 Despite this it can be reasonably propositioned that the ability to avoid missing true PE is more valuable than the specificity.
The fear of missing PE has been acknowledged, at least in part, to be one of the driving factors of over utilising AI and the avoidance of using clinical decision rules and algorithms. 7,45One of the prospective studies 31 demonstrated a large proportion of patients where AI was requested against the YEARS protocol.This highlights the presence of mistrust felt by clinicians during clinical use.In relation to this, YEARS did hold a reassuringly high sensitivity and low negative likelihood ratio of 0.06.In fact, the rate of missed PE within the combined cohort was only 0.5%.This falls well short of the generally accepted miss rate for PE of 2% indicating YEARS is likely safe for patient use when considering the risk of missed PE. 46 In combination to this proposition of a low miss rate, YEARS must also reduce unnecessary AI ordering.In this regard YEARS also appeared to hold value as it demonstrated a decrease of 22 percentage points in the risk ratio.This reduction is statistically significant.These results suggest that YEARS is effective at reducing AI utilisation compared to several different forms of alternative PE algorithms.As is demonstrated in the current literature of PE, over-investigation with AI causes increased risk to both the patient and health care system. 6,7,12,13nificant selection bias was observed within the participant exclusion criteria listed across all studies.Common exclusions were the presence of YEARS exclusion criteria such as pregnancy, incomplete participant data, recent use of anticoagulants or a life expectancy less than three months.A concession to this, though minimal in our opinion, was the exclusion of participants aged 50 years or less in one small study. 29Two sub-groups of patients appeared at risk of falling below the acceptable level of reference testing for PE and made up 12.6% of the total cohort.This was either participants lost during three-month follow up or participants who did not receive AI within one study 35 due to the concerns discussed during critical appraisal.Regarding prospective vs. retrospective studies, retrospective analysis is often chosen in studies of diagnostic test accuracy due to data being readily available. 25This can present risks for error when implementing a protocol compared to prospective implementation. 25For instance, one study 33 decided on whether PE was the most likely diagnosis retrospectively from chart review, depending on whether the patient had a known disease which would explain breathlessness (e.g.COPD).In practice however, the clinical acumen needed for this decision is more complex.In spite of this, sub-group meta-analysis by study design demonstrated minimal differences regarding outcomes.
Another point for consideration were studies which conducted sampling via retrospective data of CTPA ordered for suspected PE.Such studies may have recruited a proportionally higher cohort of individuals who were at high risk for PE compared to the 'typical' patient with suspected PE.To elaborate, it could be surmised that patients who received CTPA, ordered according to the local protocols, were more likely to have PE compared to patients who had PE excluded without CTPA (thus were not recruited).This could risk the results overestimating sensitivity and under estimating specificity. 25,38stingly, out of the nine studies 9,[29][30][31][33][34][35][36][37] which included comparative data of YEARS versus an alternative algorithm, seven 9,29,31,[33][34][35][36] indicated the YEARS algorithm did not produce the lowest rate of missed PE. Itwas not in the scope of this review to compare the diagnostic accuracy of YEARS against alternative algorithms; therefore, no comment can be made on the superiority or inferiority of YEARS concerning the diagnostic test accuracy within this review.This review has several limitations. Studies publihed in languages other than English were excluded.Furthermore, grey literature was not included.47 A single researcher conducted most of the data extraction and critical appraisal.To mitigate this risk, two authors were consulted throughout the process and 20% of the included studies received calibration exercises of critical appraisal and data extraction to moderate against error and discuss discrepancies.48

Conclusion
This review aimed to evaluate the diagnostic test accuracy YEARS when assessing patients presenting with suspected PE.This review concluded that the YEARS algorithm holds a sufficiently high sensitivity to avoid missing true PE.The specificity suggests YEARS has poor accuracy at detecting true PE (without AI).However, despite the relatively poor specificity, the use of AI was reduced compared to other reference tests analysed.It was demonstrated that the studies synthesised included a wide range of ages, demographics, and genders, with variable medical histories and clinical presentations, in varied clinical settings.This suggests that the results from this review can be applied to a wide range of patient demographics seen in clinical practice.Further research on the implementation of YEARS prospectively is needed to accurately demonstrate its outcomes on patient care during live clinical use.As was discussed, the limitations of this review predominantly stemmed from the use of retrospective study methodology.The future of investigating patients presenting with suspected PE remains a common dilemma for clinicians.The YEARS algorithm has been shown to constitute a possible means of safely managing this patient demographic.Were the index test results interpreted without knowledge of the results of the reference standard? 5.If a threshold was used, was it pre-specified?6.Is the reference standard likely to correctly classify the target condition?7. Were the reference standard results interpreted without knowledge of the results of the index test?8. Was there an appropriate interval between index test and reference standard?9. Did all patients receive the same reference standard?10.Were all patients included in the analysis?

Figure 1 :
Figure 1: The YEARS Algorithm Legend: DVT = Deep Vein Thrombosis.PE = Pulmonary Embolism.AI = Advanced Imaging.Adapted from Van Der Hulle T, et al.9 Legend: Y = Yes.N = No.U = Unclear.T = Total score out of 10 1. Was a consecutive or random sample of patients enrolled?2. Was a case control design avoided? 3. Did the study avoid inappropriate exclusions? 4.

Figure 2 :
Figure 2: Forest plot of meta-analysis of sensitivity/specificity

Figure 3 :Figure 4 :
Figure 3: Summary Receiver Operating Characteristics of meta-analysis of diagnostic test accuracy

Table 2 :
Critical appraisal of included studies