Advertisement
Scientific Article| Volume 8, ISSUE 6, 101234, November 2023

Augmenting Quality Assurance Measures in Treatment Review with Machine Learning in Radiation Oncology

Open AccessPublished:April 06, 2023DOI:https://doi.org/10.1016/j.adro.2023.101234

      Purpose

      Pretreatment quality assurance (QA) of treatment plans often requires a high cognitive workload and considerable time expenditure. This study explores the use of machine learning to classify pretreatment chart check QA for a given radiation plan as difficult or less difficult, thereby alerting the physicists to increase scrutiny on difficult plans.

      Methods and Materials

      Pretreatment QA data were collected for 973 cases between July 2018 and October 2020. The outcome variable, a degree of difficulty, was collected as a subjective rating by physicists who performed the pretreatment chart checks. Potential features were identified based on clinical relevance, contribution to plan complexity, and QA metrics. Five machine learning models were developed: support vector machine, random forest classifier, adaboost classifier, decision tree classifier, and neural network. These were incorporated into a voting classifier, where at least 2 algorithms needed to predict a case as difficult for it to be classified as such. Sensitivity analyses were conducted to evaluate feature importance.

      Results

      The voting classifier achieved an overall accuracy of 77.4% on the test set, with 76.5% accuracy on difficult cases and 78.4% accuracy on less difficult cases. Sensitivity analysis showed features associated with plan complexity (number of fractions, dose per monitor unit, number of planning structures, and number of image sets) and clinical relevance (patient age) were sensitive across at least 3 algorithms.

      Conclusions

      This approach can be used to equitably allocate plans to physicists rather than randomly allocate them, potentially improving pretreatment chart check effectiveness by reducing errors propagating downstream.

      Introduction

      Radiation treatment plans are created using a complex, iterative process. The patient's clinical history, cancer diagnosis, and normal and malignant tissue anatomy are all considered to create a personalized plan. A multidisciplinary team of professionals including physicians, physicists, and dosimetrists are involved, and every step of the process must be verified and checked multiple times. According to the current quality assurance (QA) recommendations, the physicist or dosimetrist manually performs “chart checks (ie, plan QA)” of multiple metrics for each patient's radiation plan before it can be made deliverable.
      • Ford EC
      • Terezakis S
      • Souranis A
      • Harris K
      • Gay H
      • Mutic S.
      Quality control quantification (QCQ): A tool to measure the value of quality control checks in radiation oncology.
      This plan evaluation also involves checking many data elements and documents including (but not limited to) the treatment plan images, fused images, contours, simulation documents, treatment prescription, treatment plan parameters/reports, and prior radiation records. This can be a time intensive process, involving a high cognitive workload and, though one of the most important safety barriers, is estimated to be only 60% effective in detecting high-severity incidents.
      • Ford EC
      • Terezakis S
      • Souranis A
      • Harris K
      • Gay H
      • Mutic S.
      Quality control quantification (QCQ): A tool to measure the value of quality control checks in radiation oncology.
      Recent American Association of Physicists in Medicine task groups have suggested methods for risk analysis, but there are currently very few concrete guidelines outlining the plan quality control process.
      • Ford EC
      • Terezakis S
      • Souranis A
      • Harris K
      • Gay H
      • Mutic S.
      Quality control quantification (QCQ): A tool to measure the value of quality control checks in radiation oncology.
      • de los Santos EF
      • Evans S
      • Ford EC
      • et al.
      Medical Physics Practice Guideline 4. a: Development, implementation, use and maintenance of safety checklists.
      • Tracton GS
      • Mazur LM
      • Mosaly P
      • Marks LB
      • Das S.
      Developing and assessing electronic checklists for safety mindfulness, workload, and performance.
      • Younge KC
      • Naheedy KW
      • Wilkinson J
      • et al.
      Improving patient safety and workflow efficiency with standardized pretreatment radiation therapist chart reviews.
      Instead, most departments have developed institution-specific chart check standards of practice.
      • Hoopes DJ
      • Dicker AP
      • Eads NL
      • et al.
      RO-ILS: Radiation Oncology Incident Learning System: A report from the first year of experience.
      However, the standards of practice vary widely among institutions.
      • Kisling KD
      • Ger RN
      • Netherton TJ
      • et al.
      A snapshot of medical physics practice patterns.
      ,
      • Potters L
      • Ford E
      • Evans S
      • Pawlicki T
      • Mutic S.
      A systems approach using big data to improve safety and quality in radiation oncology.
      Although the majority use written procedures or checklists, the details of what is reviewed or checked is heterogeneous.
      • Younge KC
      • Naheedy KW
      • Wilkinson J
      • et al.
      Improving patient safety and workflow efficiency with standardized pretreatment radiation therapist chart reviews.
      ,
      • Fong de los Santos L
      • Dong L
      • Greener A
      • et al.
      Tu-d-201-02: Medical physics practices for plan and chart review: Results of AAPM task group 275 survey.
      Automation of the plan evaluation process has been successfully used before, typically using rules- or atlas-based approaches.
      • Furhang EE
      • Dolan J
      • Sillanpaa JK
      • Harrison LB.
      Automating the initial physics chart-checking process.
      ,
      • Pillai M
      • Adapa K
      • Das SK
      • et al.
      Using artificial intelligence to improve the quality and safety of radiation therapy.
      This type of approach, however, is limited in its ability and is not easily adaptable to changes in the treatment planning process that will inevitably occur over time as technology improves.
      • Kalet AM
      • Luk SM
      • Phillips MH.
      Radiation therapy quality assurance tasks and tools: the many roles of machine learning.
      Presently, at our institution, these types of in-house QA tools augment the chart checking process by automating standardized second checks, thus improving efficiency, and reducing cognitive workload.
      • Tracton GS
      • Mazur LM
      • Mosaly P
      • Marks LB
      • Das S.
      Developing and assessing electronic checklists for safety mindfulness, workload, and performance.
      Artificial intelligence (AI) and machine learning (ML) are increasingly used to improve QA processes in 4 broad areas- machine QA, patient-specific QA, treatment plan review, and QA of contours. In a recent review, Luk et al
      • Luk SMH
      • Ford EC
      • Phillips MH
      • Kalet AM.
      Improving the quality of care in radiation oncology using artificial intelligence.
      have reported that despite the importance of treatment plan review, fewer studies have explored the application of AI and ML models in assisting treatment plan review. Azmandian et al
      • Azmandian F.
      • Kaeli D.
      • Dy J.G.
      • et al.
      Towards the development of an error checker for radiotherapy treatment plans: a preliminary study.
      developed an outlier detection model to cluster many treatment plans, and while checking a new treatment plan, the parameters of the plan were tested to check if they belonged to established clusters. If they did not belong to these established clusters, they were identified as “outliers” and brought to the attention of human chart checkers. Although the k-means clustering algorithm helps identify problematic plans, it does not provide any information on factors contributing to treatment plan complexity. Kalet et al
      • Kalet AM
      • Gennari JH
      • Ford EC
      • Phillips MH.
      Bayesian network models for error detection in radiotherapy plans.
      and Luk et al
      • Luk SMH
      • Ford EC
      • Phillips MH
      • Kalet AM.
      Improving the quality of care in radiation oncology using artificial intelligence.
      developed an error detection Bayesian network that mimics human reasoning processes to improve the detection of errors during the treatment plan review. Triaging radiation treatment plans as difficult and less difficult before treatment plan review is likely to optimize physicists’ cognitive workload and reduce potential errors during plan review.
      • Campbell AM
      • Mattoni M
      • Yefimov MN
      • Adapa K
      • Mazur LM.
      Improving cognitive workload in radiation therapists: A pilot EEG neurofeedback study.
      ,
      • Mazur LM
      • Mosaly PR
      • Hoyle LM
      • et al.
      Relating physician's workload with errors during radiation therapy planning.
      To the best of our knowledge, no previous ML studies have examined a comprehensive array of factors to determine the degree of difficulty to check radiation treatment plans.
      • Luk SMH
      • Ford EC
      • Phillips MH
      • Kalet AM.
      Improving the quality of care in radiation oncology using artificial intelligence.
      The research objective herein is to use machine learning to identify and flag difficult cases that require additional scrutiny by the physicist, potentially leading to fewer errors evading this check and propagating downstream in the clinical workflow to affect patient treatment. We used the clinical research framework presented by Park et al
      • Park Y
      • Jackson GP
      • Foreman MA
      • et al.
      Evaluating artificial intelligence in medicine: phases of clinical research.
      to frame our study. Park et al
      • Park Y
      • Jackson GP
      • Foreman MA
      • et al.
      Evaluating artificial intelligence in medicine: phases of clinical research.
      described clinical research in 5 phases, and our study illustrates phase 0 (ie, user needs and workflow assessment, data quality check, algorithm development and performance evaluation, prototype design) and phase 1 (ie, in silico algorithm performance optimization). The analysis demonstrates a classification algorithm that categorizes the radiation treatment plans as difficult or less difficult to QA via the physics pretreatment chart check.

      Methods and Materials

      Data collection

      We collected data from physics pretreatment chart checks for treatment plans of each patient, encompassing all cancer sites. Data used in this analysis were collected from July 2018 to October 2020. The study was designed with an interdisciplinary team of engineers, clinicians and physicists who conducted a user needs and workflow assessment and selected data attributes to be considered for inclusion by the machine learning models. Data quality checks were performed, and the selected data attributes were based on clinical relevance, contribution to plan complexity, and QA metrics (Table 1). The degree of difficulty of treatment plans, an outcome variable, was collected as a subjective rating by physicists on a scale of 1 to 10, normalized across the data set, after they completed their pretreatment chart checks. Each plan was checked by a single physicist, and the degree of difficulty was rated by one of 16 physicists with an average experience of ∼10 years in pretreatment plan review, supporting their credibility as subject matter experts. Multiple other machine learning studies have used expert ratings as an outcome variable.
      • Syed K
      • Sleeman W
      • Hagan M
      • Palta J
      • Kapoor R
      • Ghosh P.
      Automatic incident triage in radiation oncology incident learning system.
      ,
      • Taylor-Weiner A
      • Pokkalla H
      • Han L
      • et al.
      A machine learning approach enables quantitative measurement of liver histology and disease monitoring in NASH.
      The study was determined “not human subjects research” by the UNC-Chapel Hill Institutional Review Board (IRB# 19-2984).
      Table 1Attributes considered for inclusion by machine learning models.
      Feature categoryFeature nameFeature details mean (standard deviation)
      Clinical relevancePatient ageMean: 62 y (SD, 15 y)
      Patient sexMale: 52.73%

      Female: 47.27%
      Site name1. Head and neck (21.5%)

      2. Brain (8.2%)

      3. Spine (6.1%)

      4. Right breast (6.9%)

      5. Left breast (6.5%)

      6. Chest (6.3%)

      7. Abdomen (3.9%)

      8. Esophagus (0.6%)

      9. Lung (10.7%)

      10. Pelvis (11.6%)

      11. Prostate bed (2.6%)

      12. Prostate (6.8%)

      13. Extremities (6.0%)

      14. Missing (0.5%)
      Plan complexityNo. of isocentersMean: 1 (SD, 1)
      No. of fractionsMean: 16 (SD, 1)
      No. of beam setsMean: 2 (SD, 1)
      No. of image setsMean: 5 (SD, 6)
      No. of planning structuresMean: 33 (SD, 24)
      Dose per monitor unit (MU)Mean: 0.64 (SD, 0.33)
      Pacemaker (yes/no)Yes: 0.01%

      No: 99.9%
      Pregnant (yes/no)Yes: 0.8%

      No: 99.2%
      Previous treatment (yes/no)Yes: 14.0%

      No: 86.0%
      Quality assuranceNo. of doctorsMean: 1 (SD, 1)
      No. of dosimetristsMean: 1 (SD, 1)
      Accelerated schedule (yes/no)Yes: 27.2%

      No: 72.3%

      Missing: 0.5%
      Physicist ID (physicist on plan)No. of physicists: 16

      Data set

      In the study, 973 patient plans were used as a data set, and a patient could have multiple plans. All analysis was done using Python scikit-learn (sklearn) packages.
      • Pedregosa F
      • Varoquaux G
      • Gramfort A
      • et al.
      Scikit-learn: Machine learning in Python.
      The outcome variable (degree of difficulty of plans) was normalized to the physicist performing the chart check; the top 30% (8-10 on the scale) were labeled as difficult and the bottom 70% (1-7 on the scale) as less difficult. In addition, 30% was chosen as the threshold through discussions with physicists and clinicians in the user needs assessment. The purpose of the classification was to identify the cases that were most difficult to check. More complex plans are more error prone and checking them results in the most cognitive workload; therefore, the class boundaries were set by differentiating the bottom 70% (less difficult) from the top 30% (difficult) of cases. The data set was split such that 778 cases were used as a training set, and 195 cases were used as the test set. The test set was set aside to evaluate performance after all model training and parameter tuning. The training set was further split into a train set (622 cases) and a development set (156 cases). The development set was used for hyperparameter tuning and voting procedure selection, as described later. The train set was randomly oversampled to balance the number of difficult and less difficult cases for training. Random oversampling consisted of randomly duplicating cases in the difficult class. Before oversampling, the train set contained 443 less difficult cases and 179 difficult cases. After oversampling, the set contained 443 less difficult cases and 358 difficult cases (Fig. 1A).
      Figure 1
      Figure 1(A) Data set preparation. (B) Current clinical workflow is shown in the flow diagram at the top of the figure. Proposed input from machine learning to the current clinical workflow is shown below.

      Classification

      A voting classifier was used for binary classification to predict the degree of difficulty of radiation treatment plans by classifying them as less difficult (bottom 70%) and difficult (top 30%). A voting classifier is a collection of algorithms that are trained in parallel to take advantage of the unique strengths of each and mitigate their weaknesses. The voting classifier consisted of 5 supervised algorithms: support vector machine (SVM), decision tree, random forest classifier, adaboost, and neural network. As all the algorithms learn differently, a voting procedure was selected based on the results from training with the development set. The number of algorithms in agreement was selected to maximize the number of difficult cases accurately predicted. For a case to be marked as difficult, at least 2 algorithms are needed to predict that the case was difficult.
      A decision tree classifier uses a tree structure to create rules for features and to classify cases. For example, a rule could be if a patient is female, continue through the tree to ask if the patient is pregnant or not. Decision tree classifiers are interpretable, allowing end users to understand how classifier decisions were made. A random forest classifier creates multiple decision trees, where each decision tree will predict the outcome individually, and the classifier will select the most frequent prediction as the output. Random forest classifiers have quick runtimes and are typically well-performing but can be prone to overfitting. Similarly, adaboost, short for adaptive boosting, uses the boosting method to allow algorithms to learn from their mistakes. A decision tree was used as the base for adaboost, meaning that after training the first decision tree, adaboost created a second tree that could reduce the errors of the first; this process continued until the maximum number of models was created. Adaboost combines weak learners to build a strong classifier but can be sensitive to noisy data and outliers. The data set used in this study does not have many outliers, making adaboost a suitable option for classification. Other linear and unsupervised approaches were tested and not included in the study because they did not perform well on the data set. Including 3 tree-based algorithms may lead to bias in the voting classifier's prediction but including SVM and a multilayer neural network can mitigate the bias. SVM is an algorithm that takes each data point and plots it in an n-dimensional space where n is equal to the number of features included in the model. It then functions by drawing a division between data points of the 2 outcome categories, thereby classifying them. SVM works well with high-dimensional data but does not perform well when target classes overlap. The class distinction in this study makes the boundaries clear between difficult and less difficult cases, thereby overcoming this weakness. The fifth algorithm used was a multilayer neural network, which takes a collection of fully connected nodes and learns weights for each connection to predict an output. Neural networks can be considered black box algorithms, but they are typically high performing and can generalize well given new data. Details of the parameter values used to train the algorithms are given in the Appendix.

      Feature selection

      Feature selection consists of identifying and retaining attributes that meaningfully contribute to prediction. Each model determines relationships between the features and the output differently, so each algorithm individually determined the meaningful features (Fig. 1B). Forward feature elimination was used with cross validation to iteratively select features for algorithms with a feature importance parameter to generate the best feature subsets for classification. Two algorithms, SVM and neural network, did not have a feature importance parameter in sklearn, so all features were used.

      Sensitivity analysis and feature explanations

      Sensitivity analysis was conducted on the test set to assess how much each feature contributed to predicting degree of difficulty. For each algorithm, feature sensitivity was tested by changing feature values to the minimum value of the feature and then to the maximum value of the feature. For example, if the minimum age was 16 years and the maximum age was 94 years, all age values were first changed to 16, and the model was run. Then, all age values were changed to 94 and the model was run. This process was repeated for each feature. Features were considered more sensitive if model accuracies, after changing all feature values, fell outside the prediction accuracy confidence interval for each model. The directional influences of features on decision tree and support vector machine predictions on the test set were examined using SHapley Additive exPlanations (SHAP) plots.
      • Lundberg SM
      • Erion G
      • Chen H
      • et al.
      From local explanations to global understanding with explainable AI for trees.
      ,
      • Lundberg SM
      • Lee S-I.
      A unified approach to interpreting model predictions.
      SHAP is an explainability approach based on game theory that sums features’ individual contributions in each of a model's predictions. SHAP is a widely used model agnostic approach that allows users to understand model rationale behind predictions. There are other explainability approaches such as Local Interpretable Model-Agnostic Explanations which provide local explanations for predictions, but SHAP was selected because it can also provide global, aggregate explanations to summarize feature contributions to predictions.

      Results

      Feature selection

      Feature selection was conducted for algorithms that support feature importance in sklearn: decision tree, random forest, and adaboost.
      • Pillai M.
      • Adapa K.
      • Shumway J.W.
      • et al.
      Feature engineering for interpretable machine learning for quality assurance in radiation oncology.
      All features were used to train SVM and the neural network. The most meaningful features selected by each algorithm are shown in Table 2.
      Table 2Most meaningful features in prediction per model in algorithms with feature importance parameters, organized by feature category
      Decision treeRandom forestAdaboost
      Clinical featuresAgeAge;

      site name
      Age;

      site name
      Plan featuresNo. of fractions;

      No. of image sets;

      No. of planning structures;

      Dose per MU
      No. of fractions;

      No. of image sets;

      No. of planning structures;

      Dose per MU
      No. of fractions;

      No. of planning structures;

      Dose per MU
      QA featuresPhysicist IDPhysicist IDPhysicist ID
      Abbreviation: QA = quality assurance.

      Classification

      The 5 algorithms were trained using the oversampled train set and development set, and performance was evaluated on the test set (Fig. 1A, Table 3). Overall accuracy, class accuracies, positive predictive value (PPV) (ie, precision) and sensitivity (ie, recall) were used as metrics for evaluation. The highest algorithm performance for each metric was compared against the voting classifier performance. The adaboost classifier had the highest overall accuracy of 81.54% on the test set, and the voting classifier performed with 77.44% overall accuracy. Adaboost attained the highest less difficult class accuracies in development and testing. For the difficult class, the voting classifier outperformed all individual algorithms with accuracies of 86.27% in the development set and 76.47% in the test set.
      Table 3Overall and class accuracies, PPV, and sensitivity for all algorithms and the voting classifier across development and test sets
      Development set accuracyTest set accuracy
      OverallLess difficultDifficultOverallLess difficultDifficult
      SVM85.26%86.67%82.35%70.77%73.88%63.93%
      Decision tree80.13%84.76%70.59%79.49%88.06%60.66%
      Random forest83.33%86.67%76.47%77.95%86.57%59.02%
      Adaboost80.77%89.52%62.75%81.54%94.03%54.10%
      Neural network75.64%77.14%72.55%74.87%79.10%65.57%
      Voting87.18%82.86%86.27%77.44%78.36%76.47%
      Development setTest Set
      PPVSensitivityPPVSensitivity
      SVM83.00%84.51%67.26%68.91%
      Decision tree77.40%77.68%76.45%74.36%
      Random forest80.97%81.57%74.47%72.80%
      Adaboost78.80%76.13%81.15%74.06%
      Neural network72.96%74.85%71.14%72.34%
      Voting81.76%84.57%74.42%76.88%
      Abbreviations: PPV = predictive value; SVM = support vector machine.
      PPV and sensitivity across all algorithms are shown in Table 3. On the development set, SVM had the highest PPV of 83.00%. The voting classifier and random forest performed similarly with PPV of 81.76% and 80.97%, respectively. The voting classifier had the highest sensitivity of 84.57% on the development set, followed closely by SVM with a sensitivity of 84.51%. On the test set, adaboost had the highest PPV of 81.15%, followed by decision tree with a PPV of 76.45%. The voting classifier had the highest sensitivity of 76.88%, followed by decision tree with a sensitivity of 74.36%. The voting classifier had the highest sensitivity for the development and test sets, while individual algorithms performed better for PPV. The area under the receiver operating characteristic curve, used to evaluate performance for SVM, decision tree, random forest, and adaboost (Fig. 2) were 0.79, 0.74, 0.85, and 0.84, respectively, on the test set. We used one-hot-encoded vectors to describe input and output labels of the neural network and were not able to perform a receiver operating characteristic curve analysis on it.
      Figure 2
      Figure 2Receiver operating curve from prediction on both test sets.

      Sensitivity analysis and feature explanations

      To further evaluate the features selected, sensitivity analysis was conducted for all algorithms. The most sensitive features with accuracies that fell outside the confidence interval for each model accuracy as well as the magnitude of differences in prediction accuracies after changing to the minimum and maximum feature values are shown in Appendix E1. SVM had 3 sensitive features, where changing the feature values resulted in large differences in model accuracy. Changing the number of beam sets feature values from 8 to 1 resulted in the greatest model accuracy change: 31.79% to 68.20% (36.41% difference). Decision tree had 5 sensitive features, where changing the number of fractions feature (the most sensitive feature) values from 50 to 1 resulted in model accuracies: 64.10% and 76.92% (12.82% difference). Random forest had 2 sensitive features, the most sensitive of which affected model accuracy minimally with a difference of 4.10% for the number of planning structures (changing from 107-3). Adaboost had 6 sensitive features where changing site name had the highest effect on model performance with model accuracies of 73.85% and 79.49% (5.64% difference). The neural network had 10 sensitive features, where the most sensitive feature, the number of beam sets (changing from 8-1), affected model accuracy with a difference of 11.34%.
      Directional influences of included features on the test set predictions from decision tree (Figure 3A) and SVM (Fig. 3B) were examined with SHAP plots. Number of planning structures, number of images, dose per monitor unit (MU), age, and number of fractions were the features that could have directional influence from decision tree. Physicist ID is not shown in the SHAP plot because it is not a continuous variable. As supported by associated literature, low numbers of planning structures, images, and fractions, and high doses per MU and older ages are associated with less difficult predictions. All features were included for SVM, and Fig. 3B shows directional feature influence as well as which features were meaningful for the model. The features are in descending order of their effect on the model. Numbers of planning structures, images, beam sets, fractions, site, dose per MU, and whether a plan was accelerated or not, age, physicist ID, and sex were meaningful. The other features hover around 0 effects. Lower numbers of planning structures, images, beam sets, and fractions were associated with less difficult predictions. Higher doses per MU and accelerated plans were associated with the less difficult class. The model associating accelerated plans with less difficulty goes against our intuition, and further examination is required to understand this model behavior. Sex is a binary variable with males denoted as 0 and females denoted as 1, and the plot shows that plans for female patients are associated with less difficulty. Patient sex was balanced in cases denoted as less difficult (51% female), and 15% of less difficult plans were breast plans. There may be other factors not accounted for in the model that led to female patients being associated with having less difficult plans to check. Site and physicist ID are categorical variables with multiple categories, so we were unable to evaluate how meaningful they were with SHAP; however, physicist ID may have been meaningful due to physicists having different perceptions of plan difficulty.
      Figure 3
      Figure 3(A) SHapley Additive exPlanations (SHAP) summary plot for difficult class predictions from decision tree on the held-out test set. (B) SHAP summary plot for predictions from support vector machine on the held-out test set. Each point in the SHAP beeswarm summary plot represents a SHAP value and an instance, where higher feature values are represented with dark colors and lower feature values are represented with light colors. Along the y-axis, features are ranked in order of importance, and along the x-axis, instances are plotted by SHAP value. Overlapping points along the x-axis show the point distribution per feature.

      Discussion

      Machine learning methods are increasingly being adopted in radiation oncology, particularly within the QA space.
      • Kalet AM
      • Luk SM
      • Phillips MH.
      Radiation therapy quality assurance tasks and tools: the many roles of machine learning.
      The results presented herein add to the growing body of literature demonstrating the utility of machine learning algorithms with QA tasks. Specifically, we have shown that various machine learning algorithms operating within a voting structure can be used to classify radiation treatment plans and flag the plans that may be more difficult for physics pretreatment chart check.
      The first key aspect of our approach is that it enables the use of multiple algorithms. This allows us to make the most of the strengths of each algorithm, while mitigating the effects of individual weaknesses. For instance, the adaboost algorithm was tuned in such a way that it did not perform as well on the difficult cases but outperformed the other algorithms on the less difficult cases. Although this raises concerns for overfitting, this weakness is reduced within the voting structure. On the other hand, the neural network was tuned in such a way that it did not perform as well on the less difficult cases but outperformed the other algorithms on the difficult cases (Table 3). This resulted in less overall accuracy, the effect of which is mitigated within the voting structure. The neural network's above average accuracy on the difficult cases plays a pivotal role in boosting the overall accuracy on the difficult cases.
      Another key aspect of this approach is that it allows us to select a voting schema that maximizes the combined efforts of each algorithm, tailored to the solution we are seeking, which is to maximize performance on classifying difficult cases. We recognize that mislabeling a less difficult case as difficult (false positive) is a more acceptable error than mislabeling a difficult case as less difficult (false negative). Therefore, we are particularly interested in maximizing the sensitivity of our algorithm. By using a development set, we were able to evaluate different voting schemas. We found that by labeling a case as difficult with a vote of 2 or more out of 5, we would achieve the best possible accuracy on the difficult cases in the development set. Difficult cases comprise a minority of the total cases and for the purposes of this model, we defined difficult cases to be the top 30% most difficult cases. Classification algorithms commonly struggle to predict cases that fall within a minority group. This can be seen clearly with our model as each algorithm performs better on the less difficult cases than on the difficult cases. Oversampling the training set brought the distribution of difficult and less difficult cases closer to 50/50 for training purposes, but this inherent difficulty was still apparent at the time of testing. Choosing a voting schema in which 2 out of 5 votes classified a case as difficult helped our overall model perform better on difficult cases than any individual algorithm could perform alone. This does come with reduced accuracy on the less difficult cases and, as a result, overall reduced accuracy. The tradeoff we make is increased sensitivity (ie, recall) at the price of decreased specificity. In short, the adaptability of the voting structure allowed us to pick the best schema for our task.
      The capabilities and scope of machine learning solutions are often misunderstood. It is easy to identify a problem or area that needs assistance and seek a solution through machine learning methods. These algorithms, however, can fall short and in these situations, it is easy to blame the machine learning method, the lack of data, or the quality of the data, when the project design and intent may be at fault. Often, the most reliable solutions do not actually solve the problem but reduce its scope. Such is the case with our voting algorithm. It does not fully automate the chart checking process (which would be a massive undertaking), but it does assist in a meaningful way by flagging the difficult cases which may require more cognitive scrutiny.
      Finally, our approach to classifying radiation treatment plans and flagging difficult plans has practical applications at the departmental and individual physicist level. This machine learning approach can enable directors/administrators of medical physics to equitably allocate difficult and less difficult cases to multiple physicists in a large academic medical center with a view to optimizing cognitive load. At the individual level, in this project we attempted to create a classification algorithm that identifies difficult cases and alerts a physicist to devote more time and attentional resources for difficult plans. The intended effect of the tool is for a physicist to take the suggested difficulty level of a plan and plan their time accordingly, but there are potential risks and benefits to using the tool. If a physicist is shown a false positive, the risk is that they may think a plan requires more scrutiny and therefore more time to check. If a physicist is shown a false negative, the risk is that the physicist may have planned to check more plans in a given time, and the suspected less difficult plan really being difficult would disrupt their schedule. We do not expect physicists to overspend or underspend effort based on model predictions; we expect them to plan their time according to predictions. To mitigate these risks, we may design the interface such that physicists will be able to see other important information (ie, most important features) along with the model prediction to help them understand how to plan their time (eg, patient age, site name). We expect that triaging treatment plans as difficult and less difficult is likely to increase physicists’ situational awareness, improve their overall performance and reduce potential errors during chart review.
      The next step for this project is to implement the solution in our department. It will be used to guide our physicists in prioritizing their activities to reduce cognitive workload and improve the effectiveness of pretreatment chart checks. We plan to implement this by creating a back-end script that can extract the plan features once the dosimetrist is ready for a physics pretreatment check. Once the features have been extracted, another program, using the voting library of trained algorithms, will process the plan features and assign a difficulty rating to the plan that will be available to the physicist in the quality checklist. Before deployment, we plan to conduct usability testing, and as it is a possible risk that there may be more undetected errors after introducing the tool, we will test it with a controlled evaluation of algorithm performance with physicists (phase 2
      • Park Y
      • Jackson GP
      • Foreman MA
      • et al.
      Evaluating artificial intelligence in medicine: phases of clinical research.
      ). After implementation, we intend to study its effectiveness by evaluating changes in reporting of near errors/errors reported to our department's incident learning system, which catches downstream errors (phase 3
      • Park Y
      • Jackson GP
      • Foreman MA
      • et al.
      Evaluating artificial intelligence in medicine: phases of clinical research.
      ). Future directions may also include validating the model at our community sites, where we will evaluate generalizability.

      Limitations

      Although the use of multiple algorithms has its strengths, it introduces a fair amount of complexity, which increases the amount of time and computation power required during training. Using random oversampling to balance the training set also increased the chances of overfitting and limited generalization to the test set. Using a subjective rating by physicists rather than departmental criteria may be considered a limitation, but we highlight that the physicists have years of experience evaluating the difficulty of plans, making them the subject matter experts. The study was conducted in a single academic medical center with data from the institutional database, so we were unable to evaluate generalizability. Future testing at community sites will mitigate this limitation.

      Conclusion

      We present an approach to predict the difficulty of radiation therapy treatment plans using a voting classifier with 5 different machine learning algorithms. This approach can be used to potentially improve the treatment plan QA process effectiveness at both the system level (by equitably, rather than randomly, allocating pretreatment checks to physicists to level cognitive workload) and at the individual level (by alerting the physicist to devote more attentional resources to difficult plans).

      Appendix. Supplementary materials

      References

        • Ford EC
        • Terezakis S
        • Souranis A
        • Harris K
        • Gay H
        • Mutic S.
        Quality control quantification (QCQ): A tool to measure the value of quality control checks in radiation oncology.
        Int J Radiat Oncol Biol Phys. 2012; 84: e263-e269
        • de los Santos EF
        • Evans S
        • Ford EC
        • et al.
        Medical Physics Practice Guideline 4. a: Development, implementation, use and maintenance of safety checklists.
        J Appl Clin Med Phys. 2015; 16: 37-59
        • Tracton GS
        • Mazur LM
        • Mosaly P
        • Marks LB
        • Das S.
        Developing and assessing electronic checklists for safety mindfulness, workload, and performance.
        Pract Radiat Oncol. 2018; 8: 458-467
        • Younge KC
        • Naheedy KW
        • Wilkinson J
        • et al.
        Improving patient safety and workflow efficiency with standardized pretreatment radiation therapist chart reviews.
        Pract Radiat Oncol. 2017; 7: 339-345
        • Hoopes DJ
        • Dicker AP
        • Eads NL
        • et al.
        RO-ILS: Radiation Oncology Incident Learning System: A report from the first year of experience.
        Pract Radiat Oncol. 2015; 5: 312-318
        • Kisling KD
        • Ger RN
        • Netherton TJ
        • et al.
        A snapshot of medical physics practice patterns.
        J Appl Clin Med Phys. 2018; 19: 306-315
        • Potters L
        • Ford E
        • Evans S
        • Pawlicki T
        • Mutic S.
        A systems approach using big data to improve safety and quality in radiation oncology.
        Int J Radiat Oncol Biol Phys. 2016; 95: 885-889
        • Fong de los Santos L
        • Dong L
        • Greener A
        • et al.
        Tu-d-201-02: Medical physics practices for plan and chart review: Results of AAPM task group 275 survey.
        Med Phys. 2016; 43: 3743
        • Furhang EE
        • Dolan J
        • Sillanpaa JK
        • Harrison LB.
        Automating the initial physics chart-checking process.
        J Appl Clin Med Phys. 2009; 10: 129-135
        • Pillai M
        • Adapa K
        • Das SK
        • et al.
        Using artificial intelligence to improve the quality and safety of radiation therapy.
        J Am Coll Radiol. 2019; 16: 1267-1272
        • Kalet AM
        • Luk SM
        • Phillips MH.
        Radiation therapy quality assurance tasks and tools: the many roles of machine learning.
        Med Phys. 2020; 47: e168-e177
        • Luk SMH
        • Ford EC
        • Phillips MH
        • Kalet AM.
        Improving the quality of care in radiation oncology using artificial intelligence.
        Clin Oncol (R Coll Radiol). 2022; 34: 89-98
        • Azmandian F.
        • Kaeli D.
        • Dy J.G.
        • et al.
        Towards the development of an error checker for radiotherapy treatment plans: a preliminary study.
        Phys. Med. Biol. 2007; 52: 6511
        • Kalet AM
        • Gennari JH
        • Ford EC
        • Phillips MH.
        Bayesian network models for error detection in radiotherapy plans.
        Phys Med Biol. 2015; 60: 2735-2749
        • Campbell AM
        • Mattoni M
        • Yefimov MN
        • Adapa K
        • Mazur LM.
        Improving cognitive workload in radiation therapists: A pilot EEG neurofeedback study.
        Front Psychol. 2020; 11571739
        • Mazur LM
        • Mosaly PR
        • Hoyle LM
        • et al.
        Relating physician's workload with errors during radiation therapy planning.
        Pract Radiat Oncol. 2014; 4: 71-75
        • Park Y
        • Jackson GP
        • Foreman MA
        • et al.
        Evaluating artificial intelligence in medicine: phases of clinical research.
        JAMIA Open. 2020; 3: 326-331
        • Syed K
        • Sleeman W
        • Hagan M
        • Palta J
        • Kapoor R
        • Ghosh P.
        Automatic incident triage in radiation oncology incident learning system.
        Healthcare. Multidisciplinary Digital Publishing Institute, Basel, Switzerland2020: 272
        • Taylor-Weiner A
        • Pokkalla H
        • Han L
        • et al.
        A machine learning approach enables quantitative measurement of liver histology and disease monitoring in NASH.
        Hepatology. 2021; 74: 133-147
        • Pedregosa F
        • Varoquaux G
        • Gramfort A
        • et al.
        Scikit-learn: Machine learning in Python.
        J Mach Learn Res. 2011; 12: 2825-2830
        • Lundberg SM
        • Erion G
        • Chen H
        • et al.
        From local explanations to global understanding with explainable AI for trees.
        Nat Mach Intell. 2020; 2: 56-67
        • Lundberg SM
        • Lee S-I.
        A unified approach to interpreting model predictions.
        Advances in Neural Information Processing Systems. 2017; 30: 1-10
        • Pillai M.
        • Adapa K.
        • Shumway J.W.
        • et al.
        Feature engineering for interpretable machine learning for quality assurance in radiation oncology.
        Stud Health Technol Inform. 2022; 290: 460-464