If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Division of Radiation Oncology, European Institute of Oncology IRCCS, Milan, Italy and Department of Oncology and Hemato-oncology, University of Milan, Milan, Italy
Investigate the ability of machine learning (ML) models to use treatment plan dosimetry for prediction of clinician approval of treatment plans (no further planning needed) for left-sided whole breast radiotherapy (WBRT) with boost.
Methods
Investigated plans were generated to deliver a dose of 40.05 Gy to the whole breast in 15 fractions over 3 weeks, with the tumor bed simultaneously boosted to 48 Gy. Apart from the manually generated, clinical plan of each of 120 patients from a single institution, a second, automatically generated plan was included as well for each patient to enhance the number of study plans to 240. In random order, the treating clinician retrospectively scored all 240 plans as 1) approved without further planning to seek for improvement, 2) further planning needed, while being blind for type of plan generation (manual or automated). In total 2 × 5 classifiers were trained and evaluated for ability to correctly predict the clinician's plan evaluations; random forest (RF) and constrained logistic regression (LR) classifiers, each trained for five different sets of dosimetric plan parameters (feature sets, FS). Importances of included features for predictions were investigated to better understand clinician's choices.
Results
Although all 240 plans were in principle clinically acceptable for the clinician, only for 71.5% no further planning was required. For the most extensive FS, Accuracy, AUC and Cohen's κ for generated RF/LR models for prediction of approval without further planning were 87.2±2.0/86.7±2.2, 0.80±0.03/0.86±0.02 and 0.63±0.05/0.69±0.04, respectively. In contrast to LR, RF performance was independent of the applied FS. For both RF and LR, PTV40.05Gy was the most important structure for predictions, with importance factors of 44.6% and 43% respectively, and with D95% as most important parameter in most cases.
Conclusions
The investigated use of ML to predict clinician approval of treatment plans is highly promising. Including also non-dosimetric parameters could possibly further increase classifiers’ performances. The tool could potentially become useful for aiding treatment planners in generating plans with a high probability of being directly approved by the treating clinician.
Introduction
Radiotherapy (RT) treatment planning for breast cancer focuses on reducing radiation exposure to healthy tissues (whole heart, left anterior descending coronary artery (LAD), lungs and contralateral breast (CB)), while ensuring an adequate target coverage. Two Phase III studies have shown significant toxicity reductions with Intensity-Modulated Radiation Therapy (IMRT) compared to 3D-conformal RT (3DCRT) [1-2]. Apart from regular C-arms linacs, static beam IMRT for breast cancer patients can also be delivered with TomoDirectTM (TD), an IMRT modality delivered with TomoTherapy® (Accuray, Madison, WI, USA) [3-6].
In a standard clinical practice, treatment plans are generated by planners and presented to the treating clinician for approval. Often, the final approved plan is the product of an iterative procedure in which an initial plan is stepwise enhanced to best satisfy the clinician's requirements. If on the one hand this can be a process that can avoid human errors [7], it is also time-consuming and workload intensive.
Automated planning has been proposed to enhance plan quality and reduce workload [8,9]. However, several studies with blinded plan comparisons have shown that clinicians do not always prefer the automatically generated plan [10-12]. Recently, Cagni et al. [13] have systematically investigated differences in plan scoring among planners and treating clinicians in a single department. Large differences in plan quality assessments were observed.
In this study, we have investigated the ability of random forest (RF) or constraint logistic regression (LR) classifiers to use treatment plan dosimetry for correct prediction of clinician's plan evaluations for left-sided whole-breast radiotherapy (WBRT) with boost as 1) approved without further planning to seek for improvement, 2) further planning needed. Basis of the study were treatment plans for previously treated patients. To enhance the statistical power of the study, for each patient the manually generated clinical plan, and an additional automatically generated plan were included. For study purposes, the involved clinician retrospectively labeled in random order all clinical, and additional automatically generated study plans as 1) approved without further planning, or 2) further planning needed, blind for type of applied plan generation (manual or automated).
Both for RF and LR, five different dosimetric feature sets (FS) were investigated (2 × 5 investigated classifiers in total) to assess dependence of prediction quality on selected plan parameters. ML predictions for plans that were labeled ‘approved without further planning’ were considered correct in case of a predicted probability P (approved without further planning) > 0.5.
For each of the investigated 2 × 5 classifiers, nested cross validation was used to establish both hyperparameters and assess model performance, using the same data set [14]. Importance of included features for predictions was investigated to better understand what is important in clinician's plan evaluations.
To the best of our knowledge, this study is the first attempt of using machine learning (ML) with dosimetric plan parameters as input to predict clinicians’ plan evaluations. In a hypothesized future clinical application, a planner could then first assess the probability that the clinician would consider a generated plan approved. If this probability is low, the planner could then try to further improve the plan before presenting it to the clinician, and thereby minimizing the time used by clinicians for plan evaluations.
Materials and Methods
Patient selection and treatment planning
120 patients receiving adjuvant left-sided WBRT after breast conserving surgery at the XXX Institute between 2019 and 2020, were randomly selected from the institutional database. The study was notified to the Ethical Committee of the XXX Institute and obtained IRB approval with identification number UID2433. RT was delivered with TomoDirect™ at a TomoTherapy® Hi-Art System (Accuray, Sunnyvale, CA).
Clinical plans were manually generated with the VOLO™ treatment planning system (Accuray Inc, Sunnyvale, USA) version 2.1.6, applying a jaw width of 2.5 cm, a pitch of 0.25 and modulation factors of 1.8–2.0 to keep the delivery time within the range of 10–15 min. Breast and tumor bed were contoured based on ESTRO guidelines for early breast cancer [15]. Isotropic 5 mm expansions were added to create the corresponding planning target volumes (PTVs). Organs at risk (OARs) included left and right lung, CB, heart and LAD [16]. In line with the RTOG 1005 study protocol [17], 40.05 Gy was delivered to the whole breast in 15 fractions over 3 weeks with a simultaneously integrated boost to the tumor bed that resulted in a total dose of 48 Gy. Dose objectives mainly followed those used in the above-mentioned protocol, see table 1.
Table 1Dose volume histogram constraints for clinical planning; recommended and maximum acceptable values for all considered targets and OARs. Apart from obtained values for the constraints, also obtained values for parameters in the table without recommended and maximum acceptable values were used in this study.
For each of the 120 study patients, automated plan generation was performed for the same planning CT and structures as in the clinical plan. Autoplanning was performed with a for breast adapted version of the Guided Planning System (GPS) [10] in the RayStation TPS, version 11A (RaySearch, Stockholm, Sweden). This autoplanning module was not specifically tuned for generation of highest quality plans for the treatment approaches and traditions in the center where the included patients were treated, as comparison of autoplanning with manual planning was not a study aim (see also Introduction section).
Collected data
The labelling of all 240 involved plans as 1) approved without need for further planning to seek for improvement, or 2) further planning needed was performed by a senior XXX radiation oncologist with more than 20 years of experience in breast cancer treatment (XXX).
The following 24 dosimetric plan parameters were gathered for all 240 plans: D0,03cc, D30%, D50%, D95%, Conformity Index (CI, defined as the ratio between the ROI volume covered by the 95% isodose and the total patient volume covered by ≥95% of the prescribed dose) and Homogeneity Index (HI, defined as D95% / D5%) for the whole breast excluding boost PTV (PTV40.05Gy); D0.03cc, D5%, D95%, CI and HI for the boost PTV (PTV48.0Gy); V20Gy, V8Gy and Dmean for the heart, Dmean and D1% for the LAD; V16Gy, V8Gy, V4Gy, and Dmean for the left lung; V4Gy for the right lung; D0,03cc, D5% and Dmean for CB. See Table 1, for an overview.
Apart from the above dosimetric plan parameters, composite dosimetric scores (CPS) were collected for OARs and PTVs, as previously proposed by XXX investigators [18]. In this scoring system, the involved 5 OARs and 2 PTVs each get a score of 0, 0.5 or 1, depending on the fulfilment of planning constraints reported in table 1: 1 point was given if all dose constraints were within recommended values, 0.5 point if at least one dose constraint was respected, and no points otherwise. Parameters in table 1 without acceptable values were not considered in this scoring system.
Prior to classifier trainings, the 240 values for each dosimetric feature were first centered around zero by subtracting the mean value, and the values were scaled to unit variance.
The full data set then consisted of 240 rows (one for each plan) and 32 columns (24 dosimetric parameters, 7 composite scores, and the clinician's binary score (approved, or not). The Python scikit-learn library [19] was used for all data analyses and model developments.
Machine Learning models and training
The investigated five dosimetric features sets (FS) used to train both the RF and LR classifiers (2 × 5 classifiers in total) were:
•
FS1: 24 dosimetric parameters defined in section 2.2
•
FS2: 7 CPS defined in section 2.2
•
FS3: 24 differences between dosimetric parameters and their objectives, as indicated in the column “Recommended value” of table 1. If this was missing (left lung Dmean for example), the original value was maintained.
•
FS4: FS2 + FS3
•
FS5: FS2 + FS1
For each of the 2 × 5 investigated classifiers (RF and LR, both combined with FSi with i=1-5), model building was performed with nested cross-validation with an outer- and an inner loop. The applied procedure is extensively described in [14] and schematically presented in Figure 1. Here a brief summary is presented: for the outer loop, the 240 available plans were equally and randomly distributed over ten folds of 24 plans. Each of the ten folds then served as a test set for model training based on the remaining (240-24) plans. However, prior to such a training, an inner loop 5-fold cross validation was performed to establish model hyperparameters such as the number and type of trees for RF, and solver, penalty, and regularization strength for LR. Inner loop cross validations were performed using only the training set of the corresponding outer loop (fig. 1). For each the 2 × 5 classifiers, the ten outer loop models were used to assess the prediction performance. The inner loop models served only for establishment of model hyperparameters.
Figure 1Schematic explanation of the applied nested cross validation, consisting of 10-fold outer loop cross validation and 5-fold inner loop cross validation. Each of the ten outer loop model buildings is preceded by a paired 5-fold inner loop cross validation to establish hyperparameters using only the training patients of the corresponding outer loop model. Nested cross validation was performed for each of the 2 × 5 classifiers investigated in this study. For each classifier, the ten outer loop models were used to evaluate prediction performance.
The function “GridSearchCV” of the Python scikit-learn library [19] was used in the inner loops to select optimal hyperparameters. For each of the 2 × 5 classifiers, prediction performance was assessed by calculating mean values and standard errors of the Accuracy, area under the receiver-operator characteristic curve (AUC) and Cohen's kappa coefficient (κ) [21] for the ten outer loop models.
For LR, we calculated the Euler number to the power of its coefficient to quantify the importance [23]. For RF classifiers, feature importance was computed as Gini Importance or Mean Decrease in Impurity (MDI) [24]. For each of the 2 × 5 classifiers, final importances of included features were calculated as averages of importance values in the ten outer loop models. The sum of the importances of all considered features is always 100%.
One-way ANOVA tests were used for detecting differences among FS in terms of Accuracy, AUC and κ values, while t-tests were used for analyzing performance differences between RF and LR classifiers.
Results
The clinician considered all evaluated 240 plans clinically acceptable. Nevertheless, only 92 out of 120 clinical plans (77%) were approved without further planning, and the remaining 28 not. Of the autoplans, 79 (66%) were judged approved and 41 not.
Model performances in terms of Accuracy, AUC and Cohen's κ are presented and compared in Table 2. Accuracy, AUC and κ for generated RF/LR models for the most extensive feature set (FS4) were 87.2±2.0/86.7±2.2, 0.80±0.03/0.86±0.02 and 0.63±0.05/0.69±0.04, respectively. Accuracies of 87.2/86.7% and AUCs of 0.80/0.86 are at the high end when compared to many published predictive modelling studies in radiotherapy. According to the interpretation by Landis and Koch, κ values of 0.63/0.69 point at ‘substantial agreement’ between clinician plan labeling and ML prediction (see [22] and M&M section).
Table 2Accuracy, AUC and κ parameters for the RF models and for the LR models for the 5 investigated feature sets (FS1 … FS5). Average values and standard errors were calculated from the 10 folds of the outer loop in the nested cross-validation. The last three columns show p-values for comparisons between RF and LR. The last row shows p-values for ANOVA tests for models considering FS1…FS5. Superscripts refer to feature sets (FS) that give statistically different results e.g., for FS1, the LR model has an AUC of 0.75 ± 0.022,4. In this case, superscripts 2 and 4 point at statistically significant differences for the LR model for FS1 compared to the LR models based on FS2 and FS4, respectively. Bold p-values point at statistical significance. Last three columns: a: RF superior, b: LR superior.
The last three columns in Table 2 show that performance differences between RF and LR were mostly not statistically significant, depending on considered performance parameter and applied FS. RF had superior Accuracy for two FS, with for one of the two also a superior κ. LR was superior in AUC for three FS.
Achieved Accuracy, AUC and κ for created RF classifiers were independent of the FS (columns 2, 3 and 4 in Table, 2 including P(ANOVA) in the last row), implying that there was no evidence that adding the FS3 features to FS2 (= FS4, section 2.3) or adding FS1 to FS2 (= FS5) resulted in better predictions. In contrast, for LR, dependences on applied FS were observed (columns 5, 6 and 7 in Table 2, including P(ANOVA) in last row), with FS4 and FS5 overall performing best.
Figure 2 shows for the evaluated PTVs and OARs, summed importances of the corresponding features for the five investigated FS (left panel: RF, right panel: LR). Both for RF and LR, PTV40.05Gy was undoubtedly the most important structure for the predictions, independent of the applied FS. The right lung was always of minor importance while the most important OARs were heart and LAD for RF and LR classifiers, respectively. Figure 3 shows importances of various PTV40.05Gy features: for FS1 and FS3 which do not contain the CPS it is clearly seen that D95% is the parameter that has the highest importance in predictions, while for FS4 and FS5 with LR model the CPS and with RF model the D95% are respectively the features of greater importance.
Figure 2For all considered structures (OARs and PTVs), importances for all features sets FS1-FS5. For each FS, the values for the seven structures add up to 100%. For each structure, the bar for each feature set represents the sum of the importances of all features related to that structure.
In the complex landscape of large amounts of data, machine learning (including deep learning), offers unique opportunities for improving the overall quality and efficiency of the modern RT workflow [25-28]. The aim of our study was to investigate whether machine learning (ML) models could potentially become useful for aiding treatment planners to present only treatment plans to clinicians that have a high probability to be approved without further planning. The applied data set consisted of 240 treatment plans for left-sided whole breast radiotherapy (WBRT) with boost, each of them retrospectively labeled by a clinician as either ‘approved without further planning to improve’ or ‘further planning needed’. In total 2 × 5 classifiers were investigated; random forest (RF) and constraint logistic regression (LR), both trained with 5 different sets of dosimetric plan features. For a given treatment plan, each of the 2 × 5 classifiers predicted the probability that the clinician would approve the plan without further planning. For plans labeled ‘approved without further planning to improve’, a probability >0.5 was considered as a correct prediction. Likewise, for plans with a label ‘further planning needed’ a probability <0.5 was considered correct. The use of five different feature sets (FS) allowed us to investigate the sensitivity of RF and LR for the choice of applied dosimetric features. FS1 and FS3 both consisted of 24 dosimetric parameters that could be directly calculated from the dose distributions. The much smaller FS2 (seven parameters) contained for each of the seven involved anatomical structures a composite score that was derived from related dosimetric parameters, as previously proposed [18]. FS4 and FS5 were the largest FS, consisting of FS2+FS3 and FS2+FS1, respectively. For FS4, Accuracy, AUC and Cohen's κ for generated RF/LR models for prediction of approval without further planning were rather high: 87.2±2.0/86.7±2.2, 0.80±0.03/0.86±0.02 and 0.63±0.05/0.69±0.04, respectively. RF performance was basically independent of the applied FS (Table 2), meaning that FS2 with only seven features performed as well as FS1 and FS3 with 24 features and FS4 and FS5 with 31 features. For LR, a dependency on FS was observed, with the large FS4 and FS5 performing overall best. Possibly, the possibility of using non-linear combinations of available dosimetric features in RF modeling could make up for reduced availability of dosimetric features in FS2.
Clinicians’ plan evaluations are not only based on plan parameters, but consider also the full 3D dose distribution. This study shows that not explicitly taking into account the full 3D dose in the 2 × 5 investigated classifiers could still result in high quality predictions of clinician's plan evaluations.
As mentioned above, all 240 study plans were retrospectively labeled by the clinician as ‘approved without further planning’ or ‘further planning needed’. Apart from this labeling that was used in the study, the clinician also assessed plan acceptability. Although only 71.5% of plans was labeled as ‘approved without further planning’, the clinician found 100% of plans in principle acceptable for treatment. Apparently, for a large number of plans the clinician had a wish to further explore plan improvement even though the plan was in principle acceptable. This reflects the complex decision making that was modeled in this paper; the label ‘further planning needed’ was not related to unacceptable constraint violations, but to more subtle desires for plan improvement.
For all investigated 2 × 5 classifiers, PTV40.05Gy was by far the most important anatomical structure for predictions, reflecting the importance given to it by the clinician (figure 2), with D95% as most important parameter for most classifiers having PTV40.05Gy D95% as feature (figure 3).
In this study, all 240 available labeled treatment plans could be used for training, validation and testing (classifier performance assessment), due to the applied nested cross validation ([14], figure 1). With this procedure, inner loop cross validation was used for establishment of hyper parameters, to be used for model trainings in the outer loop cross validation.
A limitation of this study is that the analyses were performed for a single clinician. Generalizability of these prediction models for use by more clinicians is a topic of future research. The endeavor of developing a single model for all clinicians in the center, could result in higher consistency of the treatments delivered in the study center. Another limitation is the lack of non-dosimetric patient data in the performed analyses, including age, performance status, previous or concomitant treatments, surgery results, co-morbidities etc. Future investigations will include such factors that could potentially further enhance the reliability of the predictive models.
Conclusion
We have investigated several Machine Learning approaches for prediction of clinician approval of treatment plans for left-sided whole-breast radiotherapy plus boost, based on plan dosimetry. Results are encouraging for future workflows in which treatment planners will only present treatment plans to treating clinicians if they have a high probability of being directly approved i.e., without a further round of planning and plan evaluation.
Funding Statement
The authors received no financial support for the research, authorship, and publication of this article.
Data Availability Statement for this Work
Research data are stored in an institutional repository and will be shared upon request to the corresponding author.
References
•
Pignol JP, Olivotto I, Rakovitch E, et al. A multicenter randomized trial of breast intensity-modulated radiation therapy to reduce acute radiation dermatitis. J Clin Oncol 2008;26:2085–92.
•
Mukesh MB, Barnett GC, Wilkinson JS, et al. Randomized controlled trial of intensity-modulated radiotherapy for early breast cancer: 5-year results confirm superior overall cosmesis. J Clin Oncol 2013;31:4488–95
•
Franco P, Catuzzo P, Cante D, et al. TomoDirect: an efficient means to deliver radiation at static angles with tomotherapy. Tumori 2011;97:498–502.
•
Murai T, Shibamoto Y, Manabe Y, et al. Intensity-modulated radiation therapy using static ports of tomotherapy (TomoDirect):comparison with the TomoHelical mode. Radiat Oncol 2013;8:68.
•
Franco P, Zeverino M, Migliaccio F, et al. Intensity-modulated adjuvant whole breast radiation delivered with static angle tomotherapy (TomoDirect): a prospective case series. J Cancer Res Clin Oncol 2013;139:1927–36.
•
Dicuonzo S, Leonardi MC, Raimondi S et al. Acute and intermediate toxicity of 3-week radiotherapy with simultaneous integrated boost using TomoDirect: prospective series of 287 early breast cancer patients. Clin Transl Oncol. 2021 Feb 3.
•
Kisling K, Johnson JL, Simonds H et al. A risk assessment of automated treatment planning and recommendations for clinical deployment. Med Phys. 2019 Jun;46(6):2567-2574.
•
Marrazzo L, Meattini I, Arilli C et al. Auto-planning for VMAT accelerated partial breast irradiation. Radiother Oncol. 2019 Mar; 132:85-92.
•
Redapi L, Rossi L, Marrazzo L et al. Comparison of volumetric modulated arc therapy and intensity-modulated radiotherapy for left-sided whole-breast irradiation using automated planning. Strahlenther Onkol. 2022 Mar;198(3):236-246.
•
Fiandra C, Rossi L, Alparone A et al. Automatic genetic planning for volumetric modulated arc therapy: A large multi-centre validation for prostate cancer. Radiother Oncol. 2020 Jul; 148:126-132.
•
Rossi L, Sharfo AW, Aluwini S et al. First fully automated planning solution for robotic radiosurgery - comparison with automatically planned volumetric arc therapy for prostate cancer. Acta Oncol. 2018 Nov;57(11):1490-1498.
•
Heijmen B, Voet P, Fransen D et al. Fully automated, multi-criterial planning for Volumetric Modulated Arc Therapy - An international multi-center validation for prostate cancer. Radiother Oncol. 2018 Aug;128(2):343-348.
•
Cagni E, Botti A, Rossi L et al. Variations in Head and Neck Treatment Plan Quality Assessment Among Radiation Oncologists and Medical Physicists in a Single Radiotherapy Department. Front Oncol. 2021 Oct 12;11:706034.
•
Talbot, N.L.C. On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res 2010,11, 2079-2107.
•
Offersen BV, Boersma LJ, Kirkove C, et al. ESTRO consensus guideline on target volume delineation for elective radiation therapy of early stage breast cancer. Radiother Oncol. 2015 Jan;114(1):3-10.
•
Duane F, Aznar MC, Bartlett F, et al. A cardiac contouring atlas for radiotherapy. Radiother Oncol 2017;122:416-422.
•
RTOG 1005. A phase III trial of accelerated whole breast irradiation with hypofractionation plus concurrent boost versus standard whole breast irradiation plus sequential boost for early-stage breast cancer Available at (15/06/2020): www.rtog.org ›ProtocolTable › StudyDetails
•
XXXXX.
•
Pedregosa et al. Scikit-learn: Machine Learning in Python. JMLR 12, pp. 2825-2830, 2011
•
Hastie, T., Tibshirani, R., Friedman, J. H. The elements of statistical learning: Data mining, inference, and prediction. New York: Springer Cawley, G.C (2001)
•
Cohen J. A Coefficient of Agreement for Nominal Scales. Educ Psychol Measurement 1960, 20:37–46.
•
Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977 Mar;33(1):159-74.
•
König, G., Molnar, C., Bischl, B., Grosse-Wentrup, M. Relative feature importance (Open Access) (2020) Proceedings - International Conference on Pattern Recognition, art. no. 9413090, pp. 623-630.
•
Random forest for bioinformatics Y Qi - Ensemble machine learning, 2012 - Springer
Sadeghnejad Barkousaraie AS, Ogunmolu O, Jiang S, Nguyen D. A fast deep learning approach for beam orientation optimization for prostate cancer treated with intensity-modulated radiation therapy. Med Phys.2019; 47:880–897;
•
Granville DA, Sutherland JG, Belec JG, La Russa DJ. Predicting VMAT patient-specific QA results using a support vector classifier trained on treatment plan characteristics and linac QC metrics. Phys Med Biol. 2019 Apr 29;64(9):095017.
•
Li J, Wang L, Zhang X, Liu L et al. Machine Learning for Patient-Specific Quality Assurance of VMAT: Prediction and Classification Accuracy. Int J Radiat Oncol Biol Phys. 2019 Nov 15;105(4):893-902.
•
Hussein M, Heijmen BJM, Verellen D, Nisbet A. Automation in intensity modulated radiotherapy treatment planning-a review of recent innovations. Br J Radiol. 2018 Dec;91(1092):20180270
Declaration of Competing Interest
None of the authors has any conflict of interest with the published data and the software used in the present work. Stefania Zara is an employee of the company “Tecnologie Avanzate TA Srl” that distributes the software RayStation in Italy; this company supports the group in terms of collection of data.
Acknowledgments
The authors would like to thank the editor and reviewers for providing very helpful inputs that served to better clarify some key points in the paper.
Article info
Publication history
Published online: March 28, 2023
Accepted:
March 15,
2023
Received:
January 19,
2023
Footnotes
Author responsible for statistical analysis: Fiandra, Christian, Department of Oncology, University of Turin, Turin, Italy e-mail: [email protected]; Fariselli, Piero, Department of Medical Sciences, University of Torino, Turin, Italy