Review of Deep Learning Based Autosegmentation for Clinical Target Volume: Current Status and Future Directions

Purpose Manual contour work for radiation treatment planning takes significant time to ensure volumes are accurately delineated. The use of artificial intelligence with deep learning based autosegmentation (DLAS) models has made itself known in recent years to alleviate this workload. It is used for organs at risk contouring with significant consistency in performance and time saving. The purpose of this study was to evaluate the performance of present published data for DLAS of clinical target volume (CTV) contours, identify areas of improvement, and discuss future directions. Methods and Materials A literature review was performed by using the key words “deep learning” AND (“segmentation” or “delineation”) AND “clinical target volume” in an indexed search into PubMed. A total of 154 articles based on the search criteria were reviewed. The review considered the DLAS model used, disease site, targets contoured, guidelines used, and the overall performance. Results Of the 53 articles investigating DLAS of CTV, only 6 were published before 2020. Publications have increased in recent years, with 46 articles published between 2020 and 2023. The cervix (n = 19) and the prostate (n = 12) were studied most frequently. Most studies (n = 43) involved a single institution. Median sample size was 130 patients (range, 5-1052). The most common metrics used to measure DLAS performance were Dice similarity coefficient followed by Hausdorff distance. Dosimetric performance was seldom reported (n = 11). There was also variability in specific guidelines used (Radiation Therapy Oncology Group (RTOG), European Society for Therapeutic Radiology and Oncology (ESTRO), and others). DLAS models had good overall performance for contouring CTV volumes for multiple disease sites, with most studies showing Dice similarity coefficient values >0.7. DLAS models also delineated CTV volumes faster compared with manual contouring. However, some DLAS model contours still required at least minor edits, and future studies investigating DLAS of CTV volumes require improvement. Conclusions DLAS demonstrates capability of completing CTV contour plans with increased efficiency and accuracy. However, most models are developed and validated by single institutions using guidelines followed by the developing institutions. Publications about DLAS of the CTV have increased in recent years. Future studies and DLAS models need to include larger data sets with different patient demographics, disease stages, validation in multi-institutional settings, and inclusion of dosimetric performance.


Introduction
Radiation treatment planning involves a multistep, complex process requiring the use of CT simulations with Sources of support: This work had no specific funding.*Corresponding author: Sushil Beriwal, MD, MBA; Emails: sushilberiwal@gmail.com, sushil.beriwal@varian.comhttps://doi.org/10.1016/j.adro.2024.1014702452-1094/© 2024 The Author(s).Published by Elsevier Inc. on behalf of American Society for Radiation Oncology.This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).manual segmentation of the gross tumor volume (GTV), clinical target volume (CTV), and organs at risk (OARs). 1,2Despite the expertise of radiation oncologists, manual segmentation remains time-consuming and can have large intraobserver and interobserver variability. 3,4he introduction of autosegmentation methods allow for uniformity and time efficiency.Early methods of autosegmentation included atlas-based methods using reference images with accompanied segmentation annotations to segment real-world clinical images.6][7] These limitations require more time editing contours by physicians.More recently, the implementation of DLAS has gained acceptance among radiation oncologists because of its superior performance and time savings.
DLAS involves artificial intelligence to perform autosegmentation on images by using a series of neural networks and architectures to analyze data. 8Convolutional neural networks (CNN) are a class of deep learning models that encompass deep learning artificial neural networks that make the assumption that inputs are images which can be used for contour outputs. 9CNNs have been frequently studied and assessed in DLAS studies.When assessing the performance of DLAS, metrics such as Hausdorff distance (HD) and Dice similarity coefficient (DSC) are considered.HD measures the average distance between ground truth image segmentation and autosegmentation or manual segmentation.A lower HD value indicates segmentation of higher quality. 10DSC compares the spatial overlap between 2 sets of contours. 11A DSC value ranges from 0 to 1, where a 0 indicates no spatial overlap between 2 sets of binary segmentation results, and a 1 indicates complete overlap.The greater the overlap, the better performance indicated by DLAS models.Specifically, a good overlap is considered to be DSC values greater than 0.700. 12Studies also commonly report subjective metrics regarding DLAS contours, including physician satisfaction and rating scales.Dosimetric outcomes are also another way to evaluate DLAS model contours.Extensive literature has been published on autosegmentation of OARs, which is more common in clinical practice today.Limited studies have investigated DLAS of the CTV, which includes expansions of the GTV volume to account for microscopic disease as well as prophylactic nodal regions.
Further investigation into the use of DLAS of the CTV is warranted to fully delineate its role in the future of radiation oncology.The goal of our review manuscript was to summarize and analyze the current literature reporting on the efficiency and performance of DLAS across different disease sites to delineate areas of strength and areas for improvement.

Methods and Materials
A literature review was performed by using the key words "deep learning" AND ("segmentation" OR "delineation") AND "clinical target volume" in indexed search into PubMed.A total of 154 articles based on the search criteria were reviewed.For this study the factors which were considered in our review were disease site, whether only the primary tumor was contoured or if lymph nodes were additionally included, the guidelines which were used, the model of deep learning based autosegmentation implemented, what type of imaging was used, whether the DLAS model being studied was developed in-house or commercially, whether dosimetric data were reported, and what the outcomes of CTV segmentation from the model were reported.Articles focusing on other methods of autosegmentation or the GTV and OARs without CTV were excluded.A PRISMA chart 13 detailing article inclusion and exclusion methods is outlined in Fig. 1.

Results
There were 53 articles that met criteria for this review from 2017 to 2023.Of these, 47 articles were published articles from 2020 to April 2023 (Fig. 2).The cervix (n = 19) and the prostate (n = 12) were studied most frequently.Most studies (n = 43) involved a single institution compared with multi-institutional studies (n = 8).Median overall sample size was 130 patients (range, 5-1052).Common metrics used to measure DLAS performance were DSC followed by HD.A summary of all included articles from the literature review can be found in Table 1.A summary of disease site statistics can be found in Table 2.The results of each disease site are summarized below.

Brain
The contouring of CTV for high-grade gliomas (n = 1) and brain metastases (n = 1) has been explored with DLAS without multi-institutional data.Both studies used in-house models and magnetic resonance imaging (MRI) scans for DLAS contouring of CTV volumes.Sample size to train and sample size to validate ranged from 296 to 469 patients and 15 to 40 patients, respectively.DSC was used to measure clinical performance of DLAS model in both articles.Sadeghi et al described a modified Segmentation-Net (SegNet) model that achieved a mean DSC of 0.896 and mean HD of 1.49 mm in patients with unresected glioblastoma.This model also revealed a statistically significant difference between D min and D max for automatically delineated CTV versus manual contours; however, no differences were found between D mean and D 98% of the CTV for both sets of contours. 14In regard to brain metastases, there was one report of agreement between the manually and automatically assess tumor volumes quantified by a concordance correlation coefficient of 0.87, and a mean DSC for brain metastases to be 0.7 for a NetSUM model combining multiple individual DLAS models through a summation technique. 15Overall, the performance of DLAS in CNS malignancies, compared with manual segmentation, are clinically acceptable.More literature regarding DLAS in all CNS malignancies, including glioblastoma, meningioma, and brain metastases, is required.

Breast
DLAS of CTV volumes after breast conserving surgery (n = 7) was studied most frequently followed by CTV        volumes after mastectomy (n = 2) and after chemotherapy without surgery (n = 1).Studies used ESTRO, RTOG, or some other international guidelines and more than half (n = 5, 56%) of the studies included CTV lymph nodes.Most studies were performed and validated by a single institution or 2 institutions.All breast cancer studies used in-house DLAS models and CT scans for model training and contour delineation.Median sample size to train and validate/test performance of DLAS models were 128 patients (range, 35-700) and 33.5 patients (range, 19-352), respectively.DSC (n = 9), HD (n = 9), and qualitative rating measures (n = 4) were commonly used to assess performance.
Subjective rating performance was good and mean DSCs for DLAS models were ≥0.7 for whole breast postlumpectomy or postchemotherapy (DSC range, 0.83-0.95) and chest wall postmastectomy (DSC, 0.73-0.736).7][18][19][20][21][22][23][24] Dai et al performed a multi-institutional study that reported better performance for DLAS models on planning CT scans compared with scanning CT scans. 21Lymph node CTVs are the most difficult fields for DLAS models.Studies reported worse DLAS performance for CTV internal mammary lymph nodes (DSC, 0.51-0.60)and Rotter's space (DSC, 0.63). 23,24Choi et al reported on several different lymph node levels with the lowest DLAS model performance for ESTRO guideline left CTV supraclavicular nodes (DSC, 0.7). 18Almberg et al reported worst DLAS model performance for CTV interpectoral lymph nodes (DSC, 0.7). 22The DLAS model from Chung et al had a mean CTV level 3 axillary lymph nodes of 0.64.This model had poor performance for ESTRO guideline supraclavicular lymph nodes (DSC, 0.67) and intramammary nodes (DSC, 0.67) with CT scans with contrast. 20Mean contour time for one DLAS model was between 4 to 21 seconds. 19Buelens et al reported an average of 11 minutes saved per patient with DLAS versus.manual segmentation. 23Of the articles that reported dosimetric data, the differences among autosegmented and manual contours were minimal.However, articles revealed autosegmented contours had decreased dose coverage for axillary node levels I to III and internal mammary nodes. 20Mean DD 90 /DD 95 for autosegmented CTV was less than 2/4 Gy compared with original manual contour plans. 21Overall, DLAS of the breast is efficient and effective for CTV whole breast and CTV, but DLAS of certain draining lymph nodes needs improvement.

Cervix (brachytherapy)
DLAS of brachytherapy CTV for cervical cancer have been studied after external beam radiation (n = 3), and other studies (n = 2) did not specify the treatment given before brachytherapy (n = 2).GEC-ESTRO were commonly used.All 5 studies were performed by a single institution and used in-house DLAS models.One study used MRI (vs 4 studies using CT imaging).Median sample size to train and validate/test performance of DLAS models were 61.5 (R, 40-160) and 20 (R, 19-50), respectively.DSC (n = 5), HD (n = 5), and Jaccard index (n = 2) were used most frequently to assess model performance.
6][27][28] Zhang et al compared 2 DLAS models, a novel 3-dimensional (3D) CNN to the standard 3D U-Net, in which the proposed novel model outperformed the standard model and was deemed by physicians to improve efficiency and consistency of treatment planning. 25In another study comparing a proposed BT DLAS with manually defined contours, DLAS contours evaluated by physicians were shown to be satisfactory without edits. 28Yoganathan et al used 2D and 2.5D ResNet and Inception ResNet models with MRI imaging, showing worse performance of 2D models compared with 2.5D models. 29These models also had worse performance for intermediate risk CTV volumes (DSC, 0.71-0.75).Regarding time savings, Jiang et al reported their DLAS model cut down 60% of total time compared with manual delineations with a mean duration to contour CTV of 70 seconds. 27It was also more time efficient, cutting down 60% of total time compared with manual delineations.From the articles that reported dosimetric data, certain autosegmentation models performed better than others.For example, in comparison to the 2-dimensional (2D) model, which had significantly lower D 90 values compared with manual contours, the D 90 of CTV for manual contours was similar to 2.5D models. 29Other models had minimal to no significant dosimetric differences between manual and autosegmented contours. 26,28Overall, DLAS of brachytherapy CTV for cervical cancer is efficient and accurate, but more studies with MRI imaging and in postoperative settings are warranted.

Cervix (external beam) 13 CT 1 MRI
DLAS of cervical CTV volumes with external beam radiation has been studied most in patients with no surgery (n = 6).RTOG guidelines were most commonly used among studies.Only 3 included studies were performed or validated by multiple institutions.CT imaging (n = 13, vs 1 MRI study) and in house DLAS models (n = 13) were commonly used.Median sample size to train and validate DLAS models was 134.5 (R, 10-300) and 37.5 (R, 13-81), respectively.1][32][33][34][35][36][37][38][39][40][41] Only Chang et al showed one pretrained DLAS model that had DSC of 0.68. 31[35][36][37][38]40,42 When looking at subjective performance metrics, DLAS model accuracy was comparable with that of senior radiation oncologists and superior to that of junior and intermediate radiation oncologists. 34Rayn et al evaluated DLAS of pelvic lymph node volumes across multiple institutions, reporting 96% of contours requiring a few or minimal edits. 43When evaluating CTV coverage, Chen et al reported 99.86% coverage of the CTV V42.5 and 99.47% coverage of the CTV V45 for the DLAS model. 44When reporting time savings, one study reported an estimated time for DLAS contouring with manual corrections to be <15 minutes. 32ther studies found time savings of 88 minutes when comparing DLAS versus resident contouring and 9.8 to 28.9 minutes saved for junior residents when contouring cervical nodal or parametrial volumes. 34,35When considering dosimetric data, certain models had lower dosimetric accuracy regarding V 42.75 , V 100 , and D mean .Specifically, although the 2D model had a higher V 42.75 compared with the 3D model, both models were lower in accuracy in comparison to manual contours. 32However, one article reported comparable percent coverage of CTV V 42.75 and V 45 for the DLAS model to the manual contours. 44Overall, the use of DLAS revealed improvement in accuracy of CTV contours for the cervix with accompanied time savings for both more senior radiation oncologists and residents; however, more emphasis and improvement on dosimetric performance of DLAS models is required.

Gastrointestinal (rectum and esophagus)
Current literature for DLAS of CTV in gastrointestinal malignancies focuses on neoadjuvant setting for rectal cancer (n = 2) and postoperative settings in both rectal (n = 1) and esophageal cancer (n = 1) at single institutions using RTOG or other international and institutional guidelines.CTV volumes for rectal cancer all included regional lymph nodes.All gastrointestinal studies used inhouse models and CT scans for model training and CTV delineation.Median sample size to train models was 110 patients (R, 58-218) and median sample size to validate/ test models was 46.5 (R, 13-111).Common performance metrics used were DSC (n = 4), qualitative or subjective metrics (n = 2), and HD (n = 2).
6][47] Wu et al found DLAS model had better performance based on a blinded subjective scoring system compared with manual contouring.DLAS models were also more efficient than manual contouring.The range for mean time for DLAS contour creation was 15 to 45 seconds for CTV and OARs.Song et al reported mean CTV correction time for 2 DLAS models to be 7.29 and 11.17 minutes. 47Cao et al investigated a 5-fold cross validated DLAS model to segment CTV lymph nodes and CTV esophageal tumor bed after an esophagectomy.For various DLAS models in this study, DSCs range was 0.83.5 to 0.867.Average time to perform CTV contour for one DLAS model was 25 seconds. 48DLAS models efficiently and accurately contoured CTV of rectum and esophagus.However, more studies investigating radiation in both neoadjuvant and adjuvant surgical settings for rectal cancer and neoadjuvant settings for esophageal cancer are required before widespread clinical implementation.

Head and neck
For studies investigating DLAS of head and neck cancers, most studies (n = 6) were in upfront radiation settings without surgery.One study included patients with no surgery and patients in postoperative setting.All studies were performed by a single institution and used CT scans for treatment planning.Most DLAS models were in-house (n = 7) and other models were commercial (n = 2).Commonly used guidelines included RTOG or international guidelines.Nodes were included in 8 out of 9 studies with 2 studies reporting on lymph nodes only.Median sample size to train and validate/test DLAS models were 72 (R17-313) and 28 (5-143) patients, respectively.DSC (n = 6) and HD (n = 5) were most used to assess DLAS model performance.
Several studies noted well performing models for CTV primary § CTV lymph nodes based off DSCs (range, 0.72-0.84)or good subjective performance scores comparable with manual contouring.0][51][52][53][54][55][56][57] Wong et al found the commercial DLAS model that was used had worse performance (DSC, 0.72) compared with manual contouring, although the model led to fast contouring of CTV volumes.More data are needed to compare the performance of commercial and in-house DLAS models.Some studies reported data specific to head and neck lymph nodes.Cardenas et al reported better DSC performance in patients with lymph node involvement compared with those without lymph node involvement. 53eissman et al reported improved DLAS performance when the model was adjusted to the CT slice plane compared with when the model was not adjusted to the CT slice plane. 55van der Veen et al reported best DLAS performance for LN levels Ib, II-IVa, VIa, VIb, VIIa, and VIIb (DSC, 0.85), and Kihara et al reported their DLAS model incorrectly segmented 1b lymph node levels for tonsillar and base of tongue cancer. 52,57van der Veen et al also measured time to DLAS of all lymph node levels to be 86 seconds with the time needed to correct autosegmented contours of lymph nodes (35 minutes) to be less than time to correct manual contours (52 minutes). 57Reported mean times to delineate CTV ranged from 0.86 to 20 seconds. 51,52Overall, DLAS models are successful in efficiently contouring CTV volumes similar to ground truth contours for head and neck cancer, although the development and validation of these models are limited to a single institution.

Prostate
CTV volumes for DLAS of prostate most frequently included a combination of the prostate § seminal vesicles (n = 9), followed by postsurgical bed (n = 3).Guidelines used by this study included RTOG, ESTRO Advisory Committee for Radiation Oncology Practice (ACROP), and Faculty of Radiation Oncology Genito-Urinary Group (FROGG), where 8 studies did not specify guidelines.Only 3 prostate studies were performed by more than one institution, and only one study reported DLAS of prostate regional nodes.CT scans (n = 8) and in-house models were used for treatment planning and DLAS contouring (n = 8) more often than MRI scans (n = 4) and commercial models (n = 4).DSC (n = 10), HD (n = 6), and qualitative or subjective evaluation methods (n = 6) were used most often to assess model performance.
9][60][61][62][63][64][65][66][67] Most DLAS models had DSC >0.7, with the exception of U-Net in one study. 64In intact patients with prostate cancer, the use of DLAS models demonstrated superiority, as blind physician evaluation resulted in selection of DLAS more often than manual contouring. 65For patients who received radiation after prostatectomy, DLAS models either outperformed or performed similarly to manual contouring. 60,63However, one study showed DLAS-generated CTVs were scored acceptable in 54% of the cases after prostatectomy, compared 73% for manual delineations. 62Models which allowed for adaptability to physician style had an average DSC 3.4% higher than with a general model which did not differentiate physician style. 60DLAS model performance on CT scans versus MRI scans was comparable with median DSC values of 0.84 (R, 0.7-0.88)and 0.855 (R, 0.65-0.92),respectively.Also, commercial versus in-house model performance was similar with median DSC values of 0.83 (R, 0.7-0.88)and 0.855 (R, 0.65-0.92),respectively.When evaluating pelvic lymph nodes in prostate cancer, Rayn et al reported few or minimal edits required for 99% of DLAS lymph node contours. 43Few articles (n = 1), reported on time savings for prostate contouring, with Shen et al showing an average contouring time of <15 seconds. 65From our review, one article reported dosimetric data for CTV, which showed agreement among the DLAS model and manual contours in regards to D 98% , D 2% , and V 95% . 66The use of DLAS shows potential for increased accuracy and efficiency of CTV contours for both intact and postprostatectomy patients with prostate cancer.

Discussion
The study of DLAS of CTV has increased in the past 4 years, especially in disease sites such as the cervix and prostate.This could be due to the high contouring time it takes for external beam cervical cancer cases and high prevalence of prostate cancer.Even more common cancers, such as lung cancer, did not have DLAS studies meeting our review criteria.DLAS models show promise of accurate contouring of CTV volumes for multiple disease sites based on reported DSC and HD values.Most DLAS models perform CTV contouring faster than manual contouring.In the few articles that reported dosimetric performance, namely in breast and cervical cancer, DLAS models did not perform as well as ground truth contours.These models could reduce the workload burden on radiation oncologists, as there is comparable contouring performance to manual contours and atlas-based contours.Manual contours can take up to 60 minutes, and DLAS models can often contour CTV volumes and other volumes in under 10 minutes.However, users must recognize that manual edits to DLAS contours may be required, especially to achieve optimal dosimetry.DLAS model performance may also be limited by variations of clinical guidelines used which may not be consistent with practice pattern of individual physicians or their practices, especially for CTV volumes like regional breast lymph nodes.Limitations of this review include the lack of uniformity on DLAS model performance making more advanced statistics difficult to perform, only one research database was used for literature review, and publication bias with published studies mostly showing benefit of DLAS models.
Further studies investigating DLAS of CTV volumes are necessary, and there are several improvements that can be made.Future DLAS models investigating disease sites like prostate and rectal cancer require extensive studies and validation in both preoperative and postoperative settings before widespread clinical implementation.Most studies have small sample sizes for DLAS model testing, are limited to data and validation at a single institution, and do not report dosimetric data.Future studies can consider using alternatives to CT imaging, such as MRI or PSMA-PET to potentially improve accuracy of DLAS models.Also, future studies should include larger sample sizes of patients from multiple institutions including dosimetric outcomes from DLAS contours to allow for more generalizable data, which may have wider clinical applicability.Larger sample sizes of patients should include breakdown of model performance according to stage of cancer and patient demographics such as race and sex assigned at birth to better characterize model performance and adaptation to real-world patients.Furthermore, explicit description of guidelines should be enforced across disease sites to allow for consistency.

Conclusion
DLAS will bring significant improvement to the future of contouring within the field, but in the interim, more studies must be done to account for the limitations in data present.

Disclosures
Hefei Liu reports past temporary employment at Varian Medical Systems.Sushil Beriwal has a leadership role as the Vice President of Medical Affairs at Varian Medical Systems, reports grant as an Elsevier consultant, and reports participation in advisory board at Xoft DSMB.

Figure 1
Figure 1 This figure displays the article screening and inclusion process utilized in this literature review.

Figure 2
Figure 2 This figure displays the number of deep learning based autosegmentation of clinical target volume articles published each year from 2017 to April 2023.

Table 1
The general characteristics and performance of DLAS model from each article included in this review

Table 2
Summary of DLAS studies across disease site, including number of articles, sample size, surgical status, in-house model use, and performance metric breakdown and summary