Validation and comparison of radiograph-based organ dose reconstruction approaches for Wilms’ tumor radiation treatment plans

Open AccessPublished:July 04, 2022DOI:



      To validate and compare the performance of four organ dose reconstruction approaches for historical radiation treatment (RT) planning based on two-dimensional (2D) radiographs.


      We considered 10 Wilms’ tumor patients with planning computed tomography (CT) images for whom we developed typical historic Wilms’ tumor RT plans, using anteroposterior and posteroanterior parallel-opposed 6 MV flank fields, normalized to 14.4 Gy. Two plans were created for each patient, with and without corner blocking. Regions of interest (ROIs: lungs, heart, nipples, liver, spleen, contralateral kidney, and spinal cord) were delineated, and dose-volume metrics including organ mean and minimum dose (Dmean and Dmin) were computed as the reference baseline for comparison. Dosimetry for the 20 plans was then independently reconstructed using four different approaches. Three approaches involved surrogate anatomy among which two used demographic-matching criteria for phantom selection/building, and one used machine learning (ML). The fourth approach was also ML-based, but used no surrogate anatomies. Absolute differences in organ dose-volume metrics, between the reconstructed and the reference values were calculated.


      For Dmean and Dmin (average and minimum point dose) all four dose reconstruction approaches performed within 10% of the prescribed dose (≤1.4 Gy). The ML-based approaches showed a slight advantage for several of the considered ROIs. For Dmax (maximum point dose), the absolute differences were much higher, i.e., exceeding 14% (2 Gy), with the poorest agreement observed for near-beam and out-of-beam organs for all approaches.


      The studied approaches give comparable dose reconstruction results and the choice of approach for cohort dosimetry for late effects studies should still be largely driven by the available resources (data, time, expertise, and funding).


      Childhood cancer survivors often experience treatment-related late adverse effects (LAEs), which have been linked to an increased risk of chronic morbidity and mortality [1-4]. In particular, radiation treatment (RT) is a component of approximately half of all cancer treatments [5] and is associated with a number of LAEs such as second malignant neoplasms and cardiovascular disease [6-8]. Dose-response relationships can be translated into contemporary RT planning by providing dose constraints to organs-at-risk to help mitigate RT-related LAEs in future childhood cancer survivors [6].
      To date, most large retrospective cohorts of childhood cancer survivors include individuals treated in the pre-computed tomography (CT) era of RT (before approximately 1999), for whom treatment was primarily based on conventional 2D radiographs [9, 10]. For these cohorts, doses to organs of interest are not directly available so that methods must be applied to reconstruct the radiation dose; these methods often involve the use of physical or computational phantoms (3D models of human anatomy) or recent planning CTs of other patients as surrogate anatomies to overcome the lack of 3D anatomical imaging [9, 11, 12]. The mean organ dose or prescribed dose to individual body regions are the most commonly reported dose metrics [13-15]. As LAEs are found to be related to dose in specific organs (e.g., second malignant neoplasms [13]) or their sub-volumes (e.g., coronary artery [16]), there is a growing interest in using more refined dose and dose-volume metrics for dose-response modeling of LAEs [13, 17].
      To provide more refined dose information, dose reconstruction is performed from the limited information available. Prior to dose reconstruction, patients’ historic RT records must be abstracted for patient demographics (sex, age at RT, height, and weight), and specific treatment details including beam energy, dose fractionation, field geometry, blocking details, field weighting, field location and anatomic field borders. In some cases, field geometry superimposed on 2D radiographs is available. Figure 1(a) is an example of a historical 2D radiograph used for field placement [15]. The surrogate anatomy is chosen based on the available patient demographics, and the historical RT plan is then simulated on the surrogate anatomy using the available treatment details [12, 14, 18]. The resulting dose distribution can then be used to derive 3D organ dose-volume metrics for dose-response analysis.
      Figure 1
      Figure 1a) An example of a historical 2D radiograph with field geometry (AP) is shown. The white corners made by four thin lead wires placed on the patient's body, depict the field boundaries. The white cross indicates the field isocenter and the ruler indicates the field size. b), An example of a digitally reconstructed radiograph (DRR, derived from CT) with field geometry (AP) is shown. The solid yellow lines depict the effective field boundary. The red point, which is the cross of the dashed yellow lines, indicates the field isocenter.
      There exist several options for the surrogate anatomy, each with its pros and cons. Stylized computational phantoms have a simplified geometrical representation of human anatomy, but can be easily scaled to different sizes and adapted to include additional organs or organ substructures. Stylized computational phantoms are commonly used for dosimetry in large retrospective childhood cancer survivor cohorts [9, 11, 18]. Voxel computational phantoms, on the other hand, are created from 3D medical images of real patients of specific ages and thus are more realistic, but are rigid and cannot be flexibly scaled or repositioned [19]. Advanced boundary representation modeling methods offer a hybrid approach which enables organ reshaping and re-positioning, while still allowing for realistic anatomical representation [20]. Such an approach has been used to develop a number of computational phantoms to represent average anatomies of the adults and pediatric populations [20-22] based on CT images, and represent different weight percentiles [23], but these phantoms are typically only available for discrete ages [11, 14, 20, 22, 23]. Patient-specific CT images with organ delineations provide an alternative type of surrogate anatomy [12]. Due to the routine use of CT in RT planning, it is relatively easy to prepare and import a surrogate CT scan into a treatment planning system (TPS) for dose calculation compared to the computational phantoms. However, the planning CT scans typically only include anatomy near the site of treatment, whereas LAEs such as second solid tumors may occur throughout the entire body [24].
      There are different sources of uncertainty in RT dose reconstruction. One source of uncertainty comes from the difference between the surrogate anatomy and the historical patient's unknown anatomy, during treatment [11, 14]. The most commonly used patient-to-surrogate matching criteria are age and sex [9, 12]. Different patient-to-surrogate matching criteria have also been investigated based on available patient demographics, such as height and weight percentiles [10, 23] and water equivalent diameter (defined as average of the scanned range of the body) [25]. When a 2D radiograph of the historical patient is available, more features can be used for dose reconstruction (e.g., use 2D radiographs to guide 3D organ deformation [26]). Recently, machine learning (ML) has been leveraged in dose reconstruction approaches based on datasets of CT scans, to predict 3D information (anatomy or dose) from the available 2D features [XX, XX]. Another source of uncertainty comes from the characterization of radiation beams used in the dose reconstruction calculations [14]. Different dose calculation algorithms such as model-based algorithms are encountered in TPSs, Monte Carlo radiation transport simulation, and measurement-based algorithms are then used to estimate dose distributions in the surrogate anatomies [9, 10, 14, 27]. However, studies have shown that the dose calculation algorithms in commercial TPSs underestimate out-of-field doses, while Monte Carlo simulations are more accurate [28, 29].
      Because reconstructed doses are used for dose-response modeling [3, 10, 18], the uncertainty in the reconstructed dose should be reported and incorporated into the models [14, 30]. This is crucial for developing robust dose-response models that can be directly translated into contemporary RT planning, i.e., used to define objectives for organs-at-risk. However, only a few study groups have reported a validation of their dose reconstruction approaches [12, 18, 26, 31].
      In this study, we carried out a collaborative effort among several institutions to validate and compare four different dose reconstruction approaches. To simulate ground truth organ doses for the validation, we used CT scans of recently treated Wilms’ tumor patients and their clinical plans (with small adaptations) as reference data. For each of the contemporary patients, we extracted data analogous to what would be available in typical historic RT records [15] and asked the institutions to independently reconstruct the RT dose according to their previously published approaches.
      Patient cohort, plan design, and organ delineations
      We investigated Wilms’ tumor dose reconstructions because this type of kidney cancer is one of the most common types of childhood cancer in the abdominal region and the RT flank fields have not changed significantly over several decades [32-34]. We considered all pediatric patients (18 patients in total) treated for Wilms’ tumor between 2004 and 2016 in our hospital with a planning CT and complete treatment record. By creating an overview of a set of characteristics associated with these patients (i.e., age, sex, height, weight, tumor laterality, treatment field) and considering the representativeness of these characteristics for the Wilms’ tumor patient population, we selected 10 Wilms’ tumor patients. For these 10 patients, the age at the time of RT planning CT ranged from 2.5 to 5.5 years. A common longitudinal field-of-view shared by these CTs was between the 10th thoracic vertebra (T10) to the 1st sacral vertebra (S1). Detailed patient information can be found in Table 1.
      Table 1Characteristics of the 10 patients and the associated RT plans (two plans per patient).
      PatientAge (year)SexHeight (cm)Weight (kg)Tumor lateralityFieldcranial/caudal bordersPlan IDShielding (Yes/No)Shielded region border
      P1RBYesRight body contour
      P3RBYesT8, T9, part of liver
      P3RBYesRight rib 9-10
      P5RBYesRight rib 9-10
      P6RBYesRight rib 10
      P8LBYesLeft rib 10
      P9LBYesLeft rib 9-10
      P10LBYesLeft rib 10
      F: female; M: male; T1 through T12 represent the 12 thoracic vertebrae; L1 through L5 represent the five lumbar vertebrae; S1 through S5 represent the five sacral vertebrae. R: right-sided plan; RB: right-sided plan with a block. L: left-sided plan; B: left-sided plan with a block. *: The prescribed dose of P1 and P6 was rescaled to 14.4 Gy in the dose reconstruction analysis.
      For each patient (P1-10) we created two typical Wilms’ tumor RT plans with 6 MV anteroposterior (AP) and posteroanterior (PA) parallel opposed flank fields [33, 35]. The first plan involved open fields (P1-10R or L, where R or L refers to a right-sided or left-sided field) and the second (P1-10RB or LB) included small corner blocks (see block information in Table 1). The plans were developed by a pediatric radiation oncologist using the Oncentra TPS (version 4.3, Elekta AB, Stockholm, Sweden). We delineated regions of interest (ROIs) on the CT including lungs, heart, nipples, liver, spleen, contralateral kidney (ipsilateral kidneys surgically removed prior to RT), and a sub-volume of the spinal cord from T10 to S1 using Velocity (version 3.2.0, Varian Medical Systems, Inc., Palo Alto, CA, US). For the lungs and heart, we delineated the portion of these organs that was imaged in the CT scans (for 8 and 5 out of 10 patients, the CT scans did not include complete lungs or heart, respectively).
      Reference dose calculation
      For validation purposes, reference dose values were extracted from the 3D dose distributions of the designed RT plans calculated on the CTs by the Oncentra TPS. All plans were designed assuming an Elekta Linac treatment machine using 6 MV photons. A collapsed cone algorithm was used to calculate the dose, which was reported to achieve good in-field and near-field dose calculation accuracy [36]. The organ dose-volume metrics that were considered included mean dose (Dmean), minimum dose (Dmin), maximum dose (Dmax), and the percentage of organ volume receiving at least 5 Gy and 10 Gy dose (V5 and V10, respectively). Here, minimum dose and maximum dose refer to the minimum and maximum point dose to an ROI in the reconstructed dose matrix. For the nipples, only Dmean was considered (as the nipple volume is small). For those patients where the volume of the heart or lungs were truncated (8/10 and 5/10 respectively), only Dmax values representing the highest dose were used in the analysis.
      Preparation of input data for dose reconstruction
      The following sections summarize the instructions shared among the participating institutes performing independent dose reconstructions according to their particular approach.
      Digitally reconstructed radiographs with plotted field
      Digitally reconstructed radiographs (DRRs) were generated from the CTs using the built-in module in the TPS for each beam's eye view (i.e., AP and PA) with the field geometry plotted on top of it to simulate historical radiographs (Figure 1(b)). We selected an enhancement setting (min/max CT data threshold -300/3095, center 1500, width 3000, bone threshold 100, and bone enhancement factor 2.5) in Oncentra TPS that gave similar contrast (based on a visual check) as historical radiographs.
      Data coding forms to describe patient and plan information
      The RT record abstraction was prepared according to the data coding forms proposed by [9]. Patient information such as name and date of birth were anonymized. The RT details were abstracted and then checked by an experienced dosimetrist. Abstraction followed the methods described in [18] and is briefly summarized here. In total the data coding forms consisted of three pages. The first page was used to collect basic information such as the maximum target dose to each body region, which is defined as the sum of the prescribed dose from all overlapping fields, i.e., the AP and PA fields. The second page included details on prescription(s) and treatment field parameters, including dose, orientation, energy, field size, weighting, shielding and anatomical borders. The third page was used to collect information about the proximity of organs of interest to the treatment fields, solely based on visually checking the prepared 2D radiographs. Proximity to the treatment field was specified as in-beam, at beam edge, near-beam, out-of-beam, or shielded.
      Dose reconstruction approaches
      The four dose reconstruction approaches included in this study are listed below and the processes are summarized in Figure 2 and described in detail in the following paragraphs. The examples of the surrogate anatomies are illustrated for methods 1-3 in Figure 3.
      Figure 2
      Figure 2An illustration of steps taken by the four different dose reconstruction approaches given the same input data, i.e., data coding forms and historical-like radiographs of the two beams.
      Figure 3
      Figure 3a) An illustration of the stylized computational phantom showing the organs (represented by 3D grids of points) used by APPROACH 1. Note that since the time of this study, the heart model in this phantom has been updated [XX]. b) A coronal view of an example of a computational phantom used by APPROACH 2. The colored regions are representations of the segmented organs. c) A front sectional view of a patient-specific surrogate anatomy constructed by APPROACH 3. The colored regions are representations of the ‘implanted’ organs.
      1) APPROACH 1: an age-scaled stylized computational phantom-based approach [XX, XX],
      2) APPROACH 2: a multiple-feature matched computational phantom-based approach [XX, XX],
      3) APPROACH 3: a surrogate anatomy ML-based approach [XX],
      4) APPROACH 4: a surrogate-free ML-based approach [XX, XX].
      APPROACH 1
      The details of the age-scalable stylized computational phantom-based approach and its use in dose reconstructions for late effects studies are described in the literature [XX, XX]. The computational phantom consists of rectangular cuboids for the head, neck, trunk, arms, and legs; organs are specified by 3D grids of evenly spaced points.
      For each of the 10 patients, the phantom was scaled to their age at RT by applying 3D scaling functions that account for non-uniform growth of different body regions [XX]. Organs for each phantom were also scaled to age at RT according to the scaling functions that were applied to each of the respective body regions. RT plans were then reconstructed on the age-scaled phantoms based on the field parameters in the coding forms and a visual check of field placements compared to the radiographs [XX]. Dose to all points in each organ were calculated using analytical dose models [XX], from which Dmax, Dmean, and Dmin, were reported. For the right and left nipple, doses were reported for a single point on each side. For the spinal cord, doses were the average of the central point in each vertebra (T10, T11, L1 to L5, and S1). For the spleen, no dose was reported as a dose grid that represents the spleen is not available in the computational phantom. For this study, only the RT plans with rectangular open-beam fields (n=10) were reconstructed using this APPROACH 1; however, in principle, blocking is also possible [XX].
      APPROACH 2
      The multiple-feature matched computational phantom-based approach, based on a library (n=351) of whole-body computational phantoms covering a large portion of the population in the United States in terms of age, sex, height, and weight, was previously developed [XX, XX]. The approach considers multiple features (e.g., age, sex, height, and weight) when selecting the surrogate phantom as available for a particular study. In this study, patient height, weight, and sex were provided in the data coding forms and were used to select the closest matched phantom from the phantom library as the surrogate anatomy. Next, the phantom (in the format of DICOM files [XX] [10] [10]) was imported into a commercial TPS, Eclipse™ (Varian Medical System, Palo Alto, CA). RT plans were reconstructed based on the data coding forms and radiographs. Once the RT plans were created, the plan data were exported from the TPS for organ dose calculation using an RT-dedicated Monte Carlo transport code [10]. Additional details on this approach are available in the literature [XX, XX]. Right and left nipple dose were not reported as these structures were not explicitly defined in the phantom. Point doses in T10-S1 vertebrae were reported as a surrogate for the dose to the spinal cord.
      APPROACH 3
      APPROACH 3 is the latest extension of [XX]. The approach incorporates ML to automatically construct patient-specific phantoms. Among the several ML models we tested, we selected the model resulting from the Gene-pool Optimal Mixing Evolutionary Algorithm for Genetic Programming (GP-GOMEA) for this study. GP-GOMEA is a state-of-the-art algorithm for learning interpretable machine learning models in the form of mathematical expressions [37] and, in particular, GP-GOMEA was recently shown to have better prediction performance among several other models (e.g., LASSO and random forest) in the task of constructing individualized phantoms [38]. The training data set was similar to that used in [XX], with some enhancements. Specifically, the training data set included a larger database of 136 CTs of pediatric cancer patients, in the age range of 1 to 8 years with more organs delineated. For each of the 136 CT scans, various features analogous to those available in historical radiographs were extracted from DRRs. Multiple ML models (one per ROI) were trained to separately predict the most similar organs and body contours, and the most likely location of each organ's center of mass, based on the extracted features [XX]. Next, the predicted best-matching organs (which may belong to different surrogate CTs) were automatically ‘virtually implanted’ at the predicted locations within the predicted body contour, forming a composite patient-specific phantom.
      Based on the input data, a list of the features of the 10 patients extracted from the coding forms and 2D radiographs were used as input for this approach. The result was 10 patient-specific phantoms which were then imported into the Oncentra TPS for manual reconstruction of the RT plans on the phantom as described in [XX]. Doses were calculated by the TPS based on the surrogate anatomy using the same collapsed cone algorithm as was used for the reference dose calculation.
      APPROACH 4
      APPROACH 4 is a dose prediction approach based on ML which does not require any surrogate anatomy [XX]. For this approach we also use the ML algorithm GP-GOMEA to build the models. In the implementation of the approach for this study, ML models were generated to directly predict organ dose-volume metrics given a list of available 2D patient and plan features. The training data included 136 abdominal CTs of patients between 1 to 8 years old and 300 artificial Wilms’ tumor RT plans. The artificial RT plans were automatically generated by sampling within plan border ranges defined by an experienced clinical oncologist. Each plan was simulated on each CT, resulting in a total of 40,800 dose distributions. The calculated organ dose-volume metrics were then used to train the ML models as response variables, whereas patients’ features available in historical records and detectable on DRRs, as well as features of the RT plan were used as explanatory variables. Separate ML models were then generated for each organ dose-volume metric. Based on the input data, the features of the 10 patients and 20 plans were put into the trained ML models from which organ dose-volume metrics were obtained.
      Dose evaluation
      To assess the level of agreement between the reference doses and the reconstructed doses obtained by the four approaches, we computed the absolute difference (subtracting the reconstructed value from the reference value and taking the absolute value) for organ mean, minimum, maximum dose, V5 and V10 (denoted by DEmean, DEmin, DEmax, DEV5, and DEV10, respectively). To make results comparable, all the plans were normalized to a prescribed dose of 14.4 Gy.
      In addition to providing the average and range of the differences for each of the organ dose-volume metrics, Wilcoxon rank-sum testing was performed to check whether differences between deviation distributions obtained by the various approaches was statistically significant (p-value <0.05).
      The average and range of the magnitude of the organ dose-volume metric differences obtained by the four approaches compared to the reference doses are summarized in Table 2 (for DEmean, DEV5, and DEV10) and Table 3 (for DEmin and DEmax). The values of Dmean calculated by the four approaches along with the reference values for the 20 cases are presented in the supplementary material. Most of the organs considered in this study are in-field or near-field organs (the contralateral kidney, spleen, liver, and spinal cord) for which the reference dose metrics calculated from the TPS can be considered to be ground truth for comparison purposes.
      Table 2Average (range) of DEmean (in Gy), DEV5 and DEV10 (in %) of organ dose reconstructions (for a subset of the organs) obtained for the 20 reconstruction cases. The doses for all plans were scaled to a prescribed dose of 14.4 Gy before comparison. For each dose-volume metric and for each organ, the smallest average of the deviation values among the approaches is indicated in bold. Similarly, the smallest range of the deviation values is formatted in bold in the brackets. R.: Right, L.: Left. The empty fields in the table indicate that the organ dose-volume metrics of the respective approach are not available for the respective approach.
      DEmean (Gy)DEV5 (%)DEV10 (%)
      OrganAPPROACH 1
      For APPROACH 1, the reconstruction for plans with a corner block applied was not performed. The statistics of the results are based on reconstruction outcomes of plans with open fields only.
      R. Nipple1.0 (0.2-5.1)0.6 (0.0-2.8)0.6 (0.0-3.6)
      L. Nipple0.6 (0.0-1.5)0.3 (0.0-2.6)0.4 (0.0-1.5)
      Liver1.4 (0.2-3.0)1.4 (0.2-3.3)1.1 (0.1-2.3)1.4 (0.1-3.6)
      Spleen0.5 (0.0-1.3)0.9 (0.1-4.7)0.9 (0.0-2.9)15 (0-39)7 (1-17)10 (3-25)13 (0-27)8 (2-17)10 (2-26)
      R. Kidney0.6 (0.3- 0.9)1.3 (0.6-2.2)1.3 (0.1-2.5)1.2 (0.5-2.3)3 (0-10)7 (0-34)6 (0-27)3 (0-11)6 (0-35)7 (0-23)
      L. Kidney0.8 (0.2-1.4)0.7 (0.0-2.0)0.7 (0.3-1.0)0.4 (0.0-0.8)11 (4-20)11 (1-20)11 (5-22)6 (1-10)9 (1-18)7 (2-14)
      Spinal Cord0.8 (0.2-1.4)0.7 (0.0-2.0)0.7 (0.3-1.0)0.4 (0.0-0.8)6 (0-15)6 (2-9)3 (1-6)4 (2-11)4 (1-6)3 (0-4)
      low asterisk For APPROACH 1, the reconstruction for plans with a corner block applied was not performed. The statistics of the results are based on reconstruction outcomes of plans with open fields only.
      Table 3Average (range) of DEmin and DEmax values (in Gy) of dose reconstructions obtained for the 20 reconstruction cases. DEmin is available for a subset of organs. The doses for all plans were scaled to a prescribed dose of 14.4 Gy before comparison. For each organ dose-volume metric, the approach with the smallest average of the deviation values is indicated in bold. Similarly, the smallest range of the deviation values among the approaches is formatted in bold in the brackets. R.: Right, L.: Left.
      DEmin (Gy)DEmax (Gy)
      OrganAPPROACH 1
      For APPROACH 1, the reconstruction for plans with a corner block applied was not performed. The statistics of the results are based on reconstruction outcomes of plans with open fields only.
      For APPROACH 1, the reconstruction for plans with a corner block applied was not performed. The statistics of the results are based on reconstruction outcomes of plans with open fields only.
      R. Lung3.0 (0.2-13.2)3.5 (0.1-12.3)2.9 (0.0-13.0)3.5 (0.5-8.6)
      L. Lung3.4 (0.2-10.9)3.3 (0.0-11.0)4.4 (0.1-11.9)3.7 (0.9-14.3)
      Heart2.0 (0.2-7.4)3.6 (0.0-10.3)2.4 (0.0-12.8)3.2 (0.0-5.8)
      Liver3.4 (0.2-14.2)0.3 (0.0-1.0)0.3 (0.0-0.5)0.1 (0.0-0.3)0.6 (0.0-1.8)0.3 (0.0-0.9)0.5 (0.0-1.6)0.4 (0.0-1.2)
      Spleen0.8 (0.0-10.4)0.8 (0.0-12.2)0.7 (0.0-6.5)6.1 (0.0-14.0)3.1 (0.0-12.3)3.0 (0.1-10.1)
      R. Kidney0.9 (0.7-1.1)0.2 (0.0-0.4)0.1 (0.0-0.2)0.1 (0.0-0.1)9.9 (9.3-10.6)1.6 (0.2-6.5)0.5 (0.1-1.4)5.0 (2.7-6.1)
      L. Kidney1.0 (0.4-1.5)0.2 (0.0-0.4)0.1 (0.0-0.2)0.1 (0.0-0.2)10.2 (8.7-12.2)1.1 (0.0-2.9)3.0 (0.2-7.5)1.5 (0.2-3.6)
      Spinal Cord0.7 (0.0-2.5)0.4 (0.0-1.7)0.1 (0.0-1.3)0.9 (0.0-7.2)0.5 (0.0-2.7)0.6 (0.0-2.9)0.4 (0.0-1.8)0.5 (0.0-2.6)
      low asterisk For APPROACH 1, the reconstruction for plans with a corner block applied was not performed. The statistics of the results are based on reconstruction outcomes of plans with open fields only.
      For DEmean, an average deviation 1.4 Gy (10% of the prescribed dose) was found for most of the organs. Among the seven organs in Table 2, APPROACH 3 was found to have the smallest average DEmean for four organs. However, for none of the organs considered were these differences found to be significantly smaller than the other three approaches (p>0.05).
      The largest DEmean values among organs for the four approaches were 5.1 Gy (right nipple) for APPROACH 1, 3.3 Gy (liver) for APPROACH 2, 4.7 Gy (spleen) for APPROACH 3 and 3.6 Gy (right nipple and liver) for APPROACH 4, respectively. For DEV5 and DEV10, on average a deviation 15% of the volume was found for all the organs.
      Except for the obtained values reported by APPROACH 1 for the liver, an average of DEmin 1.0 Gy was found for all organs by all approaches. APPROACH 4 was found to achieve the smallest DEmin in both average and largest values for four out of five organs. For four out of eight organs (both lungs, heart, and spleen), DEmax was found to be on average ≥2.0 Gy for all approaches. Among the eight organs, APPROACH 2 and APPROACH 3 each had the smallest average DEmax for three organs, while APPROACH 4 obtained the smallest average DEmax for two organs. The largest deviation (i.e., worst case) reported for each organ's DEmin and DEmax over all approaches was 7.2 Gy and 14.3 Gy, respectively.
      APPROACH 3 was found to have significantly smaller DEmin than the other three approaches for the spinal cord (p=0.01). No other distributions of DEmin or DEmax were found to be statistically different (p>0.05).
      We observed that the reconstructed Dmean of the liver by all approaches has similar variations among the plans with the same laterality (see supplementary material). The differences between Dmean for left-sided plans were smaller than for the right-sided plans. On average, the reconstructed Dmean values of the kidneys by all approaches have in general small differences (≤1.3 Gy). The reconstructed Dmean of the spinal cord has similar values across all plans for any of the approaches. When the Dmean of an organ for plans with a corner block has a different value than for plans with a rectangular open field, approaches 2, 3, and 4 were found to be able to capture the trend (i.e., decrease) of the differences but not always accurately (i.e., the magnitude of decrease), e.g., see P9L vs. P9LB in the supplementary. Furthermore, no significant differences in DEmean between plans with corner blocks applied and plans with open fields were found for these three approaches.
      In this study, the performance of four organ dose reconstruction approaches were validated and compared for the same RT plans. Multiple institutes participated and independently performed the dose reconstruction using their own approach based on the same input data. A comprehensive analysis was performed in order to assess and compare the dosimetry results.
      The results indicate that in average, the approaches achieved agreement within 10% of the prescribed dose for Dmean and Dmin, and within 15% of the organ volume for V5 and V10. Lower agreement was observed for reconstructed Dmax doses for all approaches for near-field and out-of-field organs (e.g., kidneys and spleen for right-sided plans). For near-field organs, this is mainly attributed to the high dose gradient at the edge of the field. For out-of-field organs, additionally, the reference dose values were calculated by the collapsed cone algorithm in the Oncentra TPS. This algorithm, is known to be subject to underestimation compared to Monte Carlo simulations [31], as used in APPROACH 2, and compared to analytical models based on physical measurements, as applied in APPROACH 1. Across all surrogate-based dose reconstruction approaches considered here, for in-beam/near-beam organs, the mismatch between anatomy of the surrogate phantom and the patient represents the main cause of reconstructed dose inaccuracies [11, 14].
      For APPROACH 1, the largest DEmin values for all organs were obtained, except for the spinal cord (largest for APPROACH 4). This can potentially be due to the rough geometrical modeling of the human anatomy in the APPROACH 1 phantom (Figure 3).
      Limitations of our study include the small number of patients and plans included, as well as the sole focus on Wilms’ tumor plans in the abdominal region, which provided a limited spread of anatomical patient and geometrical plan variations. Secondly, in this study we did not consider the uncertainty introduced by using a different dose calculation algorithm in the reference case compared to the dose calculation algorithms used by APPROACH 1 and APPROACH 2 [10, 18]. APPROACH 3 and APPROACH 4 (training stage) used the same dose calculation (collapsed cone algorithm) as was used for the reference dose calculation. Thus, a bias towards smaller dose differences for out-of-the-field organs may exist for the two ML-based approaches versus APPROACH 1 and APPROACH 2. Furthermore, some organ dose metrics were not reported for all approaches, such as the spleen dose of APPROACH 1, and the right and left nipple dose of APPROACH 2. Doses for these organs could be added in the future. Furthermore, for APPROACH 1, dose reconstruction was not performed for plans involving corner blocks.
      In general, compared to APPROACH 1 and 2, APPROACH 3 achieved slightly better average values for Dmean (up to 0.3 Gy) and Dmax (up to 1.1 Gy) for in-field and near-field organs, indicating promising applications of leveraging ML for individualized phantom construction. However, this slight advantage does not apply for all organs and was not found to be statistically significant. Taken into account the relatively small amount of patient anatomies (136 CTs) used in the training stage, it is likely that the ML approach can perform better as more data becomes available to train the ML models, however this remains to be seen [39, 40]. A disadvantage of APPROACH 3 is the possible unrealistic 3D anatomy (e.g., overlapping organs), as organ shape and locations were predicted independently [X]. In this study, all the 10 automatically assembled patient-specific phantoms had overlapping organ contours to some extent. In contrast, APPROACH 2 utilizes more anatomically-consistent phantoms containing a complete set of organs (we refer to [X] for available organs). Overall better results were obtained by APPROACH 2 compared to APPROACH 1 for Dmin and Dmax. This may indicate that more realistic phantoms (APPROACH 2) are a better surrogate anatomy than stylized phantoms, or that height and weight (used to select the representative phantom in APPROACH 2) are important features to consider for patient-phantom matching (APPROACH 1 only considers age). However, for a subset of the organs, APPROACH 2 and APPROACH 3 still had larger dose differences than APPROACH 1, e.g., DEmean for the right kidney was found to be 0.7 Gy smaller for APPROACH 1 compared to APPROACH 2 and 3. So, even if APPROACH 1 is arguably simpler than the others in that it uses stylized phantoms and relies on age alone, it remains competitive on some organs. This suggests that the position and shape of some organs (like the right kidney) remain very hard to predict well even with more complex matching criteria (APPROACH 2) or ML (APPROACH 3).
      The predicted dose metric values of APPROACH 4 were found to be comparable to the investigated surrogate-based approaches or even better (e.g., for DEmin). Furthermore, in general a smaller range (or largest value) of differences between the reconstructed organ dose-volume metrics and the reference metrics was observed for APPROACH 4, compared to the other three approaches. This indicates that APPROACH 4 has fewer outliers (i.e., large deviations in dose reconstruction), and could be considered more robust. The downside of APPROACH 4 is the absence of the entire 3D dose distribution as it can only predict dose metrics for which the ML models are trained. Nevertheless, the promising results of the two ML-based approaches indicate that leveraging ML on a dataset where variability of patient anatomy is captured can benefit the accuracy, efficiency, and robustness of an ML-based dose reconstruction approach.
      In epidemiologic studies of late effects, the level of detail of dosimetry that is required remains largely unknown [2, 14]. The required dose accuracy for a study greatly depends on the type of tumor, organs of interest, available outcome data, and study design, such as cohort or case-control [26]. Some existing modeling studies use dose bins of 2 Gy, which indicates that the granularity of dose-effect relationships is limited to effects that can be distinguished at the level of 1 Gy dose difference [26]. 1 Gy dose bins are almost never used because there would be too few people in each category. However, in future studies, if enough data is available and better dose reconstruction accuracy is desirable, smaller dose bins should be used to obtain finer dose-effect relationship models, or the dose might be modeled as a continuous variable instead of a categorical variable. The question of what level of accuracy is needed can be more solidly answered after the models using finer dose bins or continuous variables are evaluated.
      We chose several dose metrics for this study. The mean organ dose represents the average of the dose distributed in an organ and is commonly used to model organ-specific effects in published dose-effect studies [13, 16]. The minimum organ dose is considered as an indication of whether the organ is located near the irradiated region. The maximum organ dose is considered as some organs (e.g., the spinal cord) have a serial functionality and thus an understanding of the maximum dose to the organ and the LAEs is required. We further considered DEV5 and DEV10, as dose volume histogram metrics are more commonly used as dose toxicity predictors and are used in current clinical practice of treatment planning for dose optimization and evaluation [17, 41].
      Due to scarcity of data and relevant studies, it is currently difficult to provide clear clinical relevance of different dose metrics and what level of inaccuracy in reconstruction is acceptable to still obtain a good dose-response modeling. In return, this is exactly why more research on the validation of dose reconstruction approaches is needed, as well as studies on their use in dose-response modelling and the associated robustness of the dose-response modelling to deviations in the input (i.e., the dose reconstruction values).
      From a practical point of view, considering our results, we conclude that the selection of a dose reconstruction approach should be primarily based on the available historical data and the amount of time/effort, and funding available for dose reconstruction. Approaches that require height and weight, or 2D radiographs may not be appropriate when these data are not available, which is often the case. For the four approaches we compared, 2D radiographs are necessary for APPROACH 3 and APPROACH 4, but not for APPROACH 1 and APPROACH 2 (although 2D radiographs were used in this study by APPROACH 1 and APPROACH 2 to accurately position a plan on the phantom, which almost certainly benefitted the performance of these approaches). When height and weight information of the patient is not available, APPROACH 2 will only use the age information and impute height and weight from standard growth tables (e.g., using growth charts of the USA: When limited patient information is available, APPROACHES 1 and 2 are more applicable compared to the ML-based, because age is the only patient feature needed to scale the phantoms. In terms of efficiency, APPROACH 4 is a fully automatic pipeline to generate organ dose-volume metrics. Once the models are trained, the pipeline will automatically generate the required organ dose-volume metrics given the historical features of a patient and 2D radiographs in seconds. APPROACH 3 takes longer to run (minutes) as it handles 3D imaging to assemble a phantom according to ML predictions, but it is handy in that this is done automatically. The subsequent plan emulation and dose calculation step can also be carried out by an automatic plan emulation pipeline [XX]. In terms of time, if a cohort includes a large number of patients treated with similar types of RT plans and 2D radiographs are available, APPROACH 4 is recommended as an efficient solution with competitive performance. However, if a study cohort includes a smaller amount of patient data associated with RT plans of large variability, the three surrogate-based approaches that provide individualized manual plan emulations are better choices.
      We compared the performance of four different dose reconstruction approaches for 2D radiograph-based organ dose reconstruction by using the same patient dataset. On average all dose reconstruction approaches obtained Dmean with similar accuracy (deviation ≤1.4 Gy) for the investigated organs, that can provide reliable input for dose modeling using 2 Gy dose bins. Conversely, predictions for Dmax were found to have much larger deviations, irrespective of the approach, suggesting that their use should be discouraged. A voxel phantom approach using multi-feature matching (APPROACH 2) provided the most realistic anatomy. An age-scaled phantom approach (APPROACH 1) uses the least patient information while providing comparable dose reconstruction outcome for most of the investigated organs. Finally, ML-based individualized approaches (APPROACH 3/4) achieve competitive results with the other two approaches that are likely to further improve with more data, while employing automatic dose reconstruction procedures, which increases efficiency especially for larger cohorts.
      Conflicts of interest
      Dr. Tanja Alderliesten, Dr. Arjan Bel, and Prof. dr. Peter A.N. Bosman are involved in projects supported by Elekta. Dr. Arjan Bel is involved in projects supported by Varian. KiKa, Elekta, and Varian were not involved in the study design, in the collation, analysis or interpretation of data, in the writing of the manuscript, or in the decision to submit the manuscript for publication.
      Funding statement
      Financial support of this work was provided by Stichting Kinderen Kankervrij (KiKa; project no. 187). Dr Cécile Ronckers was supported by a grant from Dutch Cancer Society (#UVA2012-5517).
      Data Sharing Statement
      Research data are not available at this time.
      The authors thank Dr. Petra S. Kroon and Dr. Geert O. Janssens (Department of Radiation Oncology, University Medical Center Utrecht, Princess Màxima Center for Pediatric Oncology, Utrecht, the Netherlands), Prof.dr. Marcel B. van Herk (Manchester Cancer Research Centre, Division of Cancer Sciences, University of Manchester, United Kingdom), Prof. David C. Hodgson (Department of Radiation Oncology, Princess Margaret Cancer Centre, Canada), and Dr. Lorna Zadravec Zaletel (Department of Radiation Oncology, Institute of Oncology Ljubljana, Slovenia) for sharing (anonymized) data (CTs and patient features) of patients treated at their departments for use in the development of the two ML based approaches. The authors thank the Maurits en Anna de Kock Stichting for financing a high-performance computing system and Elekta for providing the research software ADMIRE for automatic segmentation for data preparation for the two ML-based approaches. Dr. Matthew M. Mille and Dr. Choonsik Lee performed their portions of this study using the computational resources of the NIH high-performance computing Biowulf cluster (
      Declaration of interests
      ☐ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
      ☒ The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:
      Ziyuan Wang reports financial support was provided by Children Cancer Free Foundation. Tanja Alderliesten, Arjan Bel, Peter Bosman reports a relationship with Elekta AB that includes: funding grants. Cecile Ronckers reports a relationship with Dutch Cancer Society that includes: funding grants.
      1. Cheung, Y.T., T.M. Brinkman, C. Li, et al., Chronic health conditions and neurocognitive function in aging survivors of childhood cancer: A report from the Childhood Cancer Survivor Study. JNCI, 2017. 110(4): p. 411-419.
      2. Hoppe, B., R. Howell, M. Ladra, et al., Spermatogenesis After Testicular Radiation Exposure in Children: Initial Results from the Pediatric Normal Tissue Effects in the Clinic (PENTEC) Initiative. IJROBP, 2019. 105(1): p. E631-E632.
      3. Bowers, D.C., Y. Liu, W. Leisenring, et al., Late-occurring stroke among long-term survivors of childhood leukemia and brain tumors: a report from the Childhood Cancer Survivor Study. JCO, 2006. 24(33): p. 5277-5282.
      4. Mulrooney, D.A., G.T. Armstrong, S. Huang, et al., Cardiac outcomes in adult survivors of childhood cancer exposed to cardiotoxic therapy: A cross-sectional study from the St. Jude lifetime cohort. Ann. Intern. Med., 2016. 164(2): p. 93.
      5. Jairam, V., K.B. Roberts, and B.Y. James, Historical trends in the use of radiation therapy for pediatric cancers: 1973-2008. IJROBP, 2013. 85(3): p. e151-e155.
      6. Constine, L., C. Ronckers, C.-H. Hua, et al., Pediatric Normal Tissue Effects in the Clinic (PENTEC): An International Collaboration to Analyse Normal Tissue Radiation Dose–Volume Response Relationships for Paediatric Cancer Patients. Clin. Oncol., 2019. 31(3): p. 199-207.
      7. Turcotte, L.M., Q. Liu, Y. Yasui, et al., Temporal trends in treatment and subsequent neoplasm risk among 5-year survivors of childhood cancer, 1970-2015. JAMA, 2017. 317(8): p. 814-824.
      8. Bates, J.E., R.M. Howell, Q. Liu, et al., Therapy-related cardiac risk in childhood cancer survivors: an analysis of the Childhood Cancer Survivor Study. JCO, 2019. 37(13): p. 1090.
      9. Stovall, M., R. Weathers, C. Kasper, et al., Dose reconstruction for therapeutic and diagnostic radiation exposures: use in epidemiological studies. Radiat. Res, 2006. 166(1): p. 141-157.
      10. Lee, C., J.W. Jung, C. Pelletier, et al., Reconstruction of organ dose for external radiotherapy patients in retrospective epidemiologic studies. Phys. Med. Biol., 2015. 60(6): p. 2309.
      11. Xu, X.G., An exponential growth of computational phantom research in radiation protection, imaging, and radiotherapy: a review of the fifty-year history. Phys. Med. Biol., 2014. 59(18): p. R233.
      12. Wang, Z., I.W. van Dijk, J. Wiersma, et al., Are age and gender suitable matching criteria in organ dose reconstruction using surrogate childhood cancer patients’ CT scans? Med. Phys., 2018. 45(6): p. 2628-2638.
      13. de Gonzalez, A.B., E. Gilbert, R. Curtis, et al., Second solid cancers after radiation therapy: a systematic review of the epidemiologic studies of the radiation dose-response relationship. IJROBP, 2013. 86(2): p. 224-233.
      14. Bezin, J.V., R.S. Allodji, J.-P. Mège, et al., A review of uncertainties in radiotherapy dose reconstruction and their impacts on dose–response relationships. J. Radiol. Prot., 2017. 37(1): p. R1.
      15. van Dijk, I.W., M.C. Cardous-Ubbink, H.J. van der Pal, et al., Dose-effect relationships for adverse events after cranial radiation therapy in long-term childhood cancer survivors. IJROBP, 2013. 85(3): p. 768-775.
      16. Hahn, E., H. Jiang, A. Ng, et al., Late cardiac toxicity after mediastinal radiation therapy for Hodgkin lymphoma: contributions of coronary artery and whole heart dose-volume variables to risk prediction. IJROBP, 2017. 98(5): p. 1116-1123.
      17. Gagliardi, G., L.S. Constine, V. Moiseenko, et al., Radiation dose–volume effects in the heart. IJROBP, 2010. 76(3): p. S77-S85.
      18. Howell, R.M., S.A. Smith, R.E. Weathers, et al., Adaptations to a Generalized Radiation Dose Reconstruction Methodology for Use in Epidemiologic Studies: An Update from the MD Anderson Late Effect Group. Radiat. Res, 2019. 192: p. 169-188.
      19. Xu, X.G. and K.F. Eckerman, Handbook of anatomical models for radiation dosimetry. 2009: Taylor & Francis.
      20. Geyer, A.M., S. O'Reilly, C. Lee, et al., The UF/NCI family of hybrid computational phantoms representing the current US population of male and female children, adolescents, and adults—application to CT dosimetry. Phys. Med. Biol., 2014. 59(18): p. 5225.
      21. Cassola, V., V. de Melo Lima, R. Kramer, and H. Khoury, FASH and MASH: female and male adult human phantoms based on polygon mesh surfaces: I. Development of the anatomy. Phys. Med. Biol., 2009. 55(1): p. 133.
      22. Lee, C., D. Lodwick, J. Hurtado, et al., The UF family of reference hybrid phantoms for computational radiation dosimetry. Phys. Med. Biol., 2010. 55(2): p. 339.
      23. Cassola, V., F. Milian, R. Kramer, et al., Standing adult human phantoms based on 10th, 50th and 90th mass and height percentiles of male and female Caucasian populations. Phys. Med. Biol., 2011. 56(13): p. 3749.
      24. Diallo, I., N. Haddy, E. Adjadj, et al., Frequency distribution of second solid cancer locations in relation to the irradiated volume among 115 patients treated for childhood cancer. IJROBP, 2009. 74(3): p. 876-883.
      25. Stepusin, E.J., D.J. Long, E.L. Marshall, and W.E. Bolch, Assessment of different patient‐to‐phantom matching criteria applied in Monte‐Carlo based computed tomography dosimetry. Med. Phys., 2017. 44.
      26. Ng, A., K.K. Brock, M.B. Sharpe, et al., Individualized 3D reconstruction of normal tissue dose for patients with long-term follow-up: a step toward understanding dose risk for late toxicity. IJROBP, 2012. 84(4): p. e557-e563.
      27. Wang, Z., B. Balgobind, M. Virgolin, et al., How do patient characteristics and anatomical features correlate to accuracy of organ dose reconstruction for Wilms’ tumor radiation treatment plans when using a surrogate patient's CT scan? J. Radiol. Prot., 2019. 39: p. 598-619.
      28. Howell, R.M., S.B. Scarboro, S.F. Kry, et al., Accuracy of out-of-field dose calculations by a commercial treatment planning system. Phys. Med. Biol., 2010. 55(23): p. 6999.
      29. Chen, W.-Z., Y. Xiao, and J. Li, Impact of dose calculation algorithm on radiation therapy. World Journal of Radiology, 2014. 6(11): p. 874.
      30. Ntentas, G., S.C. Darby, M.C. Aznar, et al., Dose-response relationships for radiation-related heart disease: Impact of uncertainties in cardiac dose reconstruction. Radiother. Oncol., 2020. 153: p. 155-162.
      31. Mille, M.M., J.W. Jung, C. Lee, et al., Comparison of normal tissue dose calculation methods for epidemiological studies of radiotherapy patients. J. Radiol. Prot., 2018. 38(2): p. 775.
      32. van Dijk, I.W., F. Oldenburger, M.C. Cardous-Ubbink, et al., Evaluation of late adverse events in long-term Wilms' tumor survivors. IJROBP, 2010. 78(2): p. 370-378.
      33. Jereb, B., J.M.V. Burgers, M.F. Tournade, et al., Radiotherapy in the SIOP (International Society of Pediatric Oncology) nephroblastoma studies: a review. Med. Pediatr. Oncol., 1994. 22(4): p. 221-227.
      34. Termuhlen, A.M., J.M. Tersak, Q. Liu, et al., Twenty‐five year follow‐up of childhood Wilms tumor: A report from the Childhood Cancer Survivor Study. Pediatr. Blood Cancer, 2011. 57(7): p. 1210-1216.
      35. D'angio, G.J., M. Tefft, N. Breslow, and J.A.J.I.J.o.R.O.B.P. Meyer, Radiation therapy of Wilms' tumor: Results according to dose, field, post-operative timing and histology. IJROBP, 1978. 4(9-10): p. 769-780.
      36. Ahnesjö, A.J.M.p., Collapsed cone convolution of radiant energy for photon dose calculation in heterogeneous media. Med. Phys., 1989. 16(4): p. 577-592.
      37. Virgolin, M., Alderliesten, T., Witteveen, C. and Bosman, P.A.N., Improving model-based genetic programming for symbolic regression of small expressions. Evol. Comput., 2021. 29: p. 211-237.
      38. Virgolin, M., Z. Wang, T. Alderliesten, and P.A. Bosman. Machine learning for automatic construction of pediatric abdominal phantoms for radiation dose reconstruction. in Medical Imaging 2020: Imaging Informatics for Healthcare, Research, and Applications. 2020. International Society for Optics and Photonics.
      39. Wei, Z., W. Wang, J. Bradfield, et al., Large sample size, wide variant spectrum, and advanced machine-learning technique boost risk prediction for inflammatory bowel disease. 2013. 92(6): p. 1008-1012.
      40. Zhou, L., S. Pan, J. Wang, and A.V.J.N. Vasilakos, Machine learning on big data: Opportunities and challenges. 2017. 237: p. 350-361.
      41. Kirkpatrick, J.P., A.J. van der Kogel, and T.E. Schultheiss, Radiation dose–volume effects in the spinal cord. IJROBP, 2010. 76(3): p. S42-S49.

      Appendix. Supplementary materials