This notebook has all the code required to emulate the results from a simulation of tumour growth shown in - Williams, M. J., Werner, B., Barnes, C. P., Graham, T. A., & Sottoriva, A. (2016). Identification of neutral tumor evolution across cancer types. Nature Genetics. http://doi.org/10.1038/ng.3489
Notebook is laid out as follows, the first cell includes code to include the required packages, and the module NeutralEvolution that defines all types and functions used in the notebook. The following sections each have headings and short descriptions of the results generated and produces equivalent figures to those seen in the publication. In the publication, outputs from the simulations were saved and figures generated in R(v3), in this notebook we used the Gadfly package to produce inline plots.
Below is taken from the paper and explains the methodology of the simulation.
To further validate our analytical model and to test the robustness to the noise in NGS data, we developed a stochastic simulation of tumor growth and accumulation of mutations that allowed us to generate synthetic datasets. The model was written and analysed in the Julia programming language. We then applied the analytical model to the simulated data to confirm that sources of noise in NGS data do not considerably impact our results. In particular, we verified that we can reliably extract input parameters of the simulation (namely the mutation rate) from “noisy” synthetic data. Confounding factors in the data include normal contamination, sampling effects, the detection limit of NGS mutation calling, and variable read depth. We simulate a tumor using a branching process with discrete generations, beginning with a single “transformed” cancer cell that gives rise to the malignancy. Under exponential growth, the population at time t will be given by:
Where R is the average number of offspring per cell and the time t is in units of generations. We will consider primarily the case when R=2 (a cell always divides into 2), but we will also consider values <2, noting that R must be greater than 1 to have growth. At each division, cells acquire new mutations at a rate μ and we assume every new mutation is unique (infinite sites approximation). The number of mutations acquired by a newborn cell at division is a random number drawn from a Poisson distribution. Each cell in the population is defined by its mutations and its ancestral history (by recording it’s parent cell). Using this information we can then reconstruct the history of the whole tumor and crucially, calculate the variant allele frequency of all mutations in the population. To relate the discrete simulation to the continuous analytical model we will now re-derive equation 7 within the context of our model. As we simulate a growing tumor using discrete generations, both the mutation rate μ and per capita growth rate @@1@@ are in units of generations. For an offspring probability distribution @@2@@ where @@3@@=P(# of OFFSPRING = k), the average number of offspring R is simply given by the expected value of P:
For example, for R=2 we have P=(@@5@@=0,@@6@@=0,@@7@@=1). By choosing different offspring probability distributions we can easily modulate the growth rate. We note that we are now expressing both μ and λ as rates per generation rather than probabilities (all rates are scaled by units of generation). This allows us to write the growth function as @@8@@. Proceeding as in the main text, our cumulative number of mutations with an allelic frequency f is therefore:
Therefore, when fitting the model to our stochastic simulation we extract @@10@@ from the linear fit, making it straightforward to compare the simulation with the analytical model. NGS data only captures a small fraction of the variability in a tumor, as the resolution is often limited to alleles with frequency >10% due to sequencing depth and limitations in mutation calling. To account for this, we employ a multistage sampling scheme in our simulations. For all simulations reported here we grow the tumor to size 1024 cells, which gives a minimum allele frequency of ~0.1%, considerably smaller than the 10% attainable in next generation sequencing data. After growing the tumor and calculating the VAF for all alleles, we take a sample of the alleles in the population, noting that we are assuming the population is well mixed and has no spatial structure. We can vary the percentage of alleles we sample, thus allowing us to investigate the effect of the depth of sequencing on our results. As we know the true allelic frequency in the simulated population, we can use the multinomial distribution to produce a sample of the “sequenced” alleles, where the probability of sampling allele i is proportional to its frequency. The probability mass function is given by:
where @@12@@ is the sampled frequency of allele @@13@@, @@14@@ is the number of trials (the chosen percentage of alleles sampled) and @@15@@ is the probability of sampling allele i (which has frequency @@16@@ in the original population):
The variant allele frequency VAF is therefore given by:
Where @@19@@ is the total number of sampled cells from which every sampled allele is derived. As we are assuming a constant mutation rate @@20@@, we can assume that the percentage of alleles sampled comes from an equivalent percentage of cells. However, to include an additional element of noise that resembles the variability of read depth, we calculate a new @@21@@ for each allele @@22@@, which approximates the read depth. For a desired “sequencing” depth D we calculate the corresponding percentage of the population we need to sample that will give us our desired depth. For example, for a desired depth of 100X from a population of 1000 cells, we would need to sample 10% of the population. To include some variability in depth across all alleles we use Binomial sampling so that @@23@@ is a distribution with mean D. Contamination from non-tumor cells in NGS results in variant allele frequencies being underestimated. To include this effect in our simulation we can modify our N_i by an additional fraction ε – the percentage of normal contamination. Our VAF calculation thus becomes: @@24@@ We also include detection limit in our sampling scheme, we only include alleles that have an allelic frequency greater than a specified limit in the original tumor population.
Here we show that our stochastic simulations can recapitulate NGS data nicely and that the stochastic simulation converges to our analytical result:
For all simulation considered here we consider an ideal branching process where all cells divide into 2 at each generation.
Here we confirm that over a large number of simulations we can accurately recover the mutation rate from the model fit. The slight overestimation is likely due to some of the elements of noise introduced we bias our results.
Here we confirm the robustness to the number of clonal mutations.
Here we confirm the robustness to the amount of normal contamination.
Here we demonstrate that we still recover the mutation rate and neutrality under different growth regimes, but that the variance in the mutation rate estimates (@@0@@) is more variable and that we would missclassify a higher proportion of tumours as non - nuetral due to poorer @@1@@ values.
Now we consider the case of 2 mixed clones which may arise as described in the online methods of the paper.
Here we consider the effect of strong selection occurring early during tumour growth. We define the second population with @@0@@, with @@1@@. We print out the population size when the fitter mutant is introduced, as can be seen the presence of selection can only be seen when the population size is small due to the limit of detectability. Included are some numbers to seed the random number generator that produces examples of histograms clearly influenced by selection.
Finally we show that a change in mutation rate also influences the distribution.