While I am not a devotee of all aspects of the so-called “tidyverse”, I am taking to ggplot, one reason (aside from the attractiveness of the outcome) is the ability to add layers to graphics. That permits one from starting with the simple and moving to the complex. Here's an example. First we will generate some randomly distributed data and plot its density
dat <-data.frame(Data=rnorm(1000,0,1)) # Create some normally distributed data
pl <-ggplot(dat,aes(Data)) +
geom_density()
pl
Now, suppose we want to superimpose a histogram of the actual points( and will make them red and somewhat transparent). We can add it to the plot rather simply:
pl2 <-pl +
geom_histogram(aes(y=..density..),fill="red", alpha=.5,bins=40)
pl2
And we could do more. But it turns out that this is not the only package to use such an approach – the coala
package (available on CRAN) one to build coalescent models for simulation via ms (and a couple of other simulators) in a similar fashion. A coala model has 3 elements:
- The basic model, consisting of the sample size and number of genes to simulate.
- features, such as scaled mutation rate, migration rate, divergence model, etc.
- summary statistics – things like nucleotide diversity, segregating sites (returned as a matrix), Tajima's D, and trees in Newick format. So let's do a simple simulation of a 10 copies of single gene with the default length of 1000 and a scaled mutation rate (θ) of 4, and ask for a tree to be generated.
activate_ms(priority=400)
mod1 <-coal_model(sample_size = 10,loci_number = 1, loci_length = 1000) +
feat_mutation(rate=4) +
sumstat_trees()
sim1 <-simulate (mod1,seed = 5)
Now we can use ggtree to make a plot of it (and layers can be added as above for ggplot). First, however, we have to convert the Newick tree to a phylo object using the read.tree function from ape
tr <-read.tree(text=sim1$trees[[1]])
ggtree(tr) +
geom_tiplab()
Now suppose we want to learn more about the model – perhaps get nucleotide diversity and Tajima's D. We can add those summary statistics to the model and rerun it:
mod2 <-mod1 +
sumstat_nucleotide_div() +
sumstat_tajimas_d() +
sumstat_sfs()
sim2 <-simulate(mod2,seed = 5)
And we can see the output as follows:
paste0("Nucleotide Diversity = ", sim2$pi)
## [1] "Nucleotide Diversity = 6.84444444444444"
paste0("Tajima's D = ", sim2$tajimas_d)
## [1] "Tajima's D = 0.089447094573295"
ggplot(data.frame(n=1:length(sim2$sfs),sfs=sim2$sfs),aes(x=n, y=sfs)) +
geom_col(fill="red", alpha=.5) +
scale_x_continuous(breaks=1:10)
labs(x="Number of Derived Alleles",y="Count")
## $x
## [1] "Number of Derived Alleles"
##
## $y
## [1] "Count"
##
## attr(,"class")
## [1] "labels"
And there is a lot more, but this is just a taste.