############################################################# NOTE FROM CBSU This document consists of two parts stacked together. Please be sure to read the second part - it contains important updates about the format of the input files. ############################################################## ------------ | P A R T 1 | ------------ NOTICE ------ This program may be freely used, copied, redistributed, and modified. Comments, criticisms etc to m.a.beaumont@reading.ac.uk. Introduction ------------- This program is designed to help the user explore the most probable demographic and genealogical histories consistent with a sample of chromosomes typed at one or more loci. It relies on Markov Chain Monte Carlo (MCMC) simulation. It is NOT an automatic "black box" method, and will need other graphical/statistical programs to interpret the output. It should be used in conjunction with the paper "Detecting expansion and decline using microsatellites". It should be possible (I hope!) to replicate the results given in the paper using the real and simulated data sets described there. The program is very slow. It is not practicable to run it on a Mac that is slower than a G3, or anything slower than a pentium II. In additition, it creates large files - of the order of 1Mb per locus - so make sure your file system is not full. The program differs slightly from that described in the Genetics paper. For the paper I ran the loci individually and then used some density estimation methods to combine across loci, as described in the paper. I think it would be too much to expect most people to go through that effort, and have therefore modified the program to run all the loci simultaneously. The problem with running all the loci simultaneously is that the acceptance rate can often be quite low, and it takes a long time to converge. I recommend that you first start with small sample sizes to obtain some idea how well the method is going to work for your data set. To make it easier to work with subsamples of the data I have included a sampling program called sinf.exe. Given an infile, formatted as described below, the program allows the user to obtain samples of the requested size without replacement. Description of the model ------------------------- A sample of chromosomes typed at a microsat locus can be assumed to have a genealogical history consisting of coalescence events and mutation events going back in time until the most recent common ancestor of the sample. The assumed demographic history is of a single stable population that was of size N1 chromosomes at some time ta ago and subsequently changed gradually in size to N0 chromosomes over the period from ta to the current time. For purposes of estimation the parameters are scaled in terms of the current population size and we have two demographic paramters tf = ta/N0 and r = N0/N1. The mutation rate mu is scaled in terms of current population size to give theta = 2N0mu. It is assumed that r and tf are the same across loci and theta varies among loci. Two models of population change are assumed - a linear change between N1 and N0 over time interval ta, and exponential change. The user has a choice of which model to use. The program does not allow one to compare the goodness of fit of each model. I hope all terminology and jargon used in this paper is defined in the section "Bayesian terminology and general notes" below. The program estimates the posterior distribution of parameters describing the genealogical and demographic history of the sample, given some rectangular prior distribution for the demographic and mutational parameters (theta, r, and tf) and the data. By a rectangular distribution I mean one in which, within the bounds of the rectangle (oblong in 3-d) the prior distribution is flat and outside the bounds it is 0. Within the bounds of the prior the 3-d posterior distribution for theta, r, and tf is proportional to the likelihood surface. Note however that the marginal posterior distribution for each parameter will not be proportional to the marginal likelihood within the limits. This is because the marginal likelihood for one parameter would be obtained by averaging over the likelihoods for ALL values of the other parameters, whereas the marginal posterior distribution is obtained by averaging within the bounds of the prior. Restrictions on the Model ------------------------- a) It assumes less than 100 loci (you would need a network of Crays!) b) It assumes that there are no more than 2000 coalescent and mutational events in the genealogy. (This can be easily changed, but seems reasonable at the moment - email me if there is a problem). c) It assumes you have microsatellite data evolving by a single step mutation model. Data input - Infile ------------------- The file "infile" contains the microsat data in the format below. The example file here is for the 8 polymorphic wombat loci described in the manuscript. There should be no text, characters or extraneous numbers until after the last locus. Lines have no significance - it is the sequence of numbers that is important. 8 - number of loci (it will only read this number - so you can have notes, other unused loci etc underneath. 2 - number of alleles (allelic classes) at locus 28 20 - counts of chromosomes with the same length 0 7 - length corresponding to counts above (has to be relative to something - I have zero here but could be any number; should be in same units - ie use number of repeats) 2 - number of alles at next locus . . . 2 - number of alleles at last locus 10 46 0 4 anything you want down here. Control of simulations - init_v_file ------------------------------------- The file "init_v_file" contains information as described below. There should be no text, characters or extraneous numbers until after the last required number. Lines have no significance - it is the sequence of numbers that is important. 13485801 - starting displacement of the random number sequences - more details below. 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 - Starting value of theta for each locus. There should be as many starting values as there are loci specified in "infile". 0.05 1.0 - starting value of r and tf respectively (assumed to be the same for all loci). 0 0 0 - an indicator which shows if theta, r, tf (respectively) should be kept fixed. 1 indicates that the corresponding parameter should be kept fixed (ie you are conditioning on a fixed value), 0 indicates that it can vary. -6 1 - the lower and upper bounds on (respectively) log10(theta), log10(tf), -4 0 log10(r). Note that these are log values but starting values are NOT -2 1 logged. 1 - indicates whether linear change (0) or exponential change (1) is to be modelled. 20000 - the maximum number of thinned (see below) update steps. I.e. in this example the total number of update steps is 20000 * 10000 = 200000000 10000 - The degree of thinning - ie the interval (in update steps) between successive lines of output. 0.5 - A scale parameter for the MCMC simulation. It is difficult to work out a fully automatic choice, so it is left to the user. For many data sets a value similar to the reciprocal of the square root of the number of loci seem adequate. Notes. If the maximum number of thinned update steps is set to be greater than 2^31 you'll have problems of integer representation. For multilocus data sets the number of steps is evenly divided among the loci so the number of updates per locus will be 200000000/8 in the wombat example here and the number of updates between successive lines of output will be 10000/8 per locus. Running the program ------------------- Double click on the application. There is no Windows GUI so it will create an msdos window. Typing control c or closing the window will kill the program. No indication on the screen is given of how far the program has got (we are usually talking hours or days). To see how much it has done you need to look at the output files while the program is running. I suggest the safest way to do this in Windows is to copy/paste a file (e.g. out1) to make a copy (e.g. 'copy of out1') and then look at that. I suggest running the program on the test data set to check that it works and to see what the output looks like. The infile given here is for the wombat data analysed in Beaumont (1999). If the program is run correctly it should be possible to replicate the posterior distribution of r and tf for the polymorphic loci in figure 6. Using sinf.exe -------------- Double click on sinf.exe (or run with dos prompt to get warnings if the requested sample size is larger than the actual size). It prompts for a sample size. It then produces two files - one called "sampfile", which contains the requested sample, and the other called "remainderfile", which contains the remainder. This can then be renamed "infile" to run with msvar (making sure you keep a copy of the original infile). Random Number Generation - INTFILE. ----------------------------------- The program uses a rather ancient Generalized Feedback Shift Register (GFSR) random number generator devised by Lewis and Payne in the 1970s. The advantage of using this is that it is very fast and has an immense cycle length and has good statistical properties. The disadvantage is that it requires a large file describing the current state of the generator - INTFILE - which is regularly updated during and at the end of an MCMC run. The other disadvantage is that it is not possible (meaning I don't know how to do it) to randomly choose a starting point in the sequence. If all runs were done sequentially this is not a problem, you would just use the INTFILE from the end of the previous run. However for runs in parallel you want to start with different INTFILES (otherwise you'll replicate the exact same MCMC sequence). To overcome this, the first line in init_v_file, specifies a number of random numbers to generate before starting. This number must be less than 100,000,000 (an arbitrary upper limit). The program will walk through this number of random numbers before doing the MCMC simulations. This is somewhat unsatisfactory because it means that parallel MCMC runs will probably share a proportion of the random number sequence, but given the high degree of contingency ("chaos?") in the simulations this does not lead to obvious correlation in the MCMC outputs between parallel runs (arms waving madly). The program gives a message that it is starting the MCMC simulation after it has run through the initial number of random numbers so that one can get an idea of how long it takes to get through them. I suggest staggering the starting points by quite a large number in parallel runs just to avoid the chance of the program being sucked into the same sequence. Output files - out1...n Each locus has it's own output file, labelled 1..n referring to loci in the order they are given in infile. There are 10 columns: Column 1 The line number starting from 0. (This number plus 1) multiplied by the thinning number is the number of update steps carried out to reach that point. Column 2 The number of mutations in the genealogical history of the sample Column 3 The time of the most recent common ancestor of the sample Column 4 Log10(theta) Column 5 Log10(r) Column 6 Log10(tf) Column 7 Log probability of the genealogical history Column 8 The number of mutations in the tips of the genealogy divided by the total number of mutations in the tree. By "tip" I mean the lineage leading from a current individual chromosome to the first coalescence involving that lineage. Column 9 The proportion of the total branch lengths of the tips of the genealogy (defined above) divided by the sum of the branch lengths. This should be very similar to Column 8 when the number of mutations in the genealogy is large (because the distribution of mutations in the genealogy is proportional to branch length). This is what I call the Distal Branch Index in the ms. Column 10 The proportion of coalescences that are above tf. These can be regarded as correlated samples from the posterior distribution. Read the notes below before doing anything with them. General notes on Convergence ----------------------------- The key idea in MCMC is that the stationary distribution of the Markov Chain is the same as the required posterior distribution. It is important to run the chain for a sufficiently long time that your sampled points (bearing in mind that they are strongly correlated) provide a good representation of the posterior distribution. The simplest informal way to assess this is to run a number of independent chains with DIFFERENT starting parameter values and visually compare marginal posterior distributions for key parameters. Note, the starting parameters should be DIFFERENT, because if you start them out at the same point different chains may be similar simply because of the shared starting point. Another informal way to check for convergence is to look at a plot of, say, log(theta), log(r), log(tf) against time. If the chain is very "sticky" and not "mixing" well (MCMC jargon) it should become apparent: the movement through parameter space will be very lumpy. Essentially, if it is mixing well any visual features of the plot should be repeated a very large number of times so there are no large scale features. By large scale feature I mean a trend, or an oscillation or flipping between two regimes. In a well mixing system it will appear very spiky with the general appearance in one small part of the sequence (say a tenth) looking very similar to another small part. More formal ways of assessing convergence are provided in the CODA package for Splus and R. The two most commonly used methods are the Raftery-Lewis method, which uses a single chain, and estimates how long a run you need to estimate certain quantiles to a required degree of precision, and the Gelman-Rubin method which uses independent chains and quantifies the variability in the mean value of a parameter within and between chains (on the basis that at equilibrium the mean value should be the same between chains). In my experience the Raftery-Lewis statistic is depressingly conservative for the coalescent MCMC described here (especially when the possibility of population growth is allowed). What I mean by this is that it will indicate that individual runs are too short, even though, if you do independent runs from different starting positions you end up with fairly similar posterior distributions, as verified by the Gelman-Rubin statistic. The reason is that it was developed for Gibbs Samplers (a type of MCMC) where in general the degree of autocorrelation is low(ish); it is designed to be conservative and becomes more conservative as the degree of autocorrelaton increases. The degree of autocorrelation in the coalescent MCMC (particularly with population growth) is many orders of magnitude times more than assumed by the statistic. Much more useful, in my experience, is the Gelman-Rubin Statistic. However, much though it offends our sensibilities, there is still an 'art' to assessing convergence and even if you don't have access to the CODA package you can get a reasonable idea of whether convergence has been attained by looking at the output of independent runs, and visually inspecting the stickiness of the chains. But beware: there will always be the possibility that if you ran the chain for 1000 times longer it will discover a whole new region of parameter space and your apparently well mixing sequence will turn out to be poorly mixing on a very large scale. You can never, no matter what statistics are used, circumvent this possibility. My advice is roughly as follows: Assuming you have multilocus data, choose the locus with the highest amount of variability and largest sample size (with more weight on the former). Run it by itself for, say 10^8 updates. Based on the output of this, ignoring the first 10% of the points (which may be affected by the starting values and not representative of the equilibrium distribution), choose starting values of theta, r, tf that are at the edges of (but not too far from) the observed posterior distribution (you may need to look at the 3 bivariate marginals to get a good ieda). Use these to start off another 4 independent chains. Cut off the first 10% of each chain. Plot histograms of the marginal distributions of the parameters. Do all 5 coincide reasonably well? How short should each chain be and still give adequate estimates (it doesn't matter if the histograms are noisy - its the relative displacement of the densities that is important)? Once you have decided on this minimum number multiply it by a number that is rather less than the number of loci. Do a run with all the loci. Look at the output. Choose starting values of the parameters for another 4 chains based on the posterior distribution (all being well this should be much tighter than for the 1 locus case). You may decide at this stage that you can get away with shorter chains. Run it again. Compare the output from the 5 independent chains. If you are happy that all is well, you could combine all the output together. Alternatively (and probably better for publishing results) you can estimate summary statistics from each chain individually and then use the variability between chains to give standard errors on these estimates. Specific Notes on Convergence ----------------------------- The method converges well both for single- and multilocus data when they support a model of population decline. Convergence is poorer for expanding populations. If there is strong cross-locus support for expansion, a multilocus linear model will not converge in a reasonable time. An exponential model will work better. In the case of the linear model it then becomes necessary to consider running loci independently, estimating densities separately, and then multiplying them together as in the genetics manuscript. A tell-tale signal that the model is not converging well in a growing population is when regions of parameter space are explored where log(r) changes greatly but log(tf) varies only a little. In general the method will converge more slowly if a) Sample sizes are large (more than 200 chromosomes per locus not recommended). b) Samples are very polymorphic or have a wide difference between minimum and maximum allele sizes. This implies a lot of mutations. One restriction placed on the model is that coalescent nodes plus mutations are not > 2000 in the genealogical history. A note on estimating summary statistics from MCMC output --------------------------------------------------------- Many summary statistics such as the posterior mean, median, quantiles can be estimated by using standard statistical methods implemented in many packages, and I will not go into them here. Estimation of the mode and HPD limits requires density estimation. Many packages will do this for you too. Often these use kernel density estimation procedures that assume no boundaries and, since the posterior distributions here are bounded this can lead to problems (the density will always be strongly underestimated at the edges). In my view the best density estimation package available is called Locfit (it's free). It is implemented as a standalone package or (my recommendation) a library for the statistical packages (interpreters? languages?), Splus and its free lookalike R. If you want to estimate modes, HPD limits, and make pretty plots, I strongly recommend using Locfit. Bayesian Terminology and general notes. --------------------------------------- Prior distribution - the probability distribution of the parameters independent of any data. This has to come from somewhere (you). Likelihood - the probability of the data given a set of parameter values. Posterior distribution - The probability distribution of the parameters conditional on the data. The posterior distribution is proportional to the likelihood multiplied by the prior. It is only proportional because we need to be able to standardize the volume so that it is 1 - ie to give us a probability distribution. Integration is often difficult, which is where MCMC comes in (see below). Conditional posterior distribution - The (joint) distribution of one (or more) parameter(s) when the other paramters are fixed at certain values. This is like taking a slice through the distribution and normalizing it to have volume 1. Marginal posterior distributions - In many models there are parameters that are a necessary part of the description of a model but in which we have no particular interest. Again, the model may have many parameters that we are interested in, but we may only want to say something about one parameter at a time rather than all the parameters jointly. However if we fixed the parameters that we were not immediately interested in to some set of values, we may find that the posterior distribution for the parameter which we are interested in (the conditional distribution) changes according to the values we assign to the other parameters. Since we don't know the value of these parameters, but have varying degrees of belief about what they might be (given by the joint posterior distribution), it is reasonable to estimate the posterior distribution for the parameter we are interested by summing (integrating), for each value of this parameter, the frequencies over all possible values of the other parameters. This is the marginal distribution. Another way of looking at it is as an average of all the conditional distributions scaled to have volume 1. The shape of the marginal distribution will be more strongly influenced by the shape of the conditional distributions in high density regions than low density regions. Improper prior distribution - An example of an improper prior is to assume that all parameter values (from, say, -infinity to +infinity) have equal prior frequency (a flat prior). This distribution is improper because it has an infinite volume. However if the likelihood function goes to 0 at extreme values of the parameters then an improper prior multiplied by the likelihood function can give rise to a proper posterior distribution (ie it can be normalized to have volume 1). Relationship between posterior distribution and likelihood function - If a flat prior is used the posterior distribution will be proportional to the likelihood function. (It will not be the same because it will be scaled to have volume 1). Summary Statistics - In the Bayesian world view all that we can say about the model (conditional on the data) is given by the posterior distribution. We need to report summary statistics based on this distribution. In general, point estimates are not as useful as interval estimates. Examples of the former are the mode, mean, median of the posterior distributions. An example of the latter is the equal tail probability interval - ie reporting, say, the 0.025 and 0.975 quantiles of a marginal posterior distribution for some parameter. Another useful summary (which I like, but is more difficult to estimate) is the HPD interval. This is like a water level such that, say 95% of the posterior probability is above it. Many biologists influenced by Edwards "Likelihood" might like to use an HPD limit such that the density at the limit is log(2) less than the modal density (maximum likelihood estimate) - so called support limits. From a Bayesian point of view this is not a very good summary because you could have a substantial posterior probability outside this limit depending on the thickness of the tails of the posterior distribution. Graphical methods of displaying posterior distributions are also very useful - they give good visual summaries. MCMC - Markov Chain Monte Carlo. This is a method of integrating the likelihood multiplied by the prior. In effect you simulate (autocorrelated) random samples from the posterior distribution. There are two key aspects to the use of MCMC. One is this integration. The other is that it automatically gives you marginal distributions. You just ignore the other values. It's like estimating the distribution of people's height. You could have measured things like sex, big toe length, earlobe size etc. The distribution of heights that you estimate is a marginal distribution over all the other things you could have measured. In my view the ability to report these marginal distributions is the key feature of this program. Final Note on msvar. -------------------- This program should allow you to replicate the results in the manuscript. The purpose of paper is primarily to analyse the likelihood surface for models of population growth. Because the likelihood asymptotes for extreme values of all the parameters it seemed sensible to use rectangular priors. Within these limits the joint likelihood and conditional likelihoods for the parameters are proportional to the likelihood. The marginal posterior distribution is not proportional for reasons discussed above. However, by using rectangular priors it makes it easier to understand the patterns obtained. The main aim of the paper is to try to understand the likelihood function, even though we can't write it down. Using priors that are not rectangular makes it more difficult to distinguish the effects of the prior from that of the likelihood. This does not mean, however, that the use of rectangular priors is in general desirable. For real data sets you do not have clearly defined boundaries on the parameter values within which all values have equal probability and outside which they zero probability. Rather, the prior distribution is more reasonably represented by a smoothly varying function. A second criticism of msvar is that we are interested in the natural parameters, N0, N1, mu, ta, but the likelihood function is only a function of 3 scaled by the fourth (I have used N0 here). Thus given a posterior distribution for tf we might want to integrate over some prior for N0 to estimate ta. In addition (especially if the populations have been growing) we might prefer to have looked at mu.ta rather than theta and then estimate ta by integrating over some belief about how mu varies among loci. However, from a Bayesian perspective, it makes sense to abandon these arbitrary parameterizations and use the natural parameters directly (Tavare' et al, 1997; Wilson and Balding, 1998). Of course this means that there will be strong correlations between the parameters (the model is over-determined), but it does have the advantage of giving estimates for the parameters that we are interested in without further guesswork. Useful references ------------------ Beaumont, M.A. (1999) Detecting Population expansion and decline using microsatellites. Genetics, 153, 2013-2029 Gelman, A. et al (1995) Bayesian Data Analysis. Chapman and Hall Gilks, W.R. et al (1996) Markov Chain Monte Carlo in Practice. Chapman and Hall. Tavare' et al (1997) Inferring coalescence times from DNA sequence data. Genetics 145: 505-518. Wilson and Balding (1998) Genealogical inference from microsatellite data. Genetics, 150: 499-510. ** I strongly recommend analysing the output using the R package:- http://www.ci.tuwien.ac.at/R/contents.html The Comprehensive R archive http://www.ci.tuwien.ac.at/R/src/contrib/PACKAGES.html where to get CODA for R http://www.mrc-bsu.cam.ac.uk/bugs/docs/docs.html Documentation for CODA http://cm.bell-labs.com/cm/ms/departments/sia/project/locfit all you want for Locfit. http://www.stats.bris.ac.uk/MCMC MCMC preprint service at Bristol (also has manuscript versions, in postscript, of already published papers) ------------ | P A R T 2 | ------------ HISTORY ------- Original windows executable compiled with mingw in june 2002. Recompiled june 2004 with Borland C++. This necessitated a small change in the code to cope with Borland fp system (to make it ieee compliant). Also a bug in printerr2 in myutil.c (opening already open files) was corrected. Rough draft readme updated June 2004. README 14 June 2002. Updated 30 June 2004 This needs to be read in conjunction with the readme for msvar0.4.x The basic structure of the program is the same. The parameterisation is however now in terms of the effective sizes and times etc rather than the scaled parameters, theta, r, and tf Run it in a similar way to msvar0.4.x. The program now outputs a new file called hpars.dat, which has the following columns 1 output line number 2 (ignore) 3 Total log likelihood (over all loci) 4 mean current population size (over loci) 5 variance (among loci) of current population size 6 mean ancestral population size 7 variance of ancestral population size 8 mean mutation rate 9 variance in mutation rate 10 mean time (in years) since population started to decline/expand 11 variance of time since population started to decline/expand It is these parameters that you are going to want to report. The parameters for the individual loci are in out1..n as before. In these files the demographic parameters will vary between loci because, this time, the model is assuming that there is some variation in each parameter among loci (your prior for the amount of this can be set in the parameter file). In principle, you could detect selection if one locus is behaving wildly differently from the others for e.g. growth rate. So, to summarise, each out1..n file contains: line number number of mutations tmrca log10(N0) for that locus log10(N1) for that locus log10(mu) for that locus log10(tfa) for that locus log likelihood proportion of mutations in terminal lineages proportion of total tree size in terminal lineages proportion of coalescences in ancestral pop NBB don't expect the 'means' and 'variances' in hpars to be exactly the same as those calculated across out1..outn - the latter are assumed to be random draws from distributions with means and variances given in hpars. The init_v_file is changed. It has the following structure number of 'turns' of the random number generator ploidy number generation time starting values for current size for all loci starting values for ancestral size for all loci starting values for mutation rate for all loci starting values for time since decline/expansion for all loci indicators (0,1) whether to update values of these parameters. Starting values for prior mean and variance for current size ..... ancestral size ..... mutation rate ..... time hyperprior mean and variance for means and variances for: ancestral size current size mutation rate time since decline/expansion An indicator as in msvar0.4.x - 0=linear growth; 1=exponential number of lines of output number of iterations between lines of output (example gives 10000, but you will probably want this 10 or 100 times larger to get good convergence with large numbers of loci or sample sizes). Further points to note ---------------------- There is an arbitrary upper limit on the events in a genealogy set at 1995 (events = no mutations + sample size -1). Any trial update that makes it bigger than that is not accepted (think of it as a prior on the number of mutations permissable). This is done just because it is computationally convenient to set at the outset the maximum 'size' of the genealogy, and 2000 seems a reasonable number. **If this limit is reached when setting up the initial genealogy at the start of the run, the program stops. Otherwise if it reaches it as part of a trial update it just doesn't accept the update and the program continues to run. Generally this latter is a rare occurence - most problems occur at the beginning because of the setting of the initial parameter values. **To solve the problem if it occurs right at the beginning, you just need to set the initial values to something that won't give initial genealogies with too many mutations (e.g. just keep adding an extra 0 to the mutation rates for each locus until the problem goes away).