Popuiation genetic

- Individuals who possess a C allele at the Rs429358 SNP, located in the 4
^{th}Exon of the ApoE gene, are more likely to have a variant at the ApoE protein that can increase the risk of Alzheimer’s disease over a T allele. You are running a set of samples on a SNP chip and determine that your population has 2 CC individuals, 16 CT individuals, and 32 TT individuals. The C and T alleles in the ApoE gene are segregating at the frequencies observed in the 1st generation you observe the population. If you assume a standard Wright-Fisher model of genetic drift, what is the*expected probability*of eventual fixation of the T allele? Is this a certain outcome?

- In a population of 4000 diploid organisms, what is the expected time (in numbers of generations) to the most recent common ancestor for 2 randomly sampled alleles from the population?
- magine that you have a population of size
*2N*= 1000 gene copies. You draw two samples from this population:**Sample 1 = 2 gene copies; Sample 2 = 10 gene copies.** - A) If you were to model the coalescent history for each of these samples independently, would you expect the first coalescent event (closest to present) to occur in Sample 1 or Sample 2 and why? Calculate the expected time to prove it.
- B) Which sample would have the longest expected time to the most recent common ancestor of the entire set of sampled gene copies? Again calculate the TMRCAs to prove it.
- Go to the following link: https://keholsinger.shinyapps.io/coalescent/

This website will run a coalescent simulation when you click “Go”. The simulation is being conducted with a population size of Ne = 100 diploid orgainsms (so you will need to also think what the number of “gene copies” should be here to place these results in the same world as our equations from class!).

- a) Set the Number of alleles to 2 (this represents the sample size, so 2 gene copies sampled) and click “Go”. This will show a coalescent tree with a statistic of “tree depth”. This represents the coalescent time in units of 4Ne generation, as the author of the package has it set up: so a value of t=1=4*100 = 400 generations (note this is twice the value that we set for t = 1 in class, so keep that in mind!).
- b) Record the value of tree depth in an spreadsheet. Click “Go” again and record the tree depth in a second row. Repeat this for a total of 50 values of tree depth.
- c) Report the mean value of t (tree depth, or time to common ancestor for the two alleles) and its variance (you can just use the average() functions in excel or whatever else you want… don’t need to calculate by hand). Then make a histogram with 10 equally spaced bins. If you don’t know how to do this, go to a website (e.g., this one seems good https://statscharts.com/bar/histogram) and paste in your data from the 2 allele tree depth simulations. Paste in your data and click “Generate Histrogram” then on the next page select 10 bins and click “Edit Chart”
- d) How does this mean value match what we discussed in class for coalescent times? Does your answer make sense? Describe the shape of the histogram, and what this means to you about the coalescent.
- e) Repeat this experiment for a sample of 10 alleles. How does the estimate of t change (don’t worry about the histogram)? Does the difference make sense to you? Explain.

Note: only 50 samples is not that many for a simulation, but should get the point across here. Your means will be much better estimates for the parameters and the historgrams will look nicer with more simulation values. Usually we do hundreds or thousands of simulations to estimate things.

- You have sequenced a region of mitochondrial DNA in three bumble bee species,
*Bombus vosnesenskii*,*Bombus bifarius*, and*Bombus ternarius*. You are interested in determining the divergence time of*bifarius*and*B. ternarius*to test a hypothesis regarding speciation events. The divergence time for*B. vosnesenskii*and the other two species is 4 million years (bumble bees have 1 generation per year, so this also equals 4 million generations). For the gene you sequence, you count 80 substitutions between*B. vosnesenskii*and*B. bifarius*and 47 substitutions between*B. bifarius*and*B. ternarius*. Assuming a molecular clock, (A) what is the mutation rate of this gene, and (B) what is the divergence time of*B. bifarius*and*B. ternarius? Note: you should ignore the possibility of deep coalescence for this…just focus on Chapter 2 material here.*

- Describe in your own words the link between DNA replication, identity by descent, and the coalescent process. Your answer should consider time directionality. Use complete sentences.

- A) You use a set of allelic genetic markers in a population of lake trout and estimate the expected heterozygosity = 0.18. Assuming an
__infinite alleles model__, if the mutation rate of these markers is approximately 0.0001 mutations per generation, what is the effective population size (in diploid individuals,*N*) in the population?

B)You go out into the same lake and drop a car battery into the water, temporarily shocking all the fish so they float to the surface. (they’ll wake up and be ok, don’t worry…also don’t actually do this!) You count 1500 trout. Do you think you made a mistake in your genetic calculation of population size? Is the estimate from the genetic data somehow incorrect, or is this something you might expect? Provide several reasons why your calculation might differ from the actual count of fish in the lake. There are a few important terms/definitions I am looking for in your answer.

- Baudry et al 2004 sequenced a series of
*Drosophila melanogaster*populations from throughout the globe at a set of genes (note all of these are on the X chromosome). A table of nucleotide diversities (presented as a percentage here) for the gene*sex-lethal*is shown below. Despite the fact that the numbers of flies living in these different regions seem to be approximately the same, they observed some interesting results. Provide a hypothesis to explain these data.

Population |
Percent nucleotide diversity (per 100 sites) |

Kenya (Africa) | 1.6 |

Zimbabwe (Africa) | 1.81 |

Ivory Coast (Africa) | 1.93 |

Niger (Africa) | 1.92 |

China (Asia) | 0.03 |

France (Europe) | 0.00 |

Russia (Asia) | 0.06 |

U.S. (N. Am) | 0.03 |

- The following data set contains aligned DNA sequences (17 total sites) from a gene sampled from a set of humans.

Ind1 | AAGATGACAGATAGGCA |

Ind2 | CTGGTGACTGATAGGCA |

Ind3 | CTGGTGACTGATAGGCT |

Ind4 | CAGATGACTGATAGGCT |

- A) How many SNPs are there?
- B) How many segregating sites (S) are there?
- C) Calculate the nucleotide diversity (or the number of pairwise differences) for the sequenced region.
- D) Calculate the two estimators of THETA (Tajima and Watterson ) from your and
*S*(note don’t be too worried that they aren’t exactly identical… they should be closish, but there are reasons for the two to not be the same…we will cover them later). - E) Go on the web and find an estimate of
__genome-wide__per-base-pair nucleotide mutation rate in humans (provide the value and source reference in your response). Use your estimate of Tajima’s to calculate the number of people (__average__*N*). Hint, if your mutation rate is “per site” rate, be sure your THETAS are calculated “per site” as well.