Biology 697: Introduction to Computational Data Analysis
Welcome to your mid-term! I hope you enjoy. Note, in all of the questions
below, there are easy not so code intensive ways of doing it, and there are longermore involved, yet still workable ways to answer them. I would suggest thatbefore you dive into analyses, you do the following. First, breathe. Second,think about the steps you need to execute to get answer the question. Writethem down. Third, for those parts of problems that require code, put thosesteps, in sequence, in comments in your script file. Use those as signposts tostep-by-step walk through the things you need to do. Fourth, go over thesesteps, and see if there are any that could be easily abstracted into functions,could be vectorized, or otherwise done so that you can expend the minimumamount of effort on the problem to get the correct answer.
Each of you has a study system your work in and a question of interest. Givean example of one variable that you would sample in order to get a sense of itsvariation in nature. Describe, in detail, how you would sample for the populationof that variable in order to understand its distribution. Questions to considerinclude, but are not limited to: Just what is your sample versus your population?What would your sampling design be? Why would you design it that particularway? What are potential confounding influences of both sampling technique andsample design that you need to be careful to avoid? What statistical distributionmight the variable take, and why?
At the Plum Island Ecosystems LTER they have been collecting data on the rapidly expanding and potentially harmfulPhragmites australis along transects in a salt marsh (Argulla Rd.) restorationsite and a reference control site (Rough Meadows) since 1997. For the sampling,they took measurements of the two tallest Phragmites stems per five meter in-terval along transects. The data is available here -
1. Before you even look at the data (no peeking!) would you expect the
population that is being sampled from at each site in each year to benormal? Why or why not?
2. Visualize the data in an informative way to see differences in the popula-
tion of sampled Phragmites across space and time.
3. Based on your observations of the data, what property or properties of
the sampled populations would you want to compare between restorationand reference site to determine effectiveness of restoration?
4. One way to ask if two samples differ in an arbitrary property is to calculate
the bootstrapped confidence interval of the difference between them - i.e.
calculate a property of a resampled replicate of population a, do the samefor population b, take their difference, then rinse and repeat to get anconfidence interval on the difference. Write a function to do this, andapply it to one of your properties, pooling across all years. What is yournull hypothesis, and what does the result tell you with regards to yournull? Use 1000 simulated draws.
5. Now look at whether the difference between control and reference site
changes across years. How would you interpret this analysis? Feel freeto look at additional properties if you think it will help you describe thedifferences between reference and control.
6. Extra Credit: There is a particular distribution that may describe this
data well. Using likelihood, fit the distribution to each site x year’s data.
Visually examine change in the parameter values over time at the controlversus restoration site.
In their 2012 paper, S etting an Optimal α That Minimizes Errors in NullHypothesis Significance Tests Mudge et al outline a procedure where one uses both the type I andtype II error rate to calculate a third quantity, ω. For any data set, we cancalculate β given α, a sample size, a measure of effect size for an estimatedparameter that we deem critical, and variation as measured in our data. Oncewe have obtained α and β, we can calculate ω as
and then plot a curve of the relationship between α and ω. The value of αat the minimum value of ω is the ’optimal α’ that balances type I and type II
error against one another. For example, here’s a plot of α versus omega for oneparticular test with a dashed line at the minimum value of ω to highlight theoptimal value of alpha.
The other great property of this is that we can calculate this optimal alpha
after sampling our data. We can use the variation observed in our data in thecalculation of power. Only the effect size, sample size, and α levels need to bespecified a priori.
Let’s assume you’re interested in testing whether the observed temperature
anomoly (the difference from the long-term average) around the globe is differentfrom 0. To appease critics, you’re assessment of a critical effect size is 1.5 degreesC. You know from looking at all of your observed temperatures that the standarddeviation from temperatures across the globe is 5 degrees C. Using simulation tocalculate β, what is your optimal alpha for 100 samples? How does your optimalalpha change with sample sizes from 10 to 1000? How does this relationshipchange if the standard deviation across all of the temperature sensors was 10degrees C? Note, using functions to help you avoid heavy lifting are going to bepretty key here.
In Maestre and Reynold’s 2006 paper, they examine the effect of species di-versity on the root:shoot ratio of biomass in plants. As good scientists, theydeposited their data at Dryad at
Is there a general relationship between aboveground andbelowground biomass in their data set? Evaluate this relationship. Visualizethe fit and prediction confidence intervals. Next, visualize the fit and predictionconfidence intervals using simulation - i.e., for each simulated line, draw valuesfor each coefficient using a normal distribution with the coefficients’ means andSEs. You may want to use separate figures to show simulations just incorporat-ing fit error versus prediction error. Also, make sure to overlay the best fit lineon top so that we can tell what is our fit line versus what are the simulationsused to show error. You may need many simulated draws to accurately showerror.
Note, to get simulated residual standard error values, you need to use an
inverse chisqaure distribution. So, here’s an example where n is the sample sizeand est.se is the residual standard error, extracted from the summary of thelinear model (you’ll need to do that a bit here, or use vcov on the lm objectto get the parameter variance), to get one random draw of a residual standarderror -
df <- n-1X <- rchisq(1, df=df)ses <- est.se * sqrt(df/X)
In their 2011 paper, Stanton-Geddes and Anderson assessed the role of a faculta-tive mutualism between a legume and soil rhizobia in limiting the plant’s range.
After publishing, they deposited their data at Dryad As part of their field experiment, theylooked at a variety of plant properties in the field. One of interest to us is therelationship between plant height and number of leaves in July. Examine therelationship using likelihood with your choice of error distribution and a linearfunction. Why did you chose this error distribution? Plot the fitted curve alongwith fit and prediction error. Is there another distribution you could have used?Why? How do results from your model fit compare to those of another error dis-tribution? What does this tell you about the relationship between plant heightand number of leaves?
Hubway, the Boston based bike rental company, is releasing all of their tripdata. The data set is huge - about 60 MB. They’re also providing lat and longinformation for all stations. They are hosting a data visualization challenge atFor your extra credit, find and visualizesomething interesting in the data. Note, ggplot and it’s map geom might come
in handy (or not). If you also want to play with breaking down and analyzingdata using different groupings, you may want to look into the plyr library atand available on CRAN. We’ll be using plyr later inthe course, but, it might be useful for exploring the data.
Extra points for each interesting or surprising thing you find. And, heck, ifyou get into this, enter the challenge!
Table 1 Pharmacokinetic parameters in sheep given ceftizoxime alone (Group I) and meloxicam co-treatment (Group II). Parameter * P<0.05; **P<0.01Group I, Only Ceftizoxime; Group II: Ceftizoxime + MeloxicamA, zero time drug concentration at distribution phase; B, zero time drug concentration at elimination phase; α, regression coefficient fordistribution phase; β, regression coefficie
Program Notes on Britten’s “Jubilate Deo” If you visit Baltimore, you should take time out from the crabcakes to visit some of the city’s religious sites, including the highest of all Anglo-Catholic parishes, Grace & St. Peter’s, the Roman Catholic Basilica, oldest in the United States, and the Museum of Visionary Art. Christopher Smart, author of Jubilate Agno, from which Ben