monologues with multivariate analysis
I wanted to rererefresh my understanding of multivariate analysis in ecology. So, here is my monologue of my googleventures.
First stop: I accidentally landed into this paper:
TEACHING MULTIVARIATE STATISTICS TO ECOLOGISTS AND THE DESIGN OF ECOLOGICAL EXPERIMENTS TO STATISTICIANS: LESSONS FROM BOTH SIDES
Snippets from: Link
“Ecologists generally become interested in multivariate analysis because they already have multivariate data”
“Incorrect inferences and conclusions can be drawn from ecological experiments that fail to take into account natural temporal and spatial variability. “
Which species are responsible for group differences?
This last line lead me to another paper by the author. At this time, I was generally interested to find out ways to identify species that are causing the difference between two communities: link
CANONICAL ANALYSIS OF PRINCIPAL COORDINATES: A USEFUL METHOD OF CONSTRAINED ORDINATION FOR ECOLOGY
An unconstrained ordination may be useful to visualize overall patterns of dispersion, but this simple example also demonstrates how real differences in location, which were masked in the PCA, were uncovered by the canonical approach.
In either case, correlations of species with canonical axes will provide a good indication of which species should be investigated in more detail with univariate analysis.
Clearly, this use of correlations with canonical axes is an indirect ‘‘post hoc’’ way of identifying possible contributions of individual species to differences among groups.
I failed to identify the right procedure, however it is unclear if it should be trusted anyways based on the last snippet that i posted here.
Dilemma in multivariate testing in ecology: My test is better than yours.
Snippets from papers that conclude one multivariate test is better than others for variance partitioning. Remember that these are just snippets and does not relay the overall message. However, if i list a pro here, you can be sure that there is a con somewhere else in the paper (follow link) and vice versa. At the end of the day, none of the tests are perfect, but are the best if used and interpreted as per authors’ manual.
 “Regardless of the philosophical merits of distancebased or rawdata based methods for testing beta diversity (Legendre, Bor card & PeresNeto 2005; Tuomisto & Ruokolainen 2006), it is clear that correlations based on distance matrices are inferior to RDA for modelling spatial patterns. ” from link1

“The inflation of R2 statistics and the irregularities in the forward selection of eigenvectors indicate that the PCNM and MEM methods are unstable and vulnerable to statistical artefacts “link1
Jargons from Ecology
Some common ecological terms that i frequently run into in ecology papers followed by links to some relative articles or papers about them. The links are usually top google hits and highly relevant to understanding the jargon it follows. In some cases, the links are results of “midnight caffein driven search rashes” that explain the jargons well, and not always on top of google hits.
 Polynomial Trend Surface Analysis
(link1) : “A variant form of multiple regression can be used to fit a nonlinear model of an explanatory variable x (or several explanatory variables xj) to a response variable y. ”
 Hellinger transformation
(link1):”The Hellinger transformation is relativization by row (sample unit) totals, followed by taking the square root of each element in the matrix.”
 Unimodal relationships
(link1):”a function f(x) is a unimodal function if for some value m, it is monotonically increasing for x ≤ m and monotonically decreasing for x ≥ m.”
 PCNM(principal coordinates of neighbor matrices )
(link1)”The technique represents the spatial configuration of sample points using principal coordinates of a truncated distance matrix amongst points. The resulting PCNM axes with positive eigenvalues are used as spatial components in variation partitioning, with each axis potentially modelling species clustering at different distances amongst sampling units. ”
(link2)”We need statistical methods to model spatial or temporal structures at all scales. ”
 Spectral decomposition
(link1): “In broad terms the spectral theorem provides conditions under which an operator or a matrix can be diagonalized (that is, represented as a diagonal matrix in some basis)”
 Canonical Correspondence Analysis
(link1): “The result is that the axes of the final ordination, rather than simply reflecting the dimensions of the greatest variability in the species data, are a linear combination of the environmental variables and the species data.”
“The choice of environmental variables greatly influences the outcome of CCA and other constrained ordinations.”
“The length of the arrow is proportional to the rate of change, so a long pH arrow indicates a large change and indicates that change in pH is strongly correlated with the ordination axes and thus with the community variation shown by the diagram.”
“In any case, you can always remove superfluous variables if they are confusing or difficult to interpret”
 Contingency Table
(link1):”A contingency table is a tabular representation of categorical data .”
 Reciprocal Averaging
link1“it starts from assigning arbitrary numerical scores to one variable values”
 Unconstrained Methods (Ordination)
(link1): “An unconstrained ordination procedure does not use a priori hypotheses in any way, but reduces dimensions on the basis of some general criterion, such as minimizing residual variance (as in PCA) or minimizing a stress function (NMDS) ”
principal component analysis (PCA), correspondence analysis (CA), metric multidimensional scaling (also called principal coordinate analysis or PCO)and nonmetric multidimensional scaling.
 ChiSquare distance
(link1)” The first premise of this distance function is that it is calculated on relative counts, and not on the original ones, and the second is that it standardizes by the mean and not by the variance. ”
 Spatial autocorrelation
(link1)”locations close to each other exhibit more similar values than those further apart”.
 Direct gradient analysis
(link1)”new techniques were developed to constrain the ordination according to the table E of explanatory environmental variables (‘‘direct compari son,’’ ‘‘direct gradient analysis’’; ”
“Technically, direct gradient analysis can be viewed as an extension of multiple regression, which has a single response variable, to the case of a multispecies response table: ”
 Indirect gradient analysis
(link1)“Historically, ecologists have first used indirect ap proaches for interpreting the structures of species assemblages (structural information extracted by the eigenanalysis of Y) in relation to environmental vari ability: site scores along the ordination axes, which are composite indices of species abundances contained in Y, were compared a posteriori to environmental variables (‘‘indirect comparison,’’ ‘‘indirect gradient analysis’’)
”
 Constrained ordination (or canonical analysis)
(link1)”concentrates on the eigenanalysis of the fitted community table, allowing the direct analysis of the variation in species abundances explained by the environmental variability. ”
How i ran into PNAS latex template.
Surprisingly, very few peerreviewed journals in bioscience provides an official latex template file for submission. I am yet to find the one that is fully supported by the publisher. I think this should be an option from all the publishers as formatting using latex can make the reviewing process little bit more lucid compared to “double spacedfigure and legend at different places” word file.
Anyways, PNAS has latex template, along with the class and style file, which can be downloaded from here http://www.pnas.org/site/authors/LaTex.xhtml (or just google Latex PNAS).
However, when i tried to compile it, it DIDN’T (I use TexShop 2.47) and TexLive2009 (Ok! I need to update.) But after a quick google, I found the solution here
http://www.latexcommunity.org/viewtopic.php?f=23&t=1470
which involved changing fonts in the .sty file provided by PNAS.
Sunday Morning Links
 Here is a link to slides from a metagenomics class in Stanford, taught by Alexander V. Alekseyenko and Susan P. Holmes.
 A link about Obama and Big Data. Whats in the big data moving forward?
 What do you need to know to assemble those genome, denovo?
 Lastly, this what knocked out Manny.
Deleting blank lines using sed
delete ALL blank lines from a file sed '/^$/d' # method 1 sed '/./!d' # method 2 from http://sed.sourceforge.net/sed1line.txt
Batch rename of zillions of sequences in single fasta file
So, working with the Illumina reads, I ran into a problem. All the sequences were anonymous as they were named as No_name. I needed to rename them so that all the sequences have unique names. Obviously, in these situation ‘awk’ came to my mind. A life saver for perl deniers. Anyways, a simple one liner using the awk gave my sequences unique name. No_name were renamed to numbers, for example the first sequence was named as “1”, second as “2”, and so on and so forth till the end.
$awk ‘/^>/{$0=”>”++i}1′ test.fna > test1.fna