Archive for the ‘Statistics’ Category

Dilemma in multivariate testing in ecology: My test is better than yours.

Snippets from papers that conclude one multivariate test is better than others for variance partitioning. Remember that these are just snippets and does not relay the overall message. However, if i list a pro here, you can be sure that there is a con somewhere else in the paper (follow link) and vice versa. At the end of the day, none of the tests are perfect, but are the best if used and interpreted as per authors’ manual.

  • “Regardless of the philosophical merits of distance-based or raw-data based methods for testing beta diversity (Legendre, Bor card & Peres-Neto 2005; Tuomisto & Ruokolainen 2006), it is clear that correlations based on distance matrices are inferior to RDA for modelling spatial patterns. ” from link1
  • “The inflation of R2 statistics and the irregularities in the forward selection of eigenvectors indicate that the PCNM and MEM methods are unstable and vulnerable to statistical artefacts “link1

Categories: Ecology, Statistics

Sunday Morning Links

December 9, 2012 Leave a comment

A link on how to reduce variables in multivariate analysis like CCA or RDA

April 23, 2012 Leave a comment


January 30, 2012 Leave a comment

Probability is defined as the assessment of the possible outcomes of an experiment whose outcome is “random”. In this definition, the term “outcome” is not exclusive to outcome of an experiment, but also to an “explanatory variables” if that is not fixed. For example, in a drug study, experimenter decides the drug dose, but if the subjects are chosen randomly then the experimenter does not have a control over the age, and hence it is not fixed and can be classified as “outcome” under probability theory.

With experiments, there are possible outcomes, and the collection of the possible outcomes is called the sample space (S). These sample spaces should be unique, and normally very exhaustive, and possibly as simple as possible.

Any subset of sample space is called events. For example, in an dice experiment , there are six possible outcomes (1u,2u,3u,4u,5u,6u- with each term representing a possibility), a subset is {1u}, {1u,6u}, and so on. In probability theory, we compute the chance than an one of the above stated event will occur taking in consideration the probabilities of the elementary outcomes in the sample space.

However, we see that the outcome of most experiment, even in the case of the above mentioned dice experiment, the results are not numbers but situation like 1u, 2u and so on. Thus, for mostly convenience reasons these outcomes are mapped or represented by integer or real numbers, like 1 to 6 for dice experiment, instead of 1u to 6u. Technically, these numbers are called a random variable. These outcomes are commonly represented as X, Y,Z .


Categories: Statistics

ANOVA (Analysis of Variance)

January 22, 2012 Leave a comment

ANOVA or Analysis of Variance is the most commonly used test hypotheses about 3 or more means. The null hypothesis of the test is that all 3 or more means are derived from the same population. A t-test with two sample is the special case of ANOVA. As there are many types of t-test, there are many types of ANOVA as well, the simplest one being One way ANOVA. Also, the t-test can only test two groups but ANOVA is not limited to two.

Like any other statistical test, there are assumptions for the test:

  • Data must be normally distributed.
  • Sigma/Standard deviation of any population in the model/datasets will be equal.

ANOVA is also known as F-test for after its developer Fisher.

Not complete yet!!! 🙂


Categories: Statistics

Mann-Whitney U Test

January 22, 2012 Leave a comment

A non parametric test (statistical test that assumes the data are not normally distributed) that is used to test the difference between two data sets. It ranks the data set and uses the median value to compare them. As a result of the test we can assess if the difference between two datasets is real or a result of fluke. It is the equivalent of t test, applied for independent samples. Few things to consider while deciding if this test is appropriate:

  • Investigating the difference between two dataset.
  • Data must be non-parametric (data that is not assumed to be normally distributed) and ordinal (data that can be ranked).
  • Although its a non-parametric test, both of the dataset must have similar distribution.
  • The number of data must be more than 5 and it is recommended to be less than 20 as well.

Remember, the test is also known as  Mann–Whitney–Wilcoxon (MWW) or Wilcoxon rank-sum test.

MWW in R.

 Lets compare the points scored per season by Kobe Bryant and Michael Jordan to see if they are any different.

I downloaded the Points per season for both the players from

I only intent to compare the first 15 seasons as Jordan only played for 15 seasons and Kobe is in his 16th.

Here we start with null hypothesis that there is no significant difference between the two samples.

Here is the dataset that i used:

Bryant Jordan
539 2313
1220 408
996 3041
1485 2868
1938 2633
2019 2753
2461 2580
1557 2404
1819 2541
2832 457
2430 2491
2323 2431
2201 2357
1970 1375
2078 1640

I saved this as a tab delimited text file and then loaded this dataset in R using:


> Jordan<-BryantJordan$Jordan

> wilcox.test(Bryant,Jordan,correct=FALSE)

Wilcoxon rank sum test

data: Bryant and Jordan
W = 69, p-value = 0.0742
alternative hypothesis: true location shift is not equal to 0

The p-value is greater than 0.05, which means that we can accept thenull hypothesis that there is no significant difference between two sata sets.

If I had run wilcox.test(b, a, correct = FALSE), the p-value would have remained logically the same.


Wilcoxon rank sum test

data: Jordan and Bryant
W = 156, p-value = 0.0742
alternative hypothesis: true location shift is not equal to 0


Categories: Statistics