Downloading complete genomes from ncbi ftp (from terminal)

May 31, 2012 2 comments

First,

Open Terminal

Second,

Connect to NCBI genome FTP

$ftp ftp://ftp.ncbi.nih.gov/genomes/Bacteria/

Third,

Check out the list of genomes

ftp>ls

Fourth,

cd into the directory of your organism

ftp>cd <favorite_microbe>
 Fifth,
download the file you desire using mget
ftp>mget *.gbk
which will result in
mget *.gbk [anpqy?]?

type y and enter and the file will be downloaded in your computer (at the same directory from where you connected to ftp)

Advertisements
Categories: Bioinformatics, BLAST

A collection of AppleScript for Texshop

Categories: Latex, Writing

A link on how to reduce variables in multivariate analysis like CCA or RDA

April 23, 2012 Leave a comment

A list of R packages for environmental and ecological data analysis

April 16, 2012 Leave a comment

Effective Scientific Writing: Clear and Concise

April 16, 2012 Leave a comment

Few tips on clear and concise scientific writing as per the webinar lecture in youtube (http://www.youtube.com/watch?v=rh-NHu5yOYc)given by Kristin L. Saininani, PhD from Stanford.

1. Cut and Kill unnecessary words and phrases

2. Always follow Subject + Verb + Object or Subject + Verb

3. Use Active Voice. (Its OK to start sentence with We or I)

4. Use strong verbs (Don’t turn verbs into nouns)

Loud music came from speakers embedded in the walls, and the entire arena moved as the hungry crowd got to its feet.

Compared to

Loud music exploded from speakers embedded in the walls, and the entire arena shook as the hungry crowd leaped to its feet.

from the novel Bringing Down the House

5. Dont bury the main verb

 

Few more tips:

a. Passive voice is OK in method to avoid the conundrum of who is doing what.

b. Use active voice in Introduction, Results, and Discussion.

c. When to use which and that?  Which=non essential clause, that=essential clause.

d. Write as you go.

e. Write for your readers, not yourself.

f. Avoid complex under details like massive tables.

 

Categories: Writing

Learning ggplot2

March 31, 2012 2 comments

I have always been content with the base graphing capabilities of R, it is indeed very powerful. However, I have been hearing lot of praise about the ggplot around internet, and after looking through some of the graphs that it generated, I was really impressed with it, and after going through some basic ggplot tutorials, I think it’d be a great tool to add in my R arsenal. So, I decided to write a blog on a step by step venture in the world of ggplot. Most of the examples here will be very preliminary plots, but the difference between the one generated by the base graphic system and ggplot2 will be apparent.

To follow this tutorial, I am assuming that you have ggplot package installed in R.I am using the UScereal data in the package MASS. To see a list of data available through R

> data()

So, the first step is to load the two packages in R.

library(ggplot2)

library(MASS)

Now, lets load the data on UScereal.

data(UScereal)

The dataset is basically Cereal brand by properties table. It has the name of cereals as the row and its different properties as columns. It has 11 properties listed which are

> colnames(UScereal)
[1] “mfr” “calories” “protein” “fat” “sodium”  “fibre” “carbo” “sugars” “shelf” “potassium” “vitamins”

Where mfr or manufacturer and vitamins are categorical or in terms of R factor data. All other data are continuous. Now, lets look at some preliminary analysis. The first thing to do is see the distribution of these variables using histogram. Lets first look at the histogram of calories using ggplot2.

> qplot(calories, data=UScereal)
stat_bin: binwidth defaulted to range/30. Use ‘binwidth = x’ to adjust this.

Histogram of Calories

Here, a warning is also shown for not including the binwidth size, however, the graph is still produced as ggplot2 automatically sets a binwidth. Lets see how the graph looks if we set a binwidth of 30.

Histogram with binwidth of 30

Now, what if we want to see the histogram of all the continuos variables. We can also do this with the base graphics by changing par(mfrow=(nrows,ncols)), but ggplot2 has a much easier way to do this.

> qplot(calories, data=UScereal,binwidth=30,facets=.~mfr)

Histogram divided to different facet based on manufacturing company

Here the ‘facets’ option can help specify multiple graphs based on the categorical data. The above command separates the graph based on the mfr, and ~ position specifies how you want the graph to be arranged.

Now, lets look at the relationship between sugars content and calories of the cereals.

> qplot(sugars,calories, data=UScereal)

Relationship between calories and sugar content of Cereals

Notice that in this case, when we use the same qplot command, it plotted a scatterplot and then in the previous examples it posted histogram. The main difference is that when we use qplot and specify two variables, ggplot2 automatically assumes a scatterplot and plots it, else it plots a histogram. ggplot2 automatically choses the type of graph based on the type of data. But, we do have a freedom of chosing what kind of plot we want using “geom” option in qplot. We can specify exactly what type of plot we want through this geom option. For example, if we want to see the kernel density estimation of the calories, we can simply add geom=”density” to the same qplot command that we used to generate a histogram.

> qplot(calories, data=UScereal, binwidth=30,geom=”density”)

Density Plot of Calories

Now, if we want to look at the relationship of calorie content and sugar content in terms of both manufacturer and vitamin content, we can use the simple command.

qplot(sugars,calories, data=UScereal, facets=mfr~vitamins)

Now, lets look at the relationship between protein content, sugars, fat, and the manufacturer of these cereals in just one graph. I used the point size for fat content and then colors of the point for the manufacturers.

> qplot(protein,sugars, data=UScereal, size=fat,color=mfr)

Finally, lets label this graph

> qplot(protein,sugars, data=UScereal, size=fat,color=mfr,xlab=”grams of protein in one portion”,ylab=”grams of sugar in one portion”,main=”Protein content Vs. sugar content of common US cereals”)

In all the previous examples, where we did not specify any geoms, it automatically assume it to be points, but there are many options or types of graphs that can be generated using the geom option. For example, we can generate the line graph by specifying geom=”line”

> qplot(protein,sugars, data=UScereal, size=fat,color=mfr,,geom=”line”,xlab=”grams of protein in one portion”,ylab=”grams of sugar in one portion”,main=”Protein content Vs. sugar content of common US cereals”)

Obviously, this is not an extensive tutorial on this amazing package. There are many visually stimulating graph that you can generate using this package. I will try to keep this page updated as i venture more into the beautiful world of R.

Categories: Graphics, R

Probability

January 30, 2012 Leave a comment

Probability is defined as the assessment of the possible outcomes of an experiment whose outcome is “random”. In this definition, the term “outcome” is not exclusive to outcome of an experiment, but also to an “explanatory variables” if that is not fixed. For example, in a drug study, experimenter decides the drug dose, but if the subjects are chosen randomly then the experimenter does not have a control over the age, and hence it is not fixed and can be classified as “outcome” under probability theory.

With experiments, there are possible outcomes, and the collection of the possible outcomes is called the sample space (S). These sample spaces should be unique, and normally very exhaustive, and possibly as simple as possible.

Any subset of sample space is called events. For example, in an dice experiment , there are six possible outcomes (1u,2u,3u,4u,5u,6u- with each term representing a possibility), a subset is {1u}, {1u,6u}, and so on. In probability theory, we compute the chance than an one of the above stated event will occur taking in consideration the probabilities of the elementary outcomes in the sample space.

However, we see that the outcome of most experiment, even in the case of the above mentioned dice experiment, the results are not numbers but situation like 1u, 2u and so on. Thus, for mostly convenience reasons these outcomes are mapped or represented by integer or real numbers, like 1 to 6 for dice experiment, instead of 1u to 6u. Technically, these numbers are called a random variable. These outcomes are commonly represented as X, Y,Z .

Source:

http://www.stat.cmu.edu/~hseltman/309/Book

Categories: Statistics