Archive for the ‘Bioinformatics’ Category

Sunday Morning Links

December 9, 2012 Leave a comment

Batch rename of zillions of sequences in single fasta file

So, working with the Illumina reads, I ran into a problem. All the sequences were anonymous as they were named as No_name. I needed to rename them so that all the sequences have unique names. Obviously, in these situation ‘awk’  came to my mind. A life saver for perl deniers. Anyways, a simple one liner using the awk gave my sequences unique name. No_name were renamed to numbers, for example the first sequence was named as “1”, second as “2”, and so on and so forth till the end.

$awk ‘/^>/{$0=”>”++i}1’ test.fna > test1.fna

Categories: Bioinformatics

Downloading complete genomes from ncbi ftp (from terminal)

May 31, 2012 2 comments


Open Terminal


Connect to NCBI genome FTP



Check out the list of genomes



cd into the directory of your organism

ftp>cd <favorite_microbe>
download the file you desire using mget
ftp>mget *.gbk
which will result in
mget *.gbk [anpqy?]?

type y and enter and the file will be downloaded in your computer (at the same directory from where you connected to ftp)

Categories: Bioinformatics, BLAST

Common OTUs across sample using mothur

November 23, 2011 Leave a comment

What do i have?
Three technical replicates of 16S amplification from a sample.
What do I intend to do?
Test the goodness of the reproducibility.
How am i going to do it?
Based on the Zhou et al. 2011 ISMEJ paper, I will compare the common OTUs across all the samples. At the end, I want to make a venn diagram that represents the number of OTUs that are common across the sample?
Let it begin……
I will use mothur v 1.22 to do this. Lets name my sample as A1, A2, and A3. I got a fasta file for each sample named: A1.fasta, A2.fasta, and A3.fasta. The sequences were already filtered to remove erroneous and chimeric reads. Additionally, they were trimmed to same region, and then aligned and filtered(remove the column-gaps from alignment) as well.
First, I created a group file. Its a two column file, which contains the name of the sequence in the first column and the sample name in the second column. For example:
G45667889 A1
G47879890 A2
G45454800 A3
G5803i808 A1
There is a command in mothur called, which creates this file, if you had not created it during demultiplexing step.,group=A1-A2-A3)

It will create a file called merge.groups, which will be used extensively for downstream analysis.

Since, we are going to find the common OTUs across samples, we will combine all the fasta using cat.

cat A1.fasta A2.fasta A3.fasta > A123.fasta

Demultiplexing sff files based on barcode

November 15, 2011 Leave a comment

To submit the 16S data to a data repository, most of them prefer that the submitter demultiplex the dataset to one sff file/ sample. There are number of ways to do it, but common requirement for all the method is sfffile.  Its a toolkit that comes with 454 software. Here is what its info says:

 The sfffile program constructs a single SFF file containing the reads from
   a list of SFF files and/or 454 runs.  The reads written to the new SFF
   file can be filtered using inclusion and exclusion lists of accessions.

So, basically, with this command we can deconstruct and reconstruct the sff file as per our need.

Here, I will describe how i deconstructed a single sff file to multiple sff file based on their barcode or MID. First, I made a list of the barcodes corresponding the samples that i want to extract. Then, formatted that file exactly like MIDConfig.parse file. Here is what that file looks like:

mid = “MID1”, “ACGAGTGCGT”, 2;
mid = “MID2”, “ACGCTCGACA”, 2;


mid is a handle that is used by sff to look for information. “MID1” is the name of your sample, followed by the barcode, and the number at the end represents the number of mismatches to allow during demultiplexing. Here “GSMIDs” can be replaced with whatever text you prefer, but the same exact text needs to be specified when you run sfffile..

So, we have all the files that we need (sff file, barcode information, and 454 toolkits installed)

sfffile -s GSMID -mcf your_barcode_file.txt -nmft your_sff_file.sff
-s : splits reads by your barcode into separate files
-mcf mid configuration files
-nmft stops from a manifest being written in the sff file.

As a result of this, your sff file will be deconstructed to several sff files based on the barcode information that you included in the file. The name of the output file will be:


Now, we need to generate MD5 checksum for all the files that we generated. A MD5 checksum or hash sum generated is used to detect errors introduced through storage or transfer. SRA uses the file name and md5 checksum to track and link files to their proper Runs. A tutorial on this can also be found in the SRA website here.

Since, there will be 20-30 sff files, its tedious to go through each of the file to generate the MD5, here is a simple and easy way to do it. Before executing this make sure you are in the same directory where all your demultiplexed sff file is.

you@yourserver/dir_with_sff_file>md5sum 454Reads.*.*>> md5.txt

It will append all the MD5 checksum of sff files present in the directory to md4.txt.

Please note that this was done in a linux based machine with 454 tools installed.

SRA: Seriously Ridiculously “bentbackward” Archive

June 22, 2011 2 comments

OK, so i wanted to rerun some published 454 data with some inhouse shell scripts to test if they are giving me the same results. First, I needed to get hands on those fasta files, which I thought’d be “like this“, but instead it turned out to be opposite. All the data generated from 454 are stored in SRA (Short Read Archive) hosted by NCBI. SRA is shutting down due to budget constraints but it is still in service.

So, as per the paper, i went to the SRA website (I used google chrome) with the study# and the sample#. I entered the study# (It starts with SRP00**) in the search box, and the result page had all the samples from the study ready to download. However, there was one caveat, Asperasoft, a high speed file transfer utility must be installed to download the file. Additionally, the software only works in Firefox. So, i downloaded Asperasoft, but this plugin is not “double-click” type installation where u can double click the file and it installs somewhere i dont wanna know. For installing Aspersoft, there is an instruction here (Download the right version). All I needed to do was go to terminal and run the downloaded file (which is apparently an shell script)


the name of the file will be different for mac and windows. Duh!

After the installation, I fired up Firefox, went to SRA website, typed the study#, and download all the .SRA files that i needed.

Well, problem solved. But, wait a minute, I want fasta files, not SRA files. So, now how do i go about getting the fasta and qual files from the SRA files. Take a guess?

Yes, you are right. Another software.

A utility tool is provided by the ncbi here. I downloaded and untarred the software, which contained number of scripts that deal with SRA files, but all i need is a script that gives me fasta and a qual file. For that, i first converted .SRA files to a fastq file which contains both nucleotides and its quality scores.

COMMAND: ./fastq-dump -A SRR0**** SRR0****.sra

Remember to cd into the directory with the scripts and copy the .sra file into the directory.

Categories: Bioinformatics