Friday, October 08, 2010

Uploaded Sanjana's video

Uploaded Sanjana's video.  Click here to view it.

Saturday, September 11, 2010

Basic Sequence objects - object oriented programming concepts

In Bioinformatics, nucleotide or protein sequence parsing and processing is a common initial step performed in most of the analysis. In this tutorial, I will explain
  1. How a simple Sequence Class can be created and used in programming language like Java, Perl and Python ?
  2. How sequence objects can be easily extended to include more functionalities ?
  3. How you can use already available sequence objects in BioJava, BioPerl and BioPython instead of creating your own classes.
Thus this tutorial could help in understanding the basic concepts of programming using sequence objects as example.

Perl

A. Using Data structure and procedural programming approach :

Let's call the program 'seq_hash.pl'

#!/usr/bin/env perl
use strict;
use warnings;

# Anonymous Hash reference in Perl used for storing the Sequence content
my $sequence_hash = {
  "name"       => "cytochrome",
  "type"       => "DNA",
  "seq_string" => "ADSFASDFASDF"
};

# A function/subroutine to get the sequence hash reference contents as FASTA string
sub seq_hash_to_fasta {
   my ($seq_hash) = @_;
   my $fasta_out  = ">$seq_hash->{'name'}\n$seq_hash->{'seq_string'}\n";
   return $fasta_out;
}

my $fasta_string = &seq_hash_to_fasta($sequence_hash);

print "FASTA string of Sequence:\n";
print $fasta_string;


Running the program from command-prompt
Prompt$$ perl seq_hash.pl
FASTA string of Sequence:
>cytochrome
ADSFASDFASF


B. Using Object-Oriented Concept - Perl Module:

use strict;
use warnings;

package SimpleSequence;


# Function: Constructor for the class and return the SimpleSequence Object
# Usage   : $seq = new SimpleSequence('seq Name')
# Returns : A SimpleSequence object. A quick synopsis:
#           $seq_object->name() - name of the sequence
#           $seq_object->sequence() - sequence as a string
#
sub new {
        my ($class,$name) = @_;
        my $data = {
                "name"     => "$name",
                "sequence" => "",
        };
        bless $data, $class;
        return $data;
}

# Get or Set Sequence name
# Usage   : $seq_obj->name('gi:145443344') - sets the name of sequence
#         : $name= $seq_obj->name() - gets the name of sequence
sub name {
        my ($self, $arg) = @_;
        if (defined $arg) {
                $self->{"name"} = $arg;
        }else {
                return $self->{"name"};
        }
}

# Get or Set sequence string
# Usage   : $seq_obj->sequence('ATAGAFAF') - sets the sequence
#         : $seq_str = $seq_obj->sequence() - gets the sequence as string
sub sequence {
        my ($self, $arg) = @_;
        if (defined $arg) {
                $self->{"sequence"} = $arg;
        }else {
                return $self->{"sequence"};
        }
}

# Get FASTA string
# Usage   : $fasta_str = $seq_obj->to_fasta()
sub to_fasta {
     my ($self, $arg) = @_;
     my $name = $self->{"name"};
     my $seq_str = $self->{"sequence"};
     my $fasta_out  = ">$name\n$seq_str\n";
     return $fasta_out;

}

Tuesday, August 31, 2010

PERL - Lesson 2 : Variables, Subroutines and Objects

Variables
"In computer programming, a variable is a symbolic name given to some known or unknown quantity or value, for the purpose of allowing the name to be used independently of the value it represents ... A variable has three essential attributes: a symbolic name (also known as an identifier), a data location (generally in storage or memory, comprised of address and length), and the value, represented by the data contents of that location. These attributes are often assigned at separate times during the program execution ... Variables often also have a fourth attribute, a type or class which specifies the kind of information the variable stores." -- WikiPedia

Variables are the essential components of any computer programming language. They provide symbolic name to access the data in computer memory (hard disk and/or RAM). Different programming languages have different ways to initialize and manipulate variables. In this post I will focus on Perl programming language and through a simple script explain different types of variables available and the common actions performed on them. I have embedded the help as comment within this script.

I will also explain few other programming concepts like conditions and loops in Perl through this script. So stay alert to learn them.

1. Scalar Variables

#!/usr/bin/perl 

use strict;
use warnings;

# Scalar
# ======

# Scalar variables are stored & allocated in a single computer memory address location
#     * begins with '$' sign
#     * strings, numbers, references are typical scalar types
#     * notice no int or string or float
#     * single or double quotes for strings or characters
#     * no character or string distinguistion

# Assignment
#
my $value_num = 1;
my $value_str = "1";

# Printing 
print "Value number : $value_num\n"; # note the variable are declared within the double quotes with other string (unlike java or c)
print "Value String : $value_str\n";

# Double quotes
print "Value String : $value_str\n";

# Single quotes
print 'Value String : $value_str\n'; # in single quotes the variables are not recognized

# qq can be uses as double quote delimiter
# q can be used as single quote delimiter

print qq{Value String : $value_str\n"}; # similar to double quote
print q{Value String : $value_str\n"}; # similar to single quote

# Joining

my $first_name = "vivek";
my $last_name  = "gopalan";

my $full_name1 = "$first_name $last_name"; 
my $full_name2 = $first_name. " " . $last_name;  # Dot is used for concatinating scalars

print "Full name1 : $full_name1\n";
print "Full name2 : $full_name2\n";

my $value_sum = $value_num + 1; # mathematical operator - addition, substraction, division.

my $empty_num = undef; # initialize as null variable
undef $empty_num; # remove variable from memory - undeclare

# string manupulation can be done using methods - substr(), char(), length()

2. Array Variables

# Array 
# ------
# Collection of Scalar variables in order - refers to multiple memory locations 
#  * begins with @
#  * zero based
#  * not required to defined the initial size
#  * elements can be any type of scalar
#  * each element should be separated by comma
# Assignment
my @colors = ('red', 'blue', 'yellow');  # notice the circular bracket '()' 
my @mixed_scalars = ('red', 'blue', 'yellow',1);  # does not matter if the content is string or number or reference

my @empty_array =();

# Printing
print "Colors : @colors\n"; # print as space delimited string
print @colors;              # print without space delimit when not within double quotes..

# know the size of array, add element, delete element, update element, get sub elements, concatinate to scalar, search element 

# get element - zero-based
my $color = $colors[0]; # notice the square bracket '[]' to retrieve the elements from the array
my @subset = @colors[0..1]; # funky code

# updating element
$colors[1]= 'gray';

# number of elements in the array ( scalar-context of the array is the size of the array)
my $color_array_size = @colors; 
my $color_array_size_better_way = scalar @colors; 
my $color_array_max_index_size = $#colors;  # array size - 1


# manipulation

my @new_colors = @colors;
push @new_colors, 'green'; # adds to the end of array
unshift @colors, 'green';  # adds to the beginning of array - reverse of push

my $last_color  = pop @new_colors; # removes the last element of the array
my $first_color = shift @new_colors; # removes the first element of the array - reverse of pop

# Foreach loop
foreach my $color_tmp(@colors) {
   print  "Color [foreach] : $color_tmp\n";
}

# Forloop
for (my $index=0 ; $index <= $#colors ; $index++) {
   print  "Color [for] : $colors[$index]\n";
}

# time to introduce $_ variable
# the default variable when you do not declare any scalar value in a loop
foreach (@colors) {
   print  " Color [for] : ";
   print ;
   #print $_;
   print  "\n";
}

# search array elements


3. Hash Variables

Hash variables are similar to arrays, where instead of using integers as indices any scalar variable can be used as indices to store values. Hash variables are also called dictionary variables. The indices in the hash variables are referred as keys and for each unique key a scalar value can be assigned. The main advantage of hash variable is that the value can be directly accessed from the memory when the key is known, whereas to find a specific value in an array one needs to iterate through all the values of it.

my %personal_info = (
        "first_name" => "vivek",
        "last_name"  => "gopalan"
);  # notice the 'curly' brackets
my $fname1 = $personal_info{first_name}; # notice the 'curly' brackets

# the 'first_name' and 'last_name' are the keys and the "vivek" and "gopalan" are the values. The key and values can be any scalar value in Perl.
# the '=>' is used an assignment operator to map key and value. Each key value pair should be separated by ',' (comma)
#   * the hash variables starts with '%'
#   * circular brackets to store the keys and values (similar to arrays)
#   * separated by comma
#   * unique key is associated with value

# other representation
# my %personal_info = ("first_name","vivek","last_name","gopalan"); # similar to array and alternate values are considered as key and values.


Summary

# A. Scalar variable
my $first_name = "vivek"; # notice the dollar sign
my $last_name  = "gopalan";

# B. Array variable

my @personal_info = ( "vivek", $last_name); # notice the circular bracket and @ symbol
my $fname = $personal_info[0]; # notice the 'square' bracket and the integer index

# C. Hash variable
my %personal_info = (
        first_name => "vivek",
        last_name  => "gopalan"
);  # notice the 'circular' brackets and % symbol
my $fname1 = $personal_info{first_name}; # notice the 'curly' brackets


Simple Puzzle

# Identify the mistakes in the following variable declarations ?

# A
my @personal_info = ["vivek", "gopalan"];
# B
my %personal_info = {"first_name","vivek"};
# C
my %status = {"company" => "ABC company", "degrees" => ("B.Tech", "PhD") };

# D
my @details = ( ("vivek", "gopalan"), ("B.Tech", "PhD"));
my @name_details = $details[0];


Subroutines

Subroutines or functions are another essential components of programming language. The subroutine wraps a set of commands or scripts that can be repeatedly used or perform specific function.It keeps the code organized. Subroutines usually takes some input values and perform specific function and then return output values. It is not required for a subroutine to provide a input or generate a output.

# subroutine to calculate sum of first 'n' integers.
# 'n' is the input value and the sum is the output value
sub get_sum {
   my ($n) = @_; # the first value of the @_ array is assigned to $n
   my $sum = 0;
   for (my $i =1 ; $i<=n ; $i++) {
      $sum += $i
   }
   return $sum;
}

# Calling the function
my $numb = 10;
my $num_total = &get_sum($numb); # notice the 'ampersand' symbol to call the method.

# Notice that the arguments are not passed in the function definition statement (unlike Java, C programming languages).
# All the input arguments are passed as a special type of array - @_.
# @_ is the default array variable.
# the each value of the @_ arrays has to be scalar value (the arrays and hashes can be passed as references -- defined in next section)
# 'return' command returns the value to be returned. return statement can return any type of variable.
# If the return statement is not used then the variable in the last statement is returned (not a good practice)

# Instead of 'my ($n) = @_' the following line can be used..
my $n = shift; # the shift or pop or grep or other array-related commands without any argument will use the @_ as input.

# $_[0], $_[1] .. are used to extract specific element from the @_ array.
# the $n variable declared within the get_sum function is 'local' (can be only accessed within the function)

Time to introduce references

As explained above variables are physically stored in some memory address in the RAM or Hard Disk. References are nothing but the memory address associated with different variable types (scalar or array or hash). It is important to remember that reference of any type of variable is always scalar. i.e the memory address is always just the location value associated with the variable For arrays and hashes the references, for understanding, you can assume that the reference value is just the unique memory identifier associated with variable.
# scalar reference
my $first_name = "vivek";
my $first_name_ref = \$first_name; # notice the 'backslash'

my $first_name0 = $$first_name_ref; # dereferencing: extra 'dollar' sign

# Array reference
my @personal_info = ("vivek", "gopalan");
my $personal_info_array_ref = \@personal_info; # notice the 'backslash'
my @personal_info0 = @$personal_info_array_ref; # dereferencing: get the array back from reference

my $first_name1 = $personal_info_ref->[0]; # dereferencing:notice the '->' operator to extract the data

# Hash reference
my %personal_info = ("first_name","vivek");
my $personal_info_hash_ref = \%personal_info; # notice the backslash
my %personal_info0 = %$personal_info_hash_ref; # dereferencing: get the hash back from reference

my $first_name2= $personal_info_hash_ref->{"first_name"}; # dereferencing: notice the '->' operator


Anonymous arrays and hashes

In the previous section we saw how to define references for a given array and hash variable. It is a two step process where first a array or hash variable was defined and it is converted to reference scalar variable by adding 'backslash' before the array or hash variable. Now I will show how array and hash references can be created without defining the variables first. It is an very important concept to understand for handling complex data models as multidimensional arrays and even object oriented Perl. Once you comprehend this concept then you can call yourself as Perl "Code Breaker".
# Anonymous Array reference 
my $personal_info_array_ref = ["vivek", "gopalan"]; # just use square brackets instead of 'circular' brackets and assign the value to a scalar($) instead of array (@).

# This is exactly the same as defining an array and then referencing it to a scalar variable
my @personal_info = ("vivek", "gopalan");
my $personal_info_array_ref = \@personal_info; # notice the 'backslash'


my $first_name1 = $personal_info_ref->[0]; # dereferencing:notice the '->' operator to extract the data

# Anonymous Hash reference 

my $personal_info_hash_ref = { "first_name" => "vivek" }; # notice the curly brackets and assignment to the scalar($) value.

# This is exactly similar to defining an hash and then obtaining the reference of ith through backslash
my %personal_info = ("first_name" => "vivek");
my $personal_info_hash_ref = \%personal_info; # notice the backslash

Power of Anonymous References

One more touch to understand the power of anonymous references. This time by example, so that you will find it very easy to learn. I will take two-dimensional (table) data as example. This could be a typical database result in real application.
# Important rule to reiterate for arrays and hashes

# The value of each element in the array or hash must be a scalar

# my %info = {"degrees" => ("B.Tech","PhD") }; # The statement is wrong because we cannot assign array as value in an hash
# solution - anonymous array solves this problem
my %info = {"degrees" => ["B.Tech","PhD"] }; # now the value is an anonymous array rather than the regular array.. very easy huh..

# dereferencing:

# the three different ways to get the "B.Tech" degree out of the %info variable

my $degrees_ref = $info{degrees}; # get the array reference value associated with the key
my $first_degree1 = $degrees_ref->[0]; # 1 - get the specific array value from the reference

my @degrees_array = @$degrees_ref; # Dereference the reference value
my $first_degree2 = @degrees_array[0]; # 2

my $first_degree3 = $info{"degrees"}->[0]; # 3 -- the one liner

# How will you know what is the type of specific reference? Important when you
# handle someone else data.

# New concept: ref() method can be used to know the type of a scalar value.
# ref($degrees_ref) gives ARRAY
# the possible values are undef (if not an reference), ARRAY, HASH, SCALAR
# ref($first_degree3) gives undef


# Coming back to the 2 dimensional data.
# lets first try to store a simple 2 x 2 matrix as perl variable. Let's take the following example.
#   00 01
#   10 11
my @matrix = (("00","01"),("10","11"));
print "Number of elements in the matrix array : " . scalar @matrix . "\n";
# the @matrix variable is just an array with 4 elements. i.e. $matrix[2] is "10" and hence it just an array.

# Anonymous reference or reference can help to solve this problem

my @first_row = ("00","01");
my @second_row = ("10","11");

my @matrix = (\@first_row, \@second_row);
# to get "10" ( first row and second column value)

my $value10 = $matrix[0]->[1]; # $matrix[0] gives the array reference of @first_row and the to get the second value from the array reference we have used '->' operator.

# similarly to get the "11" element (second row, second column entry) from the matrix
my $value11 = $matrix[1]->[1];

# simplifying things using Anonymous references
my @matrix = (["00","01"], ["10","11"]); # notice the circular brackets.
my $val1 = $matrix[1]->[1]; # to get "11" value

my $matrix_ref = (["00","01"], ["10","11"] ] ; # notice the square brackets
my $val1 = $matrix_ref->[1]->[1]; # to get "11" value , notice the double '->' usage


# to print out the results as 2D table from the $matrix_ref variable

my $nrows = scalar @$matrix_ref; # the size of the dereferenced array is the number of rows

for (my $row = 0; $row < $nrows ; $row ++) {
   my $ncols = scalar @{$matrix_ref->[$row]}; # Dereferencing each value of row data, which itself is a an array reference.
   for (my $col = 0; $col < $ncols ; $col ++) {
      print $matrix_ref->[$row]->[$col], "\t";
   }
   print "\n";
}

# this logic can be used to store any dimension matrices in Perl.

Perl data model in practice

Let me use my car data as an example to explain this concept

my $car_data_ref = {
   "model" => "Hyundai Accent",
   "make"  => "2007",
   "color"  => "Platinum",
   "type"  => "sub-compact car",
   "amount_spend_on_services" => {
      2008 => 200,
      2009 => 500, 
      2010  => 300
   },
   "owners" => ["Vivek","Dhivya"],

};

# As you have noticed we stored all the details about the car in a complex hash reference variable.
# You need to know how to perform CRUD (Create, Read, Update and Delete) actions or operations on the contents of the data. Let me show how this can be done.

# Create: Add a new attribute called 'gear_type' and assign 'automatic' as scalar value.
$car_data_ref->{'gear_type'} = "automatic"; # set the new attribute for the hash reference

# Read : Find total amount of money spend for servicing

my $total = 0;
my $amounts_hashref = $car_data_ref->{'amount_spend_on_services'}

foreach my $year (keys %$amounts_hashref) {
    $total += $amounts_hashref->{$year};
}
print "Total amount spend on service : ", $total, " USD\n";

# Update : add 'Sanjana' as owner along with Vivek and Dhivya
push @{$car_data_ref->{'owners'}}, "Sanjana"; # dereference the array and then push the new scalar value to it.

# Delete : remove 'type' attribute for the car data
$car_data_ref->{'type'} = undef; # delete command also can be used.

One of the critical component of any software development is the data model. One has to spend enough time to design a good data model.

Object-Oriented Perl: Blessed references

In the previous section, we learned about references and anonymous references. Using matrix and Car data as example, I then explained how complex data models can be created and manipulated. Now I will go through some of the problems with the complex data models created using references and explain how these problems could be solved using blessed references or objects. The $car_data_ref hash reference that we created in the previous example can be used to store attributes of a car. But, it will be very useful if we could associate subroutines to modify or extract details in it. For example, Methods such as get_total_amount_spent() or set_type() or get_owners() that are associated with the contents of the the $car_data_ref would be very valuable when you want to share your datamodel with others or providing meaningful access to the datamodel (needs rewrite - vivek). The data along with its properties and all its behaviors (subroutines) is referred to as object in the Object Oriented Programming world.
"In the domain of object-oriented programming an object is usually taken to mean a compilation of attributes (object elements) and behaviors (methods or subroutines) encapsulating an entity. In this way, whilst primitive or simple data types are still just single pieces of information, object oriented objects are complicated types that have multiple pieces of information and specific properties (or attributes). Instead of merely being assigned a value, (like int =10), objects have to be "constructed". In the real world, if a gun (let's say a Colt 45) is an "object", its physical properties and its function to shoot would have been individually specified. Once the properties of this Colt 45 "object" had been specified into the form of a class (let's call it 'gun'), it can be endlessly copied to create identical objects that look and function in just the same way. As an alternative example, animal is a superclass of primate and primate is a a superclass of human. Individuals such as Joe Bloggs or John Doe would be particular examples or 'objects' of the human class, and consequently possess all the characteristics of the human class (and of the primate and animal superclasses as well." -- WikiPedia
In Perl, the data attributes of the object is usually represented as an anonymous hash reference. 'bless' command is used then to link the anonymous hash reference with a specific Class type. I will explain this concept by extending the example that I explained in the previous section.
package Car;

sub new {
    my ($class) = @_;
    my $data = {
      "model" => "Hyundai Accent",
      "make"  => "2007",
      "color"  => "Platinum",
      "type"  => "sub-compact car",
      "amount_spend_on_services" => {
         2008 => 200,
         2009 => 500, 
         2010  => 300
      },
      "owners" => ["Vivek","Dhivya"],
   };

   bless $data, $class;
   return $data;
}
sub get_total_amount_spent {
   my ($self) = @_; # note the first argument is the object data..
   my $total = 0;
   my $amounts_hashref = $self->{'amount_spend_on_services'}

   foreach my $year (keys %$amounts_hashref) {
       $total += $amounts_hashref->{$year};
   }
   return $total;
}

sub set_type {
   my ($self, $type); # the first argument is the object data and the second argument is the value passed
   $self->{'type'} = $type;
}

# end of Car package
# beginning of the the default package..
package main;

my $car_obj = new Car();

$car_obj->set_type('Compact Car'); # Call the subroutine
my $total_amt_spent = $car_obj->get_total_amount_spent();

print "Total amount spent : $total_amt_spent USD \n";

my $type = $car_obj->{'type'}; ## Note: this works, but in Object Oriented Programming (OOPs) good practice, the attributes should be hidden and should be exposed through subroutines.

Need to explain 'package', bless and subroutine statements - vivek.

PERL - Lesson 1 : Introduction and Hello World

Introduction

I always believe everyone can write programs and enjoy building computer applications. With this goal in mind, I am planning to write few posts during my free times about programming concepts and how I really understood these concepts. I will use Perl programming language as the guide and explain these concepts.

PERL - Practical Extraction and Report Language

Positive features:
  • Scripting language - No compiling
  • System administration - easier to use syntax than shell script
  • Easy to parse text content and extract information using regular expression
  • Few lines of code compared to Java or C++ 
  • Object-Oriented concepts - version > 5.0

1. Hello world program
The first step in learning any new program is writing a script to print hello world. Here is how it is done in Perl

a. From a file
create a file hello_world.pl

#!/usr/bin/perl 
use strict;
use warnings;

# print statement
print "hello world!\n";
  1. first line - Shebank character used to identify the interpreter (perl)
  2. '#' used for comment
  3. 'use' is used for loading libraries - strict and warnings should be used as good practise to dump out errors or warnings generated by perl interpreter - variable declaration.
  4. 'print' command to print string.
  5. every line should end with semicolon. 
In the command-prompt.
M_BCBB_10017:~ vivek$ perl hello_world.pl 
hello world!

M_BCBB_10017:~ vivek$ chmod 775 hello_world.pl # make the code exectuable
M_BCBB_10017:~ vivek$ ./hello_world.pl 
hello world!

b.  from command prompt
M_BCBB_10017:~ vivek$ perl -e 'print "hello world!\n;"'
hello world!

2. Checking Perl Version
Perl comes already installed in UNIX, LINUX or Mac OS X operating system. You can check the version of perl installed in your computer by typing 'perl -v' in the command-prompt. I used my MacBookPro laptop with Mac OS X (10.6) with Darwin Kernel(10.4.0). If you are using Windows operating system then you have to download and install ActivePerl(http://www.activestate.com/activeperl/downloads) from ActiveState

-command-prompt$ perl -v
This is perl, v5.10.0 built for darwin-thread-multi-2level
(with 2 registered patches, see perl -V for more detail)

Copyright 1987-2007, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl".  If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.