RESUME

HOME | OUTSOURCING | DNA CHIP | AN ABSTRACT | SYS - BIO | STUDIO | PROFILE | GUEST  | CONTACT  | e-ART


SUMMARY OF EXPERIENCE :
* Worked as a domain consultant. Includes diversified experience in developing a few bioinformatics tools / solutions on.
* Experienced in researching and analyzing media trends.
* Chemical and biological information exchange on the WWW.
* Experience in software testing / quality assurance with an understanding of the software life cycle.


EXTRA - CURRICULAR ACTIVITIES / AWARDS :
* First prize winner under: Painting and Quiz categories organized by the United Nations Organization.
* Active participation in various Social Service Programs.
* Participated in various Inter-Collegiate (National level): Quiz, Painting, Essay Writing, Model Making, Poster Display and Satellite Symposium competitions.
* Held post of college president for two consecutive years.
* Held post of student editorial member for college magazine - MASHAAL.
* Active participant in Polio Vaccination and HIV/AIDS awareness programs run by Govt. of India and Rotary International.
* Volunteer in National Service Scheme for two years and toured more than seven villages to create awareness about AIDS/HIV and voluntary blood donation camps.



MY PROJECT :

THE 5 STUDENT PROJECT : "ASTRA"

PROJECT AIM :
To design a software framework for Globin family and construction of a HMM (Hidden Markov Model) of length L.

PROJECT DESCRIPTION :
HMM in computational biology : Large databases of biological information create both challenging data-mining problems and opportunities, each requiring new ideas. In this regard, conventional computer science algorithms have been useful, but are increasingly unable to address many of the most interesting sequence analysis problems. This is due to the inherent complexity of biological systems, brought about by evolutionary tinkering, and to our lack of a comprehensive theory of life's organization at the molecular level. Machine-learning approaches (e.g. neural networks, hidden Markov models, belief networks), on the other hand, are ideally suited for domains characterized by the presence of large amounts of data, "noisy'' patterns, and the absence of general theories. The fundamental idea behind these approaches is to learn the theory automatically from the data, trough a process of inference, model fitting, or learning from examples. Thus they form a viable complementary approach to conventional methods.

Hidden Markov models : Hidden Markov models (HMMs) offer a more systematic approach to estimating model parameters. A hidden Markov model (HMM) is a five-tuple (Omega_X,Omega_O,A,B,pi). Hidden Markov Models (HMMs) are applied to the problems of statistical modeling, database searching and multiple sequence alignment of protein families and protein domains. These methods are demonstrated on the Globin family, the protein kinase catalytic domain, and the EF-hand calcium binding motif.


When employed in discrimination tests (by examining how closely the sequences in a database fit the globin, kinase and EF-hand HMMs), the HMM is able to distinguish members of these families from non-members with a high degree of accuracy.


Hidden Markov Models are a powerful statistical machine learning technique, which is widely applied to many language-related tasks such as speech recognition, text segmentation and other information extraction systems like extracting gene names and locations from trained models.


A HMM is a finite state automation with stochastic state transitions and symbol emissions. This model contains two sets of states and three sets of probabilities. Hidden states is the states of a system that may be described by a Markov process. Observable states are the states of the process that are visible. Pi vector contains the probability of a hidden state given the previous hidden state. State transition matrix holds the probability of a hidden state given the previous hidden state. The emission matrix contains the probability of observing a particular observable state given that the hidden model is in a particular hidden state.


Once the HMM is constructed, three problems can be solved. It can evaluate the probabilistic generative processes whereby a sequence of symbols is produced by transitioning from one state to the other state, finally reaching the final state. The forward algorithm was developed to calculate this probability. The most important usage of HMMs is to find the most probable sequence of hidden states given some observations. The last critical task, which is also required by our ASTRA TOOL is to train the HMM model.



A hidden Markov model is a stochastic process which attempts to model a family of sequences and then is able to provide the probability of other test sequences belonging to that family. The first major use for HMM has been in the development and implementation of automatic speech recognition (ASR) systems that have proved successful. Jelinek (1997).
Protein sequences are fairly similar to speech signals, as they are both represented as a linear sequence of characters. So David Haussler and his colleges at UC Santa Cruz in 1992 decided to look at how HMM could be used to model protein sequences and this produced very promising results. Since then a great deal of research has gone into refining HMM for protein sequence analysis. To construct a model we need to use a set of training data to estimate the best model parameters.
To use a HMM you have to accomplish three tasks :-
1. Computing the probability of a model producing a certain sequence of data.
2. Decoding the problem, given a output sequence and a model. You have to compute the most probable state sequence.
3. Training the model, how to maximise the likelihood of the model M producing the output sequence in the training set.
This project is only concerned with the first problem. To evaluate a HMM you have to compute the probability that a model M with parameters generated the sequence of output symbols P(YN| ,M) which is the test sequence. To find the probability of a sequence being produced from a HMM model a summation is performed over every single possible state sequence which could have produced the sequence, so consequently computation increases in size dramatically as the length of the sequence increases. During this phase the observer will not know which state matches to which symbol of the test sequence and this is where a HMM gets it name from. Various techniques have been developed to shortcut this evaluation process, one such method is the Viterbi algorithm. This algorithm uses an assumption that most sequences will only produce very low scores so by only concentrating on the most favourable state sequences a good approximation can be found Rabiner (1996). Most modern implementations use the forward algorithm as it provides a better solution as it uses two recurrence relations, which simulate calculating every single sequence. One recurrence relation moves forward through the model and one moves through the model.

Implementation :

The first step to build ASTRA is to construct HMM model for this specific purpose (GLOBIN FAMILY). As the initial approach, we use fully connected model. Each class (or tag) corresponds to one state. Transition from any state to any other states are allowed. Such a model will be computational expensive, but it is a good stating point to build such a complex system.

In our HMM model, the query sequence is in text format. Our HMM has been trained using the training data set. The way to get the initial probability array is as follows :

(I) The first state is read in from all the training data set, count the total number for each state which is the first state for each sequence, then divided by the total count number for all the states. For the transition probability matrix, we use the similar trick to count the number of each transition from one state to another state. To get the probability, we count the total number of transition from one state to all the states.

(II) The probability of transition from one state to another is the count number divided by the total number of that state transit to all the states.


(III) To get the emission probability matrix, for each character in the sequence, we count the number for that character to be one state, then divided by all the counts of that chaacter to be all the states.


What algorithms are implemented in ASTRA :

The implementation comprises the following classes and algorithms on Hidden Markov Models (HMMs):

HMM :
The class of hidden Markov models. An HMM object includes state names, a matrix of transition probabilities, emission names, and a matrix of emission probabilities

Viterbi :
Uses Viterbi decoding to compute the most probable path through an HMM for a for a given observed sequence.

Forward :
Uses the so-called Forward algorithm on an HMM to compute the likelihood of a given observed sequence, and to build the forward table.





Conclusion :
In this project a system has been developed which is able to identify protein sequences which belong to a particular family. The system is then able to identify certain families with near perfect accuracy, but other families have a fairly high error rate.

The HMM is a linear model and is unable to capture higher order correlations among amino acids in a protein molecule. These correlations include hydrogen bonds between non-adjacent amino acids in a polypeptide chain, hydrogen bonds created between amino acids in multiple chains, and disulfide bridges, chemical bonds between C (cysteine) amino acids which are distant from each other within the molecule.In conclusion the SVM-Fisher method is a very promising method for the classification of protein sequences. This project has showed that this method is feasible and should be investigated further.



CHALLENGES AHEAD - WHAT WE STILL DON'T KNOW?

* Gene number, exact locations, functions and its regulation.

* DNA sequence organization.

* Chromosomal structure and organization.

* Non-coding DNA types, amount, distribution, information content, and functions.

* Coordination of gene expression, protein synthesis and post-translational events.

* Interaction of proteins in complex molecular machines.

* Predicted vs experimentally determined gene function.

* Evolutionary conservation among organisms.

* Protein conservation (structure and function).

* Proteomes (total protein content and function) in organisms.

* Correlation of SNPs with health and disease.

* Disease-susceptibility prediction based on gene sequence variation.

* Genes involved in complex traits and multigene diseases.


RESOURCES:

Here is a list of books on Bioinformatics that are quite handy apart from online resources.


1. Molecular Modelling: Principles and Applications by Andrew R. Leach (Paperback)
2. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, Second Edition by Andreas D. Baxevanis (Editor), et al (Paperback)
3. Introduction to Protein Structure by Carl-Ivar Branden, John Tooze (Paperback)
4. Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids by Richard Durbin (Editor), et al (Paperback)
5. Computational Molecular Biology by Pavel A. Pevzner (Hardcover)
6. Developing Bioinformatics Computer Skills by Cynthia Gibas, Per Jambeck (Paperback)
7. Computational Molecular Biology: An Introduction by Peter Clote, Rolf Backofen (Paperback)
8. Post-Genome Informatics by Minoru Kanehisa (Paperback)
9. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology by Dan Gusfield
10. Bioinformatics: Sequence and Genome Analysis by David W. Mount
11. Bioinformatics (Adaptive Computation and Machine Learning) by Pierre Baldi, Soren Brunak
12. Proteome Research : New Frontiers in Functional Genomics (Principles and Practice) by M. R. Wilkins (Editor) 1997
13. Bioinformatics: Methods and Protocols by Stephen Misener (Editor), Stephen A. Krawetz (Editor)
14. Genetics Databases by M. J. Bishop (Editor) (Paperback - September 1999)
15. Bioinformatics : Sequence, Structure, and Databanks : A Practical Approach (Practical Approach Series by Des Higgins (Editor), Willie Taylor (Editor) (Paperback)
16. Guide to Human Genome Computing by M. J. Bishop (Editor) (Hardcover)
17. Statistical Methods in Bioinformatics: An Introduction by Gregory R. Grant, Warren J. Ewens (June 2001)
18. Proteomics: From Protein Sequence to Function by S. Pennington (Editor), M. J. Dunn (Editor)
19. Genomics: The Science and Technology Behind the Human Genome Project by Charles R. Cantor, et al
20. Genomics and Proteomics : Functional and Computational Aspects by Sandor Suhai (Editor) (September 2000)
21. Bioinformatics Basics Applications in Biological Science and Medicine by Hooman H. Rashidi, Lukas K. Buehler
22. Computational Modeling of Genetic and Biochemical Networks by James M. Bower (Editor), Hamid Bolouri (Editor)
23. Bioinformatics : A Biologist's Guide to Biocomputing and the Internet by Stuart M. Brown
24. Neural Networks and Genome Informatics by Cathy H. Wu, Jerry W. McLarty (Hardcover - September 2000)
25. Computational Analysis of Biochemical Systems : A Practical Guide for Biochemists and Molecular Biologists by Eberhard O. Voit (November 2000)
tanwir23@yahoo.com