Solved by verified expert:this file includes a tutorial and the homework problems, the problems start from page 4, please I need the soft and hard copy of the solution.
f_17_bme__assignment_d_and_e.docx
Unformatted Attachment Preview
Dr. P. S. Neelakanta
BME 6762 – BIOINFORMATICS: BIOENGINEERING PERSPECTIVES
Instructor: Dr. P. S. Neelakanta
(EE. 96/Rm517)
Telephone: 561/297-3469
E-Mail: neelakan@fau.edu
Fax: 561/297-2800
Fall 2017
ASSIGNMENTS D & E: REGULAR EXPRESSION & EVOLUTIONARY TREE
BUILDING
TUTORIAL INSTRUCTIONS/NOTES: REGULAR EXPRESSIONS
The conserved sequence motifs in a set of multiple sequences are called consensus (sub)
sequences and they show which residues are conserved and which residues have changed. The
terms “patterns” and “motifs” are used indistinctly meaning “a recurring thing in something“
while describing the consensus segments.
In the bioinformatic contexts, the recurring “something” is usually a (sub)-sequence may refer to
nucleic base at the genomic level; or, it can be proteomic primary sequence of amino acids).And,
the “thing” can represent functional regions in the genome like promoter site where some
transcription factors binds. For proteins, it can be a catalytic site common to several proteins.
These patterns or motifs help to find functions for new sequences or to group them in families or
subfamilies concerning the multiple sequence set analyzed.
A given set of multiple sequences is represented in a compact form using Regular Expressions. In
theoretical computer science and formal language theory, a regular expression (abbreviated regex
or regexp) and sometimes called a rational expression [Forta, Ben (2004). Sams Teach Yourself
Regular Expressions in 10 Minutes. Sams. ISBN 0-672-32566-7] is a sequence of characters that
forms a search pattern, mainly for use in pattern matching with strings, or string matching, that is,
“find and replace”-like operations. Each character in a regular expression is either understood to be
a metacharacter with its special meaning, or a regular character with its literal meaning. Together,
they can be used to identify textual material of a given pattern, or process a number of instances of
it that can vary from a precise equality to a very general similarity of the pattern. A method of a
regular expression in written texts would be to locate the same word spelled two different ways in
a text editor, for example the regular expression seriali[sz]e matches both “serialise” and
“serialize”.
In bioinformatic contexts, a set of syntax notations are prescribed to write the regular expressions.
For example, considering a set of multiple sequences with a motif segment (shown in red, namely:
…AATAGTCGC…
…GGTAGTCTA…
…ATTAGTCGA…
…GCTAGTCGG…
…CTTACTCGG…
the following pattern or regular expression can be prescribed:
T-A-[GC]-T-C
1
This means that the 3rd letter can be a G or a C.
Bioinformatic contexts – Regular Expressions …
Motifs are not one specific sequence, they can have several variants
Motif databases have commonly been used to:
Classify proteins
Provide functional alignment
Identify structural and evolutionary relationships
Perl (a programming language) has powerful text processing power
It easily manipulates text files
http://expasy.org/tools/scanprosite/
Sites like this have different regular
expression ‘symbols’ than Perl, but use the same concepts
One-letter codes for amino acids
Symbol ‘x’ is a wildcard
Alternation is provided by the ‘[]’ brackets
Negated alternation is provided by the ‘{}’ brackets
A ‘-’ is just a separator
X(3) = x-x-x
A(2,4) = A-A or A-A-A or A-A-A-A
Examples:
(a)
[AC]-x-V-x(4)-{ED}
This pattern is translated as: [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}
http://en.wikipedia.org/wiki/Amino_acid
Uses standard amino acid abbreviations
(b)
[ACG]-XXAG-V-X(4)-{AEGD}
[Alanine or cysteine or glycine], any, any, alanine, glycine, valine, any, any, any, any,
{not alanine, glutamic acid, glycine, or aspartic acid}
References
An article in 2007 (?) uses probabilities, gaps, and local optimization combined with regular
expressions and the results are comparable to those obtained via CLUSTALW
http://www.ncbi.nlm.nih.gov/pubmed/19534754 – Article June 2009, Regular expression
Blasting algorithm
(c)
A[CT]N{A}YR
In this notation, A means that an A is always found in that position; [CT] stands for either C or T;
N stands for any base; and {A} means any base except A. Y represents any pyrimidine, and R
indicates any purine.
In this example, the notation [CT] does not give any indication of the relative frequency of C or T
occurring at that position.
2
Motif bioinformatics…
Consider the N-glycosylation site motif …:
Asn, followed by anything but Pro, followed by either Ser or Thr, followed by anything but
Pro
This pattern may be written as N{P}[ST]{P} where N = Asn, P = Pro, S = Ser, T = Thr;
{X} means any amino acid except X; and [XY] means either X or Y.
The notation [XY] does not give any indication of the probability of X or Y occurring in the pattern.
———————————————————————————————————————-A tutorial note on C- and N- termini:
Each amino acid has a carboxyl group and an amine group, and amino acids link to one another to
form a chain by a dehydration reaction by joining the amine group of one amino acid to the
carboxyl group of the next. Thus polypeptide chains have an end with an unbound carboxyl group,
the C-terminus, and an end with an amine group, the N-terminus.
When the protein is translated from messenger RNA, it is created from N-terminus to C-terminus.
The amino end of an amino acid (on a charged tRNA) during the elongation stage of translation,
attaches to the carboxyl end of the growing or nascent chain. Since the start codon of the genetic
code codes for the amino acid methionine, most protein sequences start with a methionine (or, in
bacteria, mitochondria and chloroplasts, the modified version N-formylmethionine, fMet).
However, some proteins are modified posttranslationally, for example by cleavage from a protein
precursor, and therefore may have different amino acids at their N-terminus.
In short, proteins are naturally synthesized starting from the N-terminus and ending at the Cterminus.The C-terminus (also known as the carboxyl-terminus, carboxy-terminus, C-terminal
tail, C-terminal end, or COOH-terminus) is the end of an amino acid chain (protein or
polypeptide), terminated by a free carboxyl group (-COOH). When the protein is translated from
messenger RNA, it is created from N-terminus to C-terminus. The convention for writing peptide
sequences is to put the C-terminal end on the right and write the sequence from N- to C-terminus.
A tetrapeptide ( for example, Val-Gly-Ser-Ala) with green highlighted N-terminal α-amino acid
(example: L-valine) and blue marked C-terminal α-amino acid (example: L-alanine).
A tetrapeptide ( for example, Val-Gly-Ser-Ala) with green highlighted N-terminal α-amino acid
(example: L-valine) and blue marked C-terminal α-amino acid (example: L-alanine). This
tetrapeptide could be encoded by the mRNA sequence 5′-GUUGGUAGUGCU-3′.
_______________________________________________________________________________
3
EXAMPLE
Translate the following regular expressions:
[GA]-T-{C, G}(2)-X-[TGC]-G(3)-[TC]
Solution
Glycine or alanine, threonine, any except cysteine or glycine, any except cysteine or
glycine, any, threonine or glycine or cysteine, three glycine and threonine or cysteine.
———————————————————————————————————————-Problem E.1
Translate the following regular expressions:
(a)
[TCG]-{A, C}(3)-P-x-[ATG]-x-[VIL]-[IVT]-x-[GS]-G-Y-S-[QL]-A
(b)
[TAG]-XXAG-V-X(4)-{AEGD}-[AC]-x-V-x(4)-{ED}
(c)
Write regular expression to match each string in the C terminus:
V or L, any (two to four times), A, T, any but D or E
———————————————————————————————————————-Problem E.2
(a)
For the following set of multiple sequence alignment, construct the regular expression
and expand it in terms of 3-letter code for amino acids:
T
N
N
N
A
K
A
N
K
(b)
E
G
G
G
E
E
E
P
E
C
P
P
A
C
C
C
C
E
V
V
T
V
V
V
T
V
V
L
L
I
M
I
I
I
I
M
A
A
T
M
C
C
C
A
M
R
R
R
R
R
R
R
R
R
T
T
T
T
T
T
T
T
T
I
I
I
I
I
I
S
T
I
For the following set of multiple sequence alignment, construct the regular
expression and expand it in terms of the relevant nucleotide bases
T
T
C
A
C
G
C
G
C
C
G
C
T
G
T
G
G
G
C
G
A
A
T
A
C
T
C
C
A
A
T
T
G
G
C
G
4
G
A
C
A
A
T
C
G
G
C
G
C
C
C
C
T
T
T
G
G
G
G
A
G
G
A
A
A
A
G
T
C
C
C
T
G
T
T
T
T
A
G
G
G
G
———————————————————————————————————————–
ASSIGNMENT E: EVOLUTIONARY TREE BUILDING
TUTORIAL INSTRUCTIONS/NOTES
Process of Reconstructing a Tree
A class of phylogeny is known as molecular phylogenetics or molecular systematics. It uses the
structure of molecules in order to gain information on the evolutionary relationships for an
organism. The result of such molecular phylogenetic analysis can be expressed in a phylogenetic
tree. In pursuing molecular phylogeny, the associated classification of methods involves distance
and character-state approaches as outlined below.
As mentioned earlier, the process of building a tree, in essence, refers to making of or
reconstructing the required phylogenetic tree structure using the multiple sequences aligned and
prepared. Suppose a set of N multiple sequences are considered. Then, it can be shown that the
corresponding number of possible trees, M, will be extremely large values of M, when thee value
of N is excessive. For example, M = 1 for N = 3; M = 3 for N = 4; …; M = 1,027,025 for N = 10
and so on. Therefore, unless only a small number of sequences is considered, the total number of
feasible trees would increase exponentially to an enormous extent. As such, with the sequence data
availed, multiple alignment is performed only on a limited number of sequences, (leading,
however, only to a suboptimal number of trees being reconstructed).
(a)
(b)
Root
vertex
Fig. I Phylogenetic tree configurations: (a) Rooted tree; (b) unrooted tree
In all, building the tree of evolution implies assessing the underlying phylogeny. This assessment
can be done by two approaches, namely, (i) distance-based approach and (ii) clustering-based
approach. The distance-based approach refers to introducing a “weight” concept to the basic tree
structures and the underlying concept of distance-based pursuit is as follows: Considering a rooted
5
tree, the root refers to the most recent ancestor in the tree and the path from the root to a leaf-node
signifies the evolutionary path. Such rooted trees are often represented with a root vertex as shown
in Figure I(a) emphasizing that the root corresponds to the ancestral species. In contrast, the
unrooted tree shown in Figure I(b), bears no assumption as regard to the position of an
evolutionary ancestor (root) in the tree. That is, no assumption about the origin of species prevails
in such unrooted trees.
The concept of weight in the distance-based methods can be illustrated as in Figure II
where, for example, there are six vertices of leaf-nodes or taxa, I, II, III, IV, V and VI; and, a
positive weight (or length) is assigned between any two consecutive nodes as shown. This length,
for example, may depict the number of mutations on the evolutionary path. Quantitatively, the
length (d) of the path between any two vertices can be specified as the sum of the weights in the
path between them. In Figure 4.2.3 for example, the length d between nodes I and V is given by: d
= (13 + 13 + 14 + 18 + 12) = 70. In general, given a weighted tree (T) with n taxa (end-nodes),
computation of the path di, j (T) between any two leaves (i, j) can be done as indicated by the above
example.
I
VI
13
13
IV
13
(c)
14
15
II
18
10
III
12
V
Figure II Phylogenetic tree configuration depicting a weighted unrooted (star) topology with
taxonic vertices I –VI.
Now, considering an inverse problem, suppose a distance matrix [n × n] with i, j = [di, j
(T)] (for every two leaf-nodes (i, j), is available (as mostly gathered via biological experiments). A
method is then required to search for a tree T that has n leaf-nodes and consistent with the data in
hand. Whenever the matrix size is small say, [3 × 3], and it is symmetric as well as non-negative,
construction of the tree could be trivial. But for larger matrix sizes, the number of trees to be
constructed becomes unwieldy. This could be seen from the following algorithmic relations: Given
nodes, the number of rooted trees that can be designed is R = (2 – 3)!! = (2 – 3)(2 – 5)(2 –
7) … ∞; and, number of unrooted trees is, UR = (2 – 5)!! = (2 – 5)(2 – 7)(2 – 9) … ∞.
Illustrated in Figure III, is a couple of simple examples.
6
(a)
(b)
Leaf
Root
Figure III
Given the number of nodes (), realization of R and UR: (a) With n = in a
rooted tree and (b) with = 3 in an unrooted (star topology) tree.
Distance-based Approach in Tree-construction
In the distance-based approach, [N. Saitou and M. Nei, The Neighbor-Joining Method: A New
Method for Reconstruction of Phylogenetic Trees, Molecular Biology and Evolution, 4, 1987,
406-425], evolutionary distances are calculated first for all pairs of taxa; then, a phylogenetic tree
is constructed by means of an algorithm, which establishes some functional relationships among
the evaluated distance values. Hence a distance matrix is deduced and presented as a table that
contains the “distances” (or counts on the number of evolutionary events) that separate each pair of
sequences in the data set of aligned multiple sequences.
Popularly, the distance matrix is conceived via UPGMA mentioned earlier and it is
regarded as simple in tree reconstruction tasks. It essentially assumes that the rate of evolution is
nearly constant among different evolutionary lineages. That is, the evolutionary distance is
considered as being proportional to the temporal stretch of divergence period. Essentially, distance
matrices of phylogeny are non-parametric schemes originally adopted for phenetic data using a
matrix of pairwise distances. The distances obtained thereof, are then used to make a tree; (that is,
the phylogram generated depicts the informative branch lengths that carry the information on the
underlying evolutionary process.
Normally as said before, the distance matrix is a compiled result of biological experiments
and expressed via measured values. Such data, for example, is elucidated from morphometric
analysis, pairwise distance formulations (such as, Euclidean distance between discrete
morphological characters), genetic distance calculations pertinent to sequence restriction fragments
and allozyme data on variant forms of an enzyme that are encoded by different alleles at the same
locus.
Raw distance values vis-à-vis phylogenetic character data, can be decided by simple counts
on the number of pairwise differences in character states. (Specifically described as Manhattan
distance, the raw distance values in question conform to what is known as taxicab geometry,
proposed by Hermann Minkowski in the 19th century. It refers to a form of geometry wherein the
conventional Euclidean geometry metric is supplanted by a new metric in which the distance
between two points is the sum of the (absolute) differences of their coordinates. (This taxicab
metric denoting the Manhattan distance (also known, as rectilinear distance, or Manhattan length)
and its variations represent the geometry of grid layout of most streets on the island of Manhattan.
Relevant length of the shortest path that a taxicab could take between two points in the city is
equal to the distance between the points in taxicab geometry).
7
Inasmuch as the distance-matrix approach requires “genetic distance” evaluation between
the sequences being classified, they need multiple sequence alignment (MSA) described in earlier
chapters as an input. This genetic distance is often defined as the fraction of mismatches at aligned
positions with gaps either ignored or counted as mismatches [D. M. Mount, Bioinformatics:
Sequence and Genome Analysis, Cold Spring Harbor Laboratory Press: Cold Spring Harbor,
NY:2004].
Further, distance methods imply constructing an all-to-all matrix from the sequence query
set. This implies the distance between each sequence pair and the reconstructed phylogenetic tree
(via distance matrix) renders closely-related sequences under the same interior node. The branch
lengths involved reproduce to a close extent, the observed distances between sequences. Further,
the type of tree reconstructed can be either rooted or unrooted version depending on the type of
algorithm adopted. Distance methods lay foundation for progressive and iterative types of MSA.
But, such methods do not use efficiently the information about any local high-variation regions
that may appear across multiple sub-trees [J. Felsenstein J., Inferring Phylogenies, Sinauer
Associates, Sunderland, MA: 2004].
The genetic distance concept is adopted as a data clustering strategy in a method known as
the neighbor-joining (NJ) approach, which enables reconstruction of unrooted trees. In its
approach, neighbor-joining exercise does not assume a constant rate of evolution across lineages.
As such, the time of evolutionary divergence cannot be found from mutations incurred.
Neighbor-joining is based on the minimum-evolution criterion for phylogenetic trees. That
is, the topology that gives the least total branch-length is preferred at each step of the algorithm.
However, neighbor-joining may not lead to finding the “true” tree topology with least total branch
length because it is, in essence, a greedy algorithm that constructs the tree in a step-wise fashion.
Despite of being conceptually a sub-optimal solution, the NJ algorithm has been extensively tested
and the tree derived is fairly close to the optimal tree. (Nevertheless, it has been superseded in
phylogenetics by methods that do not rely on distance measures and offer superior accuracy under
most conditions).
The main virtue of neighbor-joining relative to other methods is its computational
efficiency. That is, neighbor-joining is a polynomial-time algorithm. It can be used on very large
data sets for which, other means of phylogenetic analysis (for example, minimum evolution,
maximum parsimony, maximum likelihood) are computationally intense.
Unlike the UPGMA algorithm for phylogenetic tree reconstruction, neighbor-joining does
not assume that all lineages evolve at the same rate (molecular clock hypothesis) and produces an
unrooted tree. Rooted trees can be created by using an outgroup and the root can then effectively
be placed on the point in the tree where the edge from the outgroup connects.
Furthermore, neighbor-joining is statistically consistent under many models of evolution.
Hence, given data of sufficient length, neighbor-joining will reconstruct the true tree with high
probability.
Thus, the distance-based methods of tree construction are based on certain distance
measures, such as the number of nucleotide or amino-acid substitutions. The UPGMA, the
transformed-distance method, and the neighbors-relation method are the generic versions of this
approach. Each method is outlined below:
A.
UPGMA method
This refers to tree construction strategy developed originally in the context of developing
taxonomic phenograms to depict the trees that reflect the phenotypic similarities between OTUs.
8
The underlying consideration is based on the relationship between organisms viewed in terms of
similarity that exists between them (instead of probing their genealogy). That is, similar organisms
are grouped (clustered) together (per the old adage of birds of the same feather flock together!).
The method of such grouping exerci …
Purchase answer to see full
attachment
You will get a plagiarism-free paper and you can get an originality report upon request.
All the personal information is confidential and we have 100% safe payment methods. We also guarantee good grades
Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.
You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.
Read moreEach paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.
Read moreThanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.
Read moreYour email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.
Read moreBy sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.
Read more