Biological Sequence Analysis I – Andy Baxevanis (2012)


Andy Baxevanis:
Before we get into this week’s lecture, for those of you who are joining us for the first
time, if you didn’t pick up a copy of the syllabus last week, they’re up in the corners
by the door, so please pick one up on your way in. Also on the syllabus is the URL for
the course website, which, once again, is the place where you can find information about
the course. Most importantly, for those of you who may miss a lecture, there — this
is the place where we post all of the YouTube versions of each one of the live lectures.
Our goal is to have the lectures up the Thursday following the live lecture so that if you
miss a lecture one week, you at least have some time to watch last week’s lecture before
this week’s lecture. Just by way of reminder, if you are interested in CNME credit, please
be sure to sign the sign-in sheets in the back of the room. My disclosure for the week
is right there. Okay. So, in the same way that Dr. Green last week
laid the groundwork for upcoming lectures in the course by providing you a broad overview
of the field of human genomics last week, my job, over the two weeks that I’m with you,
is to give you a solid foundation for the upcoming lectures devoted to bioinformatics.
I mentioned last week that genomics and bioinformatics really do go hand in hand with one another,
and Eric reinforced that sentiment when he showed you this figure from NHGRI’s current
vision document that was published in Nature last February, showing the spectrum of genomics-based
research, going from understanding human genomes, just the structure and the actual sequence,
going through the biology of what’s encoded in those genomes, how that relates to our
understanding of human disease, advancing that into medical applications to finally,
hopefully, improve the effectiveness of health care. And our ability to work across this entire
spectrum really rests, in large part, on our ability to use computational approaches to
generate the kinds of — and analyze the kinds of data that are generated in whole genome
sequences studies and genome-wide association studies, clinical sequencing efforts, pharmacogenomic
studies, the whole range of genomic science that you’re going to hear about throughout
this course. So given this inextricable relationship between genomics and bioinformatics, my job,
again, during the two weeks I’m going to be lecturing, is to give you as gentle of an
introduction as I can to the major tools that are used in bioinformatics. You’re going to
see these tools and the concepts that underly these tools over and over and over again in
the next 11 weeks. So it’s really important to have a basic understanding
of the theory that underline these techniques and to start to take them away from the realm
of being the black box. It’s very easy to go to a website, you see a box, you stick
your sequence in it, you hit a button, you get some results and you just assume that
those results are correct. And many times — in fact, in the vast majority of those
times, those results maybe are not biologically significant. So that’s one of the major themes
that we’re going to hit through all of these bioinformatics lectures. So we have a lot
of ground to cover today. We’re probably going to go right up against the 11:00 mark, so
— but as we start, I hope that you find all this background information useful. And for
those of you who are actively doing these kinds of studies or using these kinds of tools,
at whatever level, you’ll hopefully find something that’s of practical use in your own studies. So what we’re going to do today is we’re going
to start off defining some terms. We’re going to talk about the importance of alignments
as one of the most powerful approaches that we have in the study of sequence data. We’re
then going to talk about how we actually evaluate how good alignments are and whether we can
infer biological conclusions based on those alignments, and then we’re going to spend
the vast majority of our time talking about two big tools that are used in biological
discovery, sequence-based discovery, BLAST and BLAT. And again, we’re going to use this
an a foundation for my next lecture with you, two weeks from now, that will now build upon
these approaches. Okay, so why do sequence alignments — that’s
what we’re going to be talking about for the next 90 minutes — and what these sequence
alignments allow you to do is figure out some measure of relatedness between two protein
sequences or two nucleotide sequences. And when you go through the effort of trying to
align two sequences with one another, you can start to make some interesting biological
inferences, whether they have to do with whether two sequences have a structural relationship
to one another, whether their functions are similar, or whether they have a certain evolutionary
relationship with one another. As we start to talk about these things, it’s important
to use the right word. So we have to cover a little bit in the definition department.
When we align two sequences to one another, so we have sequence A and sequence B, the
metric that we can use to assess — the easiest metric we have to use to assess how good that
alignment is, is to simply count how many positions in that alignment match between
sequence A and sequence B. So that observation, that observable is similarity. So it’s — the term similarity is usually
expressed as a percent identity. You can use that to quantify changes that occur as two
sequences diverge from one another by looking at substitutions at individual positions,
insertions or deletions, and you can also use them to identify residues that are crucial
for maintaining a particular protein’s structure or function. And this is very important when
you do a multiple sequence alignment — we’ll talk about how to do those in two weeks’ time
— because you can really pick out, by looking at these multiple sequence alignments, what’s
important from a structural standpoint, what’s important from a functional standpoint. What
residues do you have to preserve to have the right alpha helices or beta strands, or to
have the right binding pockets or so on. Now when we see high degrees of similarity
between sequence A and sequence B, it might imply to you either some sort of common evolutionary
history, that they have a common ancestor, or some possible commonality of biological
function. So you may have an unknown, a new protein or a new gene that you’ve sequenced,
you have no idea what it actually does in the cell, but if you find that it is similar,
by virtue of one of these sequence similarity methods, to something where the function is
known or the structure is known, you can infer a relatedness; you can start to assign function
to that newly identified sequence that you have. Okay, so this term is pretty easy to understand.
The second one, though, is often misused, and that is the term homology, okay? So genes
are or are not homologous. This is never measured in degrees so you cannot say that something
is 15 percent homologous or 50 percent homologous or 100 percent homologous with something else.
So the similarity takes the number, the percentage; the homology part implies the evolutionary
relationship. So to put a little bit of a finer point on this, a quote from Walter Fitch:
“It’s worth repeating here that homology, like pregnancy, is indivisible.” You either
are homologous — pregnant — or you are not. Okay? And it goes on to say that if 80 percent
of the character states, in this case, the aligned residues, are identical, one should
speak of 80 percent identity and not 80 percent homology. So again, similarity is the metric,
homology is the conclusion. Okay? Is that clear? Good? All right. So just to make things a little bit more complicated,
there are two types of homology, so let’s define both of those. So the term may apply
either to two genes that have been separated by an event of speciation — those are called
orthologs — or between genes that have been separated by a genetic duplication event and
these are called paralogs. So a little bit more detail. When we talk about orthologs,
these are sequences that have come from a common ancestor. And because they’ve come
from a common ancestor, they most likely have similar domain structure, similar three-dimensional
structure, and share a common biological function. Okay? When we talk about paralogs, they’re
related, again, through a gene duplication event. And these are kind of interesting from
an evolutionary standpoint because this gives you some insight as to how mother nature has
tinkered round with the armamentarium of genes that are available, to start to use genes
in different contexts. So the whole idea of evolutionary innovation, taking a particular
gene that’s already being used for one function, changing it a little bit and co-opting it
into either another pathway or into another function. Okay, I think this makes more sense when we
look at a diagram. So let’s say we have this gene, alpha, here, and we’re going to call
this the most recent common ancestor of three genes that are up here: gene 1, gene 2, and
gene 3. The relationship between gene 1 and 2, 2 and 3, or 1 and 3 is that all of those
are orthologous to one another. Okay? Because they rose from this common ancestor. All right,
now let’s say a little further back from A, we’ve have a gene duplication event that gave
rise to gene alpha and to gene beta. So perhaps alpha and beta are exact copies of each other,
perhaps they’re slightly different, but from that point on, they’ve had independent lineages
arising from them, okay? So beta, in this case, gave rise to 4, 5, and 6, so these genes,
again, because they came from a common ancestor, 4, 5, and 6 are all orthologous to one another.
But, when we talk about the relationship of any of 1, 2, or 3 to any of 4, 5, or 6, so
any of the ones in this box to any of the ones in this box, these are all paralogous
to one another because they’re related to each way back here because of this gene duplication
event. Okay? Does that make sense? Okay. So this takes a little while to learn the
terms and when I was in graduate school, I had to post this on the board in front of
my desk to finally learn them, but it’s important to know what these terms are, because again,
these do imply different kinds of relationships and unfortunately they’re not always used
the right way. Okay. So, we’re going to put the definitions behind us and now start to
talk about the alignments themselves and how we look at the quality of those alignments.
Okay. So I want you to imagine, again, two sequences,
sequence A and sequence B, and we’re going to try to align those sequences. And if we
try to align those sequences across their entire length, from the N terminal end to
the C terminal end of A and B, or, if we’re looking at nucleotide sequences, from the
5′-end to the 3′-ends of both of those, that’s called a global sequence alignment, because
you’re trying to force an alignment along the entire length of both of those two sequences.
Now this works incredibly well when you have two sequences of approximately the same length
and that are highly similar to one another, because aligning those things over such a
long stretch when they’re very similar, it’s an easy proposition. But as you start to see
the degree of sequence similarity going down, doing these global alignments gets harder
and harder to do. And when you try to force these global alignments amongst more dissimilar
sequences, you start to miss important biological relationships. You may miss the alignment
of an important protein domain that’s responsible for either structure or function or some other
similar feature. So from a biological point of view, a slightly different approach is
better. So rather than doing global sequence alignments,
we do local sequence alignments. So here, we still start with sequence A and sequence
B and try to align them to one another, but we don’t try to force the alignment over the
entire length of both of those sequences. Instead, we look for regions that align well.
So the goal here is to find what, in the parlance, are called paired subsequences, just a glorified
term for short regions that align well with one another. Anything outside the area of
those local alignments are excluded, and you can imagine because we’re looking for these
local alignment areas that we could have more than one good alignment for two sequences
that we’re comparing to one another. So really, this is, by far, much better for sequences
that share some similarity or when you have sequences of different lengths. And in the
business that we’re all in, in trying to figure out what various genes do, how they’re related
to each other, trying to identify members of protein families, this is what we’re trying
to accomplish here. So not necessarily look for global alignments, but those key regions
that are common from protein to protein to protein or from gene to gene to gene. So this
is really the approach of choice for biological discovery. Okay. So we’re going to try to find these local
regions, we’re going to align the letters in sequence A and sequence B the best that
we can, but at some point we have to have some sort of qualitative measure to tell us
how good the alignment is. So we talked about percent similarity before; that’s one metric.
But there are other metrics that are available to us as well. And in order to calculate these
metrics, we use something called scoring matrices. So these are empirical weighting schemes that
look at the physicochemical properties and the biological characteristics of nucleotides
and of amino acids. So specifically, side chains, their chemistries, their function,
their structure. So when we look at these scoring matrices, what’s being taken into
account are factors like, you know, looking at the fact that cystines and prolines are
always important for a structure in function. You need those cystines in order to make disulphide
cross-bridges. Prolines often break alpha helices, so you’ll see either a proline or
a proline-proline at the end of an alpha helix. Things like a tryptophan; big, bulky side
chain can only be accommodated in certain places in proteins. Things like lysines and
arginines, they all have — they both have a plus charge on them. Things that you would
want to conserve when you’re comparing two proteins to one another. Okay. So the game here, what we’re trying to take
into account and trying to capture through these scoring matrices, involve conservation.
So which residues can substitute for one another and not affect the function of the protein?
Okay. So things like isoleucines and valines can substitute freely for one another because
they’re both small, they’re both hydrophobic. When we look are serines and threonines, again,
we often see those substituting for one another because they’re both polar. So we want to
try to conserve, in this game, the charge, the size, the hydrophobicity, and other physicochemical
factors. The other thing we take into account is the frequency of these individual amino
acids, asking the question, “How often do those amino acids occur in — amongst the
entire constellation of proteins?” Things like glycines, all over the place. Things
like tryptophans, the most rare of the amino acids. So we want to account for the frequency
of those amino acids as we start to score these alignments. Okay. So why do you care about all this? Why is
this actually important to you? When you do a BLAST search or a BLAT search or any sort
of analysis that involves comparing one sequence to another sequence or one sequence to a set
of sequences, these scoring matrices are all being used in the background. So again, we’re
talking about the black box here. So when you do those BLAST searches, these scoring
matrices are being used. They’re always part of a sequence analysis. The selection of them
is important because each one of them — there’s not just one, there’s a whole slew of these
scoring matrices — and each one of them implicitly represents a particular evolutionary pattern.
So the choice of the matrix really can strongly influence the outcome of your analysis, the
results you get back when you do that BLAST search. Okay? And for a lot of things that
you might want to do, using the default values on NCBI’s BLAST website may not be the right
choice. So we’re going to spend a good amount of time talking about these individual scoring
matrices, and more importantly, some guidelines about which ones to choose. Okay, how are
we doing so far? Good? All right. So let’s do the easy part first. If we compare
two nucleotide sequences to each other, we have here just a simple match/mismatch scheme.
So we’ve aligned nucleotide sequence A with nucleotide sequence B and we’re just going
to look for exact matches. Every time we see an exact match, we give that particular position
two points. Every time we see a mismatch, we subtract three points. So you want to go,
obviously, in favor of the matches. So for example, if we look at this, this is the actual
scoring matrix. If we have a particular position where a C has aligned with a C, you just look
for where these intersect and there’s the two points for the match. And you’ll notice
on the diagonal, you will always see the value, two. And then off the diagonal, any place
there’s a mismatch, you see the minus three. So very simple, okay? Now this assumes that
each of the nucleotides occurs 25 percent of the time, so here’s where the frequency
business comes into play. In the nucleotide world, this is easy, the side chains are very
chemically similar to one another, you can have a very simple scoring scheme. But it
looks a little bit different when you get to the protein side of the world. Okay. So this unreadable slide, I know, is what
is called the BLOSUM62 matrix. This is the default matrix when you do a BLAST search
on the NCBI website. So how do things work here? So as before, instead of having the
nucleotides across the top and down the side, we’ve got each one of the amino acids across
the top and down the side. The exact matches are represented on the diagonal — and it’s
probably easier to look at the screen than look at your handouts right now — and you’ll
see that the values on the diagonal, unlike the nucleotide case, are not all the same.
So when things are infrequent, they get higher scores. And in fact, if we look at where — if
we have a particular position in our amino acid alignment, in our protein alignment,
where our tryptophan has aligned with a tryptophan and we see where that intersects, that particular
position would be scored with 11 points. It’s the highest number on the diagonal because
the tryptophan is the least common of the amino acids when you look at this entire constellation
of proteins. Okay. Now, of course, we don’t want to just look
for exact matches. We want to look for conservative substitutions as well because we might learn
something from proteins that are somewhat similar to the one that we’re looking for.
We don’t want to just look for the ones that are exactly the same. So starting with this
tryptophan again, let’s say we have a particular position where instead of having the tryptophan,
the tryptophan has been substituted by a tyrosine. So we’ve got the W up here, we’ve got the
Y over here for tyrosine, and if we now see where that intersects, two points, okay? Still
a positive value to represent that it’s a conservative substitution, that the tyrosine
can substitute for the tryptophan, but not as high as if you would have had the exact
match. Okay? Finally, let’s say you just totally mess things
up, you’ve got a position where two things that should not be aligned are now aligned
with each other. So let’s say you’ve got a cystine that has aligned with a glutamine,
in this particular case, minus three. You’re subtracting points because those two should
not be aligned with one another. They don’t have the kind of relationship that we’re looking
for. They can’t serve in the same role, functionally or structurally. So, simply put, you get the most points for
an exact match. We score higher for rare residues than we do for common residues. Okay? If you
have a conservative substitution, you still earn points, just not as many as if you would
have had an exact match. And finally, any time you have a mismatch, you subtract points.
Does that makes sense? Good. All right. Okay, so this is — this is what the basic
structure of these matrices look like. Now I’ve already alluded to the fact that there
are more than one of these and this is one of a family of matrices called the BLOSUM
matrices. These were developed by Steve and Jorja Henikoff at the Hutch, back in 1992.
BLOSUM stands for blocks substitution matrix. And what they did was they took all of the
proteins that were known at the time, did multiple sequence alignments on those protein
families to look for the natural patterns of what could substitute for what, and which
amino acids were more frequent than other amino acids. So this is based on empirical
evidence from the sequence databases, so they’re directly calculated. Because they come from
real data, they are very sensitive to detecting structural or functional substitution. So
really these matrices have become the benchmark for whatever we want to assess the quality
of any pair-wise sequence alignment. Many years previous, there was another series
— still is — called PAM, point associated mutations, that was developed by Margaret
Dayhoff back in the early ’60s. She was one of the founders of the field of bioinformatics.
But as you can imagine, in the 1960s, we didn’t really have very many sequences of any type.
Most of what we had were globular proteins; they were very similar to one another. You’ll
often see this PAM offered to you as a choice in phylogenetic programs, but just because
of the point in time where they were made and the kinds of proteins they were based
upon, they’re really not the best choice for most analyses. So whenever you have a choice
between the PAM matrices and the BLOSUM matrices, pick the BLOSUM matrices. And there’ll be
some more information on that in your readings later. Since I hear a phone going off, if
you haven’t already, could you please put your phones on stun and that would be great.
Thank you. Okay, so how do you now select which one of
these BLOSUM matrices to use? When we say BLOSUM, it’s always followed by a number.
The example that I showed you before was BLOSUM62. What that “n” represents is based on how it
was calculated. So the N just says, all right, it was calculated from sequences sharing no
more than N percent identity. So in the case of a BLOSUM62, the sequences that were used
were no more than 62 percent similar to any of the other sequences in this set that were
used to calculate the matrix. I’ll show that example in a minute. More importantly, what
this does is it helps to remove bias from the set. So the contribution of sequences
that are higher than the number are clustered and weighted to one. Okay? When we do whole
genome sequencing, the whole history of how proteins have been sequenced and genes were
found in individual laboratories — these are not uniformly distributed amongst proteins.
They’re not uniformly distributed amongst tact, so we have overrepresentation of certain
sequences in the public sequence databases. So we need to normalize all of that out, remove
the bias from the dataset and take those closely related sequences out so that we don’t miss
something that might actually be biologically relevant if we hadn’t done that. Okay. So to hopefully make this make a little bit
more sense, here is a multiple sequence alignment. There’s six sequences in the set. You’ll notice
that there are some absolutely conserved positions, the ones that are marked with the star and
that are in the lighter blue. And if I were to now say, “Okay, I want you to cluster this
alignment into blocks where we’re going to apply an 80 percent cut-off.” So I want you
to just group this together, all of the sequences that are 80 percent or more similar to another
sequence in the set. And when you do that — all right, so this is not similar to any
of the others at this percent relationship. These three are 80 percent or more similar
to one another. And finally, these two are 80 percent or more similar to one another.
Once this is done, this is counted as one sequence, these are counted as one sequence,
and these are counted as one sequence. So we’ve started with a set of six, we’ve collapsed
it to a set of three. That’s now going to be used as the basis for calculating the matrix,
okay? So again, this is how the bias is removed from the set. Okay? Hopefully that made sense.
Okay. All right, so, implications. Okay. Again,
when we do this clustering, reduces the contribution of closely related sequences, we have less
bias towards substitutions that are in really closely related members of a family. At the
end of the day, the take home message, reducing N yields more distantly related sequences.
So your first cheat sheet of the day, put a big star next to it, circle the slide. Okay. Which one of these do you actually choose,
after all of that? You know how they’re constructed now, you know what the — what the underlying
implications are. Which ones do you pick? Okay. The one in yellow is the default again,
and this was determined in studies where the answers were known in advance, going backwards,
to try to figure out which ones should be the default on the NCBI website. That again
is BLOSUM62: it’s the most effective one for finding all potential similarities between
two sequences and generally works best when you’re in this kind of percent similarity
range, in that 30 to 40 percent range. As you go further up, you do — you’ll do better
finding more closely related sequences, things that are almost exactly the same. So by the
time you get up to BLOSUM90, you’re going to get shorter alignments but they’re going
to be highly similar in the 70 to 90 percent similarity range. Going the other way, when
you go down, you start to get longer and weaker local alignments, things that might be less
than 30 percent similar to one another, but this is really where a lot of the biological
action is. Okay? You can look by eye and see two sequences are similar to one another once
you have them aligned. Once you get down to the bottom is where you start making these
interesting discoveries, finding out that two proteins are related to one another, even
though there may be very little sequence similarity between the two of them. A great example of
this, homeodomain proteins. I know everybody’s familiar with the homeobox genes. When you
look at the protein complement, you’re looking at a 60 amino acid stretch, six or seven of
those positions are absolutely conserved over evolutionary time, and the rest of it, it’s
kind of willy-nilly. So, but if you were to just do a routine search and possibly pick
the wrong matrix, you’d never find all the members of the family. Okay? So this is why
the choice of these things is important. So the strategy that I advocate here is one
that was originally put forth by Steve Altschul, at NCBI, the guy who was behind developing
BLAST back in 1990. And instead of doing one BLAST search, I want you to do three BLAST
searches, okay? Use the default, pick one of the more — higher number one’s to give
you — to look for closer similarities, but also pick one of the lower ones to look for
the more distant similarities. So when you go back to lab, take your sequence of interest,
run it using three different matrices. The vast majority of the results will be the same,
admittedly, but you will start to see subtle differences that you wouldn’t have found if
you just did one search. So I think it’s a good approach to start to find those relationships
that you wouldn’t normally otherwise find. Of course, if you know what you’re doing in
advance, if you’re looking for things that are really close or really distance, you can
just pick one and go from there. But if you’re just really trying to cover the field, blanket
all your possibilities, take the default, take a higher one, take a lower one, as well.
Okay? So really, what I’m trying to get across here
is no single matrix is the complete answer for all sequence comparisons. You need to
make this intelligent choice and not again use that NCBI website as a black box. It’s
a great resource, but there’s a lot of things you can change along the way to help refine
what you get. Okay? Any questions? Okay, great. All right, so for those of you who would like
a little bit more background, there is a unit in Current Protocols in Bioinformatics, Unit
3.5, that talks more about the PAM matrices that I really went through very quickly, a
lot more on the BLOSUM matrices, and some more information about specialized scoring
matrices. So we’re looking at, in this case, things like transmembrane proteins places,
where the usual rules don’t apply because of where they are found in the cell, because
they’re embedded in membranes and so on. So the frequencies and substitution patterns
are often very different because of those either compartmentalized or embedded proteins.
Okay. All right. So one last thing we need to talk about before
we actually get into BLAST, and those are gaps. So again, I want you to imagine sequence
A and sequence B. You’ve just aligned them and there might be some place in that alignment
where it would be fortuitous to improve the alignment to insert a gap. Okay? Now I want
you to remember that when you put those gaps in, it’s not just for sake of making a nicer
alignment, it’s not a word processing exercise, okay? Each time you put a gap in, it actually
represents a biological event. So every gap represents either an insertion or a deletion.
Okay? So because of that, you can’t just put them in willy-nilly again — I’ve used willy-nilly
twice now today — okay, you can’t just put them any place you want. You have to keep
in mind that they represent biological events and keep them to some sort of plausible number.
A good rule of thumb is about one in every 20 residues, but the fewer number of gaps,
the better. Now if you think about your position by position alignment again and you think
about your scoring matrices — so you’ve got an amino acid lined up with another amino
acid. You can go to your BLOSUM matrices, see where the values intersect and assess
a number for that particular position. But when you have a gap, okay, you have a residue
and then nothing. Okay? We didn’t see gaps in the scoring matrices before, so we can’t
simply score those as a match or a mismatch, as we did before. So instead, there are gap
penalties that are assigned. With apologies, an equation. And basically
all this equation is saying is that any time you insert a gap, you take a penalty, and
you also take an additional penalty for the length of the gap. So one for the gap and
one for the length of the gap. So the equation is just G plus L times n. So G is the penalty
for opening the gap. You see the values here for both nucleotide and protein sequence comparisons.
L is the gap extension penalty, this value right here, so it’s two for nucleotides, one
for proteins. N is just the length of the gap. So let’s say I have a protein alignment.
It is of length 2, okay? So to assess how many points I’m going to subtract for that,
okay, so G for a protein is 11, L times n. So here’s L, the gap extension penalty is
1, the length is 2. So 1 times 2 is 2, 11 plus 2 makes 13. So you subtract 13 points
for the gap, okay? If you’ll remember, the tryptophan to tryptophan was 11, so you get
plus 11. So to give you some measure of how these compare to exact matches — so when
you start opening big gaps, you start plummeting as far as the quality of your alignment goes
and the scores of that alignment actually go. Okay? All right. So with that, it’s time to talk about BLAST.
Okay. So how many of you have used it before? How many of you have used BLAST? The vast
majority of people in the room. Great. How many of you actually know how it works? Okay,
a few. [laughs] And from the people who are left, how many people have actually ever changed
the default values on the NCBI website? One, two, a few. Okay, good. All right, so — maybe
not. So all will now be revealed. We’re going to go through this in gross detail and hopefully
you’ll all be your lab’s experts at this at the end of the day, and again, use this tool
to its — to much better advantage. Okay. All right, so BLAST is the basic local alignment
search tool, B-L-A-S-T. By far, the work horse in bioinformatics, the most important tool
that we use in our field, because the whole act of doing these sequence alignments, whether
it’s one by one or one to a database, really help us discover these previously unknown
relationships. So we’re going to, again, hit this pretty hard today to make sure you know
how to use it properly. Part of the reason for that is not just making sure you know
how to use it right, but there are many, unfortunately, cases in the literature where biological conclusions
have been made that are incorrect based on people’s either misinterpretation of BLAST
results, they didn’t either know how to use the program or they didn’t know how to interpret
the results. So — and that’s a pitfall that I don’t want you all to fall into. So one
of the things to keep in the back of our heads as we go through this or any of the techniques
is that just because a result appears on a screen or just because it’s printed out on
a piece of paper does not automatically mean it’s biologically significant. Okay? And we’re
going to talk about that a little bit more as we go on, but just keep that in mind that
that does not automatically mean it’s a great result and that you should run with it. Okay. So when we do this, what we’re trying to do
— this is a local alignment technique, so we’re looking for these paired subsequences.
In this case, they’re called high scoring segment pairs, HSPs. So a pair of sequences,
it can be aligned well with one another, we are going to try to align them to make them
as long as possible. They have to be above a certain score threshold — we’ll talk about
that in a minute — and they can either be gapped or ungapped. Because we’re — again,
we’re looking for local alignments, we can have more than one good alignment between
sequence A and sequence B. All right. There are various flavors of BLAST that are
shown on this slide. The ones that are used much more frequently are BLASTN — the N at
the end just stands for nucleotide, so we start with a nucleotide sequence and we search
that against a database of nucleotide sequences. BLASTP — P is for protein, so we start with
a protein query looking against a protein database. BLASTX is a little bit funky; we
start with a nucleotide sequence, we translate it in six frames and then we take that translation
and compare it to the protein databases. Okay? The two at the bottom are shown for sake of
completion; they’re not used very often. By far, you’ll be using the three at the top
most often and we’re going to concentrate on BLASTP for most of our examples today. All right, so before we actually pretend to
do a search, how does the alignment actually take place? So this is a query sequence, just
an amino acid sequence, and we’re going to use this as the basis of comparison to a public
database. And what I’ve done in the middle is I’ve highlighted three residues: the P,
the Q, and the G. Okay, so this is what is called the query word. By default, BLAST will
nucleate a search with a three amino acid stretch. Now of course it starts at the end
so it’ll start with the GSQ and then go to the next three SQ and move across, but for
the sake of symmetry on the slide, we’re going to just start with the P, the Q, and the G.
So what BLAST is now going to do is look in the public database for any other protein
sequence that contains, in that order, the P, the Q, and the G. But we also want to find places where there
are conservative substitutions, places where there might be a little wiggle in the sequence
but might produce an important, biologically relevant match, places where there are conservative
substitutions. So certainly, we’re going to look for things for the PQG. What I’ve done
for this example is just changed the middle letter here. So looking for things like the
PEG, the PRG, and so on. Each one of these, you see, has a score next to it. You now know
how to calculate this score, so if you were to take the BLOSUM62 matrices you have in
your handouts and just looked at P for P, Q for Q, G for G, see where they intersect,
you’d get 7 and 5 and 6 for a title of 18, the best match because it’s the exact match.
The next one down, we have a conservative substitution in the middle position. We see
that the score goes down a little bit and the scores keep going down as we go through
this. At some point, BLAST is going to say, well,
we’ve gone kind of far enough, you know, we’re now starting to go into territory where the
sequences are starting to diverge too much. That’s something called the neighborhood score
threshold. In this case, it’s 13 points. This changes on the fly from search to search and
depending on where you are along the length of the sequence. So you don’t have to worry
about changing that value. You can, but this is something I don’t think I’ve ever personally
changed in a search. Okay. All right, so we have now established these
other three-letter words that are in the neighborhood and we’re going to use this as the basis for
continuing into the next step of the BLAST search. So we’re going to look for everything
that has a PQ, GPE, GPRG, and so on. All right, so this is from the previous slide. Here’s
our PQG again, and in this particular case, we’ve aligned with a sequence that instead
has a PMG. So it’s not an exact match, but if we look up here, PMG is in the neighborhood,
so that’s okay. And then the next thing BLAST will do is, starting from this position, this
query word, three across, it’s going to go out in both directions as far as it can, aligning
as many of these of these amino acid positions as it can. Okay, so we’re going to go out
to the N terminus, we’re going to go out to the C terminus. Okay. And by looking at this,
before we talk about the numbers, you can also do a qualitative assessment of how good
the alignment is. In this middle row, any time you see an exact match, you see that
the letter has repeated itself; any place you see a conservative substitution, you see
a plus sign. So just by eye, you can see how good the alignment actually is and you want
as much of that middle filled in as possible. Okay. So, now, time to bring the numbers into play.
I told you we start with that nucleated PQG, go out in both directions — well, how far
is too far? Okay, so here’s our alignment again and here is a graph. So the extension
on the X axis is just how many positions have we aligned, how long is the alignment? The
cumulative score, which comes from our BLOSUM matrix, is on the Y axis. Back here, at the
first position, this is this part of the alignment, the PQG with the PMG, the 13 points that we
had from the previous slide, above our score threshold. Okay? So as we start to go in both
directions, we’re going to assume that we’re going to have more matches than mismatches.
So remember, when we match, we’re adding points, when we are mismatching, we’re subtracting
points, but we’re going to assume that we’re going in favor of the exact matches. So the
score, as we start to extend, goes up. At some point, you’re going to hit this score
threshold S. Okay? This is also set on the fly, search to search. If you break S, this
alignment is returned to you in your BLAST results. It does not, again, mean that it
is biologically significant. It just means it’s going to return the result to you. All
right? So, one thing to remember. So we’re going to break this score threshold, it’s
going to now appear in our BLAST results and we’re going to keep aligning merrily away.
But at some point, we’re going to go too far. The mismatches are going to start to outweigh
the matches, so we’re going to start subtracting points at a faster rate than we’re adding
them. We might start having some gap penalties, again, subtracting points. And at some point,
we’re going to hit something called a significance delay. When we go off of a peak like this,
a certain amount down, BLAST is going to say, okay, you’ve gone a little bit too far. We’re
going to go back up to the peak, so whatever that length alignment corresponds to is the
length of this alignment at the peak, that is now your high-scoring segment pair, the
maximal length alignment in that region between sequence A and sequence B. Got it? Okay. All
right. Now I’ve just made a big deal about the scores.
Okay, we need the scores to assess how good a match is position by position by position.
But we don’t actually use the scores to assess biological significance. What we’re going
to use instead is the result of this equation. Okay? Don’t even worry about what this equation
is, okay? This is something called a Karlin-Altschul equation. What the score allows us to do — there’s
the score — based on the length of the database we’re looking at, the length of the sequences
we’re looking at, composition, all of that, it allows us to calculate a normalized probability,
and that’s the value you’re going to look at. Okay? So why I am making a big deal out of this?
If you now have some understanding of what those scoring matrices look like, I want you
to think of two alignments. So we have maybe sequence A compared to sequence B and that
might be of a certain length, but it’s all composed of common residues. You may have
another alignment of sequence C and sequence D that is of the same length as the first
alignment, but it’s all composed of rare residues. Okay? You already know that rare residues
score higher than common residues. So you could end up with a situation where you have
two alignments of the same length with the same number of positions matching up, but
because of taking this rare versus common thing into account, the score of one is going
to be vastly higher than the score of the other one. So you need some way to compare,
from alignment to alignment to alignment, some normalized measure that allows you to
say, “Okay, this one is good and this one is not so good.” So that’s why the score itself
is not a good measure, but it is the key for us to get to the probability value. I hope
that made sense. Okay, so, at the end of the day, you’re going
to forget about all of that. This is what’s important. What that E value represents to
you is the number of high-scoring segment pairs that come from comparing sequence 1
to sequence 2 that were found purely by chance: the number of false positives. Okay? So because
this represents the number of false positives, you want E to be as low as possible. Okay?
So lower values of E signify higher similarity. And again, because this is a normalized value,
you can use this as a metric to compare any two sequences — excuse me, any two alignments
to each other. You don’t have to worry about lengths and other vagaries. Okay. So rule number 1: okay, if you are doing a
nucleotide-based search, you’ve started with a nucleotide sequence, you searched against
GenBank, the rule I want you to apply here is you’re going to look for values of E that
are 10 to the minus six or less. If you are doing a protein-based search — so this is
BLASTN and BLASTX. Remember, BLASTX, even though you’re starting with a nucleotide sequence,
the comparison’s at the protein level, you’re looking for values of 10 to the minus 3rd,
.001 or smaller. Okay? So that’s rule number one, and we’re going to come back to the rules
in a moment. All right. Questions? Good. All right. Okay, so let’s go through an example, finally,
after all of that background. And again, we’re going to consider a BLASTP example. We’re
going to pretend we’re at the computer. The screen captures are all in your handout, so
everything I show you, you have, and I would really recommend that you go through these
exercises when you get back to your labs or to your offices. It’s one thing to watch me
explain this to you, it’s another thing for your hands to get onto the keyboards and actually
do them yourselves. That’s the only way you’re going to learn how to do this stuff. Okay. So here’s the NCBI website at ncbi.nlm.nih.gov.
Let me orient you to the page a little bit. There are some common features, some popular
resources that are shown in the right sidebar here. And the very first one on the list is
BLAST, so we’re just going to pretend that we’ve clicked on that. That will take us to
the BLAST page, or you could just go directly to this URL. There are three sections on this
page. The first one allows you to search assembled genomes directly. So if you know that you
only want to search against a particular organism, you can pick one of these off the list and
there’s actually a link that will take you to all of the ones that are available to you.
At the bottom, there’s some specialized BLAST stuff, we’ll come back to that a little later.
But what we’re going to concentrate on now is this middle block, basic BLAST, and it
says nucleotide BLAST, protein BLAST, and so on. So we’re going to do a BLASTP search,
we’re going to click on that protein BLAST link. Before we can actually do the search,
we need a sequence. Okay? So again, for you to reinforce these concepts when you get back
to your offices or to your labs, there is a website at this URL that has all of the
sequences that I’m using as the examples for these lectures. So please go to this website,
copy them, go through step by step exactly what we’re doing here. All right. So this is now the BLAST query page. So at
the very top we have a sequence box, so from that processing page, I just took the first
sequence, pasted it into the box. If I just leave that the way it is, it will search the
entire length of that sequence, from position 1 to however long it is, but let’s say I only
want to use part of the sequence. I can tell BLAST, all right, only use a subrange: start
at position 10 and only go to position 50. If you have a reason for picking a particular
domain, you can do that rather than trimming the sequence manually. Okay. In the next block
down we have some choices to make regarding what we want to search. We know now what our
query is; what’s going to be the subject, the target of our search? By default, it says
“non-redundant protein sequences” or “nr,” so this is what people colloquially call GenBank.
Okay? All of the sequences that are publicly available through NCBI, so the non-redundant
database. There are a couple of other interesting ones, though, that I want to talk to you about.
The first one is called RefSeq and the second we’re going to talk about in a couple weeks
called Swiss-Prot. But these are specialized databases because they are highly curated. So let me tell you a little bit about RefSeq.
So RefSeq stands for “reference sequence.” This is NCBI’s reference sequence project
and the goal of this particular project is to represent each molecule in the central
dogma, each DNA sequence, each protein sequence, or each mRNA sequence, once and only once.
So those of you who have done sequence-based searches before may have come across the situation
where you put in a gene name, DCC, and you might get one one, five, 10, 20 entries back,
and you sort of quizzically look at the page and you wonder, well, which one is the right
one? Which is the canonically accepted sequence for that particular gene? So that’s the whole
point of this effort is that you don’t have to worry about trying to figure out which
one is “the right one.” The curators at NCBI have already done this for you and identified
which sequence is the accepted, correct sequence for a particular gene, mRNA, or protein. Because
of this curatorial effort, by definition, this is a non-redundant database. Each sequence
is represented once and only once. And the cool thing is that there are curators, I believe
somewhere below ground in Building 38A — I’m not sure how many subbasements down — who
read the literature and make sure that all of the things in the entry are kept up to
date. Okay. Now, how do you know when you’re looking at
a RefSeq entry? So I want to introduce the concept of the accesion number, and you’re
going to hear that term many times throughout the course. So the accesion number is just
a unique identifier, that when you use that as the basis for your search, will return
one and only one sequence. So in essence, it is that sequence’s Social Security number.
Okay? They take various formats; they have codes. And the way you can tell which ones
are in the RefSeq series is by looking at the first couple of letters. So the first
set I have up here says from curation of GenBank entries. They all start with an N, so the
ones that are NT underscore — and then six digits — these are genomic contigs. They
are the sequences that come off of the sequencing machines n the kinds of large-scale whole
genome efforts that Eric Green spoke about last week, just cranking all of this genome
data out. By the same token, the NMs are the mRNAs; NP, the P stands for proteins; and
there’s also entries for the non-coding transcripts, which are shown by NRs. So the thing that
all of these have in common is that these were directly experimentally verified. Somebody
had these molecules in their hands at some point and directly sequenced them. Now you can imagine a situation where, because
of these large-scale genome sequencing efforts we do, the sequences are being cranked out
at just amazing, breakneck speeds. Sometimes what we do is take these genomic contigs,
the data that comes off these sequencing machines, and we apply computational techniques to try
to predict where we think the genes are, where the introns are, where the exons are, genomic
structure, all of that. So what we can start with one of these genomic contigs and then
if one of these computational techniques has been applied, we would then have a series
of model mRNAs, the X series here, so XM for model mRNAs, and then based on those predictions,
in turn, we could predict where the model proteins — what the model proteins are, and
these are the XPs. So the important takeaway here is when you do a search against RefSeq,
you always want to go in favor of the Ns, because again, these have all been experimentally
verified. If these are not available, take a look at the Xs, but keep in mind, again,
that these are computational predictions; they have not been verified. The computational
predictions are pretty damn good, okay? But they are not fool-proof. So you can look at
those to get an idea of, “Well, I think I can make some sort of conclusion,” but you’re
going to need some other sort of verification to support that. So again, the Ns are experimental,
the X are predictions. Clear? Okay, very important. Okay, so we can pick — we’re going back to
our BLAST page — we can pick a page — excuse me, a target database that we want to use.
Let’s say we want to limit our search to particular organisms or particular types of organisms.
We could just put terms in this box here, so you can say “human” or “bilaterian” or
whatever you’d like to limit the results that you get back. Okay. In this section here,
it says “program selection,” BLASTP is already selected for us. You’ll see that there are
two other choices here — PSI-BLAST and PHI-BLAST — we will talk about those in two weeks’
time. Now most people at this point will just click the BLAST button and go off on their
merry way, but if you’ll notice underneath, it says “algorithm parameters.” So we’re going
to pretend that we’ve now clicked on that. It will expand the screen, there’s more below.
And this is where we now get to fine tune our searches. So all of the stuff that I talked
to you about in the first 45 minutes comes into play here. Okay. The first thing it asks you, “How many sequences
do you want returned to you?” The default is 100, okay? I’ve, for the purposes of the
example, changed this to 250. I would rather get too many results back than too few and
make the decision myself about what is relevant and what is not relevant. Okay? And we’re
going to see in this example that this is critical. Okay, expect threshold. What E value
cut-off do you want to apply? So I’ve already given you a rule about the E value — it’s
10 to the minus 3rd or less when it comes to protein sequence comparisons. The default
is 10. For purposes of the example, we’re going to leave it at 10. Okay? But you can
change that there if you want. Right below, you can see the word size is 3, our P, Q,
and G. Okay. Going a little further down, here’s now where you can pick the matrices.
So remember, I’m advocating you using multiple matrices; here’s where you make that selection.
Above that, before we get to the next one, you can see gap costs– remember I was telling
you about the affine gap penalty — if you want to make the gaps more permissive or less
permissive, you can change the default values here. So by default, it’s 11 for the — opening
the gap and 1 for the extension. If you want to make it more permissive, you’d push those
values down. If you want to make it less permissive, you’d push the values up. Okay. Filters and masking. So always, you should
check this particular box off, to filter for things called low-complexity regions. So what
is a low-complexity region? So these are regions of biased composition. What you’ll sometimes
see in sequences are just long, homopolymeric runs, short-period repeats, or just subtle
over-representation of the same residues. So here’s an example where we’ve got a homopolymeric,
just run of As and Qs, an alanine/glutamine track here. And unfortunately, BLAST really
relies on having somewhat of a uniform distribution of residues. So when it sees something like
that, it sometimes doesn’t know what to do with it, because you’ve got all of this, in
this case, all these As and Qs. If it’s got another sequence with an equal run like that,
it doesn’t quite know what register to put them in. Okay? So it’s best to actually remove
those, mask those from the sequence to not have the results get confounded. This usually
— just biologically — this usually comes from either DNA replication errors or you
might have some unequal crossing over, still a question mark on that one. But as far as
these analyses go, it’s important to mask them because again, it often leads to false
positives. So you just get in the habit of checking that box off when you do these searches.
Okay. Finally, we’ve gotten to the bottom of the
page. It tells us we can go ahead and search. There’s a little check box here that I always
check off, that says “Show results in a new window.” Okay? So when we finally click that
last BLAST button, by checking that box, your results will, as it says, appear in a new
window. The reason I like doing that is, if you don’t do that, everything’s going to be
in the same window and if you want to back up, it’s harder to do that. So just from a
practical standpoint, you’ve got your results in a different window, if you want to now
go back and make an adjustment, you still have your original window without having to
put the sequence back in, pick all the same parameters again. It’s just a nicer way of
doing it. Okay, so finally, time to issue the query,
and this is what you get back. Okay. Now you’ll recall the default value for the number of
results to return from the target database was 100. If you look at the screen, it says
“Distribution of 204 BLAST hits on the query sequence,” so we would have not seen the vast
majority of our results, which, in this case, most of them are actually biologically significant,
as you’ll see in a moment. Above this box, you’ll see that BLAST has identified some
important protein domains — and again, we’ll come back to this in a couple of weeks. So let’s concentrate on this box here. So
this gives you sort of a general overview of your BLAST results. At the very top, you
have a color key that shows you just sort of a heat map of as the scores get higher,
we move from the blacks to the reds. Remember the score is now what you’re looking at; you
— we’re going to look at the probability values, but we’ll just work with this for
now. The alignments, each one of the hits that have been found — so any time I use
the word hit, it just means a sequence that was found in the target database that was
deemed similar to the one that we started with. So the best ones are, by score, are
at the top, then going down in descending score order. If you were to take your mouse
and just mouse any place over any of these bars, there’s a window at the very top that
would give you the identity of what protein it is that was hit in this particular analysis.
What’s next? Okay. Now, you’ll see some of these are connected
by these very faint, thin bars. Some of them are not. Whenever you see two of them connected
like this, this means that we have a protein that was found in the public database where
there are multiple high-scoring segment pairs. Okay? So in the very case at the top, there
were two separate alignments for sequence A and sequence B. Okay? The case where the
arrows showing 1, 2, 3, 4. Some places you’ll see this very light vertical bar. And what
that is intended to convey to you is that, all right, in these cases, you know, there’s
some space between this high-scoring segment pair and this one. In this case, either the
space is too small to show, but it wants to give you an indication that there is a boundary
there, or you have two high-scoring segment pairs that overlap with one another. Okay?
And that happens quite frequently. So in — let’s find a good one — well, we’ll come back to
it later. Okay. Okay, any time you see things on a line that are not connected, they’re
unrelated; it’s just done to save space. Okay? So that’s sort of the overview. What we’re
going to get — I like numbers, so we’re going to see much more useful information when we
scroll down a little bit and look at the actual results. So this is the table that you get
back, nicely formatted. The accesion numbers, the unique identifiers, are in the first column,
and if you were to click on any of those, it would take you into GenBank and you could
just see the GenBank entry for that particular protein. In the description column, you’re
given a brief one-liner about what was found, usually just the name of the protein and the
organism that it’s in. The third column gives you the score. Remember, the score is not
important: the E value here is. And thank you to NCBI for finally putting them in E
value order instead of score order. For many years, those of you who may have done this
before will remember that the results were always sorted by score instead of E value.
You’ll notice that there is a small up arrow here, so the sort is actually the way you
want it by default. But if you, for some reason, want to sort on one of the other values, you
can just click on the column header and it will do that for you. You have some links in the final column. The
cheat sheet is above the table here, so if you see U, U stands for UniGene, which gives
you some information from a gene-centric standpoint about gene clusters. E is for GEO, for the
gene expression omnibus. So if you see a E next to one of these — we don’t have one
here — that just means that there’s gene expression information available to you as
well. The G stands for EntrezGene, a nice gene-centric view of the world. So if you
want o know about — more about the gene that encodes for that particular protein, you could
go there. A couple of cases here, we have Ss. S is for structure, so that means that
there is a solid, three-dimensional structure, either by x-ray or NMR for that particular
protein on that line. Might not be the whole protein, might just be a domain, but there
is some real structural information available. M would take you to the NCBI map; you’re — Tyra
will cover that very briefly next week. And finally, if there’s some bioassay information
available through PubChem, you’ll get that as well. All right. So that’s the orientation
to the table. This is the column, again, we’re paying attention
to. At the very top, you’ll notice that the values are zero. So what zero just means is
that it is less than 10 to the minus 1,000. So basically, there’s just not enough room
to show the value. And then, you know, so they’ll go out to three digits in the exponent,
but not four. So if you see a zero, you’re sitting pretty at that point. Okay. As we
go down, the values still look pretty good, given the criteria that I had given you before
and we’re going to pretend to — oh, I’m sorry — just so you know how to read these, if
you’re not used to the notation, if you see something like 4e minus 97, that just means
4 times 10 to the minus 97. So the part after the e is the exponent. Okay. All right. So now we’re going to pretend to scroll down
and the values, if this is the E value column, again, they look pretty good, we’re going,
we’re going, we’re going, but at some point, we now are going to apply that .001 criteria
that I gave you and draw a cut-off line. Okay? So just a couple things to reinforce here.
We changed how many hits to return. If we had stopped at 100, we would have only seen
something maybe further up here and not seen any of the ones at the bottom. We left the
E value at 10 and I think that’s actually okay, as long as you know what the cut-off
values are so you can draw the line yourself. So in this particular case, the line would
go here because this one, 10 to the minus 7, that one’s okay, but .006 is just slightly
above our cut-off guide. So everything from there up, we’re going to accept for now, and
everything below, we’re going to reject. We’re going to come back to this point in a minute
because you’ll see in — if you look at these def lines, you see a protein called Prospero
above the line, but you also see some below the line. So we’re going to talk about that
in a minute. Just keep that in mind. All right. We’re now going to imagine we’ve
scrolled further down. We’re now looking at one of the alignments. And so you’ll always
see something like this. At the very top, it tells you what the hit is. So again, here’s
protein Prospero. In this case, the first high-scoring segment pair starts at position
17 of the query that’s aligned with position 317 in the subject, and you can see how the
numbers go across as you go down. The part to pay particular attention to is right here.
So expect — that’s your E value — again, this one is zero, that one’s good. You see
two metrics here, identities and positives. So identities are the number of exact matches.
Okay? Percent identity. In this case, it’s 100 percent identical. The positives are the
exact matches plus the conservative substitutions. Okay? In this case, because the identity is
100, the positives have to be 100, but we’ll see those numbers changing momentarily. So
here’s where guideline number two comes into play. This is a protein-based search. What
I want you to look for is greater than 25 percent identity between the two sequences.
If it was a nucleotide search, 70 percent or more. Okay? You’ll see that some of these
amino acids are in lower case and grayed out a bit; that represents the low-complexity
regions that we talked about before. So these have been masked out for purpose of the comparison. Now we’re going to pretend again, we’re going
to scroll down just a little bit. This is the bottom of the piece that we just looked
at, ending at position 704. And you’ll notice here, now, is another block of statistics.
But unlike — let’s just go back one real fast — unlike the previous one, we don’t
see any def line information, we don’t see any protein information being given to us.
So what that means is that it’s the second high-scoring segment pair for that same protein.
Okay? So our protein query against the same sequence, different alignment, that was found.
As before, we can see, you know, where the sequences align. We have our statistics, so
here, 93 percent identical and again, 93 percent positives. Here’s our first case of a gap,
so you can see what that looks like when you have a gap in the alignment. Okay? And in
this particular case, there’s actually a third one in this set as well. So if we look at this overview again, here’s
just the two little blocks of scores from the alignments in the two previous slides.
So the very first one we had and the second, our E values were zero, so that beats our
10 to the minus 3rd cut-off. We look at the percent identities, we want this to be more
than 25. Well, both of those values beat our cut-off. And if we just look at where these
are, here’s the first high-scoring segment pair from 17 to 704 in the query, this bar
right here. Here is the second one, going from this black mark going all the way over,
and the third one, which we didn’t take a look at, is right here as well. Okay. Questions?
Okay. It’s — I know it’s starting to get a little bit overwhelming at this point. Okay, so cheat sheet number two, to try to
bring some rhyme and reason to all of this. Okay, so just by way of reminder. If we’re
looking at nucleotide sequences, we want an E value better than 10 to the minus 6th, sequence
identity 70 percent or better. If we’re looking at protein sequences, we want an E value 10
to the minus 3rd or better, 25 percent sequence identity or more. Okay? Now, these are guidelines,
okay? And I told you to pay attention to what’s above and below the line. I don’t want you
to use these cut-offs blindly. These are just guidelines as a starting point, as you get
more and more experienced with using BLAST. Okay? You have to pay attention to what’s
on either side of the line. So if you see things that are below the line,
that you have biological evidence that supports that that thing below the line, that protein
below the line, is actually related to the one that you started with, the one that you
used as your query, your biological evidence will always trump the computational prediction
and you will accept that as being related. So whenever you’ve got biological information,
you always use that first. Okay? And then you use these to either confirm the relationship
or again, to discover new relationships. But don’t just say, “Oh, Andy told me I have to
draw a line here and so I’m not going to consider this.” Always make sure to dig back into what
you know from the literature, what you know from your own studies, when you’re looking
at things on either side of the line. There might be things slightly above it that you
want to reject, there might be things slightly below it that you want to accept. Okay? So
that will help you get — not fall into the pit, again, if you keep your biology into
account. Okay. Just very quickly, some things to keep in
mind that can mess up a BLAST search. So again, we’ve already talked about low complexity
regions. Repetitive elements also, for the same reason, because of the repeats, tend
to confound the analyses, and I give you some ways of dealing with those here. Low-quality
sequence hits are sort of less of an issue now that we’re cranking out sequences at the
pace that we are, but they still exist in the databases, so you may, occasionally, if
you’re doing nucleotide searches, get hits to expressed sequence tags, to EST’s, or to
single-pass sequence reads. There’s a Poisson distribution that governs how good a single-pass
read is, as far as accuracy goes, with respect to genome sequencing. So when you only sequence
once, the accuracy is on the order of something like, what is it, 63 percent, okay? When you
do the second pass, it goes up to 87 and then so on. So that’s why we do genomes at 10 or
12x coverage, to make sure we’ve got it right. So just keep that in the back of your mind.
If you hit one of these lower-quality reads, you may want to discount the result. Okay.
All right. So that is how you use BLAST, and hopefully
that has helped to take it a little bit more out of the realm of the black box. There are
some variations on B LAST; we’re going to talk about one of them today, called BLAST
2 sequences. So let’s say you’ve sequence A and sequence B. You don’t want to search
against the databases; you just want to do an alignment of A with B. So this is the tool
that you would use for that, either at the protein level or on the nucleotide level.
All of the BLAST programs are available to you. You can, as before, select any of the
BLOSUM matrices, any of the PAM matrices; the scoring works exactly the same. Okay.
So we’re back to our BLAST homepage at this URL. We’re now going to look at this bottom
block here, the specialized BLAST searches, and just click where it says “Align two sequences
using BLAST.” Okay, once you click on that, the screen looks
very similar to what we saw before. At the very top, BLASTP has been highlighted, the
BLASTP tab. But what’s slightly different from before, before we had a query sequence
for one sequence, because we’re now comparing two pre-identified sequences: sequence A,
sequence B. Okay? And as before, if we want to select a subrange of that sequence, we
can. BLASTP is selected by default, it’s the only thing you can do here. Again, as before,
we can take a look at the algorithm parameters, so this will take us further down the page.
In the first block, how many sequences do we want returned. E value, the E value is
somewhat irrelevant here, as far as performing the search, because you’ve, in advance, identified
which two sequences you want to align. But you do want to look at the result later to
get the measure of statistical significance. You can change the word size and so on. As
before, matrices are selected the exact same way. You’ve got your low-complexity regions
here. You’re going to make sure you always check that box, check the box to get the results
in the new window, and then finally, go ahead and BLAST. Okay. And the result format is exactly the same
as what we’ve seen before, with minor exception. We only were comparing one sequence to one
other sequence, so there is only one line here in our overview. The results formatted
exactly the same way, so this was the sequence in the second box, which was used as the target.
It tells us the E value is 4e to the minus 24, so we accept that, using our criteria.
And if, as before, we look at our percent identities, the cut-off is 25 percent or more.
Here, these two sequences in this region are 46 percent identical to one another, 74 percent
similarity. Okay. So just a very quick way to align two sequences.
Here’s the first high-scoring segment pair, here’s a very teeny one as well. Okay? This
one, we would reject because here, if you look at the E value, 1.9, it’s way above our
threshold. Okay. Another way to look at these, here’s just a graphical view of where those
two alignments line up. So this is sequence 1, starting at position 1 going to position
178. Here’s sequence 2, going from 1 to 134. This was the statistically significant high-scoring
segment pair that we identified going from 95 to 178 along here, going from 51 to 134
going up here. This is the one that we rejected, but you can see that it overlapped this one
as well. So we’re going to, again, by basis of our statistics, just reject this one outright
and accept this as the best alignment of the two sequences. Okay. All right, very quickly, we’re going to talk
about nucleotide BLAST. So two methods: one’s called MegaBLAST, the other one’s called BLASTN.
We’re not — I’m not going to walk you through an example, I’m just going to show you how
to actually do it. So again, we’re going back to our BLAST homepage, the center section.
Nucleotide BLAST is the first option in the second section, it takes us to our BLAST page,
and what I want to zero in on here is a section you have not seen before where it asks you
which nucleotide comparison program do you want to use, MegaBLAST, discontiguous MegaBLAST,
or BLASTN? Okay? And that’s, of course, incredibly obvious, so here we go. [laughs] So here’s a table that just gives you, all
right, what the word lengths are, the plus/minus, so for matches and mismatches, what the scores
are, how the gaps are scored. So affine is what I described to you before, where it’s
a penalty to open the gap plus the extension. When it’s linear, you don’t asses a penalty
to open the gap, you just score the length of the gap, you subtract points for the length
of the gap. If you have two sequences that are highly similar to one another, really
long stuff, chromosome length, you know, big contig length stuff, you want to use the first
one, which is called MegaBLAST. So this is what you use if you’ve got very long sequences
or highly similar sequences, I mean, really, highly similar sequences, 95 percent or greater.
You’re going to get very long alignments back and it’s incredibly fast. I’m going to jump
to the lower block real fast. So the lower block is if you’ve got really short stuff
that you want to make sure you’ve got almost exact matches for. These are the parameters
that NCBI recommends for this, but they don’t recommend putting a cut-off in, so this is
some place where manual inspection is very important. In the vast majority of cases, you’re in the
middle, where you’ve got things that are somewhat similar to one another. The results that you
would get using either a discontiguous MegaBLAST or BLASTN are going to be not quite exact,
but they’re going to be pretty much very similar to one another. The only difference is that
MegaBLAST is going to run much faster in relative terms. It’s like, how fast can you word process,
right? So you might want to just do both of those methods to see if you get slightly different
results when you do your BLASTN searches. So that’s all we’re going to talk about on
the nucleotide side of the house and we’re going to keep coming back to this throughout
the course, so you’ll see some more treatment of this later on. Okay, so finally, in the last 11 minutes,
we’re going to get to the last thing on the list, which is called BLAT. So everything
we’re looked at so far, the scoring matrices, all of that stuff, has revolved around NCBI’s
basic local alignment search tool. But the same basic premises of how the scoring matrices
are computed, how they’re used to assess similarities, how searches are nucleated to find other sequences
that are similar to one another, also apply to BLAT. So BLAT is a tool called the BLAST-Like Alignment
Tool that’s made available by UC Santa Cruz. And what this is designed to do, like MegaBLAST,
is to really rapidly align longer nucleotide sequences, so things that are longer than
40 in length that have really, really high sequence similarity. So again, we’re talking
things on more of a genomic scale at this point. It really is, you know, a method of
choice when you’re looking for exact matches in nucleotide databases. It’s incredibly speedy,
500 times faster than BLAST for either mRNA or DNA-based searches. Because it is pushing
in the direction of exact matches, you might miss divergent or shorter sequence alignments.
That’s when you’re going to go back and use BLASTN instead, over at NCBI. It can be used
on protein sequences; truth be told, I’ve never actually done that. I’ve always used
this on the nucleotide side, but just that you’re aware that you can do that as well. All right. So when do you use this instead
of one of the BLAST algorithms? All right, you might have a particular unknown gene or
you might have a sequence fragment, so it’ll allow you to place it onto a chromosome to
find its genomic coordinates. It allows you to help determine the gene structure by virtue
of this comparison; where are the introns, where are the exons? Or it might help you
find some markers of interest in the vicinity of a sequence, so there might be upstream
regulatory elements or other important genetic features of — note, Laura Elnitski will talk
about this in great length when she talks about epigenetics and upstream regulatory
regions. You, as already inferred, use it to find highly similar sequences, so other
gene family members or putative homologs. You know what a homolog is now. And also,
just to display your sequences as a specific track. So, track is a new concept. It’s just
a way to describe linearly a bunch of sequences that are related to one another, and we’ll
see an example of that in a moment. Okay, so we now leave Building 38A, we go
to the West Coast, genome.ucsc.edu. This is UCSC’s genome bioinformatics homepage. All
of the tools that are available to you are on the left-hand side. The fourth one down
is BLAT, so we’re going to pretend we’re clicking on that for purposes of the example. Again,
I’ve just pasted a sequence into the box from the sample sequence page that I told you about
earlier. This is a rat CDNA clone, so I’ve changed the genome to rat. It’s using the
latest assembly, so by default, it will give you the latest set of information available.
Tyra will touch on that a little bit next week and why it’s important to keep track
of the assemblies. The query type, it’s a DNA sequence. And here, we’re just going to
submit it, okay? There are ways to change the parameters, but we’re just going to go
ahead and click on the submit button and see what we get back. And we end up getting back a very simple format
on our results. And they are, in this case, shown by — in score. You don’t see probability
values here. Okay? The best hit is the one at the top, so — because it has the highest
score. It is also the longest one, so the span in the last column tells you how long
the alignment is. So in this case, we started with a sequence. It aligned from position
1 to position 733: 98 percent identity. It’s on chromosome 5 on the plus strand and it
gives us the genomic coordinates of where it landed. It’s one thing to look at it that
way. We can now see it in the context of the genome browser by clicking on the browser
link. So this is now what a standard UCSC genome browser page looks like, and again,
Tyra will make you experts in how to use this by the end of next week. Very simply, here’s
a red bar on this ideogram here, so we can see that we are at 5Q 31, is where our sequence
landed. It’s right here, your sequence from BLAT search, and the accesion number is at
the front. You’ll remember that this was 98.1 percent similar to the chromosomal sequence,
so there are some gaps out here to show you where there have been gaps inserted. You’ll
also see, very faintly, there are some red bars that are in there as well, and those
just indicate to you the positions of mismatches. Okay? On your handout, I’ve given you a color
key, so red genome and query sequence have different bases at this position. You can
read for yourselves when you see something in orange, in purple, in green, what those
all represent. So different characteristics of the alignment. Okay. There are a slew of other tracks here that
are of great interest. So there are some rat EST, some conservation tracks that tell you
how well the sequences line up between, in this case, mouse and human and dog and so
on, in this particular region. There’s SNP information available to you, a whole slew
of really cool genomic information, and again, Tyra will show you all of that next week.
Okay. There’s another way to look at these results,
by clicking the second link over here, the details link, so we get a more text-based
result here. The first sequence here is the one we started with. This was the query that
I pasted into the BLAT search page. This was the result. You’ll see that some of these
are capitalized, they’re in different colors. It tells you at the very top what all of that
means, so matching bases and CDNA and genomic sequences are colored blue and capitalized,
light blue bases mark the boundaries of gaps in either sequence. And if we just pretend
we scroll down a little bit more, here is the alignment in the more traditional way
that you’re now used to looking at these alignments. Okay? So again, very powerful tool, gives you chromosomal
context, whether you’re looking at human sequences, rat sequences, what have you, and allows you
to now start thinking about things, not just between the two sequences lined up, but a
whole just amazing collection of genomic information, and may help you think about how that particular
gene is implicated in certain biological processes and perhaps in certain disease processes as
well. Okay, the last item. It’s called FASTA. So
FASTA actually predates BLAST as a sequence similarity method. It, like BLAST, identifies
region of local alignment. It uses a very complex algorithm called a Smith-Waterman
algorithm to do the comparison. So the method is significantly different than the one I
described to you earlier in this lecture, but it’s an incredibly robust method. While
we don’t have time to talk about it today, I wanted to point it out because, again, if
we think about where those cut-off lines are, if you want confirmatory evidence by some
other method for some of those BLAST hits, you could go ahead and do a FASTA search or
use some of the methods I will tell you about in Week 4 to help you confirm or reject results
that are around the cut-off line. So the same way that in the lab you might use a second
or a third method to confirm your results or your conclusions, you do the same thing
in the computational world as well. Here’s where you can find some online implementations
of that. And I know we covered a lot of ground in the
last hour and a half. For those of you who would like some more information on how to
use these methods, some more of the fine points of how these methods are utilized to answer
specific biological questions, you can read Chapter 11 in my own textbook, which talks
about both BLAST and FASTA. In the Mount textbook, Bioinformatics Chapter 6 talks about sequence
similarity as well. Both of these books are available in the NIH library, so you can just
check them out at your convenience. All right, so that’s it for this week. What
we’re going to do two weeks from now is build on all of this stuff to start thinking about
things in more context, looking at profiles and patterns and motifs. We’re going to talk
about structure a little bit, something that usually scares people but actually is pretty
cool and is not that hard to do, and talk about how to start constructing multiple sequence
alignments. So just by way of reminder, next week, Tyra Wolfsberg will pick up from here,
talk to you about genome browsers, the stuff that I alluded to towards the end of today’s
lecture. So thank you for coming, I’m glad to take any questions at the podium. We’ll
see you again next week. All right. [applause]