Our ability to generate sequence and catalog interactions between DNA and the proteins that bind it has increased dramatically with the development of ChIP-Seq in 2007. Our computational ability to deal with this data has also grown, albeit more slowly. What has not changed is our method of visualizing binding preferences – we still rely on a system of tall and short letters developed over two decades ago to represent these patterns.
Sequence logos (seqLogos) are a graphical means of viewing a position weight matrix (PWM), such as those derived from a multiple sequence alignment. They are typically used to describe nucleotide or amino acid motifs in transcription factor binding sites, integration sites, or any other instances where proteins are likely to interact with specific DNA sequences.
The y-axis of a nucleotide seqLogo is typically measured in log2 “information content”, a metric in which 2 bits represents 100% conservation at a position. A high stack represents a high preference for one or at most two nucleotides.
SeqLogos feature prominently in papers, in part due to ease by which they can be generated from the user-friendly webLogo application. While born during the Sanger era, seqLogos have experienced something of a renaissance due to the success of high-throughput ChIP-Seq profiling.
However, seqLogos present some notable deficiencies when portraying nucleotide PWMs. (Amino acid seqLogos are often a totally incomprehensible jumble of letters and will not be tackled here, although the alternatives are even worse).
Problems with seqLogos
1. The most frequent nucleotide gets top billing even if it is only a fraction more frequent than the runner-up.
Proponents of sequence logos boast their strength to illustrate ambiguity, but the stacking pattern tend to exaggerate the importance of winning even if there is considerable doubt.
In ties, the top dog is simply a question of alphabetic order.
AT, TT, AA, TA
GG, GC, CG, CC
Worse yet, the winner gets to stand on the shoulders of his vanquished enemies, adding to possible misinterpretation.
2. “Nullotides” are left off the credits.
Nucleotides that are never observed (“nullotides”) are in turn absent from seqLogos, leaving us to see how the remaining three nucleotides stack up. The nullotide might be more interesting from a biological perspective.
3. They look obnoxious.
Granted, this is admittedly subjective, but the adjacent placement of absurdly tall and squat letters offends the senses. There must be a reason actual corporate “logos” don’t look this way.
Can we do better?
R guru and ggplot2 creator Hadley Wickham’s response is priceless:
Is there diagram that avoids these problems, and can we find a seqLogo for the ChIP-Seq age?
With this in mind, I was reminded of a figure developed by Charles C. Berry for “Selection of Target Sites for Mobile DNA Integration in the Human Genome” to distinguish patterns of integration site motifs between different retroviruses and retrotransposons.
berryLogo: a better seqLogo
I have coined this figure a “berryLogo”, and its implementation in R and ggplot2 is available here.
Instead of “information content”, the y-axis is the log relative frequency with respect to the background frequency, generated here from a GC content parameter.
What kind of new information is missing from that can be added to enhance a berryLogo?
Can we quantify the contribution of each position to the binding specificity? How does each position distinguish or trim the number of loci to which a protein would bind?
One way we can estimate this is by iteratively “backgrounding” each position in the logo and counting matches to the genome. The following plot is generated using the matchPWM function in Biostrings. The background hit rate represents the loss of specificity created by removing positions in the positional weight matrix.
Another possible view, for proteins with a known active site, involves progressively adding positions on either side, and plotting the selectivity as a function of width.