The Awsomics bioinformatics framework

Analysis of genomic data has become highly complicated, and a focus of modern biomedical research. Individual labs have growing needs to analyze different types of genomic data in a customized fashion to meet unique research goals. No matter the genomic data were collected from their own experiments or public repositories, preparing them for custom analysis is very time-consuming and beyond the capacity of most biomedical researchers.

We are currently developing the Awsomics framework (http://awsomics.org) to make genomic data more accessible for custom data analysis. Awsomics has three major components. The backend is a growing archive of genomic information, covering from standard annotations (dbSNP, Entrez gene, OMIM, etc.) to public experimental data sets (GWAS, transcriptome, protein interaction, etc.) The frontend is a Shiny web server, hosting a series of generic and project-specific APPs. The middle layer implements existing and novel bioinformatics methods that perform integrative analysis of archived data and deliver the results to users through the web server. Running on an instance of Amazon Web Service (AWS), the whole framework is highly portable and can be easily deployed on other public or private servers.

awsomics

The goal of Awsomics to promote exploratory analysis of complex genomic data and facilitates knowledge discovery in biomedical research. Its targeted users are biomedical scientists with unique needs of custom analysis, but lack time or resource to do so. It will help them to incorporate enormous genomic data in public repositories into their own research. The “bottom-up” and “data-centered” developing strategy of the framework makes it high flexible and expandible and the field of genome sciences keep evolving.

We have been developing the Awsomics framework for ~18 months. The archive of genomic data keeps growing, and currently includes over 10 major categories. For example:

  • Over 7,000 results sets of GWAS analysis
  • Close to 300 curated trascriptome data sets in 8 collections
  • 3 million pre-defined gene sets

We have developed several generic bioinformatics APPs that support general exploratory analysis of archived data:

We also worked closely with several research labs to develop project-specific APPs. For example:

Advertisements
Posted in Uncategorized | Leave a comment

Analysis of high dimensional data: a customized workflow

Genomic data are often high-dimensional. For example, a typical transcroptome data set has less than 100 samples, but over ten thousand genes/variables. This article will introduce a three-part workflow of analyzing high dimensional data: t-SNE for dimension reduction; dbSCAN for sample classificataion, and plotly for visualization. The data to be used is the ExpO (The Expression Project for Oncology) microarray data set, which includes gene expression measurements of 19,361 genes in 2,158 tumor samples.

 

t-SNE

t-SNE (t-Distributed Stochas- tic Neighbor Embedding) is a popular method of dimension reduction that maps high-dimentional data to a 2D or 3D space. This method is implemented by the Rtsne {Rtsne} R function. On a t-SNE space, the similarities between samples is measured by their overall KL (Kullback–Leibler) divergence. An optimal t-SNE run should have relatively lower KL divergence. Each t-SNE run of this analysis involves the following sub-steps:

  • Run the Barnes-Hut approximation to reduce computational time;
  • Calculate the euclidean distance between any pair of samples;
  • Fit the distance of each sample to its nearest neighbors to a Cauchy distribution;
  • Adjust sample positions on the low dimensional space by 1000 iterations to lower their overall KL divergence.

Multiple t-SNE runs using the same data set will not have identical results because their initial state is dependent on random seeds, and the overall KL divergence can be used to select an optimal run (the lower the better). The figure below shows the distribution of overall KL divergence from 1000 t-SNE runs. Because the worst run is only ~4% higher than the best run, we can confidently say that t-SNE has quite consistent performance on the ExpO data.

tsne_1

DBSCAN

DBSCAN (Density-based spatial clustering of applications with noise) is an supervised clustering method based on the density of sample on a multi-dimensional space. It clusters samples packed together to form a high-density region in given space into the same group. This analysis uses the dbscan {dbscan} R implementation of DBSCAN and the result of t-SNE run having the lowest KL divergence as input.

Each DBSCAN run has 2 key parameters: epsilon – the size of epsilon neighborhood, and minCore – the minimal number of core samples to form a cluster. Both parameters can be arbitrarily chosen, but we are going to run another iterative process to identify their optimal combination based on the silhouette. Silhouette is the measurement of sample clustering cohesion. For each clustered sample, it silhouette value is its distance to its own cluster comparing its distance to other cluster, and between -1 and 1 (the higher the better). In this analysis, we use the average silhouette value of all samples to select the optimal combination parameters, within the range of 1E-6 to 1E+6 for epsilon and 2 to 100 for minCore. The figure below plots the average silhouette value given each minCore value and shows that the clustering of ExpO is optimal when minCore = 78, which clusters about 89% of all samples into 8 clusters.

tsne_2

plotly

Now, we can have fun by visually inspecting the clustering results using an online graphing tool called Plotly. Plotly makes online graphs with interactive features, such as zooming, figure downloading, and hover over text. An example of Plotly graph can be found here (click the “Load an example” button to load the results of a t-SNE run from ExpO data).

tsne_3

In summary, the t-SNE/DBSCAN/Plotly combination makes a practical workflow of sample unsupervised clustering using high-dimensinal data. The whole workflow is implemented by the RoCA project.

Posted in Uncategorized | Leave a comment

Reduce Snakemake runtime by collapsing similar jobs into one multiprocessing job

Snakemake is a pipeline management system written in Python. It enables users to encode tasks as rules that describe how to create output files from input files. These rules are written in a Snakefile (similar to a makefile), and are chained together using input and output files (matched by file path patterns) to form a pipeline. For instance, a simple variant calling pipeline that starts with fastq files and ends with vcf files has an alignment rule, a sort/index bam rule, and a variant calling rule. The alignment rule shows how to take a fastq file and make an aligned bam file. The sort/index rule describes how to make a sorted and indexed bam file from the aligned bam file. The variant calling rule shows how to use the indexed bam file to make a vcf file containing variants.

I use Snakemake in every project because it allows me to maintain a record of every step in my analysis. It also has a number of convenient features. It will re-run rules based on upstream file changes. In the variant calling example, if I change the mapping of my bam file, Snakemake knows that my vcf file needs to be regenerated. Snakemake will show me the progress of my pipeline, as well as tell me what rules will be executed in a dry run of the pipeline. I consider Snakemake to be one of most valuable tools I use, but it does have some issues.

The biggest problem I have with Snakemake is scalability. Snakemake becomes unbearably slow when it manages over ten thousand files. It can easily take half an hour for Snakemake to do a dry run of some of my pipelines. That’s a long time to wait if I just need to process 10 new samples. Below I’ll describe a specific instance of a scaling issue I had, and how I solved it using Python’s multiprocessing module.

I start with 700 whole genome samples. The variant calls and associated data like read depth and alternative allele count are stored in sample specific HDF5 files. I have another file with thousands of positions that I want to query against each sample’s HDF5 variant set. I begin with one sample, and query its HDF5 variant file using Python’s PyTables. In my Snakefile, I have one rule that takes an HDF5 file and my query position file as input, and runs a script to make an output file containing the query results.

# Snakefile 1: queries all positions by sample
# too slow

rule query_hdf5:
    input:  'hdf5bySample/{sample}.hdf',
            'query_pos_file'
    output: 'hdf5QueryAll/{sample}'
    shell:  'python query_hdf5.py {wildcards.sample} {input} {output}'

rule all:
    input: 'hdf5QueryAll/sample_1'

I run the rule for one sample, and I kill the job after half an hour because it has only finished a few chromosomes. This rule needs to run chromosomes in parallel to finish one sample in a reasonable amount of time.

I split my query position file by chromosome, and alter my Snakefile rule so that it takes a chromosome specific query position file. When all 24 chromosomes run at once, I can do one sample in ten minutes.

# Snakefile 2: queries chrom positions by sample
# runs fast in parallel

rule query_hdf5:
    input:  'hdf5bySample/{sample}.hdf',
            'query_pos_file.{chrom}'
    output: 'hdf5QueryAll/{sample}.{chrom}'
    shell:  'python query_hdf5.py {wildcards.sample} {input} {output}'

rule all:
    input: expand('hdf5QueryAll/sample_1.{chrom}', chrom=CHROMS)

When I try to run several hundred samples at the same time, I hit Snakemake’s scalability issue. Instead of running the initial sample based rule 700 times in Snakefile 1, I’m running the sample/chromosome based rule 16,800 times in Snakefile 2. Snakemake is now so slow that I can grab lunch before it finishes enumerating all the files it needs to make. This will be horrible for me in a month when I come back with more samples, or need to add an additional rule to the pipeline.

I have a solution that maintains the ten minute per sample run time, and provides Snakemake with only 700 output files (one per sample) to manage. I maintain one Snakemake rule per sample. That rules takes the sample’s HDF5 variants as well as 24 files listing the query positions by chromosome, and outputs one file listing all query results.

# Snakefile 3: rules operate by sample
# but still split queries by chrom
# pull_hdf_mp.py handles multiprocessing by chrom

rule query_hdf5:
    input:   'hdf5bySample/{sample}.hdf',
             expand('query_pos_file.{chrom}', chrom=CHROMS)
    output:  'hdf5QueryAll/{sample}'
    threads: 24
    shell:   
             'python query_hdf5_mp.py {wildcards.sample} \
              {input} {output}'

rule all:
    input: expand('hdf5QueryAll/{sample}', sample=SAMPLES)

Assigning 24 threads to this rule allows me to grab 24 slots for multiprocessing when run on a cluster. Inside this rule, I use a script that creates one process per chromosome. This process loops through the chromosome’s query positions, and queries the HDF5 variants. When all processes are done, the results are collapsed into one file. The query script with multiprocessing is below.

"""
query_hdf5_mp.py
Python script to query all chroms in parallel
"""
import sys, tables, csv, os
from multiprocessing import Process, Pool

def query_hdf(table, chrom, start):
    return [x for x in 
            table.where('(chrom == %d) & (pos == %d)'
                         % (chrom, start))]
def process_row(sample, row, table, fout):
    start = int(row['start']) + 1
    chrom = row['chrom']
    result_ls = query_hdf(table, chrom, start)
    for r in result_ls:
        print('\t'.join(r), file=fout)

def query_file(params):
    query_file_name, out_file_name, hdf_file, sample = params

    # open connection to hdf5
    h5file = tables.open_file(hdf_file, mode="r")
    table = h5file.root.posCollection.posLs

    with open(query_file_name) as f, open(out_file_name, 'w') as fout:
        reader = csv.DictReader(f)
        for row in reader:
            process_row(sample, row, table, fout)

def main(sample, hdf_file, query_files, out_file):
    # create tmp files to store query results
    out_files = [afile + '.' + sample + '.tmp'
                 for afile in query_files]

    # run 24 processes at a time
    pool = Pool(processes=24)
    process_param_ls = []
    for query_pos_file, out_file in zip(query_files, out_files):
        args = [query_pos_file, out_file, hdf_file, sample]
        input_ls.append(args)

    # start processes
    result = pool.map(read_file, input_ls)

    # cat tmp query files
    os.system('touch ' + out_file)
    for gemini_file in files:
        os.system('cat {} >> {}'.format(gemini_file + '.' + sample
                                        + '.tmp',
                                        out_file))

if __name__ == "__main__":
    sample, hdf_file = sys.argv[1:3]
    gemini_files = sys.argv[3:-1]
    out_file = sys.argv[-1]
    main(sample, hdf_file, gemini_files, out_file)

I can run this on our cluster using 24 slots for each query job, and Snakemake uses almost no time determining what rules to run:

snakemake -s snakefile.py all --drmaa " -pe smp {threads}" \
-j 100 --latency-wait 40 --greediness 0.7

In conclusion, if you find yourself waiting on Snakemake dry runs and dependency checks, try limiting the files you produce by shoving some of the multiprocessing into a python script.

Posted in Uncategorized | Leave a comment

Gene function prediction: insights into the correlation of protein and mRNA

How often are we trying to compare the transcripts to the protein expression levels and it just doesn’t work? Can we really use transcriptome level information to predict cellular phenotype, which is mostly governed by proteins? Recent technological advances to quantify proteins (mass spectrometry-based protoemics) and transcripts (RNA-Seq) have helped us understand the dichotomy between mRNA and protein level in complex biological systems.

The central dogma of biology tightly links DNA, RNA, and protein in biological samples. It is mechanistically very well understood how genes get transcribed, mRNA gets processed and sequentially translated into amino acid chains at the ribosome and subsequently fold into functional proteins. However, the relationship between mRNA and protein abundances derived from a particular locus is not trivial (Li and Biggin, 2015).

There are multiple biological processes beyond transcript concentration that affect the expression level of a protein. These include translation rate, translation rate modulation, modulation of protein’s half-life, protein synthesis delay, and protein delocalization. Therefore there is a significant discrepancy between mRNA and protein profiles in complex biological sample. Zhang et al. (Zhang et al., 2014) presented the first global analysis of transcript-protein relationships in a large human colon and rectal cancer cohort (87 tumors for which 3764 genes had both mRNA and protein measurements). They showed that although 89% of the genes showed positive mRNA-protein correlation, only 32% had statistically significant correlation. The average Spearman’s correlation between mRNA and protein variation was 0.23, which implies that only ~23% of the variation in protein concentration can be explained by knowing mRNA abundance (Figure 1). They also showed that genes encoding intermediary metabolism functions (e.g. Arginine and proline metabolism) had high mRNA-protein correlations, whereas genes involved in regulation, chromatin organization, and transcriptional regulation had low or negative correlation.

newFigure 1. The mean Spearman’s correlation for mRNA and protein variation is 0.23 (A). The poor concordance between protein and mRNA varaiation can be related to the biological function of gene product (B) (figure modified from Zhang et al., 2014)

Several new proteogenomics studies published in high-impact journals  (Liu et al., 2016; Mertins et al., 2016; Ruggles et al., 2016; Zhang et al., 2014; Zhang et al., 2016) showed that the proteome links better to the cell phenotype than the transcriptome does. But can we improve biological function prediction if we only have RNA-Seq data for our samples? In a new study by Jing Wang et al. (Wang et al., 2017), the gene co-expression networks were constructed based on mRNA and protein profiling data sets for three cancer types (i.e. breast, colorectal, and ovarian). They systematically investigated the relative utility of mRNA and proteome data in predicting co-functionality. They concluded that mRNA co-expression was driven by both co-function and chromosomal co-localization of the genes, whereas protein co-expression was driven primarily by functional similarity between co-expressed genes. Therefore by linking the differential gene expression to the chromosomal localization of genes, we may improve the RNA-Seq data for functional characterization of genomes.

References:

Li, J.J., and Biggin, M.D. (2015). Gene expression. Statistics requantitates the central dogma. Science 347, 1066-1067.

Liu, Y., Beyer, A., and Aebersold, R. (2016). On the Dependency of Cellular Protein Levels on mRNA Abundance. Cell 165, 535-550.

Mertins, P., Mani, D.R., Ruggles, K.V., Gillette, M.A., Clauser, K.R., Wang, P., Wang, X., Qiao, J.W., Cao, S., Petralia, F., et al. (2016). Proteogenomics connects somatic mutations to signalling in breast cancer. Nature 534, 55-62.

Ruggles, K.V., Tang, Z., Wang, X., Grover, H., Askenazi, M., Teubl, J., Cao, S., McLellan, M.D., Clauser, K.R., Tabb, D.L., et al. (2016). An Analysis of the Sensitivity of Proteogenomic Mapping of Somatic Mutations and Novel Splicing Events in Cancer. Mol Cell Proteomics 15, 1060-1071.

Wang, J., Ma, Z., Carr, S.A., Mertins, P., Zhang, H., Zhang, Z., Chan, D.W., Ellis, M.J., Townsend, R.R., Smith, R.D., et al. (2017). Proteome Profiling Outperforms Transcriptome Profiling for Coexpression Based Gene Function Prediction. Mol Cell Proteomics 16, 121-134.

Zhang, B., Wang, J., Wang, X., Zhu, J., Liu, Q., Shi, Z., Chambers, M.C., Zimmerman, L.J., Shaddox, K.F., Kim, S., et al. (2014). Proteogenomic characterization of human colon and rectal cancer. Nature 513, 382-387.

Zhang, H., Liu, T., Zhang, Z., Payne, S.H., Zhang, B., McDermott, J.E., Zhou, J.Y., Petyuk, V.A., Chen, L., Ray, D., et al. (2016). Integrated Proteogenomic Characterization of Human High-Grade Serous Ovarian Cancer. Cell 166, 755-765.

 

Posted in Uncategorized | Leave a comment

Turning Scala code into Spark

Note: reposted from http://anthonycros.com/turn-scala-into-spark/

The other day I found myself explaining to a bioinformatician in our team how we could process some data at scale. She is mostly used to processing data in-memory due to the relative small size of the data she typically deals with, but this time we were going to have to leverage a cluster in order to make it happen. The size of the data wasn’t all that big, somewhere in the vicinity of 1TB, but just too big to fit in the most powerful machine we had handy. An HPC-based solution could also have been considered here, but would have required more low-level manipulations[1]. So I thought it’d be a good occasion to introduce her to Apache Spark. But because she was mostly used to processing data in an imperative manner (as opposed to declarative here), I figured I would first show her what the processing flow would look like in Scala[2]. That way I would not introduce both a new computational paradigm and an entirely new framework at once.


Scala approach

The original task at hand, part of a Stand Up 2 Cancer effort, consisted of turning a bunch of CSV files[3] into a highly nested structure and annotating nested documents from various web services. The most important part was to do two nested GROUP BYs, so I only kept that part. I made up two dummy input files that vaguely resembled the original ones:

f1.tsv:

gene  sample  other1
g1    s1      a1
g1    s2      b1
g1    s3a     c1
g2    s4      d1

And

f2.tsv:

gene  sample  other21  other22
g1    s1      a21      a22
g1    s2      b21      b22
g1    s3b     c21      c22
g2    s4      d21      d22
g3    s5      f21      f22

And came up with the following Scala code:

package demo.scala

// intentionally verbose so it reads more easily
// shouldn't actually be all in one file either
// ===========================================================================
object Inputs {

    trait HasGene   { val gene:   String } 
    trait HasSample { val sample: String }

    sealed trait F
      extends HasGene
         with HasSample

    // ---------------------------------------------------------------------------
    case class F1(
            gene:   String,
            sample: String,
            other1: String)
        extends F

        object F1 {

            def apply(line: String): F1 = { // factory of F1s
                val it = line.split("\t").iterator

                F1(
                    gene   = it.next, // TODO: should add checks
                    sample = it.next,
                    other1 = it.next)
            }

        }

    // ---------------------------------------------------------------------------
    case class F2(
            gene:    String,
            sample:  String,
            other21: String,
            other22: String)
        extends F

        object F2 {

            def apply(line: String): F2 = { // factory of F2s
                val it = line.split("\t").iterator

                F2(
                    gene    = it.next,
                    sample  = it.next,
                    other21 = it.next,
                    other22 = it.next)
            }

        }

}

// ===========================================================================
object Outputs {

  import Inputs._

    // ---------------------------------------------------------------------------
    case class GeneCentric(
        gene: String,
        samples: Iterable[SampleCentric])

    // ---------------------------------------------------------------------------
    case class SampleCentric(
        sample: String,
        extras: Iterable[Extra])

    // ---------------------------------------------------------------------------
    sealed trait Extra

        object Extra {

            case class Extra1(
                other1: String)
                extends Extra

            case class Extra2(
                other21: String,
                other22: String)
                extends Extra

            // factory of "extras" (either as Extra1 or Extra2,
            // based on the type of f we get)
            def apply(f: F): Extra =
                // this pattern matching is safe because F is sealed
                // (compiler will warn if we forget a case)
                // pattern matching is one of the most powerful scala constructs, see
                // alvinalexander.com/scala/using-match-expression-like-switch-statement
                f match {               
                    case F1(_, _, other1) =>           Extra1(other1)
                    case F2(_, _, other21, other22) => Extra2(other21, other22)
                }

        }

}

// ===========================================================================
object Demo extends App { // i.e. main ("App" trait)

    import Inputs._
    import Outputs._
    import Outputs.Extra._

    // ---------------------------------------------------------------------------
    // These are simply type aliases used for illustration purposes here
    type GenericSeq[A]      = Seq[A]
    type GenericIterable[A] = Iterable[A]

    // ---------------------------------------------------------------------------
    // read lines and transform each into a F1/F2
    // "apply()" is an implicit factory that case classes all have
    val f1s: GenericSeq[F] = readIn(args(0)).map(F1.apply)
    val f2s: GenericSeq[F] = readIn(args(1)).map(F2.apply)

    // ---------------------------------------------------------------------------
    val genes: GenericSeq[(String /* genes */, Map[String /* sample */, Iterable[F]])] =
        f1s

            // combine both file contents
            .union(f2s)

            // group by "gene" since they both are guaranteed
            // to have this property (they transitively extend HasGene via F)
            .groupBy(f => f.gene)

            // ignore key for now and focus on values (the groupings)
            .mapValues(geneGroup =>
                geneGroup

                    // within each such grouping,
                    // do the same thing but with "sample" 
                    .groupBy(f => f.sample))
            .toSeq

    //---------------------------------------------------------------------------
    val geneCentrics: GenericIterable[GeneCentric] =
        genes
            .map { case (genes, samples) => genes ->
                samples
                    .mapValues(f =>

                        // lastly extract last bits of interest
                        f.map(Extra.apply))

                    // we can now build the sample-centric object
                    // (does not capture the parent gene, though it could)
                    // note that this uses a scala trick to be able to pass
                    // a tuple to the primary constructor
                    .map((SampleCentric.apply _).tupled) }

            // we can now build the gene-centric object
            .map((GeneCentric.apply _).tupled)

    // ---------------------------------------------------------------------------
    // write all as giant json array
    val fw = new java.io.FileWriter("/tmp/demo.json")
        fw.write( // TODO: a scalable solution should stream instead
                geneCentrics
                    .map(geneCentric =>
                            net.liftweb.json.Serialization.writePretty
                                (geneCentric)
                                (net.liftweb.json.DefaultFormats))
                    .mkString("[", ",", "]"))
            fw.close()

    println("done.")

  // ===========================================================================
  def readIn(filePath: String): GenericSeq[String] =
    scala.io.Source
          .fromFile(filePath) // TODO: should actually be closed
            .getLines()
            .drop(1) // drop header (TODO: ideally would read schema from it)
            .toSeq // TODO: a scalable solution should stream instead

}

The code produces the following JSON ouput (slightly reformatted, order not guaranteed):

[{
  "gene":"g3",
  "samples":[
    {
      "sample":"s5",
      "extras":[
        { "other21":"f21", "other22":"f22" } ]
    }
  ]
},{
  "gene":"g2",
  "samples":[
    {
      "sample":"s4",
      "extras":[
        { "other1":"d1" },
        { "other21":"d21", "other22":"d22" }
      ]
    }
  ]
},{
  "gene":"g1",
  "samples":[
    {
      "sample":"s2",
      "extras":[
        { "other1":"b1" },
        { "other21":"b21", "other22":"b22" }
      ]
    },
    {
      "sample":"s1",
      "extras":[
        { "other1":"a1" },
        { "other21":"a21", "other22":"a22" }
      ]
    },
    {
      "sample":"s3b",
      "extras":[
        { "other21":"c21", "other22":"c22" } ]
    },
    {
      "sample":"s3a",
      "extras":[
        { "other1":"c1" } ]
    }
  ]
}]

Now obviously the above code would be way overkill for such a trivial task, but I also thought it had value in showing the strength of the Scala type system[4]. If one pays close attention to the code, the readIn method and its call-sites are really the only places where a runtime error can actually happen. The rest is guaranteed safe by the compiler. Now it doesn’t guarantee that the program is semantically correct, but it guarantees that we’ve wired the various entities in a compatible manner. So the idea is to imagine how this could be part of a much more complex pipeline, where having such compile-time support would really pay dividends (otherwise I still like a good old python or groovy script for such simpler tasks).

So after walking her through that Scala code, it seems she was able to understand roughly how the different pieces fit together. She was still a bit unsettled by this new way of thinking about computation, but seemed to see the value in a more declarative approach to data processing.

The next logical step was to show her the Spark equivalent, albeit with its Scala API since I’m not as a familiar with the python one. All that would be left for her to do was to “translate” it to the python API, since that’s the language she was most comfortable with. So I set out to rewrite the code with Spark. But slowly, I realized I was almost exactly replicating what I had just written in Scala… So upon completion, I tried reorganizing my Spark code in order to make it resemble the Scala one. And that’s when it really struck me, how similar Scala collections and Spark really are… I was aware of the similarities, but it’s not until I finished this little exercise that I realized how much so![5]

So for the rest of this post, I propose to walk you through the few steps required to turn the above Scala code into Spark code that can run locally (not all that useful for big computation but that’s beside the point here).


Spark approach

Here is the Resulting file diffed with the original file:

diff

Let’s walk through the differences.

At the top, we need to import the relevant Spark entities:

import org.apache.spark
import org.apache.spark.{SparkContext,SparkConf}
import org.apache.spark.rdd.RDD

We then need to change the collection type of the – admittedly contrieved – collection aliases I created for illustration purposes:

- type GenericSeq[A]      = Seq[A]
+ type GenericSeq[A]      = RDD[A]
- type GenericIterable[A] = Iterable[A]
+ type GenericIterable[A] = RDD[A]

Then we need to add a SparkContext like so:

val sc =
  new spark.SparkContext(
    new spark.SparkConf()
      .setAppName("demo")
      .setMaster("local"))

A little bit of cheating to compensate for the fact that Spark uses sequences of tuples instead of maps:

- .toSeq

A way to collect the result, i.e. actually triggering the workflow (which up until then is only a “plan of action”):

+ .collect()

Lastly we need to change how the files are being read in:

    def readIn(filePath: String): GenericSeq[String] =
-       io.Source
-           .fromFile(filePath) // TODO: close
-           .getLines()
-           .drop(1) // drop header
-           .toSeq
+       sc
+           .textFile(filePath)
+           // not quite as clean as the Scala version
+           // but I didn't want to complicate things too much here
+           .filter(!_.startsWith("gene\t")) 

Et voilà ![6]


Conclusion

As you can see, 90% of the lines are the same, verbatim. And obviously, it produces the same result (order aside)!

In an upcoming post, I will show a significantly more compact way to express the same computation in Scala, though obviously at the expense of readability and scalability. I will then compare it to a python equivalent and comment on the relative sizes and merits of each[7].

Now I recognize that Scala is far from the easiest language to deal with, and that many things could be done to bolster its adoption in bioinformatics[8], but one has to admit that turning the above code into a scalable equivalent in just a few line changes is rather impressive!

Anthony Cros

@ The Data Access & Management Group

@ The Bioinformatics Group

@ The Department of Biomedical and Health Informatics

@ The Children’s Hospital of Philadelphia

A big thank you to the people who were nice enough to review the post, in alphabetical order: Emilie Lalonde, Pichai Raman, Deanne Taylor and Bob Tiernay.


Footnotes:

  • [1] The kind I would like bioinformaticians in our team not to have to deal with: they have precious knowledge I’d rather see applied to solve new problems rather than reinventing the wheel every time.
  • [2] See also the excellent Twitter Scala Schools and Daniel Westheide‘s Neophyte’s Guide to Scala as learning resources
  • [3] Bioinformaticians sure like their CSV files!
  • [4] Not the most diplomatic case for it but still valuable.
  • [5] I had googled it around first and found relatively little guidance on the matter.
  • [6] I’m actually French, so it’s legit.
  • [7] It is possible to write very compact Scala code, though I tend to find that it defeats its purpose if overused.
  • [8] See upcoming post (in the works, I’ll link from here once it’s ready).
Posted in Uncategorized | Leave a comment

Continuous Integration

Automatic code test with Jekins.

Outline

Overview

It is critical to build and test your application at various stages of its development. Every once in a while you’ll build the application from the integrated codes (assuming various components or units of the codes have already passed unit test) and run it on test data to make sure the application works as expected and uncover bugs as early as possible. Ideally you’d like to have a system that would trigger the build process upon some specific events, such as the merge of a branch to the master branch from a GitHub repository if you have decided to always build the final application with the codes from the master branch (the master branch is the default branch created when you create a GitHub repository; please click here for more info). This is especially important if the codes are contributed by multiple developers and you want to ensure the integrated codes (and the product built from the codes) in the master branch always in the working condition.

In development of bioinformatics applications, we often build pipelines to manage workflows that integrate various tools to process biological data for data mining and knowledge discovery, such as processing sequence data for the identification of genes or mutations associated with diseases. In such cases, we need to test a pipeline every time new tools are incorporated into the pipeline or parameters are changed for some tools.

Here we present a way we implemented at the bioinformatics group in the Department of Biomedical and Health Informactics of the Children’s Hospital of Philadelphia (CHOP) to automatically run a pipeline with Jenkins when changes to the pipeline codes are pushed or merged to the master branch of the GitHub repository. In this particular case, the pipeline is coded in a snakefile and executed with snakemake. Snakemake is a general-purpose workflow management system coupled with the Python language. Please read the tutorial if you’re not familiar with snakemake. Recipes of the pipeline come from more than one developer, each writing and testing codes in separate branches before pushing the codes and merging with the master branch, the codes of which are used as the production copy. Once the codes from a branch are merged into the master branch, the test on the pipeline with the merged codes are automatically triggered and the status of the test process will be reported to the developers (see below).

Fig1
Fig. 1 Create Jenkins Project

Create Jenkins Project

Jenkins is an open source automation server for building, testing and deploying projects. It works in similar ways as Travis CI except that the repository can be from your internal GitHub enterprise server and the tests are done on your own servers. Here we assume you have a Jenkins server set up and created an account on the server for you.

To set up your application under development for continuous integration with Jenkins, the first step is to create a new project (item). In this specific case, we created a pipeline (Fig. 1). You can create other kind of project appropriate for you.

Configure Jenkins Project

Fig2
Fig. 2 Set Notification Endpoints

Once a Jenkins project is created, you need to set its configuration. Some settings described here are just for your reference and you should set them to fit your case.

Check the following and set their parameters

Set Job Notifications

Add two notification endpoints (at job start or job completion) for Jenkins to send out a notice to a web CGI (Fig. 2). The CGI script jenkinsjobs can be downloaded from the links listed at the end of the blog. Of course you need to point URL to your own web server.

Fig3
Fig. 3 Set Build Trigger

Then check the following options and, if needed, set the settings:

  • This build is parameterized (Fig. 3)This option is for branch specification. * select String Parameter * define the name of parameter to use (e.g. BRANCH) * set the default value (e.g. master) * write the description, if needed.
  • Prepare an environment for the run
  • Keep Jenkins Environment Variables
  • Keep Jenkins Build Variables
  • Execute concurrent builds if necessary

Build Triggers

Fig4
Fig. 4 Set Build Trigger

Check Trigger builds remotely (e.g., from scripts)and set Authentication Token. The token will be included in the URL used to trigger the build (Fig. 4).

Please note that here you have the option to set the build to be triggered by a push to a GitHub repository. But we chose not to use this option because we wanted to trigger the build only by a push or merge to the master branch, not just a push to any branches. The webhook you can set up at GitHub doesn’t give you the opiton to specify a specific branch. (Actually for some unknown causes, we could not get it to work after half dozen tries to trigger a build on the Jenkins server upon a push to the GitHub repository.)

Pipeline

You can specify how to run your pipeline when the build is triggered. In our case, we chose to run the pipeline in a Groovy script (selected Pipeline script and check the box Use Groovy Sandbox) on a specific node, as shown below.

node ('respublica-slave') {
    // in Groovy
    stage 'Build'
    sh """
        # cd to the repository directory and check out the specified branch

        git checkout ${BRANCH}
        git pull

        # prepare your environments if necessary

        # our pipeline is coded in a snakefile
        snakemake
    """
}

Create Webhook on Github

On the Github repository page, click Settings on the menubar, then click Hooks & services in the left panel followed by clicking Add webhook button at upper right. On the Webhoooks / Manage webhook frame, specify

  • Payload URL: http∶//mitomapd.research.chop.edu/cgi-bin/jenkins?user=zhangs3&branch=master&token=grin_test@chop&url=http∶//jenkins-ops-dbhi.research.chop.edu/view/BiG/job/grin_master/buildWithParameters
    • user: the user account under which to run the build on Jenkins. See comments in the script jenkins on how to authenticate to the Jenkins server.
    • branch: the branch(s) a push of which will trigger the build for the specified branch on the Jenkins server. You can specify more than one branch separated by commas (,) or semicolons (:). You must sepcify at least one branch. Here the master branch is specified.
    • token: the token from Jenkins project, as specified in the configuration of the Jenkins project (see Fig. 4).
    • url: the URL for remotely triggering the build on Jenkins. Change the server name and the path to your Jenkins project to fit your case.
  • Content type: application/x-www-form-urlencoded

Note that you need to change the server and the path to the CGI script jenkins in the URL to fit your case. The script jenkins can be downloaded from the links listed below.

CGI Scripts on a Web Server

Click the script names to get them.

  • Script to receive notice from GitHub and trigger the build on Jenkins: jenkins
  • Script to receive notifications from the Jenkins server and send notifications (Slack/Email): jenkinsjobs

Acknowledgement

I’d like to thank Jeremy Leipzig for his input and help with the implemenation of the CI system and the preparation of the blog, and thank LeMar Davidson for his help with debugging the configuration of the project on the Jenkins server.


Posted in Uncategorized | Leave a comment

Harvest: Data Discovery for Humans


I'm Speaking at OSCON 2013 (size 125 X 125)

Biomedical data can be a pain to work with. Data lives in multiple places: electronic health records, proprietary vendor databases,  MS Access databases, and even flat files. Furthermore, healthcare data models are often incredibly complex, with multiple temporal attributes and a dizzying array of tables.

This kind of environment puts a damper on biomedical research, making it hard for analysts and clinicians to explore and and test hypotheses about their patient populations. We believe that as medicine and biomedical research becomes increasingly data-driven we need to democratize access to data across a wide number of people with varying skill sets to maximize its utility. For this reason, we created the Harvest framework: a toolkit for building highly interactive data-intensive biomedical applications. Harvest was built using several guiding principles:

  1. Promote data discovery
  2. Enable real time query and reporting
  3. Render data in a way that makes sense to the clinician/researcher
  4. Give easy routes to customizing data sets for export to popular analysis software

We think Harvest is going to be very popular inside the biomedical research community, but we are also excited to see who else might want to use it too. So where do you go if you want to reach the biggest audience of open source developers, users, and fans? You go the O’Reilly Open Source Convention in Portland, Oregon. OSCON is the sort of place where never know who you’re going to meet sitting at breakfast or lunch, and that’s part of the magic of it all. If you’re at OSCON Thursday, come check out our talk Harvest: Data Discovery for Humans and let us know what you think!

Posted in Announcement | Leave a comment