Advanced features

Handling multiple input files

Sometimes you have multiple input files that you want to merge for one sample. For example, a common use case is a single library that was spread across multiple sequencing lanes, yielding multiple input files that need to be merged, and then run through the pipeline as one. Rather than putting multiple lines in your sample annotation sheet, which causes conceptual and analytical challenges, we introduce two ways to merge these:

  1. Use shell expansion characters (like ‘*’ or ‘[]’) in your data_source definition or filename (good for simple merges)
  2. Specify a merge table which maps input files to samples for samples with more than one input file (infinitely customizable for more complicated merges).

To do the first option, just change your data source specifications, like this:

data_R1: "${DATA}/{id}_S{nexseq_num}_L00*_R1_001.fastq.gz"
data_R2: "${DATA}/{id}_S{nexseq_num}_L00*_R2_001.fastq.gz"

To do the second option, just provide a merge table in the metadata section of your project config:

metadata:
merge_table: mergetable.csv

Make sure the sample_name column of this table matches, and then include any columns you need to point to the data. Looper will automatically include all of these files as input passed to the pipelines. Warning: do not use both of these options simultaneously for the same sample, it will lead to multiple merges.

Note: to handle different classes of input files, like read1 and read2, these are not merged and should be handled as different derived columns in the main sample annotation sheet (and therefore different arguments to the pipeline).

Connecting to multiple pipelines

If you have a project that contains samples of different types, then you may need to specify multiple pipeline repositories to your project. Starting in version 0.5, looper can handle a priority list of pipelines. Starting with version 0.6, these pointers should point directly at a pipeline interface files (instead of at directories as previously). in the metadata.pipeline_interfaces attribute.

For example:

metadata:
  pipeline_interfaces: [pipeline_iface1.yaml, pipeline_iface2.yaml]

In this case, for a given sample, looper will first look in pipeline_iface1.yaml to see if appropriate pipeline exists for this sample type. If it finds one, it will use this pipeline (or set of pipelines, as specified in the protocol_mappings section of the pipeline_interface.yaml file). Having submitted a suitable pipeline it will ignore the pipeline_iface2.yaml interface. However if there is no suitable pipeline in the first interface, looper will check the second and, if it finds a match, will submit that. If no suitable pipelines are found in any of the interfaces, the sample will be skipped as usual.

If your project contains samples with different protocols, you can use this to run several different pipelines. For example, if you have ATAC-seq, RNA-seq, and ChIP-seq samples in your project, you may want to include a pipeline interface for 3 different pipelines, each accepting one of those protocols. In the event that more than one of the pipeline interface files provide pipelines for the same protocol, looper will only submit the pipeline from the first interface. Thus, this list specifies a priority order to pipeline repositories.

Command-line argument pass-through

Any command-line arguments passed to looper run that are not consumed by looper will be added to the command of every pipeline submitted in that looper run. This gives you a handy way to pass-through command-line arguments that you want passed to every job in a given looper run. For example, pypiper pipelines understand the –recover flag; so if you want to pass this flag through looper to all your pipeline runs, you may run looper run config.yaml –recover. Since looper `does not understand `–recover, this will be passed to every pipeline. Obviously, this feature is limited to passing flags that looper does not understand, because those arguments will be consumed by looper and not passed through to individual pipelines.

Project models

Looper uses the peppy package to model Project and Sample objects under the hood. These project objects are actually useful outside of looper. If you define your project using looper’s standardized project definition format , you can use the project models to instantiate an in memory representation of your project and all of its samples, without using looper.

If you’re interested in this, you should check out the peppy package. All the documentation for model objects has moved there.