How to define a project

To use looper with your project, you must define your project using Looper’s standard project definition format. If you follow this format, then your project can be read not only by looper for submitting pipelines, but also for other tasks, like: summarizing pipeline output, analysis in R (using the project.init package), or building UCSC track hubs.

The format is simple and modular, so you only need to define the components you plan to use. You need to supply 2 files:

  1. Project config file - a yaml file describing input and output file paths and other (optional) project settings
  2. Sample annotation sheet - a csv file with 1 row per sample

Quick example: In the simplest case, project_config.yaml is just a few lines of yaml. Here’s a minimal example project_config.yaml:

metadata:
  sample_annotation: /path/to/sample_annotation.csv
  output_dir: /path/to/output/folder
  pipeline_interfaces: path/to/pipeline_interface.yaml

The output_dir key specifies where to save results. The pipeline_interfaces key points to your looper-compatible pipelines (described in linking the pipeline interface). The sample_annotation key points to another file, which is a comma-separated value (csv) file describing samples in the project. Here’s a small example of sample_annotation.csv:

Minimal Sample Annotation Sheet
sample_name library file
frog_1 RNA-seq frog1.fq.gz
frog_2 RNA-seq frog2.fq.gz
frog_3 RNA-seq frog3.fq.gz
frog_4 RNA-seq frog4.fq.gz

With those two simple files, you could run looper, and that’s fine for just running a quick test on a few files. In practice, you’ll probably want to use some of the more advanced features of looper by adding additional information to your configuration yaml file and your sample annotation csv file.

For example, by default, your jobs will run serially on your local computer, where you’re running looper. If you want to submit to a cluster resource manager (like SLURM or SGE), you just need to specify a compute section.

Let’s go through the more advanced details of both annotation sheets and project config files:

orphan:

Sample annotation sheet

The sample annotation sheet is a csv file containing information about all samples in a project. This should be regarded as an immutable and the most important piece of metadata in a project. One row corresponds to one sample (or, more specifically, one pipeline run).

A sample annotation sheet may contain any number of columns you need for your project. You can think of these columns as sample attributes, and you may use these columns later in your pipelines or analysis (for example, you could define a column called organism and use this to adjust the reference genome to use for each sample).

Special columns

Certain keyword columns are required or provide looper-specific features. Any additional columns become attributes of your sample and will be part of the project’s metadata for the samples. Mostly, you have complete control over any other column names you want to add, but there are a few reserved column names:

  • sample_name - a unique string identifying each sample [1]. This is required for Sample construction. The only required column.
  • organism - a string identifying the organism (“human”, “mouse”, “mixed”). Recommended but not required.
  • library - While not needed to build a Sample, this column is required for submission of job(s). It specifies the source of data for the sample (e.g. ATAC-seq, RNA-seq, RRBS). Looper uses this information to determine which pipelines are relevant for the Sample.
  • data_source - This column is used by default to specify the location of the input data file. Usually you want your annotation sheet to specify the locations of files corresponding to each sample. You can use this to simplify pointing to file locations with a neat string-replacement method that keeps things clean and portable. For more details, see the advanced section Derived columns. Really, you just need any column specifying at least 1 data file for input. This is required for Looper to submit job(s) for a Sample.
  • toggle - If the value of this column is not 1, Looper will not submit the pipeline for that sample. This enables you to submit a subset of samples.

Here are a few example annotation sheets:

Example Sample Annotation Sheet
sample_name library organism ip data_source
atac-seq_PE ATAC-seq human   microtest
atac-seq_SE ATAC-seq human   microtest
chip-seq_PE CHIP-seq human H3K27ac microtest
chip-seq_SE CHIP-seq human H3K27ac microtest
chipmentation_PE ChIPmentation human H3K27ac microtest
chipmentation_SE ChIPmentation human H3K27ac microtest
cpgseq_example_data CpG-seq human   microtest
quant-seq_SE Quant-seq human   microtest
rrbs RRBS human   microtest
rrbs_PE RRBS human   microtest
wgbs WGBS human   microtest
RNA_TRUseq_50SE SMART human   microtest
RNA_SMART_50SE SMART human   microtest
rrbs_PE_fq RRBS human   microtest
rrbs_fq RRBS human   microtest
Example Sample Annotation Sheet
sample_name library organism flowcell lane BSF_name data_source
albt_0h RRBS albatross BSFX0190 1 albt_0h bsf_sample
albt_1h RRBS albatross BSFX0190 1 albt_1h bsf_sample
albt_2h RRBS albatross BSFX0190 1 albt_2h bsf_sample
albt_3h RRBS albatross BSFX0190 1 albt_3h bsf_sample
frog_0h RRBS frog       frog_data
frog_1h RRBS frog       frog_data
frog_2h RRBS frog       frog_data
frog_3h RRBS frog       frog_data

Footnotes

[1]This should be a string without whitespace (space, tabs, etc...). If it contains whitespace, an error will be thrown. Similarly, looper will not allow any duplicate entries under sample_name.
orphan:

Project config file

A minimal project config file requires very little; only a single section (metadata – see above). Here are additional details on this and other optional project config file sections:

Project config section: metadata

The metadata section contains paths to various parts of the project: the output directory (the parent directory), the results subdirector, the submission subdirectory (where submit scripts are stored), and pipeline scripts.Pointers to sample annotation sheets. This is the only required section.

Example:

metadata:
  sample_annotation: /path/to/sample_annotation.csv
  output_dir: /path/to/output/folder
  pipeline_interfaces: /path/to/pipeline_interface.yaml

Project config section: data_sources

The data_sources section uses regex-like commands to point to different spots on the filesystem for data. The variables (specified by {variable}) are populated by sample attributes (columns in the sample annotation sheet). You can also use shell environment variables (like ${HOME}) in these.

Example:

data_sources:
  source1: /path/to/raw/data/{sample_name}_{sample_type}.bam
  source2: /path/from/collaborator/weirdNamingScheme_{external_id}.fastq
  source3: ${HOME}/{test_id}.fastq

For more details, see Derived columns.

Project config section: derived_columns

derived_columns is just a simple list that tells looper which column names it should populate as data_sources. Corresponding sample attributes will then have as their value not the entry in the table, but the value derived from the string replacement of sample attributes specified in the config file. This enables you to point to more than one input file for each sample (for example read1 and read2).

Example:

derived_columns: [read1, read2, data_1]

For more details, see Derived columns.

Project config section: implied_columns

implied_columns lets you infer additional attributes, which can be useful for pipeline arguments.

Example:

implied_columns:
  organism:
    human:
      genome: "hg38"
      macs_genome_size: "hs"

For more details, see Implied columns.

Project config section: subprojects

Subprojects are useful to define multiple similar projects within a single project config file. Under the subprojects key, you can specify names of subprojects, and then underneath these you can specify any project config variables that you want to overwrite for that particular subproject. Tell looper to load a particular subproject by passing --sp subproject-name on the command line.

For example:

subprojects:
  diverse:
        metadata:
          sample_annotation: psa_rrbs_diverse.csv
  cancer:
        metadata:
          sample_annotation: psa_rrbs_intracancer.csv

This project would specify 2 subprojects that have almost the exact same settings, but change only their metadata.sample_annotation parameter (so, each subproject points to a different sample annotation sheet). Rather than defining two 99% identical project config files, you can use a subproject.

Project config section: pipeline_config

Occasionally, a particular project needs to run a particular flavor of a pipeline. Rather than creating an entirely new pipeline, you can parameterize the differences with a pipeline config file, and then specify that file in the project config file.

Example:

pipeline_config:
  # pipeline configuration files used in project.
  # Key string must match the _name of the pipeline script_ (including extension)
  # Relative paths are relative to this project config file.
  # Default (null) means use the generic config for the pipeline.
  rrbs.py: null
  # Or you can point to a specific config to be used in this project:
  wgbs.py: wgbs_flavor1.yaml

This will instruct looper to pass -C wgbs_flavor1.yaml to any invocations of wgbs.py (for this project only). Your pipelines will need to understand the config file (which will happen automatically if you use pypiper).

Project config section: pipeline_args

Sometimes a project requires tweaking a pipeline, but does not justify a completely separate pipeline config file. For simpler cases, you can use the pipeline_args section, which lets you specify command-line parameters via the project config. This lets you fine-tune your pipeline, so it can run slightly differently for different projects.

Example:

pipeline_args:
  rrbs.py:  # pipeline identifier: must match the name of the pipeline script
        # here, include all project-specific args for this pipeline
        "--flavor": simple
        "--flag": null

For flag-like options (options without parameters), you should set the value to the yaml keyword null (which means no value). Looper will pass the key to the pipeline without a value. The above specification will now pass --flavor=simple and --flag (no parameter) whenever rrbs.py is invoked – for this project only. This is a way to control (and record!) project-level pipeline arg tuning. The only keyword here is pipeline_args; all other variables in this section are specific to particular pipelines, command-line arguments, and argument values.

Project config section: compute

You can specify project-specific compute settings in a compute section. However, you’re better off specifying this globally using a pepenv environment configuration. Instructions are at the pepenv repository. If you do need project-specific control over compute settings (like submitting a certain project to a certain resource account), you can do this by specifying variables in a project config compute section, which will override global pepenv values for that project only.

compute:
  partition: project_queue_name

Project config section: track_configurations

Warning

The track_configurations section is for making UCSC trackhubs. This is a work in progress that is functional, but ill-documented, so it is best avoided for now.

Project config complete example

Here’s an example. Additional fields can be added as well and will be ignored.
      metadata:
        # Relative paths are considered relative to this project config file.
        # Typically, this project config file is stored with the project metadata
        # sample_annotation: one-row-per-sample metadata
        sample_annotation: table_experiments.csv
        # merge_table: input for samples with more than one input file
        merge_table: table_merge.csv
        # compare_table: comparison pairs or groups, like normalization samples
        compare_table: table_compare.csv
        # output_dir: the parent, shared space for this project where results go
        output_dir: /fhgfs/groups/lab_bock/shared/projects/example
        # results and submission subdirs are subdirectories under parent output_dir
        # results: where output sample folders will go
        # submission: where cluster submit scripts and log files will go
        results_subdir: results_pipeline
        submission_subdir: submission
        # pipeline_interfaces: the pipeline_interface.yaml file or files for Looper pipelines
        # scripts (and accompanying pipeline config files) for submission.
        pipeline_interfaces: /path/to/shared/projects/example/pipeline_interface.yaml


      data_sources:
        # specify the ABSOLUTE PATH of input files using variable path expressions
        # entries correspond to values in the data_source column in sample_annotation table
        # {variable} can be used to replace environment variables or other sample_annotation columns
        # If you use {variable} codes, you should quote the field so python can parse it.
        bsf_samples: "$RAWDATA/{flowcell}/{flowcell}_{lane}_samples/{flowcell}_{lane}#{BSF_name}.bam"
        encode_rrbs: "/path/to/shared/data/encode_rrbs_data_hg19/fastq/{sample_name}.fastq.gz"


implied_columns:
      # supported genomes/transcriptomes and organism -> reference mapping
  organism:
    human:
      genome: hg38
      transcriptome: hg38_cdna
    mouse:
      genome: mm10
      transcriptome: mm10_cdna

      pipeline_config:
        # pipeline configuration files used in project.
        # Default (null) means use the generic config for the pipeline.
        rrbs: null
        # Or you can point to a specific config to be used in this project:
        # rrbs: rrbs_config.yaml
        # wgbs: wgbs_config.yaml
        # cgps: cpgs_config.yaml