Hello World! example for looper

This tutorial demonstrates how to install looper and use it to run a pipeline on a PEP project.

1. Install the latest version of looper:

pip install --user --upgrade looper

2. Download and unzip the hello_looper repository

The hello looper repository contains a basic functional example project (in /project) and a looper-compatible pipeline (in /pipeline) that can run on that project. Let's download and unzip it:

!wget https://github.com/pepkit/hello_looper/archive/master.zip
--2020-05-21 08:23:43--  https://github.com/pepkit/hello_looper/archive/master.zip
Resolving github.com (github.com)...
Connecting to github.com (github.com)||:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/pepkit/hello_looper/zip/master [following]
--2020-05-21 08:23:43--  https://codeload.github.com/pepkit/hello_looper/zip/master
Resolving codeload.github.com (codeload.github.com)...
Connecting to codeload.github.com (codeload.github.com)||:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘master.zip’

master.zip              [ <=>                ]   5.20K  --.-KB/s    in 0.004s  

2020-05-21 08:23:44 (1.25 MB/s) - ‘master.zip’ saved [5328]

!unzip master.zip
Archive:  master.zip
   creating: hello_looper-master/
  inflating: hello_looper-master/README.md  
   creating: hello_looper-master/data/
  inflating: hello_looper-master/data/frog1_data.txt  
  inflating: hello_looper-master/data/frog2_data.txt  
  inflating: hello_looper-master/looper_pipelines.md  
  inflating: hello_looper-master/output.txt  
   creating: hello_looper-master/pipeline/
  inflating: hello_looper-master/pipeline/count_lines.sh  
  inflating: hello_looper-master/pipeline/pipeline_interface.yaml  
   creating: hello_looper-master/project/
  inflating: hello_looper-master/project/project_config.yaml  
  inflating: hello_looper-master/project/sample_annotation.csv  

3. Run it

Run it by changing to the directory and then invoking looper run on the project configuration file.

!looper run hello_looper-master/project/project_config.yaml
Looper version: 1.2.0-dev
Command: run
Ignoring invalid pipeline interface source: ../pipeline/pipeline_interface.yaml. Caught exception: FileNotFoundError(2, 'No such file or directory')
> Not submitted: No pipeline interfaces defined
> Not submitted: No pipeline interfaces defined

Looper finished
Samples valid for job generation: 0 of 2
Commands submitted: 0 of 0
Jobs submitted: 0

1 unique reasons for submission failure: No pipeline interfaces defined

Summary of failures:
No pipeline interfaces defined: frog_2, frog_1

Voila! You've run your very first pipeline across multiple samples using looper!

Exploring the results

Now, let's inspect the hello_looper repository you downloaded. It has 3 components, each in a subfolder:

!tree hello_looper-master/*/
├── frog1_data.txt
└── frog2_data.txt
├── count_lines.sh
└── pipeline_interface.yaml
├── project_config.yaml
└── sample_annotation.csv

0 directories, 6 files

These are:

  • /data -- contains 2 data files for 2 samples. These input files were each passed to the pipeline.
  • /pipeline -- contains the script we want to run on each sample in our project. Our pipeline is a very simple shell script named count_lines.sh, which (duh!) counts the number of lines in an input file.
  • /project -- contains 2 files that describe metadata for the project (project_config.yaml) and the samples (sample_annotation.csv). This particular project describes just two samples listed in the annotation file. These files together make up a PEP-formatted project, and can therefore be read by any PEP-compatible tool, including looper.

When we invoke looper from the command line we told it to run project/project_config.yaml. looper reads the project/project_config.yaml file, which points to a few things:

The 3 folders (data, project, and pipeline) are modular; there is no need for these to live in any predetermined folder structure. For this example, the data and pipeline are included locally, but in practice, they are usually in a separate folder; you can point to anything (so data, pipelines, and projects may reside in distinct spaces on disk). You may also include more than one pipeline interface in your project_config.yaml, so in a looper project, many-to-many relationships are possible.

Pipeline outputs

Outputs of pipeline runs will be under the directory specified in the output_dir variable under the paths section in the project config file (see defining a project). Let's inspect that project_config.yaml file to see what it says under output_dir:

!cat hello_looper-master/project/project_config.yaml
  sample_annotation: sample_annotation.csv
  output_dir: $HOME/hello_looper_results
  pipeline_interfaces: ../pipeline/pipeline_interface.yaml

Alright, next let's explore what this pipeline stuck into our output_dir:

!tree $HOME/hello_looper_results
├── results_pipeline
└── submission
    ├── count_lines.sh_frog_1.log
    ├── count_lines.sh_frog_1.sub
    ├── count_lines.sh_frog_2.log
    ├── count_lines.sh_frog_2.sub
    ├── frog_1.yaml
    └── frog_2.yaml

2 directories, 6 files

Inside of an output_dir there will be two directories:

  • results_pipeline - a directory with output of the pipeline(s), for each sample/pipeline combination (often one per sample)
  • submissions - which holds a YAML representation of each sample and a log file for each submitted job

From here to running hundreds of samples of various sample types is virtually the same effort!

A few more basic looper options

Looper also provides a few other simple arguments that let you adjust what it does. You can find a complete reference of usage in the docs. Here are a few of the more common options:

For looper run:

  • -d: Dry run mode (creates submission scripts, but does not execute them)
  • --limit: Only run a few samples
  • --lumpn: Run several commands together as a single job. This is useful when you have a quick pipeline to run on many samples and want to group them.

There are also other commands:

  • looper check: checks on the status (running, failed, completed) of your jobs
  • looper summarize: produces an output file that summarizes your project results
  • looper destroy: completely erases all results so you can restart
  • looper rerun: rerun only jobs that have failed.

On your own

To use looper on your own, you will need to prepare 2 things: a project (metadata that define what you want to process), and pipelines (how to process data). To link your project to looper, you will need to define a project. You will want to either use pre-made looper-compatible pipelines or link your own custom-built pipelines. These docs will also show you how to connect your pipeline to your project.