Project Models

Workflow explained:
  • Create a Project object
    • Samples are created and added to project (automatically)
In the process, Models will check:
  • Project structure (created if not existing)
  • Existence of csv sample sheet with minimal fields
  • Constructing a path to a sample’s input file and checking for its existence
  • Read type/length of samples (optionally)


from models import Project
prj = Project("config.yaml")
# that's it!


# see all samples
# get fastq file of first sample
# get all bam files of WGBS samples
[s.mapped for s in prj.samples if s.library == "WGBS"]

prj.metadata.results  # results directory of project
# export again the project's annotation
prj.sheet.write(os.path.join(prj.metadata.output_dir, "sample_annotation.csv"))

# project options are read from the config file
# but can be changed on the fly:
prj = Project("test.yaml")
# change options on the fly
prj.config["merge_technical"] = False
# annotation sheet not specified initially in config file
class looper.models.AttributeDict(entries=None, _force_nulls=False, _attribute_identity=False)[source]

A class to convert a nested mapping(s) into an object(s) with key-values using object syntax (attr_dict.attribute) instead of getitem syntax (attr_dict[“key”]). This class recursively sets mappings to objects, facilitating attribute traversal (e.g., attr_dict.attr.attr).


Update this AttributeDict with provided key-value pairs.

Parameters:| collections.Mapping entries (collections.Iterable) – collection of pairs of keys and values

Copy self to a new object.

class looper.models.PipelineInterface(config)[source]

This class parses, holds, and returns information for a yaml file that specifies how to interact with each individual pipeline. This includes both resources to request for cluster job submission, as well as arguments to be passed from the sample annotation metadata to the pipeline

Parameters:config (str | Mapping) – path to file from which to parse configuration data, or pre-parsed configuration data.
choose_resource_package(pipeline_name, file_size)[source]

Select resource bundle for given input file size to given pipeline.

  • pipeline_name (str) – Name of pipeline.
  • file_size (float) – Size of input data.

resource bundle appropriate for given pipeline, for given input file size

Return type:


  • ValueError – if indicated file size is negative, or if the file size value specified for any resource package is negative
  • _InvalidResourceSpecificationException – if no default resource package specification is provided

Copy self to a new object.

get_arg_string(pipeline_name, sample, submission_folder_path='', **null_replacements)[source]

For a given pipeline and sample, return the argument string

  • pipeline_name (str) – Name of pipeline.
  • sample (Sample) – current sample for which job is being built
  • submission_folder_path (str) – path to folder in which files related to submission of this sample will be placed.
  • null_replacements (dict) – mapping from name of Sample attribute name to value to use in arg string if Sample attribute’s value is null
Return str:

command-line argument string for pipeline

get_attribute(pipeline_name, attribute_key, path_as_list=True)[source]

Return the value of the named attribute for the pipeline indicated.

  • pipeline_name (str) – name of the pipeline of interest
  • attribute_key (str) – name of the pipeline attribute of interest
  • path_as_list (bool) – whether to ensure that a string attribute is returned as a list; this is useful for safe iteration over the returned value.

Translate a pipeline name (e.g., stripping file extension).

Parameters:pipeline (str) – Pipeline name or script (top-level key in pipeline interface mapping).
Returns:translated pipeline name, as specified in config or by stripping the pipeline’s file extension
Return type:str: translated name for pipeline

Names of pipelines about which this interface is aware.

Return Iterable[str]:
 names of pipelines about which this interface is aware

Keyed collection of pipeline interface data.

Return Mapping:pipeline interface configuration data

Determine whether the indicated pipeline uses looper arguments.

Parameters:pipeline_name (str) – name of a pipeline of interest
Returns:whether the indicated pipeline uses looper arguments
Return type:bool
class looper.models.Project(config_file, subproject=None, default_compute=None, dry=False, permissive=True, file_checks=False, compute_env_file=None, no_environment_exception=None, no_compute_exception=None, defer_sample_construction=False)[source]

A class to model a Project.

  • config_file (str) – Project config file (YAML).
  • subproject (str) – Subproject to use within configuration file, optional
  • default_compute (str) – Configuration file (YAML) for default compute settings.
  • dry (bool) – If dry mode is activated, no directories will be created upon project instantiation.
  • permissive (bool) – Whether a error should be thrown if a sample input file(s) do not exist or cannot be open.
  • file_checks (bool) – Whether sample input files should be checked for their attributes (read type, read length) if this is not set in sample metadata.
  • compute_env_file (str) – Looperenv YAML file specifying compute settings.
  • no_environment_exception (type) – type of exception to raise if environment settings can’t be established, optional; if null (the default), a warning message will be logged, and no exception will be raised.
  • no_compute_exception (type) – type of exception to raise if compute settings can’t be established, optional; if null (the default), a warning message will be logged, and no exception will be raised.
  • defer_sample_construction (bool) – whether to wait to build this Project’s Sample objects until they’re needed, optional; by default, the basic Sample is created during Project construction
from models import Project
prj = Project("config.yaml")

Create all Sample object for this project for the given protocol(s).

Return pandas.core.frame.DataFrame:
 DataFrame with from base version of each of this Project’s samples, for indicated protocol(s) if given, else all of this Project’s samples
build_submission_bundles(protocol, priority=True)[source]

Create pipelines to submit for each sample of a particular protocol.

With the argument (flag) to the priority parameter, there’s control over whether to submit pipeline(s) from only one of the project’s known pipeline locations with a match for the protocol, or whether to submit pipelines created from all locations with a match for the protocol.

  • protocol (str) – name of the protocol/library for which to create pipeline(s)
  • priority (bool) – to only submit pipeline(s) from the first of the pipelines location(s) (indicated in the project config file) that has a match for the given protocol; optional, default True
Return Iterable[(PipelineInterface, str, str)]:

AssertionError – if there’s a failure in the attempt to partition an interface’s pipeline scripts into disjoint subsets of those already mapped and those not yet mapped


Environment variable through which to access compute settings.

Return str:name of the environment variable to pointing to compute settings

Copy self to a new object.


Path to default compute environment settings file.


Finalize the establishment of a path to this project’s pipelines.

With the passed argument, override anything already set. Otherwise, prefer path provided in this project’s config, then local pipelines folder, then a location set in project environment.


pipe_path (str) – (absolute) path to pipelines

  • PipelinesException – if (prioritized) search in attempt to confirm or set pipelines directory failed
  • TypeError – if pipeline(s) path(s) argument is provided and can’t be interpreted as a single path or as a flat collection of path(s)

For this project, given a pipeline, return an argument string specified in the project config file.

static infer_name(path_config_file)[source]

Infer project name based on location of configuration file.

Provide the project with a name, taken to be the name of the folder in which its configuration file lives.

Parameters:path_config_file (str) – path to the project’s configuration file.
Return str:name of the configuration file’s folder, to name project.

Creates project directory structure if it doesn’t exist.


Number of samples available in this Project.


Directory in which to place results and submissions folders.

By default, assume that the project’s configuration file specifies an output directory, and that this is therefore available within the project metadata. If that assumption does not hold, though, consider the folder in which the project configuration file lives to be the project’s output directory.

Return str:path to the project’s output directory, either as specified in the configuration file or the folder that contains the project’s configuration file.

Parse provided yaml config file and check required fields exist.

Raises:KeyError – if config file lacks required section(s)

Names of folders to nest within a project output directory.

Return Iterable[str]:
 names of output-nested folders

Determine this Project’s unique protocol names.

Return Set[str]:
 collection of this Project’s unique protocol names

Names of metadata fields that must be present for a valid project.

Make a base project as unconstrained as possible by requiring no specific metadata attributes. It’s likely that some common-sense requirements may arise in domain-specific client applications, in which case this can be redefined in a subclass.

Return Iterable[str]:
 names of metadata fields required by a project

Names of samples of which this Project is aware.


Generic/base Sample instance for each of this Project’s samples.

Return Iterable[Sample]:
 Sample instance for each of this Project’s samples

Set the compute attributes according to the specified settings in the environment file.

Parameters:setting (str) – name for non-resource compute bundle, the name of a subsection in an environment configuration file
Return bool:success flag for attempt to establish compute settings

Make the project’s public_html folder executable.


Path to folder with default submission templates.

Return str:path to folder with default submission templates

Parse data from environment configuration file.

Parameters:env_settings_file (str) – path to file with new environment configuration data
class looper.models.ProtocolInterface(interface_data_source)[source]

PipelineInterface and ProtocolMapper for a single pipelines location.

This class facilitates use of pipelines from multiple locations by a single project. Also stored are path attributes with information about the location(s) from which the PipelineInterface and ProtocolMapper came.

Parameters:interface_data_source (str) – location (e.g., code repository) of pipelines

Fetch the mapping for a particular protocol, null if unmapped.

Parameters:protocol (str) – name/key for the protocol for which to fetch the pipeline(s)
Return str | Iterable[str] | NoneType:
 pipeline(s) to which the given protocol is mapped, otherwise null
fetch_sample_subtype(protocol, strict_pipe_key, full_pipe_path)[source]

Determine the interface and Sample subtype for a protocol and pipeline.

  • protocol (str) – name of the relevant protocol
  • strict_pipe_key (str) – key for specific pipeline in a pipeline interface mapping declaration; this must exactly match a key in the PipelineInterface (or the Mapping that represent it)
  • full_pipe_path (str) – (absolute, expanded) path to the pipeline script
Return type:

Sample subtype to use for jobs for the given protocol, that use the pipeline indicated


KeyError – if given a pipeline key that’s not mapped in this ProtocolInterface instance’s PipelineInterface


Determine pipeline’s full path, arguments, and strict key.

This handles multiple ways in which to refer to a pipeline (by key) within the mapping that contains the data that defines a PipelineInterface. It also ensures proper handling of the path to the pipeline (i.e., ensuring that it’s absolute), and that the text for the arguments are appropriately dealt parsed and passed.

Parameters:pipeline_key (str) – the key in the pipeline interface file used for the protocol_mappings section. Previously was the script name.
Return (str, str, str):
 more precise version of input key, along with absolute path for pipeline script, and full script path + options
class looper.models.ProtocolMapper(mappings_input)[source]

Map protocol/library name to pipeline key(s). For example, “WGBS” –> wgbs.

Parameters:mappings_input (str | Mapping) – data encoding correspondence between a protocol name and pipeline(s)

Create command-line text for given protocol’s pipeline(s).

Parameters:protocol (str) – Name of protocol.

Copy self to a new object.

class looper.models.Sample(series, prj=None)[source]

Class to model Samples based on a pandas Series.

Parameters:series (Mapping | pandas.core.series.Series) – Sample’s data.
from models import Project, SampleSheet, Sample
prj = Project("ngs")
sheet = SampleSheet("~/projects/example/sheet.csv", prj)
s1 = Sample(sheet.iloc[0])

Returns a pandas.Series object with all the sample’s attributes.

Return pandas.core.series.Series:
 pandas Series representation of this Sample, with its attributes.

Check provided sample annotation is valid.

Parameters:required (Iterable[str]) – collection of required sample attribute names, optional; if unspecified, only a name is required.

Copy self to a new object.


Determine which of this Sample’s required attributes/files are missing.

Return (type, str):
 hypothetical exception type along with message about what’s missing; null and empty if nothing exceptional is detected

Create a name for file in which to represent this Sample.

This uses knowledge of the instance’s subtype, sandwiching a delimiter between the name of this Sample and the name of the subtype before the extension. If the instance is a base Sample type, then the filename is simply the sample name with an extension.

Parameters:delimiter (str) – what to place between sample name and name of subtype; this is only relevant if the instance is of a subclass
Return str:name for file with which to represent this Sample on disk

Generate name for the sample by joining some of its attribute strings.


Get value corresponding to each given attribute.

Parameters:attrlist (str) – name of an attribute storing a list of attr names
Return list | NoneType:
 value (or empty string) corresponding to each named attribute; null if this Sample’s value for the attribute given by the argument to the “attrlist” parameter is empty/null, or if this Sample lacks the indicated attribute

Create a K-V pairs for items originally passed in via the sample sheet.

This is useful for summarizing; it provides a representation of the sample that excludes things like config files and derived entries.

Return OrderedDict:
 mapping from name to value for data elements originally provided via the sample sheet (i.e., the a map-like representation of the instance, excluding derived items)

Infer value for additional field(s) from other field(s).

Add columns/fields to the sample based on values in those already-set that the sample’s project defines as indicative of implications for additional data elements for the sample.

Parameters:implications (Mapping) – Project’s implied columns data
Return None:this function mutates state and is strictly for effect

Determine whether this Sample is inactive.

By default, a Sample is regarded as active. That is, if it lacks an indication about activation status, it’s assumed to be active. If, however, and there’s an indication of such status, it must be ‘1’ in order to be considered switched ‘on.’

Return bool:whether this Sample’s been designated as dormant
locate_data_source(data_sources, column_name='data_source', source_key=None, extra_vars=None)[source]

Uses the template path provided in the project config section “data_sources” to piece together an actual path by substituting variables (encoded by “{variable}””) with sample attributes.

  • data_sources (Mapping) – mapping from key name (as a value in a cell of a tabular data structure) to, e.g., filepath
  • column_name (str) – Name of sample attribute (equivalently, sample sheet column) specifying a derived column.
  • source_key (str) – The key of the data_source, used to index into the project config data_sources section. By default, the source key will be taken as the value of the specified column (as a sample attribute). For cases where the sample doesn’t have this attribute yet (e.g. in a merge table), you must specify the source key.
  • extra_vars (dict) – By default, this will look to populate the template location using attributes found in the current sample; however, you may also provide a dict of extra variables that can also be used for variable replacement. These extra variables are given a higher priority.
Return str:

regex expansion of data source specified in configuration, with variable substitutions made


ValueError – if argument to data_sources parameter is null/empty


Creates sample directory structure if it doesn’t exist.


Sets the paths of all files for this sample.

Parameters:project (Project) – object with pointers to data paths and such

Set the genome for this Sample.

Parameters:str] genomes (Mapping[str,) – genome assembly by organism name
set_pipeline_attributes(pipeline_interface, pipeline_name, permissive=True)[source]

Set pipeline-specific sample attributes.

Some sample attributes are relative to a particular pipeline run, like which files should be considered inputs, what is the total input file size for the sample, etc. This function sets these pipeline-specific sample attributes, provided via a PipelineInterface object and the name of a pipeline to select from that interface.

  • pipeline_interface (PipelineInterface) – A PipelineInterface object that has the settings for this given pipeline.
  • pipeline_name (str) – Which pipeline to choose.
  • permissive (bool) – whether to simply log a warning or error message rather than raising an exception if sample file is not found or otherwise cannot be read, default True
set_read_type(n=10, permissive=True)[source]

For a sample with attr ngs_inputs set, this sets the read type (single, paired) and read length of an input file.

  • n (int) – Number of reads to read to determine read type. Default=10.
  • permissive (bool) – whether to simply log a warning or error message rather than raising an exception if sample file is not found or otherwise cannot be read, default True

Set the transcriptome for this Sample.

Parameters:str] transcriptomes (Mapping[str,) – transcriptome assembly by organism name
to_yaml(path=None, subs_folder_path=None, delimiter='_')[source]

Serializes itself in YAML format.

  • path (str) – A file path to write yaml to; provide this or the subs_folder_path
  • subs_folder_path (str) – path to folder in which to place file that’s being written; provide this or a full filepath
  • delimiter (str) – text to place between the sample name and the suffix within the filename; irrelevant if there’s no suffix
Return str:

filepath used (same as input if given, otherwise the path value that was inferred)


ValueError – if neither full filepath nor path to extant parent directory is provided.


Update Sample object with attributes from a dict.