Adding Pipelines to Flow.bio

If you are an admin on Flow, you can control what pipelines appear to users, and even add new pipelines.

Pipelines are managed in the 'Pipelines' tab of the admin panel. The order of steps for adding a new pipeline in the panel is to first add the git repository your pipeline is in, then to add the pipeline and specify the version of the pipeline (marked by branch, tag, commit hash .etc). Once added, you can manage how users see your pipelines in the "Pipeline Categories" tab.

Most Nextflow pipelines can be added to flow with little modification other than the addition of a JSON schema file.

Pipeline Repositories

The pipeline must exist in a public git repository hosted on GitHub, GitLab, or any other service which provides a URL. It only needs to be public when first added (or when you want to sync).

To add the repository to Flow, go to the admin panel, and in the first tab ('Pipelines') click 'Add New Repo'. You need to specify the URL of the repository, and the directory name it will have on Flow's backend. If your repository is a fork of some other repository and you want to highlight that (perhaps an nf-core repository that you've forked to add Flow-specific content), you can also provide the URL of that upstream repository.

Once saved, this will clone the repository files to Flow. You can manage this local copy of the repository with the 'pull' and 'checkout' buttons.

Adding Pipelines

To tell Flow about the existence of a pipeline in your repo, click 'Add New Pipeline' within the repo box on the first tab of the admin panel. Here you give general descriptive metadata about the pipeline - its name, which subcategory it should be shown under, and a few checkboxes:

nf-core - if checked, this will show an nf-core badge next to the pipeline to indicate it is a community pipeline. It has no other effects.
demultiplex - if checked, Flow will interrogate the output files to try to create sample objects by using information in any provided annotation sheets, once runs of this pipeline are complete.
imports samples - if checked, Flow will interrogate the output files to try to create sample objects by using information in the input parameters, once runs of this pipeline are complete.

If in doubt, leave all three unchecked.

As you can see, one repository can have multiple pipelines within it.

Pipeline Versions

On Flow, you don't run the pipeline per se, you run a specific version of it. Each pipeline version is defined by a git branch, or tag, or commit hash - anything that can be passed to a git checkout command. When the user goes to run the pipeline, a copy of the files are maded, checked out to the location specified by that version, and that is run.

Click 'Add New Version' for the interface to add a new version. You provide:

Name - the name of the version as it will be shown to the user.
Git - the git identifier for this version. It's often simplest to have name be set to this to.
Description - a short descriptive text that will be shown in the list of pipelines (if this is the default version).
Long Description - longer descriptive text that will be shown on the pipeline page itself when this version is selected.
Path - the relative path within the repository to the Nextflow .nf file that represents this pipeline. If you were running at the command line, it is the file you would use in your nextflow run command.
Schema Path - the relative path within the repository to the schema file.
Config Paths - the relative paths within the repository to any config files that should be passed when running, as a comma-separated list.
Private - if checked, the pipeline version will only be available to admins.
Active - if unchecked, the pipeline version will not be runnable - useful for older, obsoleted versions.
Fileset Pipeline Versions - some pipelines can take the output of other pipeline executions as their inputs. If this is the case, this specifies which pipeline/version this one can take as its inputs.

Most attributes are set at the pipeline version level, rather than the pipeline level, so that they can vary by version as needed.

Once you add a new pipeline version as a branch, you need to 'checkout' and 'pull' from that branch for it to appear in the interface.

Schema Files

You do not typically need to modify the pipeline itself in order for it to run on Flow, but you do need to add a Flow-specific JSON file - the schema file. The schema file tells Flow what input parameters to expose and how, and which output files are important.

An example schema file for our RNA-Seq pipeline can be seen here.

Inputs

The inputs section is a list of 'section' outputs, which is a grouping of parameters. Each section has:

name - the name of the section.
description - a (brief) overview of what the parameters in this section have in common.
advanced - if true, this section will be hidden by default in the interface. Not recommended if it has required params with no defaults.
from_fileset - if true when rendered the user will be presented with an option to select a fileset they have access to, which will then be used to auto-fill data inputs in this section where possible. This is useful where a section takes raw genome files (for example).
fileset_requires_organism - if true and from_fileset is true, only filesets associated with an organism will be selectable.
from_execution - if true when rendered the user will be presented with an option to select an execution, which will then be used to auto-fill data inputs in this section where possible. This is useful where a section takes prepared genome files (for example). The executions available to select from and set in the admin panel of the Flow instance.
modes - an array of mode names. You can opt to have a toggle at the top of the section to toggle different 'modes', with different parameters being available in different modes.
params - a mapping of parameter names to parameter objects. The key is the command line string the parameter uses (after the --).

A parameter object creates an interface for populating a command line parameter's value. Each parameter has:

name - the human readable name of the parameter, which can be different from the raw string passed as the command line (ie. --fasta_fai might have the name "Genome FASTA index:").
param - the parameter name to use at the command-line, if different from the key used. This is useful where different modes have different versions of the same parameter.
description - an explanation of what the parameter does.
required - whether or not this parameter must have a value before the pipeline can be submitted.
modes - if the section has different modes, this attribute (an array of mode names) indicates which modes the parameter should appear in.
type - the type of parameter. All parameters ultimately become a string when passed at the command line, but the interface for generating them will vary.

There are other attributes that are only required for certain param types.

String Parameters

"type": "string" parameters are the simplest - the interface renders a text box, and whatever is entered is simply passed straight in as the command line value. Specific attributes:

default - a pre-loaded value that the user can replace if they wish.
valid - a list of acceptable values, that will be rendered as a dropdown.

Number Parameters

"type": "number" parameters are rendered as a number text input - that is, one which only allows numbers to be entered - and the resultant number gets passed as a string. Specific attributes:

default - a pre-loaded value that the user can replace if they wish.
valid - a list of acceptable values, that will be rendered as a dropdown.

Boolean Parameters

"type": "boolean" parameters are rendered as toggles, which produce the string "true" or the string "false".

Specific attributes:

default - a pre-loaded value that the user can replace if they wish. If there is no default, and the user does not interact with the toggle interface, the parameter will not be passed.

Hidden Parameters

"type": "hidden" parameters are not rendered in the interface, and are there to ensure some parameter is always passed to the pipeline. Its attributes:

default - a the value that should always be passed. If you don't provide this, there is no real point to having this parameter.

Data Parameters

"type": "data" parameters represent data files. In the interface the user selects Flow data from navigator, and Flow converts this into a path to the file on disk. Some type specific attributes here are:

pattern - when given, the available data will be filtered by those whose filename matches this regex.
fileset_pattern - if the section can be auto-filled by some fileset, this is the pattern used to select a matching data file from that fileset.
category - only allow certain Flow data categories (1 is generic, 2 is multiplexed annotation file, 3 is multiplexed file, 4 is demultiplexed file).
data_types - you can limit the selectable data to data of a specific type - the install of Flow needs to have these types added first.
execution_output - if the section can be auto-filled from some execution, this object determines where to look. It has two attributes:
- process - the process execution to look in. In larger pipelines the full name may be multiple identifiers separated by a : - you can just use the rightmost name, or as far to the left as you like, such as FASTQC or PREPARE_SMRNA_GENOME:CUSTOM_GETCHROMSIZES. If null, the inputs will be searched instead.
- pattern - the filename regex to filter by within the process execution.

CSV Parameters

"type": "csv" parameters are the most complex inputs. They are rendered as an editable data, which gets turned into a CSV file, the path to which is passed to the pipeline as a param.

Its type-specific attributes determine how this is rendered:

takes_samples - if true, the user must pick a sample for each row, and the final CSV will have a row per sample fileset. The other columns will be able to refer to attributes of the sample.
sample_types - you can limit the selectable samples to samples of a specific type - the install of Flow needs to have these types added first.
takes_filesets - if true, the user must pick a fileset for each row. The other columns will be able to refer to data in this fileset.
fileset_category - you can limit the selectable filesets to those whose data is all of a particular category.
fileset_size - you can limit the selectable filesets to those with a particular number of data contents.
allow_custom_columns - if true, the interface will allow the user to add their own custom columns.
columns - an array of column objects, which determines what the columns of the CSV file will be.

The column objects broadly have the same structure as other inputs, and can be of any of the types above (string, data etc. - but not csv, as nested CSVs don't make sense). They have two important extra attributes though:

from_sample - you can opt to have a default value taken from the sample for this row, if takes_sample is true. For most input types this must be a metadata attribute of the sample, but for data types, the value here will be used to select a file from the sample's fileset - an integer to get the file by index, a string to get the file by regex pattern.
from_fileset - you can opt to have a default value taken from the fileset for this row, if takes_filesets ,for data types - an integer to get the file by index, a string to get the file by regex pattern.
render - if false the column won't be shown in the interface, and the column will be populated using default or from_sample.

Outputs

The outputs section is a list of curated output objects, each of which has:

name - the name of this curated output.
description - an overview of what these files are and why they are important.
process - the name of the process execution to look in for the files. In larger pipelines the full name may be multiple identifiers separated by a : - you can just use the rightmost name, or as far to the left as you like, such as FASTQC or PREPARE_SMRNA_GENOME:CUSTOM_GETCHROMSIZES.
filetype (optional) - the file extension to filter by within the process execution. For directories, this should be a blank string.
pattern (optional) - the filename regex to filter by within the process execution.

Pipeline Categories

In Flow, all pipelines are organised by category, and then by subcategory. As admins it is entirely up to you how to organise your pipelines for your users.

To add these, consult the second tab of the admin panel ('Pipeline Categories'). Each category and subcategory needs a name, a brief description, and an ordering.