In this tutorial, we’d like to introduce you to the core features of Genestack Platform. You will learn how our system deals with files, how it helps you organise and manage your data and how to share data with your colleagues. You will see how easy it is to work on private and public data simultaneously and seamlessly, and how to reproduce complex analyses with data flows, a built-in mechanism for capturing and replaying your research.
In this tutorial we will walk you through:
1.Creating an Account and Managing Users
3.Organising your research
4.Importing data onto the Genestack Platform
6.Managing and Sharing Data
7.Reproducing the Analyses with Data Flows
Creating an Account and Managing Users
It’s easy to register on Genestack. All you need to do is provide an email and set up a password.
You will quickly receive a confirmation email with a link to click on and then you’ll be able to log in. After you log in, the system will take you to the Welcome Page.
This is your main point of entry and the place where you can manage and search data using the File Manager, view your recent results and share the findings with your colleagues, set up and launch analysis pipelines and visit the tutorials section. You can always go back to the Welcome Page by clicking the Genestack icon in the upper left corner. You can always change the settings of your account and instead of the Welcome Page choose the File Manager as the main point of entry to the platform.
Now that you have set up your own account, let’s talk about user management. Try opening the menu in the top right hand corner of the screen, where your email is displayed.
If you click on Manage Users you will go to the user management screen. Every user in Genestack Platform belongs to an organisation. When you signed up to use Genestack via the sign up dialog, we created a new organisation for you, and you have automatically become its first user and its administrator.
As an organisation administrator you can create as many new users for your organisation as you want. For instance you can create accounts for your colleagues. Being in one organisation means you can share data without any restrictions. The user management screen allows you to get an overview of all users in your organisation. You can change a user’s password, make any user an administrator or lock a user out of the system.
Organising your research
So you have created a Genestack Platform account, logged on, and created a bunch of users. Let’s now talk about data organisation on Genestack platform. From the Welcome Page go to the File Manager and explore it a little bit. Right now you do not have any private files, but you have access to all public data available on the platform.
We have preloaded the platform with hunders of thousands of publicly available experiments, curated datasets and reference genomes. In the public data folder you will also find Public Data Flows that you can use in the future.
While you browse around all the folders, we’d like to point out a key feature of Genestack Platform: all files are format-free objects. Each Genestack file can be considered a container, packing several physical files or even a database, with complex and rich metadata. Let us take a look at an example. In the Reference genomes folder you will see several pre-loaded genomes:
Take a look at the KIND column. These files are of the Reference Genome type. There is no single, standard, commonly accepted file format for storing and exchanging genomic sequence and features: sequence can be stored in FASTA, EMBL or GenBank formats. Genomic features (introns, exons, etc.) can for example be represented via GFF or GTF files. Each of these formats themselves have flavours, versions, occasionally suffering from incompatibilities and conflicts. In Genestack you no longer have to worry or know about file formats (“low-level implementation details” as programmers call them).
A Reference Genome file contains packed sequence and genomic features. When data, such as reference genomes, is imported onto Genestack (and several different formats can be imported) it is “packed” into a Genestack file, meaning all reference genomes will behave identically, regardless of any differences in the physical formats underneath. You can browse reference genomes with our Genome Browser, you can use them to map raw sequence reads, to analyse variations, to add and manage rich annotations such as Gene Ontology and you never have to think about formats again. Of course, not only reference genomes are format-free. All files in Genestack Platform are format-free: raw reads, mapped sequence, statistics, genome annotations, genomic data, codon tables, and so forth.
We can take a look at any file’s metadata by clicking the “eye” icon in the File Manager. All files have rich metadata, different for each file type. Some metadata fields are filled in by our curators, some are available for you to edit via different applications in Genestack Platform, and some are computed when files are initialised.
Importing data onto the Genestack Platform
Additional option of importing your data is using import templates. On the Welcome page you can find an “Add import template” option. Import templates allow you to specify required and optional metainfo attributes for different file kinds. When you scroll down to the bottom of the page, you’ll see an “Add import template” button.
Initialising files and various file types
Now that you know how to import data onto the platform, we will walk you through file initialisation. All files on Genestack are created by various applications. When an application creates a new file, it specifies what should happen when it is initialised: a script, a download, indexing, computation. In practice it means that uninitialised files are cheap and quick to create, can be configured, used as inputs to applications to create other files, and then, later, computed all at once.
Let’s look at an example. Go to the public experiment library and choose “Analysis of the intestinal microbiota of hybrid house mice reveals evolutionary divergence in a vertebrate hologenome” experiment by Wang et al. Select one of the raw sequencing reads file called “FS01”, right click on it, and select “Preprocessing” and “Trim Low Quality Bases” app. This created a file “Trimmed FS01” that is not initialised yet. What is special about our system, is that you do not have to start initialisation! In fact, you can use this file as input to applications for creating other files.
Notice that you can edit the initialisation parameters of the new file. You can change them because the file is not yet initialised, i.e. the computation – in this case, trimming – has not yet been started. After initialisation has completed, these parameters are fixed and are there to inform you about how the file was created. They can be used to identically reproduce your work. If you wanted to start initialisation of this newly created file, click on the name of the file and select “Start initialisation”. In this post we will show you how to use this file as an input for a different application. The trimmed file can for example be mapped to a reference genome. In order to do this you should click on “add step” and select the Spliced Mapping application. Using the “edit parameters” option you can check if the system suggested a correct reference genome and if not, you can select the correct one (in this case this should be a mouse genome). These actions created another file called “Mapped reads for Trimmed FS01” that is waiting to be initialised.
This again can be used as an input for a different application. As a last step you could for example create a genetic variations file by choosing the Variant Calling app in the “add step” option. In order to see the entire data flow we have just created, click on the name of the last created file, go to “manage” and “File provenance”.
It will show you processes that have been completed, and ones that need to be initialised. To initialise only one of the steps, click on a given cell, then on “Actions” and later select “Start initialization”. To initialise all of the uninitialised dependencies, simply click on “Start initialisation” blue button at the top.
You can track the progress of your computations using the Task Manager that can be found at the top of the page. All the files created in the above example are located in the tutorial folder. To read more about data flows scroll down.
One additional thing we should mention is that if you want to analyse more than one file using the same app, it’s very easy: just tick all the files you want to analyse, right click on them and select the app you wish to use.
All the steps you need to take are identical to if you would want to analyse just one file. In this example we have created 100 files that we have to initialize to start the tasks.
Now let’s talk a bit about different types of files that can be found on the platform. As we demonstrated, all our files have a built-in system type. Some of these file types are particularly useful when it comes to organising your research and now we will discuss them in more detail.
There are many different file types in Genestack Platform. Every file is created by an application and there’s a lot of metadata associated with each file. For example, every file has one or more unique accessions, a name and a description. Applications use file type and metadata to make suggestions about what kinds of analyses a given file can be used in. Almost anywhere you see file names and accessions, e.g., File Manager or in other applications, you can click on them and a file context menu will show up. For example, clicking on a file containing raw sequenced reads displays a menu:
You can view and edit file metadata via the Edit Metainfo, which appears under the Manage submenu.
You can open the metainfo viewer on any file in the system by clicking on the eye icon. Here it is on a sequencing assay:
Folders in Genestack behave the same as folders in other systems. You can put files in folders, and you can remove files from folders. There’s one very useful difference, however, from most systems. Each file can be added (or, as we sometimes say, “linked”) to multiple folders. No data gets copied of course, the file simply appears in multiple locations. This is very handy for organising your work. For example, you can collect into one folder files from multiple experiments and work on them as if they were all part of one experiment.
Experiments, Assays, and Assay Groups
An experiment is a very special kind of folder. It contains only assays, or files, which contain experimentally collected data. One can think of experiments as packages for experimental data. They are a handy container for data. Assays are a general category of file types, which store experimentally collected data. Assay groups are a way to collect assays with common metadata into experimental subgroups, e.g., technical replicates, biological samples undergoing the same treatment, and so forth.
Managing and sharing data
Reproducing your work with data flows
So, you learned how to work with files and folders, you even created a simple analytical data flow to go from raw sequence to a list of variants. Now, let’s talk about reproducibility. We will now show you how to take any data file in Genestack Platform, and repeat the analysis steps that led up to it on different data.
Let’s go back to the genetic variations file you created called “Variants from Mapped reads from Trimmed FS01″. You might use the Welcome Page to find it in Resent Results or go to the “Created files” folder in the File Manager. You can also find it in the tutorial folder. Rather than viewing its provenance like we did before, let’s see if we can reuse the provenance. To do this, select the file, go to “Manage” and “Create new Data Flow”.
In the next screen you will see the data flow we have previously created.
The data flow editor has one core goal: to help you create more files using this diagram. To do this you will need to make some decisions for boxes in the diagram via the Action menu. If you want to select different files, go to “Choose another file”. If you want to leave the original file simply don’t change anything.
In this example, we will use this data flow to produce variant calls for another raw sequence data file, FS02 reproducing the entire workflow including trimming low quality bases, spliced mapping and variant calling. All you need to do is choose another input file and click on “Run dataflow” button at the top of the page.
You will be given a choice: you can initialize the entire data flow now or delay initialization.
If you decide to delay the initialization till later, you will be brought back to the Data Flow Runner page where you can initialize individual files by clicking on the file name and later selecting “Start initialization”.
This is the end of this tutorial. We hope you found it useful and that you are now ready to make the most out of our platform. If you have any questions you can post them on our forum and we will answer them as soon as we can. Alternatively, you can e-mail us.