1 How to cite

To cite PlanMine, please refer to the following publication:

Brandl, H., Moon, H., Vila-Farré, M., Liu, S.-Y., Henry, I., & Rink, J. C.
PlanMine - a mineable resource of planarian biology and biodiversity.
Nucleic Acids Research, gkv1148. doi:10.1093/nar/gkv1148 (2015)

2 What is PlanMine?

PlanMine is a database for mining planarian transcriptomes. We host independently assembled transcriptomes of the model species S. meditteranea contributed by different groups, as well as transcriptomes of “wild” planarian species for comparative analysis. You can search PlanMine by sequence (using our BLAST query) or by annotation (this is easiest using our predefined templates). Minable information currently includes BLAST homologies, GO-terms, orthologues in other planarian species, gene expression information and taxonomic information on the represented species. PlanMine is built with the Intermine data warehouse platform, which is also used by other model organism communities including WormMine, FlyMine, YeastMine, and ZebrafishMine and thus allows easy cross-system comparisons.

3 Contig identifier naming scheme

3.1 ID Scheme prefix

Each contig in planmine has an ID that takes the form of a two letter city code, followed by a four letter species code, followed by a version number, followed by a number which represents a unique contig ID in that assembly. Therefore the Dresden S. mediterranea contigs that have had their annotation updated 6 times are prefixed with dd_Smed_v6_* and followed by the contig ID number.

City identifiers currently in use include: dd (Dresden- Rink lab); mu (Muenster- Bartscherer lab); be (Berlin- Rajewski lab); to (Toronto- Pearson lab); ox (Oxford- Aboobaker lab); uc (Urbana Champaign- Newmark lab); bo (Boston- Reddien lab); ka (Kansas - Sanchez lab).

3.2 Contig suffix

Taking the example of dd_Smed_v6_10001_0_1 the ID prefix is dd_Smed_v6 (as described above). The 10001_0_1 is derived from the ID assigned based on the contig ID from the trinity assembler (comp10001_c0_seq1). We simply remove comp, c and seq characters and keep the underscores.

For user-contributed assemblies we add the ID Scheme prefix but the contig suffix is taken from the origin contributor’s nomenclature for easy comparison with source data.

3.3 Gene

For contigs we have assembled we classify gene IDs as the contig id without the last underscore number (i.e. seq1, seq2, seq3 in trinity original IDs). Therefore the final number in a contig ID is omitted in this pseudo-gene ID. This final number in the full contig ID is therfore used to refer to isoforms of a gene.

3.4 Contig and Gene annotations

It is important to note that PlanMine does not use gene names, e.g. “b-Catenin-1”. Therefore, even transcripts originating from published genes will be designated by the above ID scheme. The annotated “published genes” list offers a workaround for querying/analyzing transcripts on basis of published gene names. Gene name searches bring up entries in the “published genes” list. From there, you can retrieve corresponding contigs in the various assemblies either via the list of orthologous or by using the provided fasta sequence of the published gene as BLAST query. Contigs are further associated with annotation information through protein functional domain predictions (InterProScan) or RefSeq contigs with strong Blast homology. These contig annotations can also be searched via the Search box in the top right hand corner of every page.

4 Overview of Page Navigation

This section gives a brief overview of the PlanMine functions accessible via the tabs in the page header. Just follow the links for detailed explanations of individual functions or try out the tutorials at the end of the help manual.

4.1 Home

This is our home page allowing you to access many common tasks (see the Home section for more details)

4.2 Blast

The integrated BLAST search page allows you to search against all contigs stored in PlanMine with a query sequence of your choice. The type of query, e.g., protein or nucleic acid, is automatically recognized. You can search against all assemblies or just select those you wish. Advanced BLAST parameters are also supported for search customization. Blast functionality is provided through the integration of the SequenceServer tool. See the BLAST section for more details.

4.3 Templates

This is a list of predefined queries that are useful to most users. It is possible to customize and modify these predefined templates and also create and save your own if you are logged in to planmine (myMine). See the Templates section for more details.

4.4 QueryBuilder

This tool allows the creation of more complex queries and users to edit existing template queries. For the majority of cases a template should already exist to perform the query you require. QueryBuilder is very powerful but takes some time to master. See the Query Builder section for more details and check out our tutorial.

4.5 Lists

Lists are a very powerful feature of PlanMine, allowing you for example to export multiple contig sequences or to perform GO-term enrichment analysis on differentially expressed genes. Available lists include all the transcriptomes and stem cell- progenitor- or differentiated cell enriched contigs (based on Labbe et al.). You can also create your own lists, either by importing lists of contig identifiers, or by saving lists from search result tables. See the Lists section for more details.

4.6 Data Sources

The data sources page provides overview information about the transcriptomes stored in PlanMine. This includes basic assembly statistics (where available), an assembly report with detailed assembly information for each transcriptome, the ability to to bulk download information and the contact information of those that have contributed data to PlanMine. If you have contributed data and are missing from this page, please contact us. See the Assembly report section for more details.

4.7 Community

The community page provides links to the various planarian labs on the planet and other community-related information. If your lab is missing from this page, please contact us.

4.8 API

If you are using PlanMine as part of an automated workflow it is possible to access the data stored in PlanMine programmatically using the APIs provided (Perl, Python, Ruby, Java).

4.9 MyMine

The creation of a user account through the MyMine tab allows you to save your own queries, lists, and templates. Just try it out, no strings attached, and you will be able to get the most out of PlanMine.

5 Contig Information Page

Contig information pages are perhaps the most useful pages in PlanMine. They include all data that is known about a given contig, including expression level, protein domain occurrences, Blast homology outside and within planarians and, for the dd_Smed and dd_Dlac contigs, differential expression of the gene under diverse experimental conditions. See the sections below to find out more about the information available for contigs.

5.1 Transcript View

The transcript view gives a graphical overview of features associated with a contig using our integrated JBrowse viewer. Blast homology to proteins in RefSeq is shown in green, protein functional domains are shown in blue, predicted open reading frames (ORFs) and their direction in orange, gene family predictions in yellow (TreeFam), and the read coverage of the contig used for our assembly with the Trinity assembler is shown as a light blue coverage track (only shown for internally assembled contigs not for contributed assemblies).

Jointly, this information allows you to:

Get information about the expression level of your gene: The scale of the raw read density track provides a useful expression metric (within the dataset used for transcriptome assembly).

5.2 Gene expression information

An extremely useful feature of PlanMine is the ability to query the differential expression of a contig under diverse experimental conditions. For this purpose, we re-map raw reads from published RNAseq datasets against the dd_Smed and dd_Dlac assembly.

Two types of gene expression data are currently available,

  1. RNAi experiments and

  2. expression in different cell types.

For RNAi experiments, the results are graphically displayed in the form of a bar chart showing up-or down-regulation of the gene relative to control and, by colour coding, whether or not the change is significant (defined using edgeR with raw counts, and filtered with an False Discovery Rate (FDR) cutoff of 1% as is described on the Trinity website. Please note that our criteria for defining significance may be different from the ones used in the original study, which may lead to differences in the results. The section to the right designates 1) the gene that was knocked down, 2) a cartoon illustrating the analyzed tissue area and the location and orientation of the amputation cut (in case of a regeneration experiment) and 3) further information about the analyzed sample (e.g. time post amputation). Note that you can access both the original raw data of the study and the publication using the quick links provided.

For cell type experiments, the bar graphs symbolize expression level (FPKM) in the respective cell populations WITHOUT normalization to a control. This display mode is useful for visualizing gene expression differences between stem cells, progeny or differentiated cells. The sample information to the right is organized as described above.

For man power constraints, gene expression data is currently only available for dd_Smed and dd_Dlac transcripts. If using one of the other Smed transcriptome as starting point, simply identify the respective dd_Smed orthologue and access expression information via the respective dd_Smed contig page. We are always happy to include new or unpublished gene expression data sets in planmine- please notify us of any data sets you would like us to include.

Please note that the PlanMine search function or the query builder can be used to extract all genes that are significantly differentially expressed in a given RNAi experiment. Please see the respective tutorial for how-to information. This does not work for the cell type experiments (because we display raw fpkm rather than fold change). We therefore provide pre-saved lists of contigs that are differentially expressed between the three cell populations (low stringency: significant expression difference AND fold-change >3; high stringency: significant expression difference AND fold-change >8). You can access and analyze these lists via the Lists tab.

5.3 Gene Homologue information

Reciprocal Blast best hits are used to define orthologues in other species or other assemblies of the same species. These data are useful for finding corresponding contigs in other assemblies. Please consult the respective Reference Section for the orthology search parameters.

5.4 Gene Ontology information

Gene Ontology information for a contig is taken from the GO information stored in the NCBI database for RefSeq proteins that show strong Blast homology with our contigs. See the Reference section for blast homology settings.

5.5 Blast Homology and protein domain information

All transcriptomes stored in PlanMIne have been annotated with RefSeq BLAST homologies. To keep the information manageable, we only display the top three RefSeq hits with preference to well annotated model organisms in this table. This information provides a useful indication as to what the function of the contig might be, but please note that automatically generated annotations are not perfect and that manual double checking is a good idea for important results. See the reference section for specific parameter settings.

Domain hits annotated by InterProScan are similarly shown to give a potential indication of protein function. For more information about domain annotation see the reference section.

5.6 Open Reading Frame information

Predicted Open Readin Frame information is displayed along with the ability to retrieve either the nucelotides sequence of an ORF or a translated ORF protein sequence. See the Reference section for specific parameter settings.

6 Home Page

The Home Page contains three main items: the Blast search box, the template search box, and thumbnail image links to the species pages of the planarian species currently represented in PlanMine.

In the top left is the Blast search box, which allows you to paste a contig in fasta format into the text area field and select a transctipome to search against. For users requiring more advanced searching capability we recommend visiting the dedicated Blast page by clicking on the Blast tab at the top of the page. Note that multi-query searches are currently not possible.

6.2 Templates

The template tabs to the right of the Blast search box allows easy access to a number of pre-configured searches to query transcritomes by functional annotation data. The results of such searches return tables of data that can be sorted and filtered with powerful data-type aware functionality.

Tabs allow users to

6.2.1 Compare S. mediterranea assemblies

With a S. mediterranea contig ID from one of the Smed assemblies it is possible to retrieve orthologous contigs in Smed assemblies contributed by other research groups and display just the orthologous IDs or the longest Open Reading Frame (ORF) corresponding to these orthologous IDs. The “Keyword -> Smed published transcripts search is a useful shortcut for finding/analyzing published genes in PlanMine.

6.2.2 Compare Species

These predefined searches are useful for seaching all contigs in PlanMine using a common gene symbol or keyword (these are annotated to a contig through their Blast ortholgues in other annotated organisms in RefSeq), InterPro domain name (annotated using InterProScan), other protein domain name (also annotated through InterProScan but includes Pfam and Smart annotations), Treefam ID (annotated based on the Treefam database of gene families), or simply by a known contig ID.

6.2.3 Explore Transcripts

Here it is possible to retrieve sequences matching (partial matching supported) a particular contig ID or just return the longest Open Reading Frame (ORF) corresponding to an ID. It is also possible to search for keywords (e.g. gene name or description) associated with known published transcripts resulting mainly from classical cloning techniques.

6.2.4 Explore contig functions

These searches allow users to seach PlanMine by a known contig ID and return a table of data that can be filtered and sorted. It is possible to return four kinds of table covering contig annotated domains, blast annotation, gene ontology terms, and Treefam domains. In addtion, follow the tutorial below to see how to perform a GO term enrichment analysis on all the transcripts that are differentially expressed in the Bartscherer lab FoxD(RNAi) experiment.

For advanced searches the QueryBuilder tool is very powerful and for help with this see the section below.

6.3 List of flatworm species in PlanMine

This table shows pictures of the planarian species for which transcriptomes are currently available in PlanMine.

Clicking on the picture or name of a flatworm takes you to a species page with expert-curated information about the species (anatomy, habitat, regenerative ability, geographical distribution, and reproductive strategy) along with close-up images useful for identifying the species, a list of relevant taxonomic publications, a geographical distribution map (courtesy to the Turbellarian Taxonomy Database), and summary information about the transcriptomes in planmine relevant to this species. Note that the blue icon on the location map indicates the sampling location of the strain used for the PlanMine transcriptome. Further, we provide a link to the Turbellarian Taxonomy Database as current expert repository of taxonomic information on Planarians.

Clicking on “Blast against ID” takes users to a Blast search page with filters predefined to only search against that particular species.

7 Searching by Keyword

7.1 What can you search for?

In the top right hand corner of every page you will find the PlanMine Search Box. The Box allows you to search all of PlanMine for:

7.2 Interpreting and filtering the results

All occurrences of the search term are already grouped into different categories. Clicking a category filters the search results based on category (e.g. Contig, Interpro Domian, Other Protein Domain, GO Term). It is also possible to filter by organism and assembly (for species with more than one assembly). When results are filtered by categories you can also save the results as a list for further use.

7.3 Saving results as tables and lists

Filtered search results can then be saved as lists for later use (for more information about lists see the lists section). Note that you need to create a “MyMine” account in order to store/retrieve lists created in a previous session.

8 Blast Page

8.1 Sequence Input

The input field takes a sequence in fasta format as is standard for Blast searching. The type of sequence, i.e., protein or nucleotide, is automatically recognized.

8.2 Selecting databases

We primarily provide nucelotide databases containing all contigs for an assembly in planmine. You can check the select all box to search against all assemblies or choose one or more specific assemblies of interest. If you enter a protein sequence, a tblastn search is automatically carried out.

8.3 Advanced Options

Any additional advanced parameters can be added here, such as an e-value cutoff (clicking on the question mark gives a full list of advanced options). Then clicking the Blast button will perform the correct kind of blast based in input sequence type and selected Blast databases. Note that decreasing the e-value cutoff (e.g., -evalue 1.0) is sometimes useful for identifying weak homologies.

8.4 Blast Output

We have customised the Blast output provided by SequenceServer to add a query contig specific graphical view. For this reason we recommend submitting single query sequences rather than multi-fasta queries.

The graphical output shows a query centric view with a line representing the query sequence at the top and multiple coloured lines representing high scoring pairs (HSPs) below. HSPs are coloured by evalue with very confident hits in red and weak hits in blue. The query centric viewer also provides information on the parts of a transcript that are not part of the HSP. This is particularly useful for spotting fragmented contigs or so-called chimeras, erroneous fusions between two independent transcripts. Chimeras are a common error category in de novo assemblies. For instance, in queries against multiple Smed assemblies, a transcript displaying a long non-homologous extension in only one assembly is likely to be a chimera.

A standard Blast output summary table is also shown allowing fasta sequences to be downloaded for later use and links are present to link to individual HSP alignments.

Clicking a particular contig ID opens the contig page for this particular contig.

9 Templates

Templates provide predefined searches that are commonly useful and generally required by the community. Templates provide a simple form to paste your input data (keyword, id, list) and query the database using various filters to give a results table.

Pre-configured templates cover questions such as:

The templates tab shows a list of all available templates including user defined queries saved into MyMine if the user is logged in. Templates can be saved with a unique named and a description along with keyword tags to enable easy searching.

For advanced searches the QueryBuilder tool is very powerful and for help with this see the section below.

10 QueryBuilder

QueryBuilder allows in depth access to the underlying data model to provide powerful search capabilities through an as intuitive as possible user-interface. However, such an interface can still be overwhelming for new users In our tutorials section we give an example as to the use of QueryBuilder.

11 Lists

Lists provide a powerful means for querying and manipulating sets of data at once. You can use lists to:

Creating lists from external data: You can simply cut/paste contig IDs from external files into the window (e.g., Excel files of RNAseq results) or you can import saved text files with contig IDs using the import button.

Creating lists within PlanMine: You can also create lists via the various search functions of PlanMine. Here we save the output of a keyword search as a list:

Our widgets also allow GO term or protein domain enrichment analysis to be done on saved lists.

12 Manipulating Tabular Result Data

By default, contig lists are displayed as follows:

12.1 Column Specific Options

Each column can be modified using the four icons in the header line (from left to right): “sort by this column”, “remove this column”, “toggle column visibility, and finally the “column summary” button, which provides a very useful overview of your data and options to filter it appropriately.

The nice thing about the column summary is that the summary you get depends on the structure and type of the data in the column selected as can be seen below:

In each case, you can filter according to the type of data represented in the column, e.g., by a particular species in the first example or by a specified range of contig length in the second example.

12.2 Controlling Filters

The manage columns buttons above the table allow you to add further columns to the table (e.g., GO terms or BLAST annotations) and further filtering options. From left to right, the icons are: “Manage Columns” , “View active filters”, and the “Undo” button.

The manage columns button allows users to reorder columns in the table or even add or remove columns. The addition of new columns is based around the QueryBuilder functionality and the underlying data model. This is a powerful feature that allows you to make full use of the different layers of annotation in PlanMine. Even if you initially just import a list of contig IDs, you can, by adding columns, add all PlanMine features to your analysis, including for example BLAST descriptions, GO terms or differential expression status in a given RNAseq data set. Together with the various filter modalities, this allows you to identify the contigs that you are dealing with; carry out GO term enrichment analysis (see below) or you can learn which of your contigs are differentially expressed in one of the RNAi RNAseq experiments stored in PlanMine. The active filter button shows active filters and allows new ones to be added through a Query Builder like interface. Finally the Undo button allows you to revert previous actions and return to an unfiltered, default sorted table. The large number of column options and filter functionalities may appear a bit daunting at first, but an exploration of the various feature is well worth the effort and will allow you to get the most out of PlanMine.

12.3 Exporting Tabular Data

Tabular data can also be export in various ways using the buttons shown below:

From left to right, we have the “list button” , the “get code button” , and the “download button”. The download button is particularly useful for routine applications, allowing, amongst other options, the export of fasta files or Excel tables.

12.4 Enrichment Analysis

Widgets also allow GO term or protein domain enrichment analysis to be done on saved lists. Opening a saved list automatically brings up the widget window at the bottom of the list:

You can select the type of enrichment algorithm, the cutoff p-value or the ontology group to analyze. Below, you have the option to modify the background population list. The default is the full transcriptome assembly from which your list derives. Some searches may require modification of the default, for example lists combining multiple transcriptomes.

The table displays the individual GO terms with p-values above the cut-off and the number of contigs annotated with the term in your list. By selecting a particular GO-term and clicking “view” or “download” you can retrieve the ID’s of the contigs that are annotated with a particular GO term.

The Protein Domain Enrichment to the right provides the same analysis and retrieval options on basis of domain annotations.

12.5 Table Display and Navigation

13 Parameter Settings & reference information

13.1 Assembly

Raw reads were trimmed using cutadapt and cleansed from common illumina and pcr-amplification adapters. Subsequently, reads low-complexity reads (More than 75% A or T rich) were removed. Remaining reads were assembled with Trinity (default settings). Raw contigs were blastn’ed using an evalue cutoff of 0.0001 against the refseq_genomic database to identify and to remove bacterial contamination. Thereby we considered contigs as bacterial if they blasted with percentage identity greater than 90% and query coverage greater than 90% against a bacterial reference sequence. Finally, we a used read direction bias (if present in the data) to correct contig directionality.

13.2 Contig Annotation Methods

13.2.1 Blast and Gene Ontology Term Annotation

The NCBI blast+ tool was used in blastx mode (-evalue 0.0001 -num_alignments 100 -num_threads 6 -outfmt ‘6 std ppos slen qlen qframe sframe’) to reveal sequence homologues in the refseq_protein database. Just those blastx hits were retained that fulfilled the following filter criterion: ((subject_coverage>0.2 & query_coverage>0.2) | e_value< 1E-30) & (PC_similarity > 40)). To limit the number of blastx results in planmine, we prefered hits in mouse, human, or fruitfly over those in other species, and kept just the most significant 3 blastx hits for display in planmine. Our contigs are also assigned Gene Ontology (GO) terms based on the terms associated at the NCBI with these RefSeq proteins identified through high Blast Homology to our contig.

13.2.2 Protein Domain Annotation

To predict domain content assembled contigs were translated in all 6 reading frames, split up at stop-codon positions, and the resulting chunksprocessed using InterProScan 5. In addition to InterProDomains we also incorporated treefam domains into our annotation workflow to provide gene-familiy annotations. These tools were run without any alteration to the default settings.

13.2.3 Open Reading Frame Annotation

To annotate open reading frames in the assemblies, we used getorf (EMBOSS tool) and refined it’s results to also include fragments without a start or stop codon. Just open reading frames longer than 30 amino acids were included into planmine.

13.2.4 Gene Homologue information

To establish an inter- and intra-species homology relation between all assembled transcripts, we performed a reciprocal blastp (—evalue 0.001) analysis between the longest ORFs of each trinity graph component. This analysis was carried out separately for all pairwise assembly combinations. The resulting graph-component homology relations were extrapolated to all contigs of the corresponding graph components.

13.2.5 Contig Coverage Track

Sequencing data was mapped back against each assembly (bowtie2 -k15) and the alignment results converted into read coverage tracks using genomeCoverageBed.

13.2.6 Wnt and Frizzled Phylogenies

We also tested each assembly for the presence of Wnt and Frizzled gene family members: Genes were extracted if they contained respective domains, and added to a multiple sequence alignment containing genes of the same family. From these extended MSAs we inferred phylogenetic trees using clustalw (-BOOTSTRAP=1000 -KIMURA -BOOTLABELS=node). By doing we hope to reveal how assembled Wnt and Frizzeld transcripts relate to orthologues in to other (planarian) reference species.

13.2.7 Creation of Protein Containing Fraction (PCF) for PlanMine Import

We filtered the raw assmeblies to create a putative protein coding fraction (PCF) for import into the PlanMine database. Contigs were kept if they contained an open reading frame longer than 75 amino acids, an annotated domain, or a blast hit.

Since most mentioned annotation steps are computationally demanding, processing was performed in a batch-wise manner using an HPC environment.

13.3 Assembly Reports

Assembly reports were created for each transcriptome in PlanMine, with different levels of reporting depending on whether we assembled from raw reads or just annotated a contributed assembly.

13.3.1 Read Quality Control

The first step in the workflow is to run the raw sequencing reads through FastQC to assess their quality.

We then do contamination filtering by performing a blastn with an evalue cutoff of 0.0001 against the refseq_genomic database to identify and to remove bacterial sequences. We consider contigs as bacterial if they blast with percentage identity > 90% and a query_coverage > 90% against a bacterial reference sequence. A potential list of contaminants is generated for subsequent filtering.

13.3.2 Annotation for Blast and Domains

Frequency plots of Blast high scoring pair query coverage are plotted to see how our contigs compare to those they have blast homology with.

The frequency of domains predicted by InterProScan domain prediction tools is also reported, along with the number of contigs that have at least one annotated domain.

Reading frames of predicted domains are also reported, which enables us to see if contigs are mostly in the forward (frame 1-3) or reverse direction (4-6). Finally frame specificity is reported to see whether, if more than one domain is present, domains are in the same reading frame, to give an indication of the presence of frameshifts

13.3.3 Wnt and Frizzled Phylogenies

We also tested each assembly for the presence of Wnt and Frizzled gene family members: Genes were extracted if they contained respective domains, and added to a multiple sequence alignment containing genes of the same family. From these extended MSAs we inferred phylogenetic trees using clustalw (-BOOTSTRAP=1000 -KIMURA -BOOTLABELS=node). By doing we hope to reveal how assembled Wnt and Frizzeld transcripts relate to orthologues in to other (planarian) reference species.

13.3.4 Open Reading Frame (ORFs)

To assess the open reading frames within our contigs three plots are created. Firstly, the distribution of the longest ORF for each contig is plotted, secondly the ratio of the longest ORF to contig length is plotted, and finally the relationship between contig length and the ratio of the longest ORF to contig length is plotted.

13.3.5 Contig Coverage

After an assembly is complete we map all the reads used to create the assembly onto the final assembly to create coverage tracks and assess the median coverage of our contigs.

13.3.6 Coverage of eukaryotic core genes

In order to assess how complete our transcriptome is we blast all assembled transcriptomes against a set of [eukaryotic core genes]. We then order our contigs in terms of their query coverage with this set and plot the query coverage of eukaryotic core genes against the percentile of our contigs that has such a coverage.

13.3.7 Assembly Filtering

In this section we report the fraction of contigs filtered using our filtering criteria of having either annotated Blast Homology, an annotated domain or an open reading frame of more than 75 amino acids. Plots are also created showing the contig length distribution of those filtered vs. those contigs kept, and the distribution of the longest Open Reading Frame per contig.

Venn diagrams showing the numbers of contigs with various annotation types used for filtering are shown.

Finally the number of filtered contigs (isoforms) for each Trinity graph component (gene) is shown.

14 Tutorials

14.1 Using Query Builder to find all Gene Ontology Terms for a contig

How are they connected? Our contigs are assigned Gene Ontology (GO) terms based on the terms associated with RefSeq genes from model organisms that show high Blast Homology to our contig. We would like to get a list of GO terms that are associated with our contig of interest. In this example I choose dd_Smed_v6_5822_0_1 for illustration purposes.

In order to begin our search we first must choose the data type we would like to use to start our Query. In our case this would be a contig ID. So we select contig and press the ‘select’ button.

Upon pressing select we are taken to another page that further allows refinement of our query. The whole data model hierarchy is shown on the left hand side with options to constrain by a field and show the field in the resulting table output. In this example we want to show and constrain (filter) by contig ID so we click on the blue show button and then the red constrain button next to Id underneath the contig section.

This brings up the following field where we can filter by our contig id and click “Add to Query”:

We can now browse the hierarchy to locate the Gene Ontology information that we would like to display that is associated with our contig ID. If we look down the list we can see a heading called “GO Annotation”, if we click the plus sign this part of the hierarchy expands allowing us to see the sub elements. We are interested in Ontology Terms and so can click on this plus too. We can then click show for Name and Namespace to give the following display:

The right hand side of the screen now shows us that we are constraining by contig ID dd_Smed_v6_5822_0_1 and would like to display the Name and Namespace of associated GO terms. Underneath we also get a summary of what the resulting table will look like. We can drag columns around below to change column order.

Once we are happy, clicking show results gives the following output table:

The output table can then be saved, exported or further filtered as normal (see the manipulating tabular data section for more information).

14.2 Using a search template to find all wnt genes in Polycelis tenuis (Pten)

Let’s say you want to retrieve all Wnt transcripts from Polycelis tenuis.

Again, note that this result will be entirely based on the parameter settings of our automated transcriptome annotation. Therefore, it is always a good idea to manually verify the search result.

14.3 Using a search template to find Gene Ontology terms enriched in significantly up-regulated differentially expressed contigs after FoxD knockdown by RNAi

We can used the pre-defined search from the contig functions tab in the template box on the home page to search for differentially expressed contigs for certain RNAi experiments.

Clicking on “Gene Knocked Down —> Differentially Expressed Contigs” takes us to the search configuration page. In our case the defaults are OK as we want to look for contigs affected by FoxD knockdown. It is important to note that only by selecting a gene used in an RNAi experiment can you currently get differentially expressed contigs as this feature is still under development. We also select for only differentially expressed contigs after FoxD knockdown by setting “Gene Expression Analysis Result > Differentially Expressed” to “= TRUE“. Finally we set “Gene Expression Analysis Result > Score” to “> 0“ as in our case we want up-regulated genes after FoxD knockdown.

When we click Show Results we get a results table showing contigs that are up-regulated under FoxD knockdown under certain conditions. It is possible to further filter and sort this table but for our purposes we just want to get the unique contig list.

In order to get this unique contig list we can click on “create new list” and select “All 53 Contigs” to then go to open a list creation box.

Here we can give our list a suitable name and description for later use. If you are logged into MyMine this list can be saved for later use.

Once the list has been saved you will see a green success bar under the header bar. Clicking on the list name takes us to the newly created list page.

This page shows basic contig information and it is possible to filter, sort and add more columns, but for our purposes we are most interested in the enrichment analysis widgets below the list.

This enrichment analysis widgets show enriched Gene Ontology terms based on the terms associated with the contig’s significant BLAST homology assignments. Protein domain enrichment is based on contig annotated InterPro domains. It is possible to change the background population to any predefined list, to change the exact method for multiple testin correction, the p-value significance cut-off, and in the case of GO enrichment the exact ontology type (i.e. biological process, molecular function, cellular component)

We can now download these lists for later use if desired.

15 Submit Data to PlanMine

PlanMine is a community effort and we love to receive data for integration into PlanMine. Thank you VERY MUCH for your participation- the more data, the more rewarding the mining for all of us… As always, time is a precious resource and we therefore ask you for help in the submission/integration process. The text below specifies the submission details for 1) RNAseq dataset submissions; 2) Transcriptome assemblies.

15.1 RNAseq Dataset Submissions

Please note that we can only integrate datasets that have been deposited to the Short Reads Archive (SRA). SRA deposition is in any case a prerequisite for publication and we rely critically on their raw data storage and archiving services, which we cannot replicate at the required scale. To integrate new data sets, we need to know i) where the raw sequencing reads are being stored and ii) experimental details pertaining to the data set. The attached Excel form queries the relevant bits of information. Please e-mail the completed form as attachment to planmine@mpi-cbg.de. Use the same e-mail address if you encounter any problems in the process.

The submission form background:

  1. The form contains pre-filled examples to help you in the completion process. Simply replace all red text with the data pertaining to your experiment.
  2. PlanMine groups differential gene expression data into 3 categories: The submission form for each is found on a separate tab. Just complete the form that best fits your experiment. For time course data, always use the time course form, even if it is a time course under different RNAi conditions. For cell-type specific sequencing data under RNAi, always use the RNAi form. ONLY SUBMIT ONE FORM- simply delete the other two tabs prior to mailing the file to us.
  3. PlanMine import requires retrieval of your raw reads, for which we need the individual SRR ID’s of each sequencing sample. The various abbreviations used by SRA may be a bit confusing at first, but become quite clear in light of the underlying hierarchical data scheme:
  4. You will have to designate which of the SRR ID’s refer to controls. This is important for correctly calculating the fold-change in gene expression. Note that the three differential gene expression data categories in PlanMine use different normalizations, hence the control designation depends on the experiment category (see examples on sheet).
  5. Designating Replicates: Anything that has the same BioSource designation will be automatically considered as replicate.

15.2 Transcriptome Assembly Submissions

For integrating your transcriptome assembly into PlanMine, we need i) the actual assembly in form of a fasta file; ii) background information on the raw data and assembly techniques that were used for generating the assembly. For i), please drop an e-mail to planmine@mpi-cbg.de to arrange transfer possibilities (ftp server, dropbox or similar). For ii), please complete the attached questionnaire. Please enter as much information as possible as this will greatly increase the utility of the transcriptome.

[Back to top]