chloroExtractor: extraction and assembly of the chloroplast genome from whole genome shotgun data

The chloroExtractor is a perl based program which provides a pipeline for DNA extraction of chloroplast DNA from whole genome plant data. Too huge amounts of chloroplast DNA can cast problems for the assembly of whole genome data. One solution for this problem can be a core extraction before sequencing, but this can be expensive. The chloroExtractor takes whole genome data and extracts the chloroplast DNA, so different DNA is separated easily by the chloroExractor. Furthermore, the chloroExtractor takes the chloroplast DNA and tries to assemble it. This is possible because of the preserved nature of the chloroplasts primary and secondary structure. Through k-mer filtering the k-mers which contain the chloroplast sequences get extracted and can then be used to assemble the chloroplast on a guided assembly with several other chloroplasts.

Freely available at GitHub:


Pollen/Plant ITS2 reference set for the RDP classifier (2014)

The identification of pollen plays an important role in ecology, palaeo-climatology, honey quality control and other areas. Currently, expert knowledge and reference collections are essential to identify pollen origin through light microscopy. Pollen identification through molecular sequencing and DNA barcoding has been proposed as an alternative approach, but the assessment of mixed pollen samples originating from multiple plant species is still a tedious and error-prone task. Next-generation sequencing has been proposed to avoid this hindrance. In this study we assessed mixed pollen probes through next-generation sequencing of amplicons from the highly variable, spe- cies-specific internal transcribed spacer two region of nuclear ribosomal DNA. Further, we developed a bioinformatic workflow to analyse these high-throughput data with a newly created reference database.

To evaluate the feasibility, we compared results from classical identification based on light microscopy from the same samples with our sequencing results. We assessed in total 16 mixed pollen samples, 14 originated from honeybee colonies and two from solitary bee nests. The sequencing technique resulted in higher taxon richness (deeper assignments and more identified taxa) compared to light microscopy. Abundance estimations from sequencing data were significantly cor- related with counted abundances through light microscopy. Simulation analyses of taxon specificity and sensitivity indicate that 96% of taxa present in the database are correctly identifiable at the genus level and 70% at the species level. Next-generation sequencing thus presents a useful and efficient workflow to identify pollen at the genus and species level without requiring specialised palynological expert knowledge.

Reference: Keller A, N Danner, G Grimmer, M Ankenbrand, K von der Ohe, W von der Ohe, S Rost, S Härtel, I Steffan-Dewenter (2014) Evaluating multiplexed next-generation sequencing as a method in palynology for mixed pollen samples. Plant Biology, 2014


16S2Genome: Genomic traits for 16S rDNA microbiota studies

Molecular sequencing techniques help to understand microbial biodiversity with regard to species richness, assembly structure and function. In this context, available methods are barcoding, metabarcoding, genomics and metagenomics. The first two are restricted to taxonomic assignments, whilst genomics only refers to functional capabilities of a single organism. Metagenomics by contrast yields information about organismal and functional diversity of a community. However currently it is very demanding regarding labour and costs and thus not applicable to most laboratories. Here, we show in a proof-of-concept that computational approaches are able to retain functional information about microbial communities assessed through 16S rDNA (meta)barcoding by referring to reference genomes. We developed an automatic pipeline to show that such integration may infer preliminary or supplementary genomic content of a community.

Reference: Keller A, Horn H, Förster F, Schultz J. (2014) Computational integration of genomic traits into 16S rDNA microbiota sequencing studies. Gene. 549:1 186–191


HMM based ITS2 annotation

The internal transcribed spacer 2 (ITS2) of the nuclear ribosomal repeat unit is one of the most commonly applied phylogenetic markers. It is a fast evolving locus, which makes it appropriate for studies at low taxonomic levels, whereas its secondary structure is well conserved, and tree reconstructions are possible at higher taxonomic levels. However, annotation of start and end positions of the ITS2 differs markedly between studies. This is a severe shortcoming, as prediction of a correct secondary structure by standard ab initio folding programs requires accurate identification of the marker in question. Furthermore, the correct structure is essential for multiple sequence alignments based on individual structural features. The present study describes a new tool for the delimitation and identification of the ITS2. It is based on hidden Markov models (HMMs) and verifies annotations by comparison to a conserved structural motif in the 5.8S/28S rRNA regions. Our method was able to identify and delimit the ITS2 in more than 30 000 entries lacking start and end annotations in GenBank. Furthermore, 45 000 ITS2 sequences with a questionable annotation were re-annotated. Approximately 30 000 entries from the ITS2-DB, that uses a homology-based method for structure prediction, were re-annotated. We show that the method is able to correctly annotate an ITS2 as small as 58 nt from Giardia lamblia and an ITS2 as large as 1160 nt from humans. Thus, our method should be a valuable guide during the first and crucial step in any ITS2-based phylogenetic analysis: the delineation of the correct sequence. Sequences can be submitted to the following website for HMM-based ITS2 delineation:

ITS2 database update III (with Dept. of Bioinformatics)

The internal transcribed spacer 2 (ITS2) is a widely used phylogenetic marker. In the past, it has mainly been used for species level classifications. Nowadays, a wider applicability becomes apparent. Here, the conserved structure of the RNA molecule plays a vital role. We have developed the ITS2 Database ( which holds information about sequence, structure and taxonomic classification of all ITS2 in GenBank. In the new version, we use Hidden Markov models (HMMs) for the identification and delineation of the ITS2 resulting in a major redesign of the annotation pipeline. This allowed the identification of more than 160,000 correct full length and more than 50,000 partial structures. In the web interface, these can now be searched with a modified BLAST considering both sequence and structure, enabling rapid taxon sampling. Novel sequences can be annotated using the HMM based approach and modelled according to multiple template structures. Sequences can be searched for known and newly identified motifs. Together, the database and the web server build an exhaustive resource for ITS2 based phylogenetic analyses.

Reference: Koetschan, C., Förster, F., Keller, A., Schleicher, T., Ruderisch, B., Schwarz, R., Müller, T., Wolf, M., and Schultz, J.örg. (2010) The ITS2 Database III—sequences and structures for phylogeny, Nucleic Acids Research, Oxford Univ Press 38, D275–D279.

ITS2 database: