The Illumina draft data was also assembled Trichostatin A with Velvet, version 1.1.05 [35], and the consensus sequences were computationally shredded into 1.5 Kb overlapping fake reads (shreds). The Illumina draft data was assembled again with Velvet using the shreds from the first Velvet assembly to guide the next assembly. The consensus from the second VELVET assembly was shredded into 1.5 Kb overlapping fake reads. The fake reads from the Allpaths assembly and both Velvet assemblies and a subset of the Illumina CLIP paired-end reads were assembled using parallel phrap, version 4.24 (High Performance Software, LLC). Possible mis-assemblies were corrected with manual editing in Consed [36-38]. Gap closure was accomplished using repeat resolution software (Wei Gu, unpublished), and sequencing of bridging PCR fragments with PacBio (unpublished, Cliff Han) technology.
For improved high quality draft, 4 PCR PacBio consensus sequences were completed to close gaps and to raise the quality of the final sequence. The estimated total size of the genome is 7 Mb and the final assembly is based on 6,036 Mb of Illumina draft data, which provides an average 862�� coverage of the genome. Genome annotation Genes were identified using Prodigal [39] as part of the DOE-JGI annotation pipeline [40], followed by a round of manual curation using the JGI GenePRIMP pipeline [41]. The predicted CDSs were translated and used to search the National Center for Biotechnology Information (NCBI) non-redundant database, UniProt, TIGRFam, Pfam, PRIAM, KEGG, COG, and InterPro databases.
These data sources were combined to assert a product description for each predicted protein. Non-coding genes and miscellaneous features were predicted using tRNAscan-SE [42], RNAMMer [43], Rfam [44], TMHMM [45], and SignalP [46]. Additional gene prediction analyses and functional annotation were performed within the Integrated Microbial Genomes (IMG-ER) platform [47,48]. Genome properties The genome is 6,905,599 nucleotides with 60.67% GC content (Table 4) and comprised of 7 scaffolds (Figures 3,,44,,55,,66,,77,,8,and8,and ,and9)9) of 7 contigs. From a total of 6,836 genes, 6,750 were protein encoding and 86 RNA-only encoding genes. The majority of genes (77.98%) were assigned a putative function whilst the remaining genes were annotated as hypothetical. The distribution of genes into COGs functional categories is presented in Table 5.
Table 4 Genome Statistics for Rhizobium leguminosarum bv. trifolii SRDI565 Figure 3 Graphical map of the genome of Rhizobium leguminosarum bv. trifolii strain SRDI565 (scaffold 1.1). From bottom to the top GSK-3 of each scaffold: Genes on forward strand (color by COG categories as denoted by the IMG platform), Genes on reverse strand (color … Figure 4 Graphical map of the genome of Rhizobium leguminosarum bv. trifolii strain SRDI565 (scaffold 2.2).