User GuidesSnapGene FAQ Printing and ExportingWhat is "Genbank - SnapGene" Format?

What is "Genbank - SnapGene" Format?

SnapGene allows me to import and export "GenBank - SnapGene" format. What is this format and how can I use it?

The "Genbank - SnapGene" format adheres to [GenBank | GenPept] conventions. However, additional qualifiers are used to encode information used by SnapGene that is not typically captured in the [Genbank | GenPept] format such as:

  • the sequence author
  • custom map label and alias
  • feature directionality, segment ranges, segment colors, segment names, and cleavage arrow positions
  • primer name, description, sequence, color, 5' phosphorylation, and date added

You can use the following format specifications in your external workflows to generate "Genbank - SnapGene" format files that can be read by SnapGene.

File Header

In order to be able to decode additional information from qualifiers and other fields, it is necessary to clearly advertise the file is using the "GenBank - SnapGene" format as follows:

1. The LOCUS name must contain Exported.

LOCUS Exported 4373 bp ds-DNA circular SYN 07-MAR-2017

2. The last REFERENCE must include the TITLE Direct Submission as well as a JOURNAL entry that contains SnapGene.

REFERENCE 2 (bases 1 to 4373)2 
    AUTHORS  [Listed here is the Sequence Author or a "." character.]
    TITLE    Direct Submission
    JOURNAL  Exported Apr 28, 2017 from         
             SnapGene 3.3.45 http://www.snapgene.com

Note that these requirements are case-sensitive.

Sequence Label

If specified, the custom map label encoded using the KEYWORDS field:

KEYWORDS    Custom Map Label

Sequence Alias

If an alias is required it is encoded using the last COMMENT field:

COMMENT Alias: pBSG307

Sequence Author

If the last REFERENCE has text other than "." in the AUTHORS field, that text will be imported as the Sequence Author. An alternative JOURNAL entry would be:

 JOURNAL SnapGene GenBank format

Features

Information about features is encoded using an additional /note qualifier.  

Such information is only decoded from the last /note qualifier if pre. Key/value pairs are encoded as key: value. Multiple values are separated by semicolons, for example:

 /note=color: #ffd281; direction: BOTH

Any prior /note qualifiers hold actual notes about the feature.

Directionality

Note that directionality determines how the direction of a feature is depicted in SnapGene.  

The orientation of a feature, if on the reverse strand, is defined using the complement qualifier:

CDS             complement(3518..4378)

The GenBank format provides ambiguous information regarding directionality. There is no way to encode bidirectionality, and directionality can only be implied using the /translation and /direction qualifiers. As a result, it is impossible to detect the directionality of non-translated forward directional features. We encode the directionality as:

direction: [RIGHT | LEFT | BOTH]

Directionality is omitted if the feature is nondirectional, or if the directionality is implicit because the feature is translated or has a /direction qualifier.

Feature Color

The color of a single-segment feature is encoded as color: #RRGGBB using the standard hexadecimal format for RGB colors.

/note="color: #FF0000"

Line appearance is encoded as #------.

Multi-Segment Features

In order to encode the name, color, and range for segments in a multi-segment feature, a multi-line note qualifier is employed whose value is enclosed in quotes.

/note="This FEATURE has N segments:
   1: # .. # / #ff0000 / First Segment Name
   ...
   N: # .. # / #0000ff / Last Segment Name"

For example:

/note="This bidirectional feature has 2 segments:
   1: 1001 .. 1298 / #ff0000 / One
   2: 1299 .. 1596 / #00ff00 / Two"

The first line is used to indicate the number of segments as well as the feature directionality and can take the following forms:

This feature has # segments:
This forward directional feature has # segments:
This reverse directional feature has # segments:
This bidirectional feature has # segments:

The first variant is used for a non-directional feature, or for a feature in which the directionality is implicit because the feature is translated or has a /direction qualifier.

Subsequent lines are used to encode information about individual non-gap segments. Each segment is encoded using the following format where SEGMENT_NAME and the preceding backslash are only included for named segments.

SEGMENT_NUMBER: FIRST_BASE .. LAST_BASE / #RRGGBB / SEGMENT_NAME

For example:

1: 1000 .. 2000 / #ff0000 / Red Segment

For each segment, the segment number, range, color, and, if specified, name, are separated by /'s. Segments are always listed in order, although gaps between segment ranges may be present.

Cleavage Arrows

If a cleavage arrow is present between two segments or at an end of the feature, this information is encoded at the end of the last /note qualifier, using the following format:

Cleavage site after base [BASE NUMBER]

or,

Cleavage sites after bases [BASE NUMBER, BASE NUMBER, ...]

For example, the last line of the last /note qualifier might read:

Cleavage sites after bases 5, 16, 200

Primers

Primers are encoded as primer_bind features.

Any primer_bind features that include sequence data are imported not as primer_bind features, but rather as SnapGene primers. If there are multiple binding sites for a primer, only one copy of the primer is generated during import.

Name

The primer name is encoded using the /label qualifier, e.g.

/label=Primer Name

Description

If present, the primer description is recorded using a /note qualifier, e.g.

/note="This is a primers description."

other Attributes

The final /note qualifier uses keys to specify primer color, primer sequence, phosphorylation (if present), and if known, the date the primer was added to the sequence.

Keys are separated by values using Key: Value format. Multiple Key / Value pairs are separated by semicolons.

/note="color: orange; 
       sequence: GCTCATGCCATTGGCGTTAACTCTGCTTCTTGGGCTCCAGCTACC; 
       added: 2021-04-05; 
       5' phosphorylated"

Color

The color values must be lowercase:

[ black | red | orange | green | blue | purple | gray ]

Sequence

The primer sequence is case sensitive and can include a mixture of upper and lower case characters.

Date

The UTC date the primer was added to the file, if known, is encoded with:

YYYY-MM-DD

5' Phosphorylation

If the primer is 5' phosphorylated this is included at the end of the terminal note qualifier:

5' phosphorylated

Example

LOCUS       Exported                2894 bp DNA     linear   UNA 05-APR-2021
KEYWORDS    Custom Map Label
REFERENCE   1  (bases 1 to 2894)
  AUTHORS   .
  TITLE     Direct Submission
  JOURNAL   Exported Monday, Apr 5, 2021 from SnapGene 5.3.0
            https://www.snapgene.com
COMMENT     Alias: This is an example of an alias
FEATURES    Location/Qualifiers
     misc_feature    740..1000
                     /label=Reverse Directional Green Feature
                     /note="color: #00FF00; direction: LEFT"
     misc_feature    1001..1894
                     /label=Simple Name
                     /note="This bidirectional feature has 3 segments:
                      1: 1001 .. 1298 / #ff0000 / First Named Segment
                      2: 1299 .. 1596 / #00ff00
                      3: 1597 .. 1894 / #0000ff / Last Named Segment
                     Cleavage site after base 1800"
     primer_bind     1427..1471
                     /label=FOR
                     /note="Here is the forward primers description."
                     /note="color: orange; sequence: 
                     GCTCATGCCATTGGCGTTAACTCTGCTTCTTGGGCTCCAGCTACC; added: 
                     2021-04-05"
     primer_bind     complement(1649..1676)
                     /label=M13rev
                     /note="standard sequencing primer"
                     /note="color: black; sequence: 
                     aaacactGGCCAAATAagaacgtagaag; added: 2021-04-05;
                     5' phosphorylated"

Importing

Information at the top of the [GenBank | GenPept] file is imported into the Description Panel. Most of the conversion follows an obvious path, but the following should be noted:

  • The DEFINITION is imported into the Description box.
  • The KEYWORDS field is normally not used, in which case it is populated with a "." character. If any other text is in the KEYWORDS field, it is recognized upon import as a sequence label for the map.
  • A Natural DNA sequence has a three-letter code other than SYN in the LOCUS line. The SOURCE is imported as the Source (called the Source Organism prior to version 4.1), and the Sequence Class is imported from the three-letter code in the LOCUS line.
  • A Synthetic DNA sequence has the three-letter code SYN in the LOCUS line. If present, the /lab_host qualifier in the source feature is imported as the Laboratory Host (called the Laboratory Host Organism prior to SnapGene version 4.1).
  • In the FEATURES section, the contents of the first /label qualifier are used as the default choice for the feature name, for example;

/label=AmpR

Note that prior to SnapGene 3.3.4, the feature name was typically encoded in the first /note qualifier, and that format is still recognized by the importer in the absence of a /label qualifier.