What is "Genbank - SnapGene" Format?
SnapGene allows me to import and export "GenBank - SnapGene" format. What is this format and how can I use it?
The "Genbank - SnapGene" format adheres to [GenBank | GenPept] conventions. However, additional qualifiers are used to encode information used by SnapGene that is not typically captured in the [Genbank | GenPept] format such as:
- the sequence author
- custom map label and alias
- feature directionality, segment ranges, segment colors, segment names, and cleavage arrow positions
- primer name, description, sequence, color, 5' phosphorylation, and date added
You can use the following format specifications in your external workflows to generate "Genbank - SnapGene" format files that can be read by SnapGene.
In order to be able to decode additional information from qualifiers and other fields, it is necessary to clearly advertise the file is using the "GenBank - SnapGene" format as follows:
1. The LOCUS name must contain
LOCUS Exported 4373 bp ds-DNA circular SYN 07-MAR-2017
2. The last REFERENCE must include the TITLE
Direct Submission as well as a JOURNAL entry that contains
REFERENCE 2 (bases 1 to 4373)2 AUTHORS [Listed here is the Sequence Author or a "." character.] TITLE Direct Submission JOURNAL Exported Apr 28, 2017 from SnapGene 3.3.45 http://www.snapgene.com
Note that these requirements are case-sensitive.
If specified, the custom map label encoded using the KEYWORDS field:
KEYWORDS Custom Map Label
If an alias is required it is encoded using the last COMMENT field:
COMMENT Alias: pBSG307
If the last REFERENCE has text other than "." in the AUTHORS field, that text will be imported as the Sequence Author. An alternative JOURNAL entry would be:
JOURNAL SnapGene GenBank format
Information about features is encoded using an additional
Such information is only decoded from the last
/note qualifier if pre. Key/value pairs are encoded as
key: value. Multiple values are separated by semicolons, for example:
/note=color: #ffd281; direction: BOTH
/note qualifiers hold actual notes about the feature.
Note that directionality determines how the direction of a feature is depicted in SnapGene.
The orientation of a feature, if on the reverse strand, is defined using the complement qualifier:
The GenBank format provides ambiguous information regarding directionality. There is no way to encode bidirectionality, and directionality can only be implied using the
/direction qualifiers. As a result, it is impossible to detect the directionality of non-translated forward directional features. We encode the directionality as:
direction: [RIGHT | LEFT | BOTH]
Directionality is omitted if the feature is nondirectional, or if the directionality is implicit because the feature is translated or has a
The color of a single-segment feature is encoded as color:
#RRGGBB using the standard hexadecimal format for RGB colors.
Line appearance is encoded as
In order to encode the name, color, and range for segments in a multi-segment feature, a multi-line note qualifier is employed whose value is enclosed in quotes.
/note="This FEATURE has N segments: 1: # .. # / #ff0000 / First Segment Name ... N: # .. # / #0000ff / Last Segment Name"
/note="This bidirectional feature has 2 segments: 1: 1001 .. 1298 / #ff0000 / One 2: 1299 .. 1596 / #00ff00 / Two"
The first line is used to indicate the number of segments as well as the feature directionality and can take the following forms:
This feature has # segments: This forward directional feature has # segments: This reverse directional feature has # segments: This bidirectional feature has # segments:
The first variant is used for a non-directional feature, or for a feature in which the directionality is implicit because the feature is translated or has a
Subsequent lines are used to encode information about individual non-gap segments. Each segment is encoded using the following format where
SEGMENT_NAME and the preceding backslash are only included for named segments.
SEGMENT_NUMBER: FIRST_BASE .. LAST_BASE / #RRGGBB / SEGMENT_NAME
1: 1000 .. 2000 / #ff0000 / Red Segment
For each segment, the segment number, range, color, and, if specified, name, are separated by
/'s. Segments are always listed in order, although gaps between segment ranges may be present.
If a cleavage arrow is present between two segments or at an end of the feature, this information is encoded at the end of the last
/note qualifier, using the following format:
Cleavage site after base [BASE NUMBER]
Cleavage sites after bases [BASE NUMBER, BASE NUMBER, ...]
For example, the last line of the last
/note qualifier might read:
Cleavage sites after bases 5, 16, 200
Primers are encoded as
primer_bind features that include sequence data are imported not as
primer_bind features, but rather as SnapGene primers. If there are multiple binding sites for a primer, only one copy of the primer is generated during import.
The primer name is encoded using the
/label qualifier, e.g.
If present, the primer description is recorded using a
/note qualifier, e.g.
/note="This is a primers description."
/note qualifier uses keys to specify primer
sequence, phosphorylation (if present), and if known, the date the primer was
added to the sequence.
Keys are separated by values using
Key: Value format. Multiple Key / Value pairs are separated by semicolons.
/note="color: orange; sequence: GCTCATGCCATTGGCGTTAACTCTGCTTCTTGGGCTCCAGCTACC; added: 2021-04-05; 5' phosphorylated"
The color values must be lowercase:
[ black | red | orange | green | blue | purple | gray ]
The primer sequence is case sensitive and can include a mixture of upper and lower case characters.
The UTC date the primer was added to the file, if known, is encoded with:
If the primer is 5' phosphorylated this is included at the end of the terminal note qualifier:
LOCUS Exported 2894 bp DNA linear UNA 05-APR-2021 KEYWORDS Custom Map Label REFERENCE 1 (bases 1 to 2894) AUTHORS . TITLE Direct Submission JOURNAL Exported Monday, Apr 5, 2021 from SnapGene 5.3.0 https://www.snapgene.com COMMENT Alias: This is an example of an alias FEATURES Location/Qualifiers misc_feature 740..1000 /label=Reverse Directional Green Feature /note="color: #00FF00; direction: LEFT" misc_feature 1001..1894 /label=Simple Name /note="This bidirectional feature has 3 segments: 1: 1001 .. 1298 / #ff0000 / First Named Segment 2: 1299 .. 1596 / #00ff00 3: 1597 .. 1894 / #0000ff / Last Named Segment Cleavage site after base 1800" primer_bind 1427..1471 /label=FOR /note="Here is the forward primers description." /note="color: orange; sequence: GCTCATGCCATTGGCGTTAACTCTGCTTCTTGGGCTCCAGCTACC; added: 2021-04-05" primer_bind complement(1649..1676) /label=M13rev /note="standard sequencing primer" /note="color: black; sequence: aaacactGGCCAAATAagaacgtagaag; added: 2021-04-05; 5' phosphorylated"
Information at the top of the [GenBank | GenPept] file is imported into the Description Panel. Most of the conversion follows an obvious path, but the following should be noted:
- The DEFINITION is imported into the Description box.
- The KEYWORDS field is normally not used, in which case it is populated with a "." character. If any other text is in the KEYWORDS field, it is recognized upon import as a sequence label for the map.
- A Natural DNA sequence has a three-letter code other than
SYNin the LOCUS line. The SOURCE is imported as the Source (called the Source Organism prior to version 4.1), and the Sequence Class is imported from the three-letter code in the LOCUS line.
- A Synthetic DNA sequence has the three-letter code
SYNin the LOCUS line. If present, the
/lab_hostqualifier in the source feature is imported as the Laboratory Host (called the Laboratory Host Organism prior to SnapGene version 4.1).
- In the FEATURES section, the contents of the first
/labelqualifier are used as the default choice for the feature name, for example;
Note that prior to SnapGene 3.3.4, the feature name was typically encoded in the first
/note qualifier, and that format is still recognized by the importer in the absence of a