GFF3 format (skbio.io.format.gff3)

GFF3 (Generic Feature Format version 3) is a standard file format for describing features for biological sequences. It contains lines of text, each consisting of 9 tab-delimited columns 1.

Format Support

Has Sniffer: Yes

Reader

Writer

Object Class

Yes

Yes

skbio.sequence.Sequence

Yes

Yes

skbio.sequence.DNA

Yes

Yes

skbio.metadata.IntervalMetadata

Yes

Yes

generator of tuple (seq_id of str type, skbio.metadata.IntervalMetadata)

Format Specification

State: Experimental as of 0.5.1.

The first line of the file is a comment that identifies the format and version. This is followed by a series of data lines. Each data line corresponds to an annotation and consists of 9 columns: SEQID, SOURCE, TYPE, START, END, SCORE, STRAND, PHASE, and ATTR.

Column 9 (ATTR) is list of feature attributes in the format “tag=value”. Multiple “tag=value” pairs are delimited by semicolons. Multiple values of the same tag are separated with the comma “,”. The following tags have predefined meanings: ID, Name, Alias, Parent, Target, Gap, Derives_from, Note, Dbxref, Ontology_term, and Is_circular.

The meaning and format of these columns and attributes are explained detail in the format specification 1. And they are read in as the vocabulary defined in GenBank parser (skbio.io.format.genbank).

Format Parameters

Reader-specific Parameters

IntervalMetadata GFF3 reader requires 1 parameter: seq_id. It reads the annotation with the specified sequence ID from the GFF3 file into an IntervalMetadata object.

DNA and Sequence GFF3 readers require seq_num of int as parameter. It specifies which GFF3 record to read from a GFF3 file with annotations of multiple sequences in it.

Writer-specific Parameters

skip_subregion is a boolean parameter used by all the GFF3 writers. It specifies whether you would like to write each non-contiguous sub-region for a feature annotation. For example, if there is interval feature for a gene with two exons in an IntervalMetadata object, it will write one line into the GFF3 file when skip_subregion is True and will write 3 lines (one for the gene and one for each exon, respectively) when skip_subregion is False. Default is True.

In addition, IntervalMetadata GFF3 writer needs a parameter of seq_id. It specify the sequence ID (column 1 in GFF3 file) that the annotation belong to.

Examples

Let’s create a file stream with following data in GFF3 format:

>>> from skbio import Sequence, DNA
>>> gff_str = """
... ##gff-version 3
... seq_1\t.\tgene\t10\t90\t.\t+\t0\tID=gen1
... seq_1\t.\texon\t10\t30\t.\t+\t.\tParent=gen1
... seq_1\t.\texon\t50\t90\t.\t+\t.\tParent=gen1
... seq_2\t.\tgene\t80\t96\t.\t-\t.\tID=gen2
... ##FASTA
... >seq_1
... ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
... ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
... >seq_2
... ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
... ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
... """
>>> import io
>>> from skbio.metadata import IntervalMetadata
>>> from skbio.io import read
>>> gff = io.StringIO(gff_str)

We can read it into IntervalMetadata. Each line will be read into an interval feature in IntervalMetadata object:

>>> im = read(gff, format='gff3', into=IntervalMetadata,
...           seq_id='seq_1')
>>> im  
3 interval features
-------------------
Interval(interval_metadata=<4604421736>, bounds=[(9, 90)], fuzzy=[(False, False)], metadata={'type': 'gene', 'phase': 0, 'strand': '+', 'source': '.', 'score': '.', 'ID': 'gen1'})
Interval(interval_metadata=<4604421736>, bounds=[(9, 30)], fuzzy=[(False, False)], metadata={'strand': '+', 'source': '.', 'type': 'exon', 'Parent': 'gen1', 'score': '.'})
Interval(interval_metadata=<4604421736>, bounds=[(49, 90)], fuzzy=[(False, False)], metadata={'strand': '+', 'source': '.', 'type': 'exon', 'Parent': 'gen1', 'score': '.'})

We can write the IntervalMetadata object back to GFF3 file:

>>> with io.StringIO() as fh:    
...     print(im.write(fh, format='gff3', seq_id='seq_1').getvalue())
##gff-version 3
seq_1   .       gene    10      90      .       +       0       ID=gen1
seq_1   .       exon    10      30      .       +       .       Parent=gen1
seq_1   .       exon    50      90      .       +       .       Parent=gen1

If the GFF3 file does not have the sequence ID, it will return an empty object:

>>> gff = io.StringIO(gff_str)
>>> im = read(gff, format='gff3', into=IntervalMetadata,
...           seq_id='foo')
>>> im
0 interval features
-------------------

We can also read the GFF3 file into a generator:

>>> gff = io.StringIO(gff_str)
>>> gen = read(gff, format='gff3')
>>> for im in gen:   
...     print(im[0])   # the seq id
...     print(im[1])   # the interval metadata on this seq
seq_1
3 interval features
-------------------
Interval(interval_metadata=<4603377592>, bounds=[(9, 90)], fuzzy=[(False, False)], metadata={'type': 'gene', 'ID': 'gen1', 'source': '.', 'score': '.', 'strand': '+', 'phase': 0})
Interval(interval_metadata=<4603377592>, bounds=[(9, 30)], fuzzy=[(False, False)], metadata={'strand': '+', 'type': 'exon', 'Parent': 'gen1', 'source': '.', 'score': '.'})
Interval(interval_metadata=<4603377592>, bounds=[(49, 90)], fuzzy=[(False, False)], metadata={'strand': '+', 'type': 'exon', 'Parent': 'gen1', 'source': '.', 'score': '.'})
seq_2
1 interval feature
------------------
Interval(interval_metadata=<4603378712>, bounds=[(79, 96)], fuzzy=[(False, False)], metadata={'strand': '-', 'type': 'gene', 'ID': 'gen2', 'source': '.', 'score': '.'})

For the GFF3 file with sequences, we can read it into Sequence or DNA:

>>> gff = io.StringIO(gff_str)
>>> seq = read(gff, format='gff3', into=Sequence, seq_num=1)
>>> seq
Sequence
--------------------------------------------------------------------
Metadata:
    'description': ''
    'id': 'seq_1'
Interval metadata:
    3 interval features
Stats:
    length: 100
--------------------------------------------------------------------
0  ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC
60 ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC
>>> gff = io.StringIO(gff_str)
>>> seq = read(gff, format='gff3', into=DNA, seq_num=2)
>>> seq
DNA
--------------------------------------------------------------------
Metadata:
    'description': ''
    'id': 'seq_2'
Interval metadata:
    1 interval feature
Stats:
    length: 120
    has gaps: False
    has degenerates: False
    has definites: True
    GC-content: 50.00%
--------------------------------------------------------------------
0  ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC
60 ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC ATGCATGCAT GCATGCATGC

References

1(1,2)

https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md