User Guide¶
There are two ways of using Freqgen: CLI and Python API. For each step, we’ll look at how to use the CLI and then how to use the (more complicated yet flexible) Python API.
As an example for the purposes of this guide, we’re going to be making the sequence for green fluorescent protein from Aequorea victoria have k-mer frequencies more similar to those of the highly expressed genes (HEGs) in Escherichia coli.
Generate an amino acid sequence¶
The first step of using Freqgen is to have an amino acid sequence for which to generate a coding DNA sequence. If you already have one in a FASTA file (.fasta or .faa filetypes), you can skip this step If not, read on!
CLI¶
Freqgen has two ways of generating an amino acid sequence from the CLI. The first is simply to translate a DNA sequence:
$ freqgen aa --mode=seq gfp.fna
MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTL
VTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLV
NRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLAD
HYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK*
To exclude stop codons, use the -s
flag:
$ freqgen aa --mode=seq gfp.fna -s
MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTL
VTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLV
NRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLAD
HYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK
To put the sequence into a formatted FASTA file, there’s a -o
flag:
$ freqgen aa --mode=seq gfp.fna -o gfp.faa
$ cat output_sequence.fasta
>Generated by Freqgen from gfp.fna
MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTL
VTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLV
NRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLAD
HYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK*
The other option is to use the amino acid frequencies of a reference set of
sequences to create a new sequence of a given length. This is the default
setting, so there’s no need for a --mode
switch. However, you do need to
provide -l
, the desired length of the sequence, excluding the stop codon:
$ freqgen aa gfp.fna -l 20
TVGQAGMNDTFEQSTNSQTD*
The -o
and -s
commands work as before.
Additionally, if your sequences are using an alternate genetic code, you can use
the -t
flag to provide Freqgen the alternate table’s NCBI ID.
Python API¶
To generate a new amino acid sequence from frequencies, we first need to
calculate the frequencies of your reference set. Luckily, that’s pretty trivial
with k_mer_frequencies()
:
>>> k_mer_frequencies("LLNL", 1, include_missing=False)
{1: {'L': 0.75, 'N': 0.25}}
(Don’t worry about the include_missing
argument for now; it’s for use on DNA.)
Now that we have the frequencies of each amino acid we can generate a sequence
using them with amino_acid_seq()
:
>>> length = 8 # the length of the sequence to generate
>>> aa_sequence = amino_acid_seq(length, k_mer_frequencies("ALLQ", 1))
>>> aa_sequence
'ALAAQLQL'
Featurize reference sequences¶
The next step of using Freqgen to generate a coding DNA sequence is to tell it what features to optimize for. This can be \(k\)-mers and/or codons.
CLI¶
The CLI can be used to generate a YAML file containing the frequencies of each \(k\)-mer in the reference set. For example, to featurize the 1-mers of a sequence:
$ freqgen featurize ecoli.heg.fna -k 1
1:
A: 0.24778707477586917
C: 0.25553373220861103
G: 0.27406970099491756
T: 0.22260949202060226
Just as before, the -o
flag can give it an output file:
$ freqgen featurize ecoli.heg.fna -k 2 -o ecoli.heg.yaml
To include the codon usage, use the -c
flag:
$ freqgen featurize ecoli.heg.fna -k 1 -c
1:
A: 0.24778707477586917
C: 0.25553373220861103
G: 0.27406970099491756
T: 0.22260949202060226
codons:
AAA: 0.04896629238995924
AAC: 0.03350325268786685
AAG: 0.011909492399041792
.
.
.
TTG: 0.003530840930507147
TTT: 0.010183808085739262
Python API¶
We need to assemble a dictionary that looks like this:
{1: {'A': 0.24778707477586917,
'C': 0.25553373220861103,
'G': 0.27406970099491756,
'T': 0.22260949202060226}}
To do so, let’s find the 1-mers of a reference sequence:
>>> sequence = "ATGTGCAGTGGTCCGTCCCGATACGGCTAG"
>>> features = k_mer_frequencies(sequence, 1)
>>> features
{1: {'A': 0.16666666666666666,
'C': 0.26666666666666666,
'G': 0.3333333333333333,
'T': 0.23333333333333334}}
If we wanted to add codon usage to the features, we can do so by passing codons=True
:
k_mer_frequencies(sequence, 1, codons=True)
Note
k_mer_frequencies()
and
codon_frequencies()
can take a single sequence or
list of sequences as its arguments.
Generate a coding sequence¶
CLI¶
Assuming the same files as generated above, provide the freqgen
command with
the -s
flag for the amino acid sequence file and the -f
flag for the
target frequency file to generate a new coding sequence:
$ freqgen --original gfp.faa --target ecoli.heg.yaml
ATGAGCAAAGGCGAAGAACTTTTCACAGGCGTGGTGCCCATCT...
To take a look at the progress of optimization, use the -v
flag:
$ freqgen --original gfp.faa --target ecoli.heg.yaml -v
Gen: 161 Since Improvement: 50/50 Fitness: 0.009440865845955711
The -o
flag for output file and -t
for translation table work as usual:
$ freqgen --original gfp.faa --target ecoli.heg.yaml -o gfp_ecoli.fna
If optimization is taking too long, you can use ^C
(or control-C
for
those on Macs) to stop early:
$ freqgen --original gfp.faa --target ecoli.heg.yaml -o gfp_ecoli.fna
^C
Stopping early...
Python API¶
Assuming the same features
and aa_sequence
variables from above,
generating a sequence with the desired parameters is easy with the
generate()
function:
>>> generate(features, aa_sequence)
'TTACTGCAAGCACTGGCGGCGTTG'
The verbose
option can print out the progress as you go along, just as in
the CLI:
>>> generate(features, aa_sequence, verbose=True)
Gen: 51 Since Improvement: 50/50 Fitness: 0.000401269136411031
'TTGCTGCAAGCGTTAGCGGCACTG'
Visualize the results¶
CLI¶
To get a feeling for the results of the sequence generation, Freqgen has a visualization utility built in. To use it, pass in the target frequencies YAML file and the optimized sequence:
$ freqgen visualize --target ecoli.heg.yaml --optimized gfp_ecoli.fna
Note
You can click on the legend to control display of the categories.
To compare the original frequencies of a DNA sequence (if you specified the
amino acid sequence) to that of the target and result, there’s an optional
--original
argument:
$ freqgen visualize --original gfp.fna --target ecoli.heg.yaml --optimized gfp_ecoli.fna
For full details, including how to change the dimensions, control the title, and more, check out the detailed help command argument listing.