User Guide

There are two ways of using Freqgen: CLI and Python API. For each step, we’ll look at how to use the CLI and then how to use the (more complicated yet flexible) Python API.

As an example for the purposes of this guide, we’re going to be making the sequence for green fluorescent protein from Aequorea victoria have k-mer frequencies more similar to those of the highly expressed genes (HEGs) in Escherichia coli.

Generate an amino acid sequence

The first step of using Freqgen is to have an amino acid sequence for which to generate a coding DNA sequence. If you already have one in a FASTA file (.fasta or .faa filetypes), you can skip this step If not, read on!

CLI

Freqgen has two ways of generating an amino acid sequence from the CLI. The first is simply to translate a DNA sequence:

$ freqgen aa --mode=seq gfp.fna
MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTL
VTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLV
NRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLAD
HYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK*

To exclude stop codons, use the -s flag:

$ freqgen aa --mode=seq gfp.fna -s
MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTL
VTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLV
NRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLAD
HYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK

To put the sequence into a formatted FASTA file, there’s a -o flag:

$ freqgen aa --mode=seq gfp.fna -o gfp.faa
$ cat output_sequence.fasta
>Generated by Freqgen from gfp.fna
MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTL
VTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLV
NRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLAD
HYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK*

The other option is to use the amino acid frequencies of a reference set of sequences to create a new sequence of a given length. This is the default setting, so there’s no need for a --mode switch. However, you do need to provide -l, the desired length of the sequence, excluding the stop codon:

$ freqgen aa gfp.fna -l 20
TVGQAGMNDTFEQSTNSQTD*

The -o and -s commands work as before.

Additionally, if your sequences are using an alternate genetic code, you can use the -t flag to provide Freqgen the alternate table’s NCBI ID.

Python API

To generate a new amino acid sequence from frequencies, we first need to calculate the frequencies of your reference set. Luckily, that’s pretty trivial with k_mer_frequencies():

>>> k_mer_frequencies("LLNL", 1, include_missing=False)
{1: {'L': 0.75, 'N': 0.25}}

(Don’t worry about the include_missing argument for now; it’s for use on DNA.)

Now that we have the frequencies of each amino acid we can generate a sequence using them with amino_acid_seq():

>>> length = 8 # the length of the sequence to generate
>>> aa_sequence = amino_acid_seq(length, k_mer_frequencies("ALLQ", 1))
>>> aa_sequence
'ALAAQLQL'

Featurize reference sequences

The next step of using Freqgen to generate a coding DNA sequence is to tell it what features to optimize for. This can be \(k\)-mers and/or codons.

CLI

The CLI can be used to generate a YAML file containing the frequencies of each \(k\)-mer in the reference set. For example, to featurize the 1-mers of a sequence:

$ freqgen featurize ecoli.heg.fna -k 1
1:
  A: 0.24778707477586917
  C: 0.25553373220861103
  G: 0.27406970099491756
  T: 0.22260949202060226

Just as before, the -o flag can give it an output file:

$ freqgen featurize ecoli.heg.fna -k 2 -o ecoli.heg.yaml

To include the codon usage, use the -c flag:

$ freqgen featurize ecoli.heg.fna -k 1 -c
1:
  A: 0.24778707477586917
  C: 0.25553373220861103
  G: 0.27406970099491756
  T: 0.22260949202060226
codons:
  AAA: 0.04896629238995924
  AAC: 0.03350325268786685
  AAG: 0.011909492399041792
  .
  .
  .
  TTG: 0.003530840930507147
  TTT: 0.010183808085739262

Python API

We need to assemble a dictionary that looks like this:

{1: {'A': 0.24778707477586917,
     'C': 0.25553373220861103,
     'G': 0.27406970099491756,
     'T': 0.22260949202060226}}

To do so, let’s find the 1-mers of a reference sequence:

>>> sequence = "ATGTGCAGTGGTCCGTCCCGATACGGCTAG"
>>> features = k_mer_frequencies(sequence, 1)
>>> features
{1: {'A': 0.16666666666666666,
     'C': 0.26666666666666666,
     'G': 0.3333333333333333,
     'T': 0.23333333333333334}}

If we wanted to add codon usage to the features, we can do so by passing codons=True:

k_mer_frequencies(sequence, 1, codons=True)

Note

k_mer_frequencies() and codon_frequencies() can take a single sequence or list of sequences as its arguments.

Generate a coding sequence

CLI

Assuming the same files as generated above, provide the freqgen command with the -s flag for the amino acid sequence file and the -f flag for the target frequency file to generate a new coding sequence:

$ freqgen --original gfp.faa --target ecoli.heg.yaml
ATGAGCAAAGGCGAAGAACTTTTCACAGGCGTGGTGCCCATCT...

To take a look at the progress of optimization, use the -v flag:

$ freqgen --original gfp.faa --target ecoli.heg.yaml -v
Gen: 161        Since Improvement: 50/50      Fitness: 0.009440865845955711

The -o flag for output file and -t for translation table work as usual:

$ freqgen --original gfp.faa --target ecoli.heg.yaml -o gfp_ecoli.fna

If optimization is taking too long, you can use ^C (or control-C for those on Macs) to stop early:

$ freqgen --original gfp.faa --target ecoli.heg.yaml -o gfp_ecoli.fna
^C
Stopping early...

Python API

Assuming the same features and aa_sequence variables from above, generating a sequence with the desired parameters is easy with the generate() function:

>>> generate(features, aa_sequence)
'TTACTGCAAGCACTGGCGGCGTTG'

The verbose option can print out the progress as you go along, just as in the CLI:

>>> generate(features, aa_sequence, verbose=True)
Gen: 51        Since Improvement: 50/50      Fitness: 0.000401269136411031
'TTGCTGCAAGCGTTAGCGGCACTG'

Visualize the results

CLI

To get a feeling for the results of the sequence generation, Freqgen has a visualization utility built in. To use it, pass in the target frequencies YAML file and the optimized sequence:

$ freqgen visualize --target ecoli.heg.yaml --optimized gfp_ecoli.fna
Bokeh Plot

Note

You can click on the legend to control display of the categories.

To compare the original frequencies of a DNA sequence (if you specified the amino acid sequence) to that of the target and result, there’s an optional --original argument:

$ freqgen visualize --original gfp.fna --target ecoli.heg.yaml --optimized gfp_ecoli.fna
Bokeh Plot

For full details, including how to change the dimensions, control the title, and more, check out the detailed help command argument listing.