User Guide
==========

There are two ways of using Freqgen: CLI and Python API. For each step, we'll
look at how to use the CLI and then how to use the (more complicated yet
flexible) Python API.

As an example for the purposes of this guide, we're going to be making the
sequence for `green fluorescent protein
<https://en.wikipedia.org/wiki/Green_fluorescent_protein>`_ from *Aequorea
victoria* have *k*-mer frequencies more similar to those of the `highly
expressed genes <http://genomes.urv.cat/HEG-DB/>`_ (HEGs) in *Escherichia coli*.

Generate an amino acid sequence
-------------------------------

The first step of using Freqgen is to have an amino acid sequence for which to
generate a coding DNA sequence. If you already have one in a FASTA file (.fasta
or .faa filetypes), you can :ref:`skip this step <Featurize reference
sequences>` If not, read on!

CLI
~~~

Freqgen has two ways of generating an amino acid sequence from the CLI. The
first is simply to translate a DNA sequence::

    $ freqgen aa --mode=seq gfp.fna
    MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTL
    VTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLV
    NRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLAD
    HYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK*

To exclude stop codons, use the ``-s`` flag::

    $ freqgen aa --mode=seq gfp.fna -s
    MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTL
    VTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLV
    NRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLAD
    HYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK

To put the sequence into a formatted FASTA file, there's a ``-o`` flag::

    $ freqgen aa --mode=seq gfp.fna -o gfp.faa
    $ cat output_sequence.fasta
    >Generated by Freqgen from gfp.fna
    MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTL
    VTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLV
    NRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLAD
    HYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK*

The other option is to use the amino acid frequencies of a reference set of
sequences to create a new sequence of a given length. This is the default
setting, so there's no need for a ``--mode`` switch. However, you do need to
provide ``-l``, the desired length of the sequence, excluding the stop codon::

    $ freqgen aa gfp.fna -l 20
    TVGQAGMNDTFEQSTNSQTD*

The ``-o`` and ``-s`` commands work as before.

Additionally, if your sequences are using an alternate genetic code, you can use
the ``-t`` flag to provide Freqgen the alternate table's NCBI ID.

Python API
~~~~~~~~~~

To generate a new amino acid sequence from frequencies, we first need to
calculate the frequencies of your reference set. Luckily, that's pretty trivial
with :func:`~freqgen.freqgen.k_mer_frequencies`::

    >>> k_mer_frequencies("LLNL", 1, include_missing=False)
    {1: {'L': 0.75, 'N': 0.25}}

(Don't worry about the ``include_missing`` argument for now; it's for use on DNA.)

Now that we have the frequencies of each amino acid we can generate a sequence
using them with :func:`~freqgen.freqgen.amino_acid_seq`::

    >>> length = 8 # the length of the sequence to generate
    >>> aa_sequence = amino_acid_seq(length, k_mer_frequencies("ALLQ", 1))
    >>> aa_sequence
    'ALAAQLQL'

Featurize reference sequences
-----------------------------

The next step of using Freqgen to generate a coding DNA sequence is to tell it
what features to optimize for. This can be :math:`k`-mers and/or codons.

CLI
~~~

The CLI can be used to generate a YAML file containing the frequencies of each
:math:`k`-mer in the reference set. For example, to featurize the 1-mers of a
sequence::

    $ freqgen featurize ecoli.heg.fna -k 1
    1:
      A: 0.24778707477586917
      C: 0.25553373220861103
      G: 0.27406970099491756
      T: 0.22260949202060226

Just as before, the ``-o`` flag can give it an output file::

    $ freqgen featurize ecoli.heg.fna -k 2 -o ecoli.heg.yaml

To include the codon usage, use the ``-c`` flag::

    $ freqgen featurize ecoli.heg.fna -k 1 -c
    1:
      A: 0.24778707477586917
      C: 0.25553373220861103
      G: 0.27406970099491756
      T: 0.22260949202060226
    codons:
      AAA: 0.04896629238995924
      AAC: 0.03350325268786685
      AAG: 0.011909492399041792
      .
      .
      .
      TTG: 0.003530840930507147
      TTT: 0.010183808085739262

Python API
~~~~~~~~~~

We need to assemble a dictionary that looks like this::

    {1: {'A': 0.24778707477586917,
         'C': 0.25553373220861103,
         'G': 0.27406970099491756,
         'T': 0.22260949202060226}}

To do so, let's find the 1-mers of a reference sequence::

    >>> sequence = "ATGTGCAGTGGTCCGTCCCGATACGGCTAG"
    >>> features = k_mer_frequencies(sequence, 1)
    >>> features
    {1: {'A': 0.16666666666666666,
         'C': 0.26666666666666666,
         'G': 0.3333333333333333,
         'T': 0.23333333333333334}}

If we wanted to add codon usage to the features, we can do so by passing ``codons=True``::

    k_mer_frequencies(sequence, 1, codons=True)

.. note::

     :func:`~freqgen.freqgen.k_mer_frequencies` and
     :func:`~freqgen.freqgen.codon_frequencies` can take a single sequence or
     list of sequences as its arguments.

Generate a coding sequence
--------------------------

CLI
~~~

Assuming the same files as generated above, provide the ``freqgen`` command with
the ``-s`` flag for the amino acid sequence file and the ``-f`` flag for the
target frequency file to generate a new coding sequence::

    $ freqgen --original gfp.faa --target ecoli.heg.yaml
    ATGAGCAAAGGCGAAGAACTTTTCACAGGCGTGGTGCCCATCT...

To take a look at the progress of optimization, use the ``-v`` flag::

    $ freqgen --original gfp.faa --target ecoli.heg.yaml -v
    Gen: 161        Since Improvement: 50/50      Fitness: 0.009440865845955711

The ``-o`` flag for output file and ``-t`` for translation table work as usual::

    $ freqgen --original gfp.faa --target ecoli.heg.yaml -o gfp_ecoli.fna

If optimization is taking too long, you can use ``^C`` (or ``control-C`` for
those on Macs) to stop early::

    $ freqgen --original gfp.faa --target ecoli.heg.yaml -o gfp_ecoli.fna
    ^C
    Stopping early...


Python API
~~~~~~~~~~

Assuming the same ``features`` and ``aa_sequence`` variables from above,
generating a sequence with the desired parameters is easy with the
:func:`~freqgen.generate` function::

    >>> generate(features, aa_sequence)
    'TTACTGCAAGCACTGGCGGCGTTG'

The ``verbose`` option can print out the progress as you go along, just as in
the CLI::

    >>> generate(features, aa_sequence, verbose=True)
    Gen: 51        Since Improvement: 50/50      Fitness: 0.000401269136411031
    'TTGCTGCAAGCGTTAGCGGCACTG'

Visualize the results
---------------------

CLI
~~~

To get a feeling for the results of the sequence generation, Freqgen has a
visualization utility built in. To use it, pass in the target frequencies YAML
file and the optimized sequence::

    $ freqgen visualize --target ecoli.heg.yaml --optimized gfp_ecoli.fna

.. raw:: html
    :file: figures/no_original.html

.. note::

   You can click on the legend to control display of the categories.

To compare the original frequencies of a DNA sequence (if you specified the
amino acid sequence) to that of the target and result, there's an optional
``--original`` argument::

    $ freqgen visualize --original gfp.fna --target ecoli.heg.yaml --optimized gfp_ecoli.fna

.. raw:: html
    :file: figures/with_original.html

For full details, including how to change the dimensions, control the title, and
more, check out the :ref:`detailed help command argument listing <Result
Visualization Reference>`.