User Guide ========== There are two ways of using Freqgen: CLI and Python API. For each step, we'll look at how to use the CLI and then how to use the (more complicated yet flexible) Python API. As an example for the purposes of this guide, we're going to be making the sequence for `green fluorescent protein `_ from *Aequorea victoria* have *k*-mer frequencies more similar to those of the `highly expressed genes `_ (HEGs) in *Escherichia coli*. Generate an amino acid sequence ------------------------------- The first step of using Freqgen is to have an amino acid sequence for which to generate a coding DNA sequence. If you already have one in a FASTA file (.fasta or .faa filetypes), you can :ref:`skip this step ` If not, read on! CLI ~~~ Freqgen has two ways of generating an amino acid sequence from the CLI. The first is simply to translate a DNA sequence:: $ freqgen aa --mode=seq gfp.fna MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTL VTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLV NRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLAD HYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK* To exclude stop codons, use the ``-s`` flag:: $ freqgen aa --mode=seq gfp.fna -s MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTL VTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLV NRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLAD HYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK To put the sequence into a formatted FASTA file, there's a ``-o`` flag:: $ freqgen aa --mode=seq gfp.fna -o gfp.faa $ cat output_sequence.fasta >Generated by Freqgen from gfp.fna MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTL VTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLV NRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLAD HYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK* The other option is to use the amino acid frequencies of a reference set of sequences to create a new sequence of a given length. This is the default setting, so there's no need for a ``--mode`` switch. However, you do need to provide ``-l``, the desired length of the sequence, excluding the stop codon:: $ freqgen aa gfp.fna -l 20 TVGQAGMNDTFEQSTNSQTD* The ``-o`` and ``-s`` commands work as before. Additionally, if your sequences are using an alternate genetic code, you can use the ``-t`` flag to provide Freqgen the alternate table's NCBI ID. Python API ~~~~~~~~~~ To generate a new amino acid sequence from frequencies, we first need to calculate the frequencies of your reference set. Luckily, that's pretty trivial with :func:`~freqgen.freqgen.k_mer_frequencies`:: >>> k_mer_frequencies("LLNL", 1, include_missing=False) {1: {'L': 0.75, 'N': 0.25}} (Don't worry about the ``include_missing`` argument for now; it's for use on DNA.) Now that we have the frequencies of each amino acid we can generate a sequence using them with :func:`~freqgen.freqgen.amino_acid_seq`:: >>> length = 8 # the length of the sequence to generate >>> aa_sequence = amino_acid_seq(length, k_mer_frequencies("ALLQ", 1)) >>> aa_sequence 'ALAAQLQL' Featurize reference sequences ----------------------------- The next step of using Freqgen to generate a coding DNA sequence is to tell it what features to optimize for. This can be :math:`k`-mers and/or codons. CLI ~~~ The CLI can be used to generate a YAML file containing the frequencies of each :math:`k`-mer in the reference set. For example, to featurize the 1-mers of a sequence:: $ freqgen featurize ecoli.heg.fna -k 1 1: A: 0.24778707477586917 C: 0.25553373220861103 G: 0.27406970099491756 T: 0.22260949202060226 Just as before, the ``-o`` flag can give it an output file:: $ freqgen featurize ecoli.heg.fna -k 2 -o ecoli.heg.yaml To include the codon usage, use the ``-c`` flag:: $ freqgen featurize ecoli.heg.fna -k 1 -c 1: A: 0.24778707477586917 C: 0.25553373220861103 G: 0.27406970099491756 T: 0.22260949202060226 codons: AAA: 0.04896629238995924 AAC: 0.03350325268786685 AAG: 0.011909492399041792 . . . TTG: 0.003530840930507147 TTT: 0.010183808085739262 Python API ~~~~~~~~~~ We need to assemble a dictionary that looks like this:: {1: {'A': 0.24778707477586917, 'C': 0.25553373220861103, 'G': 0.27406970099491756, 'T': 0.22260949202060226}} To do so, let's find the 1-mers of a reference sequence:: >>> sequence = "ATGTGCAGTGGTCCGTCCCGATACGGCTAG" >>> features = k_mer_frequencies(sequence, 1) >>> features {1: {'A': 0.16666666666666666, 'C': 0.26666666666666666, 'G': 0.3333333333333333, 'T': 0.23333333333333334}} If we wanted to add codon usage to the features, we can do so by passing ``codons=True``:: k_mer_frequencies(sequence, 1, codons=True) .. note:: :func:`~freqgen.freqgen.k_mer_frequencies` and :func:`~freqgen.freqgen.codon_frequencies` can take a single sequence or list of sequences as its arguments. Generate a coding sequence -------------------------- CLI ~~~ Assuming the same files as generated above, provide the ``freqgen`` command with the ``-s`` flag for the amino acid sequence file and the ``-f`` flag for the target frequency file to generate a new coding sequence:: $ freqgen --original gfp.faa --target ecoli.heg.yaml ATGAGCAAAGGCGAAGAACTTTTCACAGGCGTGGTGCCCATCT... To take a look at the progress of optimization, use the ``-v`` flag:: $ freqgen --original gfp.faa --target ecoli.heg.yaml -v Gen: 161 Since Improvement: 50/50 Fitness: 0.009440865845955711 The ``-o`` flag for output file and ``-t`` for translation table work as usual:: $ freqgen --original gfp.faa --target ecoli.heg.yaml -o gfp_ecoli.fna If optimization is taking too long, you can use ``^C`` (or ``control-C`` for those on Macs) to stop early:: $ freqgen --original gfp.faa --target ecoli.heg.yaml -o gfp_ecoli.fna ^C Stopping early... Python API ~~~~~~~~~~ Assuming the same ``features`` and ``aa_sequence`` variables from above, generating a sequence with the desired parameters is easy with the :func:`~freqgen.generate` function:: >>> generate(features, aa_sequence) 'TTACTGCAAGCACTGGCGGCGTTG' The ``verbose`` option can print out the progress as you go along, just as in the CLI:: >>> generate(features, aa_sequence, verbose=True) Gen: 51 Since Improvement: 50/50 Fitness: 0.000401269136411031 'TTGCTGCAAGCGTTAGCGGCACTG' Visualize the results --------------------- CLI ~~~ To get a feeling for the results of the sequence generation, Freqgen has a visualization utility built in. To use it, pass in the target frequencies YAML file and the optimized sequence:: $ freqgen visualize --target ecoli.heg.yaml --optimized gfp_ecoli.fna .. raw:: html :file: figures/no_original.html .. note:: You can click on the legend to control display of the categories. To compare the original frequencies of a DNA sequence (if you specified the amino acid sequence) to that of the target and result, there's an optional ``--original`` argument:: $ freqgen visualize --original gfp.fna --target ecoli.heg.yaml --optimized gfp_ecoli.fna .. raw:: html :file: figures/with_original.html For full details, including how to change the dimensions, control the title, and more, check out the :ref:`detailed help command argument listing `.