Tutorial: Generating a sequence logo plot

This document will walk through how to generate a frequency/probability-based logo plot from sequence data in the format of an input to ortho_seqs, using the logo-plot CLI command. The logo plot can provide information on the frequencies of nucleotides/amino acids present in your sequence dataset before running orthogonal-polynomial.

Logo plots are generated using the logomaker package in Python. More customization options exist (font, color schemes, etc) that are not (yet) implemented here.

1. Requirements for logo-plot

The sequence data, formatted as input to ortho_seqs. This can take the form of either:

  • .txt file: single file containing sequences, separated by line breaks.

  • .csv or .xlsx file: single file containing sequence data in the first column. Can (but doesn’t have to) contain phenotype data in the second column.

2. logo-plot flags:

logo-plot will require the following flags.

--filename

This will be the sequence data file, formatted as described in (1).

--molecule

This is the molecule type. Should either be DNA, RNA, or protein. Default is DNA.

--out_dir

This is where you want the logo plot to be stored.

3. Running logo-plot:

You will run logo-plot in the CLI the same way you would run orthogonal-polynomial or rf1d-viz. The general format is:

ortho_seq logo-plot filename --molecule --out_dir

4. Guided example with test dataset:

The example uses the Sidhu dataset that has also been used in the other tutorials.

The sequence data that will be used for this example is called sidhu_insulin_cdrh3_seqs.xlsx. Given that this dataset is about proteins, our CLI input will be

ortho_seq logo-plot docs/source/sidhu_insulin_cdrh3_seqs.xlsx --molecule protein --out_dir docs/source/tutorial_outputs/

The saved figure will look like:

_images/sidhu_insulin_cdrh3_seqs_freq_logo1.png