.. _running-decode-docs::

=================
Running DELi Decode
=================
DELi is both a package and a command line tool. This means you can run the decoding process either by writing a script
that uses the DELi package or by using the :ref:`command line interface (CLI) <deli-cli-docs>` provided by DELi.
In most cases the CLI is the quickest option, but if you want more control you can use the package.

What you need to run a decoding job
--------------------------------
To run a decoding job, you will need the following:

- A :ref:`sequenced selection file <selection-file-docs>` that specifies the libraries (and tool compounds) used in the selection along with the sequence information
- The raw sequencing data from the selection in fastq format. This should be demultiplexed by selection (only includes the selection of interest; most sequencing providers will do this for you).

And that's it! DELi keeps it simple and will take care of all the other logic for you.

Running a decode job
--------------------
To run a decode job, you can use the :ref:`deli decode CLI <decode-cli-docs>`.
You can include the decoding settings as arguments or in a YAML file or in the selection file itself.
See the :ref:`decoding settings docs <decoding-settings-docs>` for more information on the available settings

As an example here is a valid DELi decode command:
.. code-block:: text

    deli decode run my_selection.yaml

The :ref:`selection file <selection-file-docs>` contains everything needed to do decoding.

Decoding job outputs
--------------------
After running a decode job, you will get the following outputs every time:

- A TSV file with the successfully decoded reads. There are three columns: "library_id", "bb_ids", and "umi". If you requested to save fastq information (read ids and file source) the additional fastq columns will also be included.
- A JSON file with the decoding statistics. This is some aggregate information about how many compounds were read, decoded (both in total and by library).

.. note::
    The file is TSV because the building block ids will be ',' separated. DELi does not generate full compound IDs until
    the very end as they cannot be guaranteed to be convertable back into the original library and building block information

There are a handful of optional file as well

- A decoding report in HTML format. This is a human readable report that give a high level overview of what happened in the decoding run. This generated by default unless turned off.
- A "deli.log" file that contains detailed logging information about the decoding run. This is generated by default unless turned off.
- A TSV file with the failed decoding results. This is a file that contains all the reads (read name and sequence) that were not successfully decoded, along with information about why they failed. This is not generated by default, but can be turned on with the ``--save-failed`` flag.

Collecting decoding runs
------------------------
Once all your decoding jobs for a given selection are complete, the files need to me merged (if more than one) and then
the compounds aggregated to get the set of UMIs (and their occurrences) for each decoded compound. This is done with
the :ref:`deli collect CLI <collect-cli-docs>` using something like ``deli decode collect my_decodes.tsv``.

.. note::
    This process will not merge decode statistics files. Use the `deli decode merge_stats` command for that.

This process requires all the decoded reads to be loaded into memory, which makes this process more memory intensive
than the decoding step, but it is necessary to get the correct UMI counts for each compound.

This will output a single NDJSON (newline delimited JSON) file where each line is a JSON object for single compounds and
contains information on the UMIs for that compound. Each object has the following keys:

- "library_id": the library the compound belongs to
- "bb_ids": the building blocks that make up the compound (separated by ',')
- "umis": a list of the UMIs observed for that compound, along with their counts. Each UMI is represented as a JSON object with two keys: "k" (the UMI sequence) and "v" (the number of times that UMI was observed for that compound).

.. note::
    DELi uses NDJSON over JSON so that the next step of counting the UMIs can be parallelized more easily
    (since you can just use something like ``split`` to break up the file into smaller chunks). If you prefer a
    different format you can convert to that instead.

Counting UMIs
----------------
The final step of decoding is to generate the final UMI count. After collect, only the "raw" (number of times a compound was seen overall)
and "dedup" (number of unique UMIs observed for that compound) counts for each compound have been collected. They are also in
relatively raw/unprocessed form. Counting will extract these counts and generate a final TSV of the decoding results.

It can also conduct more advanced counting. Currently that only includes only UMI clustering. This can help address
possible sequence errors in the UMIs and get a more accurate final count on the number of unique UMIs observed. It
is recommended in nearly all cases.

Lastly, this step can also update your decode statistics file to include the final umi counts per library.
This is used for some enrichment metrics like the PolyO score. It is not required to update the stats file,
but it is recommended.