.. _decoding-algorithm: ================== Decoding Algorithm ================== Broadly speaking, DELi decodes raw DNA read into compound IDs using a simple 2 step system: - Barcode Alignment: Determine which spans of the read belong to which sections of a DELs DNA barcode - Section Calling: Use the alignments to call the decodable sections of the barcode It would be nice if these steps were completely separate and flowed linearly, but in DELi these steps are intertwined. For example, to align the barcode you need to know which library to align the read too. To do that you need to know which library the read is from (so you can get to correct alignment). To do that you need to align the read and call the library section, which we already established cannot be done until we know the library. This doesn't make decoding impossible, instead it means a partial alignment to locate the library section must be done first, then a the library section called, then the alignment can be finished and the rest of the sections called. Most of this confusing interweaving happens during the first stages of decoding, where the reads library is determined. To isolated this, DELi refers to this stage as "library demultiplexing". It is during this stage the the reads library is called and an alignment between that libraries barcode and the read is generated. After reads are converted to compound IDs, DELi can also do UMI correction ("degeneration" as it is called in DELi) to create a final count of unique compounds observed for a given compound ID. .. _library-demultiplexing: Library Demultiplexing ---------------------- DELi implements four different approaches for library demultiplexing: - Full barcode demultiplexing - Library section demultiplexing - Single section demultiplexing - Flanking section demultiplexing The last three approaches are roughly grouped together as :ref:`"static alignment" ` approaches, as they all rely on finding static (known) sections of the barcode to help align the read to the barcode design rather than doing a full sequence alignment. Each have benefits and drawbacks under different situations. The goal of so many methods is to give flexibility between speed and robustness. .. _full-barcode-demulti: Full barcode demultiplexing ^^^^^^^^^^^^^^^^^^^^^^^^^^^ An obvious way to determine which library is the read is from is to just try a full alignment to each possible library barcode schema. The one with the best alignment score is (likely) the correct library. In ``full`` mode DELi uses a or a semi-global alignment with a custom scoring matrix that does not penalize for the 'N' wildcards in the barcode schemas during alignment. It is implemented using Biopython's `PairwiseAligner `_. The benefits of this approach are that the semi-global alignment is very flexible to read quality and it can be "reused" when calling the other decodable sections, since the alignment to the full library barcode schema was already done to find the library. However, this approach is "brute". If your DEL selection used 70 libraries, then for every read you will need to run 70 semi-global alignments, one per library. It also scales directly with the size of reads and number of libraries. This makes the full barcode demultiplexing the slowest of the four approaches .. _static-section-alignment:: Static section alignments ^^^^^^^^^^^^^^^^^^^^^^^^^ Rather than spending the resources to do full dynamic sequence alignments, it can be alot faster to just look for smaller static sections of the barcode to help align the read. Once the location of a static section is known in the read, DELi can use that as reference to build the full alignment. This is much faster than doing a full semi-global alignment, as it only needs to figure out how to map the decodable sections in between the static sections. Luckly this can be done once and stored for the full decode run, allowing for nearly instant alignment generation after the static section(s) are found. This is done by looking at the distance between the static sections and the other sections in the barcode design, and then uses the known location of the static sections used to find the library to guess where the other sections should be. For a single known static section this will just seek out the variable sections based on their expected distance from the known static section. For example, if the first building block section is known to be 15 bp ahead of the static section and is 11 bp long, it will grab the DNA sequence 15-25 bp ahead of the known static section alignment and assign that as the barcode sequence. However, if there are INDELS between these two section this could result in a misalignment. With two known static sections on either side of the section in question, DELi can do a bit better. It can use the observed vs expected distance between the two known static sections to adjust the alignment of the section that fall between to account for INDELs. For example I could have the read and barcode schema of:: AGCTAGCTGGGGAGCTAGACTAGCTAGCT AGCTAGCTJJJJNNNNNNNKKAGCTAGCT Where the N is the section we want and J and K are some other sections we don't care about. If three was an INDEL such that:: AGCTAGCTGG-GAGCTAGACTAGCTAGCT AGCTAGCTJJJJNNNNNNNKKAGCTAGCT Then by looking at the distance between the two static sections (AGCTAGCT) DELi can see that there is a 1 bp deletion in the read between them. Therefore it can adjust the variable sections accordingly, so it would extract ``NNNNK`` instead of ``NNNK``. DELi also supports a "wiggle" mode for static alignments, where the alignment span is extended by 1 bp to the right. This dramatically increases the ability to recover reads with INDELs during :ref:`barcode calling `. Under the hood, DELi uses the ``StaticSectionAligner`` class to handle this logic. .. _library-section-demulti: Library section demultiplexing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Using static section alignment, you could instead just look for the each unique library section rather than align to each DELs barcode. Since all libraries must have unique library section barcodes this will still allow for determining the library. This is called ``library`` mode and is about ten times faster that using :ref:`full barcode demultiplexing `. There are still downsides with ``library`` mode that is not easy to overcome. The library barcodes being so short is one. Searching for barcodes that are as small as 8-12bp has an increase risk of mismatches cause by SNPs or random chance. DELi has some extra bells and whistles to try and compensate (using the known size of the barcode to help it pick the best match for example), but it is still a problem. There is also the issue of scaling with the number of libraries. While the searches are faster, it would still be better if we could get away with doing less searches rather than one per library. .. _single-section-demulti: Single section demultiplexing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Single section demultiplexing, called ``single`` in DELi, addressed both of the issues presented by the library section demultiplexing method. Instead of searching for the library barcode a static section is searched for instead to generate the static alignment. The main benefit here is that unlike library sections, which must be unique between each library (thus requiring the a single search per library) static sections can be shared among many library barcode designs. This is often the case, as DELs made by a single organization or vendor will often have the same primer sections that are static across all the DELs. If we used 50 DELs in our selection, but they all have the same primer (static) section, then we only need to do one search, not 50. In practice, this means we can group DELs based on having the same static section, and then only do a search per group rather than per library. Of course this causes a new issue: how to tell which library the read is from when the generated match could be for any of the libraries. All that is needed to fix this is to use the alignment to locate the library section (just like we would for any other section). Then we can use the :ref:`barcode calling module ` to call the library section and determine which library this read is from. This requires a small tweak to the static alignment method, which instead will only do a partial alignment for just the library section, rather than the full barcode design. After we have the library a second and full alignment can be generated between the read and the full library barcode. Yet this tweak also causes a new issue: what if the libraries share a static section, but the distance between it and the library section is not the same. Then the partial library section only alignment will not work for all the DELs in this group. Instead, we need to further sub group the DELs. These sub groups are determined by the distance between the static section and the library tag. If the distance is the same, between DELs, they can be in the same subgroup. Otherwise they are in separate subgroups. Just like how one search must be made per group, one partial library alignment must be made per subgroup in a group. This might seem a bit too much to handle, having to group DELs based on various barcode schema compatibilities. Lucky for us DELi can do this on its own. Given a set of libraries it can find the set of static sections to use that will minimize the number of groups and subgroups (thus minimizing the number of searches needed). In practice, you can often speed up a decoding run about 20-40x by using ``single`` mode over ``library``. However, in cases where none of the libraries share any static sections, most of the benefits of this method are lost, and ``library`` mode might be a better choice, especially if the size of the static sections are small and library barcodes are larger. .. _flanking-section-demulti: Flanking primer demultiplexing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The last method extends the single section demultiplexing method by using two static sections instead of one. The benifits of this are discussed in the :ref:`static section alignment ` section, but the main takeaway is this further improves the your ability to over come INDELs, at the cost of a minor increase risk in :ref:`miscalling of barcodes `. DELi can also automatically determine which static sections to use as the flanking sections, so you don't have to worry about that either. Core demultiplexing algorithms ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To actually carry out the above demultiplexing approaches DELi uses existing tools to perform the alignment and searches As mentioned already, Biopython's `PairwiseAligner `_ is used for :ref:`full barcode demultiplexing ` and can be used for the other approaches as well if you set `realign = True` (which will reject the search based alignment and replace it with a semi-global alignment). For the search based approaches, DELi can either use Cutadapt's `Adapter search `_ or the Regex packages `fuzzy search `_ .. note:: ``regex`` is not the same as ``re`` from the python standard library. It is a third party library that supports more complex regular expression syntax (but is still compatible with python ``re``). Cutadapt ~~~~~~~~ This will use Cutadapt's alignment algorithm to find the static section(s) in the read. Cutadapt is a fantastically optimized tool for locating adapters in sequencing reads. However, adapters are meant to be at the ends of reads, while static sections could be anywhere in the read. Usually, primers are at the ends of reads so things work out okay, but if you have static sections in the middle of the read, Cutadapt might not be able to find them as well as other methods. This is because the alignment algorithm prioritizes finding matches at the ends of reads, and will pick them over matches in the middle of reads *even if* there match in the middel is better (less errors). This is not a bug but a feature of cutadapt, as it is designed for adapter trimming, not general sequence searching. However, it can cause strange behavior in DELi. Only use this approach if you are sure your static sections are at the ends of your reads (like primers). DELi will warn you if it detects static sections in the middle of reads when using Cutadapt. .. warning:: If you are considering using cutadapt over regex for demultiplexing, you should read the above paragraph, there are important caveats to be aware of. Regex ~~~~~ This will conduct a fuzzy regular expression search for the static section(s) in the read. Unlike a normal regular expression which cannot handle insertion, deletion or substitution errors appearing arbitrarily in a specific section of the query, the `regex` library supports fuzzy searching. DELi will build out regex expressions that look like ``(?:(?PAGCTAGCT)){e<=1}.{{10,20}}(?:(?PCGTACGTA)){e<=1}``. While confusing to read, these queries are quick powerful and efficient. In nearly all cases, regex is faster at finding static sections in a given read than any other method. Regex is the default and recommend method to use. .. _barcode-calling: Barcode Calling --------------- After the alignment is complete and we know which spans of the read contain which sections of the barcode, we can call the decodable sections to determine the final ID associated with the read. On the surface, this is pretty straight forward, as we just need to loop through each decodable section spans and look up its match from the available options provided. Yet is isn't so simple. First, looking for perfect matches leaves alot on the table. Depending on sequencing quality and number of decodable section as many as 30-40% of reads could have an error that would cause them to be uncallable if a perfect match was needed. For DELs this is especially wasteful, as the set of valid barcodes is often far smaller than the set of possible barcodes. This means even with a single error in the barcode we can be pretty confident about what the original barcode was and still call it correctly. To address this DELi uses a hashmap based approach, where prior to decoding a list of each barcode and all its possible 1-Levenshtein distance (1 INDEL or SNP) neighbors is generated and stored. This allows for O(1) lookup of the correct barcode even if there is an error in the read, as long as it is only 1 error away from a valid barcode. If there is a case where two of the valid barcodes are similar enough that they share a 1-Levenshtein distance neighbor (making it too ambiguous to know the correct barcode), DELi can detect this and correct all the read as non-decodable. This is the ``ambiguous`` mode in the error correcting settings. You can read more about his in the :ref:`barcode calling documentation `. There is one more "problem" to address. If you read the above sections about :ref:`library demultiplexing `, you will notice that the alignments generated are not guaranteed to be the correct size, or even just 1 bp larger. This would automatically cause the barcode calling to fail, as the hashmap does not cover that many Levenshtein edits. Worry not, DELi is designed to handle this, and in fact, it is a core feature in making barcode calling more robust. When DELi get asked to call a section. It is given the aligned span from the read. It will then sweep through all possible windows of the expected size, plus one smaller and one larger (to account for possible INDELs) for the section to try and find a match in the hashmap. There are several modes for this, but the default is a "greedy first perfect" which will pick the first window with a perfect (zero errors) match or pick the window that results in a match with the least number of errors (if no perfect match is found). This is still really fast, as the matching is an O(1) operation thanks to the hashmap callers. Using this DELi can recover reads with as many as 2-3 INDELs with high accuracy. This also address another problem we have yet to address, alignment spans that overlap. If decodable barcode sections are close, and there are INDELs, it is possible for the alignments to overlap. By sweeping the windows, DELi can find an alignment that either avoids overlaps between matching windows (which would result in an impossible decode output since one base pair can't be claimed by two different sections) or allows DELi to fail when an non-resolvable overlap is detected. This is generally quite rare, but DELi is designed to handle it when it does happen. Barcode Decoding ---------------- Once the read has been assigned to a specific library and an alignment(s) between the read and the library are created, then the sections can be decoded. This is pretty straight forward. DELi will loop through each alignment and try to decode the building block barcodes based on the mapped nucleotides for each variable section. If it fails (because that barcode doesn't exist) it will try the next alignment and continue until it finds a valid decoding or runs out of alignments. Hashmap based :ref:`barcode calling <>` is used to allow for possible errors in the barcode sections when decoding while keep runtime as low as possible (O(1) in this case). Once all the variable sections are called, DELi will then locate the UMI by both using the alignment to find where the UMI is located in the read *and* looking taking the closest aligned variable section to the UMI and searching *n* bp away to find the UMI sequence. Both these (if different) are save as possible UMIs for the read, with the *n* bp away being the primary UMI call. Future updates will build upon this to do more advanced UMI correction. This produces a full decoded read, mapping it to a unique compound ID and associating it with a UMI. UMI extraction -------------- After the compound ID is decoded, the alignment is also used to extract the UMI sequence from the read. DELi *does not* use the alignment for this. Instead it uses the known distance between the UMI and the closest decoded section to find the UMI sequence. This is to adjust for INDELs that may have occurred in the read, since window scanning does not work on the UMI (there is nothing to call, the region is meant to be random). Decode Parallelization and Collection ------------------------------------- The process of decoding a single read into a compound ID and UMI is embarrassingly parallel, as each read can be decoded independently of the others. This is why DELi's decoding algorithm does not implement any native multithreading or multiprocessing; it is easy (and way faster) to just run multiple processes of DELi on different chunks of the input FASTQ files. You can read more on how to do this in the :ref:`parallelization section `. However, before count and UMI degeneration can occur, the decoded compounds need to be aggregated; we need to know how many time each compound showed up and with what UMIs. This process is not easily parallelizable, since all the data must be looked at in a single process (at some point). This is also the reason this process is sepatated from decoding. DELi can do this collection for you; it is just a massive groupby and count operation. Polars' streaming engine is used for this to keep the memory usage low. UMI Degeneration ---------------- Once the compounds are collected and counted, DELi can do UMI degeneration to correct for PCR amplification bias. Technically this is optional, but non-UMI corrected counts are not very useful due to all the noise introduced by PCR amplification bias. DELi implements two methods for UMI degeneration, :ref:`Exact ` and :ref:`Clustered `. .. _exact-umi-degeneration: Exact UMI Degeneration ^^^^^^^^^^^^^^^^^^^^^^ This is the simplest method, where all reads with the same UMI are grouped together and their counts are summed. There is no room for error in the UMI. Given that even high end sequencers have error rates of about 1/100 base pairs, This is going to introduce some noise, as AACCTTGG and AACCTTG*A* will be counted as the **different** UMIs, even though, statistically speaking, the odds are far more likely one is a misread of the other. To demonstrate this, say we have 100bp DNA barcode with a 10 bp UMI. on average 1 out of every 10 reads (10%) will have a single error in the UMI, giving it a hamming distance on 1. Now, with 10bp UMIs, there are 1,048,576 possible UMIs, but for any given UMI, there are only 30 other UMIs with a hamming distance of 1, or a 0.0029% chance of a read being misread as one of those. If we assume that the chance there are far more possible UMIs that those actually in the DEL (Since DELs are synthesized at such low concentrations) it is 4 fold more likely that two UMIs with hamming distance of 1 are just one single UMI (one has a misread) than it is to be two unique UMIs. In practice these odds are even lower, as UMIs are often longer than 10bp. .. _clustered-umi-degeneration: Clustered UMI Degeneration ^^^^^^^^^^^^^^^^^^^^^^^^^^ To compensate for misreads and PCR errors during sequencing that end up in the UMI section, DELi implements a "clustered" UMI degeneration method. It is an implementation of the "directional" method outlined in Smith et al [1]_. References ---------- .. [1] Smith, T., Heger, A., & Sudbery, I. (2017). UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Research, 27(3), 491-499. doi:10.1101/gr.209601.116