.. _analysis-config-docs: Analysis configuration (YAML) ============================== The automated analysis driver reads a single YAML file. A full skeleton with multiple experiments, comparisons, and optional **raw_indexes** is in the repository at ``examples/analysis_examples/DEL005_cube_config_sample.yaml``. How to run the driver is described in :doc:`analysis_docs/analysis_readme`; this page only documents the config file itself. General section --------------- All of these keys live under **general**: - **data**: Path to the cube CSV (path is resolved relative to the process working directory unless you pass an absolute path). - **lib_size**: Encoded library size (integer). Required for sampling-depth metrics (``SD_min``, ``NSC_values``), PolyO, and the normalized z-score path when enabled. - **output_dir**: Base directory for outputs; the driver also creates a dated subdirectory for each run. - **ID_col**: Column used as the compound identifier (for example ``DEL_ID``). If omitted, the first column of the CSV is used. Indexes and controls -------------------- - **indexes**: Maps each **experiment name** (a label you choose) to a list of count columns for that experiment—typically UMI-corrected or otherwise decodable read counts per replicate or channel. - **control_cols**: Maps each experiment name to the NTC (or other control) column(s) used by metrics that compare selection to control—for example MLE and log-based z-scores. Keys should match the experiment names you use in **indexes** where a control exists. - **raw_indexes**: Optional. If **indexes** is empty, **raw_indexes** is copied to **indexes**. For current implementations, NSC, ``SD_min``, and PolyO use the columns listed under **indexes** (not a separate raw-only path). Keeping **raw_indexes** in the file is optional and mainly useful for bookkeeping or future use. Flags block ----------- All pipeline switches sit under **flags**. They are booleans unless noted. The driver runs steps in a fixed order; some flags depend on others (for example disynthon-based plots require **disynthon_data** first). Understanding the flags (grouped) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Use the following groupings when reading an example config (such as the multi-target sample linked above): **Enrichment and QC metrics** - **SD_min** / **NSC_values**: Sampling depth and normalized sequence counts from **indexes** and **lib_size**. - **MLE**: Maximum-likelihood enrichment style ratios using **indexes** and **control_cols**. - **Z_score**: Normalized z-scores under a uniform-library null (expected fraction ``1 / lib_size``), using **indexes**; requires **lib_size**. - **z_score_log_data**: Log-space z-style scores using experiment and control columns from **control_cols**; independent of **Z_score**’s theoretical null. **Disynthon / PolyO / overlap** - **disynthon_data**: Builds disynthon-level tables; required before PolyO and disynthon overlap steps that consume them. - **polyO**: PolyO scores on the cube (uses **indexes** and **lib_size**). - **trisynthon_overlap** / **disynthon_overlap**: Reproducibility-style plots at tri- and disynthon resolution. - **disynthon_threshold**: Numeric cutoff for overlap filtering, or the string ``"auto"`` where supported. **Normalization** - **normalized_data**: Subtracts control counts from selection columns (per experiment). Run after steps that still need raw control columns if you want normalized exports downstream. **Downstream CSVs and ML (optional)** - **simple_spotfire_version**: Writes a reduced column set for Spotfire or similar tools. - **ml_fingerprints_to_RF_reg** / **ml_fingerprints_to_RF_clf**: Random forest models from fingerprints; **clf_thresh** sets the classification cutoff. - **gnn_classifier**, **gnn_threshold**, **gnn_arch**: Optional graph neural network classifier (architecture string such as ``GAT`` or ``GIN``). **Hits and report** - **top_hits** / **top_hits_metric**: How many compounds to list and whether to rank by **sum** or **avg** over index columns. - **report**: Writes the bundled HTML report and a dated cube CSV next to the run. Nested flag structures ^^^^^^^^^^^^^^^^^^^^^^ **top_disynthons** A **comparisons** list; each item is one chart batch. Fields include **comparison** (``"control"``, ``"exp2"``, or ``"none"``), **exp_name**, optional **exp2_name**, **control_name** where required by the comparison type, **top_count**, and **comparison_metric** (``sum`` or ``avg``). Multiple rows mean multiple comparison sets. **top_delta_compounds** A **comparisons** list of **exp_name** / **control_name** pairs (keys into **indexes** / **control_cols**), plus **metric** and **top_count**. **monosynthon_chemical_space** Either enabled as a boolean (all experiments) or as a mapping with **experiments**: a list of experiment names to include in the t-SNE-style view. **report** is typically ``true`` when you want the HTML summary; it collects outputs from whatever earlier flags actually ran.