SimChA: Simulator of Chromosomal Aberrations

SimChA is a fitness-driven simulator of copy-number evolution. SimChA can simulate 22 event types and SNVs within 25 different cancer types, or use pan-cancer profiles.

SimChA can be used in three ways:

repeats N: generates N repeats of the simulation,
tree <phylogeny>: generates a tree of clones based on the provided phylogeny structure,
profiles <cn_profiles>: scores the provided CN profiles.

There are three basic modes of fitness:

basic: events are selected at random, without considering fitness.
evolution: events are selected based on their fitness, with a higher chance of selecting events that increase fitness.
matching: events are selected to minimize the distance to a target fitness value provided on input.

The fitness is calculated based on the following principles:

Tumor Suppressor Genes (TSG) and Oncogenes (OG) contribute to fitness, with TSG loss and OG gain increasing fitness.
Essential genes contribute to fitness, with their full loss penalizing fitness.
Abnormally high ploidy is penalized by a stress factor.

Quick Start

For quickstart, Git >= 2.4 and Conda >= 22 (or equivalent) are required.

The program can be run on a platform of your choice in the provided Conda environment. The following commands should make SimChA display the available commands.

git clone git@bitbucket.org:schwarzlab/simcha.git
cd simcha
conda env create --file simcha.yml
conda activate simcha
dotnet run

Tested platforms

The program has been tested on:

Windows 11 - PowerShell
Windows 11 - WSL2 Ubuntu
Ubuntu 24.04
MacOS X 10

Execution

The repository is a solution (SimChA.sln) with two projects: SimChA, the simulator, and Tests, the unit test suite. SimChA is the default project — its .csproj sits at the repository root (with sources in src/), so dotnet run runs the simulator directly:

git clone git@bitbucket.org:schwarzlab/simcha.git
cd simcha
dotnet run

The results will be written to the folder ./out

Options

Use dotnet run -- [options] to specify any of the following:

 
  -O, --output                 (Default: ./out) The path to the output files.

  -C, --config                 (Default: ./configs/main_config.json) A json file with configuration of the experiment.

  -T, --tree <path>            Clone-tree TSV/CSV file path. Required columns: ID, ParentID, Distance (int). Optional column: Fitness (float; only used/required with -m matching). Delimiter can be tab or comma.

  -R, --repeats <int>          (Default: 1) Positive integer number of independent repeats/samples in repeat mode. Optional when omitted (defaults to 1). Cannot be combined with -T or -P when value > 1.

  -P, --cnprofiles <path>      CNA profile TSV file path (tab-separated, header + at least 6 columns): SampleID, Chr, Start, End, CN_hap1, CN_hap2. Extra columns are optional/ignored. Start is interpreted as 1-based by default, or 0-based with -z.

  -m, --mode                   (Default: evolution) The event selection mode: 'basic' (events are selected at random), 'evolution' (events are selected to increase fitness), or 'matching' (events are selected to minimize the distance to a target fitness).

  -s, --segments               (Default: false) Write out copy numbers segments.

  -S, --consistent-segments    (Default: false) Write out copy number segments under a minimum consistent segmentation.

  -k, --karyotypes             (Default: false) Write out karyotype after each event.
  
  -d, --delta                  (Default: false) Write out the lost and gained regions for each event.

  -v, --variants               (Default: false) Write out VCF file with the variants of the final simulated karyotype. Requires `data/<assembly>/genome.fa` (e.g. `data/hg19/genome.fa`). Use `scripts/DownloadRefData.sh` to download it.

  -f, --fasta                  (Default: false) Write out a FASTA file for each sample. Requires `data/<assembly>/genome.fa` (e.g. `data/hg19/genome.fa`). Use `scripts/DownloadRefData.sh` to download it. WARNING! Average file size is 6GB per sample.

  -z, --zero-index             (Default: false) Flag for zero-indexed input copy number profiles.

  --root                       (Default: .) A path to the folder that will be considered root for relative paths.

  -h, --help                       Display this help screen.

  --version                    Display version information.

Input files

Input files are only required for specific execution modes:

Option	Used for	Required?
`-R, --repeats <int>`	Repeat-mode simulation (default mode when no `-T`/`-P` is given)	No input file required
`-T, --tree <path>`	Tree-mode simulation	Required in tree mode
`-P, --cnprofiles <path>`	Profile scoring mode	Required in profiles mode

Clone tree file (`-T`, `--tree`)

A .tsv or .csv file with a header row. The extension determines the separator: tab for .tsv, comma for .csv.

Columns:

ID (string): clone/sample identifier
ParentID (string): parent clone identifier
Distance (int): number of events from parent to child
Fitness (float): target fitness value — only read when using -m matching, otherwise ignored

The root is the row where ParentID == ID, or where ParentID does not match any ID in the file. Exactly one root is expected.

Minimal tree example (.tsv):

ID	ParentID	Distance
A	A	0
B	A	12
C	B	7

Matching-mode example (.tsv):

ID	ParentID	Distance	Fitness
A	A	0	0.0
B	A	12	3.5

CNA profile file (`-P`, `--cnprofiles`)

A tab-separated file with a header row and at least 6 columns per data row. Additional columns are ignored.

First 6 columns (in order):

SampleID (string)
Chr (string, e.g. chr1)
Start (int): 1-based by default; use -z for 0-based input
End (int)
CN_hap1 (numeric, rounded to nearest int)
CN_hap2 (numeric, rounded to nearest int)

chrX and chrY rows are skipped when SimParams.AutosomesOnly = true.

Minimal CNA example (.tsv):

SampleID	Chr	Start	End	CN_hap1	CN_hap2
S1	chr1	1	248956422	1	1
S1	chr8	1	145138636	2	1

Path resolution (`--root`)

Relative paths for -T, -P, -C, and output are resolved from the current working directory, or from --root if provided.

Configuration files

Default parameters are found in the file: ./configs/main_config.json. The exact parameters are dependent on the execution mode.

The default execution corresponds to running

dotnet run -- --config ./configs/main_config.json

This contains optimized simple event and fitness parameters for pan-cancer simulation.

Cancer-type configs

Pre-built configuration files for each cancer type (and pan-cancer) are provided in ./configs/, organized into three subfolders:

Location	Description
`configs/spice_<type>.json`	Optimized event profiles (hg19) derived from TCGA/PCAWG data.
`configs/hg38/spice_<type>.json`	Same as above with `SimParams.Assembly` set to `"hg38"`.
`configs/basic/spice_<type>.json`	Non-optimized profiles - these should be used with `basic` mode, since the WGD probability is not affected by stress.

<type> is the TCGA cancer-type abbreviation (e.g. LUAD, BRCA) or pancancer for the pan-cancer profile.

File Name	Full Name
`ACC`	Adrenocortical Carcinoma
`BLCA`	Bladder Urothelial Carcinoma
`BRCA`	Breast Invasive Carcinoma
`CESC`	Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma
`COAD`	Colon Adenocarcinoma
`ESCA`	Esophageal Carcinoma
`GBM`	Glioblastoma Multiforme
`HNSC`	Head and Neck Squamous Cell Carcinoma
`KIRC`	Kidney Renal Clear Cell Carcinoma
`KIRP`	Kidney Renal Papillary Cell Carcinoma
`LGG`	Brain Lower Grade Glioma
`LIHC`	Liver Hepatocellular Carcinoma
`LUAD`	Lung Adenocarcinoma
`LUSC`	Lung Squamous Cell Carcinoma
`MESO`	Mesothelioma
`OV`	Ovarian Serous Cystadenocarcinoma
`PAAD`	Pancreatic Adenocarcinoma
`PCPG`	Pheochromocytoma and Paraganglioma
`PRAD`	Prostate Adenocarcinoma
`READ`	Rectum Adenocarcinoma
`SARC`	Sarcoma
`SKCM`	Skin Cutaneous Melanoma
`STAD`	Stomach Adenocarcinoma
`TGCT`	Testicular Germ Cell Tumors
`UCEC`	Uterine Corpus Endometrial Carcinoma

Customizing configuration

`SimParams`

The parameters controlling the simulation of events.

Seed: int (0): The seed for the random number generator. If < 0, the seed will be generated randomly on runtime.
Assembly: string ("hg19"): The reference genome assembly to use (e.g. "hg19", "hg38").
Sex: ["Any", "Male", "Female"] ("Any"): One of Any, Male, Female. If Any, then samples' sex will be generated with a random.
RateDist: ["Uniform", "Geometric", "Poisson"] ("Uniform") - The distribution of the mutation rate.
RateMean: float (1.0): the mean of the mutation rate (mutations between two nodes).
TetraploidStart: bool (false): If true, the root karyotype will undergo a whole genome doubling before simulation begins.
AutosomesOnly: bool (false): If true, karyotypes will only contain autosomes (chromosomes 1-22), excluding sex chromosomes.
Mixture: ["Single", "Constant", "Dirichlet"] ("Constant"): In case of multiple signatures, how are these mixed for each sample. Single means that only one signature is used (selected based on its relative probability), Constant means that each signature has a fixed probability of being selected, while Dirichlet means that the probabilities are drawn from a Dirichlet distribution.

`FitParams`

The parameters controlling the fitness of the samples.

Stress: float (0.0): Stress penalizes abnormally high ploidy.
TsgOg: float (0.0): Affected by the number of Tumor Suppressors (TSG) and Oncogenes (OG) in the sample. TSG loss and OG gain increase fitness.
Essentiality: float (0.0): Penalizes full loss of essential genes.
GeneSet: string ("Empty"): The gene set to use for fitness calculations (e.g. "spice_all"). This should be a folder name relative to the assembly directory.

`Signatures`

Signatures define the mutational processes that generate structural variants during simulation. Each signature contains:

Name: string: A descriptive name for the signature.
Prob: double: The relative probability of this signature being selected (compared to other signatures).
Events: array: A list of event types with their parameters and probabilities.

The Mixture parameter in SimParams controls how multiple signatures are combined:

Single: Only one signature is selected per sample based on relative probabilities.
Constant: Each signature maintains a fixed probability throughout simulation.
Dirichlet: Signature probabilities are drawn from a Dirichlet distribution for each sample.

See the Signatures section below for details on configuring individual event types.

Fitness Matching Mode

When using -m matching (fitness matching mode), events are selected using the same mechanism as evolution mode, but instead of maximizing fitness, each event is chosen to minimize the distance to a target fitness value. The EvoParams.MaxTries parameter controls the full candidate-search budget per step. Across that budget, the existing acceptance rule is still evaluated, but its influence fades smoothly from early tries to late tries, so the search transitions continuously from exploration toward strict best-match selection.

`EvoParams`

The parameters controlling the evolutionary mode of event simulation (selection of events based on fitness).

Acceptance: float (0.0): See publication for details. The higher the value, the less likely an event is to be accepted. Usually between 0 and 1.
MaxTries: int (1): How many candidate events are sampled before matching mode settles on the best available event or gives up and moves to the next sample.
Decay: float (0.0): Used in fitness matching mode (-m matching). Controls how strongly the acceptance rule influences candidate selection during the early part of the try budget. The decay increases linearly from 0 (first event) to Decay (last event), so earlier events get more exploration pressure, while later events lean more quickly toward pure distance-to-target matching.

Working path

The --root option sets the base directory used to resolve all relative paths. By default it is . (the working directory from which the command is run), so paths like ./configs/main_config.json and ./out resolve relative to wherever you invoke dotnet run.

If you run SimChA from a different directory — for example via a script or a workflow manager — set --root to the repository folder so that default config and output paths resolve correctly without having to specify each one individually:

dotnet run -- --root /path/to/simcha

With this setting, ./configs/main_config.json resolves to /path/to/simcha/configs/main_config.json and ./out to /path/to/simcha/out, regardless of the current working directory.

Input Data

Reference data is located in the data/ folder, organized by assembly name (matching SimParams.Assembly). Each assembly folder needs to contain description of the chromosomes, centromeres, and gene score files in a subfolder (matching FitParams.GeneSet). We provide GRCh37 and GRCh38 in the folders ./data/hg19 and ./data/hg38 respectively.

Chromosomes

chromosomes.tsv

The chromosome file contains two columns, one with the name of a chromosome, one with its number of bases.

Example file:

chr1	248956422
...
chrY	59373566

TSG/OG/Eseentiality Score

tsgs_select.tsv, ogs_select.tsv, essentials_select.tsv

We use three data files, providing the TSG/OG/Essentiality score. Each file is a tab-separated file with the following columns:

Chromosome: string - The name of the chromosome.
Start: long - The start position of the gene (inclusive).
End: long - The end position of the gene (inclusive).
Gene: string - The name of the gene.
Score: float - The score of the gene.

Example file:

chr3	178865902	178957881	PIK3CA	1
chr7	140419127	140624564	BRAF	0.991919559
chr12	25357723	25403870	KRAS	0.990684897
...

For each gene, the:

TSG score is the probability that the gene is a Tumor Suppresor Gene (TSG)
OG score is the probability that the gene is an Oncogene (OG)
Essentiality score is the log2 fold change in the reproducibility of a cell after a knock-out of the gene (both alleles).

Centromeres

centromeres.tsv

Each chromosome has a centromere for each arm, defined by a start and an end. The centromere information is given by three columns: chromosome name, start point, end point. Each chromosome has two rows, corresponding to the portions of the centromere belonging to the p- and q-arms of the chromosome.

It is also possible to provide a single row for each chromosome, corresponding to the whole centromeric region.

Example file:

chr1    121500000   125000000
chr1    125000000   128900000
...
chrY    11600000    12500000
chrY    12500000    13400000

Signatures

SimChA simulates events based on mutational signatures. Each signature is a set of events and their associated parameters, for example consider the following excerpt from a configuration file:

"Signatures" : [
  {
    "Name" : "WoleChromEvents"
    "Prob": 1,
    "Events": [
      {
        "Type": "ChromDeletion",
        "Prob": 1
      },
      {
        "Type": "ChromDuplication",
        "Prob": 2
      }        
    ]
  },
  {
    "Name": "InternalEvents"
    "Prob": 5,
    "Events": [
      {
        "Type": "InternalDeletion",
        "Prob": 1,
        "Size": 1000000
      },
      {
        "Type": "InternalDuplication",
        "Prob": 7,
        "Size": 500000
      }
    ]
  }
]

These are two signatures, one for Whole Chromosome Events and one for Internal events. The likelihood of a signature being selected is 1 : 5, meaning the Internal is 5 times as likely.

If the WholeChromEvent is selected, the probability of a deletion compared to duplication is 1 : 2. SimChA works with contigs, i.e. contiguous sequences of bases. A deletion or duplication will be a deletion or duplication of a contig. A contig may be comprised of parts of different chromosomes, e.g. after a translocation, however at the start of simulation, the set of contigs is the same as the set of chromosomes.

If the Internal Deletion is selected, the mean size of a deleted segment will be 1MB, distributed exponentially, while Internal Duplication events will have a mean segment size of 500kB.

Each Event has an associated type and a probability, which is relative to all the other events in the signature. The following events are available:

ChromDuplication
ChromDeletion
TailDuplication
TailDeletion
CentromereBoundDuplication
CentromereBoundDeletion
ArmDuplication
ArmDeletion
InternalDuplication
InternalDeletion
InternalInversion
InvertedDuplication
Translocation
BreakageFusionBridge
WholeGenomeDoubling
Chromothripsis
Chromoplexy
TIChain
TICycle
TIBridge
Pyrgo
Rigma
SNV
Pass

Events have parameters from the following:

Type: string - The type of the event, one from the list above.
Prob: double - The probability of the event being selected.
Frac: double - The mean size of the event as a fraction of contig length, exponentially distributed. Only applicable to internal, tail, and centromere-bound events.
Frag: double - Some complex events cause fragmentation, this is the mean number of fragments.

Contig selection

Once an event type is chosen, the contig(s) it acts on are selected with a probability that depends on the event category:

Within-contig events (internal deletion/duplication/inversion, inverted duplication, tail deletion/duplication, breakage-fusion-bridge, chromothripsis, pyrgo, rigma, SNV) select a contig with probability proportional to its length — a longer contig is more likely to contain a given breakpoint.
Arm and centromere-bound events (arm deletion/duplication, centromere-bound deletion/duplication) select a contig with probability proportional to its number of centromeres. Contigs without a centromere are never selected; a fused contig carrying several centromeres is proportionally more likely.
Multi-contig events (translocation, templated insertions, chromoplexy) and whole-chromosome events (chromosome deletion/duplication, whole-genome doubling) select contigs uniformly at random.

If no contig is eligible for the chosen event — for example an arm or centromere-bound event when no remaining contig carries a centromere — the event cannot be generated and is skipped: it is recorded in the output as a Skip event (see below) and leaves the karyotype unchanged.

Chromosome Deletion (Prob)

A single contig is selected at random and removed.

Chromosome Duplication (Prob)

A single contig is selected at random and duplicated.

Arm Deletion (Prob)

For a contig with at least once centromere, one of the arms is selected at random and removed - the end within the centromere is selected uniformly.

Arm Duplication (Prob)

For a contig with at least once centromere, one of the arms is selected at random and duplicated - the end within the centromere is selected uniformly.

Whole Genome Doubling (Prob)

All the existing contigs are duplicated.

Tail Deletion (Prob, Frac)

A tail of a length given by the Frac parameter is removed from an end of a contig. The end is selected by a coin flip.

Tail Duplication (Prob, Frac)

A tail of a length given by the Frac parameter is duplicated from an end of a contig. The duplicated segment is placed at the same end. The end is selected by a coin flip.

Internal Deletion (Prob, Frac)

A single contig is selected, from which a segment distributed by along the Frac parameter is removed. The position of a segments is guaranteed to be internal and uniformly distributed.

Internal Duplication (Prob, Frac)

A single contig is selected, from which a segment distributed by along the Frac parameter is duplicated. The position of a segments is guaranteed to be internal and uniformly distributed. This segment is pasted directly after its original position.

Internal Inversion (Prob, Frac)

A single contig is selected, from which a segment distributed by along the Frac parameter is inverted. The position of a segments is guaranteed to be internal and uniformly distributed.

Inverted Duplication (Prob, Frac)

Like a duplication, but the segment is inverted before being pasted.

Centromere-Bound Duplication (Prob, Frac)

For a contig with at least one centromere, a segment is selected that extends from within the centromere to a distance given by the Frac parameter. The breakpoint within the centromere is selected uniformly. This segment is then duplicated.

Centromere-Bound Deletion (Prob, Frac)

For a contig with at least one centromere, a segment is selected that extends from within the centromere to a distance given by the Frac parameter. The breakpoint within the centromere is selected uniformly. This segment is then deleted.

Translocation (Prob, Frac)

Two contigs are selected, from which a segment distributed along the Frac parameter is swapped. The position is selected individually for each contig. A coin flip decides if one of the segments is inverted before being pasted.

BreakageFusionBridge (Prob, Frac)

A contig and its tail is selected (see above). The tail is then removed, the rest is copied, the copy is reversed and the two copies are connected on the breakage location.

Chromothripsis (Prob, Frac)

A contig is broken into a number of fragments, such that the fragment size is distributed exponentially with a mean of Frac.
Have, f the number of fragments from step 1). A k framgents are then randomly selected such that 0 < k <= f.
The k fragments are then reassembled in a random order, potentially with inverted orientations.
The result is a highly rearranged chromosomal region with multiple breakpoints clustered in one genomic area.

Chromoplexy (Prob, Frac, Frag)

A number c of contigs is selected, following the probability distribution listed below.
Each contig is broken into a number of fragments, such that the fragment size is distributed under normal distribution with a mean of Frac.
These contigs are then reassembled in a random order into a single contig.
This contig is broken down into new contigs with mean number of contigs equal to Frag.

Probabilities are sourced from Ashby et al., 2019:

3 contigs: 46%
4 contigs: 18%
5 contigs: 10%
6 contigs: 5%
2 contigs: 21%

Pyrgo (Prob, Frac, Frag)

A contig is selected and a segment of length distributed along Frac is chosen.
This segment is fragmented into multiple pieces, with the number of fragments drawn from a geometric distribution with mean Frag.
Each fragment has a length drawn from an exponential distribution with mean Frac / Frag.
These fragments are then duplicated and randomly inserted throughout the genome, creating dispersed duplications.

Rigma (Prob, Frac, Frag)

A contig is selected and a starting position is chosen.
From this position, a series of deletions are made, with the number of deletions drawn from a geometric distribution with mean Frag.
Each deletion has a length drawn from an exponential distribution scaled by Frac / Frag.
This creates a pattern of multiple local deletions originating from a single starting point.

Template Insertion Chain (Prob, Frac, Frag)

A number of contigs is selected from a geometric distribution with mean Frag (minimum 1).
From each contig (except the first and last), a segment is selected with length distributed along Frac.
The first segment has zero length (just a breakpoint), and the last segment also has zero length.
These segments are then chained together sequentially, with random orientations, creating a linear arrangement of templated insertions.
The resulting chain is inserted back into the genome as a single new contig.

Template Insertion Cycle (Prob, Frac, Frag)

A number of contigs is selected from a geometric distribution with mean Frag (minimum 1).
From each contig, a segment is selected with length distributed along Frac.
Unlike TIChain, all segments (including the first) have non-zero length.
These segments are chained together and form a cycle (the last segment connects back to the first).
The cycle is then integrated into the genome, creating a circular arrangement of templated insertions.

Somatic Nucleotide Variants - SNV (Prob)

SimChA is capable of handling the Jukes-Cantor nucleotide substitution model, but to attach proposed SNV events from SimChA to the reference genome, the reference genomes have to be downloaded. To do this, we have provided the DownloadRefData.sh script, which can be run from the root directory of the project as follows:

chmod +x scripts/DownloadRefData.sh && ./scripts/DownloadRefData.sh

The download script places the reference FASTA files at data/hg19/genome.fa and data/hg38/genome.fa. These files are required when using -v/--variants and -f/--fasta (for the assembly selected in SimParams.Assembly).

Note that if you don't want to download both reference genomes (hg19 and hg38), simply comment out or remove the relevant section of the script.

Template Insertion Bridge (Prob, Frac, Frag)

A number of contigs is selected from a geometric distribution with mean Frag (minimum 2).
The first segment has zero length (just a breakpoint), while the remaining segments have lengths distributed along Frac.
These segments are chained together with random orientations.
The resulting structure bridges the original breakpoint with templated insertions from other genomic locations.
This mimics the mechanism of template switching during DNA repair or replication.

Pass (Prob)

A no-op event that leaves the karyotype unchanged. Pass can be added to a signature like any other event to reserve part of the probability mass for "nothing happens" — i.e. a mutation step that draws this event consumes a step but makes no structural change. It is the only no-op event that is intended to be configured.

Skip

Skip is an internal event that is not configured in signatures. It is substituted automatically when a drawn event cannot be applied, in two situations:

No contig is eligible for the chosen event (e.g. an arm or centromere-bound event when no remaining contig carries a centromere).
In evolution/matching modes, the candidate search exhausts EvoParams.MaxTries without accepting an event.

Like Pass, a Skip leaves the karyotype unchanged, but it is still recorded as a row in events.tsv so that skipped steps remain visible in the output.

Output

The text files are primarily used as source for plots shown below.

`copynumbers.tsv`

CN output in the format examplifed as:

sample_id	chrom	start	end	cn_a	cn_b	n_snvs
0	chr1	24721	98434	1	4	2

`sim_params.json`

Stores configuration parameters used for this simulation, including the random seed. If this file is provided on input, the exact same simulation will be executed.

`samples.tsv`

Information about the individual samples at the end of the simulation.

`events.tsv`

Information about the individual events in the simulation. The columns are:

sample_id	event_type	depth	description	delta_fitness	total_fitness	num_rejections	signature	regions_gained	regions_lost
sample_0	ChromDuplication	12	contig:23	0.5378	11.5378	0	CNA	>chr11[0:135006516)
sample_0	InternalDuplication	13	contig:18;length:59128983;start:49287088;end:50048544	0.0000	11.5378	12	CNA	>chr19[49287088:50048544)
sample_0	InternalDeletion	14	contig:12;length:115169878;start:2802803;end:7456363	0.0000	11.5378	2	CNA		>chr13[2802803:7456363)

The regions_gained/regions_lost columns list all individual regions that were added or removed by the event (separated by | if multiple). This includes regions from whole contigs that were created or deleted, as well as regions that were duplicated or removed within existing contigs.

`vcf.tsv`

When using the -v or --variants flag, generates a simple VCF file of all the observable (i.e. final and nucleotide-altering) SNVs that occurred to the samples during simulation. This requires data/<assembly>/genome.fa (for example data/hg19/genome.fa), which can be downloaded with scripts/DownloadRefData.sh (see SNV section above).

The file contains standard VCF format headers and columns:

##fileformat=VCFv4.3

##source=SimChAV1.0

##reference=verily_hg19_genome.fa

| `#SAMPLEID` | 	CHROM	| POS	| ID |	REF |	ALT |
|----|------------|-------------|----------|-----------|------------|
|  0 | chr1 | 8114090 | . | A | C |

`genome.fa`

Note that this only currently works with single-sample simulations

Generates the final simulated karyotype sequence from the relevant human genome reference. It will also include the introduced SNVs if applicable. Using -f requires a reference FASTA named genome.fa in the selected assembly folder, e.g. data/hg19/genome.fa or data/hg38/genome.fa. You can create these files with scripts/DownloadRefData.sh.

`segments.tsv`

When using the -s or --segments flag, outputs copy number segments in a simplified format. Each segment represents a contiguous region with uniform copy number state.

`consistent_segments.tsv`

When using the -S or --consistent-segments flag, outputs segments under minimum consistent segmentation. This provides a more refined segmentation that is consistent across all samples in the simulation.

`karyotypes.tsv`

When using the -k or --karyotypes flag, outputs detailed karyotype information. Unlike CN segments, karyotypes retain information about:

The connections between segments
Orientations (5' to 3' or 3' to 5')
Structural relationships between different chromosomal regions
The complete architecture of derived chromosomes

Each karyotype is represented as a series of regions in the format ChromID*[start:end) where * is either + for 5' to 3' or - for 3' to 5'.

`vcf.tsv`

When using the -v or --variants flag, generates a simple VCF file of all the observable (i.e. final and nucleotide-altering) SNVs that occurred to the samples during simulation. This requires data/<assembly>/genome.fa (for example data/hg19/genome.fa), which can be downloaded with scripts/DownloadRefData.sh (see SNV section above).

Plots

SimChA outputs can be visualized using the provided Python scripts:

Karyotype Plots

Use scripts/plot_karyotype.py to visualize the structural arrangement of chromosomes:

python scripts/plot_karyotype.py --input out/karyotypes.tsv --sample sample_0

This generates a horizontal arrow plot showing:

Each contig as a row
Chromosomal segments colored by chromosome of origin
Orientation indicated by arrow direction
Scale in megabases

Copy Number Plots

Use scripts/plot_cns.py with the --cn flag to visualize allele-specific copy number profiles:

python scripts/plot_cns.py --input out/karyotypes.tsv --sample sample_0 --cn

This generates a per-chromosome CN track plot showing:

One row per chromosome, scaled proportionally to its CN range
cn_a (major allele) displayed as a rectangle above the CN value
cn_b (minor allele) displayed as a rectangle below the CN value
Chromosomes colored using consistent chromosome colors
Shared x-axis scaled to the longest chromosome

Additional Plots

When using SimChA as part of optimization workflows (see project-simcha), additional plots are generated including:

Event frequency heatmaps
WGD and event count distributions
Fitness trajectory plots
Copy number frequency plots across the genome

Examples

Basic Simulation

Run a simulation with events selected at random (-m basic):

dotnet run -- --output ./out --config ./configs/main_config.json -m basic

Evolution Mode

Simulate tumor evolution with fitness-based selection using the spice gene set:

dotnet run -- -O ./out -C ./configs/main_config.json -G spice -m evolution

Fitness Matching Mode

Run fitness matching simulation to match target fitness values:

dotnet run -- -O ./out -C ./configs/main_config.json -m matching

With All Outputs

Generate all available output formats:

dotnet run -- -O ./out -C ./configs/main_config.json -s -S -k -v -f

Note: The -f flag generates large FASTA files (~6GB per sample) and should only be used when necessary.

Performance Considerations

Memory: Each sample requires approximately 100-500 MB depending on event count
Runtime: Typical simulations (100 events, 50 samples) complete in 1-5 minutes
FASTA output: Adds significant time and disk space (~6GB per sample)
Parallelization: SimChA is single-threaded; run multiple instances for parallel simulations

Running Tests

The Tests project is not run by default. Run the test suite manually with:

dotnet test

To target the test project explicitly:

dotnet test test

Contact

Email questions, feature requests and bug reports to Adam Streck, adam.streck@iccb-cologne.org.

License

SimChA is available under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 1,692 Commits
.github/workflows		.github/workflows
configs		configs
data		data
doc/img		doc/img
scripts		scripts
src		src
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.MD		README.MD
SimChA.csproj		SimChA.csproj
SimChA.sln		SimChA.sln
global.json		global.json
simcha.yml		simcha.yml

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SimChA: Simulator of Chromosomal Aberrations

Quick Start

Tested platforms

Execution

Options

Input files

Clone tree file (-T, --tree)

CNA profile file (-P, --cnprofiles)

Path resolution (--root)

Configuration files

Cancer-type configs

Customizing configuration

SimParams

FitParams

Signatures

Fitness Matching Mode

EvoParams

Working path

Input Data

Chromosomes

TSG/OG/Eseentiality Score

Centromeres

Signatures

Contig selection

Chromosome Deletion (Prob)

Chromosome Duplication (Prob)

Arm Deletion (Prob)

Arm Duplication (Prob)

Whole Genome Doubling (Prob)

Tail Deletion (Prob, Frac)

Tail Duplication (Prob, Frac)

Internal Deletion (Prob, Frac)

Internal Duplication (Prob, Frac)

Internal Inversion (Prob, Frac)

Inverted Duplication (Prob, Frac)

Centromere-Bound Duplication (Prob, Frac)

Centromere-Bound Deletion (Prob, Frac)

Translocation (Prob, Frac)

BreakageFusionBridge (Prob, Frac)

Chromothripsis (Prob, Frac)

Chromoplexy (Prob, Frac, Frag)

Pyrgo (Prob, Frac, Frag)

Rigma (Prob, Frac, Frag)

Template Insertion Chain (Prob, Frac, Frag)

Template Insertion Cycle (Prob, Frac, Frag)

Somatic Nucleotide Variants - SNV (Prob)

Template Insertion Bridge (Prob, Frac, Frag)

Pass (Prob)

Skip

Output

copynumbers.tsv

sim_params.json

samples.tsv

events.tsv

vcf.tsv

genome.fa

segments.tsv

consistent_segments.tsv

karyotypes.tsv

vcf.tsv

Plots

Karyotype Plots

Copy Number Plots

Additional Plots

Examples

Basic Simulation

Evolution Mode

Fitness Matching Mode

With All Outputs

Performance Considerations

Running Tests

Contact

License

About

Resources

Clone tree file (`-T`, `--tree`)

CNA profile file (`-P`, `--cnprofiles`)

Path resolution (`--root`)

`SimParams`

`FitParams`

`Signatures`

`EvoParams`

`copynumbers.tsv`

`sim_params.json`

`samples.tsv`

`events.tsv`

`vcf.tsv`

`genome.fa`

`segments.tsv`

`consistent_segments.tsv`

`karyotypes.tsv`

`vcf.tsv`

Packages