Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 40 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
[![PyPI license](https://img.shields.io/pypi/l/attpc_spyral.svg)](https://pypi.python.org/pypi/attpc_spyral/)
[![DOI](https://zenodo.org/badge/528950398.svg)](https://doi.org/10.5281/zenodo.14143006)

Spyral is an analysis library for data from the Active Target Time Projection Chamber (AT-TPC). Spyral provides a flexible analysis pipeline, transforming the raw trace data into physical observables over several tunable steps. The analysis pipeline is also extensible, supporting a diverse array of datasets. Sypral can process multiple data files in parallel, allowing for scalable performance over larger experiment datasets.
Spyral is an analysis library for data from the Active Target Time Projection Chamber (AT-TPC). Spyral provides a flexible analysis pipeline, transforming the raw trace data into physical observables over several tunable steps. The analysis pipeline is also extensible, supporting a diverse array of datasets. Spyral can process multiple data files in parallel, allowing for scalable performance over larger experiment datasets.

## Installation

Expand Down Expand Up @@ -51,6 +51,7 @@ from spyral import (
DetectorParameters,
ClusterParameters,
OverlapJoinParameters,
TripclustParameters,
SolverParameters,
EstimateParameters,
DEFAULT_MAP,
Expand All @@ -63,7 +64,7 @@ workspace_path = Path("/some/workspace/path/")
trace_path = Path("/some/trace/path/")

run_min = 94
run_max = 94
run_max = 97
n_processes = 4

pad_params = PadParameters(
Expand Down Expand Up @@ -104,16 +105,42 @@ det_params = DetectorParameters(

cluster_params = ClusterParameters(
min_cloud_size=50,
min_points=3,
min_size_scale_factor=0.05,
min_size_lower_cutoff=10,
cluster_selection_epsilon=10.0,
overlap_join=OverlapJoinParameters(
min_cluster_size_join=15.0,
circle_overlap_ratio=0.25,
hdbscan_parameters = None,
# hdbscan_parameters = HdbscanParameters(
# min_points=3,
# min_size_scale_factor=0.03,
# min_size_lower_cutoff=10,
# cluster_selection_epsilon=10.0),
# overlap_join=OverlapJoinParameters(
# min_cluster_size_join=15,
# circle_overlap_ratio=0.25,
# ),
# continuity_join=None,
continuity_join = ContinuityJoinParameters(
join_radius_fraction=0.4,
join_z_fraction=0.2),
overlap_join=None,
outlier_scale_factor=0.1,
direction_threshold=0.5,
# tripclust_parameters=None,
tripclust_parameters=TripclustParameters(
r=6,
rdnn=True,
k=12,
n=3,
a=0.03,
s=0.3,
sdnn=True,
t=0.0,
tauto=True,
dmax=0.0,
dmax_dnn=False,
ordered=True,
link=0,
m=50,
postprocess=False,
min_depth=25,
),
continuity_join=None,
outlier_scale_factor=0.05,
)

estimate_params = EstimateParameters(
Expand Down Expand Up @@ -172,7 +199,7 @@ The core of Spyral is the Pipeline. A Pipeline in a complete description of an a

### Parallel Processing

Spyral is capable of running multiple data files in parallel. This is acheived through the python `multiprocessing` library. In the `start_pipeline` function a parameter named `n_processors` indicates to Spyral the *maximum* number of processors which can be spawned. Spyral will then inspect the data load that was submitted in the configuration and attempt to balance the load across the processors as equally as possible.
Spyral is capable of running multiple data files in parallel. This is achieved through the python `multiprocessing` library. In the `start_pipeline` function a parameter named `n_processors` indicates to Spyral the *maximum* number of processors which can be spawned. Spyral will then inspect the data load that was submitted in the configuration and attempt to balance the load across the processors as equally as possible.

Some notes about parallel processing:

Expand All @@ -183,7 +210,7 @@ Some notes about parallel processing:

### Logs and Output

Spyral creates a set of logfiles when it is run (located in the log directory of the workspace). These logfiles can contain critical information describing the state of Spyral. In particular, if Spyral has a crash, the logfiles can be useful for determining what went wrong. A logfile is created for each process (including the parent process). The files are labeld by process number (or as parent in the case of the parent).
Spyral creates a set of logfiles when it is run (located in the log directory of the workspace). These logfiles can contain critical information describing the state of Spyral. In particular, if Spyral has a crash, the logfiles can be useful for determining what went wrong. A logfile is created for each process (including the parent process). The files are labeled by process number (or as parent in the case of the parent).

## Notebooks

Expand Down
164 changes: 133 additions & 31 deletions docs/user_guide/config/cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,38 +2,48 @@

The Cluster parameters which control the clustering, joining, and outlier detection algorithms.

The default recommended settings when using the continuity join method are:
The default recommended settings for HDBSCAN when using the continuity join method are:

```python
cluster_params = ClusterParameters(
min_cloud_size=50,
min_points=5,
min_size_scale_factor=0.0,
min_size_lower_cutoff=5,
cluster_selection_epsilon=13.0,
# hdbscan_parameters = None
hdbscan_parameters = HdbscanParameters(
min_points=5,
min_size_scale_factor=0.0,
min_size_lower_cutoff=5,
cluster_selection_epsilon=13.0),
overlap_join=None,
continuity_join=ContinuityOverlapParamters(
join_radius_fraction=0.3,
join_z_fraction=0.2,
),
# continuity_join = None
outlier_scale_factor=0.05,
direction_threshold = 0.5,
tripclust_parameters = None,
)
```

The default recommended settings when using the circle overlap join method are:
The default recommended settings for HDBSCAN when using the circle overlap join method are:

```python
cluster_params = ClusterParameters(
min_cloud_size=50,
min_points=3,
min_size_scale_factor=0.05,
min_size_lower_cutoff=10,
cluster_selection_epsilon=10.0,
# hdbscan_parameters = None
hdbscan_parameters = HdbscanParameters(
min_points=3,
min_size_scale_factor=0.05,
min_size_lower_cutoff=10,
cluster_selection_epsilon=10.0),
# overlap_join=None,
overlap_join=OverlapJoinParameters(
min_cluster_size_join=15.0,
circle_overlap_ratio=0.25,
),
continuity_join=None,
direction_threshold = 0.5,
tripclust_parameters = None,
outlier_scale_factor=0.05,
)
```
Expand All @@ -46,56 +56,148 @@ This is the minimum size a point cloud must be (in number of points) to be sent

## minimum_points

The minimum number of samples (points) in a neighborhood for a point to be a core point. This is a re-exposure of the `min_samples` parameter of
[scikit-learn's HDBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.HDBSCAN.html#sklearn.cluster.HDBSCAN). See
their documentation for more details. Larger values will make the algorithm more likely to identify points as noise. See the original
The minimum number of samples (points) in a neighborhood for a point to be a core point. This is a re-exposure of the `min_samples` parameter of
[scikit-learn's HDBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.HDBSCAN.html#sklearn.cluster.HDBSCAN). See
their documentation for more details. Larger values will make the algorithm more likely to identify points as noise. See the original
[HDBSCAN docs](https://hdbscan.readthedocs.io/en/latest/parameter_selection.html#) for details on why this parameter is important and how it can impact the data.

## minimum_size_scale_factor

HDBSCAN requires a minimum size (the hyper parameter `min_cluster_size` in
HDBSCAN requires a minimum size (the hyper parameter `min_cluster_size` in
[scikit-learn's HDBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.HDBSCAN.html#sklearn.cluster.HDBSCAN)) in terms of samples for a group to
be considered a valid cluster. AT-TPC point clouds vary dramatically in size, from tens of points to thousands. To handle this wide scale, we use a scale factor
to determine the appropriate minimum size, where `min_cluster_size = minimum_size_scale_factor * n_cloud_points`. The default value was found through some testing,
be considered a valid cluster. AT-TPC point clouds vary dramatically in size, from tens of points to thousands. To handle this wide scale, we use a scale factor
to determine the appropriate minimum size, where `min_cluster_size = minimum_size_scale_factor * n_cloud_points`. The default value was found through some testing,
and may need serious adjustment to produce best results. Note that the scale factor should be *small*. Update: in the case of continuity joining, scaling was not
needed.

## minimum_size_lower_cutoff

As discussed in the above `minimum_size_scale_factor`, we need to scale the `min_cluster_size` parameter to the size of the point cloud. However, there must be
a lower limit (i.e. you can't have a minimum cluster size of 0). This parameter sets the lower limit; that is any `min_cluster_size` calculated using the scale factor
that is smaller than this cutoff is replaced with the cutoff value. As an example, if the cutoff is set to 10 and the calculated value is 50, the calculated value would
As discussed in the above `minimum_size_scale_factor`, we need to scale the `min_cluster_size` parameter to the size of the point cloud. However, there must be
a lower limit (i.e. you can't have a minimum cluster size of 0). This parameter sets the lower limit; that is any `min_cluster_size` calculated using the scale factor
that is smaller than this cutoff is replaced with the cutoff value. As an example, if the cutoff is set to 10 and the calculated value is 50, the calculated value would
be used. However, if the calculated value is 5, the cutoff would be used instead.

## cluster_selection_epsilon

A re-exposure of the `cluster_selection_epsilon` paramter of
[scikit-learn's HDBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.HDBSCAN.html#sklearn.cluster.HDBSCAN). This parameter will merge clusters that
are less than epsilon apart. Note that this epsilon must be on the scale of the scaled data (i.e. it is not in normal units). The impact of this parameter is large, and
small changes to this value can produce dramatically different results. Larger values will bias the clustering to assume the point cloud is onesingle cluster (or all noise),
while smaller values will cause the algorithm to revert to the default result of HDBSCAN. See the original
A re-exposure of the `cluster_selection_epsilon` parameter of
[scikit-learn's HDBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.HDBSCAN.html#sklearn.cluster.HDBSCAN). This parameter will merge clusters that
are less than epsilon apart. Note that this epsilon must be on the scale of the scaled data (i.e. it is not in normal units). The impact of this parameter is large, and
small changes to this value can produce dramatically different results. Larger values will bias the clustering to assume the point cloud is one single cluster (or all noise),
while smaller values will cause the algorithm to revert to the default result of HDBSCAN. See the original
[HDBSCAN docs](https://hdbscan.readthedocs.io/en/latest/parameter_selection.html#) for details on why this parameter is important and how it can impact the data.

The recommended parameters when using the TRIPCLUST (Dalitz) clustering algorithm are:
(see also the publication [Dalitz clustering](https://doi.org/10.1016/j.cpc.2018.09.010) for more information)

```python
# tripclust_parameters=None,
tripclust_parameters=TripclustParameters(
r=6,
rdnn=True,
k=12,
n=3,
a=0.03,
s=0.3,
sdnn=True,
t=0.0,
tauto=True,
dmax=0.0,
dmax_dnn=False,
ordered=True,
link=0,
m=50,
postprocess=False,
min_depth=25,
```
Breakdown of each of the parameters:

## r

The neighbor distance for smoothing.

## rdnn

When this boolean is set to True the value of r is calculated automatically using dnn.

## k

The number of tested neighbors for each triplet mid point.

## n

The maximum number of triplets retained for each mid point.

## a

The maximum value of angle between the two triplet branches, expressed as 1-cos(alpha).

## s

The distance scale factor in the metric of triplet distance.

## sdnn

When set to True, the value of s is calculated automatically using dnn.

## t

The threshold for the "distance" between triplets.

## tauto

When set to True, the value of t is set automatically.

## dmax

The maximum gap width.

## dmax_dnn

Use dnn for dmax.

## ordered

When True the point cloud is ordered in chronological order.

## link

Linkage method for hierarchical clustering of the triplets.

## m

The minimum number of triplets per cluster.

## postprocess

When set to True, the post process algorithm is attempted.

## min_depth

Value of the minimum depth used in the post processing algorithm.



## outlier_scale_factor

We use [scikit-learn's LocalOutlierFactor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html) as a last round of noise elimination on
a cluster-by-cluster basis. This algorithim requires a number of neighbors to search over (the `n_neighbors` parameter). As with the `min_cluster_size` in HDBSCAN, we need to
scale this value off the size of the cluster. This factor multiplied by the size of the cluster gives the number of neighbors to search over
(`n_neighbors = outlier_scale_factor * cluster_size`). This value tends to have a "sweet spot" where it is most effective. If it is too large, every point has basically the
same outlier factor as you're including the entire cluster for every point. If it is too small the variance between neighbors can be too large and the results will be
We use [scikit-learn's LocalOutlierFactor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html) as a last round of noise elimination on
a cluster-by-cluster basis. This algorithm requires a number of neighbors to search over (the `n_neighbors` parameter). As with the `min_cluster_size` in HDBSCAN, we need to
scale this value off the size of the cluster. This factor multiplied by the size of the cluster gives the number of neighbors to search over
(`n_neighbors = outlier_scale_factor * cluster_size`). This value tends to have a "sweet spot" where it is most effective. If it is too large, every point has basically the
same outlier factor as you're including the entire cluster for every point. If it is too small the variance between neighbors can be too large and the results will be
unpredictable. Note that if the value of `outlier_scale_factor * cluster_size` is less than 2, `n_neighbors` will be set to 2 as this is the minimum allowed value.
This algorithm is also used to clean clusters prior to the joining process, using the same parameter.

## Overlap Join Parameters

### min_cluster_size_join

The minimum size of a cluster for it to be considered in the joining step of the clustering. After HDBSCAN has made the initial clusters we attempt to combine any clusters which
The minimum size of a cluster for it to be considered in the joining step of the clustering. After HDBSCAN has made the initial clusters we attempt to combine any clusters which
have overlapping circles in the 2-D projection (see `circle_overlap_ratio`). However, many times, small pockets of noise will be clustered and often sit within the larger trajectory.
To avoid these being joined we require a cluster to have a minimum size.

### circle_overlap_ratio

The minimum amount of overlap between circles fit to two clusters for the clusters to be joined together into a single cluster. HDBSCAN often fractures trajectories into multiple
The minimum amount of overlap between circles fit to two clusters for the clusters to be joined together into a single cluster. HDBSCAN often fractures trajectories into multiple
clusters as the point density changes due to the pad size, gaps, etc. These fragments are grouped together based on how much circles fit on their 2-D projections (X-Y) overlap.

## Continuity Join Parameters
Expand Down
5 changes: 3 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
[project]
name = "attpc_spyral"
version = "1.0.0"
version = "1.1.1"
description = "AT-TPC analysis pipeline"
authors = [
{name = "gwm17", email = "gordonmccann215@gmail.com"},
{name = "turinath", email = "turi@frib.msu.edu"},
{name = "DBazin", email = "bazin@frib.msu.edu"},
]
dependencies = [
"spyral-utils>=2.0.0",
"spyral-utils>=2.1.0",
"contourpy>=1.2.1",
"h5py>=3.11.0",
"lmfit>=1.3.0",
Expand Down
2 changes: 2 additions & 0 deletions src/spyral/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
ClusterParameters,
OverlapJoinParameters,
ContinuityJoinParameters,
TripclustParameters,
EstimateParameters,
SolverParameters,
DEFAULT_MAP,
Expand Down Expand Up @@ -52,6 +53,7 @@
"ClusterParameters",
"OverlapJoinParameters",
"ContinuityJoinParameters",
"TripclustParameters",
"EstimateParameters",
"SolverParameters",
"DEFAULT_MAP",
Expand Down
Loading
Loading