Skip to content

Conversation

@S81D
Copy link
Collaborator

@S81D S81D commented Sep 23, 2025

Describe your changes

Since the start of Phase II there was a decision to ditch the PMT waveforms for the ProcessedData and only save the extracted hits (ID, charge, timing, etc...). This was done for a good reason as typically analyzers only need the Hits information and the ProcessedData size was cut down from ~80MB (with the full PMT waveforms) to ~6MB (Hits only) per part file.

Frank brought up a good point that we should not be "throwing away" the pulse information in case there is ever a need to examine the noise or do any sort of analysis on the pulses that are saved in our data.

That information isn't "throw away" technically (as it still can be found in the RawData), but in order to actually look at the raw waveforms and traces, you had to essentially re-run the event building tools via the old DataDecoder (or the newEventBuilderRaw) toolchain without the hit finding tools, then run another toolchain to pull the waveforms out in a readable format via something like the PrintADCData tool... this is a pain and many of the tools involved are somewhat deprecated.

Instead of saving the entirety of the waveforms, we can instead just save the identified pulse from the hit finding (from pulse start to pulse end) in the ADCPulse class along with the start, stop, amplitude, charge, etc... so that anyone can access it from the ProcessedData if needed.

This PR expands the ADCPulse class to include the raw traces for each hit, allowing it to be accessed by other tools via RecoADCHits. The PhaseIIADCHitfinder tool populates this with the x (ns) and y (adc) points of the identified pulse, saving them to vectors.

The only caution (and question) I have is that if you run an older version of ToolAnalysis (without the changes in this PR) over any newly ProcessedData with the ADCPulse expansion, i.e. if you run the event building tools with these changes and try to load the ProcessedData with the traces using an older version of ToolAnalysis, the LoadANNIEEvent tool will fail to load it from the store. Andrew Sutton made a similar modification to the ADCPulse class as part of his closed PR #324 (he added the stop time of the pulse to the class - I omitted it from PR #358 as I knew i was going to eventually add the whole pulse), and his remedy was to add to ADCPulse.h something like:

template <class Archive> void serialize(Archive& ar,
      const unsigned int version)
    {
      Hit::serialize(ar, version);

      ar & start_time_;
      ar & peak_time_;
      ...
      ...
      if (version > 0) {
        ar & stop_time;
      }

...
...

};

// Need to increment the class version since we added time as a new variable
// the version number ensures backward compatibility when serializing 
BOOST_CLASS_VERSION(ADCPulse, 1)

I have added this to my modifications of the class, though it doesn't seem to remedy that problem. These changes do preserve compatibility when running toolchains over older ProcessedData without the ADCPulse expansion. So any older data remains readable by the newer version of ToolAnalysis.

If anyone (Ben or Marcus in particular) have a comment on what could be happening with this I would be happy to alter my PR. Not sure the issue identified above is really a back-breaker - anyone analyzing data with an out-of-date ToolAnalysis will only be hindered if they are trying to look at the newest data from the new EventBuilderV2 tools... and in that case there are so many new modifications from Yue's latest PRs that they should be used the latest ToolAnalysis anyways.

I also added more information to the PhaseIIADCHitFinder README as it was a little opaque (and we always need more documentation).

Checklist before submitting your PR

  • This PR implements a single change (one new/modified Tool, or a set of changes to implement one new/modified feature)
  • This PR alters the minimum number of files to affect this change
  • [N/A] If this PR includes a new Tool, a README and minimal demonstration ToolChain is provided
  • [N/A] If a new Tool/ToolChain requires model or configuration files, their paths are not hard-coded, and means of generating those files is described in the readme, with examples provided on /pnfs/annie/persistent
  • [N/A] For every new usage, there is a reason the data must be on the heap
  • [N/A] For every new there is a delete, unless I explicitly know why (e.g. ROOT or a BoostStore takes ownership)

Modifications are made to the DataModel so additional oversight by Level 0 is needed.

Additional information

The impact on file sizes by adding the traces is not large. Here is a test from an AmBe run on two part files:

without trace | with trace
--------------------------
22MB      |   24 MB
24MB      |   26 MB

Looks to add ~2MB per part file.

Beef up Hit finding README (yay more documentation)
Add entire ADC trace to the pulse class
Expand ADCpulse class to include entire trace
Extend ADCpulse class to include traces
bug fix and additional documentation for the newly added pulse features
more bug fixes to ADCpulse (should work now!)
bug fix for storing traces
trying to make it as backwards compatible as possible
@S81D S81D assigned marc1uk and jminock and unassigned marc1uk Sep 23, 2025
@S81D
Copy link
Collaborator Author

S81D commented Sep 23, 2025

Also there will be a follow up PR for a toolchain to easily read the newly added traces.

Copy link
Collaborator

@jminock jminock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing surface level incorrect with changes to the Tool or DataModel. Would like further review of version conditional in DataModel if this is appropriate fix for handling different iterations of files @marc1uk or @brichards64

ar & raw_area_;
ar & raw_amplitude_;
ar & calibrated_amplitude_;
if (version > 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does version work? How does it iterate? Is this the correct way to ensure the version differences? These are questions more for Ben and Marcus.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty much the same as ROOT's schema evolution - the default class version is 0. If you modify the class definition, you need to manually alter the class version (what value you give it is up to you) and the (de)serialisation code to check that and behave appropriately. I don't believe there's any automatic way to ensure it does the 'correct' thing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No functionality is being taken away, and the changes would only impact the ADCHitFinder to my knowledge; so, I don't think there's anything wrong with making changes to the DataModel in principle. Will need further review from Ben and/or Marcus regarding the version conditional.

}
}

// extract the x and y points of the pulse (subtract off baseline and "zero" the pulse to the pulse start)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this same block is added in four times in this PR. Can we make it a function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of duplicated code in this tool (different pulse-finding approaches are very similar; it's not the most elegant tool). It agree it would be worth while to re-organize this tool into more concise functions. Could we for now add it as a TODO? Rather than making this particular feature its own function, since much of the pulse-finding approaches use the same infrastructure it could instead be folded into a single function that could be called (more than just the trace extraction).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm inclined to think that the fact that this code is added as part of a new feature, even if this was to be refactored, it would be fair for it to be a separate function. Besides, even if the code could do with refactoring, putting unrelated improvements off as TODO is one thing, but making the situation worse is something esle. It shouldn't be a difficult change.


// Store the freshly made pulse in the vector of found pulses
pulses.emplace_back(channel_key,
( wmin * NS_PER_SAMPLE )-timing_offset,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm curious what this change is about? I can't spot a variable rename from NS_PER_SAMPLE to NS_PER_ADC_SAMPLE....

Copy link
Collaborator Author

@S81D S81D Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tool calls both (and both are defined the same way in constants):

constexpr unsigned int NS_PER_SAMPLE = 2; // ns
and
constexpr unsigned int NS_PER_ADC_SAMPLE = 2; // ns

I was trying to make it all consistent. It functionally doesn't change anything.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this pushes the question upstream - why are there two constants that appear to both represent the same thing. "The tool calls both" is not a good thing (and doesn't really explain why this instance in particular should be changed).
Can we do a find-and-replace and remove the redundancy?

@marc1uk
Copy link
Collaborator

marc1uk commented Oct 9, 2025

ok, so andrew's code is the correct way to update a class definition when adding in new variables, but of course causality dictates that old code (which doesn't check the version) assumes the old format and will encounter unexpected data during deserialisation. I don't think there's a way around this - but as Steven says, it's pretty unlikely that you'll be running old analysis code on new data. If you did want to do so, it shouldn't be too difficult to patch in the new variable to the old code.

An alternative would be to store the pulse waveform snippet separately, rather than as part of the ADCPulse class, but i expect that'll introduce more complexity than it's worth to maintain backwards compatibilty for edge case studies.

unsigned short raw_amplitude, double calibrated_amplitude,
double charge);
unsigned short raw_amplitude, double calibrated_amplitude, double charge,
const std::vector<double>& trace_x = std::vector<double>(),
Copy link
Collaborator

@marc1uk marc1uk Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very odd for the constructor to be taking references as arguments, and in general references should not be initialised to temporaries (I'm a little surprised the compiler let you do it). I suppose ultimately it may work, but I would make this non-references to avoid confusion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree it is odd / not standard. I haven't noticed any issues though and it seems to behave fine with the event building. We can flag it moving forward if you want.

Copy link
Collaborator

@marc1uk marc1uk Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's odd/non-standard, then it's probably something to not do. 😅
Assuming the point here is to avoid a second copy as part of constructing this object from optional existing data, i think the correct way to do it would be to accept a pointer that defaults to null, and to make the single copy within the constructor if the passed pointer is not null.

inline double amplitude() const { return calibrated_amplitude_; }

// @brief Returns the x [ns] and y [ADC] points of the "found" pulse (baseline-subtracted and relative to pulse start point)
inline const std::vector<double>& GetTraceXPoints() const { return trace_x_; }
Copy link
Collaborator

@marc1uk marc1uk Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

functions should not return references to member variables.
If the ADCPulse object is destroyed (goes out of scope or gets deleted), a caller may be holding onto a dangling reference to a member of an object that no longer exists (i.e., a segfault waiting to happen). Just return by value and if the caller is smart, they can capture that into a reference.

Copy link
Collaborator Author

@S81D S81D Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out. I agree that returning by value is safer. I have not observed any problems in my workflow but its likely because any instances I'm fetching it, its lifetime is short.

In practice I was having some troubles getting something to work, and this ended up working without slowing down the event building or giving any observed issues (both creating the ProcessedData and reading from it). One potential problem I have with changing it is the potential performance impact. This object is created and accessed millions of times per run (15k triggers per part file, 128 PMTs, many hits per event). A real concern is the potential slowdown if returning by value and copying the vector millions of times when churning through the (already) slow event building. I unfortunately don't have a ton of time to stress test this (but have stress tested this change ~thousands of times.

If I had to guess these traces will not be used often as analyzers are largely focused on the hits information. It was mostly added just in case anyone wants it down the line. I think for potential performance concerns (and limited time) we could keep it unless you feel strongly otherwise.

@marc1uk
Copy link
Collaborator

marc1uk commented Oct 22, 2025

ok in light of trying to get things merged I'll leave these as TODOS

@marc1uk marc1uk merged commit aea75aa into ANNIEsoft:Application Oct 22, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants