Skip to content

escalation of computation times with more than 250k observations #102

@francisvolh

Description

@francisvolh

Hello! I have been trying to use CoordinateClener clean_coordinates to clean some data that contains about 4 million observations, but it seems it may take a very long time. I read in the reference pub for the package that you worked with 200k observations at a time for computational speed.

What would be the average time you estimate to clean 200k observations? I did a simple test in my computer with your sample produced dataset with 250, increased it to 2 500, 25 000 and the computation time jumped from 60, to 60 seconds to 1.8 hours respectively. So I wonder if its memory issues (16GB in my local computer) or hard-drive limitation.


system.time({
  # Simulate example data
  minages <- runif(250000, 0, 65)
  exmpl <- data.frame(species = sample(letters, size = 250000, replace = TRUE),
                      decimalLongitude = runif(250000, min = 42, max = 51),
                      decimalLatitude = runif(250000, min = -26, max = -11),
                      min_ma = minages,
                      max_ma = minages + runif(250000, 0.1, 65),
                      dataset = "clean")
  
  # Run record-level tests
  rl <- clean_coordinates(x = exmpl)
  
  
})

I wonder how does computation time escalates with clean_coordinates and number of observations, and if specifying parameters changes (up or down) these times.

Thanks.
ps: I can provide additional info if required.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions