Skip to content

calculate_distance() possible improvement... #9

@coforfe

Description

@coforfe

Hi Przemek,

For data.frames with more than 200k lines, there is an important opportunity to improve the speed in the calculation of this function, which is the core of calculate_covariate_fit() function.

Instead of using rank() here:

calculate_distance <- function(variable_old, variable_new, bins = 20) {
  if ("factor" %in% class(variable_old)) {
    after_cuts <- c(variable_old, variable_new)
  } else {
    after_cuts <- cut(rank(c(variable_old, variable_new)), bins)
  }

It would improve a lot if you use frank() from data.table package.

calculate_distance <- function(variable_old, variable_new, bins = 20) {
  if ("factor" %in% class(variable_old)) {
    after_cuts <- c(variable_old, variable_new)
  } else {
   after_cuts <- cut(frank(c(variable_old, variable_new)), bins)
  }

Well, after that calculation there is another calculation based on table() that also can be improved significantly by using a data.table calculation. If you accept to add data.tabledependency in shifter I will open a PR with these calculations. with these changes, I was able to pass from some minutes calculations in my comparison data.frame to barely 30 secs.

Thanks,
Carlos.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions