Skip to content

Using main.py for computation of basic tags is very very slow for pitch computation #42

@kubicra

Description

@kubicra

print("Compute pitch")
pitch_dataset = dataset.cast_column(audio_column_name, Audio(sampling_rate=16_000)).map(
pitch_apply,
batched=True,
batch_size=args.batch_size,
with_rank=True if torch.cuda.device_count()>0 else False,
num_proc=torch.cuda.device_count()*args.num_workers_per_gpu_for_pitch if torch.cuda.device_count()>0 else args.cpu_num_workers,
remove_columns=[audio_column_name], # tricks to avoid rewritting audio
fn_kwargs={"audio_column_name": audio_column_name, "penn_batch_size": args.penn_batch_size},
)

and this is my configuration for running the script (I modified the loading part of it - I work with local data, not hugging face):

python main.py
--source "local"
--metadata_path "/mnt/personal/kubicra3/Czech_par/Czech_par_dataset/metadata_filtered.tsv"
--dataset_path "/mnt/personal/kubicra3/Czech_par/Czech_par_dataset"
--configuration "default"
--text_column_name "text"
--audio_column_name "path"
--output_dir '/mnt/personal/kubicra3/data-speech/Czech_par'
--cpu_num_workers 32
--num_workers_per_gpu_for_squim 4
--num_workers_per_gpu_for_pitch 4
--num_workers_per_gpu_for_snr 4
--rename_column
--repo_id "ParlaCZ-tts-tags"
--apply_squim_quality_estimation
--penn_batch_size 2048 \ #4096 causes Cuda OOM error
--batch_size 64

The part with pitch computation takes ages, I am computing it for quite a big dataset (around 900 hours of 0.5 - 30s long recordings). However I am computing on 2 GPUs on my institution cluster. I am playing with penn_batch_size and batch_size, however nothing seems to speed it up. It looks like it will take 30 hours to compute everythinh in this part... i know that working with that many audio recordings can be quite heavy work to do, but still it seems weird to me that it takes that long. Please is it normal or not?

Thank you for response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions