print("Compute pitch")
pitch_dataset = dataset.cast_column(audio_column_name, Audio(sampling_rate=16_000)).map(
pitch_apply,
batched=True,
batch_size=args.batch_size,
with_rank=True if torch.cuda.device_count()>0 else False,
num_proc=torch.cuda.device_count()*args.num_workers_per_gpu_for_pitch if torch.cuda.device_count()>0 else args.cpu_num_workers,
remove_columns=[audio_column_name], # tricks to avoid rewritting audio
fn_kwargs={"audio_column_name": audio_column_name, "penn_batch_size": args.penn_batch_size},
)
and this is my configuration for running the script (I modified the loading part of it - I work with local data, not hugging face):
python main.py
--source "local"
--metadata_path "/mnt/personal/kubicra3/Czech_par/Czech_par_dataset/metadata_filtered.tsv"
--dataset_path "/mnt/personal/kubicra3/Czech_par/Czech_par_dataset"
--configuration "default"
--text_column_name "text"
--audio_column_name "path"
--output_dir '/mnt/personal/kubicra3/data-speech/Czech_par'
--cpu_num_workers 32
--num_workers_per_gpu_for_squim 4
--num_workers_per_gpu_for_pitch 4
--num_workers_per_gpu_for_snr 4
--rename_column
--repo_id "ParlaCZ-tts-tags"
--apply_squim_quality_estimation
--penn_batch_size 2048 \ #4096 causes Cuda OOM error
--batch_size 64
The part with pitch computation takes ages, I am computing it for quite a big dataset (around 900 hours of 0.5 - 30s long recordings). However I am computing on 2 GPUs on my institution cluster. I am playing with penn_batch_size and batch_size, however nothing seems to speed it up. It looks like it will take 30 hours to compute everythinh in this part... i know that working with that many audio recordings can be quite heavy work to do, but still it seems weird to me that it takes that long. Please is it normal or not?
Thank you for response
print("Compute pitch")
pitch_dataset = dataset.cast_column(audio_column_name, Audio(sampling_rate=16_000)).map(
pitch_apply,
batched=True,
batch_size=args.batch_size,
with_rank=True if torch.cuda.device_count()>0 else False,
num_proc=torch.cuda.device_count()*args.num_workers_per_gpu_for_pitch if torch.cuda.device_count()>0 else args.cpu_num_workers,
remove_columns=[audio_column_name], # tricks to avoid rewritting audio
fn_kwargs={"audio_column_name": audio_column_name, "penn_batch_size": args.penn_batch_size},
)
and this is my configuration for running the script (I modified the loading part of it - I work with local data, not hugging face):
python main.py
--source "local"
--metadata_path "/mnt/personal/kubicra3/Czech_par/Czech_par_dataset/metadata_filtered.tsv"
--dataset_path "/mnt/personal/kubicra3/Czech_par/Czech_par_dataset"
--configuration "default"
--text_column_name "text"
--audio_column_name "path"
--output_dir '/mnt/personal/kubicra3/data-speech/Czech_par'
--cpu_num_workers 32
--num_workers_per_gpu_for_squim 4
--num_workers_per_gpu_for_pitch 4
--num_workers_per_gpu_for_snr 4
--rename_column
--repo_id "ParlaCZ-tts-tags"
--apply_squim_quality_estimation
--penn_batch_size 2048 \ #4096 causes Cuda OOM error
--batch_size 64
The part with pitch computation takes ages, I am computing it for quite a big dataset (around 900 hours of 0.5 - 30s long recordings). However I am computing on 2 GPUs on my institution cluster. I am playing with penn_batch_size and batch_size, however nothing seems to speed it up. It looks like it will take 30 hours to compute everythinh in this part... i know that working with that many audio recordings can be quite heavy work to do, but still it seems weird to me that it takes that long. Please is it normal or not?
Thank you for response