In this work, we are finetuning a DistilBERT model on a drilling logs dataset by ONGC. The goal is to automate the process of event extraction. The input variable is the "Service hour type", which is an attribute of "text data type". The output or the variable to be predicted is the "Cde" or Code value, which is an Alphanumeric data type.
- Conversion of given dataset in excel format to Python dataframe object.
- Dropping all null values.
- Preprocessing the text of the input column, i.e, The service hours type.
- Discarding the very rare instances of input, because keep them in the training set will lead to a class imbalance issue. ( Threshold for the number of instances is set to be 10 ).
- Label encoding the output/target features.
- Splitting the dataset in the ratio of 7:3. Here, I have done stratified splitting.
- Tokenizing the train & test encodings.
- Forming Pytorch datasets.
- Training the model for 30 epochs.
- Evaluating the training loss & accuracy.