- Create a github repo and clone it in local (Add experiments)
- Add src folder along with all components(run them individually)
- Add data,models,reports directories to .gitignore file
- Since changes are made, now do git add,commit,push
- Create dvc.yaml file and add stages to it.
- dvc init then do "dvc repro" to test the pipeline automation. (check dvc dag)
- Now git add, commit, push
- add params.yaml file
- Add the params setup (mentioned below)
- Do "dvc repro" again to test the pipeline along with the params
- Now git add, commit, push
- pip install dvclive
- Add the dvclive code block (mentioned below)
- Do "dvc exp run", it will create a new dvc.yaml(if already not there) and dvclive directory (each run will be considered as an experiment by DVC)
- Do "dvc exp show" on terminal to see the experiments or use extension on VSCode (install dvc extension)
- Do "dvc exp remove {exp-name}" to remove exp (optional) | "dvc exp apply {exp-name}" to reproduce prev exp
- Change params, re-run code (produce new experiments)
- Now git add, commit, push
- Login to AWS console
- Create an IAM User
- Create S3 bucket
- To connect DVC to S3
pip install dvc[s3]
- To connect to aws
pip install awscli
- Configure IAM User with project
aws configure
- sets the remote storage address in .dvc/config
dvc remote add -d dvcstore s3://bucketname
- dvc commit,push the exp outcome that you want to keep dvc tracks the outcome of each component in the pipeline, the data that belongs to the pipeline is tracked.
dvc commit
dvc push
- Finally git add,commit,push
- To rollback to a previous code version, fetch the commit hash and then do
dvc pull
# Ensure the "logs" directory exists
log_dir = 'logs'
os.makedirs(log_dir, exist_ok=True)
# logging configuration
logger = logging.getLogger('model_building')
logger.setLevel('DEBUG')
console_handler = logging.StreamHandler()
console_handler.setLevel('DEBUG')
log_file_path = os.path.join(log_dir, 'model_building.log')
file_handler = logging.FileHandler(log_file_path)
file_handler.setLevel('DEBUG')
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
console_handler.setFormatter(formatter)
file_handler.setFormatter(formatter)
logger.addHandler(console_handler)
logger.addHandler(file_handler)
logger.propagate = False
params.yaml setup:
- import yaml
- add func:
def load_params(params_path: str) -> dict:
"""Load parameters from a YAML file."""
try:
with open(params_path, 'r') as file:
params = yaml.safe_load(file)
logger.debug('Parameters retrieved from %s', params_path)
return params
except FileNotFoundError:
logger.error('File not found: %s', params_path)
raise
except yaml.YAMLError as e:
logger.error('YAML error: %s', e)
raise
except Exception as e:
logger.error('Unexpected error: %s', e)
raise
- Add to main():
# data_ingestion
params = load_params(params_path='params.yaml')
test_size = params['data_ingestion']['test_size']
# feature_engineering
params = load_params(params_path='params.yaml')
max_features = params['feature_engineering']['max_features']
# model_building
params = load_params('params.yaml')['model_building']
- import dvclive and yaml:
from dvclive import Live
import yaml
- Add the load_params function and initiate "params" var in main
- Add below code block to main:
with Live(save_dvc_exp=True) as live:
live.log_metric('accuracy', accuracy_score(y_test, y_test))
live.log_metric('precision', precision_score(y_test, y_test))
live.log_metric('recall', recall_score(y_test, y_test))
live.log_params(params)


