DB deployment support for cluster IAM role #1

dbczumar · 2018-06-26T21:52:54Z

No description provided.

sueann

A few comments for you inline. Just to be clear, this is WIP (e.g. _get_assumed_role_arn is not yet used) with https://dbc-c0ef404d-7f59.cloud.databricks.com/?o=2527845275686535#notebook/4455645117498241/command/4455645117498246 as the main testing mechanism (not done yet), right?

sueann · 2018-06-26T22:58:36Z

mlflow/sagemaker/__init__.py


-def deploy(app_name, model_path, execution_role_arn, bucket, run_id=None,
-           image="mlflow_sage", region_name="us-west-2"):
+def deploy(app_name, model_path, bucket=None, image_url=DEFAULT_IMAGE_URL, run_id=None, 


not sure how we are about breaking API changes for MLflow currently - would leave it as close to original as possible but can ask in the PR against master

also this breaks https://github.com/databricks/mlflow/blob/ed52e68be2a8dd9439eaaf4637e86ad3dc9e6fb4/mlflow/sagemaker/cli.py#L27 ? we should test the cli commands with these changes.

Agreed. @tomasatdatabricks - what are your thoughts?

Hmm I think it's early enough to get away with breaking API changes although I would double check with Matei on that. I agree with @sueann that we should also update the cli. Not sure whether we want to pass the container url or stick to using the name in cli.

We should also think through what happens if someone wants to built their custom image and use it instead of the default one. E.g. Should build container return url now?

we should change the argument ordering back to what it was

decide whether to change the image name to url or not (what this allows is being able to use another account's ECR repo, which might be useful - e.g. sagemaker has published tensorflow images you can manually use in their account)

see if we can infer the region name automatically (at least on databricks) - i don't think you can have the cluster in one region and do all the operation in another?? (to be checked)

sueann · 2018-06-27T17:27:26Z

mlflow/sagemaker/__init__.py


 DEFAULT_IMAGE_NAME = "mlflow_sage"

+PYFUNC_IMAGE_URL = "707343435239.dkr.ecr.us-west-2.amazonaws.com/mlflow-pyfunc-test:latest"


note: we want to change this to the databricks prod account in submitting the PR to mlflow master

I think we should tag the image with the current mlflow version instead of latest.

just to clarify, so for master, we'd make it the version of the next release, right? is that 0.2.2 now? @tomasatdatabricks

I think it would make most sense for this to point to the current version of MLflow - i.e. mlflow.version.VERSION. That way we do not have to be managing updating versions on multiple places.

sueann · 2018-06-28T21:38:42Z

mlflow/sagemaker/__init__.py

+def _get_default_s3_bucket(region_name):
+    # create bucket if it does not exist
+    account_id = _get_account_id()
+    bucket_name = "{pfx}-{aid}".format(pfx=DEFAULT_BUCKET_NAME_PREFIX, aid=account_id)


btw, is the idea here that multiple accounts could access the same s3 bucket and so we want to distinguish them by account name? what bad things would happen if we just let the bucket name be the default name? what kind of naming convention does sagemaker itself use (when it creates the bucket for you)?

S3 bucket names are global, so we're forced to create a unique bucket name. Come to think of it, this naming scheme doesn't handle the case where that bucket name is taken by a different account, which would be unfortunate.

I'm not familiar with the sagemaker default bucket naming scheme. That should be explored

From the sagemaker docs: When you make an Amazon SageMaker API call that accesses an S3 bucket location and one is not specified, the Session creates a default bucket based on a naming convention which includes the current AWS account ID.

ah we used to have some sagemaker-auto-generated S3 buckets visible in SageMaker Models UI (e.g. for your very first tensorflow one) which showed the naming convention. can't quite find them right now...

tomasatdatabricks

Looks good overall, left a few comments.

tomasatdatabricks · 2018-06-29T18:44:22Z

mlflow/sagemaker/__init__.py


 DEFAULT_IMAGE_NAME = "mlflow_sage"

+PYFUNC_IMAGE_URL = "707343435239.dkr.ecr.us-west-2.amazonaws.com/mlflow-pyfunc-test:latest"


I think we should tag the image with the current mlflow version instead of latest.

tomasatdatabricks · 2018-06-29T18:55:22Z

mlflow/sagemaker/__init__.py

-        prefix = run_id + "/" + prefix
-    run_id = _check_compatible(model_path)
+        prefix = os.path.join(run_id, prefix)
+        run_id = _check_compatible(model_path)


Wrong indent here. The last line is not supposed to be part of the if block.

tomasatdatabricks · 2018-06-29T19:03:44Z

mlflow/sagemaker/__init__.py


-def deploy(app_name, model_path, execution_role_arn, bucket, run_id=None,
-           image="mlflow_sage", region_name="us-west-2"):
+def deploy(app_name, model_path, bucket=None, image_url=DEFAULT_IMAGE_URL, run_id=None, 


Hmm I think it's early enough to get away with breaking API changes although I would double check with Matei on that. I agree with @sueann that we should also update the cli. Not sure whether we want to pass the container url or stick to using the name in cli.

We should also think through what happens if someone wants to built their custom image and use it instead of the default one. E.g. Should build container return url now?

tomasatdatabricks · 2018-06-29T19:08:00Z

mlflow/sagemaker/__init__.py

            eprint('tag response', response)
-            return '{}/{}/{}'.format(s3.meta.endpoint_url, bucket, key)
-
+            return os.path.join(s3.meta.endpoint_url, bucket, key)


This is supposed to return url so maybe don't use os.path.join.

Got it. Replaced it with slightly-tweaked slash concatenation. Wanted to make sure that this would work if the arguments contained trailing slashes.

sueann · 2018-06-29T20:45:35Z

mlflow/sagemaker/__init__.py

        model_path = _get_model_log_dir(model_path, run_id)
        prefix = os.path.join(run_id, prefix)
-        run_id = _check_compatible(model_path)
+    run_id = _check_compatible(model_path)


this changes the behavior ? was it incorrect before?

@tomasatdatabricks

ah sorry, was viewing a subset of changes - this was changing it back to the original logic

sueann · 2018-07-04T01:59:33Z

mlflow/sagemaker/container/__init__.py

        print("activating custom environment")
        env = conf[pyfunc.ENV]
        env_path_dst = os.path.join("/opt/mlflow/", env)
+        env_path_dst_dir = os.path.dirname(env_path_dst)


@tomasatdatabricks do you know why this might have been added?

Because we were attempting to copy files to a destination directory that may not exist; ran into an error during copying due to this. There may be an easier way to force directory creation by adding a flag to the copy command.

Yes, it looks reasonable to me. Btw, I fix this in my pending PR too so we might need to merge it at some point.

* Preliminary Java support * Renames * Package restructuring * Add mleap python flavor - preliminary * Implement javafunc with correct loader module * Input parsing * Add missing files * Failed attempt at package installation * Fix up naming semantics, loader module impl * Server impl * Fixes - serving works * Benchmarking results * Benchmarking script * SparkML critical path lats * add batching to bmark * Remove runID from SageMakerServer * .. * Add method that allows deserialization of MLeap pipeline in native format * Move JavaFunc into protected sagemaker package * Modify pyfunc container to handle mleap * Python API changes progress * Use package protection to hide modules that we don't yet wish to expose to users * Fixes * Container mods * Remove mlflow-java directory * Serving is functional. Add sparkml module * Exception message fixes * Throw unsupportedop from attempting to load a model from runId * Remove nonfunctional PackageInstaller * Package restructuring, remove unnecessary lines * Improved error handling * Doc tweaks * JavaFunc error coherency improvements * Remove unused main method * Fix more merge conflicts * Remove invalid comments * Attempt at MLeap serialization parameter validation * Debugging * Exception tweaks * Add javafunc experimental comments * Remove javafunc line * Pom fixes * More pom fixes, make container use jar, correct version * Construct leapframe from pandas * Add missing files * Docs fixes * Comment fix * Docs fix * pom fixes * New tests * Test improvements * Fix print statement omission * Docker build, spark lib fixes * Renaming, new yaml/json utilities, prediction column selection from MLeap * Container emits json-serialized predictions column * lambda syntax fix * MLeap predictor return json * Merge docker tests * Use * Add test for correct handling of invalid inputs * helper function documentation * Pom changes, docs * Revert MLeap, sagemaker container * Remove leapframe, sagemaker java infra * Revert tests * Revert helper functions for tests * Fix spacing * revert setup py * Remove commented pom * Remove javafunc * Remove py * Add style checker to pom with correct configuration * Docs * Add tests - FOR REF LATER * Remove tests * Remove unused tracking utils * Revert "Remove tests" This reverts commit c88d6b2. * Remove pipeline deserialization test * Serialization utils docs * Add unit tests * Error message fix * Revert py changes * Revert py * Remove unnecessary test resources * Doc link fixes * Change package name in java file * Change package name in directory structure * Pom updates * Print stack trace on modeltest fail * Remove trailing from test case names * Rename tests to comply with style guide, add class argument to serialization utility * Format code * Format tests * Rename to * Update travis yml with Java tests * Remove superfluous files * Test fix * Remove swn files * Remove swm

* begin R service API refactoring * fix tracking uri getting * remove old mlflow_create_experiment() * reorg files * local server registration * experiment adding * mlflow client server url attribute * delete/restore experiments * get run * log metric * fix log metric docs * set tag * log param * get param * get metric * get metric history * list and get experiment * set terminated * reorg files * refactor mlflow_log_artifact() * reorg files * remove unused mlflow_ensure_run_id * refactor mlflow_end_run() * delete old cold * fix get experiment * options for view_type for list experiments * revise tests * fix mlflow_get_run dispatch * deregister local servers in tests * fix tests * pass client to rest calls * more fixes * get or start run for all fluent * actually refresh state for tests * remove mlflow_connect() * rename mlflow_tracking_uri to mlflow_get_tracking_uri * remove mlflow_disconnect() * remove active connection code * rename mlflow_connection class to mlflow_server * remove unused mlflow_connection_url * remove mentions of connection * cosmetics * typo * remove dead code * with.mlflow_active_run condition handling * error message when client isn't specified for tracking service API * active experiment id * setting tracking uri shouldn't change other globals * refactor entities run * test server -> validate server * reorg files * reorg test files * remove run_uuid parameter from mlflow_log_model * clean up create/list experiment rest wrapper * remove unnecessary assignments in rest wrappers * set experiment id independent of run * consistent naming getters and setters * some r cmd check fixes * fix r cmd check * rename mlflow_get_active_run -> mlflow_active_run and make private mlflow_set_active_run * unexport experiment id getting/setting * add mlflow_get_experiment_by_name() and mlflow_set_experiment() * typo * with.mlflow_run * refactor create experiment * refactor list experiments * refactor get run * refactor create run * refactor get experiment * refactor get experiment by name * refactor delete experiment * refactor restore experiment * log metric * refactor client api * fluent api refactor * reorg files * tracking client docs * fluent docs * rbuildignore internal * typos * default end_time of mlflow_client_set_terminated to current time * fix mlflow_end_run() * default user * document tracking client api * shorten function name * add mlflow_client_list_artifacts() * ignore noisy lintr warnings (#1)

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

mlflow#20001)

dbczumar added 3 commits June 26, 2018 16:52

DB deployment support for cluster IAM role

2801d26

fix

435adc0

Fix region name

463e251

sueann reviewed Jun 27, 2018

View reviewed changes

sueann reviewed Jun 28, 2018

View reviewed changes

dbczumar added 4 commits June 29, 2018 00:55

use assumed role ARN if none specified

8da00a6

Container env fix

9af3669

Add region to default bucket name

888c066

print >> eprint

c2343b8

tomasatdatabricks reviewed Jun 29, 2018

View reviewed changes

Address comments

7ead7a4

sueann reviewed Jun 29, 2018

View reviewed changes

sueann reviewed Jul 4, 2018

View reviewed changes

dbczumar mentioned this pull request May 19, 2020

Databricks artifact repo #5

Open

24 tasks

dbczumar added a commit that referenced this pull request Aug 25, 2020

Address review comments #1

4209563

dbczumar added a commit that referenced this pull request Aug 27, 2020

Address review comments #1

d3ef4fd

dbczumar force-pushed the master branch from ae6dea1 to e293a73 Compare July 15, 2023 23:30

dbczumar pushed a commit that referenced this pull request Nov 3, 2025

[Eval #1] Fix uniquness handling of trace linking (mlflow#18414)

edcbfa8

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

dbczumar pushed a commit that referenced this pull request Nov 3, 2025

[Vercel #1] Prep for adding more integrations (mlflow#18400)

a85b2f3

Signed-off-by: B-Step62 <yuki.watanabe@databricks.com>

dbczumar pushed a commit that referenced this pull request Jan 18, 2026

[Prompt Optimization Backend PR #1] Wrap prompt optimize in mlflow job (

6b2e131

mlflow#20001)


		DEFAULT_IMAGE_NAME = "mlflow_sage"

		PYFUNC_IMAGE_URL = "707343435239.dkr.ecr.us-west-2.amazonaws.com/mlflow-pyfunc-test:latest"

DB deployment support for cluster IAM role #1

Are you sure you want to change the base?

DB deployment support for cluster IAM role #1

Uh oh!

Conversation

dbczumar commented Jun 26, 2018

Uh oh!

sueann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sueann Jun 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomasatdatabricks left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sueann Jun 28, 2018 •

edited

Loading