Skip to content

Commit 2b8abde

Browse files
authored
Merge pull request #431 from kbase/develop
D->M Fix Update Timestamp Bug
2 parents 5f6dea6 + 18408b2 commit 2b8abde

8 files changed

Lines changed: 301 additions & 36 deletions

File tree

RELEASE_NOTES.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
# execution_engine2 (ee2) release notes
22
=========================================
33

4+
## 0.0.7
5+
* Fixed a bug that could cause missing `queued` timestamps if many jobs were submitted in a
6+
batch
7+
48
## 0.0.6
59
* Release of MVP
610

docs/adrs/004-SplitAndAggregate.md

Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
# Replace KBParallels with another solution to avoid Deadlocks
2+
3+
Date: 2021-09-22
4+
5+
[Related ADR](https://github.com/kbase/execution_engine2/blob/develop/docs/adrs/rework-batch-analysis-architecture.md)
6+
7+
## Note
8+
This ADR is more of a place to keep the current discussions we had at https://docs.google.com/document/d/1AWjayMoqCoGkpO9-tjXxEvO40yYnFtcECbdne5vTURo
9+
Rather than to make a decision. There is still more planning, scoping and testing involved before we can fully design this system.
10+
11+
Still to be determined (not in scope of this ADR):
12+
* UI and how it relates to the bulk execution
13+
* XSV Analysis and how it relates to the bulk execution
14+
15+
16+
## Intro
17+
Sometimes a calculation requires too many resources from one node (walltime, memory, disk), so the calculation gets spread across multiple machines.
18+
The final step of the app that uses KBParallels is to create a report. This step may use results from all of the computed jobs to create the final report.
19+
In order to do this, the following apps use a mechanism called KBParallel
20+
* kb_Bowtie2
21+
* refseq_importer
22+
* kb_concoct
23+
* kb_phylogenomics
24+
* kb_hisat2
25+
* kb_meta_decoder
26+
* kb_Bwa
27+
28+
## The current implementation of Batch Analysis in [kb_BatchApp](https://github.com/kbaseapps/kb_BatchApp) at KBase has the following issues:
29+
30+
* Current UI is not adequate: Users shouldn’t have to code in order to run batch analysis. Also it’s difficult to do so, even for those familiar with KBase code (have to find object names)
31+
* Dependency on [KBParallel](https://github.com/kbaseapps/KBParallel): any changes to KBParallel could affect KB Batch and subsequently all other apps.
32+
* Queue deadlocking: users have a max of 10 slots in the queue, with the current implementation one management job is created to manage the jobs that it submits. This could lead to deadlock scenarios, as there can be 10 management jobs waiting to submit computation jobs, but they cannot, as there all slots are being used up.
33+
* KBP can spawn other KBP jobs. Batch jobs can spawn other batch jobs.
34+
* Missing the ability to be able to run, manage (cancel) and track jobs and their subjobs along with the ability to specify resources differently between the main and sub jobs
35+
* No good way to test and hard to benchmark or measure performance
36+
* Code is split more than is necessary
37+
* UI doesn't properly display progress of batch jobs
38+
39+
## Author(s)
40+
41+
@bio-boris, @mrcreosote
42+
43+
## Status
44+
Needs more planning, but current ideas are documented here
45+
46+
47+
## Decision Outcome (pending more research to iron out more details)
48+
49+
For the first pass, we would likely limit the number of kbparallel runs.
50+
51+
For the next pass, we would want to create a comprehensive generalized solution to submit,split and aggregate, with recipes or conveniences for common operations for creating sets, reports, or things of that nature.
52+
53+
We would also want to do a user study on what we want from the UI and which functionality we want, as the UI may inform the design of the backend system.
54+
55+
56+
### Deprecate KBP and instead break out apps into 3 parts
57+
58+
* Fan out (FO)
59+
* Process in parallel (PIP)
60+
* Fan in (FI)
61+
62+
63+
### Steps:
64+
1. User launches job as normal
65+
2. Possibly the job is marked as a FO job, Makes it easier for the UI to display the job correctly initially, Ideally would be marked in the spec, but this might be a lot of work Could potentially be marked in the catalog UI (e.g. along with the job requirements)
66+
3. Job figures out what the PIP / sub jobs should be
67+
4. Job sends the following info to EE2
68+
* Its own job ID
69+
* The parameters for each of the sub jobs
70+
* The app of the FI job, e.g. kb_phylogenomics/build_microbial_speciestree_reduce
71+
* EE2 starts the subjobs and associates them with with FO job (Probably need retry handling for the subjobs to deal with transient errors)
72+
5. Whenever a subjob finishes, EE2 checks to see if all the subjobs are finished
73+
* If true, EE2 starts the FI job, providing the outputs of the subjobs as a list to the reduce job
74+
* When the FI job is finished, the job is done.
75+
* The various jobs can communicate by storing temporary data in the caching service or in the Blobstore. If the latter is used, the FI job should clean up the Blobstore nodes when its complete.
76+
* Could make a helper app for this?
77+
* What about workflow engines (WDL, Cromwell)? Are we reinventing the wheel here?
78+
* Can new EE2 endpoints speed up or reduce the complexity of any of these steps?
79+
80+
### Notes about DAG in ee2 Endpoints
81+
```
82+
Your dag would need to have (at least) a first job followed by a SUBDAG EXTERNAL.
83+
Somewhere in the first job you'd generate a new dag workflow that
84+
defines the N clusters followed by the N+1 job, which runs in the
85+
subdag.
86+
87+
As for DAGMan support in the Python bindings, we do this in the
88+
following two ways:
89+
90+
1) There is a htcondor.Submit.from_dag() option which takes the name
91+
of a dag filename. You then submit the resulting object just like any
92+
regular job.
93+
2) We have a htcondor.dags library which can be used to
94+
programmatically construct a DAG workflow in computer memory, then
95+
write to a .dag file and submit using the function mentioned in 1)
96+
above.
97+
```
98+
99+
Between these there are several different ways to do what you want.
100+
101+
There's a useful example here that shows the general workflow in the
102+
bindings: https://htcondor.readthedocs.io/en/latest/apis/python-bindings/tutorials/DAG-Creation-And-Submission.html#Describing-the-DAG-using-htcondor.dags
103+
104+
## Consequences
105+
106+
* We will have to implement a new narrative UI, however this was work that would happen regardless due as we are looking to improve the UX for batch upload and analysis at KBase.
107+
* This will take significant time to further research and engineer the solutions
108+
109+
Still to be determined (not in scope of this ADR):
110+
* UI and how it relates to the bulk execution
111+
* XSV Analysis and how it relates to the bulk execution
112+
113+
## Alternatives Considered
114+
115+
* Ignore most issues and just make apps that run kbparallels limited to N instances of kbparallels per user to avoid deadlocks
116+
* Remove kbparallels and change apps to a collection of 2-3 apps that do submit, split and aggregate and an use an ee2 endpoint to create a DAG
117+
* Different DevOps solutions
118+
* Rewriting KBP or swapping it out for a lightweight alternative that has a subset of the KBP features
119+
120+
121+
## Pros and Cons of the Alternatives
122+
123+
### General Notes
124+
* With the current implementation of KBP, Having a separate KBP queue with multiple machines can save a spot from a user's 10 job maximum for running more jobs, but takes up / wastes compute resources (especially if the nodes sit idle). The user still gets 10 jobs, but there are less spots for jobs to run overall in the system if we make another queue, as this requires taking up more compute nodes that are currently dedicated to the NJS queue.
125+
* Without changing apps that use KBP, running multiple KBP apps on the same machine can interfere with each other and we want to avoid this.
126+
* If we scrap KBP in favor of a "lightweight alternative" we can avoid some of the previous issues, if we modify all apps that use KBP to use a lightweight alternative. A lightweight alternative would have to guarantee that no computation besides job management occured, and then we could have the management jobs sit and wait for other jobs without interfering with other jobs on the system.
127+
128+
### Increase number of slots per user > 10
129+
* `+` Simple solutions, quick turnarounds, fixes deadlock issue for small numbers of jobs.
130+
* `-` Doesn't fix deadlock issue as the user can still submit more KBP jobs
131+
* `-` Addresses only the deadlocking issue, UI still broken for regular runs and batch runs
132+
* `-` A small amount of users can take over the entire system by being able to submit more than 10 jobs
133+
* `-` > 10 nodes will continue be taken up by jobs that do little computation as each job gets its own node
134+
* `-` Capacity is still wasted, as some KBP jobs sit around waiting for other jobs to run
135+
136+
### LIMIT KBP jobs to a maximum of N<10 active KBP jobs per user
137+
* `+` Simple solution requires ee2 to maintain list of KBP apps, and add a KBP_LIMIT to jobs from this list. [Condor](https://github.com/kbase/condor/pull/26) will need KBP_LIMIT Added
138+
* `+` List of apps is not frequently updated
139+
* `+` Apps do not need to be modified
140+
* `-` If a new app uses KBP and their app is not on the list, it won't be limited by the KBP_LIMIT unless the owner lets us know.
141+
* `-` If an existing app no longer uses KBP, their app is still limited unless the owner lets us know.
142+
* `-` Nodes will continue be taken up by jobs that do little computation as each job gets its own node.
143+
* `-` Users may not be able to effectively use up their 10 job spots
144+
* `-` Capacity is still wasted, as some KBP jobs sit around waiting for other jobs to run
145+
146+
### LIMIT KBP jobs to a maximum of N<10 active jobs per user + Seperate queue for kbparallels apps
147+
* `+` Same pros as above
148+
* `+` Users will be able to more effectively use their 10 job spots
149+
* `+` Allows us to group up KBP jobs onto fewer machines, instead of giving them their entire node
150+
* `-` Requires going through each app and understanding the worst case computational needs in order to set the estimated cpu and memory needs for each app
151+
* `-` Apps can interfere with other innocent apps and take them down
152+
* `-` Creating a new queue requires balancing between how many active KBP nodes there vs how many nodes are available for other NJS jobs.
153+
* `-` Capacity is still wasted, as some KBP jobs sit around waiting for other jobs to run
154+
155+
156+
### Build KBP Lightweight Version + KBP Queue
157+
158+
159+
#### Design of new verison
160+
161+
162+
* All apps must be modified to use the new KBP lightweight version, which will:
163+
* Can either modify KBP, or create a new tool/package to use instead of KBP
164+
165+
166+
1) Launch a management job called the *Job Manager* that sits in the KBP Queue, alongside other KBP jobs. Other jobs are launched in the NJS queue.
167+
2) Launch the *Setup Job* which will
168+
* Use the *User Parameters* and/or
169+
* Optionally Download the Data from the *User Parameters* to figure out *Job Manager* parameters
170+
* Use the results of information gathered from the initial download and or *User Parameters*
171+
* Generate final parameters to be sent to the *Job Manager* to launch *Fan Out* jobs, or directly launch *Fan Out* jobs and return job ids
172+
3) The *Job Manager* now has enough parameters to launch and/or monitor *Fan Out* Jobs, and monitor/manage the jobs (and possibly retry them upon failure)
173+
4) *Fan Out* jobs download data and perform calculations, save them back to the system, and return references to the saved objects
174+
5) The *Job Manager* launches one *FanIn* job based on User Parameters and or the results of *Fan Out* Jobs
175+
6) The *FanIn* (a.k.a Group/Reduce/Report) job downloads objects from the system, and creates a set or other grouping, and then saves the object(s) back to the system. Final data and report is uploaded back to the system
176+
8) The *Job Manager* returns the reference to the results of the *Report Job*
177+
178+
Pros/Cons
179+
180+
* `+` All KBP jobs can run on a small subset of machines, deadlock issue is fixed
181+
* `+` No changes to ee2 required
182+
* `-` Addresses the deadlocking issue, UI still broken for regular runs and batch runs if we re-use KBP
183+
* `-` On an as needed basis, would have to rewrite apps that use KBP to use this new paradigm
184+
185+
186+
### Modify Apps to do only local submission by remove KBP, and moving the job
187+
* `+` Simple solutions, quick turnarounds, fixes deadlock issue, fixes UI issues
188+
* `-` We have a limited number of larger resources machines
189+
* `-` Continued dependency on deprecated KBP tools
190+
* `-` App runs may take longer since fewer resources may be available to the app run

kbase.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ service-language:
88
python
99

1010
module-version:
11-
0.0.5
11+
0.0.7
1212

1313
owners:
1414
[bsadkhin, tgu2, wjriehl, gaprice]

lib/execution_engine2/db/MongoUtil.py

Lines changed: 43 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,7 @@
33
import time
44
import traceback
55
from contextlib import contextmanager
6-
from datetime import datetime
7-
from typing import Dict, List
6+
from typing import Dict, List, NamedTuple
87
from bson.objectid import ObjectId
98
from mongoengine import connect, connection
109
from pymongo import MongoClient, UpdateOne
@@ -17,7 +16,11 @@
1716
)
1817

1918
from lib.execution_engine2.utils.arg_processing import parse_bool
20-
from execution_engine2.sdk.EE2Runjob import JobIdPair
19+
20+
21+
class JobIdPair(NamedTuple):
22+
job_id: str
23+
scheduler_id: str
2124

2225

2326
class MongoUtil:
@@ -272,6 +275,42 @@ def check_if_already_finished(job_status):
272275
return True
273276
return False
274277

278+
def update_job_to_queued(
279+
self, job_id: str, scheduler_id: str, scheduler_type: str = "condor"
280+
) -> None:
281+
f"""
282+
* Updates a {Status.created.value} job to queued and sets scheduler state.
283+
Always sets scheduler state, but will only update to queued if the job is in the
284+
{Status.created.value} state.
285+
:param job_id: the ID of the job.
286+
:param scheduler_id: the scheduler's job ID for the job.
287+
:param scheduler_type: The scheduler this job was queued in, default condor
288+
"""
289+
if not job_id or not scheduler_id or not scheduler_type:
290+
raise ValueError("None of the 3 arguments can be falsy")
291+
# could also test that the job ID is a valid job ID rather than having mongo throw an
292+
# error
293+
mongo_collection = self.config["mongo-jobs-collection"]
294+
queue_time_now = time.time()
295+
with self.pymongo_client(mongo_collection) as pymongo_client:
296+
ee2_jobs_col = pymongo_client[self.mongo_database][mongo_collection]
297+
# should we check that the job was updated and do something if it wasn't?
298+
ee2_jobs_col.update_one(
299+
{"_id": ObjectId(job_id), "status": Status.created.value},
300+
{"$set": {"status": Status.queued.value, "queued": queue_time_now}},
301+
)
302+
# originally had a single query, but seems safer to always record the scheduler
303+
# state no matter the state of the job
304+
ee2_jobs_col.update_one(
305+
{"_id": ObjectId(job_id)},
306+
{
307+
"$set": {
308+
"scheduler_id": scheduler_id,
309+
"scheduler_type": scheduler_type,
310+
}
311+
},
312+
)
313+
275314
def update_jobs_to_queued(
276315
self, job_id_pairs: List[JobIdPair], scheduler_type: str = "condor"
277316
) -> None:
@@ -285,7 +324,7 @@ def update_jobs_to_queued(
285324

286325
bulk_update_scheduler_jobs = []
287326
bulk_update_created_to_queued = []
288-
queue_time_now = datetime.utcnow().timestamp()
327+
queue_time_now = time.time()
289328
for job_id_pair in job_id_pairs:
290329
if job_id_pair.job_id is None:
291330
raise ValueError(

lib/execution_engine2/execution_engine2Impl.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ class execution_engine2:
2828
# state. A method could easily clobber the state set by another while
2929
# the latter method is running.
3030
######################################### noqa
31-
VERSION = "0.0.5"
31+
VERSION = "0.0.7"
3232
GIT_URL = "https://github.com/mrcreosote/execution_engine2.git"
3333
GIT_COMMIT_HASH = "2ad95ce47caa4f1e7b939651f2b1773840e67a8a"
3434

lib/execution_engine2/sdk/EE2Runjob.py

Lines changed: 14 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -75,11 +75,6 @@ class PreparedJobParams(NamedTuple):
7575
job_id: str
7676

7777

78-
class JobIdPair(NamedTuple):
79-
job_id: str
80-
scheduler_id: str
81-
82-
8378
from typing import TYPE_CHECKING
8479

8580
if TYPE_CHECKING:
@@ -312,17 +307,11 @@ def _run_multiple(self, runjob_params: List[Dict]):
312307
).start()
313308
return job_ids
314309

315-
def _update_to_queued_multiple(self, job_ids, scheduler_ids):
310+
def _finish_multiple_job_submission(self, job_ids):
316311
"""
317312
This is called during job submission. If a job is terminated during job submission,
318313
we have the chance to re-issue a termination and remove the job from the Job Queue
319314
"""
320-
if len(job_ids) != len(scheduler_ids):
321-
raise Exception(
322-
"Need to provide the same amount of job ids and scheduler_ids"
323-
)
324-
jobs_to_update = list(map(JobIdPair, job_ids, scheduler_ids))
325-
self.sdkmr.get_mongo_util().update_jobs_to_queued(jobs_to_update)
326315
jobs = self.sdkmr.get_mongo_util().get_jobs(job_ids)
327316

328317
for job in jobs:
@@ -377,14 +366,21 @@ def _submit_multiple(self, job_submission_params):
377366
)
378367
raise RuntimeError(error_msg)
379368
condor_job_ids.append(condor_job_id)
380-
381-
self.logger.error(f"It took {time.time() - begin} to submit jobs to condor")
382-
# It took 4.836009502410889 to submit jobs to condor
369+
# Previously the jobs were updated in a batch after submitting all jobs to condor.
370+
# This led to issues where a large job count could result in jobs switching to
371+
# running prior to all jobs being submitted and so the queued timestamp was
372+
# never added to the job record.
373+
self.sdkmr.get_mongo_util().update_job_to_queued(job_id, condor_job_id)
374+
375+
self.logger.error(
376+
f"It took {time.time() - begin} to submit jobs to condor and update to queued"
377+
)
383378

384379
update_time = time.time()
385-
self._update_to_queued_multiple(job_ids=job_ids, scheduler_ids=condor_job_ids)
386-
# It took 1.9239885807037354 to update jobs
387-
self.logger.error(f"It took {time.time() - update_time} to update jobs ")
380+
self._finish_multiple_job_submission(job_ids=job_ids)
381+
self.logger.error(
382+
f"It took {time.time() - update_time} to finish job submission"
383+
)
388384

389385
return job_ids
390386

0 commit comments

Comments
 (0)