Skip to content

Commit df26fbb

Browse files
even more typos
1 parent 70ae13f commit df26fbb

File tree

1 file changed

+8
-8
lines changed

1 file changed

+8
-8
lines changed

CodeFlareSDK_Design_Doc.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,17 +2,17 @@
22

33
## Context and Scope
44

5-
The primary purpose for the CodeFlare SDK is to provide a pythonic interaction layer between a user and the CodeFlare Stack (a set of services that enable advanced queuing, resource management and distributed compute on a kubernetes cluster).
5+
The primary purpose for the CodeFlare SDK is to provide a pythonic interaction layer between a user and the CodeFlare Stack (a set of services that enable advanced queuing, resource management and distributed compute on Kubernetes).
66

7-
The reason that this SDK is needed is due to the fact that many of the benefits associated with the CodeFlare stack are aimed at making the work of data scientists simpler and more efficient. However, since all parts of the CodeFlare stack are separate Kubernetes services, there needs to be something that unifies the interactions between the user and these separate services. Furthermore, we do not expect the average user to be experienced working with kubernetes infrastructure, and want to provide them with a python native way of interacting with these services.
7+
The reason that this SDK is needed is due to the fact that many of the benefits associated with the CodeFlare stack are aimed at making the work of data scientists simpler and more efficient. However, since all parts of the CodeFlare stack are separate Kubernetes services, there needs to be something that unifies the interactions between the user and these separate services. Furthermore, we do not expect the average user to be experienced working with Kubernetes infrastructure, and want to provide them with a Python native way of interacting with these services.
88

99
The SDK should support any operation that a user would need to do in order to successfully submit machine learning training jobs to their kubernetes cluster.
1010

1111
## Goals
1212

1313
* Serve as an interaction layer between a data scientist and the CodeFlare Stack (MCAD, InstaScale, KubeRay)
1414
* Abstract away user’s infrastructure concerns; Specifically, dealing with Kubernetes resources, consoles, or CLI’s (kubectl).
15-
* Provide users the ability to programmatically request, monitor and stop the kubernetes resources associated with the CodeFlare stack.
15+
* Provide users the ability to programmatically request, monitor, and stop the kubernetes resources associated with the CodeFlare stack.
1616

1717
## Non-Goals
1818

@@ -35,7 +35,7 @@ In order to achieve this we need the capacity to:
3535

3636
### Framework Clusters:
3737

38-
In order to create these framework clusters, we will start with a template AppWrapper yaml file with reasonable defaults that will generate a valid RayCluster via MCAD.
38+
In order to create these framework clusters, we will start with a [template AppWrapper yaml file](/src/codeflare_sdk/templates/base-template.yaml) with reasonable defaults that will generate a valid RayCluster via MCAD.
3939

4040
Users can customize their AppWrapper by passing their desired parameters to `ClusterConfig()` and applying that configuration when initializing a `Cluster()` object. When a `Cluster()` is initialized, it will update the AppWrapper template with the user’s specified requirements, and save it to the current working directory.
4141

@@ -45,17 +45,17 @@ With a valid AppWrapper, we will use the Kubernetes python client to apply the A
4545

4646
We will also use the Kubernetes python client to get information about both the RayCluster and AppWrapper custom resources to monitor the status of our Framework Cluster via `cluster.status()` and `cluster.details()`.
4747

48-
The RayCluster deployed on your kubernetes cluster can be interacted with in two ways: Either through an interactive session via `ray.init()` or through the submission of batch jobs.
48+
The RayCluster deployed on your Kubernetes cluster can be interacted with in two ways: Either through an interactive session via `ray.init()` or through the submission of batch jobs.
4949

5050
Finally we will use the Kubernetes python client to delete the AppWrapper via `cluster.down()`
5151

5252
### Training Jobs:
5353

54-
For the submission of Jobs we will rely on the [TorchX](https://pytorch.org/torchx/latest/) job launcher to handle orchestrating the distribution of our model training jobs across the available resources on our cluster. We will support two distributed backend schedulers: Ray and Kuberentes-MCAD. TorchX is designed to be used primarily as a CLI, so we will wrap a limited subset of its functionality into our SDK so that it can be used as part of a python script.
54+
For the submission of Jobs we will rely on the [TorchX](https://pytorch.org/torchx/latest/) job launcher to handle orchestrating the distribution of our model training jobs across the available resources on our cluster. We will support two distributed backend schedulers: Ray and Kubernetes-MCAD. TorchX is designed to be used primarily as a CLI, so we will wrap a limited subset of its functionality into our SDK so that it can be used as part of a python script.
5555

5656
Users can define their jobs with `DDPJobDefinition()` providing parameters for the script they want to run as part of the job, the resources required for the job, additional args specific to the script being run and scheduler being used.
5757

58-
Once a job is defined it can be submitted to the Kuberentes cluster to be run via `job.submit()`. If `job.submit()` is left empty the SDK will assume the Kuberentes-MCAD scheduler is being used. If a RayCluster is specified like, `job.submit(cluster)`, then the SDK will assume that the Ray scheduler is being used and submit the job to that RayCluster.
58+
Once a job is defined it can be submitted to the Kubernetes cluster to be run via `job.submit()`. If `job.submit()` is left empty the SDK will assume the Kubernetes-MCAD scheduler is being used. If a RayCluster is specified like, `job.submit(cluster)`, then the SDK will assume that the Ray scheduler is being used and submit the job to that RayCluster.
5959

6060
After the job is submitted, a user can monitor its progress via `job.status()` and `job.logs()` to retrieve the status and logs output by the job. At any point the user can also call `job.cancel()` to stop the job.
6161

@@ -85,7 +85,7 @@ In either case, users can log out and clear the authentication inputs with `.log
8585
## Security Considerations
8686

8787

88-
We will rely on the kubernetes cluster’s default security, where users cannot perform any operations on a cluster if they are not authenticated correctly.
88+
We will rely on the Kubernetes cluster’s default security, where users cannot perform any operations on a cluster if they are not authenticated correctly.
8989

9090
## Testing and Validation
9191

0 commit comments

Comments
 (0)