Skip to content

Commit cefb6c4

Browse files
Merge pull request #253 from carsonmh/cli-design-doc
Cli Design Markdown
2 parents 15e6f26 + cd7a2b0 commit cefb6c4

File tree

1 file changed

+179
-0
lines changed

1 file changed

+179
-0
lines changed
Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
# CodeFlare CLI Design
2+
3+
4+
## Context and Scope
5+
6+
7+
The primary purpose of the CLI is to serve as an interaction layer between a user and the CodeFlare stack (MCAD, InstaScale, KubeRay) from within the terminal. This addition is required due to the fact that a large set of our target users come from a high-performance computing background and are most familiar and comfortable submitting jobs to a cluster via a CLI.
8+
9+
10+
The CLI will utilize the existing CodeFlare SDK. It will allow for similar operations that the SDK provides (such as Ray Cluster and job management) but in the terminal. The CLI adds some additional functions, allows for saved time, simpler workspaces, and automation of certain processes via bash scripts on top of the existing SDK.
11+
12+
13+
14+
15+
## Goals
16+
17+
18+
- Provide users the ability to request, monitor and stop the Kubernetes resources associated with the CodeFlare stack within the terminal.
19+
- Serve as an interaction layer between the data scientist and CodeFlare stack (MCAD, InstaScale, KubeRay)
20+
- Allow for a user-friendly workflow within the terminal
21+
- Allow for automation and scripting of job/RayCluster management via bash scripts
22+
23+
24+
## Non-Goals
25+
26+
27+
- Do not want to re-make the functionality that is found in the existing CodeFlare SDK or any of the SDK’s clients for Ray, MCAD, or any other service
28+
29+
30+
## Architecture and Design
31+
32+
33+
The CodeFlare CLI is an extension to the CodeFlare SDK package that allows a user to create, monitor, and shut down framework clusters (RayClusters for now) and distributed training jobs on an authenticated Kubernetes cluster from the terminal.
34+
35+
36+
The user should have the ability to do the following from within the terminal:
37+
- Create, view details, view status, submit, delete Ray Clusters via appwrappers
38+
- Create, view logs, view status, submit, delete jobs
39+
- List out all jobs
40+
- List out all ray clusters
41+
- Login to Kubernetes cluster
42+
- Logout of Kubernetes cluster
43+
44+
45+
To support these operations, additional functions to the SDK may include:
46+
- Formatted listing ray clusters
47+
- Formatted listing jobs
48+
- Getting a job given the name
49+
50+
51+
For the majority of functionality, the CLI will utilize the SDK’s already built functionality.
52+
53+
54+
### CLI Framework:
55+
56+
57+
[Click](https://click.palletsprojects.com/en/8.1.x/) is the chosen CLI framework for the following reasons
58+
- Simple syntax/layout: Since the CLI commands are very complex, it is important that the CLI framework doesn’t add any unnecessary complexity
59+
- Supports functional commands instead of objects: This is important because the SDK is designed with various functions, and the CLI being similar improves readability
60+
- Comes with testing and help generation: Testing library and automatic help generation quickens development process
61+
- Large community support/documentation: extensive documentation and large community leads to less errors and easier development.
62+
63+
64+
### Framework Clusters:
65+
66+
67+
When the user invokes the `define raycluster` command, a yaml file with default values is created and put in the user’s current working directory. Users can customize their clusters by adding parameters to the define command and these values will override the defaults when creating the AppWrapper yaml file.
68+
69+
70+
Once the appwrapper is defined, the user can create the ray cluster via a create command. When the user invokes the `create raycluster`, they will specify the name of the cluster to submit. The CLI will first check to see whether or not the specified name is already present in the Kubernetes cluster. If it isn’t already present, then it will search the current working directory for a yaml file corresponding to cluster name and apply it to the K8S cluster. If the wait flag is specified, then the CLI will display a loading sign with status updates until the cluster is up.
71+
72+
73+
We will try to find a good balance between exposing more parameters and simplifying the process by acting on feedback from CLI users.
74+
75+
76+
For `delete raycluster`, the user will invoke the command, and the CLI will shut it down and delete it.
77+
78+
79+
### Training Jobs
80+
81+
82+
When the user invokes `define job` command, a DDPJobDefiniton object will be created and saved into a file. Users can customize their jobs using parameters to the define command.
83+
84+
85+
Once the job is defined, the user can submit the job via a `job submit` command. When the user submits a job, the user will specify the job name. The CLI will then check to see if the job is already on the Kubernetes cluster and if not it will submit the job. The job submitted will be a DDPJob and it will be submitted onto a specified ray cluster.
86+
87+
88+
When the user wants to delete a job, they just invoke the job delete command, and the CLI will stop the job and delete it. This can happen at any time assuming there is a job running.
89+
90+
91+
### Authentication
92+
93+
94+
Users will need to be authenticated into a Kubernetes cluster in order to be able to perform all operations.
95+
96+
97+
If the user tries to perform any operation without being logged in, the CLI will prompt them to authenticate. A kubeconfig will have to be valid in the users environment in order to perform any operation.
98+
99+
100+
The user will be able to login using a simple `login` command and will have the choice of logging in via server + token. The user can also choose whether or not they want tls-verification. If there is a kubeconfig, the CLI will update it, else it will create one for the user.
101+
102+
103+
Alternatively, the user can invoke the login command with their kubeconfig file path, and this will login the user using their kubeconfig file.
104+
105+
106+
Users can logout of their cluster using the `logout` command.
107+
108+
109+
110+
111+
### Listing Info
112+
113+
114+
Users can list both ray cluster information and job information by invoking respective commands. CLI will list information for each raycluster/job such as requested resources, status, name, and namespace.
115+
116+
117+
## Alternatives Considered
118+
119+
120+
- Existing CodeFlare CLI
121+
- Written in TypeScript and overcomplicated. Did not support
122+
- Just using SDK
123+
- Making a CLI saves a lot of time and is easier for the user in some cases
124+
- Interactive CLI
125+
- Interactive CLIs make it harder for automation via bash scripts
126+
- Other CLI libraries
127+
- **Cliff:** Ugly syntax, less readability, not much functionality.
128+
- **Argparse:** Less functionality out of the box. More time spent on unnecessary reimplementation.
129+
- **Cement:** Ugly syntax and low community support.
130+
131+
132+
## Security Considerations
133+
134+
135+
We will rely on Kubernetes default security, where users can not perform any operations on a cluster if they are not authenticated correctly.
136+
137+
138+
## Testing and Validation
139+
The CLI is found within the SDK, so it will be [tested](https://github.com/project-codeflare/codeflare-sdk/blob/main/CodeFlareSDK_Design_Doc.md#testing-and-validation) the same way.
140+
141+
142+
## Deployment and Rollout
143+
- The CLI will be deployed within the CodeFlare SDK so similar [considerations](https://github.com/project-codeflare/codeflare-sdk/blob/main/CodeFlareSDK_Design_Doc.md#deployment-and-rollout) will be taken into account.
144+
145+
146+
## Command Usage Examples
147+
Create ray cluster
148+
- `codeflare create raycluster [options]`
149+
150+
151+
Doing something to a ray cluster:
152+
- `codeflare {operation} raycluster {cluster_name} [options e.g. --gpu=0]`
153+
154+
155+
Create job
156+
- `codeflare create job [options]`
157+
158+
159+
Doing something to a job:
160+
- `codeflare {operation} job {job_name} [options e.g. cluster-name=”mycluster”]`
161+
- Namespace and ray cluster name will be required as options
162+
163+
164+
Listing out clusters
165+
- `codeflare list raycluster -n {namespace} OR codeflare list ray-cluster –all`
166+
167+
168+
Listing out jobs
169+
- `codeflare list job -c {cluster_name} -n {namespace}`
170+
- `codeflare list job -n {namespace}`
171+
- `codeflare list job --all`
172+
173+
174+
Login to kubernetes cluster
175+
- `codeflare login [options e.g. --configpath={path/to/kubeconfig}]` (if configpath is left blank default value is used)
176+
177+
178+
Logout of kubernetes cluster
179+
- `codeflare logout`

0 commit comments

Comments
 (0)