From 1f14d7fdb678db55cfabe58c96d0a70d570ce877 Mon Sep 17 00:00:00 2001 From: SamyOubouaziz Date: Thu, 15 May 2025 12:05:08 +0200 Subject: [PATCH 1/4] docs(dwc): add doc on persistent volume MTA-5885 --- pages/data-lab/concepts.mdx | 9 +++++++-- pages/data-lab/how-to/create-data-lab.mdx | 5 ++++- 2 files changed, 11 insertions(+), 3 deletions(-) diff --git a/pages/data-lab/concepts.mdx b/pages/data-lab/concepts.mdx index 5fd2b68d79..f26eb98154 100644 --- a/pages/data-lab/concepts.mdx +++ b/pages/data-lab/concepts.mdx @@ -12,6 +12,10 @@ categories: - managed-services --- +## Apache Spark Cluster + +An Apache Spark cluster is an orchestrated set of machines over which the distributed/Big data calculus is going to be processed. In the case of this project, the Apache Spark cluster is a Kubernetes cluster, upon which Apache Spark has been installed in every pod deployed. For more details, check out the [Apache Spark documentation](https://spark.apache.org/documentation.html). + ## Data Lab A Data Lab is a project setup that combines a Notebook and an Apache Spark Cluster for data analysis and experimentation. it comes with the required infrastructure and tools to allow data scientists, analysts, and researchers to explore data, create models, and gain insights. @@ -40,9 +44,9 @@ Lighter is a technology that enables SparkMagic commands to be readable and exec A notebook for an Apache Spark cluster is an interactive, web-based tool that allows users to write and execute code, visualize data, and share results in a collaborative environment. It connects to an Apache Spark cluster to run large-scale data processing tasks directly from the notebook interface, making it easier to develop and test data workflows. -## Apache Spark Cluster +## Persistent volume -An Apache Spark cluster is an orchestrated set of machines over which the distributed/Big data calculus is going to be processed. In the case of this project, the Apache Spark cluster is a Kubernetes cluster, upon which Apache Spark has been installed in every pod deployed. For more details, check out the [Apache Spark documentation](https://spark.apache.org/documentation.html). +A Persistent Volume (PV) is a cluster-wide storage resource that ensures data persistence beyond the lifecycle of individual pods. Persistent volumes abstract the underlying storage details, allowing administrators to use various storage solutions. ## SparkMagic @@ -50,4 +54,5 @@ SparkMagic is a set of tools that allows you to interact with Apache Spark clust ## Transaction + An SQL transaction is a sequence of one or more SQL operations (such as queries, inserts, updates, or deletions) executed as a single unit of work. These transactions ensure data integrity and consistency, following the ACID properties: Atomicity, Consistency, Isolation, and Durability, meaning all operations within a transaction either complete successfully or none of them take effect. An SQL transaction can be rolled back in case of an error. \ No newline at end of file diff --git a/pages/data-lab/how-to/create-data-lab.mdx b/pages/data-lab/how-to/create-data-lab.mdx index 840707e43f..e5e999359f 100644 --- a/pages/data-lab/how-to/create-data-lab.mdx +++ b/pages/data-lab/how-to/create-data-lab.mdx @@ -34,7 +34,10 @@ Data Lab for Apache Sparkā„¢ is a product designed to assist data scientists and Provisioning zero worker nodes lets you retain and access you cluster and notebook configurations, but will not allow you to run calculations. - - Optionally, choose an Object Storage bucket in the desired region to store the data source and results. + - Activate the [persistent volume](/data-lab/concepts/#persistent-volume) if required, then enter a volume size according to your needs. + + Persistent volume usage depends on your workload, and only the actual usage will be billed, within the limit defined. A minimum of 1 GB is required to run the notebook. + - Enter a name for your Data Lab. - Optionally, add a description and/or tags for your Data Lab. - Verify the estimated cost. From 9aa535b4b53b5085cdb3e9fea4a8bba4d12d333c Mon Sep 17 00:00:00 2001 From: SamyOubouaziz Date: Fri, 16 May 2025 11:19:42 +0200 Subject: [PATCH 2/4] docs(dwc): update concept --- pages/data-lab/concepts.mdx | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/pages/data-lab/concepts.mdx b/pages/data-lab/concepts.mdx index f26eb98154..ceba1f2ace 100644 --- a/pages/data-lab/concepts.mdx +++ b/pages/data-lab/concepts.mdx @@ -48,6 +48,10 @@ A notebook for an Apache Spark cluster is an interactive, web-based tool that al A Persistent Volume (PV) is a cluster-wide storage resource that ensures data persistence beyond the lifecycle of individual pods. Persistent volumes abstract the underlying storage details, allowing administrators to use various storage solutions. +Apache SparkĀ® executors require storage space for various operations, particularly to shuffle data during wide operations such as sorting, grouping, and aggregation. Wide operations are transformations that require data from different partitions to be combined, often resulting in data movement across the cluster. During the map phase, executors write data to shuffle storage, which is then read by reducers. + +A PV sized properly ensures a smooth execution of your workload. + ## SparkMagic SparkMagic is a set of tools that allows you to interact with Apache Spark clusters through Jupyter notebooks. It provides magic commands for running Spark jobs, querying data, and managing Spark sessions directly within the notebook interface, facilitating seamless integration and execution of Spark tasks. For more details, check out the [SparkMagic repository](https://github.com/jupyter-incubator/sparkmagic). From 745b58ab5fe606eaa03c5ce5ce7fcc3358d81857 Mon Sep 17 00:00:00 2001 From: SamyOubouaziz Date: Fri, 16 May 2025 13:49:07 +0200 Subject: [PATCH 3/4] Update pages/data-lab/concepts.mdx Co-authored-by: Rowena Jones <36301604+RoRoJ@users.noreply.github.com> --- pages/data-lab/concepts.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pages/data-lab/concepts.mdx b/pages/data-lab/concepts.mdx index ceba1f2ace..f22eb7e45e 100644 --- a/pages/data-lab/concepts.mdx +++ b/pages/data-lab/concepts.mdx @@ -14,7 +14,7 @@ categories: ## Apache Spark Cluster -An Apache Spark cluster is an orchestrated set of machines over which the distributed/Big data calculus is going to be processed. In the case of this project, the Apache Spark cluster is a Kubernetes cluster, upon which Apache Spark has been installed in every pod deployed. For more details, check out the [Apache Spark documentation](https://spark.apache.org/documentation.html). +An Apache Spark cluster is an orchestrated set of machines over which distributed/Big data calculus is processed. In the case of Scaleway Data Lab, the Apache Spark cluster is a Kubernetes cluster, with Apache Spark installed in each pod. For more details, check out the [Apache Spark documentation](https://spark.apache.org/documentation.html). ## Data Lab From c7219fa1f84ef7428984228d0b05df30176bede5 Mon Sep 17 00:00:00 2001 From: SamyOubouaziz Date: Fri, 6 Jun 2025 15:05:35 +0200 Subject: [PATCH 4/4] Update pages/data-lab/concepts.mdx Co-authored-by: Jessica <113192637+jcirinosclwy@users.noreply.github.com> --- pages/data-lab/concepts.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pages/data-lab/concepts.mdx b/pages/data-lab/concepts.mdx index f22eb7e45e..e680bff59b 100644 --- a/pages/data-lab/concepts.mdx +++ b/pages/data-lab/concepts.mdx @@ -12,7 +12,7 @@ categories: - managed-services --- -## Apache Spark Cluster +## Apache Spark cluster An Apache Spark cluster is an orchestrated set of machines over which distributed/Big data calculus is processed. In the case of Scaleway Data Lab, the Apache Spark cluster is a Kubernetes cluster, with Apache Spark installed in each pod. For more details, check out the [Apache Spark documentation](https://spark.apache.org/documentation.html).