VK Cloud logo
Updated at October 31, 2023   06:09 AM

About the serviceBeta

Apache Spark is a core for processing big data. It consists of APIs in Java, Scala, Python and R, as well as processing tools Spark SQL for SQL, pandas API, MLlib for machine learning, GraphX for processing graphs and Structured Streaming for streaming processing. Spark is most often used as part of a Hadoop cluster.

Cloud Spark — a solution based on Apache Spark Operator and PaaS Kubernetes from VK Cloud. It allows you to deploy Spark inside Kubernetes using an image from Docker Registry, without using a Hadoop cluster.

What tasks is the service suitable for?

  • Distributed processing of big data.
  • Reading data from S3 with their further export to the database for processing (ClickHouse / Greenplum / PostgreSQL). It is also possible to transfer data from the database to S3.
  • Distributed training of ML models using big data.
  • Graph calculations using the GraphX component.

Service features

  • Deployment of the Spark cluster inside Kubernetes.
  • Automatic configuration of the master nodes of the deployed clusters.
  • Connecting a pre-assembled Spark image via Docker Registry.
  • Horizontal and vertical cluster scaling with autoscaling support.
  • Access differentiation using tokens and a role model.
  • Auto-delete or put the cluster into sleep mode according to a schedule.
  • Service Management using the API.

Interaction of service components