Apache Spark is an open source Big Data framework for distributed batching and streaming of unstructured and semi-structured data, part of the Hadoop ecosystem.
Working with Spark in the Ambari interface
To create a Spark cluster, follow the instructions, setting the required parameters.
After creation in the cluster, in the "General information" tab, a link, login and password to the Ambari interface for managing the cluster will be published:
To enter the Ambari interface, you need to enter Web UI in the browser search bar, enter your login and password in the window that opens:
This will open the Ambari interface for managing the Spark cluster and its components:
Working with Spark in Zeppelin
Apache Zeppelin is a multipurpose notebook (web-notebook) that allows you to process, analyze and visualize data on the Hadoop platform. Spark clusters in MCS include Zeppelin notebooks that can be used to perform Spark tasks.
There are several ways to access the Zeppelin web interface, described below.
Most of the services in the Hadoop ecosystem do not have authorization, so by default access to them is allowed exclusively from the internal network. By default, after starting the cluster, security groups are configured as follows:
|Outgoing traffic||With no restrictions|
|Inbound and outbound traffic between cluster nodes within the internal network||With no restrictions|
|Inbound traffic to TCP port 8080 of the headend||Without restrictions, used to access the Ambari interface (authorization by login and password).|
|Inbound traffic to TCP port 22 of the headend||Unlimited, used for SSH access (SSH key authorization).|
|Incoming ICMP traffic to the head-end||With no restrictions|
Thus, the default configuration only provides access to the Ambari web interface and SSH access to the head node. Access to internal web interfaces and API services is possible only from an internal network isolated from the public Internet. To ensure access, the following options are possible:
- Configuring VPN access to Mail.Ru Cloud Solutions resources from the organization's network.
- Using the master host as a proxy server.
- Opening access to specific ranges of IP addresses through security groups.
Organization of VPN access
To provide access to all resources of the Hadoop cluster, it is recommended to create a VPN connection from the organization's network to the MCS network. At the same time, the resources of Mail.Ru Cloud Solutions will be available as if they were within the internal network of the organization.
To use a VPN with a Hadoop cluster, do the following:
- Create a new internal network and router within the project.
- Start the Hadoop cluster using the created network as the cluster's internal network.
- Set up a VPN connection to Mail.Ru Cloud Solutions services.
Using the head node as a proxy
Alternatively, you can use an SSH tunnel to the master. To organize an SSH tunnel, a machine with an OpenSSH client of any version is required. The SSH client is started with the -D option, which starts the SOCKS5 proxy using an SSH tunnel to the remote server:
ssh -D 3128 -C -q -N -f centos @ <host>
- the -D 3128 option tells you to start the SOCKS5 proxy server;
- the -C option requests data compression;
- the -q option minimizes the output of status messages to the console;
- the -N option says that when connecting, you should not run any commands and use an SSH connection exclusively for the tunnel;
- the -f option sends the SSH client to the background.
If necessary, the SSH-SOCKS5 tunnel can be automatically created at system boot. To do this, create a .service file for systemd.
/etc/systemd/system/mcs-proxy.service: [Unit] Description = Setup SOCKS Proxy After = network.target [Service] Type = simple ExecStart = / usr / bin / ssh -D 8123 -C -q -N centos @ HOST -i /home/user/.ssh/id_rsa [Install] WantedBy = multi-user.target
After creating the file, reload the systemd configuration and start the service:
systemctl daemon-reload systemctl start mcs-proxy
More information on enabling SOCKS5 proxy in your browser can be found in the Firefox and Chrome documentation. Detailed information on the procedure and configuration parameters of the tunnel itself is contained in the documentation for the SSH client.
Setting up security groups
Services can also be accessed by changing the security group settings. To do this, in the section "Virtual networks" - "Firewall settings" of the MCS interface, you must select the appropriate security group containing the name of the cluster and the type of node (head or worker) in the name.
For example, if the cluster is named test1, then the security group for the master node will have the string test1-hadoop-head in the name, and the security group for the worker node will have test1-hadoop-worker. These groups are created automatically for each new cluster.
So, to organize access to Zeppelin, you need to use the following data:
- Login: admin
- Password: admin
Use the control bar in the upper right corner of a cell to execute code within that cell. The result is displayed immediately below the executable code:
Working with Spark in the console
Spark MCS clusters include console utilities for working with Spark.
Connection to the main cluster node is carried out via SSH. This is followed by launching one of the available utilities, depending on the language used.
To connect to Spark2, you need to set the environment variable SPARK_MAJOR_VERSION = 2.
1. To perform tasks on Scala, spark-shell is launched:
SPARK_MAJOR_VERSION = 2 spark-shell SPARK_MAJOR_VERSION is set to 2, using Spark2 Setting default log level to "WARN". To adjust logging level use sc.setLogLevel (newLevel). For SparkR, use setLogLevel (newLevel). Spark context Web UI available at http: //spark-1-spark-head-0.novalocal: 4041 Spark context available as 'sc' (master = local [*], app id = local-1526372942530). Spark session available as 'spark'. Welcome to ____ __ / __ / __ ___ _____ / / __ _ \ \ / _ \ / _ `/ __ / '_ / / ___ / .__ / \ _, _ / _ / / _ / \ _ \ version 184.108.40.206.6.5.0-292 / _ / Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_161) Type in expressions to have them evaluated. Type: help for more information.
2. To perform tasks in Python, pyspark is launched:
SPARK_MAJOR_VERSION = 2 pyspark SPARK_MAJOR_VERSION is set to 2, using Spark2 Python 2.7.5 (default, Aug 4 2017, 00:39:18) [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel (newLevel). For SparkR, use setLogLevel (newLevel). 05/18/15 08:31:25 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. Welcome to ____ __ / __ / __ ___ _____ / / __ _ \ \ / _ \ / _ `/ __ / '_ / / ___ / .__ / \ _, _ / _ / / _ / \ _ \ version 220.127.116.11.6.5.0-292 / _ / Using Python version 2.7.5 (default, Aug 4 2017 00:39:18) SparkSession available as 'spark'.
3. To perform tasks on R, the R language interpreter is installed on the main node and all working nodes. To do this, connect the EPEL repository and install the R package:
yum install epel-release yum install R
Next, sparkR is launched:
SPARK_MAJOR_VERSION = 2 sparkR SPARK_MAJOR_VERSION is set to 2, using Spark2 R version 3.4.4 (2018-03-15) - "Someone to Lean On" Copyright (C) 2018 The R Foundation for Statistical Computing Platform: x86_64-redhat-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license ()' or 'license ()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors ()' for more information and 'citation ()' on how to cite R or R packages in publications. Type 'demo ()' for some demos, 'help ()' for on-line help, or 'help.start ()' for an HTML browser interface to help. Type 'q ()' to quit R. Launching java with spark-submit command / usr / hdp / current / spark2-client / bin / spark-submit "sparkr-shell" / tmp / Rtmpu2FF4v / backend_port26c2218284a Setting default log level to "WARN". To adjust logging level use sc.setLogLevel (newLevel). For SparkR, use setLogLevel (newLevel). Welcome to ____ __ / __ / __ ___ _____ / / __ _ \ \ / _ \ / _ `/ __ / '_ / / ___ / .__ / \ _, _ / _ / / _ / \ _ \ version 18.104.22.168.6.5.0-292 / _ / SparkSession available as 'spark'.