梦想还是要有的,万一实现了呢

0%

spark集群初体验

主要内容

  • spark集群初体验

1
2
3
4
5
6
7
spark-submit --class.org.apache.spark.examples.SparkPi \
--master yarn \
--mun-executors 1 \
--driver-memory 1g \
--executor-cores 1 \
--conf "spark.app.name=SparkPi" \
/opt/cloudera/ xx .jar

run-example SparkPi

前置

  • 内存不足,在cm里配置yarn

    1
    2
    yarn.scheduler.maximum-allocation-mb
    yarn.nodemanager.resource.memory-mb
  • dfs

如果直接 root 执行,会有如下错误

1
2
3
4
5
╰─# spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/02/27 16:47:23 ERROR spark.SparkContext: Error initializing SparkContext.
org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode="/user":hdfs:supergroup:drwxr-xr-x

查看

1
2
3
4
5
╰─# hadoop fs -ls /                                                                                                 1 ↵
Found 3 items
drwxr-xr-x - hbase hbase 0 2020-02-27 15:49 /hbase
drwxrwxrwt - hdfs supergroup 0 2020-02-27 01:22 /tmp
drwxr-xr-x - hdfs supergroup 0 2020-02-27 01:22 /user

解决方案
You need to have a user home directory on HDFS. Log as the HDFS user and create a home dir for root.
其他用户类似。

hadoop的用户是hdfs, 默认是不能直接登录的

1
2
sudo -u hdfs hadoop fs -mkdir /user/root  
sudo -u hdfs hadoop fs -chown root:root /user/root

1
2
3
4
5
6
7
8
╰─#  hadoop fs -ls /user
Found 6 items
drwxrwxrwx - mapred hadoop 0 2020-02-27 01:21 /user/history
drwxrwxr-t - hive hive 0 2020-02-27 01:21 /user/hive
drwxrwxr-x - hue hue 0 2020-02-27 01:22 /user/hue
drwxr-xr-x - root root 0 2020-02-27 17:27 /user/root
drwxr-x--x - spark spark 0 2020-02-27 16:59 /user/spark
drwxr-xr-x - hdfs supergroup 0 2020-02-27 01:20 /user/yarn

然后root用户可以执行 spark-shell

其实直接用 spark 用户执行就可以。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
╰─# sudo -u spark spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/02/27 17:31:33 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
20/02/27 17:31:33 WARN lineage.LineageWriter: Lineage directory /var/log/spark/lineage doesn't exist or is not writable. Lineage for this application will be disabled.
Spark context Web UI available at http://master-23:4040
Spark context available as 'sc' (master = yarn, app id = application_1582792452399_0004).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.0-cdh6.3.2
/_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

参考

Hadoop 3.2 文档

RDD

弹性分布式数据集(Resilient Distribution Dataset)
数据集合

创建RDD

  • SparkContext 的 parallelize
  • 读外部数据(HDFS,消息队列等)

Pair RDD键值对操作
https://blog.csdn.net/u014646662/article/details/84673920
https://blog.csdn.net/JasonDing1354/article/details/46845585

spark sql