2019独角兽企业重金招聘Python工程师标准>>>
一、编译环境描述
-
OpenStack创建五个虚拟机,其中1个主节点(hostname为bigdatamaster),4个从节点(hostname分别为,bigdataslave1、bigdataslave2、bigdataslave3、bigdataslave4)
-
OS:CentOS 7.2_1511
-
JDK:Oracle JDK 1.8_191
-
Maven:3.5.2
-
Hadoop:Apache Hadoop 2.7.2
-
Hive:0.13.1
-
Scala:2.11.8
-
Spark:2.3.2
-
CarbonData:1.5.0
二、编译过程
1.选择源码
在CarbonData的归档地址(http://archive.apache.org/dist/carbondata/1.5.0/或者https://dist.apache.org/repos/dist/release/carbondata/)下载源码:
[root@bigdatamaster Desktop]# wget https://dist.apache.org/repos/dist/release/carbondata/1.5.0/apache-carbondata-1.5.0-source-release.zip
...
[root@bigdatamaster Desktop]# ls
apache-carbondata-1.5.0-source-release.zip
[root@bigdatamaster Desktop]# unzip apache-carbondata-1.5.0-source-release.zip
[root@bigdatamaster Desktop]# ls
apache-carbondata-1.5.0-source-release.zip carbondata-parent-1.5.0
[root@bigdatamaster carbondata-parent-1.5.0]# ls
assembly build conf datamap dev examples hadoop LICENSE NOTICE processing store tools
bin common core DEPENDENCIES docs format integration licenses-binary pom.xml README.md streaming
注:如果底层的hadoop系统版本为2.7.2,scala版本为2.11.8,spark版本为2.2.1,则不需要通过源码编译。由于本文所处的底层系统hadoop版本为2.7.1,scala版本为,spark版本为2.3.2,因此需要下载源码重新编译。
2.编译源码
[root@bigdatamaster carbondata-parent-1.5.0]# mvn -DskipTests -Pspark-2.3 -Dspark.version=2.3.2 -Dhadoop.version=2.7.1 clean package...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Apache CarbonData :: Parent ........................ SUCCESS [ 8.404 s]
[INFO] Apache CarbonData :: Common ........................ SUCCESS [ 15.636 s]
[INFO] Apache CarbonData :: Core .......................... SUCCESS [01:00 min]
[INFO] Apache CarbonData :: Processing .................... SUCCESS [ 27.459 s]
[INFO] Apache CarbonData :: Hadoop ........................ SUCCESS [ 13.402 s]
[INFO] Apache CarbonData :: Streaming ..................... SUCCESS [03:11 min]
[INFO] Apache CarbonData :: Store SDK ..................... SUCCESS [ 37.462 s]
[INFO] Apache CarbonData :: Spark Datasource .............. SUCCESS [01:31 min]
[INFO] Apache CarbonData :: Spark Common .................. SUCCESS [01:32 min]
[INFO] Apache CarbonData :: Search ........................ SUCCESS [ 34.174 s]
[INFO] Apache CarbonData :: Lucene Index DataMap .......... SUCCESS [01:35 min]
[INFO] Apache CarbonData :: Bloom Index DataMap ........... SUCCESS [ 13.619 s]
[INFO] Apache CarbonData :: Spark2 ........................ SUCCESS [02:35 min]
[INFO] Apache CarbonData :: Spark Common Test ............. SUCCESS [01:09 min]
[INFO] Apache CarbonData :: DataMap Examples .............. SUCCESS [ 3.621 s]
[INFO] Apache CarbonData :: Assembly ...................... SUCCESS [ 15.694 s]
[INFO] Apache CarbonData :: CLI ........................... SUCCESS [ 24.015 s]
[INFO] Apache CarbonData :: Hive .......................... SUCCESS [ 32.317 s]
[INFO] Apache CarbonData :: presto ........................ SUCCESS [01:06 min]
[INFO] Apache CarbonData :: Spark2 Examples ............... SUCCESS [ 52.898 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 18:22 min
[INFO] Finished at: 2019-04-20T21:36:16+08:00
[INFO] Final Memory: 201M/1411M
[INFO] ------------------------------------------------------------------------...
查看pom.xml文件发现scala编译的版本默认就是2.11.8,因此在此处编译时不需要再添加“-Pscala-2.1 -Dscala.version=2.11.8”指定scala版本
编译好后,可以在assembly目录的target目录下发现指定hadoop版本和spark版本编译好的carbon文件,如下图所示:
三、安装过程
1.按照官方文档https://carbondata.apache.org/quick-start-guide.html的步骤,在Spark集群下安装和配置CarbonData(文档:Installing and Configuring CarbonData on Standalone Spark Cluster这一部分)
前提条件:
-
Hadoop的HDFS和YARN均正常运行(已安装Hadoop-2.7.1集群并正常运行)
-
Spark正常运行(已安装Spark-2.3.2集群并正常运行)
-
CarbonData用户必须有HDFS的访问权限(root账户运行)
1)在spark安装目录下创建carbonlib目录:
[root@bigdatamaster ~]# cd $SPARK_HOME
[root@bigdatamaster spark-2.3.2]# ls
bin data jars LICENSE NOTICE R RELEASE sparkdata
conf examples kubernetes licenses python README.md sbin yarn
[root@bigdatamaster spark-2.3.2]# mkdir carbonlib
[root@bigdatamaster spark-2.3.2]# ls
bin conf examples kubernetes licenses python README.md sbin yarn
carbonlib data jars LICENSE NOTICE R RELEASE sparkdata
2)将步骤2中编译好的apache-carbondata-1.5.0-bin-spark2.3.2-hadoop2.7.1.jar拷贝至carbonlib目录下
[root@bigdatamaster spark-2.3.2]# cp ~/Desktop/carbondata-parent-1.5.0/assembly/target/scala-2.11/apache-carbondata-1.5.0-bin-spark2.3.2-hadoop2.7.1.jar ./carbonlib/
[root@bigdatamaster spark-2.3.2]# ls carbonlib/
apache-carbondata-1.5.0-bin-spark2.3.2-hadoop2.7.1.jar
3)编辑spark安装目录下conf目录下的spark-env.sh文件,将carbonlib添加至spark环境变量中
[root@bigdatamaster spark-2.3.2]# vim conf/spark-env.sh
(在文件末尾添加如下一行)
export SPARK_CLASSPATH=$SPARK_CLASSPATH:${SPARK_HOME}/carbonlib/*
4)将carbon.properties.template文件拷贝至spark安装目录的conf目录下,并命名为carbon.properties
[root@bigdatamaster carbondata-parent-1.5.0]# ls
assembly common datamap docs hadoop licenses-binary processing streaming
bin conf DEPENDENCIES examples integration NOTICE README.md target
build core dev format LICENSE pom.xml store tools
[root@bigdatamaster carbondata-parent-1.5.0]# ls conf/
carbon.properties.template dataload.properties.template[root@bigdatamaster carbondata-parent-1.5.0]# cp conf/carbon.properties.template $SPARK_HOME/conf/carbon.properties
[root@bigdatamaster carbondata-parent-1.5.0]# ls $SPARK_HOME/conf
carbon.properties metrics.properties.template spark-env.sh
docker.properties.template slaves spark-env.sh.template
fairscheduler.xml.template slaves.template
log4j.properties.template spark-defaults.conf.template
5)配置carbon.properties
[root@bigdatamaster spark-2.3.2]# vim conf/carbon.properties
(添加如下几行)
carbon.storelocation=hdfs://bigdatamaster:9000/carbon/Store
carbon.ddl.base.hdfs.url=hdfs://bigdatamaster:9000/carbon/Data
carbon.badRecords.location=hdfs://bigdatamaster:9000/carbon/BadRecords
carbon.lock.type=HDFSLOCK
四个配置的含义请见官网https://carbondata.apache.org/configuration-parameters.html
6)配置spark安装目录下conf目录下的spark-default.conf文件
[root@bigdatamaster spark-2.3.2]# cd conf/
[root@bigdatamaster conf]# ls
carbon.properties metrics.properties.template spark-env.sh
docker.properties.template slaves spark-env.sh.template
fairscheduler.xml.template slaves.template
log4j.properties.template spark-defaults.conf.template
[root@bigdatamaster conf]# cp spark-defaults.conf.template spark-defaults.conf
[root@bigdatamaster conf]# ls
carbon.properties metrics.properties.template spark-defaults.conf.template
docker.properties.template slaves spark-env.sh
fairscheduler.xml.template slaves.template spark-env.sh.template
log4j.properties.template spark-defaults.conf[root@bigdatamaster conf]# vim spark-defaults.conf
(添加如下2行)
spark.executor.extraJavaOptions -Dcarbon.properties.filepath=/root/data/spark-2.3.2/conf/carbon.properties
spark.driver.extraJavaOptions -Dcarbon.properties.filepath=/root/data/spark-2.3.2/conf/carbon.properties
7)将hive-site.xml添加至spark安装目录的conf目录下
[root@bigdatamaster spark-2.3.2]# cp ~/data/hive-0.13.1/conf/hive-site.xml conf/
注:此步骤不做,则在下面测试创建表时会报如下错误
scala> carbon.sql("CREATE TABLE IF NOT EXISTS test_table(id string, name string, city string, age Int) STORED BY 'carbondata'")
2019-04-21 14:10:47 WARN ObjectStore:6666 - Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
2019-04-21 14:10:47 WARN ObjectStore:568 - Failed to get database default, returning NoSuchObjectException
2019-04-21 14:10:50 WARN ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
2019-04-21 14:10:52 AUDIT CarbonCreateTableCommand:207 - [bigdatamaster][root][Thread-1]Creating Table with Database name [default] and Table name [test_table]
2019-04-21 14:10:53 WARN HiveExternalCatalog:66 - Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.CarbonSource. Persisting data source table `default`.`test_table` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
2019-04-21 14:10:54 AUDIT CarbonCreateTableCommand:207 - [bigdatamaster][root][Thread-1]Table created with Database name [default] and Table name [test_table]
res0: org.apache.spark.sql.DataFrame = []
8)在Spark其他节点上重复步骤1)~步骤6)
此处我直接通过scp命令传输。步骤1)~步骤6)涉及spark安装目录下的以下目录和文件:
-
carbonlib($SPARK_HOME/carbonlib)
-
spark-env.sh($SPARK_HOME/conf/spark-env.sh)
-
spark-defaults.conf($SPARK_HOME/conf/spark-defaults.conf)
-
carbon.properties($SPARK_HOME/conf/carbon.properties)
直接将一个目录和三个配置文件通过scp远程传输至spark集群其他节点相应位置即可:
[root@bigdatamaster spark-2.3.2]# scp -r carbonlib root@bigdataslave1:~/data/spark-2.3.2/
[root@bigdatamaster spark-2.3.2]# scp -r carbonlib root@bigdataslave2:~/data/spark-2.3.2/
[root@bigdatamaster spark-2.3.2]# scp -r carbonlib root@bigdataslave3:~/data/spark-2.3.2/
[root@bigdatamaster spark-2.3.2]# scp -r carbonlib root@bigdataslave4:~/data/spark-2.3.2/[root@bigdatamaster spark-2.3.2]# scp conf/spark-env.sh root@bigdataslave1:~/data/spark-2.3.2/conf/
[root@bigdatamaster spark-2.3.2]# scp conf/spark-env.sh root@bigdataslave2:~/data/spark-2.3.2/conf/
[root@bigdatamaster spark-2.3.2]# scp conf/spark-env.sh root@bigdataslave3:~/data/spark-2.3.2/conf/
[root@bigdatamaster spark-2.3.2]# scp conf/spark-env.sh root@bigdataslave4:~/data/spark-2.3.2/conf/[root@bigdatamaster spark-2.3.2]# scp conf/spark-defaults.conf root@bigdataslave1:~/data/spark-2.3.2/conf/
[root@bigdatamaster spark-2.3.2]# scp conf/spark-defaults.conf root@bigdataslave2:~/data/spark-2.3.2/conf/
[root@bigdatamaster spark-2.3.2]# scp conf/spark-defaults.conf root@bigdataslave3:~/data/spark-2.3.2/conf/
[root@bigdatamaster spark-2.3.2]# scp conf/spark-defaults.conf root@bigdataslave4:~/data/spark-2.3.2/conf/[root@bigdatamaster spark-2.3.2]# scp conf/carbon.properties root@bigdataslave1:~/data/spark-2.3.2/conf/
[root@bigdatamaster spark-2.3.2]# scp conf/carbon.properties root@bigdataslave2:~/data/spark-2.3.2/conf/
[root@bigdatamaster spark-2.3.2]# scp conf/carbon.properties root@bigdataslave3:~/data/spark-2.3.2/conf/
[root@bigdatamaster spark-2.3.2]# scp conf/carbon.properties root@bigdataslave4:~/data/spark-2.3.2/conf/
四、测试
1)创建测试数据,并上传至HDFS(注:下面测试数据集测试步骤选自文献9,在此感谢作者)
[root@bigdatamaster Desktop]# vim carbonTestData.csv
(添加如下4行)
id,name,city,age
1,david,shenzhen,31
2,eason,shenzhen,27
3,jarry,wuhan,35[root@bigdatamaster Desktop]# hadoop dfs -put carbonTestData.csv /
[root@bigdatamaster Desktop]# hadoop dfs -ls /
Found 6 items
-rw-r--r-- 2 root supergroup 78 2019-04-21 14:06 /carbonTestData.csv
drwxr-xr-x - root supergroup 0 2019-04-04 14:55 /hadoopdata
drwxr-xr-x - root supergroup 0 2019-03-06 16:16 /hbase
drwxr-xr-x - root supergroup 0 2019-03-06 16:04 /root
drwxrwxr-x - root supergroup 0 2019-03-06 17:01 /tmp
drwxr-xr-x - root supergroup 0 2019-03-06 16:27 /user
2)在bigdatamaster的terminal输入以下目录启动spark shell
spark-shell \
--master spark://bigdatamaster:7077 \
--jars /root/data/spark-2.3.2/carbonlib/apache-carbondata-1.5.0-bin-spark2.3.2-hadoop2.7.1.jar \
--total-executor-cores 2 \
--executor-memory 2G
[root@bigdatamaster ~]# spark-shell \
> --master spark://bigdatamaster:7077 \
> --jars /root/data/spark-2.3.2/carbonlib/apache-carbondata-1.5.0-bin-spark2.3.2-hadoop2.7.1.jar \
> --total-executor-cores 2 \
> --executor-memory 2G
2019-04-21 14:08:56 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://bigdatamaster:4040
Spark context available as 'sc' (master = spark://bigdatamaster:7077, app id = app-20190421140908-0002).
Spark session available as 'spark'.
Welcome to____ __/ __/__ ___ _____/ /___\ \/ _ \/ _ `/ __/ '_//___/ .__/\_,_/_/ /_/\_\ version 2.3.2/_/Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191)
Type in expressions to have them evaluated.
Type :help for more information.scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSessionscala> import org.apache.spark.sql.CarbonSession._
import org.apache.spark.sql.CarbonSession._scala> val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("hdfs://bigdatamaster:9000/carbon/Store")2019-04-21 14:10:07 WARN SparkContext:66 - Using an existing SparkContext; some configuration may not take effect.
2019-04-21 14:10:07 WARN CarbonProperties:168 - main The enable unsafe sort value "null" is invalid. Using the default value "true
2019-04-21 14:10:07 WARN CarbonProperties:168 - main The enable off heap sort value "null" is invalid. Using the default value "true
2019-04-21 14:10:07 WARN CarbonProperties:168 - main The custom block distribution value "null" is invalid. Using the default value "false
2019-04-21 14:10:07 WARN CarbonProperties:168 - main The enable vector reader value "null" is invalid. Using the default value "true
2019-04-21 14:10:07 WARN CarbonProperties:168 - main The carbon task distribution value "null" is invalid. Using the default value "block
2019-04-21 14:10:07 WARN CarbonProperties:168 - main The enable auto handoff value "null" is invalid. Using the default value "true
2019-04-21 14:10:07 WARN CarbonProperties:168 - main The specified value for property carbon.sort.storage.inmemory.size.inmbis invalid.
2019-04-21 14:10:07 WARN CarbonProperties:168 - main The specified value for property 512is invalid.
2019-04-21 14:10:07 WARN CarbonProperties:168 - main The specified value for property carbon.sort.storage.inmemory.size.inmbis invalid. Taking the default value.512
carbon: org.apache.spark.sql.SparkSession = org.apache.spark.sql.CarbonSession@53e166ad(构建表模式)
scala> carbon.sql("CREATE TABLE IF NOT EXISTS test_table(id string, name string, city string, age Int) STORED BY 'carbondata'")2019-04-21 14:12:34 AUDIT CarbonCreateTableCommand:207 - [bigdatamaster][root][Thread-1]Creating Table with Database name [default] and Table name [test_table]
res1: org.apache.spark.sql.DataFrame = [](上传数据)
scala> carbon.sql("LOAD DATA INPATH 'hdfs://bigdatamaster:9000/carbonTestData.csv' INTO TABLE test_table")2019-04-21 14:21:12 WARN DeleteLoadFolders:168 - main Files are not found in segment hdfs://bigdatamaster:9000/carbon/Store/default/test_table/Fact/Part0/Segment_0 it seems, files are already being deleted
2019-04-21 14:21:12 AUDIT CarbonDataRDDFactory$:207 - [bigdatamaster][root][Thread-1]Data load request has been received for table default.test_table
2019-04-21 14:21:18 AUDIT CarbonDataRDDFactory$:207 - [bigdatamaster][root][Thread-1]Data load is successful for default.test_table
2019-04-21 14:21:18 AUDIT MergeIndexEventListener:207 - [bigdatamaster][root][Thread-1]Load post status event-listener called for merge index
res4: org.apache.spark.sql.DataFrame = [](查看表数据)
scala> carbon.sql("SELECT * FROM test_table").show()
+---+-----+--------+---+
| id| name| city|age|
+---+-----+--------+---+
| 1|david|shenzhen| 31|
| 2|eason|shenzhen| 27|
| 3|jarry| wuhan| 35|
+---+-----+--------+---+scala> carbon.sql("SELECT city, avg(age), sum(age) FROM test_table GROUP BY city").show()
+--------+--------+--------+
| city|avg(age)|sum(age)|
+--------+--------+--------+
| wuhan| 35.0| 35|
|shenzhen| 29.0| 58|
+--------+--------+--------+
五、参考文献
-
CarbonData使用示例(Java):https://blog.csdn.net/u013181284/article/details/77574094
-
CarbonData编译、安装和集成Spark 2.2:https://blog.csdn.net/wuzhilon88/article/details/78864735
-
Spark2.1.0 + CarbonData1.0.0集群模式部署及使用入门:https://blog.csdn.net/coridc/article/details/61915801
-
Apache CarbonData :一种为更加快速数据分析而生的新Hadoop文件版式:https://blog.csdn.net/u011239443/article/details/52015680
-
【思维导图】Parquet Orc CarbonData 三种列式存储格式对比:https://blog.csdn.net/lxhandlbb/article/details/80754252
-
carbondata 安装文档:https://blog.csdn.net/u013181284/article/details/73331170
-
Apache CarbonData学习资料汇总:https://blog.csdn.net/xubo245/article/details/84336960
-
Apache CarbonData中文文档:https://www.iteblog.com/archives/tag/carbondata/
-
Apache CarbonData 1.0.0 编译部署 on Mac OS:https://ask.hellobi.com/blog/marsj/6164
六、附录
贴几张mvn编译源码过程中的图片,真养眼。。