java写入carbondata_Carbondata使用过程中遇到的几个问题及解决办法

本文总结了几个本人在使用 Carbondata 的时候遇到的几个问题及其解决办法。这里使用的环境是:Spark 2.1.0、Carbondata 1.2.0。

carbondata_iteblog.jpg

必须指定 HDFS nameservices

在初始化 CarbonSession 的时候,如果不指定 HDFS nameservices,在数据导入是没啥问题的;但是数据查询会出现相关数据找不到问题: scala> val carbon = SparkSession.builder().tempnfig(sc.getConf).getOrCreateCarbonSession("hdfs:///user/iteblog/carb")

scala> carbon.sql("""CREATE TABLE temp.iteblog(id bigint) STORED BY 'carbondata'""")

17/11/09 16:20:58 AUDIT command.CreateTable: [www.iteblog.com][iteblog][Thread-1]Creating Table with Database name [temp] and Table name [iteblog]

17/11/09 16:20:58 WARN hive.HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.CarbonSource. Persisting data source table `temp`.`iteblog` into Hive metastore in Spark SQL specific format, which is NOT tempmpatible with Hive.

17/11/09 16:20:59 AUDIT command.CreateTable: [www.iteblog.com][iteblog][Thread-1]Table created with Database name [temp] and Table name [iteblog]

res2: org.apache.spark.sql.DataFrame = []

scala> carbon.sql("insert overwrite table temp.iteblog select id from temp.mytable limit 10")

17/11/09 16:21:46 AUDIT rdd.CarbonDataRDDFactory$: [www.iteblog.com][iteblog][Thread-1]Data load request has been received for table temp.iteblog

17/11/09 16:21:46 WARN util.CarbonDataProcessorUtil: main sort scope is set to LOCAL_SORT

17/11/09 16:23:03 AUDIT rdd.CarbonDataRDDFactory$: [www.iteblog.com][iteblog][Thread-1]Data load is successful for temp.iteblog

res3: org.apache.spark.sql.DataFrame = []

scala> carbon.sql("select * from temp.iteblog limit 10").show(10,100)

17/11/09 16:23:15 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 (TID 1011, static.iteblog.com, executor 2): java.lang.RuntimeException: java.io.FileNotFoundException: /user/iteblog/carb/temp/iteblog/Fact/Part0/Segment_0/part-0-0_batchno0-0-1510215706696.carbondata (No such file or directory)

at org.apache.carbondata.tempre.indexstore.blockletindex.IndexWrapper.(IndexWrapper.java:39)

at org.apache.carbondata.tempre.scan.executor.impl.AbstractQueryExecutor.initQuery(AbstractQueryExecutor.java:141)

at org.apache.carbondata.tempre.scan.executor.impl.AbstractQueryExecutor.getBlockExecutionInfos(AbstractQueryExecutor.java:216)

at org.apache.carbondata.tempre.scan.executor.impl.VectorDetailQueryExecutor.execute(VectorDetailQueryExecutor.java:36)

at org.apache.carbondata.spark.vectorreader.VectorizedCarbonRetemprdReader.initialize(VectorizedCarbonRetemprdReader.java:116)

at org.apache.carbondata.spark.rdd.CarbonScanRDD.internalCompute(CarbonScanRDD.scala:229)

at org.apache.carbondata.spark.rdd.CarbonRDD.tempmpute(CarbonRDD.scala:62)

at org.apache.spark.rdd.RDD.tempmputeOrReadCheckpoint(RDD.scala:323)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

at org.apache.spark.rdd.MapPartitionsRDD.tempmpute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.tempmputeOrReadCheckpoint(RDD.scala:323)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

at org.apache.spark.rdd.MapPartitionsRDD.tempmpute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.tempmputeOrReadCheckpoint(RDD.scala:323)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

at org.apache.spark.scheduler.Task.run(Task.scala:99)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)

at java.util.tempncurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.tempncurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.io.FileNotFoundException: /user/iteblog/carb/temp/iteblog/Fact/Part0/Segment_0/part-0-0_batchno0-0-1510215706696.carbondata (No such file or directory)

at java.io.FileInputStream.open(Native Method)

at java.io.FileInputStream.(FileInputStream.java:138)

at java.io.FileInputStream.(FileInputStream.java:93)

at org.apache.carbondata.tempre.datastore.impl.FileFactory.getDataInputStream(FileFactory.java:128)

at org.apache.carbondata.tempre.reader.ThriftReader.open(ThriftReader.java:77)

at org.apache.carbondata.tempre.reader.CarbonHeaderReader.readHeader(CarbonHeaderReader.java:46)

at org.apache.carbondata.tempre.util.DataFileFooterConverterV3.getSchema(DataFileFooterConverterV3.java:90)

at org.apache.carbondata.tempre.util.CarbonUtil.readMetadatFile(CarbonUtil.java:925)

at org.apache.carbondata.tempre.indexstore.blockletindex.IndexWrapper.(IndexWrapper.java:37)

... 20 more

可以看出,如果创建 CarbonSession 的时候,如果不指定 HDFS nameservices,在数据导入是没啥问题的;查找的时候就会出现文件找不到。这个最直接的解决版本就是创建 CarbonSession 的时候指定 HDFS nameservices。针对这个问题一个改进措施是让 Carbondata 能够根据提供的 hadoop 配置信息自动补充 HDFS nameservices 信息。

不支持 tinyint 数据类型 scala> carbon.sql("""CREATE TABLE temp.iteblog(status tinyint) STORED BY 'carbondata'""")

org.apache.carbondata.spark.exception.MalformedCarbonCommandException: Unsupported data type: StructField(status,ByteType,true).getType

at org.apache.spark.sql.parser.CarbonSpark2SqlParser$$anonfun$getFields$1.apply(CarbonSpark2SqlParser.scala:427)

at org.apache.spark.sql.parser.CarbonSpark2SqlParser$$anonfun$getFields$1.apply(CarbonSpark2SqlParser.scala:417)

at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)

at scala.collection.immutable.List.foreach(List.scala:381)

at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)

at scala.collection.immutable.List.map(List.scala:285)

at org.apache.spark.sql.parser.CarbonSpark2SqlParser.getFields(CarbonSpark2SqlParser.scala:417)

at org.apache.spark.sql.parser.CarbonSqlAstBuilder.visitCreateTable(CarbonSparkSqlParser.scala:135)

at org.apache.spark.sql.parser.CarbonSqlAstBuilder.visitCreateTable(CarbonSparkSqlParser.scala:72)

at org.apache.spark.sql.catalyst.parser.SqlBaseParser$CreateTableContext.accept(SqlBaseParser.java:578)

at org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:42)

at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:66)

at org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleStatement$1.apply(AstBuilder.scala:66)

at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:93)

at org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:65)

at org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:54)

at org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53)

at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:82)

at org.apache.spark.sql.parser.CarbonSparkSqlParser.parse(CarbonSparkSqlParser.scala:68)

at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)

at org.apache.spark.sql.parser.CarbonSparkSqlParser.parsePlan(CarbonSparkSqlParser.scala:49)

at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)

... 50 elided

这是因为 Carbondata 目前不支持 tinyint 类型,Carbondata 目前支持的数据类型可以参见:http://carbondata.apache.org/supported-data-types-in-carbondata.html。但是奇怪的是 CARBONDATA-18 这里面已经解决了这个问题,不知道为啥到当前版本却不支持了。

添加分区出现NoSuchTableException

如果你使用 ALTER TABLE temp.iteblog ADD PARTITION('2017') 语句来添加分区,你会遇到下面的异常: scala> carbon.sql("ALTER TABLE temp.iteblog ADD PARTITION('2012')")

org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'iteblog' not found in database 'default';

at org.apache.spark.sql.hive.client.HiveClient$$anonfun$getTable$1.apply(HiveClient.scala:76)

at org.apache.spark.sql.hive.client.HiveClient$$anonfun$getTable$1.apply(HiveClient.scala:76)

at scala.Option.getOrElse(Option.scala:121)

at org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:76)

at org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:78)

at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:110)

at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:110)

at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:95)

at org.apache.spark.sql.hive.HiveExternalCatalog.org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable(HiveExternalCatalog.scala:109)

at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:601)

at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$getTable$1.apply(HiveExternalCatalog.scala:601)

at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:95)

at org.apache.spark.sql.hive.HiveExternalCatalog.getTable(HiveExternalCatalog.scala:600)

at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:106)

at org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:69)

at org.apache.spark.sql.hive.CarbonSessionCatalog.lookupRelation(CarbonSessionState.scala:83)

at org.apache.spark.sql.internal.CatalogImpl.refreshTable(CatalogImpl.scala:461)

at org.apache.spark.sql.execution.command.AlterTableSplitPartitionCommand.processSchema(carbonTableSchema.scala:283)

at org.apache.spark.sql.execution.command.AlterTableSplitPartitionCommand.run(carbonTableSchema.scala:229)

at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)

at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)

at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)

at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)

at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)

at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)

at org.apache.spark.sql.Dataset.(Dataset.scala:185)

at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)

at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)

... 50 elided

运行上面的SQL语句,Carbondata 表相关的分区其实已经添加好了,但是通过 Spark 刷新表的相关信息就出错了。从出错的信息可以看出,虽然我们已经传递了表所在的 DB 相关信息,但是 Spark 的 catalyst 并没有获取到,这个 bug 是因为代码里面并没有将表数据相关信息传递给 catalyst, 这个 bug 还影响分区的 split 相关操作。不过此 bug 在 CARBONDATA-1593 里面已经解决。

insert overwrite 操作超过三次将会出现 NPE

如果你在导数的时候执行 insert overwrite 大于等于三次,那么恭喜你,你肯定会遇到下面的异常,如下: scala> carbon.sql("insert overwrite table temp.iteblog select id from co.order_common_p where dt = '2012-10'")

17/10/26 13:00:05 AUDIT rdd.CarbonDataRDDFactory$: [www.iteblog.com][iteblog][Thread-1]Data load request has been received for table temp.iteblog

17/10/26 13:00:05 WARN util.CarbonDataProcessorUtil: main sort scope is set to LOCAL_SORT

17/10/26 13:00:08 ERROR filesystem.AbstractDFSCarbonFile: main Exception occurred:File does not exist: hdfs://mycluster/user/iteblog/carb/temp/iteblog/Fact/Part0/Segment_0

17/10/26 13:00:09 ERROR command.LoadTable: main

java.lang.NullPointerException

at org.apache.carbondata.core.datastore.filesystem.AbstractDFSCarbonFile.isDirectory(AbstractDFSCarbonFile.java:88)

at org.apache.carbondata.core.util.CarbonUtil.deleteRecursive(CarbonUtil.java:364)

at org.apache.carbondata.core.util.CarbonUtil.access$100(CarbonUtil.java:93)

at org.apache.carbondata.core.util.CarbonUtil$2.run(CarbonUtil.java:326)

at org.apache.carbondata.core.util.CarbonUtil$2.run(CarbonUtil.java:322)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)

at org.apache.carbondata.core.util.CarbonUtil.deleteFoldersAndFiles(CarbonUtil.java:322)

at org.apache.carbondata.spark.load.CarbonLoaderUtil.recordLoadMetadata(CarbonLoaderUtil.java:333)

at org.apache.carbondata.spark.rdd.CarbonDataRDDFactory$.updateStatus$1(CarbonDataRDDFactory.scala:595)

at org.apache.carbondata.spark.rdd.CarbonDataRDDFactory$.loadCarbonData(CarbonDataRDDFactory.scala:1107)

at org.apache.spark.sql.execution.command.LoadTable.processData(carbonTableSchema.scala:1046)

at org.apache.spark.sql.execution.command.LoadTable.run(carbonTableSchema.scala:754)

at org.apache.spark.sql.execution.command.LoadTableByInsert.processData(carbonTableSchema.scala:651)

at org.apache.spark.sql.execution.command.LoadTableByInsert.run(carbonTableSchema.scala:637)

at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)

at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)

at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)

at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)

at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)

at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)

at org.apache.spark.sql.Dataset.(Dataset.scala:185)

at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)

at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)

at $line20.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:31)

at $line20.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:36)

at $line20.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:38)

at $line20.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(:40)

at $line20.$read$$iw$$iw$$iw$$iw$$iw$$iw.(:42)

at $line20.$read$$iw$$iw$$iw$$iw$$iw.(:44)

at $line20.$read$$iw$$iw$$iw$$iw.(:46)

at $line20.$read$$iw$$iw$$iw.(:48)

at $line20.$read$$iw$$iw.(:50)

at $line20.$read$$iw.(:52)

at $line20.$read.(:54)

at $line20.$read$.(:58)

at $line20.$read$.()

at $line20.$eval$.$print$lzycompute(:7)

at $line20.$eval$.$print(:6)

at $line20.$eval.$print()

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)

at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)

at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)

at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)

at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)

at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)

at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)

at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)

at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)

at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)

at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)

at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)

at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:415)

at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:923)

at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)

at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)

at scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)

at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)

at org.apache.spark.repl.Main$.doMain(Main.scala:68)

at org.apache.spark.repl.Main$.main(Main.scala:51)

at org.apache.spark.repl.Main.main(Main.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:498)

at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)

at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)

at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)

at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)

at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

17/10/26 13:00:09 AUDIT command.LoadTable: [www.iteblog.com][iteblog][Thread-1]Dataload failure for temp.iteblog. Please check the logs

java.lang.NullPointerException

at org.apache.carbondata.core.datastore.filesystem.AbstractDFSCarbonFile.isDirectory(AbstractDFSCarbonFile.java:88)

at org.apache.carbondata.core.util.CarbonUtil.deleteRecursive(CarbonUtil.java:364)

at org.apache.carbondata.core.util.CarbonUtil.access$100(CarbonUtil.java:93)

at org.apache.carbondata.core.util.CarbonUtil$2.run(CarbonUtil.java:326)

at org.apache.carbondata.core.util.CarbonUtil$2.run(CarbonUtil.java:322)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)

at org.apache.carbondata.core.util.CarbonUtil.deleteFoldersAndFiles(CarbonUtil.java:322)

at org.apache.carbondata.spark.load.CarbonLoaderUtil.recordLoadMetadata(CarbonLoaderUtil.java:333)

at org.apache.carbondata.spark.rdd.CarbonDataRDDFactory$.updateStatus$1(CarbonDataRDDFactory.scala:595)

at org.apache.carbondata.spark.rdd.CarbonDataRDDFactory$.loadCarbonData(CarbonDataRDDFactory.scala:1107)

at org.apache.spark.sql.execution.command.LoadTable.processData(carbonTableSchema.scala:1046)

at org.apache.spark.sql.execution.command.LoadTable.run(carbonTableSchema.scala:754)

at org.apache.spark.sql.execution.command.LoadTableByInsert.processData(carbonTableSchema.scala:651)

at org.apache.spark.sql.execution.command.LoadTableByInsert.run(carbonTableSchema.scala:637)

at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)

at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)

at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)

at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)

at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)

at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)

at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)

at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)

at org.apache.spark.sql.Dataset.(Dataset.scala:185)

at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)

at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)

... 50 elided

scala>

虽然出现 NPE 异常,但是数据其实已经导到 Carbondata 相关表里面了。引起这个异常的原因其实是因为每次执行完 insert overwrite 操作的时候,都需要删除之前的数据(也就是Segment目录)。但是 Segment 目录存在重复删除,导致找不到相关目录所以出现了 NPE 异常。这个问题在 CARBONDATA-1486 解决了。

不支持超过32767个字符的列

如果你有一列数据长度大于32767(Short.MaxValue),并且 enable.unsafe.sort=true ,那么你往 Carbondata 表导数据的时候会出现异常,如下: java.lang.NegativeArraySizeException

at org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeCarbonRowPage.getRow(UnsafeCarbonRowPage.java:182)

at org.apache.carbondata.processing.newflow.sort.unsafe.holder.UnsafeInmemoryHolder.readRow(UnsafeInmemoryHolder.java:63)

at org.apache.carbondata.processing.newflow.sort.unsafe.merger.UnsafeSingleThreadFinalSortFilesMerger.startSorting(UnsafeSingleThreadFinalSortFilesMerger.java:114)

at org.apache.carbondata.processing.newflow.sort.unsafe.merger.UnsafeSingleThreadFinalSortFilesMerger.startFinalMerge(UnsafeSingleThreadFinalSortFilesMerger.java:81)

at org.apache.carbondata.processing.newflow.sort.impl.UnsafeParallelReadMergeSorterImpl.sort(UnsafeParallelReadMergeSorterImpl.java:105)

at org.apache.carbondata.processing.newflow.steps.SortProcessorStepImpl.execute(SortProcessorStepImpl.java:62)

at org.apache.carbondata.processing.newflow.steps.DataWriterProcessorStepImpl.execute(DataWriterProcessorStepImpl.java:87)

at org.apache.carbondata.processing.newflow.DataLoadExecutor.execute(DataLoadExecutor.java:51)

at org.apache.carbondata.spark.rdd.NewDataFrameLoaderRDD$$anon$2.(NewCarbonDataLoadRDD.scala:442)

at org.apache.carbondata.spark.rdd.NewDataFrameLoaderRDD.internalCompute(NewCarbonDataLoadRDD.scala:405)

at org.apache.carbondata.spark.rdd.CarbonRDD.compute(CarbonRDD.scala:62)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

这是 Carbondata 设计的缺陷,目前没有办法解决这个问题,不过可以实现一个类似于 varchar(size) 的数据类型。

日期格式错误导致数据丢失

如果你将带有日期类型的数据导入到 Carbondata 表中,可能会出现数据丢失: scala> carbon.sql("""CREATE TABLE temp.iteblog(dt DATE) STORED BY 'carbondata'""")

17/11/09 16:44:46 AUDIT temp.mand.CreateTable: [www.iteblog.com][iteblog][Thread-1]Creating Table with Database name [temp] and Table name [iteblog]

17/11/09 16:44:47 WARN hive.HiveExternalCatalog: Couldn't find temp.responding Hive SerDe for data source provider org.apache.spark.sql.CarbonSource. Persisting data source table `temp`.`iteblog` into Hive metastore in Spark SQL specific format, which is NOT temp.patible with Hive.

17/11/09 16:44:47 AUDIT temp.mand.CreateTable: [www.iteblog.com][iteblog][Thread-1]Table created with Database name [temp] and Table name [iteblog]

res1: org.apache.spark.sql.DataFrame = []

scala> carbon.sql("select dt from temp.mydate").show(10,100)

17/11/09 16:44:52 ERROR lzo.LzoCodec: Failed to load/initialize native-lzo library

+--------+

| dt|

+--------+

|20170509|

|20170511|

|20170507|

|20170504|

|20170502|

|20170506|

|20170501|

|20170508|

|20170510|

|20170505|

+--------+

only showing top 10 rows

scala> carbon.sql("insert into table temp.iteblog select dt from temp.mydate limit 10")

17/11/09 16:45:14 AUDIT rdd.CarbonDataRDDFactory$: [www.iteblog.com][iteblog][Thread-1]Data load request has been received for table temp.iteblog

17/11/09 16:45:14 WARN util.CarbonDataProcessorUtil: main sort scope is set to LOCAL_SORT

17/11/09 16:45:16 AUDIT rdd.CarbonDataRDDFactory$: [www.iteblog.com][iteblog][Thread-1]Data load is successful for temp.iteblog

res3: org.apache.spark.sql.DataFrame = []

scala> carbon.sql("select * from temp.iteblog limit 10").show(10,100)

+----+

| dt|

+----+

|null|

|null|

|null|

|null|

|null|

|null|

|null|

|null|

|null|

|null|

+----+

这是因为 Carbondata 对数据类型(DATE)有默认的格式,由参数 carbon.date.format 控制,默认值是 yyyy-MM-dd。所以你使用 yyyy-MM-dd 格式去解析 20170505 数据肯定会出现错误,从而导致数据丢失了。同理,时间戳类型(TIMESTAMP) 也有默认的格式,由参数 carbon.timestamp.format 空值,默认值为 yyyy-MM-dd HH:mm:ss。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/552078.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

商品审核网页界面_Shopee新手指南:Shopee卖家中心用户界面介绍

1.Shopee各站点前台网页链接:2.Shopee各站点后台网页链接3.Shopee APP下载:安卓版下载链接:https://pan.baidu.com/s/1eSp8M1k#list/path%2Fios版:可在App Store中直接搜索下载使用。台湾站点ios版本请搜索关键字“虾皮”下载&…

pdf exe如何提取pdf文件_python应用:如何用python提取pdf文件中的文字

从pdf中提取文字,相信很多人都干过这事,怎么在python中实现呢,今天带大家看看。第一步导入库import PyPDF2第二步导入pdf文件pdf_file open(dataset/laban.1027.pdf, rb)第三步读取pdf并检查加密情况read_pdf PyPDF2.PdfFileReader(pdf_file…

java轻量分布式框架_5个强大的Java分布式缓存框架推荐

在开发中大型Java软件项目时,很多Java架构师都会遇到数据库读写瓶颈,如果你在系统架构时并没有将缓存策略考虑进去,或者并没有选择更优的 缓存策略,那么到时候重构起来将会是一个噩梦。本文主要是分享了5个常用的Java分布式缓存框…

python三次样条插值拟合的树行线_数学建模笔记——插值拟合模型(一)

啊好像距离上次写作又过了七天,啊好像我之前计划的一周两三篇,啊辣鸡小说毁我青春,啊我是一只可怜的鸽子。不管怎样,我又回来了,并坚定地更新着hhh。再过两三天就是我们学校数学建模选拔,再过八九天就是期末…

密度图的密度估计_不同类型的二维密度图小教程

R相关小教程链接:用R构建气泡图案例小教程【小教程】散点图、饼图怎么在我的文章中完美展示小教程热图在论文发表中完美呈现小教程R与密度、函数、变量的微妙关系北京市计算中心医用数据库建设解决方案更多内容,请关注“生信会议”公众号Different types…

python 找质数的个数_用Python打造一款文件搜索工具,所有功能自己定义!

一、前言大家好,又到了Python办公自动化系列。在日常的办公中,我们经常会从一堆不同格式的文件(夹)中搜索特定的文件,可能你是凭着记忆去找或是借助软件,但你有想过如何用Python实现吗?本文将基于几个常见的搜索操作讲…

nessus安装_Nessus忘记密码怎么办?

最近公司购买了Nessus,才安装好,然后隔天密码就忘了,唉,人老了呀,记性不行了。网上看了一下,还是有比较多的同学也遇到这个问题,现将密码重置方法,分享给大家。系统环境:操作系统&am…

graphpad prism画折线图_如何用Graphpad Prism 8作折线图

如何用Graphpad Prism 8作折线图如何用Graphpad Prism 8作折线图Prism 8 有8种数据类型,Prism数据表的格式决定可制作的图表种类和可执行的分析类型。选择一个数据表格式可以使Prism创建合适数据的数据表,然后创建所需的图形,执行适当的分析。…

sqlserver可视化工具_数据分析之基础分析工具篇(修订版)

原创:海峰996已经火了,你正在经历996吗,怎样才能避免,而又能在职场立足,工作效率是关键,那么先从选对工具开始吧。进入数据时代,大家都会进行或多或少的数据分析,那么现在的你正在使…

windows下python环境搭建_Linux/Windows下Python环境搭建步骤

Python环境搭建首先到官网(www.python.org)下载相应的安装版本。主要分为Windows和Linux两种: 一、Linux下Python环境搭建 一般情况下,Linux系统都已经预安装好Python,但是版本都比较低,需要安装新的版本方…

oracle 同一列数据不同条件分组求和_艾瑞教育:有关Oracle数据库,你需要知道的几件事...

Oracle一、Oracle数据库在存储过程中,如何在字符串中使用变量?例:select to_char(sysdate,yyyymmdd) into v_yyyymmdd from dual;execute immediate(create table tableName_bk_ || v_yyyymmdd || as select * from TableName);将B表中符合关联条件的A…

网络多人游戏架构与编程 电子书_Java互联网架构-高性能网络编程必备技能IO与NIO阻塞分析...

欢迎关注头条号:java小马哥周一至周日早九点半!下午三点半!精品技术文章准时送上!!!精品学习资料获取通道,参见文末一、概念NIO即New IO,这个库是在JDK1.4中才引入的。NIO和IO有相同…

python能制作游戏吗_没有Python不能做的游戏,这些游戏都可以做

简介:Python编程语言的强大,几乎是众所周知的!那么,下面我给大家介绍一下几个用Python实现的各种游戏吧。不仅能用来做web、爬虫、数据分析等,没想到还能用做这么多的游戏,实在令人惊讶不已。注意:以下介绍…

vba excel 退出编辑状态_偷梁换柱之EXCEL编辑保护和VBA隐藏代码保护的解锁

如何解锁EXCEL表格编辑保护和VBA隐藏代码保护?当我们想借用别人的表格发现表格上锁无法编辑又不知道密码时或者当我们用软件生成一些表格时往往会遇到“上锁”问题,导致我们无法对表格进行改动。类似下图这样:那怎么才能征服她呢?…

activex for chrome扩展程序 下载”_Chrome扩展程序一键生成网页骨架屏

对于依赖接口渲染的页面,在拿到数据之前页面往往是空白的,为了提示用户当前正在加载中,往往会使用进度条、loading图标或骨架屏的方式。对于前两种方案而言,实现比较简单;本文主要研究骨架屏的应用及实现,并…

dynamo python修改多个参数_python之函数

a.sort()没有返回值。而sorted(a)是有返回值的。Python的标准比较运算符&#xff1a;<、<、 > 、>、 、 !函数用法和底层分析&#xff1a;函数是一个可重用的程序代码块&#xff0c;函数也代表一个任务和功能&#xff08;function&#xff09;,是代码复用的通用机制…

javascript进制转换_JavaScript 加减危机——为什么会出现这样的结果?

在日常工作计算中&#xff0c;我们如履薄冰&#xff0c;但是 JavaScript 总能给我们这样那样的 surprise~0.1 0.2 &#xff1f;1 - 0.9 &#xff1f;如果小伙伴给出内心的结果&#xff1a;0.1 0.2 0.31 - 0.9 0.1那么小伙伴会被事实狠狠地扇脸&#xff1a;console.log(0.…

java实现复制粘贴的计算器_软帝学院教你用java编写计算器(三)

教你用java编写计算器(三)import java.awt.Color;import java.awt.Dimension;import java.awt.event.ActionListener;import javax.swing.JButton;import javax.swing.JFrame;import javax.swing.JMenu;import javax.swing.JMenuBar;import javax.swing.JMenuItem;import javax…

php 公众号验证回调方法_如何进行公众号文章收集 两种收集方法详解

大家都知道优质的公众号吸引用户最关键的就是要优质的文章&#xff0c;所以会有专人负责进行公众号文章收集工作&#xff0c;下面我们跟随拓途数据一起来了解一下如何进行公众号文章收集的相关资料吧。 如何进行公众号文章收集方案一&#xff1a;基于搜狗入口 在网上能搜索到的…

mysql保存一个文件怎么打开_悄悄告诉你,MySQL 通过SQL语句导出到Excel的方法-sql文件怎么打开...

执行SQL语句select fullname,time,endtime,closed from chat_archive into outfile c:/xxx.xls注意&#xff1a;因为office默认的是gb2312编码&#xff0c;服务器端生成的很有可能是utf-8编码&#xff0c;此时有几种选择1、把查询出来的结果转换为GB2312格式(字段fullname)sele…