Apache Spark 2.1.0正式发布,Structured Streaming有重大突破

iteblog 过往记忆大数据

Apache Spark 2.1.0是 2.x 版本线的第二个发行版。此发行版在为Structured Streaming进入生产环境做出了重大突破,Structured Streaming现在支持event time watermarks了,并且支持Kafka 0.10。此外,此版本更侧重于可用性,稳定性和优雅(polish),并解决了1200多个tickets。以下是本版本的更新(等下班我翻译一下)

Core and Spark SQL

API updatesSPARK-17864: Data type APIs are stable APIs.SPARK-18351: from_json and to_json for parsing JSON for string columnsSPARK-16700: When creating a DataFrame in PySpark, Python dictionaries can be used as values of a StructType.Performance and stabilitySPARK-17861: Scalable Partition Handling. Hive metastore stores all table partition metadata by default for Spark tables stored with Hive’s storage formats as well as tables stored with Spark’s native formats. This change reduces first query latency over partitioned tables and allows for the use of DDL commands to manipulate partitions for tables stored with Spark’s native formats. Users can migrate tables stored with Spark’s native formats created by previous versions by using the MSCK command.SPARK-16523: Speeds up group-by aggregate performance by adding a fast aggregation cache that is backed by a row-based hashmap.Other notable changesSPARK-9876: parquet-mr upgraded to 1.8.1Programming guides: Spark Programming Guide and Spark SQL, DataFrames and Datasets Guide.

Structured Streaming

API updatesSPARK-17346: Kafka 0.10 support in Structured StreamingSPARK-17731: Metrics for Structured StreamingSPARK-17829: Stable format for offset logSPARK-18124: Observed delay based Event Time WatermarksSPARK-18192: Support all file formats in structured streamingSPARK-18516: Separate instantaneous state from progress performance statisticsStabilitySPARK-17267: Long running structured streaming requirementsProgramming guide: Structured Streaming Programming Guide.


API updatesSPARK-5992: Locality Sensitive HashingSPARK-7159: Multiclass Logistic Regression in DataFrame-based APISPARK-16000: ML persistence: Make model loading backwards-compatible with Spark 1.x with saved models using spark.mllib.linalg.Vector columns in DataFrame-based APIPerformance and stabilitySPARK-17748: Faster, more stable LinearRegression for < 4096 featuresSPARK-16719: RandomForest: communicate fewer trees on each iterationProgramming guide: Machine Learning Library (MLlib) Guide.


The main focus of SparkR in the 2.1.0 release was adding extensive support for ML algorithms, which include:New ML algorithms in SparkR including LDA, Gaussian Mixture Models, ALS, Random Forest, Gradient Boosted Trees, and moreSupport for multinomial logistic regression providing similar functionality as the glmnet R packageEnable installing third party packages on workers using spark.addFile (SPARK-17577).Standalone installable package built with the Apache Spark release. We will be submitting this to CRAN soon.Programming guide: SparkR (R on Spark).


SPARK-11496: Personalized pagerankProgramming guide: GraphX Programming Guide.


MLlibSPARK-18592: Deprecate unnecessary Param setter methods in tree and ensemble models

Changes of behavior

Core and SQLSPARK-18360: The default table path of tables in the default database will be under the location of the default database instead of always depending on the warehouse location setting.SPARK-18377: spark.sql.warehouse.dir is a static configuration now. Users need to set it before the start of the first SparkSession and its value is shared by sessions in the same application.SPARK-14393: Values generated by non-deterministic functions will not change after coalesce or union.SPARK-18076: Fix default Locale used in DateFormat, NumberFormat to Locale.USSPARK-16216: CSV and JSON data sources write timestamp and date values in ISO 8601 formatted string. Two options, timestampFormat and dateFormat, are added to these two data sources to let users control the format of timestamp and date value in string representation, respectively. Please refer to the API doc of DataFrameReader and DataFrameWriter for more details about these two configurations.SPARK-17427: Function SIZE returns -1 when its input parameter is null.SPARK-16498: LazyBinaryColumnarSerDe is fixed as the the SerDe for RCFile.SPARK-16552: If a user does not specify the schema to a table and relies on schema inference, the inferred schema will be stored in the metastore. The schema will be not inferred again when this table is used.Structured StreamingSPARK-18516: Separate instantaneous state from progress performance statisticsMLlibSPARK-17870: ChiSquareSelector now accounts for degrees of freedom by using pValue rather than raw statistic to select the top features.

Known Issues

SPARK-17647: In SQL LIKE clause, wildcard characters ‘%’ and ‘_’ right after backslashes are always escaped.SPARK-18908: If a StreamExecution fails to start, users need to check stderr for the error.


  1. Spark On HBase
  2. VMware ESXi 6.5补丁升级
  3. 4K + 书写主动画笔:EHOMEWEI 便携触摸显示器评测
  4. Linux下编译Qt 5版本源码
  5. Apache Spark 2.4 回顾以及 3.0 展望
  6. 详解TensorFlow™ GPU 安装
  7. 升级融云 4.0 及以上版本的兼容方案
  8. 融云升级到到5.0报错 使用 pod ,从4.x版本升级到 5.x,写法和报错如
  9. 华为麒麟990 5G芯片重磅发布!全球首个旗舰版5G SoC芯片,支持双组网


  1. Android(安卓)VideoView播放视频
  2. Android API开发之TTS开发之Android TTS
  3. Android处理EditText键盘自动隐藏
  4. Android开发艺术探索 第2章 IPC机制(部分
  5. Android中判断网络功能是否可用
  6. Android 系统字体和颜色样式
  7. Android(安卓)面试题及答案(英文)
  8. 在android 中开发java.net.SocketExcepti
  9. Android 网络多线程断点下载
  10. Android 设置系统SystemUI 顶部StatusBar