Skip to content

Rebase#1

Open
rui-mo wants to merge 793 commits into
rel-3.1.1from
rebase
Open

Rebase#1
rui-mo wants to merge 793 commits into
rel-3.1.1from
rebase

Conversation

@rui-mo
Copy link
Copy Markdown
Owner

@rui-mo rui-mo commented May 14, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

huaxingao and others added 30 commits April 23, 2020 15:03
### What changes were proposed in this pull request?
Need to address a few more comments

### Why are the changes needed?
Fix a few problems

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
Manually build and check

Closes apache#28306 from huaxingao/literal-folllowup.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(cherry picked from commit f543d6a)
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
…ther()

### What changes were proposed in this pull request?

Backport this in order to backport apache#28245 more smoothly later.

1. Combine  `BarrierRequestToSync` and `AllGatherRequestToSync` into `RequestToSync`, which is distinguished by `RequestMethod` type.

2. Remove unnecessary Json serialization/deserialization

3. Clean up some codes to make runBarrier() and `BarrierCoordinator` more general

4. Remove unused imports.

### Why are the changes needed?

To make codes simpler for better maintain in the future.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

This is pure code refactor, so should be covered by existed tests.

Closes apache#28283 from Ngone51/backport-3.0.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ator v1 APIs

### What changes were proposed in this pull request?

Add migration guide for removed accumulator v1 APIs.

### Why are the changes needed?

Provide better guidance for users' migration.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass Jenkins.

Closes apache#28309 from Ngone51/SPARK-16775-migration-guide.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 6c018b3)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…s or exception with abortableRpcFuture check

### What changes were proposed in this pull request?

Rewrite the periodically check logic of  `abortableRpcFuture` to make sure that barrier task would always return either desired messages or expected exception.

This PR also simplify a bit around `AbortableRpcFuture`.

### Why are the changes needed?

Currently, the periodically check logic of  `abortableRpcFuture` is done by following:

```scala
...
var messages: Array[String] = null

while (!abortableRpcFuture.toFuture.isCompleted) {
   messages = ThreadUtils.awaitResult(abortableRpcFuture.toFuture, 1.second)
   ...
}
return messages
```
It's possible that `abortableRpcFuture` complete before next invocation on `messages = ...`. In this case, the task may return null messages or execute successfully while it should throw exception(e.g. `SparkException` from `BarrierCoordinator`).

And here's a flaky test which caused by this bug:

```
[info] BarrierTaskContextSuite:
[info] - share messages with allGather() call *** FAILED *** (18 seconds, 705 milliseconds)
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(0, 2) finished unsuccessfully.
[info] java.lang.NullPointerException
[info] 	at scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:204)
[info] 	at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:204)
[info] 	at scala.collection.IndexedSeqOptimized.toList(IndexedSeqOptimized.scala:285)
[info] 	at scala.collection.IndexedSeqOptimized.toList$(IndexedSeqOptimized.scala:284)
[info] 	at scala.collection.mutable.ArrayOps$ofRef.toList(ArrayOps.scala:198)
[info] 	at org.apache.spark.scheduler.BarrierTaskContextSuite.$anonfun$new$4(BarrierTaskContextSuite.scala:68)
...
```

The test exception can be reproduced by changing the line `messages = ...` to the following:

```scala
messages = ThreadUtils.awaitResult(abortableRpcFuture.toFuture, 10.micros)
Thread.sleep(5000)
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually test and update some unit tests.

Closes apache#28312 from Ngone51/cherry-pick-31472.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…gurations should be static

### What changes were proposed in this pull request?
HiveClient instance is cross-session, the following configurations which are defined in HiveUtils and used to create it should be considered static:

1. spark.sql.hive.metastore.version - used to determine the hive version in Spark
2. spark.sql.hive.metastore.jars - hive metastore related jars location which is used by spark to create hive client
3. spark.sql.hive.metastore.sharedPrefixes and spark.sql.hive.metastore.barrierPrefixes -  package names of classes that are shared or separated between SparkContextLoader and hive client class loader

Those are used only once when creating the hive metastore client. They should be static in SQLConf for retrieving them correctly. We should avoid them being changed by users with SET/RESET command.

Speaking of spark.sql.hive.version - the fake of the spark.sql.hive.metastore.version, it is used by jdbc/thrift client for backward compatibility.

### Why are the changes needed?

bugfix, these configurations should not be changed.

### Does this PR introduce any user-facing change?

Yes, the following set of configs are not allowed to change.
```
Seq("spark.sql.hive.metastore.version ",
      "spark.sql.hive.metastore.jars",
      "spark.sql.hive.metastore.sharedPrefixes",
      "spark.sql.hive.metastore.barrierPrefixes")
```
### How was this patch tested?

add unit test

Closes apache#28302 from yaooqinn/SPARK-31522.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 8dc2c02)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?

This PR intends to add a new test suite for `ExpressionInfo`. Major changes are as follows;

 - Added a new test suite named `ExpressionInfoSuite`
 - To improve test coverage, added a test for error handling in `ExpressionInfoSuite`
 - Moved the `ExpressionInfo`-related tests from `UDFSuite` to `ExpressionInfoSuite`
 - Moved the related tests from `SQLQuerySuite` to `ExpressionInfoSuite`
 - Added a comment in `ExpressionInfoSuite` (followup of apache#28224)

### Why are the changes needed?

To improve test suites/coverage.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added tests.

Closes apache#28308 from maropu/SPARK-31526.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit 42f496f)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…shdown

### What changes were proposed in this pull request?
1. Modified `ParquetFilters.valueCanMakeFilterOn()` to accept filters with `java.time.LocalDate` attributes.
2. Modified `ParquetFilters.dateToDays()` to support both types `java.sql.Date` and `java.time.LocalDate` in conversions to days.
3. Add implicit conversion from `LocalDate` to `Expression` (`Literal`).

### Why are the changes needed?
To support pushed down filters with `java.time.LocalDate` attributes. Before the changes, date filters are not pushed down to Parquet datasource when `spark.sql.datetime.java8API.enabled` is `true`.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added a test to `ParquetFilterSuite`

Closes apache#28259 from MaxGekk/parquet-filter-java8-date-time.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 2616542)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…n in CompactibleFileStreamLog

### What changes were proposed in this pull request?

This patch adds some log messages to log elapsed time for "compact" operation in FileStreamSourceLog and FileStreamSinkLog (added in CompactibleFileStreamLog) to help investigating the mysterious latency spike during the batch run.

### Why are the changes needed?

Tracking latency is a critical aspect of streaming query. While "compact" operation may bring nontrivial latency (it's even synchronous, adding all the latency to the batch run), it's not measured and end users have to guess.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

N/A for UT. Manual test with streaming query using file source & file sink.

> grep "for compact batch" <driver log>

```
...
20/02/20 19:27:36 WARN FileStreamSinkLog: Compacting took 24473 ms (load: 14185 ms, write: 10288 ms) for compact batch 21359
20/02/20 19:27:39 WARN FileStreamSinkLog: Loaded 1068000 entries (397985432 bytes in memory), and wrote 1068000 entries for compact batch 21359
20/02/20 19:29:52 WARN FileStreamSourceLog: Compacting took 3777 ms (load: 1524 ms, write: 2253 ms) for compact batch 21369
20/02/20 19:29:52 WARN FileStreamSourceLog: Loaded 229477 entries (68970112 bytes in memory), and wrote 229477 entries for compact batch 21369
20/02/20 19:30:17 WARN FileStreamSinkLog: Compacting took 24183 ms (load: 12992 ms, write: 11191 ms) for compact batch 21369
20/02/20 19:30:20 WARN FileStreamSinkLog: Loaded 1068500 entries (398171880 bytes in memory), and wrote 1068500 entries for compact batch 21369
...
```

![Screen Shot 2020-02-21 at 12 34 22 PM](https://user-images.githubusercontent.com/1317309/75002142-c6830100-54a6-11ea-8da6-17afb056653b.png)

This messages are explaining why the operation duration peaks per every 10 batches which is compact interval. Latency from addBatch heavily increases in each peak which DOES NOT mean it takes more time to write outputs, but we have no idea if such message is not presented.

NOTE: The output may be a bit different from the code, as it may be changed a bit during review phase.

Closes apache#27557 from HeartSaVioR/SPARK-30804.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit 39bc50d)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Seems like in certain environment, it requires to set `setwd` as below:

```
> library(devtools); devtools::document(pkg="./pkg", roclets=c("rd"))
Loading required package: usethis
Error: Could not find package root, is your working directory inside a package?
```

see also https://stackoverflow.com/questions/52670051/how-to-troubleshoot-error-could-not-find-package-root and https://groups.google.com/forum/#!topic/rdevtools/79jjjdc_wjg

We can make up another story too. For example, if you set a specific directory in your `~/.Rprofile`, then R documentation build will fail as below:

```
echo 'setwd("~")' > ~/.Rprofile
sh R/create-rd.sh
```

```
Using R_SCRIPT_PATH = /usr/local/bin
Loading required package: usethis
Error: Can't find './pkg'.
Execution halted
```

This PR proposes to set the `setwd` explicitly so it does not get affected on the global environment.

To make R dev env more independent.

No, dev only.

Manually tested:

```bash
echo 'setwd("~")' > ~/.Rprofile
sh R/create-rd.sh
```

Before:

```
Using R_SCRIPT_PATH = /usr/local/bin
Loading required package: usethis
Error: Can't find './pkg'.
Execution halted
```

After:

```
Using R_SCRIPT_PATH = /usr/local/bin
Loading required package: usethis
Updating SparkR documentation
Loading SparkR
Creating a new generic function for ‘as.data.frame’ in package ‘SparkR’
Creating a new generic function for ‘colnames’ in package ‘SparkR’
Creating a new generic function for ‘colnames<-’ in package ‘SparkR’
Creating a new generic function for ‘cov’ in package ‘SparkR’
Creating a new generic function for ‘drop’ in package ‘SparkR’
Creating a new generic function for ‘na.omit’ in package ‘SparkR’
Creating a new generic function for ‘filter’ in package ‘SparkR’
Creating a new generic function for ‘intersect’ in package ‘SparkR’
...
```

Closes apache#28285
### What changes were proposed in this pull request?
Document identifier in SQL Reference

### Why are the changes needed?
make SQL Reference complete

### Does this PR introduce any user-facing change?
Yes
<img width="1049" alt="Screen Shot 2020-04-23 at 11 14 10 PM" src="https://user-images.githubusercontent.com/13592258/80180695-2f2a4f00-85b8-11ea-819b-f96872956d05.png">

<img width="1050" alt="Screen Shot 2020-04-23 at 11 32 32 PM" src="https://user-images.githubusercontent.com/13592258/80182062-e6c06080-85ba-11ea-9502-1c38358c97c9.png">

### How was this patch tested?
Manually build and check

Closes apache#28277 from huaxingao/identifier.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit b14b980)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
… for untyped scala udf

### What changes were proposed in this pull request?

Give more friendly warning message/migration guide of deprecated scala udf to users.

### Why are the changes needed?

User can not distinguish function signature between typed and untyped scala udf. Instead, we shall tell user what to do directly.

### Does this PR introduce any user-facing change?

No, it's newly added in Spark 3.0.

### How was this patch tested?

Pass Jenkins.

Closes apache#28311 from Ngone51/update_udf_doc.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit 463c544)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…shdown

### What changes were proposed in this pull request?

This PR aims to add a benchmark suite for nested predicate pushdown with parquet file:

Performance comparison: Nested predicate pushdown disabled vs enabled,  with the following queries scenarios:

1.  When predicate pushed down, parquet reader are able to filter out all the row groups without loading them.

2. When predicate pushed down, parquet reader only loads one of the row groups.

3. When predicate pushed down, parquet reader can't filter out any row group in order to see if we introduce too much overhead or not when enabling nested predicate push down.

### Why are the changes needed?

No benchmark exists today for nested fields predicate pushdown performance evaluation.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
 Benchmark runs and reporting result.

Closes apache#28319 from JiJiTang/SPARK-31364.

Authored-by: Jian Tang <jian_tang@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
(cherry picked from commit 6a57616)
Signed-off-by: DB Tsai <d_tsai@apple.com>
… the existing active or default SparkSession

### What changes were proposed in this pull request?

SparkSessionBuilder shoud not propagate static sql configurations to the existing active/default SparkSession
This seems a long-standing bug.

```scala
scala> spark.sql("set spark.sql.warehouse.dir").show
+--------------------+--------------------+
|                 key|               value|
+--------------------+--------------------+
|spark.sql.warehou...|file:/Users/kenty...|
+--------------------+--------------------+

scala> spark.sql("set spark.sql.warehouse.dir=2");
org.apache.spark.sql.AnalysisException: Cannot modify the value of a static config: spark.sql.warehouse.dir;
  at org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:154)
  at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:42)
  at org.apache.spark.sql.execution.command.SetCommand.$anonfun$x$7$6(SetCommand.scala:100)
  at org.apache.spark.sql.execution.command.SetCommand.run(SetCommand.scala:156)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:229)
  at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
  ... 47 elided

scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession

scala> SparkSession.builder.config("spark.sql.warehouse.dir", "xyz").get
getClass   getOrCreate

scala> SparkSession.builder.config("spark.sql.warehouse.dir", "xyz").getOrCreate
20/04/23 23:49:13 WARN SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.
res7: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession6403d574

scala> spark.sql("set spark.sql.warehouse.dir").show
+--------------------+-----+
|                 key|value|
+--------------------+-----+
|spark.sql.warehou...|  xyz|
+--------------------+-----+

scala>
OptionsAttachments
```

### Why are the changes needed?
bugfix as shown in the previous section

### Does this PR introduce any user-facing change?

Yes, static SQL configurations with SparkSession.builder.config do not propagate to any existing or new SparkSession instances.

### How was this patch tested?

new ut.

Closes apache#28316 from yaooqinn/SPARK-31532.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(cherry picked from commit 8424f55)
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
…ing Point Special Values

### What changes were proposed in this pull request?
Re-arrange Data Types page to document Floating Point Special Values

### Why are the changes needed?
To complete SQL Reference

### Does this PR introduce any user-facing change?
Yes

- add Floating Point Special Values in Data Types page
- move NaN Semantics to Data Types page

<img width="1050" alt="Screen Shot 2020-04-24 at 9 14 57 AM" src="https://user-images.githubusercontent.com/13592258/80233996-3da25600-860c-11ea-8285-538efc16e431.png">

<img width="1050" alt="Screen Shot 2020-04-24 at 9 15 22 AM" src="https://user-images.githubusercontent.com/13592258/80234001-4004b000-860c-11ea-8954-72f63c92d50d.png">

<img width="1049" alt="Screen Shot 2020-04-24 at 9 15 44 AM" src="https://user-images.githubusercontent.com/13592258/80234006-41ce7380-860c-11ea-96bf-15e1aa2102ff.png">

### How was this patch tested?
Manually build and check

Closes apache#28264 from huaxingao/datatypes.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(cherry picked from commit 054bef9)
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
…ssFor

### What changes were proposed in this pull request?

the 2 method `arrayClassFor` and `dataTypeFor` in `ScalaReflection` call each other circularly, the cases in `dataTypeFor` are not fully handled in `arrayClassFor`

For example:
```scala
scala> implicit def newArrayEncoder[T <: Array[_] : TypeTag]: Encoder[T] = ExpressionEncoder()
newArrayEncoder: [T <: Array[_]](implicit evidence$1: reflect.runtime.universe.TypeTag[T])org.apache.spark.sql.Encoder[T]

scala> val decOne = Decimal(1, 38, 18)
decOne: org.apache.spark.sql.types.Decimal = 1E-18

scala> val decTwo = Decimal(2, 38, 18)
decTwo: org.apache.spark.sql.types.Decimal = 2E-18

scala> val decSpark = Array(decOne, decTwo)
decSpark: Array[org.apache.spark.sql.types.Decimal] = Array(1E-18, 2E-18)

scala> Seq(decSpark).toDF()
java.lang.ClassCastException: org.apache.spark.sql.types.DecimalType cannot be cast to org.apache.spark.sql.types.ObjectType
  at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$arrayClassFor$1(ScalaReflection.scala:131)
  at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at org.apache.spark.sql.catalyst.ScalaReflection$.arrayClassFor(ScalaReflection.scala:120)
  at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:105)
  at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:88)
  at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerForType$1(ScalaReflection.scala:399)
  at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at org.apache.spark.sql.catalyst.ScalaReflection$.serializerForType(ScalaReflection.scala:393)
  at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:57)
  at newArrayEncoder(<console>:57)
  ... 53 elided

scala>
```

In this PR, we add the missing cases to `arrayClassFor`

### Why are the changes needed?

bugfix as described above

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

add a test for array encoders

Closes apache#28324 from yaooqinn/SPARK-31552.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit caf3ab8)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…FileSuite

### What changes were proposed in this pull request?

 Add V1/V2 tests for TextSuite and WholeTextFileSuite

### Why are the changes needed?

This is missing part since apache#24207. We should have these tests for test coverage.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Unit tests.

Closes apache#28335 from gengliangwang/testV2Suite.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 16b9615)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…deGenerator in DOC

### What changes were proposed in this pull request?
This PR proposes to remove the non-existed `hiveClientCalls.count` metric documentation of `CodeGenerator` of the Spark metrics system in the monitoring guide.

There is a duplicated `hiveClientCalls.count` metric in both `namespace=HiveExternalCatalog` and  `namespace=CodeGenerator` bullet lists, but there is only one defined inside object `HiveCatalogMetrics`.

Closes apache#28292 from wezhang/monitoringdoc.

Authored-by: Wei Zhang <wezhang@outlook.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 3e83ccc)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… merged block

### What changes were proposed in this pull request?

Fix the wrong fetch size.

### Why are the changes needed?

The fetch size should be the sum of the size of merged block and the total size of those merging blocks. But we missed the size of merged block.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added a regression test.

Closes apache#28301 from Ngone51/fix_merged_block_size.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit ab8cada)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…st's internal types

### What changes were proposed in this pull request?
In the PR, I propose to fix the `InSet.sql` method for the cases when input collection contains values of internal Catalyst's types, for instance `UTF8String`. Elements of the input set `hset` are converted to Scala types, and wrapped by `Literal` to properly form SQL view of the input collection.

### Why are the changes needed?
The changes fixed the bug in `InSet.sql` that makes wrong assumption about types of collection elements. See more details in SPARK-31563.

### Does this PR introduce any user-facing change?
Highly likely, not.

### How was this patch tested?
Added a test to `ColumnExpressionSuite`

Closes apache#28343 from MaxGekk/fix-InSet-sql.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 7d8216a)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…/date_trunc fucntions

The PR backport apache#28313 from master to 3.0

### What changes were proposed in this pull request?

Similar to https://jira.apache.org/jira/browse/SPARK-31507, millennium, century, and decade are not commonly used in most modern platforms.

For example
Negative:
https://docs.snowflake.com/en/sql-reference/functions-date-time.html#supported-date-and-time-parts
https://prestodb.io/docs/current/functions/datetime.html#date_trunc
https://teradata.github.io/presto/docs/148t/functions/datetime.html#date_trunc
https://www.oracletutorial.com/oracle-date-functions/oracle-trunc/

Positive:
https://docs.aws.amazon.com/redshift/latest/dg/r_Dateparts_for_datetime_functions.html
https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC

This PR removes these `fmt`s support for trunc and date_trunc functions.

### Why are the changes needed?

clean uncommon datetime unit for easy maintenance, we can add them back if they are found very useful later.

### Does this PR introduce any user-facing change?
no, targeting 3.0.0, these are newly added in 3.0.0

### How was this patch tested?

remove and modify existing units tests

Closes apache#28340 from yaooqinn/SPARK-31528-30.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?

Fix flakiness by checking `1970/01/01` instead of `1970`.
The test was added by SPARK-27125 for 3.0.0.

### Why are the changes needed?

the `org.apache.spark.sql.execution.ui.AllExecutionsPageSuite.SPARK-27019:correctly display SQL page when event reordering happens` test is flaky for just checking the `html` content not containing 1970. I will add a ticket to check and fix that.
In the specific failure https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/121799/testReport, it failed because the `html`

```
...
<td sorttable_customkey="1587806019707">
...

```
contained `1970`.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

passing jenkins

Closes apache#28344 from yaooqinn/SPARK-31564.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 7959808)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?

1. Remove console.log(), which seems unnecessary in the releases.
2. Replace the double equals to triple equals
3. Reuse jquery selector.

### Why are the changes needed?

For better code quality.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing tests + manual test.

Closes apache#28333 from gengliangwang/removeLog.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
(cherry picked from commit f59ebde)
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
…t_date, and current_timestamp

### What changes were proposed in this pull request?

This PR intends to add entries for substring, current_date, and current_timestamp in the SQL built-in function documents. Specifically, the entries are as follows;

 - SELECT current_date;
 - SELECT current_timestamp;
 - SELECT substring('abcd' FROM 1);
 - SELECT substring('abcd' FROM 1 FOR 2);

### Why are the changes needed?

To make the SQL (built-in functions) references complete.

### Does this PR introduce any user-facing change?

<img width="1040" alt="Screen Shot 2020-04-25 at 16 51 07" src="https://user-images.githubusercontent.com/692303/80274851-6ca5ee00-8718-11ea-9a35-9ae82008cb4b.png">

<img width="974" alt="Screen Shot 2020-04-25 at 17 24 24" src="https://user-images.githubusercontent.com/692303/80275032-a88d8300-8719-11ea-92ec-95b80169ae28.png">

<img width="862" alt="Screen Shot 2020-04-25 at 17 27 48" src="https://user-images.githubusercontent.com/692303/80275114-36696e00-871a-11ea-8e39-02e93eabb92f.png">

### How was this patch tested?

Added test examples.

Closes apache#28342 from maropu/SPARK-31562.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit e01125d)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?

This PR fixes a CTE substitution issue so as to the following SQL return the correct empty result:
```
WITH t(c) AS (SELECT 1)
SELECT * FROM t
WHERE c IN (
  WITH t(c) AS (SELECT 2)
  SELECT * FROM t
)
```
Before this PR the result was `1`.

### Why are the changes needed?
To fix a correctness issue.

### Does this PR introduce any user-facing change?
Yes, fixes a correctness issue.

### How was this patch tested?
Added new test case.

Closes apache#28318 from peter-toth/SPARK-31535-fix-nested-cte-substitution.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 4f53bfb)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…` values in ORC

### What changes were proposed in this pull request?
Convert `java.time.LocalDate` to `java.sql.Date` in pushed down filters to ORC datasource when Java 8 time API enabled.

Closes apache#28272

### Why are the changes needed?
The changes fix the exception raised while pushing date filters when `spark.sql.datetime.java8API.enabled` is set to `true`:
```
Wrong value class java.time.LocalDate for DATE.EQUALS leaf
java.lang.IllegalArgumentException: Wrong value class java.time.LocalDate for DATE.EQUALS leaf
	at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.checkLiteralType(SearchArgumentImpl.java:192)
	at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.<init>(SearchArgumentImpl.java:75)
	at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$BuilderImpl.equals(SearchArgumentImpl.java:352)
	at org.apache.spark.sql.execution.datasources.orc.OrcFilters$.buildLeafSearchArgument(OrcFilters.scala:229)
```

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
Added tests to `OrcFilterSuite`.

Closes apache#28261 from MaxGekk/orc-date-filter-pushdown.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit bd139bd)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?

This PR unifies the font color of label as `#333333` among all DAG-viz.

### Why are the changes needed?

For the consistent appearance among all DAG-viz.
There are three types of DAG-viz in the WebUI.
One is for stages, another one is for RDDs and the last one is for query plans.
But the font color of labels are slightly different among them.

For stages, the color is `#333333` (simply 333) which is specified by `spark-dag-viz.css`.
<img width="355" alt="job-graph" src="https://user-images.githubusercontent.com/4736016/80321397-b517f580-8857-11ea-8c8e-cf68f648ab05.png">
<img width="310" alt="job-graph-color" src="https://user-images.githubusercontent.com/4736016/80321399-ba754000-8857-11ea-8708-83bdef4bc1d1.png">

For RDDs, the color is `#212529` which is specified by `bootstrap.min.js`.
<img width="386" alt="stage-graph" src="https://user-images.githubusercontent.com/4736016/80321438-f0b2bf80-8857-11ea-9c2a-13fa0fd1431c.png">
<img width="313" alt="stage-graph-color" src="https://user-images.githubusercontent.com/4736016/80321444-fa3c2780-8857-11ea-81b2-4f1203d47896.png">

For query plans, the color is `black` which is specified by `spark-sql-viz.css`.
<img width="449" alt="plan-graph" src="https://user-images.githubusercontent.com/4736016/80321490-61f27280-8858-11ea-9c3a-2c98d3d4d03b.png">
<img width="316" alt="plan-graph-color" src="https://user-images.githubusercontent.com/4736016/80321496-6ae34400-8858-11ea-8fe8-0d6e4a821608.png">

After the change, the appearance is like as follows (no change for stages).

For RDDs.
<img width="389" alt="stage-graph-fixed" src="https://user-images.githubusercontent.com/4736016/80321613-6b300f00-8859-11ea-912f-d92474aa9f47.png">

For query plans.
<img width="456" alt="plan-graph-fixed" src="https://user-images.githubusercontent.com/4736016/80321638-9a468080-8859-11ea-974c-33c56a8ffe1a.png">

### Does this PR introduce any user-facing change?

Yes. The unified color is slightly lighter than ever.

### How was this patch tested?

Confirmed that the color code among all DAG-viz are `#333333` using browser's debug console.

Closes apache#28352 from sarutak/unify-label-color.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 91ec2ea)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…Split with pipeline estimator cannot save and load model

### What changes were proposed in this pull request?
Fix Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model.

Most pyspark estimators/transformers inherit `JavaParams`, but some estimators are special (in order to support pure python implemented nested estimators/transformers):
* Pipeline
* OneVsRest
* CrossValidator
* TrainValidationSplit

But note that, currently, in pyspark, estimators listed above, their model reader/writer do NOT support pure python implemented nested estimators/transformers. Because they use java reader/writer wrapper as python side reader/writer.

Pyspark CrossValidator/TrainValidationSplit model reader/writer require all estimators define the `_transfer_param_map_to_java` and `_transfer_param_map_from_java` (used in model read/write).

OneVsRest class already defines the two methods, but Pipeline do not, so it lead to this bug.

In this PR I add `_transfer_param_map_to_java` and `_transfer_param_map_from_java` into Pipeline class.

### Why are the changes needed?
Bug fix.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Unit test.

Manually test in pyspark shell:
1) CrossValidator with Simple Pipeline estimator
```
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, ParamGridBuilder

training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0),
    (4, "b spark who", 1.0),
    (5, "g d a y", 0.0),
    (6, "spark fly", 1.0),
    (7, "was mapreduce", 0.0),
], ["id", "text", "label"])

# Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=2)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)

cvModel.save('/tmp/cv_model001')
CrossValidatorModel.load('/tmp/cv_model001')
```

2) CrossValidator with Pipeline estimator which include a OneVsRest estimator stage, and OneVsRest estimator nest a LogisticRegression estimator.

```
from pyspark.ml.linalg import Vectors
from pyspark.ml import Estimator, Model
from pyspark.ml.classification import LogisticRegression, LogisticRegressionModel, OneVsRest
from pyspark.ml.evaluation import BinaryClassificationEvaluator, \
    MulticlassClassificationEvaluator, RegressionEvaluator
from pyspark.ml.linalg import Vectors
from pyspark.ml.param import Param, Params
from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, ParamGridBuilder, \
    TrainValidationSplit, TrainValidationSplitModel
from pyspark.sql.functions import rand
from pyspark.testing.mlutils import SparkSessionTestCase

dataset = spark.createDataFrame(
    [(Vectors.dense([0.0]), 0.0),
     (Vectors.dense([0.4]), 1.0),
     (Vectors.dense([0.5]), 0.0),
     (Vectors.dense([0.6]), 1.0),
     (Vectors.dense([1.0]), 1.0)] * 10,
    ["features", "label"])

ova = OneVsRest(classifier=LogisticRegression())
lr1 = LogisticRegression().setMaxIter(100)
lr2 = LogisticRegression().setMaxIter(150)
grid = ParamGridBuilder().addGrid(ova.classifier, [lr1, lr2]).build()
evaluator = MulticlassClassificationEvaluator()

pipeline = Pipeline(stages=[ova])

cv = CrossValidator(estimator=pipeline, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
cvModel.save('/tmp/model002')

cvModel2 = CrossValidatorModel.load('/tmp/model002')
```

TrainValidationSplit testing code are similar so I do not paste them.

Closes apache#28279 from WeichenXu123/fix_pipeline_tuning.

Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Xiangrui Meng <meng@databricks.com>
(cherry picked from commit 4a21c4c)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
…uite doesn't match results correctly

### What changes were proposed in this pull request?

`CliSuite.runCliWithin` was not matching for expected results correctly. It was matching for expected lines anywhere in stdout or stderr.
On the example of `Single command with --database` test:
In
```
    runCliWithin(2.minute)(
      "CREATE DATABASE hive_db_test;"
        -> "",
      "USE hive_test;"
        -> "",
      "CREATE TABLE hive_test(key INT, val STRING);"
        -> "",
      "SHOW TABLES;"
        -> "hive_test"
    )
```
It was looking for lines containing "", "", "" and then "hive_test".
However, the string "hive_test" was contained in "hive_test_db", and hence:
```
2020-04-08 17:53:12,752 INFO  CliSuite - 2020-04-08 17:53:12.752 - stderr> Spark master: local, Application Id: local-1586368384172
2020-04-08 17:53:12,765 INFO  CliSuite - stderr> found expected output line 0: ""
2020-04-08 17:53:12,765 INFO  CliSuite - 2020-04-08 17:53:12.765 - stdout> spark-sql> CREATE DATABASE hive_db_test;
2020-04-08 17:53:12,765 INFO  CliSuite - stdout> found expected output line 1: ""
2020-04-08 17:53:17,688 INFO  CliSuite - 2020-04-08 17:53:17.688 - stderr> chgrp: changing ownership of 'file:///tmp/spark-8811f069-4cba-4c71-a5d6-62dd925fb5ff': chown: changing group of '/tmp/spark-8811f069-4cba-4c71-a5d6-62dd925fb5ff': Operation not permitted
2020-04-08 17:53:12,765 INFO  CliSuite - stderr> found expected output line 2: ""
2020-04-08 17:53:18,069 INFO  CliSuite - 2020-04-08 17:53:18.069 - stderr> Time taken: 5.265 seconds
2020-04-08 17:53:18,087 INFO  CliSuite - 2020-04-08 17:53:18.087 - stdout> spark-sql> USE hive_test;
2020-04-08 17:53:12,765 INFO  CliSuite - stdout> found expected output line 3: "hive_test"
2020-04-08 17:53:21,742 INFO  CliSuite - Found all expected output.
```
Because of that, it could kill the CLI process without really even creating the table. This was not expected. The test could be flaky depending on whether process.destroy() in the finally block managed to kill it before it actually creates the table.

I make the output checking more robust to not match on unexpected output, by making it check the echo of query output on the CLI. Also, wait for the CLI process to finish gracefully (triggered by closing its stdin), instead of killing it forcibly.

### Why are the changes needed?

org.apache.spark.sql.hive.thriftserver.CliSuite was flaky, and didn't test outputs as expected.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests in CLISuite. Tested several times with no flakiness. Was getting flaky results almost on every run before.
```
[info] CliSuite:
[info] - load warehouse dir from hive-site.xml (12 seconds, 568 milliseconds)
[info] - load warehouse dir from --hiveconf (10 seconds, 648 milliseconds)
[info] - load warehouse dir from --conf spark(.hadoop).hive.* (20 seconds, 653 milliseconds)
[info] - load warehouse dir from spark.sql.warehouse.dir (9 seconds, 763 milliseconds)
[info] - Simple commands (16 seconds, 238 milliseconds)
[info] - Single command with -e (9 seconds, 967 milliseconds)
[info] - Single command with --database (21 seconds, 205 milliseconds)
[info] - Commands using SerDe provided in --jars (15 seconds, 51 milliseconds)
[info] - SPARK-29022: Commands using SerDe provided in --hive.aux.jars.path (14 seconds, 625 milliseconds)
[info] - SPARK-11188 Analysis error reporting (7 seconds, 960 milliseconds)
[info] - SPARK-11624 Spark SQL CLI should set sessionState only once (7 seconds, 424 milliseconds)
[info] - list jars (9 seconds, 520 milliseconds)
[info] - list jar <jarfile> (9 seconds, 277 milliseconds)
[info] - list files (9 seconds, 828 milliseconds)
[info] - list file <filepath> (9 seconds, 646 milliseconds)
[info] - apply hiveconf from cli command (9 seconds, 469 milliseconds)
[info] - Support hive.aux.jars.path (10 seconds, 676 milliseconds)
[info] - SPARK-28840 test --jars command (10 seconds, 921 milliseconds)
[info] - SPARK-28840 test --jars and hive.aux.jars.path command (11 seconds, 49 milliseconds)
[info] - SPARK-29022 Commands using SerDe provided in ADD JAR sql (14 seconds, 210 milliseconds)
[info] - SPARK-26321 Should not split semicolon within quoted string literals (12 seconds, 729 milliseconds)
[info] - Pad Decimal numbers with trailing zeros to the scale of the column (10 seconds, 381 milliseconds)
[info] - SPARK-30049 Should not complain for quotes in commented lines (10 seconds, 935 milliseconds)
[info] - SPARK-30049 Should not complain for quotes in commented with multi-lines (20 seconds, 731 milliseconds)
```

Closes apache#28156 from juliuszsompolski/SPARK-31388.

Authored-by: Juliusz Sompolski <julek@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 560bd54)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ther than Julian days

This PR modifies LegacyDateFormatter#parse to return proleptic Gregorian days rather than hybrid Julian days.

The legacy time parser currently returns epoch days in the hybrid Julian calendar. However, the callers to the legacy parser (e.g., UnivocityParser, JacksonParser) expect epoch days in the proleptic Gregorian calendar. As a result, pre-Gregorian dates like '1000-01-01' get interpreted as '1000-01-06'.

No

Manual testing and modified existing unit tests.

Closes apache#28345 from bersprockets/SPARK-31557.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit a911287)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ecision in ansi mode

To follow ANSI,the expressions - `date + interval`, `interval + date` and `date - interval` should only accept intervals which the `microseconds` part is 0.

Better ANSI compliance

No, this PR should target 3.0.0 in which this feature is newly added.

add more unit tests

Closes apache#28310 from yaooqinn/SPARK-31527.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit ebc8fa5)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sarutak and others added 30 commits June 1, 2020 14:33
### What changes were proposed in this pull request?

This PR backports the change of apache#28583 (SPARK-31764) to branch-3.0, which changes JsonProtocol to write RDDInfos#isBarrier.

### Why are the changes needed?

JsonProtocol reads RDDInfos#isBarrier but not write it so it's a bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I added a testcase.

Closes apache#28660 from sarutak/SPARK-31764-branch-3.0.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
…emory to reliably publish

### What changes were proposed in this pull request?
Allow overriding the zinc options in the docker release and set a higher so the publish step can succeed consistently.

### Why are the changes needed?

The publish step experiences memory pressure.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?
Running test locally with fake user to see if publish step (besides svn part) succeeds

Closes apache#28698 from holdenk/SPARK-31889-docker-release-script-does-not-allocate-enough-memory-to-reliably-publish.

Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
(cherry picked from commit ab9e5a2)
Signed-off-by: Holden Karau <hkarau@apple.com>
…here is a join in the plan

### What changes were proposed in this pull request?

This PR proposes to check `DetectAmbiguousSelfJoin` only if there is `Join` in the plan. Currently, the checking is too strict even to non-join queries.

For example, the codes below don't have join at all but it fails as the ambiguous self-join:

```scala
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.sum
val df = Seq(1, 1, 2, 2).toDF("A")
val w = Window.partitionBy(df("A"))
df.select(df("A").alias("X"), sum(df("A")).over(w)).explain(true)
```

It is because `ExtractWindowExpressions` can create a `AttributeReference` with the same metadata but a different expression ID, see:

https://github.com/apache/spark/blob/0fd98abd859049dc3b200492487041eeeaa8f737/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2679
https://github.com/apache/spark/blob/71c73d58f6e88d2558ed2e696897767d93bac60f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L63
https://github.com/apache/spark/blob/5945d46c11a86fd85f9e65f24c2e88f368eee01f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala#L180

Before:

```
'Project [A#19 AS X#21, sum(A#19) windowspecdefinition(A#19, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L]
+- Relation[A#19] parquet
```

After:

```
Project [X#21, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L]
+- Project [X#21, A#19, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L]
   +- Window [sum(A#19) windowspecdefinition(A#19, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L], [A#19]
      +- Project [A#19 AS X#21, A#19]
         +- Relation[A#19] parquet
```

`X#21` holds the same metadata of DataFrame ID and column position with `A#19` but it has a different expression ID which ends up with the checking fails.

### Why are the changes needed?

To loose the checking and make users not surprised.

### Does this PR introduce _any_ user-facing change?

It's the changes in unreleased branches only.

### How was this patch tested?

Manually tested and unittest was added.

Closes apache#28695 from HyukjinKwon/SPARK-28344-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit ea45fc5)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…l shuffle" test having no skew join

### What changes were proposed in this pull request?
Fix configurations and ensure there is skew join in the test "Do not optimize skew join if additional shuffle".

### Why are the changes needed?
The existing "Do not optimize skew join if additional shuffle" test has no skew join at all.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Fixed existing test.

Closes apache#28679 from manuzhang/spark-31870.

Authored-by: manuzhang <owenzhang1990@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 283814a)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…s to Parquet

### What changes were proposed in this pull request?
Fixed conversions of `java.sql.Timestamp` to milliseconds in `ParquetFilter` by using existing functions from `DateTimeUtils` `fromJavaTimestamp()` and `microsToMillis()`.

### Why are the changes needed?
The changes fix the bug:
```scala
scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS")
scala> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")
scala> Seq(java.sql.Timestamp.valueOf("1000-06-14 08:28:53.123")).toDF("ts").write.mode("overwrite").parquet("/Users/maximgekk/tmp/ts_millis_old_filter")
scala> spark.read.parquet("/Users/maximgekk/tmp/ts_millis_old_filter").filter($"ts" === "1000-06-14 08:28:53.123").show(false)
+---+
|ts |
+---+
+---+
```

### Does this PR introduce _any_ user-facing change?
Yes, after the changes (for the example above):
```scala
scala> spark.read.parquet("/Users/maximgekk/tmp/ts_millis_old_filter").filter($"ts" === "1000-06-14 08:28:53.123").show(false)
+-----------------------+
|ts                     |
+-----------------------+
|1000-06-14 08:28:53.123|
+-----------------------+
```

### How was this patch tested?
Modified tests in `ParquetFilterSuite` to check old timestamps.

Closes apache#28699 from MaxGekk/parquet-ts-millis-filter-3.0.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…nearRegressionWithTests.test_train_prediction

### What changes were proposed in this pull request?

It increases the timeout for `StreamingLinearRegressionWithTests.test_train_prediction`

```
Traceback (most recent call last):
  File "/home/jenkins/workspace/SparkPullRequestBuilder3/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 503, in test_train_prediction
    self._eventually(condition)
  File "/home/jenkins/workspace/SparkPullRequestBuilder3/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 69, in _eventually
    lastValue = condition()
  File "/home/jenkins/workspace/SparkPullRequestBuilder3/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 498, in condition
    self.assertGreater(errors[1] - errors[-1], 2)
AssertionError: 1.672640157855923 not greater than 2
```

This could likely happen when the PySpark tests run in parallel and it become slow.

### Why are the changes needed?

To make the tests less flaky. Seems it's being reported multiple times:

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123144/consoleFull
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123146/testReport/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123141/testReport/
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123142/testReport/

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Jenkins will test it out.

Closes apache#28701 from HyukjinKwon/SPARK-29137.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 64cb6f7)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
We should use dataType.catalogString to unified the data type mismatch message.
Before:
```sql
spark-sql> create table SPARK_31834(a int) using parquet;
spark-sql> insert into SPARK_31834 select '1';
Error in query: Cannot write incompatible data to table '`default`.`spark_31834`':
- Cannot safely cast 'a': StringType to IntegerType;
```

After:
```sql
spark-sql> create table SPARK_31834(a int) using parquet;
spark-sql> insert into SPARK_31834 select '1';
Error in query: Cannot write incompatible data to table '`default`.`spark_31834`':
- Cannot safely cast 'a': string to int;
```

### How was this patch tested?
UT.

Closes apache#28654 from lipzhu/SPARK-31834.

Authored-by: lipzhu <lipzhu@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit d79a8a8)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…down

1. Modified `ParquetFilters.valueCanMakeFilterOn()` to accept filters with `java.time.Instant` attributes.
2. Added `ParquetFilters.timestampToMicros()` to support both types `java.sql.Timestamp` and `java.time.Instant` in conversions to microseconds.
3. Re-used `timestampToMicros` in constructing of Parquet filters.

To support pushed down filters with `java.time.Instant` attributes. Before the changes, date filters are not pushed down to Parquet datasource when `spark.sql.datetime.java8API.enabled` is `true`.

No

Modified tests to `ParquetFilterSuite` to check the case when Java 8 API is enabled.

Closes apache#28696 from MaxGekk/support-instant-parquet-filters.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?

Only push the release tag after the build has finished.

### Why are the changes needed?

If the build fails we don't need a release tag.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Running locally with a fake user in apache#28667

Closes apache#28700 from holdenk/SPARK-31860-build-master-only-push-tags-on-success.

Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
(cherry picked from commit 69ba9b6)
Signed-off-by: Holden Karau <hkarau@apple.com>
### What changes were proposed in this pull request?

Add cross build support to our docker image script using the new dockerx extension.

### Why are the changes needed?

We have a CI for Spark on ARM, we should support building images for ARM and AMD64.

### Does this PR introduce _any_ user-facing change?

Yes, a new flag is added to the docker image build script to cross-build

### How was this patch tested?
Manually ran build script & pushed to https://hub.docker.com/repository/registry-1.docker.io/holdenk/spark/tags?page=1 verified amd64 & arm64 listed.

Closes apache#28615 from holdenk/cross-build.

Lead-authored-by: Holden Karau <hkarau@apple.com>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Signed-off-by: Holden Karau <hkarau@apple.com>
(cherry picked from commit 2570228)
Signed-off-by: Holden Karau <hkarau@apple.com>
### What changes were proposed in this pull request?

This PR backports the change of apache#28627 .
Some fixes for UI like apache#28690 should be merged to `branch-3.0` including testcases so apache#28627 is need to be backported.

This PR mainly adds two things.

1. Real headless browser support for UI test
2. A test suite using headless Chrome as one instance of  those browsers.

Also, for environment where Chrome and Chrome driver is not installed, `ChromeUITest` tag is added to filter out the test suite.
By default, test suites with `ChromeUITest` is disabled.

### Why are the changes needed?

In `branch-3.0`, there are two problems for UI test.
1. Lots of tests especially JavaScript related ones are done manually.
Appearance is better to be confirmed by our eyes but logic should be tested by test cases ideally.

2. Compared to the real web browsers, HtmlUnit doesn't seem to support JavaScript enough.
I added a JavaScript related test before for SPARK-31534 using HtmlUnit which is simple library based headless browser for test.
The test I added works somehow but some JavaScript related error is shown in unit-tests.log.

```
======= EXCEPTION START ========
Exception class=[net.sourceforge.htmlunit.corejs.javascript.JavaScriptException]
com.gargoylesoftware.htmlunit.ScriptException: Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type "(null|function)". (http://192.168.1.209:60724/static/jquery-3.4.1.min.js#2)
        at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:904)
        at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628)
        at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:515)
        at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:835)
        at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:807)
        at com.gargoylesoftware.htmlunit.InteractivePage.executeJavaScriptFunctionIfPossible(InteractivePage.java:216)
        at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptFunctionJob.runJavaScript(JavaScriptFunctionJob.java:52)
        at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptExecutionJob.run(JavaScriptExecutionJob.java:102)
        at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptJobManagerImpl.runSingleJob(JavaScriptJobManagerImpl.java:426)
        at com.gargoylesoftware.htmlunit.javascript.background.DefaultJavaScriptExecutor.run(DefaultJavaScriptExecutor.java:157)
        at java.lang.Thread.run(Thread.java:748)
Caused by: net.sourceforge.htmlunit.corejs.javascript.JavaScriptException: Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type "(null|function)". (http://192.168.1.209:60724/static/jquery-3.4.1.min.js#2)
        at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1009)
        at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:800)
        at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105)
        at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:413)
        at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:252)
        at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3264)
        at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$4.doRun(JavaScriptEngine.java:828)
        at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:889)
        ... 10 more
JavaScriptException value = Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type "(null|function)".
== CALLING JAVASCRIPT ==
  function () {
      throw e;
  }
======= EXCEPTION END ========
```
I tried to upgrade HtmlUnit to 2.40.0 but what is worse, the test become not working even though it works on real browsers like Chrome, Safari and Firefox without error.
```
[info] UISeleniumSuite:
[info] - SPARK-31534: text for tooltip should be escaped *** FAILED *** (17 seconds, 745 milliseconds)
[info]   The code passed to eventually never returned normally. Attempted 2 times over 12.910785232 seconds. Last failure message: com.gargoylesoftware.htmlunit.ScriptException: ReferenceError: Assignment to undefined "regeneratorRuntime" in strict mode (http://192.168.1.209:62132/static/vis-timeline-graph2d.min.js#52(Function)#1)
```
To resolve those problems, it's better to support headless browser for UI test.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I tested with following patterns. Both Chrome and Chrome driver should be installed to test.

1. sbt / with default excluded tags (ChromeUISeleniumSuite is expected to be skipped and SQLQueryTestSuite is expected to succeed)
`build/sbt -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver "testOnly org.apache.spark.ui.ChromeUISeleniumSuite org.apache.spark.sql.SQLQueryTestSuite"`

2. sbt / overwrite default excluded tags as empty string (Both suites are expected to succeed)
`build/sbt -Dtest.default.exclude.tags= -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver "testOnly org.apache.spark.ui.ChromeUISeleniumSuite org.apache.spark.sql.SQLQueryTestSuite"`

3. sbt / set `test.exclude.tags` to `org.apache.spark.tags.ExtendedSQLTest` (Both suites are expected to be skipped)
`build/sbt -Dtest.exclude.tags=org.apache.spark.tags.ExtendedSQLTest -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver "testOnly org.apache.spark.ui.ChromeUISeleniumSuite org.apache.spark.sql.SQLQueryTestSuite"`

4. Maven / with default excluded tags (ChromeUISeleniumSuite is expected to be skipped and SQLQueryTestSuite is expected to succeed)
`build/mvn -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest=none -DwildcardSuites=org.apache.spark.ui.ChromeUISeleniumSuite,org.apache.spark.sql.SQLQueryTestSuite test`

5. Maven / overwrite default excluded tags as empty string (Both suites are expected to succeed)
`build/mvn -Dtest.default.exclude.tags= -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest=none -DwildcardSuites=org.apache.spark.ui.ChromeUISeleniumSuite,org.apache.spark.sql.SQLQueryTestSuite test`

6. Maven / set `test.exclude.tags` to `org.apache.spark.tags.ExtendedSQLTest` (Both suites are expected to be skipped)
`build/mvn -Dtest.exclude.tags=org.apache.spark.tags.ExtendedSQLTest  -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest=none -DwildcardSuites=org.apache.spark.ui.ChromeUISeleniumSuite,org.apache.spark.sql.SQLQueryTestSuite test`

Closes apache#28702 from sarutak/real-headless-browser-support-3.0.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?

This PR fix an issue related to DAG-viz.
Because DAG-viz for a job fetches link urls for each stage from the stage table, rendering can fail with pagination.
You can reproduce this issue with the following operation.
```
sc.parallelize(1 to 10).map(value => (value ,value)).repartition(1).repartition(1).repartition(1).reduceByKey(_ + _).collect
```
And then, visit the corresponding job page.
There are 5 stages so show <5 stages in the paged table.
<img width="1440" alt="dag-rendering-issue1" src="https://user-images.githubusercontent.com/4736016/83376286-c29f3d00-a40c-11ea-891b-eb8f42afbb27.png">
<img width="1439" alt="dag-rendering-issue2" src="https://user-images.githubusercontent.com/4736016/83376288-c3d06a00-a40c-11ea-8bb2-38542e5010c1.png">

### Why are the changes needed?

This is a bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Newly added test case with following command.
`build/sbt -Dtest.default.exclude.tags= -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver "testOnly org.apache.spark.ui.ChromeUISeleniumSuite -- -z SPARK-31882"`

Closes apache#28690 from sarutak/fix-dag-rendering-issue.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
(cherry picked from commit 271eb26)
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
…ase to be consistent with Scala side

### What changes were proposed in this pull request?

Scala:

```scala
scala> spark.range(10).explain("cost")
```
```
== Optimized Logical Plan ==
Range (0, 10, step=1, splits=Some(12)), Statistics(sizeInBytes=80.0 B)

== Physical Plan ==
*(1) Range (0, 10, step=1, splits=12)
```

PySpark:

```python
>>> spark.range(10).explain("cost")
```
```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/dataframe.py", line 333, in explain
    raise TypeError(err_msg)
TypeError: extended (optional) should be provided as bool, got <class 'str'>
```

In addition, it is consistent with other codes too, for example, `DataFrame.sample` also can support `DataFrame.sample(1.0)` and `DataFrame.sample(False)`.

### Why are the changes needed?

To provide the consistent API support across APIs.

### Does this PR introduce _any_ user-facing change?

Nope, it's only changes in unreleased branches.
If this lands to master only, yes, users will be able to set `mode` as `df.explain("...")` in Spark 3.1.

After this PR:

```python
>>> spark.range(10).explain("cost")
```
```
== Optimized Logical Plan ==
Range (0, 10, step=1, splits=Some(12)), Statistics(sizeInBytes=80.0 B)

== Physical Plan ==
*(1) Range (0, 10, step=1, splits=12)
```

### How was this patch tested?

Unittest was added and manually tested as well to make sure:

```python
spark.range(10).explain(True)
spark.range(10).explain(False)
spark.range(10).explain("cost")
spark.range(10).explain(extended="cost")
spark.range(10).explain(mode="cost")
spark.range(10).explain()
spark.range(10).explain(True, "cost")
spark.range(10).explain(1.0)
```

Closes apache#28711 from HyukjinKwon/SPARK-31895.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit e1d5201)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
This PR switches the default Locale from the `US` to `GB` to change the behavior of the first day of the week from Sunday-started to Monday-started as same as v2.4

```sql
spark-sql> select to_timestamp('2020-1-1', 'YYYY-w-u');
2019-12-29 00:00:00
spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy	legacy
spark-sql> select to_timestamp('2020-1-1', 'YYYY-w-u');
2019-12-30 00:00:00
```

These week-based fields need Locale to express their semantics, the first day of the week varies from country to country.

From the Java doc of WeekFields
```java
    /**
     * Gets the first day-of-week.
     * <p>
     * The first day-of-week varies by culture.
     * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday.
     * This method returns the first day using the standard {code DayOfWeek} enum.
     *
     * return the first day-of-week, not null
     */
    public DayOfWeek getFirstDayOfWeek() {
        return firstDayOfWeek;
    }
```

But for the SimpleDateFormat, the day-of-week is not localized

```
u	Day number of week (1 = Monday, ..., 7 = Sunday)	Number	1
```

Currently, the default locale we use is the US, so the result moved a day backward.

For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671)

With this change, it restores the first day of week calculating for functions when using the default locale.

Yes, but the behavior change is used to restore the old one of v2.4

add unit tests

Closes apache#28692 from yaooqinn/SPARK-31879.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit c59f51b)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This PR disables week-based date filed for parsing

closes apache#28674

1. It's an un-fixable behavior change to fill the gap between SimpleDateFormat and DateTimeFormater and backward-compatibility for different JDKs.A lot of effort has been made to prove it at apache#28674

2. The existing behavior itself in 2.4 is confusing, e.g.

```sql
spark-sql> select to_timestamp('1', 'w');
1969-12-28 00:00:00
spark-sql> select to_timestamp('1', 'u');
1970-01-05 00:00:00
```
  The 'u' here seems not to go to the Monday of the first week in week-based form or the first day of the year in non-week-based form but go to the Monday of the second week in week-based form.

And, e.g.
```sql
spark-sql> select to_timestamp('2020 2020', 'YYYY yyyy');
2020-01-01 00:00:00
spark-sql> select to_timestamp('2020 2020', 'yyyy YYYY');
2019-12-29 00:00:00
spark-sql> select to_timestamp('2020 2020 1', 'YYYY yyyy w');
NULL
spark-sql> select to_timestamp('2020 2020 1', 'yyyy YYYY w');
2019-12-29 00:00:00
```

  I think we don't need to introduce all the weird behavior from Java

3. The current test coverage for week-based date fields is almost 0%, which indicates that we've never imagined using it.

4. Avoiding JDK bugs

https://issues.apache.org/jira/browse/SPARK-31880

Yes, the 'Y/W/w/u/F/E' pattern cannot be used datetime parsing functions.

more tests added

Closes apache#28706 from yaooqinn/SPARK-31892.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit afe95bd)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?

This PR fixes a wrong coloring issue in the DAG-viz.
In the Job Page and Stage Page, nodes which are associated with "barrier mode" in the DAG-viz will be colored pale green.
But, with some type of jobs, nodes which are not associated with the mode will also colored.
You can reproduce with the following operation.
```
sc.parallelize(1 to 10).barrier.mapPartitions(identity).repartition(1).collect()
```
<img width="376" alt="wrong-coloring" src="https://user-images.githubusercontent.com/4736016/83403670-1711df00-a444-11ea-9457-c683f75bc566.png">

In the screen shot above, `repartition` in `Stage 1` is not associated with barrier mode so the corresponding node should not be colored pale green.

The cause of this issue is the logic which chooses HTML elements to be colored is wrong.
The logic chooses such elements based on whether each element is associated with a style class (`clusterId` in the code).
But when an operation crosses over shuffle (like `repartition` above), a `clusterId` can be duplicated and non-barrier mode node is also associated with the same `clusterId`.

### Why are the changes needed?

This is a bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Newly added test case with the following command.
```
build/sbt -Dtest.default.exclude.tags= -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver "testOnly org.apache.spark.ui.ChromeUISeleniumSuite -- -z SPARK-31886"
```

Closes apache#28694 from sarutak/fix-wrong-barrier-color.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
(cherry picked from commit 8ed93c9)
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
… K8S IT

### What changes were proposed in this pull request?

This PR aims to activate `hadoop-2.7` profile by default in Kubernetes IT module.

### Why are the changes needed?

While SPARK-31881 added Hadoop 3.2 support, one default test dependency was moved to `hadoop-2.7` profile. It works when we give one of `hadoop-2.7` and `hadoop-3.2`, but it fails when we don't give any profile.

**BEFORE**
```
$ mvn test-compile -pl resource-managers/kubernetes/integration-tests -Pkubernetes-integration-tests
...
[ERROR] [Error] /APACHE/spark-merge/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala:23:
object amazonaws is not a member of package com
```

**AFTER**
```
$ mvn test-compile -pl resource-managers/kubernetes/integration-tests -Pkubernetes-integration-tests
..
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
```

The default activated profile will be override when we give `hadoop-3.2`.
```
$ mvn help:active-profiles -Pkubernetes-integration-tests
...
Active Profiles for Project 'org.apache.spark:spark-kubernetes-integration-tests_2.12:jar:3.1.0-SNAPSHOT':

The following profiles are active:

 - hadoop-2.7 (source: org.apache.spark:spark-kubernetes-integration-tests_2.12:3.1.0-SNAPSHOT)
 - kubernetes-integration-tests (source: org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT)
 - test-java-home (source: org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT)
```
```
$ mvn help:active-profiles -Pkubernetes-integration-tests -Phadoop-3.2
...
Active Profiles for Project 'org.apache.spark:spark-kubernetes-integration-tests_2.12:jar:3.1.0-SNAPSHOT':

The following profiles are active:

 - hadoop-3.2 (source: org.apache.spark:spark-kubernetes-integration-tests_2.12:3.1.0-SNAPSHOT)
 - hadoop-3.2 (source: org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT)
 - kubernetes-integration-tests (source: org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT)
 - test-java-home (source: org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the Jenkins UT and IT.

Currently, all Jenkins build and tests (UT & IT) passes without this patch. This should be tested manually with the above command.

`hadoop-3.2` K8s IT also passed like the following.
```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- Test basic decommissioning
Run completed in 8 minutes, 33 seconds.
Total number of tests run: 19
Suites: completed 2, aborted 0
Tests: succeeded 19, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

Closes apache#28716 from dongjoon-hyun/SPARK-31881-2.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit e5b9b86)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
1. Replace `def dateFormatter` to `val dateFormatter`.
2. Modify the `date formatting in hive result` test in `HiveResultSuite` to check modified code on various time zones.

### Why are the changes needed?
To avoid creation of `DateFormatter` per every incoming date in `HiveResult.toHiveString`. This should eliminate unnecessary creation of `SimpleDateFormat` instances and compilation of the default pattern `yyyy-MM-dd`. The changes can speed up processing of legacy date values of the `java.sql.Date` type which is collected by default.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Modified a test in `HiveResultSuite`.

Closes apache#28687 from MaxGekk/HiveResult-val-dateFormatter.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 125a89c)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?

This PR set the hour to 12/0 when the AMPM_OF_DAY field exists

### Why are the changes needed?

When the hour is absent but the am-pm is present, the time is incorrect for pm

### Does this PR introduce _any_ user-facing change?
yes, the change is user-facing but to change back to 2.4 to keep backward compatibility

e.g.
```sql
spark-sql> select to_timestamp('33:33 PM', 'mm:ss a');
1970-01-01 12:33:33
spark-sql> select to_timestamp('33:33 AM', 'mm:ss a');
1970-01-01 00:33:33

```

otherwise, the results are all `1970-01-01 00:33:33`

### How was this patch tested?

add unit tests

Closes apache#28713 from yaooqinn/SPARK-31896.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit afcc14c)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
… with fresh attribute IDs

### What changes were proposed in this pull request?

This is a followup of apache#26589, which caches the table relations to speed up the table lookup. However, it brings some side effects: the rule `ResolveRelations` may return exactly the same relations, while before it always returns relations with fresh attribute IDs.

This PR is to eliminate this side effect.

### Why are the changes needed?

There is no bug report yet, but this side effect may impact things like self-join. It's better to restore the 2.4 behavior and always return refresh relations.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

Closes apache#28717 from cloud-fan/fix.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit dc0709f)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ion guide

### What changes were proposed in this pull request?
The Pyspark Migration Guide needs to mention a breaking change of the Pyspark ML API.

### Why are the changes needed?
In SPARK-29093, all setters have been removed from `Params` mixins in `pyspark.ml.param.shared`. Those setters had been part of the public pyspark ML API, hence this is a breaking change.

### Does this PR introduce _any_ user-facing change?
Only documentation.

### How was this patch tested?
Visually.

Closes apache#28663 from EnricoMi/branch-pyspark-migration-guide-setters.

Authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Sean Owen <srowen@gmail.com>
(cherry picked from commit 4bbe3c2)
Signed-off-by: Sean Owen <srowen@gmail.com>
…w metrics in Query UI

### What changes were proposed in this pull request?

In `Dataset.collectAsArrowToR` and `Dataset.collectAsArrowToPython`, since the code block for `serveToStream` is run in the separate thread, `withAction` finishes as soon as it starts the thread. As a result, it doesn't collect the metrics of the actual action and Query UI shows the plan graph without metrics.

We should call `serveToStream` first, then `withAction` in it.

### Why are the changes needed?

When calling toPandas, usually Query UI shows each plan node's metric and corresponding Stage ID and Task ID:

```py
>>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], schema=['x', 'y', 'z'])
>>> df.toPandas()
   x   y    z
0  1  10  abc
1  2  20  def
```

![Screen Shot 2020-06-03 at 4 47 07 PM](https://user-images.githubusercontent.com/506656/83815735-bec22380-a675-11ea-8ecc-bf2954731f35.png)

but if Arrow execution is enabled, it shows only plan nodes and the duration is not correct:

```py
>>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
>>> df.toPandas()
   x   y    z
0  1  10  abc
1  2  20  def
```

![Screen Shot 2020-06-03 at 4 47 27 PM](https://user-images.githubusercontent.com/506656/83815804-de594c00-a675-11ea-933a-d0ffc0f534dd.png)

### Does this PR introduce _any_ user-facing change?

Yes, the Query UI will show the plan with the correct metrics.

### How was this patch tested?

I checked it manually in my local.

![Screen Shot 2020-06-04 at 3 19 41 PM](https://user-images.githubusercontent.com/506656/83816265-d77f0900-a676-11ea-84b8-2a8d80428bc6.png)

Closes apache#28730 from ueshin/issues/SPARK-31903/to_pandas_with_arrow_query_ui.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit 632b5bc)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?

This PR proposes to add one more newline to clearly separate JVM and Python tracebacks:

Before:

```
Traceback (most recent call last):
  ...
pyspark.sql.utils.AnalysisException: Reference 'column' is ambiguous, could be: column, column.;
JVM stacktrace:
org.apache.spark.sql.AnalysisException: Reference 'column' is ambiguous, could be: column, column.;
  ...
```

After:

```
Traceback (most recent call last):
  ...
pyspark.sql.utils.AnalysisException: Reference 'column' is ambiguous, could be: column, column.;

JVM stacktrace:
org.apache.spark.sql.AnalysisException: Reference 'column' is ambiguous, could be: column, column.;
  ...
```

This is kind of a followup of apache@e694660 (SPARK-31849).

### Why are the changes needed?

To make it easier to read.

### Does this PR introduce _any_ user-facing change?

It's in the unreleased branches.

### How was this patch tested?

Manually tested.

Closes apache#28732 from HyukjinKwon/python-minor.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit 53ce58d)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…ormatting too

After all these attempts apache#28692 and apache#28719 an apache#28727.
they all have limitations as mentioned in their discussions.

Maybe the only way is to forbid them all

These week-based fields need Locale to express their semantics, the first day of the week varies from country to country.

From the Java doc of WeekFields
```java
    /**
     * Gets the first day-of-week.
     * <p>
     * The first day-of-week varies by culture.
     * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday.
     * This method returns the first day using the standard {code DayOfWeek} enum.
     *
     * return the first day-of-week, not null
     */
    public DayOfWeek getFirstDayOfWeek() {
        return firstDayOfWeek;
    }
```

But for the SimpleDateFormat, the day-of-week is not localized

```
u	Day number of week (1 = Monday, ..., 7 = Sunday)	Number	1
```

Currently, the default locale we use is the US, so the result moved a day or a year or a week backward.

e.g.

For the date `2019-12-29(Sunday)`, in the Sunday Start system(e.g. en-US), it belongs to 2020 of week-based-year, in the Monday Start system(en-GB), it goes to 2019. the week-of-week-based-year(w) will be affected too

```sql
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-US'));
2020
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-GB'));
2019

spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US'));
2020-01-01
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB'));
2019-52-07

spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US'));
2020-02-01
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB'));
2020-01-07
```

For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671)

With this change, user can not use 'YwuW',  but 'e' for 'u' instead. This can at least turn this not to be a silent data change.

add unit tests

Closes apache#28728 from yaooqinn/SPARK-31879-NEW2.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 9d5b5d0)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?

It appears I have unintentionally used nested JDBC statements in the two tests I added.

### Why are the changes needed?

Cleanup a typo. Please merge to master/branch-3.0

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes apache#28735 from juliuszsompolski/SPARK-31859-fixup.

Authored-by: Juliusz Sompolski <julek@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit ea01013)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…rmatting

### What changes were proposed in this pull request?

In this PR, we throw `SparkUpgradeException` when getting `DateTimeException` for datetime formatting in the `EXCEPTION` legacy Time Parser Policy.

### Why are the changes needed?
`DateTimeException` is also declared by `java.time.format.DateTimeFormatter#format`, but in Spark, it can barely occur. We have suspected one that due to a JDK bug so far. see https://bugs.openjdk.java.net/browse/JDK-8079628.

For `from_unixtime` function, we will suppress the DateTimeException caused by `DD` and result `NULL`. It is a silent date change that should be avoided in Java 8.

### Does this PR introduce _any_ user-facing change?

Yes,  when running on Java8 and using `from_unixtime` function with pattern `DD` to format datetimes, if dayofyear>=100, `SparkUpgradeException` will alert users instead of silently resulting null. For `date_format`, `SparkUpgradeException` take the palace of  `DateTimeException`.

### How was this patch tested?

add unit tests.

Closes apache#28736 from yaooqinn/SPARK-31867-F.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit fc6af9d)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…tition columns

### What changes were proposed in this pull request?
```sql
CREATE TABLE t1(a STRING, B VARCHAR(10), C CHAR(10)) STORED AS parquet;
CREATE TABLE t2 USING parquet PARTITIONED BY (b, c) AS SELECT * FROM t1;
SELECT * FROM t2 WHERE b = 'A';
```
Above SQL throws MetaException

> Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:810)
	... 114 more
Caused by: MetaException(message:Filtering is supported only on partition keys of type string, or integral types)
	at org.apache.hadoop.hive.metastore.parser.ExpressionTree$FilterBuilder.setError(ExpressionTree.java:184)
	at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.getJdoFilterPushdownParam(ExpressionTree.java:439)
	at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilterOverPartitions(ExpressionTree.java:356)
	at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilter(ExpressionTree.java:278)
	at org.apache.hadoop.hive.metastore.parser.ExpressionTree.generateJDOFilterFragment(ExpressionTree.java:583)
	at org.apache.hadoop.hive.metastore.ObjectStore.makeQueryFilterString(ObjectStore.java:3315)
	at org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsViaOrmFilter(ObjectStore.java:2768)
	at org.apache.hadoop.hive.metastore.ObjectStore.access$500(ObjectStore.java:182)
	at org.apache.hadoop.hive.metastore.ObjectStore$7.getJdoResult(ObjectStore.java:3248)
	at org.apache.hadoop.hive.metastore.ObjectStore$7.getJdoResult(ObjectStore.java:3232)
	at org.apache.hadoop.hive.metastore.ObjectStore$GetHelper.run(ObjectStore.java:2974)
	at org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilterInternal(ObjectStore.java:3250)
	at org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilter(ObjectStore.java:2906)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:101)
	at com.sun.proxy.$Proxy25.getPartitionsByFilter(Unknown Source)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_partitions_by_filter(HiveMetaStore.java:5093)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:148)
	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
	at com.sun.proxy.$Proxy26.get_partitions_by_filter(Unknown Source)
	at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsByFilter(HiveMetaStoreClient.java:1232)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173)
	at com.sun.proxy.$Proxy27.listPartitionsByFilter(Unknown Source)
	at org.apache.hadoop.hive.ql.metadata.Hive.getPartitionsByFilter(Hive.java:2679)
	... 119 more

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add a unit test.

Closes apache#28724 from LantaoJin/SPARK-31904.

Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(cherry picked from commit 5079831)
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.