Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/content.zh/docs/concepts/flink-architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ Flink 应用程序的作业可以被提交到长期运行的 [Flink Session 集

### Flink Session 集群

* **集群生命周期**:在 Flink Session 集群中,客户端连接到一个预先存在的、长期运行的集群,该集群可以接受多个作业提交。即使所有作业完成后,集群(和 JobManager)仍将继续运行直到手动停止 session 为止。因此,Flink Session 集群的寿命不受任何 Flink 作业寿命的约束
* **集群生命周期**:在 Flink Session 集群中,客户端连接到一个预先存在的、长期运行的集群,该集群可以接受多个应用程序提交。即使所有应用程序完成后,集群(和 JobManager)仍将继续运行直到手动停止 session 为止。因此,Flink Session 集群的寿命不受任何 Flink 应用程序寿命的约束

* **资源隔离**:TaskManager slot 由 ResourceManager 在提交作业时分配,并在作业完成时释放。由于所有作业都共享同一集群,因此在集群资源方面存在一些竞争 — 例如提交工作阶段的网络带宽。此共享设置的局限性在于,如果 TaskManager 崩溃,则在此 TaskManager 上运行 task 的所有作业都将失败;类似的,如果 JobManager 上发生一些致命错误,它将影响集群中正在运行的所有作业。

Expand Down
5 changes: 5 additions & 0 deletions docs/content.zh/docs/concepts/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,11 @@ JobMaster 是在 [Flink JobManager](#flink-jobmanager) 运行中的组件之一
JobResultStore 是一个 Flink 组件,它将全局终止(已完成的、已取消的或失败的)作业的结果保存到文件系统中,从而使结果比已完成的作业更长久。
这些结果然后被 Flink 用来确定作业是否应该在高可用集群中被恢复。

#### ApplicationResultStore

ApplicationResultStore 是一个 Flink 组件,它将全局终止(已完成的、已取消的或失败的)应用程序的结果保存到文件系统中,从而使结果比已完成的应用程序更长久。
这些结果然后被 Flink 用来确定应用程序是否应该在高可用集群中被恢复。

#### Logical Graph

逻辑图是一种有向图,其中顶点是 [算子](#operator),边定义算子的输入/输出关系,并对应于数据流或数据集。通过从 [Flink Application](#flink-application) 提交作业来创建逻辑图。
Expand Down
27 changes: 20 additions & 7 deletions docs/content.zh/docs/deployment/advanced/historyserver.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ under the License.

# History Server

Flink 提供了 history server,可以在相应的 Flink 集群关闭之后查询已完成作业的统计信息
Flink 提供了 history server,可以在相应的 Flink 集群关闭之后查询已完成作业和应用程序的统计信息

此外,它暴露了一套 REST API,该 API 接受 HTTP 请求并返回 JSON 格式的数据。

Expand All @@ -37,7 +37,7 @@ Flink 提供了 history server,可以在相应的 Flink 集群关闭之后查

## 概览

HistoryServer 允许查询 JobManager 存档的已完成作业的状态和统计信息
HistoryServer 允许查询 JobManager 存档的已完成作业和应用程序的状态和统计信息

在配置 HistoryServer *和* JobManager 之后,你可以使用相应的脚本来启动和停止 HistoryServer:

Expand All @@ -58,20 +58,24 @@ bin/historyserver.sh (start|start-foreground|stop)

**JobManager**

已完成作业的存档在 JobManager 上进行,将已存档的作业信息上传到文件系统目录中。你可以在 [Flink 配置文件]({{< ref "docs/deployment/config#flink-配置文件" >}})中通过 `jobmanager.archive.fs.dir` 设置一个目录存档已完成的作业
已完成作业和应用程序的存档在 JobManager 上进行,将已存档的作业和应用程序信息上传到文件系统目录中。你可以在 [Flink 配置文件]({{< ref "docs/deployment/config#flink-配置文件" >}})中通过 `jobmanager.archive.fs.dir` 设置一个目录存档已完成的作业和应用程序

```yaml
# 上传已完成作业信息的目录
jobmanager.archive.fs.dir: hdfs:///completed-jobs
# 上传已完成作业和应用程序信息的目录
jobmanager.archive.fs.dir: hdfs:///archives
```

{{< hint info >}}
如需了解具体的目录结构,请参阅 [FLIP-549: Support Application Management](https://cwiki.apache.org/confluence/display/FLINK/FLIP-549%3A+Support+Application+Management)。
{{< /hint >}}

**HistoryServer**

可以通过 `historyserver.archive.fs.dir` 设置 HistoryServer 监视以逗号分隔的目录列表。定期轮询已配置的目录以查找新的存档;轮询间隔可以通过 `historyserver.archive.fs.refresh-interval` 来配置。

```yaml
# 监视以下目录中已完成的作业
historyserver.archive.fs.dir: hdfs:///completed-jobs
# 监视以下目录中已完成的作业和应用程序
historyserver.archive.fs.dir: hdfs:///archives

# 每 10 秒刷新一次
historyserver.archive.fs.refresh-interval: 10000
Expand Down Expand Up @@ -105,6 +109,15 @@ historyserver.log.taskmanager.url-pattern: http://my.log-browsing.url/<jobid>/<t
尖括号中的值为变量,例如作业 `7684be6004e4e955c2a558a9bc463f65` 的
`http://hostname:port/jobs/<jobid>/exceptions` 请求须写为 `http://hostname:port/jobs/7684be6004e4e955c2a558a9bc463f65/exceptions`。

**应用程序相关请求**

- `/applications/overview`
- `/applications/<applicationid>`
- `/applications/<applicationid>/jobmanager/config`
- `/applications/<applicationid>/exceptions`

**作业相关请求**

- `/config`
- `/jobs/overview`
- `/jobs/<jobid>`
Expand Down
4 changes: 4 additions & 0 deletions docs/content.zh/docs/deployment/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -258,6 +258,10 @@ The JobManager ensures consistency during recovery across TaskManagers. For the

{{< generated/common_high_availability_jrs_section >}}

**Options for the ApplicationResultStore in high-availability setups**

{{< generated/common_high_availability_ars_section >}}

**Options for high-availability setups with ZooKeeper**

{{< generated/common_high_availability_zk_section >}}
Expand Down
24 changes: 21 additions & 3 deletions docs/content.zh/docs/deployment/ha/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,14 @@ under the License.
# 高可用

JobManager 高可用(HA)模式加强了 Flink 集群防止 JobManager 故障的能力。
此特性确保 Flink 集群将始终持续执行你提交的作业。
此特性确保 Flink 集群将始终重新执行在故障发生时正在运行的应用程序。

{{< hint warning >}}
恢复后,应用程序在故障前提交的作业可能会继续执行或被弃用,具体取决于应用程序 main() 方法采取的执行路径。

故障前后的作业按照名称进行匹配,相同名称的作业按照提交顺序进一步匹配。
为避免匹配错误——尤其是在作业提交顺序不确定的情况下——建议通过 execute(jobName) 为每个作业指定唯一名称。
{{< /hint >}}

## JobManager 高可用

Expand Down Expand Up @@ -70,13 +77,24 @@ Flink 提供了两种高可用服务实现:

## 高可用数据生命周期

为了恢复提交的作业,Flink 持久化元数据和 job 组件。高可用数据将一直保存,直到相应的作业执行成功、被取消或最终失败。当这些情况发生时,将删除所有高可用数据,包括存储在高可用服务中的元数据。
为了恢复提交的应用程序,Flink 持久化应用程序的元数据。
高可用数据将一直保存,直到相应的应用程序执行成功、被取消或最终失败。
当这些情况发生时,将删除所有高可用数据,包括存储在高可用服务中的元数据。
类似的生命周期也适用于单个作业的高可用数据。

{{< top >}}

## 应用程序结果存储

应用程序结果存储用于归档达到终止状态(即完成、取消或失败)的应用程序的最终结果,其数据存储在文件系统上(请参阅 [application-result-store.storage-path]({{< ref "docs/deployment/config#application-result-store-storage-path" >}}))。
只要没有正确清理相应的应用程序,此数据条目就是脏数据(数据位于应用程序的子文件夹中 [high-availability.storageDir]({{< ref "docs/deployment/config#high-availability-storagedir" >}}))。
脏数据将被清理,即相应的应用程序要么在当前时刻被清理,要么在应用程序恢复过程中被清理。一旦清理成功,这些脏数据条目将被删除。请参阅 [HA configuration options]({{< ref "docs/deployment/config#high-availability" >}}) 下应用程序结果存储的配置参数以获取有关如何调整行为的更多详细信息。

{{< top >}}

## 作业结果存储

作业结果存储用于归档达到全局结束状态作业(即完成、取消或失败)的最终结果,其数据存储在文件系统上 (请参阅[job-result-store.storage-path]({{< ref "docs/deployment/config#job-result-store-storage-path" >}}))。
只要没有正确清理相应的作业,此数据条目就是脏数据 (数据位于作业的子文件夹中 [high-availability.storageDir]({{< ref "docs/deployment/config#high-availability-storagedir" >}}))。
脏数据将被清理,即相应的作业要么在当前时刻被清理,要么在作业恢复过程中被清理。一旦清理成功,这些脏数据条目将被删除。请参阅 [HA configuration options]({{< ref "docs/deployment/config#high-availability" >}}) 下作业结果存储的配置参数以获取有关如何调整行为的更多详细信息
脏数据将被清理,即相应的作业要么在当前时刻被清理,要么在作业恢复过程中被清理。这些条目将在清理成功且对应的应用程序已创建脏条目后被删除
{{< top >}}
40 changes: 27 additions & 13 deletions docs/content.zh/docs/deployment/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,8 +73,8 @@ When deploying Flink, there are often multiple options available for each buildi
JobManager is the name of the central work coordination component of Flink. It has implementations for different resource providers, which differ on high-availability, resource allocation behavior and supported job submission modes. <br />
JobManager <a href="#deployment-modes">modes for job submissions</a>:
<ul>
<li><b>Application Mode</b>: runs the cluster exclusively for one application. The job's main method (or client) gets executed on the JobManager. Calling `execute`/`executeAsync` multiple times in an application is supported.</li>
<li><b>Session Mode</b>: one JobManager instance manages multiple jobs sharing the same cluster of TaskManagers</li>
<li><b>Application Mode</b>: runs the cluster exclusively for one application. The application's main method (or client) gets executed on the JobManager. Calling `execute`/`executeAsync` multiple times in an application is supported.</li>
<li><b>Session Mode</b>: one JobManager instance manages multiple applications (and all jobs within them) sharing the same cluster of TaskManagers</li>
</ul>
</td>
<td>
Expand Down Expand Up @@ -168,6 +168,10 @@ while subsuming them as part of the usual CompletedCheckpoint management. These
not covered by the repeatable cleanup, i.e. they have to be deleted manually, still. This is
covered by [FLINK-26606](https://issues.apache.org/jira/browse/FLINK-26606).

The application resource cleanup is similar (see the
[High Availability Services / ApplicationResultStore]({{< ref "docs/deployment/ha/overview#applicationresultstore" >}})
section for further details).

## Deployment Modes

Flink can execute applications in two modes:
Expand All @@ -184,14 +188,14 @@ Flink can execute applications in two modes:

#### Application Mode

In all the other modes, the application's `main()` method is executed on the client side. This process
If the application's `main()` method is executed on the client side, this process
includes downloading the application's dependencies locally, executing the `main()` to extract a representation
of the application that Flink's runtime can understand (i.e. the `JobGraph`) and ship the dependencies and
the `JobGraph(s)` to the cluster. This makes the Client a heavy resource consumer as it may need substantial
network bandwidth to download dependencies and ship binaries to the cluster, and CPU cycles to execute the
`main()`. This problem can be more pronounced when the Client is shared across users.

Building on this observation, the *Application Mode* creates a cluster per submitted application, but this time,
Building on this observation, the *Application Mode* creates a cluster per submitted application, and
the `main()` method of the application is executed on the JobManager. Creating a cluster per application can be
seen as creating a session cluster shared only among the jobs of a particular application, and torn down when
the application finishes. With this architecture, the *Application Mode* provides the application granularity resource isolation
Expand All @@ -213,12 +217,14 @@ execution of the "next" job being postponed until "this" job finishes. Using `e
non-blocking, will lead to the "next" job starting before "this" job finishes.

{{< hint warning >}}
The Application Mode allows for multi-`execute()` applications but
High-Availability is not supported in these cases. High-Availability in Application Mode is only
supported for single-`execute()` applications.

Additionally, when any of multiple running jobs in Application Mode (submitted for example using
`executeAsync()`) gets cancelled, all jobs will be stopped and the JobManager will shut down.
The Application Mode allows for multi-job applications (by calling `execute()` or `executeAsync()` multiple times in the `main()` method) but
High-Availability is limited in these cases. High-Availability in Application Mode is only
supported for applications with a single streaming job or multiple batch jobs.
For more details, see [FLIP-560](https://cwiki.apache.org/confluence/display/FLINK/FLIP-560%3A+Application+Capability+Enhancement).

Additionally, when any of multiple running jobs in Application Mode (submitted for example using
`executeAsync()`) gets cancelled, all jobs will be stopped and the JobManager will shut down by default.
This behavior can be configured through the [`execution.terminate-application-on-any-job-terminated-exceptionally`]({{< ref "docs/deployment/config" >}}#execution-terminate-application-on-any-job-terminated-exceptionally) option.
Regular job completions (by the sources shutting down) are supported.
{{< /hint >}}

Expand All @@ -234,13 +240,21 @@ restarting jobs accessing the filesystem concurrently and making it unavailable
Additionally, having a single cluster running multiple jobs implies more load for the JobManager, who
is responsible for the book-keeping of all the jobs in the cluster.

In Session Mode, the application's `main()` method can be executed either on the client or on the cluster.
When submitting applications via Command-Line Interface (CLI) or the SQL Client, the `main()` method is executed on the client.
However, when submitting applications via the REST API `/jars/:jarid/run-application`,
the `main()` method is executed on the cluster.
This provides the same benefits as Application Mode in terms of resource usage and network bandwidth for the client,
while still maintaining the shared cluster resource model of Session Mode.

#### Summary

In *Session Mode*, the cluster lifecycle is independent of that of any job running on the cluster
and the resources are shared across all jobs. The
*Application Mode* creates a session cluster per application and executes the application's `main()`
In *Session Mode*, the cluster lifecycle is independent of that of any application running on the cluster
and the resources are shared across all applications. The application's `main()` method can be executed either on the client or on the cluster.
*Application Mode* creates a session cluster per application and executes the application's `main()`
method on the cluster.
It thus comes with better resource isolation as the resources are only used by the job(s) launched from a single `main()` method.
This comes at the price of spinning up a dedicated cluster for each application.



Expand Down
Loading