Cluster Features¶
The nf-k8s-cloud executor runs Nextflow tasks as Kubernetes Jobs on EKS. This page documents the resource directives, scheduling controls, S3 staging model, error handling, and bin-packing considerations available to workflow developers.
Resource Directives¶
Standard Nextflow resource directives are mapped to Kubernetes pod resource specifications:
| Directive | K8s Mapping | Behavior |
|---|---|---|
cpus |
CPU request (+ optional limit) | Always set as request. Limit only when cluster enables it. Pods can burst if node allows. |
memory |
Memory request and limit | Both set to same value. OOM-killed (exit 137) if exceeded. |
disk |
Ephemeral storage request and limit | Both always set. Evicted if exceeded. Must cover staged inputs + outputs + temp files. |
time |
Active deadline | Killed with SIGTERM (exit 143) when deadline expires. |
accelerator |
GPU resources | Request GPU devices for tasks that need them. |
Warning
Memory and disk limits are hard enforced by Kubernetes. A pod that exceeds its memory limit is OOM-killed immediately. A pod that exceeds its disk limit is evicted. Always include headroom in your resource allocations.
Retry Scaling¶
Resource directives can be scaled with task.attempt for automatic retry scaling:
process MY_PROCESS {
memory { 16.GB * task.attempt }
disk { 100.GB * task.attempt }
time { 8.h * task.attempt }
maxRetries 2
// ...
}
Note
Never scale CPUs with task.attempt. No task fails from insufficient CPUs -- they just run slower. Scaling CPUs on retry wastes resources without fixing the actual failure.
Resource Labels (Scheduling Hints)¶
The cluster config reads resourceLabels from each process and uses them to control pod scheduling. These are not standard Nextflow directives -- they are custom labels interpreted by the cluster configuration.
| Label | Values | Effect |
|---|---|---|
capacityType |
"spot", "on-demand" |
Schedule on spot or on-demand instances. |
instanceLocalNvme |
true |
Require local NVMe storage. For I/O-heavy processes. |
instanceNetworkBandwidth |
integer (Mbps) | Require minimum network bandwidth. For S3-bottlenecked processes. |
Spot vs On-Demand Strategy¶
The standard pattern runs attempt 1 on cheaper spot instances and falls back to on-demand on retry:
process {
resourceLabels = [
capacityType: task.attempt == 1 ? 'spot' : 'on-demand'
]
}
Spot instances can be preempted by EC2 at any time. When this happens, Nextflow retries the task automatically. The retry cost (wasted compute time + wall-clock delay) is the key trade-off -- data integrity is not at risk because partial outputs from interrupted pods are never used.
Guidelines for capacity type selection:
| Avg Execution Time | Recommendation |
|---|---|
| < 15 min | Spot-first -- very low interruption risk |
| 15 min - 45 min | Spot-first -- low risk, cheap retry |
| 45 min - 2 hr | Case-by-case -- spot saves money, retries cost time |
| > 2 hr | On-demand justified if retry cost is high |
Some processes may warrant on-demand for all attempts if they are on the critical path and their runtime makes retries very expensive. Use the Cost & Performance analysis to quantify per-process savings.
NVMe and Network Bandwidth¶
process IO_HEAVY_TASK {
resourceLabels = [
capacityType: task.attempt == 1 ? 'spot' : 'on-demand',
instanceLocalNvme: true,
instanceNetworkBandwidth: 25000 // 25 Gbps
]
// ...
}
Tip
Request NVMe storage for processes that are staging-bound (high overhead relative to execution time). Request higher network bandwidth for processes with large S3 transfers.
S3 Staging Model¶
The nf-k8s-cloud executor uses s5cmd for high-performance S3 file transfers. Understanding the staging model is important for disk sizing.
How It Works¶
- Stage-in: Input files are downloaded from S3 to pod-local
/tmpbefore the task command runs. - Execution: The task command reads inputs from and writes outputs to
/tmp. - Stage-out: Output files are uploaded from
/tmpback to S3 after the task command completes.
The cluster config mounts an emptyDir at /tmp and sets scratch = "/tmp", so all staged inputs, outputs, and temporary files land in ephemeral storage.
Disk Budget¶
The disk directive must cover the peak ephemeral storage footprint:
disk_needed = staged_inputs + task_outputs + temporary_files
Where:
- Staged inputs -- All input files downloaded from S3 before the task starts. For processes with large BAM/CRAM/FASTQ inputs, this is often the dominant component.
- Task outputs -- Files produced by the task command.
- Temporary files -- Working data created during execution (e.g., samtools sort temp files).
Warning
There is no runtime metric in Tracker that directly measures peak disk usage during execution. The rchar/wchar I/O metrics in Tracker only capture the task command's I/O -- they do not include staged input files. Use Datadog's kubernetes.ephemeral_storage.usage metric for actual disk usage data.
Disk Sizing Formula¶
A reasonable starting point for disk allocation:
recommended_disk = (estimated_input_size + wchar + working_scratch) * 1.5
Apply a 1.5x safety margin and validate against disk-full failure history.
Error Handling and Exit Codes¶
When a task fails, the exit code and exception type indicate the cause:
| Scenario | Exit Code | Exception | Notes |
|---|---|---|---|
| Script error | 1, 2, etc. | -- | Check .command.err |
| Command not found | 127 | -- | Binary missing from image |
| OOM killed | 137 | K8sOutOfMemory... |
Memory exceeded limit |
| Timeout | 143 | -- | Deadline exceeded (time) |
| Spot preemption | null or 137 | NodeTermination... |
EC2 reclaimed spot instance |
| Disk exhaustion | 137 or eviction | -- | Ephemeral storage exceeded |
| Pod unschedulable | -- | PodUnschedulable... |
No node fits resource requests |
| Image pull failure | -- | PodUnschedulable... |
Image not found or inaccessible |
Distinguishing Killed Tasks¶
Exit codes 137, 143, and null are ambiguous -- multiple causes share the same exit codes. Tracker's peak_rss metric is never captured for killed tasks because the pod is terminated before Nextflow can record resource metrics.
To determine the actual cause of a killed task:
- Check disk first using Datadog:
kubernetes.ephemeral_storage.usage{kube_job:<native_id>}. Disk exhaustion is commonly mistaken for OOM. - Check memory using Datadog:
kubernetes.memory.rss{kube_job:<native_id>}. Compare against the allocated memory. - Check for timeout: If the task duration closely matches the configured
timedirective (e.g., exactly 16 hours), this is likely a timeout. - Check for spot preemption: If multiple tasks on the same node fail at the same time with infrastructure exit codes (not 1 or 2), this suggests a node-level event. If the instance was spot, preemption is likely.
Note
See Claude Skills for AI-powered investigation tools that automate this analysis.
Common Failure Patterns¶
| Pattern | Likely Cause | Next Step |
|---|---|---|
| Single task failed, exit 1 | Code bug for specific input | Check .command.err |
| Multiple tasks, exit 1 | Systematic code issue | Compare across revisions |
| Exit 137/143, same process | Resource limit (disk, memory, timeout) | Check Datadog metrics |
| Exit 137/143, multiple processes, same time | Cluster or node event | Check if tasks shared a node |
| Retries succeed | Transient (spot preemption, node pressure) | Likely self-resolving |
| Retries fail identically | Deterministic (code bug, resource limit) | Needs a fix |
Trace Fields¶
The nf-k8s-cloud executor adds EKS-specific fields to each task attempt, available in Tracker for investigation:
| Field | Description | Example |
|---|---|---|
native_id |
Kubernetes Job name | nf-a1b2c3d4-5678 |
hostname |
EC2 instance ID | i-0abc123def456789 |
instance_type |
EC2 instance type | i4i.4xlarge |
capacity_type |
Spot or on-demand | spot |
The native_id is the key for joining Tracker data with Datadog metrics (via the kube_job tag) and Kubernetes inspection (via kubectl).
Bin-Packing Considerations¶
The cluster uses i4i instances with a fixed ratio of 8 GB memory per vCPU. A task's resource request determines how many vCPU-slots it occupies on a node, which affects how many tasks can run in parallel.
Effective vCPU-Equivalents¶
effective_vcpu_equiv = max(cpus, ceil(memory_gb / 8))
This represents the number of vCPU-slots the task occupies, accounting for the memory-to-CPU ratio.
| Example | CPUs | Memory | Eff. vCPU-equiv | Status |
|---|---|---|---|---|
| Memory-bound | 2 | 32 GB | 4 | Wastes 2 CPU slots per task |
| CPU-bound | 14 | 8 GB | 14 | Efficient |
| Balanced | 2 | 16 GB | 2 | Efficient |
Node Capacity¶
On an i4i.4xlarge (16 vCPU, 128 GB memory):
tasks_per_node = floor(16 / effective_vcpu_equiv)
Ephemeral storage is not typically the constraining dimension on i4i instances because they have large NVMe drives (3,750 GB on i4i.4xlarge).
Over-Provisioned Memory¶
When effective_vcpu_equiv > 2 * cpus, the process is requesting disproportionately more memory than CPU. Each task wastes CPU slots on the node that could serve other tasks.
Tip
If a process is memory-bound, check whether its tool supports a disk-backed mode (e.g., samtools sort, STAR) before increasing memory. Trading memory for disk is often more efficient on storage-optimized instances.
For systematic resource optimization across all processes, see the nf-right-sizing Claude skill.