Cluster Features¶

The nf-k8s-cloud executor runs Nextflow tasks as Kubernetes Jobs on EKS. This page documents the resource directives, scheduling controls, S3 staging model, error handling, and bin-packing considerations available to workflow developers.

Resource Directives¶

Standard Nextflow resource directives are mapped to Kubernetes pod resource specifications:

Directive	K8s Mapping	Behavior
`cpus`	CPU request (+ optional limit)	Always set as request. Limit only when cluster enables it. Pods can burst if node allows.
`memory`	Memory request and limit	Both set to same value. OOM-killed (exit 137) if exceeded.
`disk`	Ephemeral storage request and limit	Both always set. Evicted if exceeded. Must cover staged inputs + outputs + temp files.
`time`	Active deadline	Killed with SIGTERM (exit 143) when deadline expires.
`accelerator`	GPU resources	Request GPU devices for tasks that need them.

Warning

Memory and disk limits are hard enforced by Kubernetes. A pod that exceeds its memory limit is OOM-killed immediately. A pod that exceeds its disk limit is evicted. Always include headroom in your resource allocations.

Retry Scaling¶

Resource directives can be scaled with task.attempt for automatic retry scaling:

process MY_PROCESS {
    memory { 16.GB * task.attempt }
    disk { 100.GB * task.attempt }
    time { 8.h * task.attempt }
    maxRetries 2

    // ...
}

Note

Never scale CPUs with task.attempt. No task fails from insufficient CPUs -- they just run slower. Scaling CPUs on retry wastes resources without fixing the actual failure.

Resource Labels (Scheduling Hints)¶

The cluster config reads resourceLabels from each process and uses them to control pod scheduling. These are not standard Nextflow directives -- they are custom labels interpreted by the cluster configuration.

Label	Values	Effect
`capacityType`	`"spot"`, `"on-demand"`	Schedule on spot or on-demand instances.
`instanceLocalNvme`	`true`	Require local NVMe storage. For I/O-heavy processes.
`instanceNetworkBandwidth`	integer (Mbps)	Require minimum network bandwidth. For S3-bottlenecked processes.

Spot vs On-Demand Strategy¶

The standard pattern runs attempt 1 on cheaper spot instances and falls back to on-demand on retry:

process {
    resourceLabels = [
        capacityType: task.attempt == 1 ? 'spot' : 'on-demand'
    ]
}

Spot instances can be preempted by EC2 at any time. When this happens, Nextflow retries the task automatically. The retry cost (wasted compute time + wall-clock delay) is the key trade-off -- data integrity is not at risk because partial outputs from interrupted pods are never used.

Guidelines for capacity type selection:

Avg Execution Time	Recommendation
< 15 min	Spot-first -- very low interruption risk
15 min - 45 min	Spot-first -- low risk, cheap retry
45 min - 2 hr	Case-by-case -- spot saves money, retries cost time
> 2 hr	On-demand justified if retry cost is high

Some processes may warrant on-demand for all attempts if they are on the critical path and their runtime makes retries very expensive. Use the Cost & Performance analysis to quantify per-process savings.

NVMe and Network Bandwidth¶

process IO_HEAVY_TASK {
    resourceLabels = [
        capacityType: task.attempt == 1 ? 'spot' : 'on-demand',
        instanceLocalNvme: true,
        instanceNetworkBandwidth: 25000  // 25 Gbps
    ]

    // ...
}

Tip

Request NVMe storage for processes that are staging-bound (high overhead relative to execution time). Request higher network bandwidth for processes with large S3 transfers.

S3 Staging Model¶

The nf-k8s-cloud executor uses s5cmd for high-performance S3 file transfers. Understanding the staging model is important for disk sizing.

How It Works¶

Stage-in: Input files are downloaded from S3 to pod-local /tmp before the task command runs.
Execution: The task command reads inputs from and writes outputs to /tmp.
Stage-out: Output files are uploaded from /tmp back to S3 after the task command completes.

The cluster config mounts an emptyDir at /tmp and sets scratch = "/tmp", so all staged inputs, outputs, and temporary files land in ephemeral storage.

Disk Budget¶

The disk directive must cover the peak ephemeral storage footprint:

disk_needed = staged_inputs + task_outputs + temporary_files

Where:

Staged inputs -- All input files downloaded from S3 before the task starts. For processes with large BAM/CRAM/FASTQ inputs, this is often the dominant component.
Task outputs -- Files produced by the task command.
Temporary files -- Working data created during execution (e.g., samtools sort temp files).

Warning

There is no runtime metric in Tracker that directly measures peak disk usage during execution. The rchar/wchar I/O metrics in Tracker only capture the task command's I/O -- they do not include staged input files. Use Datadog's kubernetes.ephemeral_storage.usage metric for actual disk usage data.

Disk Sizing Formula¶

A reasonable starting point for disk allocation:

recommended_disk = (estimated_input_size + wchar + working_scratch) * 1.5

Apply a 1.5x safety margin and validate against disk-full failure history.

Error Handling and Exit Codes¶

When a task fails, the exit code and exception type indicate the cause:

Scenario	Exit Code	Exception	Notes
Script error	1, 2, etc.	--	Check `.command.err`
Command not found	127	--	Binary missing from image
OOM killed	137	`K8sOutOfMemory...`	Memory exceeded limit
Timeout	143	--	Deadline exceeded (`time`)
Spot preemption	null or 137	`NodeTermination...`	EC2 reclaimed spot instance
Disk exhaustion	137 or eviction	--	Ephemeral storage exceeded
Pod unschedulable	--	`PodUnschedulable...`	No node fits resource requests
Image pull failure	--	`PodUnschedulable...`	Image not found or inaccessible

Distinguishing Killed Tasks¶

Exit codes 137, 143, and null are ambiguous -- multiple causes share the same exit codes. Tracker's peak_rss metric is never captured for killed tasks because the pod is terminated before Nextflow can record resource metrics.

To determine the actual cause of a killed task:

Check disk first using Datadog: kubernetes.ephemeral_storage.usage{kube_job:<native_id>}. Disk exhaustion is commonly mistaken for OOM.
Check memory using Datadog: kubernetes.memory.rss{kube_job:<native_id>}. Compare against the allocated memory.
Check for timeout: If the task duration closely matches the configured time directive (e.g., exactly 16 hours), this is likely a timeout.
Check for spot preemption: If multiple tasks on the same node fail at the same time with infrastructure exit codes (not 1 or 2), this suggests a node-level event. If the instance was spot, preemption is likely.

Note

See Claude Skills for AI-powered investigation tools that automate this analysis.

Common Failure Patterns¶

Pattern	Likely Cause	Next Step
Single task failed, exit 1	Code bug for specific input	Check `.command.err`
Multiple tasks, exit 1	Systematic code issue	Compare across revisions
Exit 137/143, same process	Resource limit (disk, memory, timeout)	Check Datadog metrics
Exit 137/143, multiple processes, same time	Cluster or node event	Check if tasks shared a node
Retries succeed	Transient (spot preemption, node pressure)	Likely self-resolving
Retries fail identically	Deterministic (code bug, resource limit)	Needs a fix

Trace Fields¶

The nf-k8s-cloud executor adds EKS-specific fields to each task attempt, available in Tracker for investigation:

Field	Description	Example
`native_id`	Kubernetes Job name	`nf-a1b2c3d4-5678`
`hostname`	EC2 instance ID	`i-0abc123def456789`
`instance_type`	EC2 instance type	`i4i.4xlarge`
`capacity_type`	Spot or on-demand	`spot`

The native_id is the key for joining Tracker data with Datadog metrics (via the kube_job tag) and Kubernetes inspection (via kubectl).

Bin-Packing Considerations¶

The cluster uses i4i instances with a fixed ratio of 8 GB memory per vCPU. A task's resource request determines how many vCPU-slots it occupies on a node, which affects how many tasks can run in parallel.

Effective vCPU-Equivalents¶

effective_vcpu_equiv = max(cpus, ceil(memory_gb / 8))

This represents the number of vCPU-slots the task occupies, accounting for the memory-to-CPU ratio.

Example	CPUs	Memory	Eff. vCPU-equiv	Status
Memory-bound	2	32 GB	4	Wastes 2 CPU slots per task
CPU-bound	14	8 GB	14	Efficient
Balanced	2	16 GB	2	Efficient

Node Capacity¶

On an i4i.4xlarge (16 vCPU, 128 GB memory):

tasks_per_node = floor(16 / effective_vcpu_equiv)

Ephemeral storage is not typically the constraining dimension on i4i instances because they have large NVMe drives (3,750 GB on i4i.4xlarge).

Over-Provisioned Memory¶

When effective_vcpu_equiv > 2 * cpus, the process is requesting disproportionately more memory than CPU. Each task wastes CPU slots on the node that could serve other tasks.

Tip

If a process is memory-bound, check whether its tool supports a disk-backed mode (e.g., samtools sort, STAR) before increasing memory. Trading memory for disk is often more efficient on storage-optimized instances.

For systematic resource optimization across all processes, see the nf-right-sizing Claude skill.