Skip to content

Cluster Features

The nf-k8s-cloud executor runs Nextflow tasks as Kubernetes Jobs on EKS. This page documents the resource directives, scheduling controls, S3 staging model, error handling, and bin-packing considerations available to workflow developers.

Resource Directives

Standard Nextflow resource directives are mapped to Kubernetes pod resource specifications:

Directive K8s Mapping Behavior
cpus CPU request (+ optional limit) Always set as request. Limit only when cluster enables it. Pods can burst if node allows.
memory Memory request and limit Both set to same value. OOM-killed (exit 137) if exceeded.
disk Ephemeral storage request and limit Both always set. Evicted if exceeded. Must cover staged inputs + outputs + temp files.
time Active deadline Killed with SIGTERM (exit 143) when deadline expires.
accelerator GPU resources Request GPU devices for tasks that need them.

Warning

Memory and disk limits are hard enforced by Kubernetes. A pod that exceeds its memory limit is OOM-killed immediately. A pod that exceeds its disk limit is evicted. Always include headroom in your resource allocations.

Retry Scaling

Resource directives can be scaled with task.attempt for automatic retry scaling:

process MY_PROCESS {
    memory { 16.GB * task.attempt }
    disk { 100.GB * task.attempt }
    time { 8.h * task.attempt }
    maxRetries 2

    // ...
}

Note

Never scale CPUs with task.attempt. No task fails from insufficient CPUs -- they just run slower. Scaling CPUs on retry wastes resources without fixing the actual failure.

Resource Labels (Scheduling Hints)

The cluster config reads resourceLabels from each process and uses them to control pod scheduling. These are not standard Nextflow directives -- they are custom labels interpreted by the cluster configuration.

Label Values Effect
capacityType "spot", "on-demand" Schedule on spot or on-demand instances.
instanceLocalNvme true Require local NVMe storage. For I/O-heavy processes.
instanceNetworkBandwidth integer (Mbps) Require minimum network bandwidth. For S3-bottlenecked processes.

Spot vs On-Demand Strategy

The standard pattern runs attempt 1 on cheaper spot instances and falls back to on-demand on retry:

process {
    resourceLabels = [
        capacityType: task.attempt == 1 ? 'spot' : 'on-demand'
    ]
}

Spot instances can be preempted by EC2 at any time. When this happens, Nextflow retries the task automatically. The retry cost (wasted compute time + wall-clock delay) is the key trade-off -- data integrity is not at risk because partial outputs from interrupted pods are never used.

Guidelines for capacity type selection:

Avg Execution Time Recommendation
< 15 min Spot-first -- very low interruption risk
15 min - 45 min Spot-first -- low risk, cheap retry
45 min - 2 hr Case-by-case -- spot saves money, retries cost time
> 2 hr On-demand justified if retry cost is high

Some processes may warrant on-demand for all attempts if they are on the critical path and their runtime makes retries very expensive. Use the Cost & Performance analysis to quantify per-process savings.

NVMe and Network Bandwidth

process IO_HEAVY_TASK {
    resourceLabels = [
        capacityType: task.attempt == 1 ? 'spot' : 'on-demand',
        instanceLocalNvme: true,
        instanceNetworkBandwidth: 25000  // 25 Gbps
    ]

    // ...
}

Tip

Request NVMe storage for processes that are staging-bound (high overhead relative to execution time). Request higher network bandwidth for processes with large S3 transfers.

S3 Staging Model

The nf-k8s-cloud executor uses s5cmd for high-performance S3 file transfers. Understanding the staging model is important for disk sizing.

How It Works

  1. Stage-in: Input files are downloaded from S3 to pod-local /tmp before the task command runs.
  2. Execution: The task command reads inputs from and writes outputs to /tmp.
  3. Stage-out: Output files are uploaded from /tmp back to S3 after the task command completes.

The cluster config mounts an emptyDir at /tmp and sets scratch = "/tmp", so all staged inputs, outputs, and temporary files land in ephemeral storage.

Disk Budget

The disk directive must cover the peak ephemeral storage footprint:

disk_needed = staged_inputs + task_outputs + temporary_files

Where:

  • Staged inputs -- All input files downloaded from S3 before the task starts. For processes with large BAM/CRAM/FASTQ inputs, this is often the dominant component.
  • Task outputs -- Files produced by the task command.
  • Temporary files -- Working data created during execution (e.g., samtools sort temp files).

Warning

There is no runtime metric in Tracker that directly measures peak disk usage during execution. The rchar/wchar I/O metrics in Tracker only capture the task command's I/O -- they do not include staged input files. Use Datadog's kubernetes.ephemeral_storage.usage metric for actual disk usage data.

Disk Sizing Formula

A reasonable starting point for disk allocation:

recommended_disk = (estimated_input_size + wchar + working_scratch) * 1.5

Apply a 1.5x safety margin and validate against disk-full failure history.

Error Handling and Exit Codes

When a task fails, the exit code and exception type indicate the cause:

Scenario Exit Code Exception Notes
Script error 1, 2, etc. -- Check .command.err
Command not found 127 -- Binary missing from image
OOM killed 137 K8sOutOfMemory... Memory exceeded limit
Timeout 143 -- Deadline exceeded (time)
Spot preemption null or 137 NodeTermination... EC2 reclaimed spot instance
Disk exhaustion 137 or eviction -- Ephemeral storage exceeded
Pod unschedulable -- PodUnschedulable... No node fits resource requests
Image pull failure -- PodUnschedulable... Image not found or inaccessible

Distinguishing Killed Tasks

Exit codes 137, 143, and null are ambiguous -- multiple causes share the same exit codes. Tracker's peak_rss metric is never captured for killed tasks because the pod is terminated before Nextflow can record resource metrics.

To determine the actual cause of a killed task:

  1. Check disk first using Datadog: kubernetes.ephemeral_storage.usage{kube_job:<native_id>}. Disk exhaustion is commonly mistaken for OOM.
  2. Check memory using Datadog: kubernetes.memory.rss{kube_job:<native_id>}. Compare against the allocated memory.
  3. Check for timeout: If the task duration closely matches the configured time directive (e.g., exactly 16 hours), this is likely a timeout.
  4. Check for spot preemption: If multiple tasks on the same node fail at the same time with infrastructure exit codes (not 1 or 2), this suggests a node-level event. If the instance was spot, preemption is likely.

Note

See Claude Skills for AI-powered investigation tools that automate this analysis.

Common Failure Patterns

Pattern Likely Cause Next Step
Single task failed, exit 1 Code bug for specific input Check .command.err
Multiple tasks, exit 1 Systematic code issue Compare across revisions
Exit 137/143, same process Resource limit (disk, memory, timeout) Check Datadog metrics
Exit 137/143, multiple processes, same time Cluster or node event Check if tasks shared a node
Retries succeed Transient (spot preemption, node pressure) Likely self-resolving
Retries fail identically Deterministic (code bug, resource limit) Needs a fix

Trace Fields

The nf-k8s-cloud executor adds EKS-specific fields to each task attempt, available in Tracker for investigation:

Field Description Example
native_id Kubernetes Job name nf-a1b2c3d4-5678
hostname EC2 instance ID i-0abc123def456789
instance_type EC2 instance type i4i.4xlarge
capacity_type Spot or on-demand spot

The native_id is the key for joining Tracker data with Datadog metrics (via the kube_job tag) and Kubernetes inspection (via kubectl).

Bin-Packing Considerations

The cluster uses i4i instances with a fixed ratio of 8 GB memory per vCPU. A task's resource request determines how many vCPU-slots it occupies on a node, which affects how many tasks can run in parallel.

Effective vCPU-Equivalents

effective_vcpu_equiv = max(cpus, ceil(memory_gb / 8))

This represents the number of vCPU-slots the task occupies, accounting for the memory-to-CPU ratio.

Example CPUs Memory Eff. vCPU-equiv Status
Memory-bound 2 32 GB 4 Wastes 2 CPU slots per task
CPU-bound 14 8 GB 14 Efficient
Balanced 2 16 GB 2 Efficient

Node Capacity

On an i4i.4xlarge (16 vCPU, 128 GB memory):

tasks_per_node = floor(16 / effective_vcpu_equiv)

Ephemeral storage is not typically the constraining dimension on i4i instances because they have large NVMe drives (3,750 GB on i4i.4xlarge).

Over-Provisioned Memory

When effective_vcpu_equiv > 2 * cpus, the process is requesting disproportionately more memory than CPU. Each task wastes CPU slots on the node that could serve other tasks.

Tip

If a process is memory-bound, check whether its tool supports a disk-backed mode (e.g., samtools sort, STAR) before increasing memory. Trading memory for disk is often more efficient on storage-optimized instances.

For systematic resource optimization across all processes, see the nf-right-sizing Claude skill.