The Deepfactor instrumentation webhook monitors for restarts of instrumented pods. If a certain pod restarts multiple times, the webhook does not instrument the pod to avoid potential restart loops. The pod may restart due to hitting resource limits (possibly due to overhead added by Deepfactor), probe failures, incompatibility with Deepfactor instrumentation library, or an actual application pod bug.
This feature can be disabled via the K8s cluster or namespace level configuration advanced Option, ‘Enable staged instrumentation’. You can also configure the number of allowed restarts after which Deepfactor will abort instrumentation using the ‘Abort Deepfactor instrumentation if pod restarts continuously’ option.
The Deepfactor instrumentation stages are as follows:
- Nominal: When instrumentation stages are enabled, a pod will start[1] with Deepfactor in the Nominal (telemetry) state up to (total number of allowed restarts / 2) rounded to the nearest lower natural number after which it will be started in debug mode.
- Debug: In this mode, the pod will run with Deepfactor in the Debug (telemetry and logging) state, where Deepfactor will add more logs to help in debugging, up to the configured total number of allowed restarts after which Deepfactor will disable instrumentation.
- Disabled: In this state, Deepfactor will not instrument the pod and will start the pod without Deepfactor indefinitely until the pod is deleted. If the pod continues to restart in this state, it is most likely due to an application issue.
The default number of restarts is set to 6, so the pod will run in nominal state up to 3 restarts and then in debug state up to 6 restarts post which Deepfactor will not instrument the pod.
Nominal Stage: The container is configured for every dynamically linked and supported libc process to be instrumented with Deepfactor runtime. See Deepfactor support matrix.
Debug Stage: Nominal with Deepfactor debug logging enabled. This may provide timing information that may assist in the diagnosis of whether a restart occurred due to a resource limit, probe, or Deepfactor support issue. This stage is skipped when the configuration Advanced Option: “Enable logging” is set, since this parameter effectively configures the Nominal stage to be Debug.
Disabled Stage: The container is configured for any dynamically linked process to not run with Deepfactor.
[1] – Every container start in an individual pod, or pod replica, is observed and Deepfactor will determine whether the current container instance should observe in a Deepfactor Nominal, Debug, or Disabled state.
Staged Instrumentation Limitations #
The Deepfactor Staged Instrumentation implementation depends on the writable ephemeral directory /tmp for a state lock file. The lock file is expected to be cleared on container restart. Most containers in Kubernetes are configured with a default /tmp directory that is writable and not a special emptyDir/etc. volume. Staged Instrumentation, when enabled, will behave with the following limitations depending on the type and usage of the /tmp directory.
a) The /tmp directory is writable, and not a special emptyDir/etc. volume.
– No limitations
b) The /tmp directory is writable, and not a special emptyDir/etc. volume, but the
/tmp/df-instr-state.lock file is removed by a process inside a container and the next process that starts clears the environment key pair DF_INSTR_STATE_LOCK.
– If the next process that starts does not clear it’s environment, then the process will recover the lock and there is no limitation. However, if all conditions are met, all subsequent processes in the container immediately transition to the next incremental Nominal, Debug, or Disabled state as if the container had restarted.
c) The /tmp directory is writable and a special emptyDir/etc. volume. The /tmp volume is not cleared on container restart.
– All instrumentation will remain in the Nominal stage indefinitely, regardless of container restarts. The effect is the same as if Staged Instrumentation is not enabled.
d) The /tmp directory does not exist or is read-only.
– All processes will be in the Disabled stage indefinitely.