Kubernetes AppOps Security Part 3: Security Context (1/2) – Good Practices
If you’ve ever looked under the hood of a container runtime such as Docker or even run any applications on a container runtime in production, you’ll know that the virtual construct of a “container” is a normal Linux process at its core that largely runs in isolation from the rest of the system using certain kernel components. This makes containers more lightweight but also more vulnerable than virtual machines (VMs). To reduce this attack surface, container runtimes offer a variety of settings whose default values strike a compromise between usability and security. At this point, developers can reduce the attack surface by following a number of good practices. The Benchmark from the Center for Internet Security (CIS) offers a good overview of the widely used container runtime docker. The benchmark contains host and container runtime configurations as well as specific security-related good practices for containers and images.
When running containers on Kubernetes, the following statements apply in equal measure: As container orchestrator, Kubernetes abstracts from underlying container runtimes, upon which the actual containers are run. Many of the container-related good practices that apply to Docker therefore also apply to Kubernetes. Additionally, Kubernetes changes parts of the standard configuration of the container runtimes. However, in Kubernetes, it is possible to apply configurations to the container runtime on different levels, thereby increasing security. Kubernetes also offers further recommended security mechanisms, such as network policies, which are described in the first part of this article series.
In this article, you will find a description of the configurations that are available in Kubernetes, and we will explain pragmatic good practices that can increase security without too much effort. If you know your way around Docker, you will recognize some of the points made here from the CIS Benchmark for Docker. In turn, the CIS Benchmark for Kubernetes focuses on cluster operation (API server, kubelet, etc.), and unfortunately it contains few recommendations for running applications on the cluster.
Security Context
The most direct way to apply security-relevant configurations in Kubernetes is the security context. It is available on two levels; per pod and per container. Some configurations are possible on both levels. In this case, the container-level configuration takes precedence. Listing 1 shows an example in YAML: Here, a pod is specified whose container cannot be run with the user “root”. An exception is made for special initialization containers.
apiVersion: v1
kind: Pod
# ...
spec:
securityContext:
runAsNonRoot: true
containers:
- name: mustNotRunAsRoot
initContainers:
- name: isAllowedToRunAsRoot
securityContext:
runAsNonRoot: false
Listing 1: A pod with a security context for pod and container
Good Practices
Under the respective “securityContext”, there are multiple container configurations to choose from, and there are more options on container lever than on pod level. Which configuration is advisable here, and how much effort will that involve? Listing 2 contains a number of container configurations that restrict the default values, and which experience has shown provide a good starting point, as they can block certain attack vectors without causing too much effort. If an application does not run with this configuration, it can be made less restrictive later. Note that in Kubernetes versions 1.18 and lower, the seccompProfile is configured via an annotation: "seccomp.security.alpha.kubernetes.io/pod: runtime/default".
The configurations and the effects they have on the application in the container will be explained here for each individual point. If you want to try it yourself in a defined environment, you will find complete examples with instructions in the „cloudogu/k8s-security-demos” repository on GitHub.
apiVersion: v1
kind: Pod
# ...
metadata:
annotations:
seccomp.security.alpha.kubernetes.io/pod: runtime/default
spec:
containers:
- name: restricted
securityContext:
runAsNonRoot: true
runAsUser: 100000
runAsGroup: 100000
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
Listing 2: A pod compared with the default configuration
Running a container with an unprivileged user
By exploiting a defective configuration or vulnerabilities in the kernel or container runtime, it is possible that an attacker can break out of the isolation of the container (“container escape”). If this attempt is successful, the attacker will gain the rights of the user that runs the process on the host system outside of the container. As such, it is crucial to avoid running the container as the user “root” (User ID 0), since in this case the attacker would have all rights to the host system. Moreover, it is advisable to give containers a User ID or Group ID that is not already allocated to an existing user on the system, as they may have access rights to files on the system, for example, which could be exploited by the attacker.
There are possible solutions to this issue on several levels:
- In the image (for example with the Dockerfile command
USER 100000
). - In the Kubernetes security context with
runAsNonRoot: true
, the container will be prevented from running if it attempts to launch with UID 0 (“root”). - In the Kubernetes security context with
runAsUser
andrunAsGroup
. It is advisable to select a value greater than 10000, as this reduces the likelihood that this value is already taken on the host system. In Dockerfiles, 1000 is often used, which usually corresponds with the first user created on the system. This is convenient in the development phase, but it can provide an attack vector in a production system.
Of all of the configurations shown in Listing 2, these have the greatest impact on reducing the attack surface. However, they also cause the most effort. Because:
- Many official images are designed to run as “root” (for example, NGINX or Postgresql). There are usually third-party alternatives available, such as, for example, from [bitnami](https://hub.docker.com/u/bitnami "bitnami's Profile – Docker Hub". NGINX even offers a dedicated image that runs as an unprivileged user (link to the image). Generally, care must be taken when selecting the image, since not all images found on the Internet are trustworthy. Another alternative solution is to build an image for the desired application yourself that explicitly defines a user. The simplest approach is to use the official image as a base image.
- “runAsNonRoot” only works with a numerical ID. Therefore, if the Dockerfile contains
USER node
, for example, the container will fail to run. This can be circumvented by explicitly setting it to “runAsUser”. - Depending on the owner and access mode of files in a container or in volumes, it may be that the container does not have access at runtime.
This can be resolved by changing owners, groups or
change mode
(using thechmod
command). Depending on how it is used, this will require a change when launching the pod or the creation of a new image. With volumes, this can be done using aninitContainer
, which has more rights than the actual application container. Listing 1 shows one possibility. Another common solution is to run the processes in the root group (Group ID 0), for example in OpenShift. For this to work, group 0 must be given access rights to the files in the image, for example like this: "chgrp -R 0 /some/directory && chmod -R g=u /some/directory".
A current example of a vulnerability that this configuration can prevent being exploited is CVE-2019-5736. Here, a vulnerability in the low level container runtime runc (which is also used by Docker) allows attackers to escape the container if the container is run as the user “root”.
Read-only root file system
An attacker can also cause damage without escaping the container. For example, it can compromise the application’s code, thereby gaining access to user data. Moreover, within the container, the attacker can download programs from the Internet in order to expand the attack on a network that is non-public, but can be accessed from the container. The first case in particular can be prevented entirely by using a read-only file system in the container. In order to allow for the shortest possible startup times and efficient memory usage, containers use a copy-on-write algorithm: The image’s file system is not copied upon launch. Instead, it is only layered: each container launched from an image is assigned its own layer, where the files written during runtime are stored. If a file that is present in the image is changed, it will initially be copied to the container layer and then changed. A read access will first check whether the file is present in the container layer, and if it is not, it is delivered from the image.
With readOnlyRootFilesystem: true
, the container layer is deactivated, which means that it is no longer possible to write to the file system during runtime. This means that the code can no longer be changed. Even if package managers are installed, these will usually no longer work, which will make downloading applications more difficult. A positive side effect is improved performance when reading.
However, most applications are not innately compatible with read-only file systems. For example, web servers require temporary directories for caching (often /tmp
). In a Kubernetes pod, these required directories can be made available as volumes, such as, for example, as emptyDir
.
The diff
command in Docker can be used to identify these directories. It shows what is in the container layer. One possible approach is as follows:
- Launch the application locally as a Docker container (not with a read-only file system),
- Run tests on the system and run through all potential use cases (otherwise, the results may be incomplete),
- Run
docker diff <containerId>
.
The result is a list of files that were written during the container’s runtime. If all affected directories are mounted in Kubernetes as volumes, the container can be run there with readOnlyRootFilesystem
without any problems.
For most applications, this solution is easy to implement and offers increased security. In some cases, for example when the application is designed to change files from the image upon launch or during runtime, this can be more problematic. There are several possible ways to remedy this: In the Dockerfile, the affected directory can be declared as a VOLUME
. The files will then be copied there by the container engine during runtime. Alternatively, this can also be resolved in Kubernetes: In a pod, initContainer
and the other containers share the volumes. As such, an initContainer
can copy files from the image to the volume. In both cases, it is important to configure the users, groups and change mode
so that the application container has access.
Preventing privilege escalation
Though the container may not run as the user “root”, it is still possible that an attacker can obtain additional rights (i.e. escalate privilege). This mechanism is surely familiar to most Linux users, because “sudo” (without any additional parameters) can be used to run a command with the rights of the “root” user. The “sudo” command should therefore never be installed or configured in any container. There are almost no images that contain sudo. Another potential way to obtain additional rights is through vulnerabilities, such as, for example, in the kernel CVE-2015-3339. By setting allowPrivilegeEscalation: false
, you can easily prevent this. Unwanted side effects for the application in the container when using this method are unlikely. If an image was designed to be run without “root” rights, it does not need any additional rights during runtime.
Restricting capabilities
In Linux, it is possible to extend the rights of processes in a fine-grained way, without having to run it as the user “root” (which is allowed to do anything). To do so, there are defined capabilities that the processes can be granted. Well-known examples include access to sockets (e.g., for the ping
application, NET_RAW
capability) or binding to ports < 1024 (e.g. for web servers, NET_BIND_SERVICE
capability).
Container runtimes launch the container processes with selected capabilities, which presents a compromise between usability and security (example: Docker-capabilities). Often, these capabilities are not required by applications, but they grant an attacker more rights and therefore increase the attack surface. For example, with the "NET_RAW” capability, a man-in-the-middle attack can be performed on the communications of all containers with a host using DNS-Spoofing.
A whitelisting approach can be used to make this scenario more secure: Launch without capabilities, only allowing for selected ones where needed.
Like with a read-only file system, Docker can be used locally to empirically identify which capabilities are required. This is done by launching the image in question with --cap-drop ALL
and running through all use cases with the container. The error messages can be used to quickly identify which capabilities are missing, which can then be added as required with the --cap-add
parameter. NGINX is also a good example of this, since it cannot launch without certain capabilities. These include the NET_BIND_SERVICE
capability for binding to port 80. In this specific case, one alternative to adding the capability is using a different image that configures NGINX in a way that it doesn’t bind to port 80, and instead binds to a port greater than 1024. See nginx-unpriv, among others.
Activating the seccomp default profile.
In the Linux kernel, container isolation is achieved using security mechanisms such as seccomp. seccomp allows for syscalls to be restricted, thereby limiting access to functions that are realised in the kernel. After comprehensive testing of all Dockerfiles on GitHub frazelle-container-security, Docker introduced a “default profile” in 2016, which prevents 44 of the over 300 syscalls. As such, it offers moderate security, but it is still compatible with most applications (docker-seccomp). In Docker, all container processes use this profile by default. However, in Kubernetes, it has been explicitly deactivated due to concerns about compatibility (k8s-20870). However, a parameter that can be set when starting the API server ("--seccompdefault") has only been available as an alpha feature since Kubernetes 1.22. Seccomp is a key security feature of Docker and has been in use there for years. The lack of Seccomp in Kubernetes has also been pointed out in a security audit.
It is therefore the obvious approach to explicitly configure a profile to begin with and only deactivate it as needed. This should only cause problems in exceptional cases. For example, official images in DockerHub, the standard registry for Docker, are only permitted to deviate from the default configuration in justified exceptional cases (docker-lib-official).
In Kubernetes, seccomp has not yet made it to the official API. It can therefore not be defined via the security context. Rather, it must be defined using an annotation. This is the usual practice for “alpha” features before they are added to a certain API object. The annotation is specified for the pod, and it can either apply to the entire pod (see Listing 2) or be specified per container.
If you wish to check whether a seccomp is active in a container, you can
- query this within the container with
grep seccomp /proc/1/status
.seccomp: 0
means that there is no active profile. In theproc
file system, information is displayed about all processes, and only one process should run in the container, typically with process ID 1. - Outside of the container, this can be queried in Docker with
docker inspect
, which will only find results if there are any deviations from the default. An explicitly deactivated seccomp (like in Kubernetes) is displayed asseccomp:unconfined
.
Conclusion and recommendations
This article recommends implementing the following configurations for each container in Kubernetes:
- Allow only unprivileged users to execute the container,
- Use a read-only file system,
- Prevent privilege escalation,
- Restrict capabilities, and
- Activate the seccomp default profile.
This presents a pragmatic approach, which allows the rights a container runs with to be reduced with minimal effort. Experience has shown that it is sufficient to create a volume for the /tmp
directory, which allows web applications (for example in Java with Spring Boot) to run without any problems with these settings. If you cannot find another way of doing it for a specific application, it is always more secure to apply security-related settings individually rather than by granting more rights than necessary from the beginning. This “least privilege” approach improves security for the entire cluster, as several attack vectors for containers are blocked. We will discuss these attacks in more detail, the other security options that exist, and other advanced topics related to security context in the next article in this series.
Tags