Kubernetes AppOps Security Part 4: Security Context (2/2) – Background
A container is basically a normal Linux process that runs isolated from the rest of the system via certain kernel components. This makes containers more lightweight but more vulnerable than virtual machines (VMs). To reduce this attack surface, container runtimes offer a variety of settings whose default values strike a compromise between usability and security. At this point, as a developer, you can reduce the attack surface by following a number of best practices. This is the case when using a container runtime like Docker as well as when using container orchestrators like Kubernetes, since they only abstract from the underlying container runtimes. The previous article in this series covers the settings that exist in Kubernetes and describes how they can be used to increase security pragmatically. This article recommends the following settings for each container using the "securityContext" in Kubernetes:
- Allow running containers with unprivileged users only,
- Use a read-only file system,
- Prevent privilege escalation,
- Restrict capabilities, and
- Activate the Seccomp default profile.
How these settings relate to attack vectors on containers, how container isolation works, and how they differ from VMs are discussed in this article. Finally, we present tools and offer an outlook on additional settings that are relevant to security.
The effects of the settings can be tried out in a defined environment. You will find complete examples with instructions in the “cloudogu/k8s-security-demos” repository on GitHub.
Isolation of containers
In general, the default settings in container runtimes provide some isolation of the process through the kernel components namespaces, CGroups, and security facilities (such as capabilities, Seccomp, AppArmor, and SELinux). These mechanisms partly overlap, thereby providing protection even in cases where an attacker is able to defeat a mechanism by exploiting a vulnerability or misconfiguration. The following is an example from frazelle-container-security: The "mount" syscall is prevented using the default AppArmor and Seccomp profiles as well as the CAP_SYS_ADMIN
capability.
Beyond these default settings, container runtimes provide security settings which would also be useful for running VMs, such as a read-only root file system or a general exclusion of privilege escalation. These will be presented in later sections.
Container vs. VM
Here is a look at the differences between VMs and containers: VMs are more isolated than containers because the VM hypervisor abstracts from the host at the hardware level and runs its own operating system (dedicated kernel) in the VM. In contrast, containers share the kernel of the host operating system. Since it is often the case that only one application is run on a VM, each application has its own kernel.
Therefore, it is harder to extend the attack from a vulnerable application to other applications that run on the same physical host. For containers, it is easier for attackers to exploit vulnerabilities in the container runtime, the kernel, or a configuration error in order to break out of the container. However, an escape cannot be ruled out even when using a VM. For examples of container and VM escapes, see docker-k8s-high-sec-env.
Attack scenario
Given this knowledge, the benefits of the described settings can be seen by looking at the following attack scenario: Similar to the situation of operating on a VM or physical host, a web application that is run in a container initially exposes only ports. An attacker needs to be able to remotely execute code in order to get into the container from outside. The attacker can exploit vulnerabilities in the system’s building blocks (operating system, server software, platforms such as Java and libraries) to perpetuate such an attack. A prominent example of a remote code execution vulnerability is CVE-2017-5638. This vulnerability exists in the Java Web Framework Apache Struts in all versions earlier than 2.3.32 and 2.5.10.1. This vulnerability was exploited in the Equifax Breach, during which 143 million customer records were stolen. This vulnerability can be exploited through a crafted HTTP request to execute arbitrary commands on the host. Since it can be cumbersome to execute many commands in this way, during the next step the attacker will often download and launch an application, such as "netcat”, on the host (in this case the container). These are used to establish a connection to a remote control server on the Internet (reverse shell). The attacker can then use the control server to interactively execute commands in the container, similar to SSH. This will allow the attack to be extended: Additional tools can be used to search the network for other hosts and open ports as well as to access services that cannot otherwise be accessed from the outside. For example, MongoDB instances are often reachable without authentication. Additional weaknesses can be exploited to break out of the container (perform a "container escape"). An example is CVE-2019-5736 in the low level container runtime runc (which is also used by Docker). If the container is run as the root
user, then the attacker is also root
on the host. With root
privikeges the attack can penetrate much further: All containers on the node can be taken over, the configuration can be viewed and, potentially, the whole cluster can be taken over.
Defense measures
How is it possible to defend against such attacks? Of course, the first step is always to use the most up-to-date versions of the system's building blocks. However, this will not help you to protect against unknown or unresolved security issues (zero day attacks). Here, additional layers of defense are needed in order to contain the damage or prevent the attack from being extended.
The scenario described above can be prevented or at least made more difficult on different layers by applying the settings that are recommended here:
- An unprivileged user cannot simply install packages in the container. In addition, in the case of a container escape, the attacker is not "root" on the host, so ideally he has no rights there.
- A read-only root file system prevents the installation of packages at runtime, even for the user
root
. More importantly, the attacker cannot compromise the code of the application. - Preventing privilege escalation ensures that even if vulnerabilities exist, an unprivileged user who gains access to the container cannot subsequently become the
root
user. Examples of such vulnerabilities can be found in docker-security. - If the container is executed without capabilities, it increases the isolation of the container and limits the options of the attacker. For example, with the "NET_RAW” capability, a man-in-the-middle attack can be performed on the communications of all containers on a host using DNS spoofing.
- The container can be further isolated and the attacker’s options can be further restricted using a Seccomp profile. A whole series of kernel vulnerabilities that you are able to eliminate through such a profile are listed in docker-security (see link above).
- This also demonstrates the benefit of network policies: They allow you to block the Internet connection that an attacker could use to subsequently load additional tools as well as to set up a reverse shell, and they prevent access to any corporate networks that can be reached from the cluster.
Especially when we look at existing security gaps in the past, it is easy to imagine that there are likely to be other unknown vulnerabilities. It is almost certainly the case that more will be revealed in the future. In this respect, a “least privilege” approach offers even more security than can be imagined at the current time. For example, at the time when CVE-2019-5736 became known (see above), containers running with an unprivileged user were not affected by the vulnerability.
Finally, it should at least be mentioned at this point that many of the above-mentioned attack vectors that utilize root
user privileges could be mitigated by using so-called "user namespace remapping". The user ID in the container is assigned to a different user ID outside the container. Take, for example, the assignment of ID 0 (root
) in the container to 10000 outside the container. Thus, the container on the host has no extended rights. Container runtimes like LXC/LXD or Podman use this by default. However, this is not the default setting for Docker, and a managed cluster user has limited influence. Since this series of articles focuses on how to use the cluster and not on how to operate the cluster, this option will not be discussed further.
Additional settings in the security context
On top of the so far recommended security settings that change the default settings, the Security Context offers further settings. Some of them are mentioned briefly.
The privileged
option is false
by default and should also stay that way. If you enable privileged
, this will eliminate the container’s isolation. This option would make all the above settings useless. This option was originally included for the purposes of running Docker in Docker. This may be useful in certain situations, such as on a CI server. A dedicated machine or cluster, however, is recommended for such a scenario. It should be run separately from the productive applications.
The Linux security modules SELinux and AppArmor can also be configured in Kubernetes. For both, however, Kubernetes does not interfere. It is therefore the responsibility of the cluster operator to configure the underlying container runtime. In the case of Docker, similar to Seccomp, an AppArmor "default profile" is activated. If AppArmor is installed on the node and is active, it will automatically be applied by Docker. Those who prefer to use SELinux (often on RedHat-based Linux distributions) may enable it, for example, in the Docker daemon settings. Nevertheless, the security context offers settings for SELinux ("seLinuxOptions") if you need special settings for each container. In addition, a special AppArmor profile can be configured similar to Seccomp through annotations. In general, it is possible to write your own, more restrictive Seccomp, AppArmor or SELinux profiles (e.g., using tools like bane) and activate them with these settings. However, this has a steep learning curve and is laborious, and it will not be described in further detail in this article for these reasons.
Last but not least, since Kubernetes 1.15 there has been support for Windows nodes, where certain security settings can be configured in the security context. The topic is still fairly new as of the time of the writing of this article. Anyone who has carefully read the article so far will have noticed that all of the options that have been discussed are based on Linux. In this respect, the topic "Containers on Windows" goes beyond the scope of this article.
Tooling
If you would like to check your cluster interactively for compliance with certain recommended options, there are several tools that you can choose from. These tools come with their own rules that are far more extensive than the ones that have been recommended in this series of articles, which focuses on striking a compromise between effort and security. It's definitely worth a look. Several opinions can help you to decide which options are the best fit for your own use case. The tools can also be adapted to your own requirements. It is also possible to automate the testing of these settings in the CI/CD process.
Three well-known tools are kubesec, kubeaudit and kube-bench.
The latter automates the auditing of points from the CIS benchmark for Kubernetes, which, as was mentioned in the first part, focuses on cluster operation (API server, Kubelet, etc.) and makes few recommendations for the operation of applications on the cluster.
kubesec and kubeaudit contain different points, which are mentioned only partially in the article. kubesec is close to the recommendations that are presented in this article. However, by default it also checks for the existence of resource limits
to protect against denial of service attacks. However, this setting can degrade the response time of the application on the cluster (Jac18), and it can also be implemented using surrounding infrastructure (reverse proxy and CDN). Therefore, for example, you should not simply implement this check without thinking.
Generally speaking, Kubeaudit has a lot of checks whose results can be overwhelming. These include, for example, the "resource limits" audit that was mentioned above as well as of network policies or the presence of annotation for Seccomp and AppArmor. For AppArmor, container runtimes typically have standard default profiles that are active as long as the operator of the cluster does not explicitly switch them off. Hence, this setting does not have to be repeated on every pod.
Force the application of the settings throughout the entire cluster
Speaking of repeating, the settings that are shown in the security context are specified for each pod or container, meaning they must sometimes be entered several times per application. This setting can also be defined cluster-wide without repetitions, with so-called "Pod Security Standards". These are the successors of the "Pod Security Policies" which are deprecated in Kubernetes version 1.21 and will be completely removed in Kubernetes version 1.25. The "Pod Security Standards" provide even more options for protecting the node and container runtime. However, these have a higher entry hurdle, as allowing exemptions is significantly more time-consuming. Therefore, as a compromise between effort and security, this article first shows the settings using Security Context. This approach can make sense in practice: In smaller teams, you can agree on which options are set and then roll this out successively to all applications.
It is also easy to use the Security Context to test whether applications with the more restrictive security settings still work. As secure defaults, as a starting point in new clusters or in larger organizations, or a larger group of people with access to the cluster who may not be completely trusted, it may be necessary to enforce the settings. In this case, it is worth taking a look at Pod Security Standards.
Conclusion
This article provides some background to our discussion of how to implement good practices for the Kubernetes Security Context that were presented in the last part: Due to the fact that containers are less isolated (than VMs), there are several attack vectors that can be mitigated using the settings in the Security Context with just a small amount of effort. There are other settings in the Security Context whose default values can be regarded as "secure by default".
Tags