Docker, VM & Hypervisors

May 28, 2020

What's going on here?

Everyone’s heard of Dockers, VM’s and Hypervisors, yet what exactly are they and how are they built? What are these terminologies?

Docker: A way to run packaged software in something called ‘containers’. Containers are isolated from one another and bundle their own software, libraries and configuration files; they can communicate with each other through well-defined channels.

Virtual Machine(VM): ‘Emulation’ of a machine. This can mean many things. Most commonly it means running one OS atop another, like Linux on Windows.

Hypervisor: A software that allows you to create & run multiple virtual machines. [VM’s run atop hypervisors]

Essentially there are 2 types of Hypervisors:

Bare-Metal Hypervisors: They run directly on the hardware and control and manage the ‘guest’ operating systems.
Hosted Hypervisors: They run atop the Host operating system to run ‘guest’ operating systems.

You may say, okay what is hyperthreading, I’ve heard of that too. Is it related? Answer is, not exactly.

Hyperthreading: This allows a single processor core to appear like two processor cores to the operating system. This allows higher ‘throughput’. This is Intel’s implementation of Simultaneous Multithreading(SMT).

If you look from a use case point of view, containers are meant to be ephemeral, whereas virtual machines are expected to last as long as possible.

Containers how do they work?

To run containers like docker, linux provides us with certain OS facilities like cgroups and namespaces. Atop these functionalities, Docker brings their own containerd and runc. Upon these, user-facing tools are built like Docker VLI, Kubernetes,etc.

Namespaces allow abstraction of filesystem, network access and process trees (PID trees), & user ID’s. For example, if you wrap a program (like Firefox) with a filesystem namespace in /home/firefox, it now treats /home/firefox as the root dir. From the perspective of the host this looks like the entire filesystem. This can be applied in PID’s as well, a docker running MySQL will give the MySQL a PID of 1, ‘fooling’ it into thinking its the only process running.

Linux cgroups (Control Groups) are used to limit, manage adn isolate resource usage of process collections. This prevents certain programs from hogging all the resources and preventing other programs from running.

All these allow us to built containers like docker which make life simpler. Because now you don’t need to worry about dependencies, or configurations. Each large process runs in its own ‘container’.

Hypervisors how do they work?

Virtualization is a pretty old concept tracing back to the days of the mainframes built by IBM in the 60’s and 70’s. Early attempts were into paravirtualization, where software interfaces were provided to the guest OS. Intel and AMD soon added hardware support to build Hardware-assisted virtualization.

Before all these, Gerald J. Popek and Robert P. Goldberg built their set of virtualization requirements, These are a set of conditions sufficient for a computer architecture to support system virtualization efficiently. According to them, a Hypervisor (aka VMM, Virtual Machine Monitor) must present these properties:

Equivalence / Fidelity: A program running under the VMM should exhibit a behavior essentially identical to that demonstrated when running on an equivalent machine directly.
Resource control / Safety: The VMM must be in complete control of the virtualized resources.
Efficiency / Performance: A statistically dominant fraction of machine instructions must be executed without VMM intervention.

To ideally design such a system, they made 3 theorems based on categorizing the Instruction Set Architecture (ISA) into:

Privileged instructions: Those that trap [‘interrupt’] if the processor is in user mode and do not trap if it is in system mode (supervisor mode). 2: Control sensitive instructions: Those that attempt to change the configuration of resources in the system. 3: Behavior sensitive instruction: Those whose behavior or result depends on the configuration of resources (the content of the relocation register or the processor’s mode).

Theorem: An effective VMM may be constructed if the set of sensitive instructions for that computer is a subset of the set of privileged instructions.

This means that one can build a para-virtualization system even if there is no hardware support for the VMM. The VMM has to trap (interrupt) & emulate every sensitive intstruction. This is called classic virtualization or trap-and-emulate virtualization.

To do this:

Binary Translate ring 3 instructions of the guest OS into ring 0. [Since the guest OS will be running on Ring 3]
‘Shadow’ OS data structures like paged virtual memory and MMU
Emulate I/O Devices

The other two theorems allow you to define a recursively virtualizable system and a hybrid virtualizable system.

But with the introduction of Intel’s VT-x and AMD’s Amd-V virtualization instructions one does not need to build a trap-and-emulate virtualization. Extra Instructions like VMREAD, VMWRITE, VMCALL which permit entering and exiting a virtual execution mode where the guest OS perceives itself as running with full privilege (ring 0), but the host OS remains protected. Extended Page Tables were also added.

Hardware-assisted virtualization reduces the maintenance overhead of paravirtualization as it reduces the changes needed in the guest operating system.