UPDATE 2020/Jan/29: As pointed out by some of the feedback, the term Virtual Machine in this article refers specifically to full x86/x64 virtualization, as described in the current Wikipedia article. It relates to the use of hypervisor and similar technologies to emulate an entire physical machine in software. Please, be aware of this reference while reading, so it is not confused with other types of virtual machines, such as JVM, .NET or interpreted programming language environments.
I’ve been working with Docker, containers, and Kubernetes for over three years now. And, from my perspective, I managed to catch up with this new trend just before it picked up full steam in most developer forums. I’ll admit, it took me more than a few months to understand what a container actually is and how it works.
If you’ve been working at software operations and infrastructure for quite sometime, and by any chance you are just now beginning to catch up, do not be scared. You have A LOT to take in. It does take effort and getting used to it. I remember the feeling, the confusion, the will to give up and go back to provisioning stuff the old fashioned way. I distinctly remember the will to find a nice blog post describing things in a simple way, without making so many assumptions. By now, I’m pretty sure some folks at /r/docker are getting used to watching thread, after thread, after thread of people rambling about their frustration. They need to migrate a full stack to containers and nothing seems to make sense.
So, I decided to write up this simple, quick introduction to welcome beginners into a new era. I will try to uncover some of the magic behind containers, so you don’t feel so lost in the dark. It serves as an introduction to containers before you are introduced to Docker. Something that I feel is missing from most tutorials and guides.
Hopefully, it’ll help you deal with the frustration, clear some of the most basic concepts and pave the way for a better experience with Docker, Kubernetes and everything else.
First things first
If you work with Linux, the basic idea is not really hard to grasp. I wish I had someone tell me this from the beginning:
To understand Linux Containers, you should first understand what makes a Linux Distribution.Said me to myself when I finally got it
Ubuntu, CentOS, Arch, Alpine, FreeBSD, OpenBSD… We each have our favorite. But, whatever flavor you love, they all have one important thing in common: a Linux Kernel. Making a new Linux Distribution almost never means writing your own kernel from scratch. There already exists a very good one driven by a strong community. For the most part, you just take it, compile it, and bundle it with other stuff to create your distribution.
Inside every common Linux Distro, you will find basically the same types of components grouped into directories in the filesystem:
- /boot – The kernel, along with whatever it needs to be bootstrapped.
- /bin – Basic program binaries like cp, ls, cat, grep, echo…
- /sbin – Program binaries reserved for the root.
- /etc – System wide configuration files.
- /lib – System wide libraries.
- /usr – User installed software, their binaries and libraries.
- /opt – Proprietary software that won’t follow the above directory structure.
- /home – User files
Of course, there’s more to that structure, variations and more directories. But that is the basic overview. The cherry on top is a Package Manager so that users can install and manage additional software: dpkg, apt, yum, synaptic, pacman, zypper, rpm… One is enough, so take your pick.
Bundle all that into an ISO image that boots as an installer program, and voilà! You’ll have yourself a working Linux Distribution.
Remember the basics, how programs work
When you run a program, a copy of it goes to RAM and becomes a process managed by the kernel. From there, it expects all of its dependencies to be in place and readily accessible. Among various things, it will usually:
- Load configuration files from /etc
- Load libraries from directories like /lib or /usr/lib
- Write data to /var/some/directory
As long as everything is in place, exactly as expected, a process will run happily ever after.
So, what’s the problem?
Dedicated servers typically run a small number of dedicated processes. For example, in order to host a WordPress blog, a single Linux host can be easily configured with a LAMP stack: MySQL, Apache, and PHP packages installed.
But… what if you need to host more than one WordPress installation? What if each one is required to have their own MySQL instance? Let’s keep going… What if you need to deploy older stacks that require different PHP versions? Different modules? Conflicting libraries? Binaries compiled with different flags and modules?
We are used to solving this problem very bluntly: increase the cost and pay for more resources. The standard response to complex requirements has been the same for so many decades:
We just can’t run everything in one host. Either give me more hosts or create more Virtual Machines. We need to keep things ISOLATED!
Isolation! That’s the key word here
From very early on, the Linux community had been trying to find ways to isolate running processes to avoid dependency conflicts and improve security. Solutions like chroot and jails were notable foreshadows of what came to be known as Linux Containers (LXC). Those were early attempts at process isolation. While chroot and jails were popular and relatively easy, they lacked advanced features. The complexity of LXC, on the other hand, made it difficult for wide adoption.
Up until now, the traditional way of isolating services with security and quality guarantees has meant mostly only one thing: running different services in different hosts. Each with its own Linux installation and dedicated kernel.
The kernel has evolved, and most people never even noticed
For quite some time now, the Linux Kernel has been growing with new exciting features. Today, several different ways to isolate processes have been baked into the kernel itself and are quite ready for production — control groups, namespaces, virtual network interfaces… all kinds of interesting features are there. LXC was a first real attempt to harness those features, but it failed to keep things simple.
Putting it the simplest way possible:
Creating a container means running a Linux process, much like any other, except with very strong isolation, the likes of which no one had ever seen before.
In practice, it means:
- Create a directory dedicated to your application.
- Place the application binary, along with everything it needs, inside that directory: dependent libraries, configuration files, data directories…
- Spawn the application process asking the kernel to isolate EVERYTHING, giving it restrictions like:
- Its own user space, including access to another less privileged root user, and no visibility or UID/GID conflicts with other users already created outside the container.
- Its own filesystem structure, with the most important parts (like /etc/hosts) as read-only, even for the container’s root user.
- Its own process space, with no visibility to any other process and PIDs running in the same kernel.
- Its own network interface where it can have its own IP and not worry about conflicting ports.
- Limits to how much time it can spend consuming CPU cycles.
- Limits to how much memory it can use.
Think of it as chroot or jails on steroids.
You can literally cram an entire different Linux Distribution inside the container directory. A process running inside the container isolation shares the same kernel as other processes, but it can easily think that it’s running completely alone and that it’s part of an entirely different operating system.
If it walks like Alpine, and it quacks like Alpine… Well, I guess I REALLY AM running in the Alpine OS!Says the process running in a container sharing the kernel bootstrapped by an Ubuntu host.
Containers start much faster than Virtual Machines because they are not bootstrapping a different kernel into a new memory space, along with every other process a full operating system needs. They are simply spawning a new process in the same kernel. The isolation is what makes that process special.
And now, you are ready for Docker
Docker came along as a bold rewrite and recycling of LXC. It completely re-imagined how to create, manage and distribute containers. It made things much, much, much easier, especially at large scale.
Instead of manually creating all of that structure, you simply need dockerd installed and running as a normal system daemon. Container can be created by writing intuitive Dockerfiles, compressed into tarballs and easily ported to other hosts. Under the hood, Docker makes use of OverlayFS to share and merge multiple directory layers into a single view inside the container. It’s a powerful tool that makes Docker containers so versatile.
I’m sure you will find many people listing all the advantages (and disadvantages) of running containerized applications. But the most important, IMHO, is automation. Infrastructure provisioning related to each application becomes code. Code that you write, that can be committed to VCS, traced, shared and integrated with other tools into a pipeline. It’s a bold new way of thinking about infrastructure. It changes a lot. But at large scale, full automation from development, to testing, to production becomes a concrete reality.
Don’t worry! There’s a lot more to learn. But, hopefully, this introduction has given you a foothold so you can dive into the official docs with more confidence: https://docs.docker.com
God speed, and best of luck!