Utkarsh

What are Containers and how to build one from scratch?

I have been using containers for a while now through Docker and they have made my life very easy. From pulling and running different images from dockerhub to building the image of my application and deploying it on the cloud, containerization has been a revelation for someone like me. I always thought containers are some lightweight virtual machines and I was wrong. So I made a container from scratch using Golang and I'll share what I've learnt.

What are containers?

Containers actually are no virtual machine they are just your simple process with some extra sauce and that sauce being namespaces and cgroups. We'll discuss what namespaces and cgroups are later. A container provides you an isolated environment and you can run different processes in that environment without effecting your actual host and its processes. You also have a pseudo filesystem in your container different from your host's file system. You can set limitations on your container on how much resources it can use. So if a process is isolated, has its own pseudo filesystem and can be regulated on how much resources it can use, it can ideally become a container. Now we know the what let's discuss the how.

How can we build a container of our own?

Now here starts the fun part. Before discussing the code, we will have to discuss some important linux constructs - namespaces and cgroups.

Namespaces

Namespaces provide process-level isolation by wrapping global system resources into abstractions. In simpler terms, it deals with what the process can see. When you create a namespace for a specific resource in a process, the process then believes that it has an isolated copy of that resource. Any changes made to the isolated resource by the process won't affect the global resource. There are different namespaces for different resources we'll look at them later when we'll discuss the code.

Cgroups

Control groups popularly known as cgroups allow us to allocate, limit, prioritize, and monitor system resources. They help us regulate the resources allocated to a process. You can create groups which basically are directories and files and add processes to a group which regulates and monitors their resource use.

What containers actually are

Since we've got these two covered let's move on to the coding part. Make sure you are on a unix system, if you're on windows you can use WSL. I have used Golang to build my container, you can use any programming language but make sure it provides you low-level constructs like interacting with the OS, forking and running processes, making system calls etc. You'll also need a copy of a linux filesystem. You can use docker to pull and export the alpine image:

docker export $(docker create alpine) -o alpine.tar
mkdir containerfs
tar -xf alpine.tar -C containerfs

As the code is small, I've written it in a single main.go file.

The main function will have a simple switch case which checks the arguments passed while running the file and runs container on passing appropriate commands as arguments.

func main() {
	switch os.Args[1]{
	case "run" :
		run()
	case "child" :
		child()
	default:
		panic("error")
	}
}

To run containers using docker we use the docker cli but here we'll run the main.go file and pass argument run after it to run the container, we can also add other commands which we want to run inside the container as arguments after run.

docker         run <image>
go run main.go run <commands>

In the switch statement you'll see a "child" case, I'll explain the use of that later. First let's implement the run method which will actually start our container.

func run(){

	cmd := exec.Command(os.Args[2],os.Args[3:]...)

	cmd.Stdin = os.Stdin
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr

	err := cmd.Start()  
	if err != nil {
            fmt.Fprintf(os.Stderr, "Failed to start: %v\n", err)
        os.Exit(1)
        }
    
        err = cmd.Wait()
        if err != nil {
            fmt.Fprintf(os.Stderr, "Process exited with error: %v\n", err)
            os.Exit(1)
        }
}

Here we're running the commands passed in the arguments using the exec package which uses fork and exec system call under the hood the create and run a process. Until now we've only started a simple process, now, let's make it a container.

func run() {
	fmt.Printf("Running %v with pid %d\n", os.Args[2:], os.Getgid())

	cmd := exec.Command("/proc/self/exe", append([]string{"child"}, os.Args[2:]...)...)

	cmd.SysProcAttr = &syscall.SysProcAttr{ 
            Cloneflags:   syscall.CLONE_NEWUTS | syscall.CLONE_NEWPID |   syscall.CLONE_NEWNS,
            Unshareflags: syscall.CLONE_NEWNS,
	}

	cmd.Stdin = os.Stdin
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr

	err := cmd.Start()
	if err != nil {
		fmt.Fprintf(os.Stderr, "Failed to start: %v\n", err)
		os.Exit(1)
	}

	fmt.Printf("Child started with PID %d\n", cmd.Process.Pid)

	err = cmd.Wait()
	if err != nil {
		fmt.Fprintf(os.Stderr, "Child exited with error: %v\n", err)
		os.Exit(1)
	}
}

Now we've added some attributes to our exec command. In SysProcAttr, Cloneflags are flags passed to kernel which it has to pass while running the fork call. Here, all flags are for creating a namespace:

Namespace flag Isolates
CLONE_NEWUTS Hostname and domain name
CLONE_NEWPID Process Ids
CLONE_NEWNS Mounts

You can also see instead of running the commands from the arguments received I'm running "proc/self/exe" and appending "child" with the arguments received. "proc" is a directory in linux pseudo filesystem which contains all the processes which are currently running. It is used by the user and kernel space to share information. So "proc/self/exe" points to the current process and runs it. Basically, I'm running the current file/process again and appending "child" with the arguments so that instead of "run" it runs the "child" method. Here's the child method:

func child() {
	fmt.Printf("Running %v with pid %d\n", os.Args[2:], os.Getgid())

	cg()

	err := syscall.Sethostname([]byte("container"))

	if err != nil {
		fmt.Fprintf(os.Stderr, "Failed to sethostname: %v\n", err)
		os.Exit(1)
	}

	err = syscall.Chroot(<linux_fs>)

	if err != nil {
		fmt.Fprintf(os.Stderr, "Failed to chroot: %v\n", err)
		os.Exit(1)
	}

	err = syscall.Chdir("/")

	if err != nil {
		fmt.Fprintf(os.Stderr, "Failed to chdir: %v\n", err)
		os.Exit(1)
	}
	err = syscall.Mount("proc", "/proc", "proc", 0, "")
	if err != nil {
		fmt.Fprintf(os.Stderr, "Failed to mount proc: %v\n", err)
		os.Exit(1)
	}
	defer func() {
		syscall.Unmount("/proc", syscall.MNT_DETACH)
	}()

	cmd := exec.Command(os.Args[2], os.Args[3:]...)

	cmd.Stdin = os.Stdin
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr

	err = cmd.Start()
	if err != nil {
		fmt.Fprintf(os.Stderr, "Failed to start child: %v\n", err)
		os.Exit(1)
	}

	err = cmd.Wait()
	if err != nil {
		fmt.Fprintf(os.Stderr, "Failed command: %v\n", err)
		os.Exit(1)
	}
}

But why do I need a child method? You can see that I'm changing the hostname, root, and current directory of my process. I cannot perform these changes before running the process so I'm running a process through "run" method and attaching namespaces to it, then inside the process I'm changing the root, hostname etc. and then forking another child process using "child" method and the child process will run the user's command passed from arguments. The main logic is to run the child process inside the run process so that it inherits the hostname, root, namespaces etc. this is why I'm rerunning the process instead of normal function call.

Flow Diagram

Moreover, I'm also mounting proc in the child process so that it can have its own process Ids and not inherit the host proc. You can verify this by running "ps" command on both host and container process.
Now what's the "cg" method doing? It creates a control group on my host machine and add the container process to it and limits it to only having 20 processes in total. Here's the method:

func cg() {
	sysPath := "/sys"

	if err := os.Mkdir(path.Join(sysPath, "fs", "cgroup", "pids"), 0755); err != nil && !os.IsExist(err) {
		fmt.Fprintf(os.Stderr, "Failed to make dir: %v\n", err)
		os.Exit(1)
	}
	if err := os.Mkdir(path.Join(sysPath, "fs", "cgroup", "pids", "utk"), 0755); err != nil && !os.IsExist(err) {
		fmt.Fprintf(os.Stderr, "Failed to make dir: %v\n", err)
		os.Exit(1)
	}

	utkPath := path.Join(sysPath, "fs", "cgroup", "pids", "utk")

	if err := os.WriteFile(path.Join(utkPath, "pids.max"), []byte("20"), 0777); err != nil {
		fmt.Fprintf(os.Stderr, "Failed to write file: %v\n", err)
		os.Exit(1)
	}
	if err := os.WriteFile(path.Join(utkPath, "notify_on_release"), []byte("1"), 0777); err != nil {
		fmt.Fprintf(os.Stderr, "Failed to remove cgroup: %v\n", err)
		os.Exit(1)
	}
	if err := os.WriteFile(path.Join(utkPath, "cgroup.procs"), []byte(strconv.Itoa(os.Getpid())), 0777); err != nil {
		fmt.Fprintf(os.Stderr, "Failed to add pid to cgroup: %v\n", err)
		os.Exit(1)
	}

}

Finally, you can run your container by running the file and passing run and other commands as arguments. I'll suggest running a shell in it:

go run main.go run /bin/sh

You can find the complete code on my github.

Now you know what a container is and how to make one. I hope this was a fun read for you. I would suggest you to code this out and explore a bit.

Be Curious. See you soon!

Some Resources:

  1. Containers From Scratch • Liz Rice • GOTO 2018
  2. Cgroups, namespaces, and beyond: what are containers made from?