Modern Jenkins Unit 1 / Part 1: Planning a Jenkins + Docker CI System

What are we doing here?

I have been introduced to far more installations of Jenkins that suck than don’t. One of the issues with a system of this level of complexity that is shared between many teams is that they tend to grow organically over time. Without much “oversight” by a person or team with a holistic view of the system and a plan for growth, these build systems can turn into spaghetti infrastructure the same way a legacy codebase can turn into spaghetti code.

Devil Jenkins!

Eventually failures end up evenly dividing themselves amongst infrastructure, bad commits, and flaky tests. Once this happens and trust is eroded, the only plausible fix for most consumers is to re-run the build. This lack of trust really makes it less fun to develop quality, reliable software. Its very important to have trust that the systems you are using to verify your work and not the other way around. An error in a build should unambiguously point to the code, not the system itself.

While not all companies can afford to have a designated CI team supporting the build infrastructure, it is possible to design the system initially (or refactor) in a way that encourages good practice and stability thus elevating developers’ confidence in the system and speed with which they can deliver new features. This series of posts will hopefully be able to get you started in the right direction when having to build or refactor a CI / CD system using Jenkins.

Describing the characteristics of the end state

I am a big fan of Agile software development1 myself. I don’t believe that there is one kind of ‘Agile’ that works for everyone or anything like that, but I do believe 100% in the iterative approach agile takes to solve complex problems. Breaking work into small deliverables, and using those chunks for planning at multiple intervals such as:

  • Yearly: Very high level direction
  • Quarterly: General feature requirements
  • Sprint: Feature implementation details

Agile Lifecycle

When you have multiple perspectives on the scope and movement of a project, it really gives you the ability to manage expectations and make the most of your time. Since you have a general idea of what the long term goals are, you (ideally) can then make trade-off decisions with accurate values on the scales. This leads to less rewrites and the ability to put a bit more into the code you write because you know it will become legacy and you know Future You will appreciate the effort.

This is perhaps in opposition to the idea of hacking your way to a solution which is also valuable process, but more for a problem with an undefined domain. Luckily shipping software has most of problem domain defined so we’re able to set long, medium, and short time goals for our CI / CD infrastructure.

Anyways, let’s state some of the properties we want from our build system:

Super happy developer guy

  • Easy to develop: A common problem with complex systems is that the process of setting up a local development environment is just as complex, if not more so. We will be developing just about the entire system locally on our laptop so we should hopefully get this for minimal cost.

  • Safe to develop: Another property that goes hand in hand with easy to develop is safe to develop. If our local environment is always reaching out to prod to comment on PRs or perhaps pushing branches and sending Slack notifications, it can be misleading and confusing to figure out where exactly these things are coming from. We will attempt to null out any non-local side effects in our development environment to provide a safe place to work.

  • Consistently deployable: Hitting the button and getting an error during a deployment is one of the worst feelings ever. In most cases when we need to deploy, we need to deploy right now! While building this project we will always keep master releasable and work with our automation in a way that no matter when we need to deploy, it will go.

    Roll safe my friend

  • Easy to upgrade: The best way to mitigate dangerous upgrades is to do them more frequently. While this may sound counterintuitive, it works for a couple of reasons:
    1. The delta is smaller the more frequently you upgrade (and deploy!)
    2. Each time you upgrade and deploy, you get more practice at it
    3. If you fix a small error each upgrade, eventually most errors are fixed :) As we will see, what enables easy upgrades is a rollback safety net and a good way to verify the changes are safe. Since we’ll never be able to catch all the bugs, having the rollback as a backup tool is clutch.

    It's Groovy baby, groovy yeah!

  • Programmatically configurable: This is a giant one. Humans are really terrible at doing or creating the same thing over and over. This is way worse once you have a group of humans trying to do the same thing over and over (like creating jobs or configuring the master). Since we cannot be trusted to click the right buttons in the right sequence, we need to make sure the system does not depend on us doing that. We will cover a variety of tools to configure our system including the Netflix Job DSL, Groovy configuration via the Jenkins Init system, and later on Jenkinsfiles. There should be nothing that is hand jammed!

    Revision Control like a boss

  • Immutable: “Immutable Infrastructure” 2 is a term that was coined by Chad Fowler back in 2013. He proposes that the benefits of immutability in software architecture (like what is brought by functional programming) would also apply to infrastructure. What this translates to is that instead of mutating a running server by applying updates, rebooting, and changing configuration on running machines, when changes need to be made we should just redeploy a new set of servers with the updated code instead. This makes it much easier to reason about what state a system is in. You will see that this is one of the main characteristics that makes this Jenkins infrastructure different (and I think much better) than a legacy installation.

  • 100% in SCM: Since we are committing to programmatic configuration, it is crucial that we keep everything within source control management. We should apply the same rigor with our CI / CD system’s codebase that we do with the one it is testing. PRs and automated tests should be the norm with anything you are creating, even if it is a one man show (IMHO).

    Secure

  • Secure: Security is constantly an afterthought which is a dangerous way to work in a build system. These systems are so complex and so critical to the company (they actually assemble and package the product) that we MUST make them secure by default. Sane secrets management, least access privileges, immutable systems, and forcing an audit trail all lead to a more secure and healthy environment.

    Scalable

  • Scalable: Build systems at successful companies only grow, they do not shrink. Teams are constantly adding more code, more tests, more jobs, and more artifacts. Normally this occurs in a polyglot environment leading to exponential growth of requirements on the machines that are actually doing the builds. It does not scale to have a single server with every single requirement installed. This pattern soon becomes unwieldly and hard to work with. We should containerize everything and separate the infrastructure from the system running on top of it to allow independent and customized scaling and maintenance.

If we can build a system that meets these requirements, we will get a lot of stuff for free as well including:

PROFIT

  • Easy disaster recovery
  • Portability to different service providers
  • Fun working environment
  • High levels of productivity
  • Profit!

An iterative approach

Each iteration will build a shippable increment on top of the iteration before it and eventually we will have a production Jenkins! Along the way we will learn a bunch of stuff, build some Docker containers, write a compose file, automate some configuration, get real groovy, and much, much more. I encourage you to stay tuned and work through solving this problem with me.

All code will be published to this blog’s git repo 3 so that you can verify your answers and fix inconsistencies if you get stuck along the way. The repo should be tagged to match up with this series.

Now that we have a general idea of how we want our system to behave and how we will build it out, let’s dig into the architectural concerns.

Next Post: Architecting a Jenkins + Docker CI System

Intro to Docker Swarm: Part 4 - Demo

Vagrant up up and away!

The primary output of my hackweek endeavours was a Docker Swarm cluster in a Vagrant environment. This post will go over how to get it spun up and then how to interact with it.

What is it?

This is a fully functional Docker Swarm cluster contained within a Vagrant environment. The environment consists of 4 nodes:

  • dockerhost01
  • dockerhost02
  • dockerhost03
  • dockerswarm01

The Docker nodes (dockerhost01-3) are running the Docker daemon as well as a couple of supporting services. The main processes of interest on the Docker hosts are:

  • Docker daemon: Running with a set of tags
  • Registrator daemon: This daemon connects to Consul in order to register and de-register containers that have their ports exposed. The entries from this service can be seen under the /services path in Consul’s key/value store
  • Swarm client: The Swarm client is what maintains the list of Swarm nodes in Consul. This list is kept under /swarm and contains a list of <ip>:<ports> of the Swarm nodes participating in the cluster

The Docker Swarm node (dockerswarm01) is also running a few services. Since this is just an example a lot of services have been condensed into a single machine. for production, I would not recommend this exact layout.

  • Swarm daemon: Acting as master and listening on the network for Docker commands while proxying them to the Docker hosts
  • Consul: A single node Consul instance is running. It’s UI is available at http://dockerswarm01/ui/#/test/
  • Nginx: Proxying to Consul for the UI

How to provision the cluster

1. Setup pre-requirements

  • The GitHub Repo: https://github.com/technolo-g/docker-swarm-demo
  • Vagrant (latest): https://www.vagrantup.com/downloads.html
  • Vagrant hosts plugin: vagrant plugin install vagrant-hosts
  • VirtualBox: https://www.virtualbox.org/wiki/Downloads
  • Ansible: brew install ansible
  • Host entries: Add the following lines to /etc/hosts:
10.100.199.200 dockerswarm01
10.100.199.201 dockerhost01
10.100.199.202 dockerhost02
10.100.199.203 dockerhost03

2a. Clone && Vagrant up (No TLS)

This process may take a while and will download a few gigs of data. In this case we are not using any TLS. If you want to use TLS with Swarm, go to 2b.

# Clone our repo
git clone https://github.com/technolo-g/docker-swarm-demo.git
cd docker-swarm-demo

# Bring up the cluster with Vagrant
vagrant up

# Provision the host files on the vagrant hosts
vagrant provision --provision-with hosts

# Activate your enviornment
source bin/env

2b. Clone && Vagrant up (With TLS)

This will generate certificates and bring up the cluster with TLS enabled.

# Clone our repo
git clone https://github.com/technolo-g/docker-swarm-demo.git
cd docker-swarm-demo

# Generate Certs
./bin/gen_ssl.sh

# Enable TLS for the cluster
echo -e "use_tls: True\ndocker_port: 2376" > ansible/group_vars/all.yml

# Bring up the cluster with Vagrant
vagrant up

# Provision the host files on the vagrant hosts
vagrant provision --provision-with hosts

# Activate your TLS enabled enviornment
source bin/env_tls

3. Confirm it’s working

Now the cluster is provisioned and running, you should be able to confirm it. We’ll do that a few ways. First lets take a look with the Docker client:

$ docker version
Client version: 1.4.1
Client API version: 1.16
Go version (client): go1.4
Git commit (client): 5bc2ff8
OS/Arch (client): darwin/amd64
Server version: swarm/0.0.1
Server API version: 1.16
Go version (server): go1.2.1
Git commit (server): n/a

$ docker info
Containers: 0
Nodes: 3
 dockerhost02: 10.100.199.202:2376
 dockerhost01: 10.100.199.201:2376
 dockerhost03: 10.100.199.203:2376

Now browse to Consul at http://dockerswarm01/ui/#/test/kv/swarm/ and confirm that the Docker hosts are listed with their proper port like so:

Consul Swarm cluster

The cluster seems to be alive, so let’s provision a (fake) app to it!

How to use it

You can now interact with the Swarm cluster to provision nodes. The images in this demo have been pulled down during the Vagrant provision so these commands should work in order to spin up 2x external proxy containers and 3x internal webapp containers. Two things to note about the commands:

  • The constraints need to match tags that were assigned when Docker was started. This is how Swarm’s filter knows what Docker hosts are available for scheduling.
  • The SERVICE_NAME variable is set for Registrator. Since we are using a generic container (nginx) we are instead specifying the service name in this manner.
# Primary load balancer
docker run -d \
  -e constraint:zone==external \
  -e constraint:status==master \
  -e SERVICE_NAME=proxy \
  -p 80:80 \
  nginx:latest

# Secondary load balancer
docker run -d \
  -e constraint:zone==external \
  -e constraint:status==non-master \
  -e SERVICE_NAME=proxy \
  -p 80:80 \
  nginx:latest

# 3 Instances of the webapp
docker run -d \
  -e constraint:zone==internal \
  -e SERVICE_NAME=webapp \
  -p 80 \
  nginx:latest

docker run -d \
  -e constraint:zone==internal \
  -e SERVICE_NAME=webapp \
  -p 80 \
  nginx:latest

docker run -d \
  -e constraint:zone==internal \
  -e SERVICE_NAME=webapp \
  -p 80 \
  nginx:latest

Now if you do a docker ps or browse to Consul here:

http://dockerswarm01/ui/#/test/kv/services/

Consul Swarm services

You can see the two services registered! Since the routing and service discovery part is extra credit, this app will not actually work but I think you get the idea.

I hope you have enjoyed this series on Docker Swarm. What I have discovered is Docker Swarm is a very promising application developed by a very fast moving team of great developers. I believe that it will change the way we treat our Docker hosts and will simplify things greatly when running complex applications.

All of the research behind these blog posts was made possible due to the awesome company I work for: Rally Software in Boulder, CO. We get at least 1 hack week per quarter and it enables us to hack on awesome things like Docker Swarm. If you would like to cut to the chase and directly start playing with a Vagrant example, here is the repo that is the output of my Q1 2014 hack week efforts:

Intro to Docker Swarm: Part 3 - Example Swarm SOA

A Docker Swarm SOA

One of the most exciting things that Docker Swarm brings to the table is the ability to create modern, resilient, and flexible architectures with very little overhead. Being able to interact with a heterogenious cluster of Docker hosts as if it were a single host enables the existing toolchains in use today to build everything we need to create a beautifully simple SOA!

This article is going to attempt to describe a full SOA architecture built around Docker Swarm that has the following properties:

  • A hypervisor layer composed of individual Docker hosts (Docker/Registrator)
  • A cluster layer tying the Docker hosts together (Docker Swarm)
  • A service discovery layer (Consul)
  • A routing layer to direct traffic based off of the services in Consul (HAProxy / Nginx)

Hypervisor Layer

The hypervisor layer is made up of a group of discrete Docker hosts. Each host has the services running on it that allows it to participate in the cluster:

  • Docker daemon: The Docker daemon is configured to listen on the network port in addition to the local Linux socket so that the Swarm daemon can communicate with it. In addition, each Dockerhost is configured to run with a set of tags that work with Swarm’s scheduler to define where containers are placed. They help describe the Docker host and is where any identifying information can be associated with the Docker host. This is an example of a set of tags a Docker host would be started with:

    • zone: application/database
    • disk: ssd/hdd
    • env: dev/prod
  • Swarm daemon: The Swarm client daemon is run alongside the Docker daemon in order to keep the node in the Swarm cluster. This Swarm daemon is running in join mode and basically heartbeats to Consul to keep its record updated in the /swarm location. This record is what the Swarm master uses to create the cluster. If the daemon were to die the list in Consul should be updated to automatically to remove the node. The Swarm client daemon would use a path in Consul like /swarm and it would contain a list of the docker hosts:

    View of Swarm cluster in Consul

  • Registrator daemon: The Registrator app1 is what will be updating Consul when a container is created or destroyed. It listens to the Docker socket and upon each event will update the Consul key/value store. For example, an app named deepthought that requires 3 instances on separate hosts and that is running on port 80 would create a structure in Consul like this:

    View of Services in Consul

    The pattern being:

    /services/<service>-<port>/<dhost>:<cname>:<cport> value: <ipaddress>:<cport>

    • service: The name of the container’s image
    • port: The container’s exposed port
    • dhost: The Docker host that the container is running on
    • cport: The Container’s exposed port
    • ipaddress: The ipaddress of the Docker host running the container

    The output of a docker ps for the above service looks like so:

  $ docker ps
  CONTAINER ID        IMAGE                       COMMAND                CREATED             STATUS              PORTS                                   NAMES
  097e142c1263        mbajor/deepthought:latest   "nginx -g 'daemon of   17 seconds ago      Up 13 seconds       10.100.199.203:49166->80/tcp   dockerhost03/grave_goldstine
  1f7f3bb944cc        mbajor/deepthought:latest   "nginx -g 'daemon of   18 seconds ago      Up 14 seconds       10.100.199.201:49164->80/tcp   dockerhost01/determined_hypatia
  127641ff7d37        mbajor/deepthought:latest   "nginx -g 'daemon of   20 seconds ago      Up 16 seconds       10.100.199.202:49158->80/tcp   dockerhost02/thirsty_babbage
  

This is the most basic way to record the services and locations. Registrator also supports passing metadata along with the container that includes key information about the service2.

Another thing to mention is that it seems the author of Registrator intends the daemon to be run as a Docker container. Since a Docker Swarm cluster is meant to be treated as a single Docker host, I prefer the idea of running the Registrator app as a daemon on the Docker hosts themselves. This allows a state on the cluster in which 0 containers are running and the cluster is still alive. It seems like a very appropriate place to draw the line between platform and applications.

Cluster Layer

At this layer we have the Docker Swarm master running. It is configured to read from Consul’s key/value store under the /swarm prefix and it generates its list of nodes from that information. It also is what listens for client connections to Docker (create, delete, etc..) and routes those requests to the proper backend Docker host. This means that it has the following requirements:

  • Listening on the network
  • Able to communicate with Consul
  • Able to communicate with all of the Docker daemons

As of yet I have yet to see mention of making the Swarm daemon itself HA, but after working with it there really do not seem to be any reasons that it could not be. I expect that a load balancing proxy with TCP support (HAproxy) could be put in front of a few Swarm daemons with relative ease. Sticky sessions would have to be enabled and possibly an active/passive if there are state synchronization issues between multiple Swarm daemons, but it seems like it would be doable. Since the containers do continue to run and are accessible even in the case of a Swarm failure we are going to accept the risk of a non-ha Swarm node over the complexity and overhead of loadbalancing the nodes. Tradeoffs right?

Service Discovery Layer

The service discovery layer is run on a cluster of Consul nodes; specifically it’s key/value store. In order to maintain quorum (n/2 + 1 nodes) even in the case of a failure there should be an odd number of nodes. Consul has a very large feature set3 including auto service discovery, health checking, and a key/value store to name a few. We are only using the key/value store, but I would expect there are benefits to incorporating the other aspects of Consul into your architecture. For this example configuration, the following processes are acting on the key/value store:

  • The Swarm clients on the Docker hosts will be registering themselves in /swarm
  • The Swarm master will be reading /swarm in order to build its list of Docker hosts
  • The Registrator daemon will be taking nodes in and out of the /services prefix
  • Consul-template will be reading the key/value store to generate the configs for the routing layer

This is the central datastore for all of the clustering metadata. Consul is what ties the containers on the Docker hosts to the entries in the routing backend.

Consul also has a GUI that can be installed in addition to everything else and I highly recommend installing it for development work. It makes figuring out what has been registered and where much easier. Once the cluster is up and running you may have no more need for it though

Routing Layer

This is the edge layer and what all external application traffic will run through. These nodes are on the edge of the Swarm cluster and are statically IP’d and have DNS entries that can be CNAME’d to for any services run on the cluster. These nodes listen on port 80/443 etc.. and have the following services running:

  • Consul-template: This daemon is polling Consul’s key/value store (under /services and when it detects a change, it writes a new HAProxy/Nginx config and gracefully reloads the service. The templates are written in Go templating and the output should be in standard HAProxy or Nginx form.

  • HAProxy or Nginx: Either of these servers are fully battle proven and ready for anything that is needed, even on the edge. The service is configured dynamically by Consul-template and reloaded when needed. The main change that happens frequently is the modification of a list of backends for a particular vhost. Since the list is maintained by what is actually alive and in Consul it changes as frequently as the containers do.

This is a high level overview of a Docker Swarm cluster that is based on an SOA. In the next post I will demonstrate a working infrastructure as described above in a Vagrant environment. This post will be coming after our Docker Denver Meetup4 so stay tuned (or better yet, come to the Meetup for the live demo)!

All of the research behind these blog posts was made possible due to the awesome company I work for: Rally Software in Boulder, CO. We get at least 1 hack week per quarter and it enables us to hack on awesome things like Docker Swarm. If you would like to cut to the chase and directly start playing with a Vagrant example, here is the repo that is the output of my Q1 2014 hack week efforts:

  1. https://github.com/progrium/registrator 

  2. https://github.com/progrium/registrator#single-service-with-metadata 

  3. http://www.consul.io/docs/index.html 

  4. http://www.meetup.com/Docker-Denver/events/218859311/