Peregrine

In 2015 I needed to deploy an early Angular application along with a companion Jetty service in Amazon's recently acquired OpsWorks service. As a manual operation, it was awful with way too many manual steps in the web UI, so I built a tool called deployoryah in Ruby to do it for me. A decade later what grew out of the early pipelines that I built from deployoryah had grown into what was known as Peregrine and Peregrine had grown up.

a platform that included EC2, ECS, and Lambda amongst its core pipeline archetypes
allowed the user to agnostically select CloudFormation, CDK, or Terraform as their IaC provider for its self-healing graph
consisted of over 4000 pipelines running approximately 135,000 executions a month
supported revenue over $300M
had a 2.35% failure rate against an industry average over 20%

Archetypes, Actors & Activities

Peregrine is based around what is known as a Context Accumulating Directed Acyclic Graph (DAG). This is just a fancy way of saying a sequence of activities, that doesn't loop, and that pass around a shared notebook that they can use to leave notes for each other. In the context of a deployment pipeline, the activies are steps like compiling source code, provisioning cloud infrastructure, and the myriad of quality and compliance steps that modern software delivery requires.

Peregrine implemented its DAG in a recursive manner, as the context flowed through the graph portions would be sliced off and passed through handlers to actors at each stage. Each stage, such as build, deploy or validate, allowed configuration at three points:

the actor that performed the stage's core function. These would be actions such as building an artifact from source, provisioning cloud infrastructure, or verifying the resources and the applications deployed on them.
a pre phase allowed additional custom actions to be executed prior to the main actor. These were typically used for steps like cloning additional repositories or generating additional workspace context, but any of the 80 or so activities could be configured
a post phased allowed the same set of actions to be executed after the actor.

Within the context of a parent orchestration engine - I chose Jenkins because it was convenient and available, any wrapper with SCM and workspace primitives would work - Peregrine acts as a compiler that compiles YAML declared workflows into immutably deployed infrastructure and captures an auditable trail of context along the way.

pipeline: conditions: is_feature_build: all: - type: environmentValue args: key: FEATURE_BRANCH value: nil negate: true is_devl_build: all: - type: environmentValue args: key: PIPELINE_ENVIRONMENT value: devl is_preprod_build: all: - type: environmentValue args: key: PIPELINE_ENVIRONMENT value: preprod conditional_values: validateConfig: - values: disabledTests: [regression] conditions: any: - !Mapped pipeline.conditions.is_feature_build - !Mapped pipeline.conditions.is_devl_build - values: disabledTests: [integration] conditions: any: - !Mapped pipeline.conditions.is_preprod_build - values: disabledTests: [] stages: build: type: maven args: maven_args: -Dskip.tests=true deploy: type: ecs args: url: !Environment pipeline.stages.deploy.args.url post: success: - type: teams args: webhook_url: message: My really cool app has deployed to <%= resources.service_url %> failed: - type: pagerduty args: service_id: conditions: all: - type: environmentValue args: key: PIPELINE_ENVIRONMENT value: prod validate: pre: - type: publishFeature conditions: !Mapped pipeline.conditions.is_feature_build type: ecs args: disableTests: !Conditional validateConfig disableTests environments: default: pipeline: stages: deploy: args: url: my-super-cool-app.<%= pipeline_environment %>.my-super-cool.co feature: pipeline: stages: deploy: args: url: my-super-cool-app-<%= feature_branch %>.devl.my-super-cool.co

This example focuses on the orchestration language found in the stages block of the pipeline definition. The static definitions for things like the artifacts that were to be deployed and the components that defined infrastructure resources to deploy them are not included for brevity. They were composed of standard items that would be familiar to anyone who has seen the AWS APIs for ECS, Cloudformation or Lambda, or configured Terraform or the AWS CDK. The final form of Peregrine that I knew allowed it to define as many of each type of artifact the build and image stages supported which could in turn be deployed in as many components as were required. The proof of concept of this final V3 release was an upgrade of the global Content I/O Network that the company used to ride the AWS backbone for a distributed FTP network between content providers and consumers that all still required the ancient protocol. It consisted of 25 full stacks distributed across 5 regions that created isolated VPCs with storage gateway backed FTP nodes, all sharing a globally distributed RDS database for authentication across private links. This was deployed with a single button. I was proud of that one.

The first section to highlight is a definition. This is just naked YAML that could any thing that a developer wanted, a common convention as to name conditions to be referenced like !Mapped conditions.is_devl_build using a custom YAML tag that would retrieve the named path. This produced configurations that were both DRY and semantically clear.

conditional_values: validateConfig: - values: disabledTests: [regression] conditions: any: - !Mapped pipeline.conditions.is_feature_build - !Mapped pipeline.conditions.is_devl_build - values: disabledTests: [integration] conditions: any: - !Mapped pipeline.conditions.is_preprod_build - values: disabledTests: []

This is where the utility of that !Mapped tag is obvious. The conditional_values is a fixed point in the configuration that is used by the !Conditional custom tag for context sensitive value substitution. It works on the concept of value groups that are selected for in descending order based on the first match of the given conditions. The any (OR) and all (AND) operators nest providing a rich language to declarivly provide configuration to the artifact and component definitions, as well as for the pipeline graph itself.

stages: build: type: maven args: maven_args: -Dskip.tests=true deploy: type: ecs args: url: !Environment pipeline.stages.deploy.args.url post: success: - type: teams args: webhook_url: message: My really cool app has deployed to <%= resources.service_url %> failed: - type: pagerduty args: service_id: conditions: all: - type: environmentValue args: key: PIPELINE_ENVIRONMENT value: prod validate: pre: - type: publishFeature conditions: !Mapped pipeline.conditions.is_feature_build type: ecs args: disableTests: !Conditional validateConfig disableTests environments: default: pipeline: stages: deploy: args: url: my-super-cool-app.<%= pipeline_environment %>.my-super-cool.co feature: pipeline: stages: deploy: args: url: my-super-cool-app-<%= feature_branch %>.devl.my-super-cool.co

In most pipelines, stages slot into a fixed structure provided by an archetype (EC2, ECS, infrastructure etc) and orchestrated as a standard Jenkins pipeline. The system could also be treated like a state machine using a variation of the !Environment thunk - a hole that is filled by the compiler based on the current deployment value and the path provided from a value found in the environments section. This can be also treated as a stack pointer with a custom activity type using an archetype with a single run phase to perform activities like declarative investigate and repair runbooks. This was extended to allow the step sequences to be read from an external source, and eventually to allow an LLM to be that external source with context and a choice of options the interface. This was fun, you should try it some time.

deploy: type: ecs args: url: !Environment pipeline.stages.deploy.args.url post: success: - type: teams args: webhook_url: message: My really cool app has deployed to <%= resources.service_url %> failed: - type: pagerduty args: service_id: conditions: all: - type: environmentValue args: key: PIPELINE_ENVIRONMENT value: prod

Next we can see the shape of the recursion, each stage has a single type to be executed with its args and the pre and post activity lists contain the same shapes. In their case the conditions that were used to control our value groups before can also be used to control flow. This applies as well at the stage level. This means that a single YAML file, when configured with the !Environment tag and conditional flows and values where necessary, can describe the complete end to end life cycle of an application in a file that can be commited to source control. The pipelines were designed as two phases. The first took source code and produced immutable images containing the compiled code and any runtime and non-external or ephemeral configuration required to to support it for any potential target deployment environment. Once the image was built it could never be reopened. The same applied to the infrastructure that was created to support it, where appropriate at least - we made exclusions for cases like ECS where the infrastructure was just wires. The resouces were captured and held in our own graph, regardless of IaC provider so that !Resource and !ComponentResource tags could reference them elsewhere in the graph internally and the !Value tag could fetch from their RESOURCES key under their pipeline node at any point. This allowed a delayed form of self healing where if infrastructure like shared load balancers were injured though user or physical action, a rebuild of the central infrastructure pipeline would reseed the graph with updated ARNs that would be consumed upon rebuilding the downstream services. We had distributed Terraform since Chef was still popular.

Tools and Toys

Behind the compiler Peregrine was grown over the years from a group of things that mostly didn't suck.

Core

The deployoryah toolkit grew to be a full AWS SDK that shared a lot of opinions on how to do things with FinOps and InfoSec. It had about 83,000 lines of code and integrated 38 services with CLI interfaces for most.

Tools

The CLI tools (firkin, nimbus, jeeves etc) were embedded in AMIs that were distributed by the platform itself to the Jenkins instances that powered it through a custom version of the EC2 cloud plugin. This was customized to consume an interal API through apirah to automatially pull the latest AMI for each of our supported node configurations. This gave Peregrine an immutable means of distributing itself through itself.

Eventually the Ruby base outgrew itself and for performance reasons, the hot path items were migrated to Go which reduced startup time for tools that were mainly CLI interfaces to Peregrine's graph and didn't require logic to be dually maintained. Any tools that required consistent logic were instead ported to one of Peregrine's APIs and the Go client would be implemented as a wrapper. The deployoryah core library was shared between the tools deployed on build nodes and the APIs on our platform ECS clusters. This provided synchronized logic and controls across both interfaces with the hot paths bypassing for simple CRUD access. The shared control plane provided a single gated point of access to the platform for automation, operations, and developers. This extended to CoPilot and fully agentic assistants through an earlier iteration of Wanderland as a live documentation and constrained agentic control plane.

The live documentation through Wanderland's executable markdown provided both a view into how to use the platform and what was currently running on it. The executable fences that the platform extracted from the prose that described them would be computed and merged back in to provide consumers, both human and assistant, tools made of text files.

Compliance

All images were tagged with the commit hashes from both their application and infrastructure contributors. This allowed the same inputs to ride the lifespan of the image, which was Peregrine's deployment unit and was immutable through the lifespan of the build - from integration through production. This meant that any running service across the dozens of ECS clusters and hundreds of individual AutoScaling Groups could be traced back to its source, both application and IAC.

Post acquisition, I was invited to be a member of a business unit cross-cutting team to determine and then implement corporate wide container security policies. As these things do, nothing much was ever really decided or implemented, but Peregrine grew the ability to collect and report on the automated image scan results that our ECR based registries were providing. These were included into the automated build reports that were generated from each pipeline execution at each environment boundary and would provide the most up to date results of the daily scans that were performed. These results could also be configured as gates at the promotion boundaries so that thresholds could be defined with hard stop triggers. These were never enabled as the InfoSec team responsible for defining the threshold and distribution mechanism for policies were replaced and outsourced and the project discontinued.

APIs

The Grape/Sinatra based API apirah provided HTTP access to consumers external to the platform core or that could not access a Ruby runtime. It shared a library with the CLI tools and the same shapes as best as possible.

A similar progression happened for apirah where a reimplementation was eventually required. This was triggered for technical debt more than anything. The original implementation was made by a young engineer who was still learning about API design and his choices were difficult to explain to his own young engineers. A clean slate in that regard was a gift.

Supporting Stuff

A custom Terraform provider integrated Peregrine pipelines that chose it for their IaC provider with the corporate "service registry", which provided tags sufficient for compliance. In order to ease the burdens of our developers, Peregrine allowed them to specify the system, environment, and sub_environment tuple that was the minimum set to identify an application and is target deployment environment. This would retrieve the full compliment of tags which by the end had grown above 20 with the two taxonomies (pre and post-acquisition) merged at the interface where required. It often takes a while to turn off all the old compliance checks before the metadata can be fully purged.

This was coupled with a custom registry, so we could run something simple in network to hold the provider. I always find it easier to get adoption if you make it easier to adopt.

Documentation

Documentation was always the most challenging thing as a solo platform engineer. My best effort was a significant in size manually constructed JSON schema for the entire YAML schema. I think it was 10K lines on its own. It was a fun weekend. I didn't update it for two years. Eventually I decomposed it to YAML files and they eventually were automatically generated.

I had created a hugo based documentation site organized under the ideas of Divio. Deep dive explanations of whatever issues kept recurring that I felt were education based. Tutorials for the core workflows, how to create full end-to-end application pipelines. How-to guides for more focused tasks. And then the engineer's best friend, the references manuals and the were automated as much as possible. I hi-jacked the formatters in Ruby's Commander and Go's aptly named Cobra CLI frameworks to generate output suitable for consumption into a pipeline that output MD for hugo.

The same idea to minimize effort to keep code and documentation in sync that resulted in annotation driven Groovy classes to configure the compiler in the pipeline library, meant that we could automatically create user guides for them that used their own test scenarios for examples so we could ensure they were valid. As you can tell, I may have a thing for not having to repeat myself. It also means I can ensure that the code and the documentation are as close to each other as possible.

In this case, the Jenkins pipeline library that acted as the compiler for the YAML files that converted them into steps to be fired by Jenkins was decomposed into a series of annotations that could be applied to the Groovy classes that defined those steps. Those annotations were the AST for a CommandBuilder that compiled the actual sh commands to run. Completely data driven where possible, with an escape hatch for Groovy logic where needed. You could compile Peregrine's YAML compiler from YAML if you were the kind of guy who hand wrote a 10K line JSON schema on the weekend.

You shouldn't.