Best Practices for Implementing DevOps in Very Large Enterprises
Earlier this year at DevOps Days Toronto and Vancouver, we gave a presentation outlining several tips and techniques for implementing DevOps within very large enterprises. We used a mixture of humour and memes to convey some of the difficulties when moving from a traditional ITIL-style approach in a conventional datacentre, to DevOps methodologies on a hyperscale cloud provider such as Amazon Web Services (AWS), Microsoft Azure or Google Cloud Platform (GCP).
This talk covered many of the approaches and techniques we’ve advocated over numerous successful cloud transformations, and we’ve distilled down some select learnings into this blog post. Naturally, this is derived from our experience and neither fully encompasses our approach nor will all the following tips necessarily apply to your organisation – though we believe this will be the exception rather than the rule.
Identify your ‘Masthead’ application
The success of any organisational change is predicated on the enthusiasm of those subject to the change. Accordingly, getting your stakeholders excited by demonstrating to them the value of implementing DevOps practices using a high-visibility application – a masthead application – will drive early adoption and buy-in to your transformation program. Choosing the right app is very important, so there are several criteria:
- Visibility is paramount, the application needs to be important to the organisation;
- Ensuring that the application has recognisable branding will reinforce the sense of ownership of the transformation;
- A clear need for DevOps practices to remediate the current deployment’s lack of scale, low agility, or other constraints will provide a strong motive for other teams looking to migrate for similar reasons;
- If possible, choosing an application with a relatively low level of tech complexity will provide for an easier migration and shorter delivery time; and
- Can generate lessons for the organisation in terms of security, compliance, and architectural patterns.
Develop an opinionated pipeline
Every organisation has performance, reliability, resiliency and security ‘opinions’ on the configuration of applications, infrastructure and operations. For example, an organisation may have an opinion that all databases are replicated across multiple fault domains for availability and resiliency. Within an organisation, these opinions are supplied by many stakeholders – operations, security, compliance to name a few. By developing an opinionated pipeline that takes these opinions and provides them as a consumable outcome for developers – think ‘MySQL database’ instead of ‘Private network zone, replicated MySQL with encryption at rest…”, we can greatly reduce the burden of deploying and migrating applications whilst ensuring compliance with the organisation’s performance, reliability, resiliency, security and other requirements.
Practice compliance-as-code
Large organisations are often subject to multiple regulatory regimes – especially if they are multi-national. Managing the controls and protections to ensure continual compliance can be onerous if deployed and managed in a ‘high-touch’ manner. Practicing ‘compliance-as-code’ greatly reduces the operational burden of deploying controls and protections as well as streamlining future audits and assessments:
- Identify the types of compliance to adhere to such as PIPA, GDPR, PCI, HIPAA, etc.;
- Disable non-compliant services (if possible); and
- Create CI/CD jobs to apply compliance rules or create cloud-native assets, such as Config Rules, WAF, System Manager, or Lambda as required.
Build security in from the beginning
Enabling DevOps practices to lower the time-to-market for application developers is a powerful business driver, but if security is not a key concern at an early stage, this can lead to an increased risk of security breaches. Security breaches – especially those for large organisations – are usually huge PR events as seen at Instagram, Uber and even the NSA. They may also present regulatory difficulties depending on the relevant jurisdictions and type of data leaked by the organisation.
To minimize the risk of accidentally exposing sensitive information, introduce reliable, multi-layered security into your DevOps practices from the beginning of the transformation program. Examples may include:
- Performing static code analyses during deployments, consider enabling “report-only” for moderate or low risk items initially to allow developers a chance to remedy issues themselves;
- If the application is containerised, investigate tools such container scanners which are designed to highlight security issues specific to these platforms;
- Create user-friendly documentation and training/support – if maintaining the organisation’s security posture is a positive experience, developers will be more likely to not only comply but suggest improvements and take ownership themselves; and
- Work directly with Security personnel to help them onboard the first few apps and tools, then support them as they gradually operate, maintain and improve the organisation’s security controls.
Define and build a Standard Operating Environment
Whilst the concept of a Standard Operating Environment (SOE) is not a DevOps-specific concept, it is an important concept to reiterate, particularly when building out a new cloud platform. This should also present an opportunity to review what processes are in place to produce an organisation’s SOE and identify any improvements such as:
- Automating the build of a SOE per operating system (OS) including managing the configuration as code;
- Managing security and OS patching at the SOE level (CIS/NIST standards are valuable references); and
- Ensure application teams deploy their application on the SOE, but also ensure that the application team does not ‘roll-back’ patches or configuration in the SOE, ideally reporting if the application attempts to drift (intentionally or otherwise) from the SOE configuration.
Enforce appropriate change control using ‘self-tainting’ ephemeral instances
Having builds fail due to manual changes that have not been captured in code is not only frustrating, but also inefficient and potentially a cause of application downtime. A technique to prevent reliance on manual changes during deployments is to enforce the concept of ‘self-tainting ephemeral instances’ where, upon login of a user to a given server, the access is logged, and the server is ‘marked’ and automatically redeployed within a defined period (24 hours is common).
This assists in ensuring that any changes introduced are automatically discarded unless they have been through the appropriate deployment methodology – usually a change to the application deployment code and deployed via the CI/CD tooling.
Don’t try to fix process problems solely with technological solutions
A common trap to fall into is to rely on new technologies and platforms to fix what may be problems with process. This approach can lead to replicating the same problems in the new environment, as opposed to resolving existing issues. A better approach is to identify these issues in place and remediating them in place or during the migration. This can resolve existing issues and instil a DevOps culture that considers outcomes not just from a technological point of view, but from a holistic view encompassing application, operations and security.
Utilise test-driven Infrastructure as Code
To maximise surety of deployments either of applications or changes to the consumable opinions (see above) we should not only run linting & rule checks, but also utilise ‘test-driven Infrastructure as Code’ to rapidly provide feedback to developers and/or platform maintainers. Tests can be as simple as ‘does it deploy successfully?’ for non-production environments, to complete testing frameworks such as Serverspec/InSpec.
This will shorten the feedback loop during development whilst also minimising the chance of failed deployments in production.
Implement blue/green deployment methodology
A ‘blue/green deployment’ is a deployment methodology where a copy of the application stack containing the desired changes (green deployment) is created alongside the existing released application stack (blue deployment). On ‘release’, traffic is routed away from the blue stack to the green stack, usually by means of a DNS update. This provides confidence in production updates as if there is any issue with the new deployment (that testing in non-production did not identify), the original ‘blue’ deployment can be ‘re-released’, returning the system to the original pre-update state. This methodology is particularly powerful on hyperscale cloud platforms where organisations are not subject to the same resource constraints as a traditional datacentre and can leverage the often per-second billing of resources offered by a cloud provider.
Discourage premature multi-clouding
Whilst it is common for large organisations to desire a move to a multi-cloud platform; usually for reasons of additional resiliency, cost arbitrage or utilising the different offerings of multiple vendors; it is important to focus on remediating any existing deployment issues and refine operational processes before attempting to adopt more than one cloud provider. The decision of an organisation to begin a migration to even one cloud is often a momentous one and will represent a large multi-year investment of time and resources in building out capability on the new platform. Additionally, delivering the masthead application (see above) will identify pain points for your organisation during the cloud migration. By attempting to go multi-cloud before maturity in at least one cloud provider is achieved, the organisation risks introducing too much change into its systems and processes, which can result in unrealised capability, potential security risks (via a misunderstood/unrealised attack surface) or bill shock via runaway costs, to name a few challenges. Limiting cloud adoption to a single provider initially can safely manage the amount of change introduced to the organisation whilst updating systems and processes to address new challenges presented by a cloud platform.
Summary
We hope that the tips above have provided some guidance on implementing DevOps in a large organisation. As mentioned above, this is by no means an extensive list, and some items discussed above may not necessarily apply to your organisation’s situation. If any of the topics discussed above interest you and you’d like to know more, either as an organisation or a DevOps professional, please reach out to us either via LinkedIn or via email at enquiries@blue-sandbox.com.
We presented this talk at the Toronto and Vancouver DevOpsDays 2018 to share some of our experiences in assisting large organisations adopt DevOps practices. We’re also interested in what some of your key learnings have been in moving organisations (of all sizes) towards DevOps practices, or the cloud generally. If you’d like to share your thoughts with us and the community, feel free to visit our LinkedIn profile https://www.linkedin.com/company/sourced-group/ and leave your thoughts as a comment on the post. To see the full video visit: https://www.devopsdays.org/events/2018-toronto/program/shawn-sterling/
Nan is a Senior Consultant with Sourced Group, who began his career in hardware and software engineering. With over 12 years of experience, he specialises in architecting public cloud solutions for highly-regulated financial institutions.