๐Ÿ“ฆ sorend / addo2020-notes

โ˜… 0 stars โ‘‚ 0 forks ๐Ÿ‘ 0 watching
๐Ÿ“ฅ Clone https://github.com/sorend/addo2020-notes.git
HTTPS git clone https://github.com/sorend/addo2020-notes.git
SSH git clone git@github.com:sorend/addo2020-notes.git
CLI gh repo clone sorend/addo2020-notes
Soren A D Soren A D updated 5c5805f 5 years ago ๐Ÿ“ History
๐Ÿ“‚ main View all commits โ†’
๐Ÿ“„ README.md
๐Ÿ“„ slides.md
๐Ÿ“„ slides.pdf
๐Ÿ“„ README.md

All Day Devops 2020 notes

:monkey_face: ToC:

General

Structure:

  • 6 tracks (CI/CD, Cultural transformation, DevSecOps, Government, Modern Infrastructure, and SRE)
  • Each track consists of 26 sessions of 30min each.
  • 6 keynotes of 50min each.

Recorded sessions

Link to all sessions: https://content.sonatype.com/l/addo-2020-on-demand

Sessions

Ask me anything keynote: CI/CD

(Cloudbees + HSBC engineers)

  • CI/CD: Automation is about removing friction. Possiblity to go through a process as fast as possible.
  • Business perspective: Cultural shift. Continuous everything, we want to do our best.
  • Going faster means breaking less, smaller releases. Business is impacted of going faster.
  • Regulatory impact. Controls are codified and deploys together with code. Engineers don't think about it, it just happens.
  • Complete views of what is met and what is not met automatically all the time.
  • Goal centric approach over tools centric approach.
  • Principles for CD: Break pieces down in small pieces. Release + monitor small parts + automated roll-back + gradual roll-outs.
  • How do you think about releasing code?
  • How to convince upper management? Customer value stream + reduce turmoil.
  • Champions are required, to drive progress.
  • Set a goal for yourself. Start small. Measure everything. You have to do it. Build community around it. Make it public (advertise it).

:+1: Are we really moving faster? How visualizing flow changed the way we work

(Roman Pickl, Elektrobit)

  • Are we really moving faster? AFTER the journey to K8s with monitoring.
  • Problem: Delivering value
  • Impediments: Long dev cycles, high workloads, changing priorities, decreasing morale
  • Fast physical feedback and visibility of problems
  • Clear feedback signals.
  • Ambient awareness: Extreme feedback devices (physical feedback devices in highly visible locations)
  • Provoke conversation, show nothing to hide. Traces back to your production system.
  • Weekly status meeting regarding Operations.
  • "We shape our buildings, and afterwards our building shape us".
  • Dashboard metrics: Open pull requests, open support requests, tickets in progress, tickets done but not released.
  • Confluence does not provide value until developer sees them.
  • High variability in design domain, low variability in delivery domain
  • Flow framework: Flow load (items worked on), Flow time (time through valuestream), Flow efficiency (actively worked on), Flow velocity (number done), Flow distribution (allocation in specific state) -> Flow-board + happiness + quality :-)
  • Separate Features vs bugfix.
  • Local optimization and the urgency paradox (measure the full value-stream, not only the development section where things are pushed to go fast).
  • Value vs non-value adding activities. Goal eliminate, optimize. Use value-stream mapping technique.

:-1: DevOps in a highly regulated environment - Aiming high like a Unicorn in skates

(Marc Cluet, Ukon Cherry Ltd)

  • Governance, risk and control: Approval process crafted to ensure compliance.
  • All data tightly classified and controlled. Long process to ensure we don't do anything silly (design/technical/regulatory forums).
  • Challenges: No confident in Agile/Devops. Upskilling required. Maintain levels of controls. Increase speed of innovation.
  • We still want to reap benefits of cloud.
  • DevOps about people, process and tools. Each impact/creates tension to the others.
  • People: Experts help accelerate.
  • Change through metrics: DORA.
  • Create CCoE
  • Agile governance: Controls are business requirements, need to ensure we accelerate controls.
  • Continuous governance: Controls as code, governance as code, policy as code.
  • Tooling: Open Policy Agent, Hashicorp Sentinel.
  • Shift left security: Zero trust security in cloud, code security checks in pipeline, dependency checks in pipeline, external dependencies scanned.
  • Platform security: Platform releases at intervals, pentest of any new platform release, everybody required to be on new platform, anything killing for more than 3 months killed and refreshed.
  • Tools: Everything must be CI. Deployment also auditable that needs to align with governance.
  • Tools: Everything is an API. Follow twelve-factor app rules. Restful interfaces always secured. Short lived sessions.
  • It's possible for big/regulated to transform to Agile+Devops
  • Takes time. Need focus on people first, then process+tools. Use technology as catalyst for change. Upskill people for happiness and retention.

:-1: Bullet-Proof Coding : Adaptive Collaboration for Resilience

(Anton Weiss, Otomato Software)

  • Solitary coding != collaborative coding.
  • Collaboration = communication + coordination + alignment.
  • Colabboration is inversely proportional to scale.
  • Promise theory: publish intentions in form of promises.
  • Promises != executive commands.
  • Adapt to a rate of unkept promises.
  • Resilience: "how a system bends instead of breaking."
  • Careless engineering helps to build resilient systems.
  • Build resilient teams (teams that can adapt to new situations): Diversity, sharing.
  • Efficient != resilient.
  • Dialogue: Different form of conversation. No judgement or regulation.
  • Empathy: Understand other teams. Shadowing.
  • Championship. Champions promoteDevOps.
  • Manage stress.

:+1: Site Reliability Engineering: Anti-patterns in Everyday Life and What They Teach Us

(Jennifer Petoff, Google Ireland)

  • SRE patterns:
  • Users should never notice outage before you
  • Eliminate classes of errors rather than point fixes
  • Don't feed machines with human toil
  • Failure is an opportunity to improve
  • Coffee machine example:
  • Set and meet SLOs aligned with happiness.
  • Don't overload -> cascading failure.
  • Avoid Ops overload -> don't overload people.
  • Actively plan and manage serving capacity
  • Hotel lock example:
  • Incident management framework:
  • Access impact, mitigate and verify, debug, short term fix, underlying causes, long term fix, resolved.
  • Impact - High risk - Low risk - Resolved.
  • "Human" errors are really systems problems (can't fix people, but can fix systems to make people do better).
  • Lack of ownership -- empower teams to mitigate and resolve issues.
  • Case for blamenessless:
  • Failure happens
  • Embrace failure to improve MTTD and MTTR
  • Proactively address failure -- more robust systems
  • Takeaways:
  • Reap a return on the unplanned investment by learning
  • It's never human error

:+1: The Past, Present, and Future of Cloud Native API Gateways

(Daniel Bryant, Ambassador Labs)

  • Edge: Boundary between data center and your users.
  • Thesis: Evolution of edge has been driven by app architecture and technology.
  • History:
  • 90s: Hardware LB: HA/Scalability.
  • 00s: Nginx/HAProxy: HA/scalaility (lower cost).
  • 05s: Ajax -- Application Delivery Controller (ADC). HA + application acceleration. SSL offload, caching, compression, LB.
  • 10s: API gateway 1st gen. Expose APIs ("api management"), L7 routing, publishing, dev portal, analytics, monetization.
  • 15s: API gateway 2nd gen. ADC, API Gateway, Monolith, Mini-services: Centralize cross-cutting app concerns, Authentication, rate limiting, monitoring, routing.
  • Building cloud-native applications: Microservices, built/released/scaled independently.
  • App Architecture: Spectrum of services.
  • Different locations: K8s/Vms/FaaS
  • Different protocols: gRPC, http,websockets, tcp
  • different load-balancing: Sticky sessions, round robin.
  • Different authentication requirements
  • Cloud gateway neesd:
  • Need API gateway management stuff.
  • Need ADC-like traffic management
  • Real time service discovery
  • Microservices lead to even bigger change:
  • "You build it, you run it." (from code to production)
  • Workflow: Full cycle development
  • Stack self-service tooling, all backed by specialists:
  • Build tools
  • Deployment pipelines
  • Metrics and alerts
  • Insights tools
  • Dev teams use this "platform" self services to optimize their workflow.
  • Thesis: The future evolution of the edge will be drive by applicaiton architecture, technology and workflow.
  • Two biggest challenges with k8s and edge:
  • Scaling edge management - more things at the edge, changes to the edge becomes a bottleneck.
  • Supporting diverse edge requirements.
  • Strategies for Edge with Kubernetes
  • Deploy additional kubernetes api gateway. Below normal gateway.
  • extend existing gateway (k8s operators integrate)
  • deploy an in-cluster edge stack.

:+1: CI and CD for Documentation

(Olivier Jacques & Laurent Gil, DXC Technology)

  • The quest for great documentation.
  • Widely used: microsoft, github, gitlab, aws.
  • fast, secure, cheap (static sites)
  • easier to contribute (git flow)
  • battle tested, engineered (automated tests, different outputs - sites/pdf/whatever)
  • monitoring (analytics)
  • challenges: diagrams (what changes?), broken links, publishing, DRY, common voice, common look/feel.
  • CI: Spell checking, approved acronyms, one voice, dead links.
  • CD: automate publishing
  • CI/CD tools for documentation:
  • Authoring: Markdown
  • Editors: Intellij/VSCode/Eclipse/etc
  • Markdownlint, Draw.io (drawings), PlantUML (diagrams as code), Marp (slides)
  • Github Codespaces or GitPod -- edit directly from browser, make it easy for tech editors (no local install), shared extensions.
  • Pick a tool:
  • Jekyll -- not good experience (-)
  • Hugo -- powerful, blazing fast (/)
  • Marp -- slides as code in markdown
  • MkDocs + material theme (/)
  • Orchestrating:
  • GitHub Actions, GitLab CI, Jenkins, AWS Code Pipeline, AzureDevops
  • Linter: github super-linter, markdownlint (vscode plugin)
  • Spell-checker: spellcheck-github-actions, spellcheck-cli, vscode code-spell-checker extension
  • Link checker: markdown-link-check
  • Style/voice: Vale
  • Hosting: Github pages, Gitlab pages, Netlify, S3 bucket
  • Making it easy: Github templates, Gitlab project templates

:+1: Your own Kubernetes Operator: Not Only in Go

(Nicolas Frรคnkel, Hazelcast)

  • Controllers: Control loops that modifies state.
  • Deployment controller, Job controller, etc.
  • Out-of-the-box controllers are in the control-plan.
  • Custom controllers can run anywhere (inside or outside cluster).
  • Operator: Application specific controller.
  • Extends kubernetes API with CRD, and controls the CRD.
  • Operator is controller that manages a CRD.
  • Talks to API server, REST based
  • Mostly written in Go (K8s is written in Go, there is the operator-sdk in Go)
  • Using fabric8 kubernetes-client + GraalVM to make small image.
(was missing talk about dealing with CRDs in fabric8 kubernetes-client)
  • Serverless databases: the good, the bad, and the ugly

(Renato Losio, Funambol)

  • Aurora database, mysql "automatic" scaling
  • boto3
  • Elastic scaling databases
  • Pay for what you need (bad queries -> you pay more until you fix the issue)

:+1: Service Mesh Past, Present, and Future with Envoy Proxy and WebAssembly

(Idit Levine, Solo.io)

  • From monolity to microservices -> the network is the queen of the cloud.
  • Challenges:
  • Routing: Make A talk to B (traffic control, resiliance)
  • Security: Make A talk to B securely (root cert, mtls, policy)
  • Observability: A talks to B1, B2, B3, you need to be able to observe it (metrics, logs)
  • From: library / application specific routing/security
  • To: Network abstraction: Service mesh: Separate operation code from business logic in separate layers.
  • How: Sidecars to your service which setups the routing, security, observability (it's a proxy, Envoy)
  • Control plane (envoy config)
  • Envoy (data plane)
  • Service
  • Service mesh challenges:
  • Hard to tell which is best
  • Complicated on-boarding
  • Different APIs
  • Contendors: Linkerd, nginx, consul, istio, kuma, open-service-mesh, mรฆsh
  • Some are built on top of Envoy, some are not.
  • Supergloo: API on top of existing meshes.
  • KISS: Source -> Policy rule -> Destination
  • Standardized: Service mesh interface (SMI)

:+1: I Have an SLO. Now What?

Alex Hidalgo, Alex Hidalgo

  • What to do with all the data we have? (Logs, metrics, traces) Setup SLOs
  • The reliability stack:
  • SLIs (service level indicator): Measurements from user perspective (measure what customers actually need from us)
  • SLOs: Target percentages for SLIs (how often do they show correct, pre-done math)
  • 100% isn't reasonable, it's too expensive to even try
  • Error budgets: Measurements over SLOs over time
  • Measure service reliability!
  • Service has one job: Do what the user needs it to do.
  • Using error budgets (this is outdated!):
  • Error budget surplus -- ship features
  • Error budget exceeded -- stop, fix reliability
  • What to do with your SLO data:
  • Maintain balance between shipping and fixing reliability
  • Determine focus of project work (reliability improvements are features, not everyone owns the code they own) -- measure better SLI is project work. So is picking better SLO thresholds. Examine measurements often. They could be wrong!
  • Are users happy? Align error budgets to user happiness.
  • Examine your risk factors. Determine biggest risks according to burning error budgets. Figure out when you're unreliable (proactively improve reliability).
  • Justify experimentation / chaos engineering. When do you start burning from error chart? Use error budget to experiment.
  • Schedule load tests, stress tests, and blackholes. Where on the curve does stuff break?
  • Just turn stuff off. Burn error budget to get a feeling of what fails when components are down.
  • Do nothing. Nothing at all. SLOs are data, not mandate (you don't need to do anything)
  • Report your service reliability in a more meaningful way. (error budgets over tickets and mttX)
  • Have better conversations, leading to better decisions(!)
  • Everything is about humans and data driven decisions

Fast & Simple: Observing Code & Infra Deployments At Honeycomb

(Liz Fong-Jones, honeycomb.io)

  • How to ship at honeycomb:
  • Instrument as we code
  • Functional and visual testing
  • Design for feature flag deployment
  • Automated integration
  • Human PR review
  • Green button merge
  • Auto-updates, rollbacks and pins
  • Observe behaviour in production
  • Prod: customers observe data
  • Dogfood: observes prod
  • SLO adoption: page views
  • Kibble observes dogfood
  • That's how 12 engineers deploys 12 times/day
  • Tips:
  • Start with lead-time
  • Deploy frequency goes up
  • Change fail rate goes down
  • time to restore goes down
  • 0.1% fail-rate
  • What about infrastructure?
  • Infrastructure empowers products
  • Kubernetes isn't the goal: Reliability and simplicity is
  • Everyone starts somewhere
  • Automate painful parts
  • Keep enviornment clean
  • Repeatable infrastructure with code
  • Centralize state and locking
  • Diff and release in browser
  • Remote run from git
  • Deploy changes incrementally
  • Feature flags for infra!
  • Ephemeral fleet and autoscaling
  • Quarantine bad traffic
  • Scaling challenges
  • Team size has doubled
  • Commit frequency is up
  • Consumer traffic is pouring in
  • Bottlenecks
  • Releases not granular enough
  • Manual rollbacks are too slow to maintain SLOs
  • SLOs:
  • 99% Store incoming telemetry
  • 99% Default
  • Sleep easily at night is the top priority in the company

:-1: Continuous Delivery for Machine Learning

(Adarsh Shah, Consultant)

  • ML code is a small part of the machine learning application system (feature extraction, analaysis, etc etc)
  • Data management - experimentation - Production
  • Challenges unique to ML:
  • Data management
  • Experimentation (lots of CPU for training models)
  • Production deployment (offline/online prediction, monitoring/alerting)
  • Solutions
  • Data management: Data pipelines (automated), Versioned datasets
  • Experimentation: CI (training code)
  • Production deployment: CI (application code), smoke tests

:-1: Six Categories of Monitoring in the DevOps Pipeline

(Hasan Yasar, CMU)

  • metrics logs reports -> data
  • devops metrics
  • monitoring -- architecture, metrics to dashboard,
  • Logs is a specific event. Metrics is a measurement at a time.
  • Pipelines (flow) generates a lot of data.
  • We want to quantify impact of investment in software delivery
  • Devops metrics pyramid
  • Guidelines:
  • avoid relying on single metric
  • look for trends, outliers and level shifts -- not only averages
  • when to measure:
  • bug report submitted
  • change request submitted
  • code commit
  • build progress
  • test results
  • deployment activities
  • operation failure and recovery
  • application usage and latency
  • metrics categories
  • productivity (deployment freq)
  • reliability (mttr, mttd)
  • quality (failed deployments, number of tickets)
  • security (change req)
  • operations
  • monitoring: collect, interpret, and make action from information gathered
  • types of monitoring: development (devops process), usability, perofrmance, security, business (kpi), functional (use-cases how are capabilities working)
  • development: build failures, testing failures, issue monitoring
  • operations: outage monitoring, resource usage
  • security: vuln monitoring

:+1: How to Scale Your Company's Security

(Clint Gibler, r2c)

  • Distilled insights from surveying lots of companies
  • Mindsets and principles
  • automate as much as possible
  • guardrails not gatekeepers (minimize no's)
  • prefer high-signal low-noise tools and alerting (better to miss something than drown)
  • developers are your customers (build useful features, telemetry logging, etc)
  • self-service security (tools and services than can be used without security taems interaction)
  • Chose how to invest time: Now, short-term, long-term (yes)
  • Focus: Is X better use of time than Y?
  • Fundamentals: vuln mgnmnt (know current state, minimize friction for devs, track in same system as you usually work/workflow -- jira), cont. scanning.
  • data driven bug bounty
  • open vulnerabilities by prio over time
  • open vulns by subteam and priority
  • bug bounty cost per vuln class (need bug bounty program)
  • cont scanning:
  • scan new code with static and dynamic tools
  • dracon
  • scan unit PRs
  • show tool findings within PR
  • capture metrics
  • go for linting/AST
  • asset inventory
  • what do you own and how do they connect?
  • use the asset inventory to answer questions (which servers are exposed through internet?)
  • use asset inventory for shortening vuln window, by easier finding out who has to fix stuff.
  • security engineering: implement libraries/wrappers to take care of classes of security problems.
(e.g. library to call SQL which takes care of SQL injections)
  • Automating least priviledge
  • Enforce invariants (alert on things that should never be true, e.g. auto-shutdown if we can access internet, or, alert if we can access everything on internet)
  • what should always or never be true in your environment
  • which can we programatically alert on
  • and automatically take action on?

:+1: Sensory Friendly Monitoring: Keeping the Noise Down

(Quintessence Anx, PagerDuty)

  • Lots of notifications, we feel overloaded. In own life, put headphones on.
  • Buries important/high sev alerts in a sea of low prio notifications -- the people who neede to be notified wont be.
  • Let's find a happy medium for high sev alerts.
  • Consider: the cost of noise -- your brain on alerts.
  • Bubble burst for the problem you were building -- simple questions interrupted costs ~25 min (study shows)
  • Costs quality.
  • Not possible to compensate for the interruption, does not depend on the problem "complexity"
  • Be aware, not overwhelmed:
  • Determine the sources of noise
  • Categorize the types of noise.
  • Channel noise into productive flow
  • Create a routine to clear clutter
  • Sources of noise: email, tickets, slack, PRs, humans, etc. (you're the source of your noise)
  • How often do you check mail, etc....?
  • Communication boundaries:
  • Setting time to focus, including friends and family (e.g. create focus-time for yourself)
  • Define "relevant emergency"
  • Set reasonable expectations for yoruself and others.
  • What about external sources:
  • Categorize your noise (severity, false positives, false negatives, fragility, frequency). Fix it!
  • Noise flow:
  • What needs to be known
  • who needs to know it
  • how soon should they know it
  • how should they be notified
  • Be specific in alerting, do not train people to ignore notifications
  • Re-evaluate redundancy (e.g. slack is down, can we dynamically re-route the notifications to email/etc)
  • Resilient noise builds trust. Silence is artificial (why is there no notifications?)
  • How reliable are your services? How much notification duplication is needed?
  • Keep alerts relevant: For every alert triggered ask:
  • Was it needed? (if not, delete it)
  • How was the incident resolved?
  • Can the solution be automated? (switch what is triggering the alert to the automation?)
  • Is the solution permanent? (condition does not exist anymore, delete it)
  • How urgently was a solution needed? (adjust if too severe, or not severe)
  • Rules of engagement:
  • When to ping
  • When to call
  • Response-SLAs depending on media (email, etc)

:+1: SLO's: You're missing the point!

(Jason Yee, Gremlin)

  • Execs say company depend on innovation, most innovations fail, how to successfully innovate?
  • SLAs: Service level agreements (contracts, guarantees)
  • SLOs: Service level objectives (tighter threshold so you don't run into breaking SLAs)
  • SLIs: Service level indicators (what signals can we look for, we're functioning but not performant)
  • Example:
  • SLA: 90% web requests latency <500ms over a month or customer gets money back.
  • SLO: 95% web requests latency <500ms over a month
  • SLI: 95% web requests latency <500ms over a month (what we monitor for)
  • Missing the point if only focus on reliability -- it's about error budgets!
  • Error budgets:
  • Embracing and managing risk
  • what to do with the 5% from example above (100% perfect - 95% the SLO = 5%)
  • Innovate!
  • Don't spend all time on reliability. Use Error budget for innovating.
  • How to innovate successfully?
  • -Google: 20% rule (100% + 20% = 120%)-
  • rather: How to successfully experiment?
  • Do the least amount of work, to get the most amount of information
  • Keys to successful experimentation:
  • validate assumptions (hypothesis)
  • learning/reducing unknowns
  • improving what exists
  • failing frugally
  • How to spend your error budget frugally:
  • Don't spend it on unplanned incidents (improve mttxs, ensure good monitoring and alerting, latency, errors, traffic, saturation. practice incident response, update documentation and runbooks)
  • Practice chaos engineering (learn about weaknesses in the systems, validate our model of how the system works, prepare for real incidents)
  • Implement feature flags (expose new features/experiments iwth select group of users, limit blast radius of failures, experiment just enough to get usable data)
  • Game days: Focused chaos engineering days. Red/blue team in security.

:+1: Towards Continuous Auditing

(Dan Petit, ServiceMaster)

  • How audits work: Upfront, revisited periodically. Define controls, processes, policies, procedures etc in the organisation.
  • Key area: Change management where traditional practices are impacted substantially by CI/CD.
  • Traditional audits:
  • Manual processes, time consuming, expensive.
  • After-the-fact. Findings are far away from when things happened. (shift left, how could we find it closer to it happened?)
  • Depends on attestations (opinions) from people as evidence rather than verifiable facts.
  • Rely on evidence sampling, which is incomplete.
  • In typical change management: Approvers are far removed from actual work. (more risk means more ignorant people to sign off). (problem is people who add no value are in the way of getting things into production).
  • Does not align very well with devops culture (friction, leads to slow down/hinderance, leads to ignore/challenge processes).
  • Autonomy: People are more decoupled/decentralized. Governance is harder.
  • Pace: Processes designed things to change yearly/monthly are not suitable for things where changes happens all the time.
  • Automation: Many controls (and processes and procedures) are designed to manage people-behaviour. With lots of automation, practices are not appropriate.
  • Documentation: Relying on written documentation/paperwork as evidence is troublesome.
  • Change management: Traditional CM processes slow down things to favor alignment.
  • Towards continuous auditing
  • Drive "compliant behaviour", so systems are already secure and compliant.
  • Want compliance to be a continuous thing.
  • Extensive automation means potential for rich data:
  • PR: Was there an approval, how were the unit tests, was the code security scanned, the images, What JIRA ticket drove the change. (commits, finally deployments).
  • Data available in real time, and provides detailed view of what happened, rather than attestation that someone said they did it.
  • Pipelines can be configured to allow system changes to be merged (and deployed) only if they pass peer review, and pass all pipeline tests. Creating gates in real time that prevents failing an audit.
  • Problem: Teams involved in Devops do not care about audit and compliance -- must engage devops and compliance teams early.
  • Item1: Approach problem differently: Empower those close to the problem to determine the compliance. Making process streamlined.
  • Item2: Leverage consistent, secure, pipelines.
  • Ensure that: All critical systems get a robust pipelin, that saves all the required data.
  • Any validations needed for compliance are automatically performed by pipelines: Non-compliant changes cannot pass.
  • Security of pipelines is solid. Must be able to trust data from them without hesitation.
  • Item3: Use peer code review and PRs.
  • Get rid of "management" receipts. They slow down, and are a waste of time. Move seperation of duty to people in peer review.
  • Item4: Make sure data is transparent.
  • Insist that all pipeline output is saved and readily available, even from ephemeral environments.
  • Use data to create rich on-demand reports and dashboards: See everything, not just samples, instantly collect without inconvenience, automate audit checks (actually find and visualize outliers for example).
  • Get complete, timely answers to everything auditor can ask: What changed? Business requirement drove the change? Did every change have approval? Did change pass quality gates?
  • High tranparency helps build trust.
  • Question legacy procedures/controls. Collaborate with IT/auditors.

:+1: The One (Pipeline)

(German Rodriguez, H-E-B)

  • Before:
  • UI-based jobs, plugin sprawl, jenkins sprawl, global ssh target configurations. groovy pipelines solve the same problems over and over.
  • Pipeline reconditioning was a primary service.
  • Long painful experience to support teams with their own Jenkins configurations
  • Common job steps:
  • Compile and build (gradle/npm), package (rpm), publish (nexus), deploy (ssh)
  • Everyone doing some variation of the same steps
  • 1st gen:
  • Groovy pipeline template
  • Complementary to existing jobs
  • A set of files to copy into your project
  • Instructions to create pipeline jobs
  • Popular, teams not maintaining own jenkins, faster onboarding (metrics for first deployment)
  • Drifted over time (not easy to distribute updates, teams would make own modifications)
  • Complex initial setup (credentials)
  • 2nd gen:
  • Opinionated (very) central pipeline generator
  • Run a (meta) pipeline to create project pipeline
  • One way to run, all tokens preconfigured
  • k8s targets only
  • very strict, no room for deviation
  • Learnings: Low acceptance (too strict), Good learnings on k8s depoyments, templates, etc...
  • 3rd gen:
  • Shared Jenkins pipeline script inner sourced.
  • Controlled by parameters and yaml file in the repository
  • Multiple languages, packaging and deployment options supported
  • Ability to optionally tag git repos back (from pipeline)
  • pipeline versioned, stable+ beta tags
  • all associated tools already integrated
  • Expansions: (Security scans, Slack integration, Multiple-step builds, Integration with custom PaaS offering (managed deployment setup, create SA, secrets, etc in the environment and save in pipeline)
  • More global view of what's going on because all builds/deployments are done through the same pipeline
  • Centralized instrumentation: dashboards for builds and deployments.
  • Learnings: High acceptance, low drift (central time owns the code), self-service is very easy for teams to setup new pipelines/builds, new build tasks are easy to integrate, teams helps contributing to the pipeline through PRs, flexible but still embeds best practice, (Jenkins problem: Method too large -- a pipeline is a single method in Java), Jenkins node and memory management is challenging with lots of pipelines, Pipeline starts to look like Gitlab/CircleCI declarative models.
  • 4th gen:
  • Refactor, simpler codebase (engine model)
  • Port to other CI tools (e.g. Gitlab CI, GithubActions), take advantage of native features.
  • Looking into Waypoint model as common platform.
  • Long term learnings:
  • Shared pipelines reduce time to deployment, reduce tech debt, provide insights.
  • Flexible and transparent creates success
  • Best practice and innovation can be accomplished, while abstracting common concerns.
  • Unknown unknowns reduce (only one place to look)
  • Cognitive load on teams reduced.

How to Build Better Software with the Scientific Method

(Dawn Parzych, LaunchDarkly)

  • How do you think about solving problems?
  • Growth mindset vs fixed mindset
  • The scientific method:
  • Isolate particular process
  • Form a hypothesis
  • Create an experiment
  • Produce repeatable results
  • Share knowledge with others
  • Hypothesis is a prediction that can be tested. Questions lead to a hypothesis.
  • Experiment: Provide learning opportunities.
  • Types of experiments:
  • Test in production (feature toggles)
  • Game days
  • A/B testing
  • Need the right culture - embrace experiments, embrace failure (turn mistakes into ideas and advice)
  • Think: hypothesis proven or not proven.
  • Blameless culture: Think, experiment failed, not person failed. Blame negates innovation.
  • Psychological safety: Show up to work without fear of consequences.
  • Avoid: bias and gaming the system.
  • Biases that can impact: Anchoring bias (initial bias), ikea effect (not created here syndrome, we bias things we build ourselves), bias blind spot (we think we're not biased, but everyone are), framing effect (frame in a positive way, constructive feedback).
  • Define success: Single definition of when an experiment is successful.
  • Identify and avoid vanity metrics: A number that always goes up and makes us feel good. Can we influence a metric if it goes up/down? Example: Followers, lines of code, number of incidents closed. McDonalds: Billions of burgers sold (always goes up, we don't know the delta).
  • Good metrics:
  • Customer-focused (what matters to customers)
  • Concrete (specific)
  • Tied to business (what is the impact of this metric to the organisation)
  • Example: Lower mttr, happier customers, business impact.
  • Create culture of learning and experimentation
  • Recognize biases
  • Don't be swayed by vanity metrics
  • Use feature-flags to run experiments

:-1: Our DevOps Journey is Incomplete without Data

(BMK Lakshminarayanan, Bank of New Zealand)

  • Data management/architecture in devops
  • Problem context:
  • Manual database deployments continue to be software delivery bottleneck.
  • Neglecting data in devops/CD has significant business impact.
  • Continues to slow us down.
  • Yesterdays DBAS: Experts, gatekeepers. (dependence when we have deployments/errors)
  • Types of data:
  • Systems of record
  • Systems of intelligence
  • Systems of engagement
  • Recommendations:
  • Understand current state (tools, process, flow, roles)
  • Make work visible (workloads, workloads in pipeline)
  • Value stream management (new features, tech debt, incidents)
  • Upskilling (devops, ci/cd, automation)
  • Investment (education, time, tools, self-service)
  • Role-play (engineer/dba developer days)
  • Outstanding challenges:
  • Automated governance
  • Policies via automation and in pipeline
  • Compliance as code
  • Everyone is custodian of data