Two weeks ago I had the chance to participate in Usenix’s LISA’17 conference in San Francisco, CA. This is the third conference I have visited in the recent past - after SREcon earlier this year and Velocity last year - and I can say it so far was the best with a great mixture of technical and cultural topics. In addition to these core aspects, it was curious how different the social program has been: instead of a simple reception, the organizers planned many Birds-of-a-Feather sessions as well as other events (e.g. lock picking). As the LISA Build team was not allowed to take care of the WiFi (the hotel insisted on owning that and failed right away), they used their available time to take care of the LISA Lab: an exercise and experimentation realm I sadly could not fully investigate.
The following is my personal summary of the talks I have seen and the ones marked with an asterisk are the ones I would recommend folks to check out if they are interested in the respectived topic.
Opening Plenary: Security in Automation
Focused on the concepts that automation can bring security and that security can be automated, the two speakers run through different incarnations of ways to marry these two aspects of systems engineering and explain what the combined results can look like. Interesting entry to the field and basic ideas.
The UK NHS defined Never Events as incidents “arise from [the] failure of strong systemic protective barriers”, i.e. things that have effective protection already. This ranges from “Retained foreign object post-procedure” to “Wrong site surgery” and provides a framework to evaluate and drive improvements to processes that are already put in place.
It is important to highlight two aspects about never events: 1) they do not require that a patient has been (terminally) harmed, but just that a set of safeguards has failed and 2) they require the safeguards to be “strong” which excludes single person checks and trusting in simple processes without validation. The entire process is aimed at improving already reliable processes.
An interesting tidbit from the presentation: giving birth is far more regular encountering issues then surgeries, likely related to the lower amount of three-person validation (which is common in surgeries but uncommon for births)
ChatOps at Shopify: Inviting Bots in Our Day-to-Day Operations
Presenting the chat bot used at Shopify to manage systems, deploy software and support during incidents. A simple, but comprehensive walkthrough of what can be done with chat bots and how one company is using them.
Working with DBA’s in a DevOps World
Many “modern” software teams consists solely of software engineers, but there is still a place for specialized engineering such as DBA’s in the context of this talk. Most of the presented material should be applicable to other kinds of specialized engineers such as NE’s as well. The core thesis is the idea if leveraging the specialists as early in the SDLC as possible instead of bringing them in late to avoid mistakes and gain most from their experience.
Queueing Theory in Practice: Performance Modeling for the Working Engineer*
A simple primer to queuing theory and how it applies to web services and distributed systems. It starts by explaining the modelling of a single thread in a single server, then extends to many parallel threads and closes with a peak at the Universal Scalability Law and how communication overhead is important.
Scalability is Quantifiable: The Universal Scalability Law*
Continues more or less where the queueing theory talk ended and walks from Amdahl’s Law to the Universal Scalability Law to explaining the trade-offs between coordination and parallelism and how these can be observed/modelled in real systems. Closes with some simple strategies on how to reduce communication overhead and scale systems without fundamental re-architecture.
Persistent SRE Antipatterns: Pitfalls on the Road to Creating a Successful SRE Program Like Netflix and Google* Despite the quite corporate title this is one of the most entertaining talks of the entire conference and highlights some very good points around reliability engineering, operational excellence and how to integrate these things into a concrete company culture. Driven by actual conversations the presenters had over the years they highlight misunderstandings around what these ideas mean.
Disaggregating the Network: Switching as a Service
Facebook presents their Wedge platform, i.e. the in-house built switches/routers running in their data centers. While in concept quite interesting, the talk stays at a very high level and does not share any relevant learnings outside of: collaborating across job families is hard and requires good common models.
Charliecloud: Unprivileged Containers for User-Defined Software Stacks in HPC
Modern HPC clusters (such as the ones at Los Alamos discussed in this talk) have a software packaging problem: what is the easiest option to get the research software onto the cluster without burdening the systems engineers or researchers. The talk presents containers as a great packaging format as they allow compilation on the researchers workstation and efficient execution on the HPC cluster - a combination not achieved by neither classic package formats nor VMs.
Operational Compliance: From Requirements to Reality
How can compliance and security be integrated into DevOps? Easy, by having automated tests for each compliance and security requirement and execute them as part of infrastructure or software integration tests. This way compliance/security is ensured all the way without additional overhead and is transparent and understandable to everybody involved.
Stories from the Trenches of Government Technology*
Even the US government has computers and IT problems - probably not surprising to anyone who has tried to submit their tax forms online. The presentations by two tech leaders from the USDS and the VA shows some examples of what influence modern software engineering can have on government operations and what seemingly simple projects can have outsized impact on peoples lives — all including a pitch to complete a tour with the USDS and help your country!
Linux Container Performance Analysis*
Brendan gives an overview of debugging and performance analysis tools and their usage in the context of container applications. Especially he highlights some pitfalls and limitations of existing tools and how they can be avoided by combining measurements on the host as well as inside the container. While overall simple, the talk gives a nice overview of the available tools and their usage.
The Actor Model and the Queue or “Batch is the New Black”*
The talk starts out with an introduction to the actor model and using Queues as mailboxes for actors. It continues showing different actor arch-types such as supervisors and supply clerks and closes out talking about implementation concerns such as pull vs push, security and state handling.
Coherent Communications—What We Can Learn from Theoretical Physics*
Earlier in 2017 physicists have shown that quantum entanglement can be used for communications. To make this and other feats possible, physicists have developed a special communication culture that aims to optimize working efficiency. The presentation tries to capture highlights of this culture and how it can be applied to other fields. One aspect I personally found curious is the combination of separated and joined working phases: it is common that work partners do most of their work separated and only come together for comparatively short amounts of time, but use that time to ensure coherent thinking.
Clarifying Zero Trust: The Model, the Philosophy, the Ethos
Classic system designs employ the model of a secure perimeter dividing the wild word (outside) from the peaceful world (inside). Zero trust is built on the assumption that that is a highly complex model if you are looking to use multiple processing sites (PagerDuty) or have a very large system (Google BeyondCorp). The presentation lines out these points and more around the background and line out how one can remove the trust requirements in a data center environment. The talk glances over the need for host identity/attestation, and the speakers pointed me in the direction of spiffe.io for these regards.
DevOps in Regulatory Spaces: It’s Only 25% What You Thought It Was…
When working in regulated areas DevOps has to be extended to tie into qualification and compliance processes to be successful, efficient and compliant. Luckily, this change is not as stark as it might sound at first and there is a lot DevOps can teach about how to run regulation and compliance more efficiently. The presentation sets the scene for the integration and lines out what these synergies are and how they can be achieved on a high level.
Failure Happens: Improving Incident Response in Large-Scale Organizations
DevOps in theory is nice, but too often teams are not empowered to execute and mitigate issues but instead are required to escalate. The presentation walks through an especially horrendous (but real) case and makes a case how tech organizations can empower their operations teams to be self-sufficient without increasing risks. Though the speaker is clearly making a case why a product like his is a great idea, the concepts and problems behind the argument persist and are partly validated by systems such as Netflix’s Winston.
Sample Your Traffic but Keep the Good Stuff!*
For most companies/systems it is not feasible to build metrics/operations platforms the same scale as the core product. To avoid this most often some form of sampling or aggregation is employed. The presentation focusses on the sampling side of the question (as the metrics aggregation part is already well covered otherwise) and explains concerns and different sample schemes which can come to the rescue. It closes out with a walk through of the sampling employed in a system owned by the presenter and how they use the resulting data.
Where’s the Kaboom? There Was Supposed to Be an Earth-Shattering Kaboom!*
Is there something software/systems engineering can learn from demolition? Probably yes! Demolition has developed to be the natural counter-art to architecture and construction - in the end without demolition no new buildings can be constructed. Similar is the relationship for computer systems: while we (often) don’t need to demolish the old thing first, it still needs to be deprecated to ensure the new thing is the focus of attention. While the metaphors are sometimes a bit sketchy, they still seem fitting and worth a consideration.
Wait for Us! Evolving On-Call as Your Company Grows
In a small company most of the technical employees have a good grasp of the entire application the company offers. This simple model fades as the company grows and matures and it becomes ever more important to partition on-call and other operations duties. The presentation lines out some schemes and tips on how this can be achieved and what lies in the shadows. For example, growth also means it is harder to keep the overall system health and quality in check and it becomes important to establish operational review procedures. While all not rocket science, a good primer on growing systems operations from nascent to mid-sized groups.
System Crash, Plane Crash: Lessons from Commercial Aviation and Other Engineering Fields*
No systems engineering conference can live without the presentation about plane crashes and comparisons between systems engineering and avionics. The focus of this incarnation is what can be learned when looking at avionics, surgery and nuclear systems (which are all similar old as computer systems, i.e. about WW2). Two interesting examples are the question on team training (called Crew Resource Management in avionics), communication training and industry-wide lesson sharing ala NTSB.