SRE / DevOps / Kubernetes Weekly Reportまとめ#79(2021/8/1~8/6) - 運び屋 (A carrier(forwarder) changed his career to an engineer)

タイトルは「The SPACE of Developer Productivity」。
開発者の生産性に関するいくつかの一般的な神話と誤解について解説し、上記のSPACEフレームワークを提案している。
SPACEフレームワークは、生産性についてのより思慮深い議論と、より影響力のあるソリューションの設計につながる個人、チーム、および組織が生産性の全体像を示す適切なメトリックを特定するのに役立つことを目的としている。

All projects include some level of testing. But do we agree on what we mean by testing when it comes to software development? Not always, as discussed in this post.

タイトルは「We need to talk about testing」。
「プログラマーとテスターがどのように協力して幸せで充実した生活を送ることができるか」の観点で、テストに関する複数のトピックを取り上げて解説している。

Microservice environments present a novel set of reliability challenges. This post looks at some of those challenges, and some of the best practices to address them.

タイトルは「The Unique Reliability Engineering Requirements of Microservices」。
「SREは、サポートしている環境のタイプに関係なく、特別な要件にプラクティスを適応させる必要がある」というポイントを、マイクロサービスベースのアプリの信頼性の管理がモノリスでの作業とどのように異なるかを通して解説している。

The first two posts in a series on the theory of monitoring, starting with defining terms and then discussing indicators and synthetics.

読者が「ゼロから」調査することで、テックモニタリングの世界についての視野を広げることができるようにモニタリングの理論を解説しているシリーズ。上記リンクのタイトルは「Monitoring theory, from scratch — The definitions」。Summary/takeawaysは以下。
- Monitoring is about efficiently and promptly observing things important to your business
- Good monitoring is monitoring that allows a human to derive insights about your business’ operation, in order to prevent or minimize damage.
- Good tech monitoring is monitoring that allows a human to ensure your business operates correctly, smoothly, efficiently, securely and in compliance while reducing false positives and false negatives to a minimum.
2つ目の記事のタイトルは「Monitoring theory, from scratch — Indicators and synthetics」。Summary/Takeawaysは以下。
- Try to pick indicators that offer the smallest gap between your desired definition of ‘up’ and their ability to report it.
- Document and train operators on such gaps and implementation limits.
- Try to pick indicators that are as discrete as possible and leave as little room as possible for interpretation.
- If you end up scratching your head much while looking at your indicators, revise and re-iterate.
- Remember that indicators aren’t predictive by nature and need to be complemented with other measures/systems.
- Remember that system state changes and you need to make sure your indicators are fresh and react to it. It’s always an on-going process.
- Consider supporting synthetic transactions in your indicator strategy, in particular if you have a complex and distributed end to end system.

A look at one organisations tracing infrastructure, discussing various aspects of the implementation from clients to capacity planning and other challenges.

タイトルは「Making Tracing as a part of Engineering DNA」。
ShareChat社のプラットフォームチーム内で、さまざまな状況下で10000のCPUコアとRAMが数百のマイクロサービスによってどのように利用されているかについてのパフォーマンスと応答を理解するためのソリューションを構築した。
ShareChat社のネットワーク上のすべてのリソースからの詳細な可観測性と監視メトリックを求めて得られた有意義な洞察を共有している。

Tools

Authorino is a cloud native AuthN/AuthZ enforcer for Zero Trust API protection built on top of Envoy. It provides a wide range of identity verification methods, policy enforcement options and caching options.

ゼロトラストAPI保護用のクラウドネイティブ認証/認可エンフォーサー「Authorino」のGitHubページ。

Naml is a configuration management tool for describing Kubernetes configuration in Go. It has a nice tool for converting YAML (including direct from the Kubernetes API) into Go as a way to bootstrap as well.

「Naml(Not another markup language)」のGitHubページ。
Kubernetes YAMLをGolangに置き換え、Golangでアプリを書き、デプロイすることができる。

SRE Weekly Issue #281 August 1st, 2021

Articles

Learning from incidents – Formula 1

The incident: a formula 1 car hit the side barrier just over 20 minutes before the race was about to start. The team sprang into action with an incredibly calm, orderly and speedy incident response to replace the damaged parts faster than they ever have before.

This article is a great analysis, and there’s also an excellent 8-minute video that I highly recommend. Listen to the way the sporting director and everyone else communicates so calmly. It’s a rare treat to get video footage of a production incident like this.

Chris Evans — incident.io

F1での素晴らしいインシデント対応を分析してコメントしている。上記のEditorのコメント通り、埋め込まれている8分間の動画がとても良いのでオススメです。

Observe a Service; Not a Server

The underlying components become the cattle, and the services become the new Pet that you tend to with your utmost care.

Piyush Verma — Last9

「ペット vs 家畜」の喩えを用いながら、以下の項目で解説している。
- Establishing Service KPIs
- Features vs. Stability
- Cascading
記事中の絵がカワイイ。

aws-samples/aws-incident-response-playbooks

AWS posted these example/template incident response playbooks for customers to use in their incident response process.

Aws

AWSのユーザーが直面するいくつかの一般的なシナリオをカバーしているプレイブック。以下の目的で使用できる「NIST Computer Security Incident Handling Guide(Special Publication 800-61 Revision 2) 」に基づく手順の概要を示している。
- Gather evidence
- Contain and then eradicate the incident
- recover from the incident
- Conduct post-incident activities, including post-mortem and feedback processes

(All) DNS Resource Records

A list with descriptions of all DNS record types, even the obscure ones. Tag yourself, I’m HIP.

Jan Schaumann

上記DEVOPS WEEKLY ISSUE #553で取り上げているため、割愛。

What’s a Major Incident Anyway?

This one includes a useful set of questions to prompt you as you develop your incident response and classification process.

Hollie Whitehead — xMatters

「Major Incident」を定義する方法と、それに効果的に対応する方法を知ることにより、運用が正常に実行されていることを確認し、実際に重要な作業に注意を向けることを目指して解説している。

How to be better, together

The author of this article shows us how they communicate actively, perform incident retrospectives, and even discuss “near misses” and normal work in order to better learn how their system works — all skills that apply directly to SRE.

Jason Koppe — Learning From Incidents

筆者が「登山を上達させるために使用したアプローチのいくつかは、同様の効果でソフトウェアエンジニアリングに適用できる」との思いから以下のポイントで解説している。
- Discuss every day work
- Learn from incidents
- What the software industry can learn from the climbing community

The Unique Reliability Engineering Requirements of Microservices

Although the fundamental concepts of site reliability engineering are the same in any environment, SREs must adapt practices to different technologies, like microservices.

JJ Tang — Rootly

上記DEVOPS WEEKLY ISSUE #553で取り上げているため、割愛。

It’s Time to Rethink Outage Reports

This one uses Akamai’s incident report from their July 22 major outage as a jumping-off point to discuss openness in incident reports. The text of Akamai’s incident report is included in full.

Geoff Huston — CircleID

タイトル通り、Outage Reportsの内容についての再考を以下の観点で促している。
- It would be a positive step forward for this industry if Akamai’s outage report were not unusual in any way. It would be good if all service providers spent the time and effort post rectification of an operational problem to produce such outage reports as a matter of standard operating procedure.

Culture & Conduct Risk: The Normalization of Deviance

Drawing from the “normalization of deviance” concept introduced in the Challenger disaster study [Diane Vaughan], this article explores the idea of studying your organization culture to catch problems early, rather than waiting to respond after they happen.

Stephen Scott

これまで企業と規制当局の間で一般的に不正行為の問題は発生した後にのみ発見できる「検出して修正する」という考え方だったものが、組織の行動リスクを管理するための「予測と防止」のアプローチが出現し始めていることを解説している。

Lorin Hochstein (Netflix) [StaffEng Podcast]

This episode of the StaffEng Podcast is an interview with Lorin Hochstein, whose writings I’ve featured here numerous times. My favorite part of this episode is when they talk about doing incident analysis for near misses. One of the hosts points out that it’s much easier for folks to talk about what happened, because there was no incident so they’re not worried about being blamed.

David Noël-Romas and Alex Kessinger– StaffEng Podcast

毎週このブログで記事を取り上げているNetflix社のSenior Software EngineerであるLorin Hochstein氏をゲストにresilienceについて語っている約48分間のPodcast。

Outages

Let’s Encrypt
Snapchat
Wikipedia
To fact-check this one, I looked at their grafana dashboard. Neat!
Netflix
Venmo
Blackboard Learn
eBay
reddit

上記各社の障害情報

KubeWeekly #271 August 6th, 2021

The Headlines

Editor’s pick of the highlights from the past week.

Kubernetes 1.22: Reaching new peaks

The Kubernetes release team is pleased to announce the release of Kubernetes 1.22, the second release of 2021!

This release consists of 53 enhancements: 13 enhancements have graduated to stable, 24 enhancements are moving to beta, and 16 enhancements are entering alpha. Also, three features have been deprecated. Learn more about the release from the blog, or listen to an interview with release team lead Savitha Raghunathan.

Kubernetes BlogにおけるKubernetes 1.22のリリース記事。上記の数の変更があり、以下の大項目でそれぞれ解説している。情報が盛りだくさんなので徐々にキャッチアップをしていく。
- Major Themes
- Major Changes
- Other Updates
- Release notes
- Release Team
- Release Logo
- User Highlights
- Project Velocity
- Ecosystem Updates
- Event Updates
- Upcoming release webinar
- Get Involved
Kubernetes 1.22のRelease Logo は以下。
Kubernetes 1.22 Release Logo

CNCF Unveils Schedule for KubeCon + CloudNativeCon North America 2021 in Los Angeles and Virtual

The schedule for KubeCon + CloudNativeCon North America 2021 is finally here! Attendees will experience over 230 sessions, including keynotes and breakouts, with over 70 presentations hosted by project maintainers. From non-technical and end user case studies to advanced engineering deep dives – the conference has content for everyone interested in cloud native technology. Explore the schedule and start planning your experience today.

10月11日から開催される「KubeCon + CloudNativeCon North America2021」のスケジュールが発表された。オンライン参加予定だが、早割チケットを買い忘れたので、$75でチケットを購入し参加登録を行った。Co-located eventsはまだ検討中。

ICYMI: CNCF online programs this week

A weekly summary of CNCF online programs from this week.

Cloud Native Live:

Humanising your cloud native platform

Lee Briggs, Pulumi

クラウドネイティブ組織全体でのプラットフォームの急増している様子を眺め、Pulumi社によるソフトウェア配信プラットフォームとインフラストラクチャパイプラインの構築がどのようにユーザーを満足させ役立つかを解説している約1時間のセッション。

On-demand Webinars:

Securing your continuous everything strategy

Abubakar Siddiq Ango, GitLab

継続的な開発ライフサイクルのさまざまな段階で発生する可能性のある脆弱性と、それらを軽減する方法を解説している約13分間のセッション。

Kubernetes clusters need persistent data

James Spurin, StorageOS

見落とされがちなCSIに焦点を当て、以下の内容を解説している約29分間のセッション。資料とデモがとても見やすくて良かった。
- Why the CSI and the installation of an effective data plane should be a key consideration for Kubernetes deployments
- Opportunities for improvements that include multi-tenancy, high availability, compliance with encryption at rest, ease of use with GitOps and the transition of traditional and legacy workloads, dependent on persistent data
- Live demo, showcasing the benefits of a Kubernetes data plane

Visit our Online Programs playlist on YouTube for more content.

We used our data to look at the highest velocity CNCF and OSS projects from 2020 and found some interesting takeaways!

For one @opentelemetry and @argoproj are growing VERY fast since entering CNCF 📈

Read more! https://t.co/GPqqKzttwL pic.twitter.com/GWFpcYG1uK
— CNCF (@CloudNativeFdn) 2021年8月2日

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

Two year update: Building an open source marketplace for Kubernetes

Alex Ellis, OpenFaaS Ltd

「Kubernetesのオープンソースマーケットプレイスの」の最初の作成からコミュニティの成長、最初のスポンサー付きアプリの入手、そして次のステップまでの過去2年間の旅について解説している。

Implementing traffic policies in Kubernetes

Cody De Arkland, Kong

「Kong blog」で公開済みの元記事からCNCF Blogへのゲスト投稿。
Envoy Proxyがバンドルされた最新の分散コントロールプレーンである「Kuma」を使用してKubernetesトラフィックを監視および監視するプロセスを紹介する方法を詳しく解説している。

July 2021 Flux update

Flux blog

以前「KubeWeekly #267 July 9th, 2021」で取り上げているため、割愛。

Encrypt your Kubernetes secrets with Mozilla SOPS

Thorsten Hans blog

SOPSをAzure Key Vaultと組み合わせて使用してKubernetesシークレット（YAMLファイル）を暗号化および復号化する方法を解説している。これにより、シークレットを他のKubernetesのマニフェストと一緒に直接gitに保存できる。

Excited to share thatI'll be speaking at @KubeCon_ North America with some amazing folks!

It was last minute but we were able to submit some talks for a brand new "student track" 🚀

I still remember the call I had with @breakawaybilly a few days before the CFP closed 😄
— Kunal Kushwaha (@kunalstwt) 2021年8月4日