SRE / DevOps / Kubernetes Weekly Reportまとめ#93(2021/11/7~11/12) - 運び屋 (A carrier(forwarder) changed his career to an engineer)

The English Version of this blog is here.
この記事は2021/11/7~11/12発行の下記3つのWeekly Reportを読み、備忘録兼リンク集として残しているものです。
なるべく情報を早く届けたい/共有したいので、ブログのリンクを確認次第、先行公開しています。自身のコメントは随時追加しています。
DEVOPS WEEKLY ISSUE #567 November 7th, 2021
- News
- Tools
  - A useful public GitHub template for bootstrapping an AWS EKS cluster using Terraform. Good accompanying blog post as well about the usefulness of such boilerplate templates.
SRE Weekly Issue #295 November 7th, 2021
- Articles
- Outages
KubeWeekly #283 November 12th, 2021

The English Version of this blog is here.

この記事は2021/11/7~11/12発行の下記3つのWeekly Reportを読み、備忘録兼リンク集として残しているものです。

なるべく情報を早く届けたい/共有したいので、ブログのリンクを確認次第、先行公開しています。自身のコメントは随時追加しています。

誰かの情報源や検索工数削減などになれば幸いです。

DEVOPS WEEKLY ISSUE #567 November 7th, 2021

SRE Weekly Issue #295 November 7th, 2021

KubeWeekly #283 November 12th, 2021

この記事を読んで疑問点や不明点があれば、URLから本文をご確認の上、ご指摘頂ければ幸いです。
理解が浅いジャンルも、とにかくコメントする様にしていますので、私の勘違いや説明不足による誤解も多々あろうかと思います。
情報量が多いので文字とリンクだけに絞っております。
各レポートで取り上げられている記事には2020年以前のものもあり、必ずしも最新のものという訳ではない様です。

DEVOPS WEEKLY ISSUE #567 November 7th, 2021

News

An excellent post on the subtleties of building trust in systems, including the technical systems and the people that make complex software work.

タイトルは「In our systems we trust」。
障害が日常にある環境で、システムと製品に対するユーザーと利害関係者の信頼をどのように維持できるか、お互いの仕事の質に自信を持って、チームメンバーとしての私たちの内部の信頼についてはどうですか、をストーリーを通して以下のポイントで解説している。
- Health of code
- Health of relation
- Let’s talk about trust.
- A word of advice
- Final thoughts

Another great post, this one on a long term effort to reduce the cost and improve the developer experience of a large, and growing, CI infrastructure.

タイトルは「Infrastructure Observability for Changing the Spend Curve」。
SlackのCIインフラを繰り返し理解し変更を重ねることで、過去2年間で支出面で桁違いの変化を遂げた事例を詳しく解説している。
HacklangとTakeawaysに記載の以下3つのアイデアが興味深かった。
- Adaptive capacity to decrease the cost of each test by changing the infrastructure runtime.
- Circuit breakers to decrease the number of tests by changing the infrastructure workflow.
- Pipeline changes to decrease the number of tests by changing our user workflows.

A new programming language based on Lua! Luau is described as a fast, small, safe, gradually typed embeddable scripting language. Lots of use cases for this, I hope it attracts an active community.

Luaから派生した、高速で、小さく、安全で、徐々に型付けされる埋め込み可能なスクリプト言語「Luau」の紹介記事。Roblox社のゲーム開発者がゲームコードを作成するために使用し、同社のエンジニアがユーザー向けアプリケーションコードの大部分とエディター（Roblox Studio）の一部をプラグインとして実装するために使用しているものをオープンソース化した。
GitHubページはこちら。

CI systems are so key to modern software development that some companies develop their own custom solutions. Not something most folks should do, but an interesting post from one team that took the custom approach.

タイトルは「Developing Databricks’ Runbot CI Solution」。
Databricks社のニーズに合わせて特別に開発された特注のCI「Runbot」の開発の背後にある動機、それに取り入れられた主要な設計上の決定、およびDatabricksエンジニアリング組織内のすべての開発者のエクスペリエンスを大幅に改善するためにRunbotをどのように使用したかを解説している。

A handy list of observability themed talks from the recent KubeCon event. Compiled by https://monitoring.love.

KubeCon + CloudNative Con NA 2021のObservabilityに関するセッションを集めたプレイリスト「KubeCon 2021 o11y Talks」。

Nginx was always one of my favourite bits of software to manage. This post looks into how to monitor it and provides an overview of various mainly SaaS solutions that can help, written by one of those SaaS providers.

タイトルは「NGINX Monitoring: Best Tools and Key Metrics You Should Know About」。
NGINXのKey Metricsと以下「The Top 7 NGINX Monitoring Tools」を中心に解説している。
1. Sematext
2. Prometheus and Grafana
3. New Relic
4. Datadog
5. AppDynamics
6. SolarWinds Server & Application Monitor
7. Dynatrace

A post introducing the monitoring golden signals (latency, traffic, errors and saturation) from first principles.

タイトルは「Golden Signals - Monitoring from first principles」。
3部構成のブログシリーズの最初の記事。メトリック駆動型測定の4つの主要なSREゴールデンシグナルと、モニタリングの全体的なコンテキストで果たす役割について解説している。

Tools

A useful public GitHub template for bootstrapping an AWS EKS cluster using Terraform. Good accompanying blog post as well about the usefulness of such boilerplate templates.

Terraformを利用した「Boilerplate for a basic AWS infrastructure with EKS cluster」のブログ記事。
GitHubページはこちら。

SRE Weekly Issue #295 November 7th, 2021

Articles

MTTR is a Misleading Metric—Now What?

I love this crystal clear argument based on statistics and research. MTTR as a metric is simply meaningless.

Courtney Nash — Verica

VOIDレポート2021の主要な調査結果のそれぞれに焦点を当てるシリーズの第2部。タイトルや上記のEditorのコメント問題提起をし、MTTxメトリクスの代わりに何を使うべきかを考察している。

Five steps to better customer communication

Their steps for better communication during an outage:

* Provide context to minimise speculation
* Explain what you’re doing to demonstrate you’re ‘on it’
* Set some expectations for when things will return to normal
* Tell people what they should do0
* Let folks know when you’ll be updating them next

Chris Evans — incident.io

タイトルにある上記Editorが抜粋している5つのステップを中心に解説している。

Heroku Incident 2365 Follow-Up

Despite checking in advance to be sure their systems would support the new Let’s Encrypt certificate chain, they ran into trouble.

[…] we discovered that several HTTP client libraries our systems use were using their own vendored root certificates.

Heroku

2021年9月30日に発生した障害のふりかえり。Let'sEncryptで使用されていた古いルート証明書の有効期限が切れたことをトリガーに発生した。

Multicloud failover is almost always a terrible idea

This is the best case I’ve seen yet against multi-cloud infrastructure. I really like the airline analogy.

Lydia Leong

タイトルの内容に沿って、マルチクラウド実装の莫大なコストと複雑さは、実際に行うべき稼働時間の改善、リスクの軽減からのマイナスの影響として注意を逸らすものであることを解説している。

An Update on Our Outage – Roblox

Roblox had a major, several-day outage starting on October 28. I don’t usually include game outages in the Outages section since they’re so common and there’s not usually much information to learn from, I sure do like a good post-incident report. Thanks, folks!

David Baszucki — Roblox

Roblox社の障害情報でコミュニティーを尊重する価値観に則って、情報の透明性をポストモーテムで維持する姿勢を行動で表している。

40 Ms Bug

When you’re sending small TCP packets, two optimizations can conspire to introduce an artificial 40 millisecond (not megasecond…) delay.

Vorner

Rustで書かれたアプリの本番環境で発生したバグを追跡調査した話を以下のポイントで共有している。
- A bit of backstory
- The increased latencies
- It was acting really weird
- The benchmarks
- Configuration options
- Overriding the defaults of http1_writev
- Splitting vectored writes
- Nagle’s algorithm
- Ok, but why 40ms?
- Conclusion

Google Incident report — Meet

_Here’s Google’s follow-up report for their October 25-26 Meet outage.

日本語または他の言語で追加情報は英語のページを案内される方はこちらへ。
Improvements identifiedは以下の通り
- Increased resource allocation for the backend message delivery system in the short term, and automatically detect message delivery overload in the long term.
- Enhancements to monitoring systems to capture real time data on quality of livestreams for all volumes of traffic.
- Alert logic updates to capture spikes in rebuffering rate proactively to help mitigate before any customer impact.

/r/sre — How to deal with retries in SLIs

Should you count failed requests toward your SLI if the client retries and succeeds? A good argument can be made on either side.

u/Sufficient_Tree4275 and other Reddit users

上記のEditorのコメント通り、SLIにおけるリトライの考え方について両方の意見から議論されていて興味深い。

What the SRE team wants to achieve with the development team

Mercari restructured its SRE team, moving toward an embedded model to adapt to their growing microservice architecture.

ShibuyaMitsuhiro — Mercari

下記の通り、2021年1月29日に出たブログの英語翻訳版。書かれていてから約10ヶ月経過しているので、次のステージにいっていると思われるので、続きが気になる。
- This article is a translation of the Japanese article written on 2021/01/29 Jan. 29th, 2021.

Episode 1: Honeycomb and the Kafka Migration – The VOID

There’s a really great discussion in this episode about leaving slack in the system in the form of bits of capacity and inefficiency that can be drawn upon to buy time during an outage.

Courtney Nash, with guests Liz Fong-Jones and Fred Hebert — Verica

Honeycomb社のKafkaアーキテクチャの移行に関連する一連の障害に関する5月に同社から出たレポートの裏話について、上記のEditorコメント以外にも、下記の特定技術の詳細以外にも触れている約31分間のPodcast。
- Complex socio-technical systems and the kinds of failures that can happen in them (they're always surprises)
- Transparency and the benefits of companies sharing these outage reports
- Safety margins, performance envelopes, and the role of expertise in developing a sense for them
- Honeycomb's incident response philosophy and process
- The cognitive costs of responding to incidents What we can (and can't) learn from incident reports

Why a ‘Reliability Mindset’ Must Be Adopted Beyond SRE

Here’s how non-SREs can use SRE principles to improve their systems.

Laurel Frazier — Transposit

非SREが「信頼性の考え方」を採用し始める以下6つの方法を中心に解説している。
1. Be prepared
2. Embrace automation
3. Let the data do the talking
4. Debrief without blame
5. Close the feedback loop
6. Be customer-centric

Outages

Facebook, Messenger and Instagram
Or Meta or whatever.
Google Nest

上記各社の障害情報。

KubeWeekly #283 November 12th, 2021

The Headlines

Editor’s pick of the highlights from the past week.

How Pokemon Go creator builds on Kubernetes for developers

B. Cameron Gain, The New Stack

In this latest episode of The New Stack Makers podcast, Ria Bhatia, senior product manager of Niantic, discusses why the Pokemon Go platform remains relevant today to developer customers and why Kubernetes will remain an integral part of the platform.

Pokemon Goプラットフォームの関連性が維持される理由と、Kubernetesにより多くの「開発者顧客(developer customers)」を呼び込むことを望んでいるため、Kubernetesがプラットフォームの不可欠な部分であり続ける理由を解説している。 PodcastとYouTube動画がそれぞれ埋め込まれている。

ICYMI: CNCF online programs this week

A weekly summary of CNCF online programs from this week.

Breaking tradition: The future of package management with Kubernetes

Aaron Hurley & Dmitriy Kalinin, VMware

Kubecon + CloudNativeCon NA 2021のキーノートで紹介された同名のセッションのアイデアを深堀した約1時間のセッション。
CarvelプロジェクトチームがKubernetesのパッケージ管理をどのように再考し、パッケージ化されたアプリとその依存関係のエンドツーエンドのライフサイクル管理を自動化する最新の宣言型の方法を提供するかを詳しく解説している。

Improve core-to-edge mobility and resiliency for cloud native applications

Ben Morrison, Trilio

クラウドネイティブアプリのモビリティーと回復性をエッジで実装する方法を以下のポイントで紹介している約53分間のセッション。
- How to further simplify deployment and management of cloud-native applications to improve resiliency and availability for edge clouds and help customers better curate their data for competitive advantage.
- How to protect and migrate workloads between core and edge using enterprise and lightweight Kubernetes with data management tooling.
  If you, like us, are missing @KubeCon_ + #CloudNativeCon, then it's time to start thinking about EU 2022 in Valencia, Spain! 🌊 🥘
  
  The CFP is open NOW through Dec 17!
  
  Learn more: https://t.co/lYE8E9PUEA pic.twitter.com/NeJGhnEf68
  — CNCF (@CloudNativeFdn) 2021年10月26日

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

How Krateo PlatformOps integrates Backstage

Diego Brag, Kiratech

タイトル通り、あらゆるタイプのリソースを一元的に作成および管理するための、包括的でモジュール式のアーキテクチャに基づくオープンソースプロジェクト「Krateo PlatformOps」がSpotify ITチームによって作成されたオープンプラットフォームである「Backstage」との統合の価値と重要性を紹介している。BackstageはCNCF sandboxプロジェクト。

Kube-lineage: A CLI tool for visualizing Kubernetes object relationships

Justin Toh

Kubernetesオブジェクトの関係を可視化するCLIツール「kube-lineage」を紹介している。

Multifactor SSO authentication for Postgres on Kubernetes

Jonathan Katz, Crunchy Data

PostgreSQL 12でデータベースでの多要素認証が提供可能になったため、導入方法を紹介している。

Flux security audit has concluded

Daniel Holbach, Weaveworks

CNCFのIncubationプロジェクトである「Flux」のCNCFとOSTIF (the Open Source Technology Improvement Fund)によるセキュリティー監査が完了したため、その結果を共有している。
監査の主な目的は、Fluxの基本的なセキュリティ体制を評価し、セキュリティーストーリーの次のステップを特定すること。

Horizontal pod autoscaling with custom metrics in Kubernetes

Natalie Serrino, Pixie

タイトル通り、カスタムアプリケーションメトリクスによってKubernetesのDeploymentをkind: HorizontalPodAutoscalerを利用してオートスケールさせる方法を解説している。

What does pizza have to do with Kubernetes?
More than you think. Read here to discover: https://t.co/Hv8zkBR4SB

And register now (you still have time!):https://t.co/mQVyPVFPy8

Join us this year virtually, you'll get a pizza on next year's edition 🍕😎 pic.twitter.com/puZj8mkKny
— Kubernetes Community Days Italy (@KCDItaly) 2021年11月11日

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

Announcing the 2021 Steering Committee election results

Kaslin Fields, Google

タイトル通り、2年任期でKubernetesコミュニティーから選出されるSteering Committeeの2021 Steering Committee Election の結果発表。今回の選挙による新任と再任のメンバー4人と、選挙の対象外で継続のメンバー3人の紹介、関係者への感謝を述べている。

Kubernetes podcast from Google: Knative 1.0, with Ville Aikas

Craig Box & Jimmy Moore

Google社社員によるKubernetes Podcast。今回のHostはCraig Box氏とGuest HostのJimmy Moore氏。
Knative 1.0のリリースに伴い、 KnativeのSteering CommitteeメンバーでChainguard Incのco-founderであるVille Aikas氏をゲストとして迎えている。
今回はNews of the week無しで直接ゲストへのインタービューに入っている。

Security microservices, configuration and observability take the stage at KubeCon NA 2021

Patrick Nelson, SiliconANGLE

KubeCon + CloudNativeCon NA 2021のキーノートからセキュリティーに関するセッションの以下5つの重要なポイントを解説している。
1. Modern security practices take hold
2. Configuration is more important in elaborate environments than cyberattack prevention
3. Supply chain hacks are escalating, and in the spotlight
4. Streamlining app deployment to Kubernetes
5. Costs getting reined in

Key takeaways from KubeCon: deeper focus on FinOps, GitOps

Charlotte Dunlap, GlobalData

KubeCon + CloudNativeCon NA 2021のFinOps、GitOpsに関するKey Takeawaysを抜粋して解説している。
Summary Bulletsは以下の2つ。
- The Open Source Security Foundation (OpenSSF), a new group focused on software security supply chain problems, added $10 million in vendor funding.
- Google Cloud recently joined the FinOps Foundation, representing the first major cloud provider to commit.