SRE / DevOps / Kubernetes Weekly Reportまとめ#57(2021/2/28~3/5) - 運び屋 (A carrier(forwarder) changed his career to an engineer)

タイトルは「Security Logging in Cloud Environments - AWS」。
筆者のブログ投稿シリーズ「Continuous Visibility into Ephemeral Cloud Environments」の記事で、AWSの最先端のマルチアカウントでのセキュリティー関連のロギングプラットフォームの設計を解説している。
今後、GCPとKubernetesでも同様の設定について解説する模様。

A post on how the software community came to appreciate systems administrators a little more with the hugops movement.

タイトルは「An oral history of #hugops: How tech’s first responders built a culture of empathy」。
クラウドを実行し続けるエンジニアがどのようにして独自の共感の文化を生み出したかを、operation engineerの苦悩が伝わるストーリーから、Twitterのハッシュタグ#hugopsが広まった経緯を中心に解説している。
日本語の解説記事が出ていました「インターネット上のあらゆるサービスを稼働させ続ける運用保守エンジニアをねぎらうハッシュタグ「#hugops」とは？」。

Some good tips for scaling infrastructure as code across teams and organizations. Observations about public modules, standards, reusable code, having a formal release/versioning process and more.

タイトルは「Infrastructure as Code at Enterprise Scale: Identify the Right Approach for Your Organization」。
IaCアプローチの拡張に役立つツールと詳細なガイドラインとして、AWSとAzureという2つの最大のパブリッククラウドに焦点を当て解説している。
タイトルにある「エンタープライズ」をどのように定義するかは読者次第。筆者は以下のように定義している。
- How you define “enterprise” is up to you: whether you’re a Fortune 500 company or a garage-based upstart, this guide is for you.

JSON comes in a surprisingly large number of formats, with subtle differences. Throw in different JSON parsers in different languages and there is the potential for vulnerabilities caused by interoperability issues.

タイトルは「An Exploration of JSON Interoperability Vulnerabilities」。
TL;DRは以下の通りで、リンクからGitHubページのハンズオン紹介ページに飛べる。
- TL;DR The same JSON document can be parsed with different values across microservices, leading to a variety of potential security risks. If you prefer a hands-on approach, try the labs and when they scare you, come back and read on.
筆者はJSON INTEROPERABILITY SECURITY RISKSを以下の5つのカテゴリーに分けて解説している。
1. Inconsistent Duplicate Key Precedence
2. Key Collision: Character truncation and Comments
3. JSON Serialization Quirks
4. Float and Integer Representation
5. Permissive Parsing and Other Bugs

A good roundup of Linux server monitoring. Looking quickly at sar, vmstat, nethogs and monitorix.

タイトルは「Linux System Monitoring Fundamentals」。
タイトルに沿って解説がされており、以下4つのLinuxシステム監視ツールを重要でさらに詳しく調べる価値があるとして紹介している。
1. Sar
2. Vmstat
3. Monitorix
4. Nethogs

A post on Kubernetes robustness, showing with examples how to bring up various Kubernetes services after failure.

タイトルは「Breaking down and fixing Kubernetes」。
まず冒頭のイラストでのrm -rf /etc/kubernetesのイラストでギョッとした。このコマンドでKubernetesのクラスターを破壊、証明書を削除し、そこから復旧させる方法を解説している。
同じ筆者でetcd版の記事「Breaking down and fixing etcd cluster」もあり、Kubernetesのファイル構造と挙動を理解していく上で良い。

A comparison of System Manager Parameter Store and the newer Secrets Manager for managing secrets in AWS environments.

タイトルは「Parameter Store vs Secrets Manager」。
Webページ冒頭のイラストが、「ストIIのリュウ vs ケン」‼️
以下の構成でタイトルに沿って比較、解説している。
- Round 1: Key Value Store
- Round 2: Storage Limitations
- Round 3: Encryption
- Round 4: Rotation
- Round 5: Cost
- The Verdict

A nice worked example of live debugging using VSCode when you have a monorepo application and multiple container-based applications.

タイトルは「Seamless Multi-Container Live Debugging in VSCode | DevContainers on Steroid」。
コンテナ化されたアプリ用のマルチコンテナワークスペースまたはモノリポスタイルのワークスペースのリモートライブデバッグを解説している。
ソースコードはこちらのGithubページに。

Tools

cloudquery transforms your cloud infrastructure into SQL or Graph database for easy monitoring, governance and security.

クラウドインフラストラクチャとSaaSアプリをSQLまたはGraph(Neo4j)データベースとしてプル、正規化、公開、監視するツール「cloudquery」のGitHubページ。

A new bash-like shell with a few interesting features. In-line spell checking, typed pipelines, built-in testing framework, user-friendly error handling and more.

bash / zsh / fish /などのようなシェル。「murex」のGitHubページ。
BashのようなPOSIXシェルと同様の構文に従いますが、$SHELLに通常期待されるよりも高度な機能をサポートする、とのこと。

SRE Weekly Issue #259 February 28th, 2021

Articles

Increment: Reliability

This quarter’s Increment issue is about Reliability, and I haven’t had this much fun since their first issue about on-call. I’ll include a few of the articles here and more in later issues as I have a chance to review them.

Stripe

チームがソフトウェアシステムを大規模に構築および運用する方法を解説している印刷物およびデジタル雑誌「Increment」のISSUE 16, FEBRUARY 2021のテーマが「Reliability」で紹介している。今回はこのIncrementから以下3つの記事を取り上げている。

[Increment: Reliability] Everything is broken, and it’s okay

Accepting that imperfect things still work is fundamental to preventing failures from becoming catastrophes.

Understanding that no system is without errors is critical to building resilient systems.

Heidi Waterhouse

サブタイトルにあるように「不完全なものがまだ機能していることを受け入れることは、失敗が大惨事になるのを防ぐための基本」として以下のポイントで解説している。
- Control is an illusion
- Failure is inevitable
- Responding to fragility
- Designing against disasters
- Accept imperfection, within limits

[Increment: Reliability] How to build organizational resilience

The very first sentence sets the tone, and I love it:

Resilience is a process: something you must actively perform, not something you check off a list once.

Ryn Daniels

サブタイトルにあるように「レジリエンスを組織の文化にエンコードすることで、エンジニアリングチームは、未知の予期しない問題に取り組むための準備を整えることができる」として、今後数年間、レジリエンスを学習、改善、構築し続けることができる成長志向の文化の構築方法を解説している。

[Increment: Reliability] Embrace your inner incident commander

Most of all, having an incident commander only works if everyone believes in the role. Someone stepping in to address a crisis and saying “I’m Batman” doesn’t help unless people have bought into the idea of Batman.

The next time I’m incident commander, I am totally going to jump in and say, “I’m Batman!”.

This article is a great primer on what an IC is and how to adopt incident command at your organization.

Tanya Reilly

火災と戦う方法は、停止をどれだけ迅速に解決できるかに影響し、インシデントコマンダーの任命が助けになり、読者もその一人になれることを以下の構成で解説している。
- Enter incident command
- The incident commander’s role
- Making it work
- You’ve got to believe
- It’s your turn

Retry pattern in microservices

After reading this blog post, you will have an understanding of the retry pattern used in microservices architecture, why it should be used, a few considerations while using the retry pattern, and how to use it in Python.

I love the W. C. Fields quote.

Anand Prashant

内容は上記の通りで、以下の構成で解説している。図やコードが分かりやすく記載されていて良い。
- Microservices
- Retry pattern
- Considerations
- Adding delays between retries
- Retrying only on certain exceptions
- Few other considerations
- Conclusion

2021 Site Reliability Engineering (SRE) Survey Now Open

It’s that time again! Be sure to fill out the survey, not only so they can gather useful data, but also because Catchpoint will donate $5 to charity.

DevOps Institute, Catchpoint, and VMWare Tanzu

DevOps Instituteによる上記サーベイの案内。サーベイの結果からレポートを作成し、公開される。
期限は2021年4月1日までで、上記のようにチャリティーも行われているます。サーベイに直行する場合は「Take the survey now」から。

QA Engineers, This is How SRE will Transform your Role

When considering the value of a QA test, SLIs can provide very valuable context.

SRE and QA can work hand in hand.

Emily Arnott — Blameless

Alex Hidalgo氏の「Implementing Service Level Objectives」のイラストを引用しつつ、「SREを実装するとIT組織内のほぼすべての役割が変わり、最大の変革の1つはQAチームにある」ということを解説している。

Silent data corruption: Mitigating effects at scale

This kind of thing keeps me up at night. Silent data corruption can destroy your reliability just as quickly as a backhoe on a non-redundant link.

Harish Dattatraya Dixit — Facebook

上記の論文の中から、数十万台のマシンの規模でサイレントデータの破損を検出して修正するためのベストプラクティスについてを解説している。
論文のフルバージョンはこちら「Silent data corruptions at scale」から。

How Etsy Prepared for Historic Volumes of Holiday Traffic in 2020

Etsy experienced years of growth practically overnight in 2020 as quarantines set in. Here’s how they handled it.

Mike Adler — Etsy

上記のEditorがコメントしている内容を以下の構成で解説している。非難なきポストモーテムが機能している組織。
- The Challenge
- Modulating Our Pace of Change
- Adapting Our “Macro” Load Testing
- Modeling History To Inform Capacity Planning
- Cresting The Peak
- Gratitude

Outages

Let’s Encrypt
Google Voice
This is Google’s analysis for the incident on February 16, caused by a TLS certificate management mishap.
India’s National Stock Exchange (NSE)
LinkedIn
US Federal Reserve
The US Fed’s computer system was down, preventing transfers between banks from going through.
Venmo
Facebook and Instagram
Reddit
Discord

上記各社の障害情報

KubeWeekly #253 March 5th, 2021

The Headlines

Editor’s pick of the highlights from the past week.

Schedule for KubeCon + CloudNativeCon Europe 2021 – Virtual is now available!

KubeCon + CloudNativeCon Europe 2021 Virtual is happening May 4-7, 2021 and the schedule is now available. Experts from organizations including Adobe, Apple, CERN, NVIDIA, and OVHcloud will deliver 100+ sessions, keynotes, lightning talks, and breakout sessions. There will also be more than 60 sessions hosted by project maintainers – spanning beginner-level introductions, end user case studies, and technical deep dives.

上記の通り、KubeCon + CloudNativeCon Europe 2021 Virtualのスケジュールが公開されました。日本ではGWの後半で時間もあるので視聴するセッションはゆっくりと決めようと思います。
記事の中でコミュニティーが主催するスケジュールも紹介されていて、以下は見ようと思います。
- The community-curated schedule will feature sessions from leading open source technologists, including:
  - “Your Path To Non-code Contribution In The Kubernetes Community” – Kaslin Fields, Google; Kat Cosgrove, JFrog; Matt Broberg, Red Hat; Kohei Ota, HPE

Will be speaking at KCD Africa and looking forward to submitting CFPs for KCD Bengaluru! 🌈

Offer to help with initial review of CFPs still stands. Reach out on DM or slack if you'd like help! :) https://t.co/LUGpBElnTe
— Nikhita Raghunath (@TheNikhita) 2021年3月4日

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

Top Kubernetes Health Metrics You Must Monitor

Ajit Chelat, Logiq

特にモニターすべきKubernetesヘルスメトリックについて解説している。
Table of Contentsは以下。
1. Crash Loops
2. Cluster State Metrics
3. Disk and Memory Pressure
4. Network Unavailable
5. CPU Utilization
6. Job Failures
7. DaemonSets
8. Monitoring Kubernetes Health Metrics

Troubleshooting Services on Google Kubernetes Engine by Example

Yuri Grinshteyn, Reliability Engineer, Google Cloud

以下2つを解説している。
- We'll walk through deploying a sample app to your cluster and configuring an alerting policy that will notify you if there are any container restarts observed.
- From there, we'll trigger the alert and explore how the new GKE dashboard makes it easy to identify the issue and determine exactly what's going on with your workload or infrastructure that may be causing it.
YouTubeのGoogle Cloud Techチャンネルの「The Stack Docker/#stackdoctor」のシリーズの、上記タイトルの動画も埋め込まれている。

Protocol Detection and Opaque Ports in Linkerd

Charles Pretzer, Buoyant

Linkerd2.10リリースでは、新しい機能「Opaque Ports」が追加される。この機能について、SlackとGitHubのLinkerdコミュニティからかなりの数の質問があったため、Linkerdがこの機能を実行できるようにする最も重要な基本機能の1つである「Protocol Detection」に焦点を当てて解説している。

Integrating Backpressure into the Infrastructure

Simone Busoli, NearForm

リンクのWebページに辿り着けない。(2021/03/06 12:35 JST時点)。Webページのトップからブログのタイトルをクリックしてもダメ。何だろう？

Multi-Cluster Monitoring with Thanos

Kevin Lefevre, CTO, Particule

Prometheusのみの監視スタックの制限と、Thanosベースのスタックに移行することでメトリックの保持を改善し、インフラストラクチャー全体のコストを削減できる理由を解説している。

Securing Istio Workloads with mTLS Using Cert-Manager

Josh van Leeuwen, Jetstack

これまでの経緯と現状を踏まえて、Jetstack社のcert-managerチームがIstioコミュニティのセキュリティーWGや多くの顧客と協力して、cert-managerがIstioサービスメッシュのワークロード証明書に署名できる統合を実現したことと、その知見を共有している。

Understanding the Kubernetes Event Horizon

Bryan Boreham, WeaveWorks

タイトルの通り、KubernetesのEventをログ出力例を見せながら解説しており、以下のWarningも記載している。
- Warning: ‘kubectl get events’ can spew out a lot of information, especially as your cluster gets busier. Sadly it does not list the events in timestamp order, so you either have to have some idea what you are looking for, or pipe the output to a file and analyze it with the Mk 1 eyeball.

Introduction to Litmus Chaos | Rawkode Live

David McKay

上記タイトルの90分のWebinar動画。デモもあり、右側のチャプター機能で見たい部分にジャンプできるようになっていて良い。

Canary Deployments using Ketch

Saiyam Pathak, Civo

上記タイトルの7分のWebinar動画。コメント欄の、入門向けのコンテンツ希望されている方のコメントにも親切に対応されていて良いなと思った。

How to Manage Multi-Cluster Kubernetes with Operators

Sascha Haase, Kubermatic

マルチクラスター管理が必要な理由、Kubermatic KubernetesPlatformがKubernetesOperatorsを活用して、複数のクラスター、クラウド、リージョンにわたるクラスターライフサイクル管理を自動化する方法と、今日から始める方法を解説している。

Getting Started With Kubernetes: Clusters and Nodes

Sofia Parafina, Pulumi

インフラストラクチャーをコードとして使用して、基本的なKubernetesオブジェクトと、基本的なオブジェクトに基づいて構築される高レベルの抽象化を作成する方法を解説している。
具体的には、Pulumiを使用してAWS、Azure、GCPでKubernetesクラスターをセットアップする方法について解説している。クラスターの作成はクラウドプロバイダーによって異なるが、プロセスは一般的に同じ。
Kubernetesのコードとしてインフラストラクチャを使用することに関するシリーズの最初の記事。次の記事では、Pod、Service、volumeなどの基本的なKubernetesオブジェクトについて解説する模様。

Migrating Jenkins Freestyle Job to Multibranch Pipeline

Aman Bisht, Infracloud

筆者のエンプラの顧客の一つでJenkinsのマルチブランチパイプラインへの切り替えが必要だった理由と、それによって彼らの生活がどのように楽になったのか解説している。
- Freestyle Vs Pipeline jobs
- Why did we move to Multibranch Pipeline?
- Sample Jenkinsfile Template
- Benefits of Multi-branch Pipeline
- Challenges
- Conclusion:

ICYMI: CNCF online programs this week

A weekly summary of CNCF online programs from this week.

Rethinking your Company’s Cloud Security in the Shadow of the SolarWinds Attack

Amir Kaushansky & Leonid Sandler @ARMO

SolarWinds Attackを分析して、Kubernetesなどのクラウドネイティブ環境の脆弱性のより深い理解を目指して解説し、次にクラウド環境に固有のリスクを排除または軽減するための効果的な対策を列挙している。

Demystifying Kubernetes Network Policy

Thomas Graf @Isovalent

Kubernetesネットワークポリシーの基本からより高度な概念まで取り上げている。
単純なポリシーの設定から、競合するルールの発見と回避、よくある間違いの確認、主要なKubernetesユーザーによって実装されたものと同様の高度な実際のポリシーの例の調査などの難しい質問への取り組みまで、段階的に解説している。

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

In the Clouds: DevSec + SecOps w/ Kirsten Newcomer

Chris Short and Kirsten Newcomer, Red Hat

タイトルに沿って、以下などをテーマに解説、議論がされている約1時間のセッション。
- Security isn't just for Ops teams anymore - what do we need to do to make security a focal point of app dev as well? And why is security important for containers and Kubernetes?

How I Became a Kubernetes Maintainer in 4 hours a Week

Matthew Broberg, Red Hat

筆者がKubernetesへのコントリビュートについて学んだことを共有している。読者が参加する焦点と時間を見つけるのに役立つことを願っている。

7 Reasons to Adopt a Kubernetes Native Backup Solution

Gaurav Rishi, Kasten

拡大するKubernetes環境を保護するためにKubernetesネイティブバックアップソリューションが最適な以下7つの理由を解説している。
1. It accommodates Kubernetes deployment patterns.
2. It aligns with “Shift-left” development.
3. It simplifies operations.
4. It accommodates multicluster scalability.
5. It closes protection gaps.
6. It bolsters security.
7. Integration with the cloud native ecosystem.

How Fidelity Investments Built its Multi-cloud Strategy with Cloud Native Technologies

CNCF

Fidelity Investments社の事例紹介記事。以下項目で解説している。
- Challenge
- Solution
- Impact
- One issue that quickly arose was that Fidelity also had distributions of Kubernetes on-prem, as well as on other cloud providers. How could they introduce, for example, a new security process across 1,000 distributed applications?
Webページでは、事例を共有する動画「End User Panel: GITOPS in the Enterprise -Real World Experiences - Cheryl Hung」が埋め込まれている。

Great work, Chris on setting up this awesome list of Kubernetes-related reading resources! Feel like something important's missing? Submit a PR! https://t.co/mJRGl4Glyy
— Kaslin Fields (@kaslinfields) 2021年3月3日

Upcoming CNCF Online Programs

This Week in Cloud Native (Livestream): Kubernetes Community Days: Ask me Anything
Bill Mulligan @CNCF
March 10, 2021
Register Now

Deploying K3s at the Edge for Multiplayer Gaming
Marco Mancini @OpenNebula
March 11, 2021
Register Now

CNCF Online Programs Playlist on YouTube

Check out our playlist for more curated content you don’t want to miss! New content is added every Friday.

For more information, please visit our updated Online Programs page.

いかがでしたか？気になる記事や情報はありましたか？

私もまだ内容を咀嚼出来ていないものが多々ありますので、この備忘録兼リンク集を活用しながら理解を深めていきたいと思います。

では、また。

Bye now!!

Yoshiki Fujiwara