SRE / DevOps / Kubernetes Weekly Reportまとめ#77(2021/7/18~7/23) - 運び屋 (A carrier(forwarder) changed his career to an engineer)

The English Version of this blog is here.
この記事は2021/7/18〜7/23発行の下記3つのWeekly Reportを読み、備忘録兼リンク集として残しているものです。
なるべく情報を早く届けたい/共有したいので、ブログのリンクを確認次第、先行公開しています。自身のコメントは随時追加しています。
DEVOPS WEEKLY ISSUE #551 July 18th, 2021
- News
- Tools
  - A handy Kubernetes operator that simplifies the management of Role Bindings and Service Accounts, using a declarative configuration for RBAC with new custom resources.
  - Moco is a MySQL operator on Kubernetes using GTID-based semi-synchronous replication.
SRE Weekly Issue #279 July 18th, 2021
- Articles
- Outages
KubeWeekly #269 July 23th, 2021←7/26にアップロードされていた模様

The English Version of this blog is here.

この記事は2021/7/18〜7/23発行の下記3つのWeekly Reportを読み、備忘録兼リンク集として残しているものです。

なるべく情報を早く届けたい/共有したいので、ブログのリンクを確認次第、先行公開しています。自身のコメントは随時追加しています。

誰かの情報源や検索工数削減などになれば幸いです。

DEVOPS WEEKLY ISSUE #551 July 18th, 2021

SRE Weekly Issue #279 July 18th, 2021

KubeWeekly #269 July 23th, 2021←7/26にアップロードされていた模様

この記事を読んで疑問点や不明点があれば、URLから本文をご確認の上、ご指摘頂ければ幸いです。
理解が浅いジャンルも、とにかくコメントする様にしていますので、私の勘違いや説明不足による誤解も多々あろうかと思います。
情報量が多いので文字とリンクだけに絞っております。
各レポートで取り上げられている記事には2020年以前のものもあり、必ずしも最新のものという訳ではない様です。

DEVOPS WEEKLY ISSUE #551 July 18th, 2021

News

The NTIA Multistakeholder process has published the minimum elements for a software bill of materials. Lower level than most considers will care for, but lots happening in this space at the moment on the standards and early tooling fronts.

2個セットの記事。上記リンクの1つ目のタイトルは「NTIA Releases Minimum Elements for a Software Bill of Materials」。
- USのバイデン大統領による大統領令(Executive Order)によるサイバーインシデントへの最優先対応指示を受けて、商務省とNTIA(National Telecommunications and Information Administration)によって公開されたSBOMの紹介。より透明で安全なソフトウェアサプライチェーンの構築の支援を目的としている。
2つ目のタイトルは「The Minimum Elements For a Software Bill of Materials (SBOM)」。
- 同名のレポート「The Minimum Elements For a Software Bill of Materials (SBOM)」を紹介し、リンクしている。

A look at the tradeoffs between testing in a pipeline and observability in production for data engineering.

タイトルは「Testing vs Observability: Which is right for your data quality needs?」。
データが正しく高品質であるかどうかを判断するための2つの主要なアプローチについて詳しく解説している。それぞれのアプローチ、潜在的な欠点について解説し、2つのアプローチの組み合わせが最も理にかなっている場合についても触れている。

A post describing 6 categories of security posture in need of management, from cloud and applications to identity and devices.

タイトルは「6 Categories of Cybersecurity Posture」。
シリーズ記事「Building Your Cybersecurity Posture」の2/9番目の記事。タイトル通り、潜在的なリスクへの組織のエクスポージャーをまとめて測定する主要なリスク指標のコレクションである以下6つの「Cybersecurity Posture」を列挙している。後続の記事で一つずつ解説していく模様。
1. Cloud Security Posture Management
2. Application Security Posture Management
3. Data Security Posture Management
4. Identity Access Posture Management
5. Network Security Posture Management
6. Devise Security Posture Management

The OpenTelemetry standard is making it easier for generic client libraries to have built-in instrumentation, but there are still interesting tradeoffs and design decisions as discussed in this post.

タイトルは「OpenTelemetry in client libraries」。
冒頭のDisclaimerに記載の通り、タイトルに関する筆者の経験に基づく意見を述べ読者からのフィードバックを期待している。

Large open source projects have interesting dynamics. This thoughtful post from the Knative project considers whether a project or product mindset would be best for the long term future of the project.

タイトルは「On Product vs. Project」。
筆者が「Knative Product Series」と呼んでいるシリーズの1記事。Knativeに関連する以下の概念をさらに定義し、このシフトの影響と、このシフトを「形式化」することがKnativeの将来にどのように影響するかについて解説している。
1. This change is a simple shift in mental model: a shift from Knative the project to Knative the product.
2. One could even chart this change along recent community actions: beginning with the acceptance of the UX WG and cresting with the conversations being had around the introduction of func.
3. The question we, as a community, ought to be asking ourselves is simple: who is the audience of our work? Are we building a project for vendors or a product for end users?

A post on the complexity of branching strategies and the fact many teams just take that friction for granted rather than try something simpler.

タイトルは「ON THE EVILNESS OF FEATURE BRANCHING - A TALE OF TWO TEAMS」。
上記の通り、筆者の「On the Evilness of Feature Branching」シリーズから一つ目の記事。個人的には「Trunk-Based Development(TBD)」は未体験でGitFlowから入っているので、とても興味深い。

A deep dive on what’s happening under the hood of AWS Lambda.

タイトルは「Behind the scenes, AWS Lambda」。
Lambdaについて、以下の項目で深堀りしている記事。
- Deployment Interoperability
- Breaking down the architecture
- Load Balancing/Scaling
- Worker layers
- Synchronous Execution Path
- Asynchronous/Events Execution Path
- Firecracker
- Footnotes

An example of using Google Cloud, Pulumi and Debezium, a Change Data Capture framework, to build a fault tolerant event driven architecture.

タイトルは「Building a fault-tolerant event-driven architecture with Google Cloud, Pulumi and Debezium」。
上記タイトル通り、フォールトトレラントなイベント駆動型アーキテクチャー(回復性がありスケーラブルなシステム)を構築する方法を解説している。

A nice summary of what observability is and why it’s important.

タイトルは「Unpacking Observability」。
「Observability-Driven Development (ODD)」を軸に解説している。
「instrumentation」が他の記事でも同様に出てきて、よくわかっていなかったが以下の引用してきたものと同一の理解で良い？
- 「ここでのインストルメンテーションという用語は、製品のパフォーマンスのレベルを監視または測定し、エラーを診断する具体的な機能を意味しています。プログラミングでは、組み込むアプリケーションの機能を指します。」

Tools

A handy Kubernetes operator that simplifies the management of Role Bindings and Service Accounts, using a declarative configuration for RBAC with new custom resources.

Kubernetesでの認証を簡素化する「RBAC Manager」のGitHubページ。
新しいカスタムリソースを使用したRBACの宣言型構成をサポートするOperator。ロールバインディングまたはサービスアカウントを直接管理する代わりに、目的の状態を指定できる。

Moco is a MySQL operator on Kubernetes using GTID-based semi-synchronous replication.

KubernetesのMySQL Operator「Moco」のGitHubページ。
主な機能は、GTIDベースの準同期レプリケーションを使用したMySQLクラスターの管理。グループレプリケーションクラスターは管理しない。

SRE Weekly Issue #279 July 18th, 2021

Articles

Managing the Risk of Cascading Failure

This is a presentation by Laura Nolan (with text transcript) all about cascading failure, what causes it, how to avoid it, and how to deal with it when it happens.

I love how succinct this is:

[…] in any system where we design to fail over, so any mechanism at all that redistributes load from a failed component to still working components, we create the potential for a cascading failure to happen.

Laura Nolan — Slack (presented at InfoQ)

上記Editorのコメント通り、カスケード障害について解説されているプレゼンテーションの書き起こし記事。6つのアンチパターンとQ＆Aが含まれていて、ボリューム盛りだくさん。

The greedy exec trap

It’s so easy to explain an incident by describing how management could have prevented it from investing additional resources.

Lorin goes on to explain the “trap” part: it’s easy to stop investigating an incident too soon and declare the cause “greedy executives”, preventing us from learning more.

Lorin Hochstein

上記のEditorのコメントと、記事の抜粋の通り、「greedy executives」の前ではインシデントに対する追加の投資/学びも無に帰す。

r/WallStreetBets Incident Anthology (What Worked Edition): Recently Consumed

They redesigned one of their caching systems in 2020, and it paid off handsomely during the GameStop saga. This article discusses the redesign and considers what would have happened without it.

Garrett Hoffman — Reddit

障害のふりかえりシリーズの1記事。「What Worked」として今回はどのように再設計したキャッシュの仕組みが活躍したかを解説している。

Pragmatic Incident Response: 3 Lessons Learned from Failures

The lessons are:

Do retrospectives for small incidents first.
Do a retrospective soon after the incident.
Alert on the user experience. All great advice, and #1 is an interesting idea I hadn’t heard before.

Robert Ross — FireHydrant

上記Editorが抜粋している3つのレッスンを掘り下げて実際的なインシデント対応のためのより良いシステムを構築に役立てることを目的としている。

De-Siloing Incident Management: How to Make Reliability Engineering Everyone’s Job

We can’t engineer reliability in a vacuum. This is a great explainer on how SRE siloing happens, the problems it causes, and how to break SRE out of its shell.

JJ Tang — Rootly

SREを他のチームから分離するサイロの壁を突き破ることが非常に重要である理由と、そのための実践的な4つの戦略を解説している。

CALLBACK 498, July 2021 – Aircrew Resilience

This ASRS (Aviation Safety Reporting System) Callback issue has some real-world examples of resilient systems in action.

Nasa Asrs

CALLBACKが共有している以下3つのインシデントのレポートより解説している。
- Cleared for Takeoff
- Initial Operating Experience (IOE)
- A Slippery Slope

Automatic Remediation of Kubernetes Nodes

Facing a common kubernetes node failure modes, Cloudflare uses open source tools (one published by them) to perform automatic restarts.

In the past 30 days, we’ve used the above automatic node remediation process to action 571 nodes. That has saved our humans a considerable amount of time.

Andrew DeMaria — Cloudflare

上記の通りCloudflare社が30日間実施した自動Kubernetesノード復旧テストを、同社がオープンソース化した「Sciuro」を含めて紹介している。

Outages

上記各社の障害情報

KubeWeekly #269 July 23th, 2021←7/26にアップロードされていた模様

The Headlines

Editor’s pick of the highlights from the past week.

Last chance to submit for 9 KubeCon + CloudNativeCon North America co-located event CFPs

Calling all speakers! Do you have a tutorial or interesting use case to share related to our co-located events at KubeCon + CloudNativeCon North America? Be sure to submit your CFP as the deadline for the following events is July 25 at 11:59 PM PT. Details below:

上記の通り、KubeCon + CloudNativeCon North America co-located eventのCFPの締め切りを案内している。ただし、KubeWeekly#269のアップロード日がJuly, 26thなので実質締め切り後の案内になってしまっている。

ICYMI: CNCF online programs this week

A weekly summary of CNCF online programs from this week.

Kyverno in production

Jim Bugwadia, Nirmata

Kyvernoを使用してKubernetesクラスターのsecurity postureを改善する方法を解説している約1時間のセッション。

Performance improvements in Etcd 3.5 release

Wilson Wang, Bytedance

タイトル通り、パフォーマンスが大幅に向上したEtcd 3.5のリリースでの重要な変更点を解説している約22分間のセッション。

How Airbnb manages a dense SOA of 1000s of services across dozens of clusters

Stephan Chan, Airbnb

CNCF End User Loungeのセッションで約47分間にわたって、Airbnb社の2018年から本番環境でKubernetesを使用して成長してきた軌跡と、クラスターのimmutabilityとユーザーと他のインフラチームの両方に公開するインターフェースに関して慎重な考慮をしながら前進してきた知見などを共有している。

Hope you are having a great honking day today /honk pic.twitter.com/lKMnYuShdU
— Jason DeTiberus (@detiber) 2021年7月22日

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

How to deploy a cross-cloud Kubernetes cluster with built-in disaster recovery

Alex Feiszli, ITnext

MicroK8s、WireGuard、およびNetmakerを使用してハイブリッドクラウド環境にまたがる単一のKubernetesクラスターを実行し、耐障害性を高める方法を解説している。

Kubernetes Semaphore: A modular and nonintrusive framework for cross cluster communication

UW Labs

タイトルにあるフレームワークを「3つの異なるプロバイダー（AWS、GCP、オンプレミス）にまたがる3つのクラスターにまたがる環境があり、異なるクラスターで実行されているアプリケーションが相互に通信できるようにする必要がある」という想定で以下を実装する方法を解説している。
- Pod to Pod comms are encrypted across clusters
- Ability to target a remote cluster Kubernetes Service
- Add rules to allow certain applications from a remote cluster talk to local endpoints

Why you should build on Kubernetes from day one

Max Horstmann, Stack Overflow

タイトル通り、以下のポイントで「新しいアプリを構築している場合は、それをクラウドネイティブにし、ジャンプからKubernetesを使用することを詳しく検討する価値があるかもしれない」と解説している。
- Managed Kubernetes does the heavy lifting
- You can stay (somewhat) cloud agnostic
- You can easily spin up new environments—as many as you like!

Learn how to simplify application management with Operators with the CNCF Operator White Paper

TAG App Delivery

Kubernetesやその他のコンテナオーケストレーターのコンテキストで、クラウドネイティブアプリケーションのオペレーターの定義と包括的なガイドを提供することを目的としたホワイトペーパーの紹介。
全文はこちらから。PDF版はこちらから。概要は以下。
- The Operator Design Pattern and emerging patterns for the future.
- Recommended configuration, implementation, and use cases for an operator application management system.
- Best practices including observability and security, technical implementation, and CNCF maintained code samples.
- Advice for organizations wanting to design their own operators.

KubeEdge@MEC: Combining the Kubernetes ecosystem with 5G

KubeEdge Maintainers

冒頭で急速に発展しているMEC(Multi-Access Edge Computing)に対応するため、KubeEdgeコミュニティーがMEC SIGを立ち上げたことに触れ、以下のポイントで解説している。
- Background
- 5G Edge-Cloud Synergy from Different Perspectives
- Challenges in 5G Edge-Cloud Synergy
- Introduction to KubeEdge MEC SIG
- Conclusion

Those who register for a #Kubernetes certification exam, including #CKA, #CKAD and #CKS, will now receive access to the @_killer_shell exam simulator, providing two practice tests! Learn more at https://t.co/and0ZQcz4G #learnlinux #LFCert #k8s @cloudnativefdn pic.twitter.com/L7bKvtkWaT
— Linux Foundation Training & Certification (@LF_Training) 2021年7月21日

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

The Internet is my computer

Alex Ellis, CNCF Ambassador

1984年の、Sun Microsystems社のJohn Gage氏の "The Network is the computer." の言葉が今日の私たちにとって意味することをもう少し深く掘り下げ、インターネットの非常に実用的な使用例を紹介している。

Why Kubernetes was inevitable

Lars Larsson, Elastisys

タイトルに沿って、以下のポイントで解説している記事。
- Those who do not understand Kubernetes are condemned to reinvent it, poorly
- Kubernetes gets deployment and orchestration right
- That time I cobbled together a crappy Kubernetes-like platform and why it sucked
- How Kubernetes helps deploy and operate applications
「Kubernetes was inevitable.」と結論付けているが、その後の文が大事だと思う。

Tech employers get good news from Survey of Industry Workforce

Lawrence E Hecht, The New Stack

Protocol | Workplaceによる「Tech Employee Survey 2021」の結果を元に、IT業界の労働環境が良好であることを解説している。日本ではどうだろう？調査対象、質問方法などで大きく変わるだろうな。

DevOps and Cloud InfoQ Trends Report - July 2021

InfoQ

基本的なインフラと運用パターン、テクノロジーフレームワークでのパターンの実現、ソフトウェアアーキテクトまたはエンジニアが育成する必要のある設計プロセスとスキルに焦点を当てた「クラウドコンピューティングとDevOps」スペースを現在どのように見ているかを要約している。
Key Takeawaysだけで以下の分量があり、情報が盛りだくさん。
- Hybrid cloud options have evolved beyond the traditional definition, and have expanded to enable the functionality of cloud services to run outside of the cloud. Services such as Azure Arc and Google Anthos allow for a much more seamless "hybrid" experience for developers and operators.
- In the emerging "no-copy data-sharing" approach it is not necessary to move or replicate the data to be able to access it from different services. We believe that the recently announced "Delta Sharing" open standard will contribute to the upward trajectory of no-copy data sharing.
- We believe that there has been limited progress on undoing the confusion between continuous integration (CI) and continuous delivery / (CD) tooling and practices. Both GitOps and site reliability engineering (SRE) practices are increasingly being adopted.
- Observability practices and tooling continue to mature. The logging and metrics domains of the three pillars of observability are relatively well-adopted, but the tracing pillar remains less so. There are a number of encouraging advancements in this space, especially with the more widespread adoption of OpenTelemetry.
- We are following the innovative developments with FinOps, real-time information flow for cost analysis, aimed at the finance teams with public cloud vendors like Microsoft and AWS at the frontline.
- Increasingly popular practices, such as "Policy as Code", as promoted by Open Policy Agent (OPA), and remote access management tooling, e.g. HashiCorp's Boundary, are pushing forward identity as code and privacy as code.
- We have seen the Team Topologies book become the de facto reference for arranging teams within an organization to enable effective software delivery. There is also increasing focus on post-incident "blameless postmortems" in becoming more akin to "healthy retrospectives", from which the entire organisation can learn from.

Upcoming CNCF Online Programs

Cloud Native Live

July 28 at 9am PT: Building the Telegraf Kubernetes Operator presented by Wojciech Kocjan, InfluxData - RSVP

YouTube playlist submissions

Looking for more great curated content? Visit our Online Programs playlist on YouTube.

Learn more about CNCF Online Programs

いかがでしたか？気になる記事や情報はありましたか？

私もまだ内容を咀嚼出来ていないものが多々ありますので、この備忘録兼リンク集を活用しながら理解を深めていきたいと思います。

では、また。

Bye now!!

Yoshiki Fujiwara