SRE / DevOps / Kubernetes Weekly Reportまとめ#43(11/22~11/27) - 運び屋 (A carrier(forwarder) changed his career to an engineer)

タイトルは「Learnings From Two Years of Kubernetes in Production」。
Kubernetesの運用経験を踏まえてこれまでと、現在の知見やアイデアを共有している。AWS Singapore RegionでEKSがGAになる前からkopsを使って導入を始めていたとのこと。強い。

Some observations from last week's KubeCon event, looking at the content presented and what it might mean about the maturity of the community and project.

タイトルは「KubeCon 2020 Recap – Maturity in Cloud Native」。
時差や他のイベントなどがあり、ほとんど参加できなかったので、概略をまとめてくれていてありがたい。Certified K8s Distributionsが減っているのは意外だった。確かに考えたら競争激しかったり、統合されていますね。

Distributed systems have lots of interesting properties that warrant detailed study, and this up-to-date set of course material (notes slides and videos) is a great start for anyone seeking more in-depth knowledge.

タイトルは「New courses on distributed systems and elliptic curve cryptography」。
分散システムに関する新しい8個の講義コースと、楕円曲線暗号に関するチュートリアルの紹介。ノートとYouTube動画があり、とてもありがたい。

An interesting look at the scale of the growing BPF ecosystem. Lots of tools at lots of different layers of the stack, with a focus on Kubernetes use cases.

タイトルは「Beyond the buzzword: BPF's unexpected role in Kubernetes」。
KubeCon NAのプレゼンから。eBPFの沼も入り口を何度もチラチラ見ているだけで入れていない、カレンダーに入れてしまおう。

A new report on container adoption. Lots of interesting aggregate data on cloud provider Kubernetes adoption, popular stateful container applications, container registry usage and more.

タイトルは「11 FACTS ABOUT REAL-WORLD CONTAINER USE」。
項目は多いが、それぞれは短く見やすいのでサッと読めて参考になる。
1. Kubernetes runs in half of container environments
2. Nearly 90 percent of containers are orchestrated
3. A majority of Kubernetes workloads are underutilizing CPU and memory
4. GKE, AKS, and EKS dominate on their respective cloud platforms
5. 1 in 3 AWS container environments runs Fargate
6. Larger Kubernetes clusters contain larger nodes
7. Networking technologies are prevalent among DaemonSets
8. The most popular Kubernetes version is 17 months old
9. Organizations are in the early stages of service mesh adoption
10. Half of all containers are now managed by cloud provider and third-party registries
11. NGINX, Redis, and Postgres are the most popular container images

A 12 part series on running .NET applications on Kubernetes. Everything from Helm charts to health checks and database migrations to rolling deployments.

タイトルは「Series: Deploying ASP.NET Core applications to Kubernetes」。
筆者がASP.NET CoreのアプリをKubernetes上にデプロイした際に学んだことを共有しているシリーズ。現在はPart 12まであり、それぞれ読み手が見やすい様に書かれている。

Tools

Ever found yourself wanting to quickly look up DNS information? Dog is a CLI tool with a nice user interface and the ability to output JSON.

OSSでDNSクライアントのCLIツール「dog」のWebページ。カラフルで見やすい。

illuminatio is a tool for automatically testing kubernetes network policies. Simply execute illuminatio clean run and illuminatio will scan your kubernetes cluster for network policies, build test cases accordingly and execute them to determine if the policies are in effect.

KubernetesのNetwork Policyを自動的にテストするツール「illuminatio」のGitHubページ。

Karpenter is a metrics-driven autoscaler built for Kubernetes and can run in any Kubernetes cluster anywhere. It's performant, extensible, and can autoscale anything that implements the Kubernetes scale subresource

Kubernetes用に構築されたメトリック駆動型オートスケーラーであり、任意のKubernetesクラスターでどこでも実行可能なツール「Karpenter」のGitHubページ。まだdeveloper preview段階。以前も取り上げた様な気がしつつ見当たらない。

SRE Weekly Issue #245 November 22nd, 2020

Articles

Trust Asia 2021 has produced inconsistent STHs

A Certificate Transparency (CT) log failed, resulting in its permanent retirement. The incident involved unintended effects from load testing being performed in a staging environment. I have a huge amount of admiration and respect for the transparency of certification authorities (CAs) when things go wrong.

Trust Asia

CAのTrust Asiaの障害発生時の問い合わせから、調査結果、今後の対処方法の流れが見えるGoogleグループ「Certificate Transparency Policy」のやりとり。
根本原因は、テストクラスターが本番環境のETCDアドレスとテストマシンを誤って使用したために、テストクラスターがelastic scalingに使用したマシンが本番クラスターに誤って接続されていたこと。

Knowing your systems and how they can fail: Twilio and AWS talk at Chaos Conf 2020

I like the idea that adding the ability to fail over to your system makes it much more complicated and thus more likely to fail.

Andre Newman — Gremlin

Chaos Conf 2020での2つのプレゼンを取り上げて解説している。Webページに該当のプレゼンを埋め込んである。
どちらのプレゼンも下記の重要な質問に対する答えを提供している。
- What are we aiming to accomplish with Chaos Engineering, and how do we do it thoughtfully?

Building for reliability at HelloSign

This one introduces some interesting concepts: the error kernel and property testing.

Kenneth Cross — HelloSign

HelloSignによるプロダクトと、コンセプトを以下の項目に沿って紹介している。
1. Kernel panic!
2. A brief guide on how to save the day
3. How we use property testing
4. Conclusion

Tech Startup Dilemmas: Resilient Deployment vs. Exhaustive Tests

[…] to be resilient, we must test everything, which consumes time that we don’t spend innovating. A good trade-off is to test in production.

Xavier Grand — Algolia

市場の変化、新しいニーズへの対応のため、レジリエンスとイノベーションの適切なバランスを見つける必要を、タイトルに沿って解説している。
本文から抜粋されている上記の観点は自身の中に無かった。トレードオフとしての、本番環境でのテスト。

8 Tips to Create an Accurate and Helpful Post-Mortem Incident Report

More useful tips as you develop your post-incident analysis process. I like their definition of “blameless”.

Zachary Flower — Splunk

タイトル通り、下記8つのTipsを解説している。
1. Don’t assign blame
2. Do take responsibility
3. Don’t procrastinate
4. Do gather information
5. Don’t be vague
6. Do define clear owners
7. Don’t lose focus
8. Do use a consistent template

Achieving exactly-once message processing with Ably

Exactly once delivery is hard to implement and requires explicit coordination at all levels, including the client. Ably explains how their flavor works.

Paddy Byers — Ably

分散型pub/subシステムのコンテキストでの「exactly-once」の意味を明確にし、Ablyが提供する「exactly-once」の　guaranteesの意味を理解を目的としている記事。
テーマに沿って、メッセージングのセマンティックタイプから深掘りしていっている。

Why you should frequently turn down ~30% of canary instances

The most effective (if scary) way to understand how your stateless service operates under load

Utsav Shah — Software at Scale

サービスのスケーラビリティの制限を理解する方法が不明確な場合に取りうるアプローチを解説している。
- 一般的なアプローチ: スクリプトを介した合成ロードから開始する
- Utilization DRT(Disaster Recovery Test

The Engineer’s Guide to Preparing for Black Friday 2020

Some good tips here — and a reminder that we may see even more traffic than normal due to social distancing.

冒頭でブラックフライデーについて、COVID-19がある中で人々がお店に群がっているのは危険であること、過去数年でデジタルイベント化しているので、今年もその動きが加速する見込みであることに触れている。
これまでに見た他のどの方法とも異なるブラックフライデーの処理方法について以下のポイントで解説している。
- SLOベースのアラート、Runbook、および準備を推進するその他のプラクティスが、ホリデーシーズンの成功にどのように重要であるか

Outages

ASX (Australian Stock Exchange)
Coinbase
GoDaddy
GoDaddy’s statement took care to explicitly state that the outage was not a security incident. This may be because they appear to have had an unrelated security incident around the same time, and some customer domains were taken over.
Nest

上記各社の障害情報

KubeWeekly #242 ←今週もお休みの様です

いかがでしたか？気になる記事や情報はありましたか？

私もまだ内容を咀嚼出来ていないものが多々ありますので、この備忘録兼リンク集を活用しながら理解を深めていきたいと思います。

では、また。

Bye now!!

Yoshiki Fujiwara