SRE / DevOps / Kubernetes Weekly Reportまとめ#52(2021/1/24~1/29) - 運び屋 (A carrier(forwarder) changed his career to an engineer)

この記事は2021/1/24~2021/1/29発行の下記3つのWeekly Reportを読み、備忘録兼リンク集として残しているものです。
なるべく情報を早く届けたい/共有したいので、ブログのリンクを確認次第、先行公開しています。自身のコメントは随時追加しています。
English Version of this blog is here.
DEVOPS WEEKLY ISSUE #526 January 24th, 2021
- News
- Tools
SRE Weekly Issue #254 January 24th, 2021
- Articles
- Outages
KubeWeekly #248 January 29th, 2021

この記事は2021/1/24~2021/1/29発行の下記3つのWeekly Reportを読み、備忘録兼リンク集として残しているものです。

なるべく情報を早く届けたい/共有したいので、ブログのリンクを確認次第、先行公開しています。自身のコメントは随時追加しています。

誰かの情報源や検索工数削減などになれば幸いです。
English Version of this blog is here.

DEVOPS WEEKLY ISSUE #526 January 24th, 2021

SRE Weekly Issue #254 January 24th, 2021

KubeWeekly #248 January 29th, 2021

この記事を読んで疑問点や不明点があれば、URLから本文をご確認の上、ご指摘頂ければ幸いです。
理解が浅いジャンルも、とにかくコメントする様にしていますので、私の勘違いや説明不足による誤解も多々あろうかと思います。
情報量が多いので文字とリンクだけに絞っております。
各レポートで取り上げられている記事には2020年以前のものもあり、必ずしも最新のものという訳ではない様です。

DEVOPS WEEKLY ISSUE #526 January 24th, 2021

News

A post on the evolution of the relationship between development and security teams, proposing 4 levels of maturity.

タイトルは「Four levels of maturity that bridge the AppSec / engineering divide」。
セキュリティーとエンジニアリングをうまく連携させるための非常に便利なツールの1つとして、セキュリティーの作業を継続的デリバリー(CD)に統合することを提案している。
継続的インテグレーション(CI)と自動化のパイプラインを構築する際に、セキュリティーとエンジニアリングの組織が通過する以下4つの典型的な成熟度を解説している。
- Level 1: Security finds problems; Engineering fixes them
- Level 2: Security and Engineering collaborate to produce test cases and remediations
- Level 3: After the issue is fixed, Security and Engineering collaborate to find systemic fixes and develop checks
- Level 4: Security and Engineering now also proactively look for new classes of issues and create systemic checks before an actual problem occurs

There are lots of interesting things about the rise of ARM for server workloads, but one that will likely drive adoption is price/performance. This post looks at a series of PostgreSQL benchmarks.

タイトルは「PostgreSQL on ARM-based AWS EC2 Instances: Is It Any Good?」。
2020年5月のAWSの第2世代のGraviton2ベースのEC2インスタンス発表を受け、タイトル通りARMベースのEC2インスタンスでPostgreSQLをテストしている。

A good web performance case study, with lots of examples, discussion of tools, code samples and improvements made.

タイトルは「How We Improved SmashingMag Performance」。
このブログ記事が掲載されているReactを使用したJAMStackで実行されているWebページにおける改善の取り組みを詳しく解説している。Webパフォーマンスを最適化し、Core Web Vitalsメトリックを改善した。
Core Web VitalsはWeb Vitalsのサブセット。Web VitalsはGoogle社が2020年にアナウンスした、ウェブ上で優れたユーザーエクスペリエンスの提供に不可欠な高品質のシグナルに関する統一されたガイダンスを提供するもの。

Rust is picking up lots of interest recently, especially for systems work or low-level CLI tooling. But it might not be suitable, as a language or an ecosystem, yet for higher-level work like web development and APIs.

タイトルは「Rust is a hard way to make a web API」。
冒頭にRustの良さに触れつつ、筆者の体験を元にタイトルの悪戦苦闘した内容を解説している。

Lots of software benefits from a custom installer, but what makes for a good user experience for this kind of software? This post shares some thoughts and examples.

タイトルは「Design choices for a declarative installer」。
Kubernetesのコンポーネントのセットのインストール、アップグレード、および削除に焦点を当てた場合に、ターゲット環境に応じて、下記の既製のソフトウェアを使用できるが、複数のコンポーネントを統合するために微調整する必要のある構成の量は、対処するためのフラストレーション、エラー、および悪夢の原因となる可能性があることを冒頭に触れている。
- For Kubernetes apps there is Helm and Continuous Delivery systems like Argo that can manage applications lifecycle described simply in naked yaml.
- For pure operators there’s Operator Lifecycle Manager (OLM).
- For more general infrastructure there is Terraform.
上記の問題を解決し、より良いユーザーエクスペリエンスを作成し得る方法について、現在のアプローチに導いた設計上の選択を紹介している。

Editorが上記に記載している通り、ツールを利用した移行にフォーカスしているコミュニティー「Konveyor」のWebページ。下記のツールなどを手掛けている。
- crane - Migrate namespaces between Kubernetes clusters.
- forklift - Migrate virtual machines to KubeVirt.
- move2kube - Migrate from Cloud Foundry or Docker Swarm to Kubernetes.
- pelorus - Measure the four critical measures to software delivery performance.
- windup - Analyze applications for modernization paths.

Tools

Cinc is a community project to build a free distribution of the Chef software stack (currently including the Infra, Workstation and Inspec tools), released under an Apache 2.0 license.

以下2つをゴールとしているプロジェクト「Cinc」のWebページ。
1. Making Chef Software Inc’s open source products easily distributable, by anyone
2. Creating free distributions of Chef Software Inc’s open source products
ロゴの下のフレーズ「CINC is not Chef」が「YAML Ain't Markup Language」を彷彿とさせる。

Web Assembly is a low level technology which is likely to have wide ranging influence. A good example of the kinds of innovation it makes possible are things like Artichoke, a new Ruby language which compiles to a WASM binary.

RustとRubyで記述されたRuby実装「Artichoke」のWebページ。
GitHubページはこちら。

PolicyHub CLI is a CLI tool that provides a simple discovery engine for finding useful Rego policies for Open Policy Agent.

ポリシーを検索可能にするため、ポリシー作成者にポリシーを共有するための標準形式を提供する「PolicyHub CLI」のGitHubページ。

Biome is a community distribution of Chef Habitat released under the Apache 2.0 license.

Chef Habitat™のコミュニティー版である「Biome」のWebページ。
GitHubページはこちら。

SRE Weekly Issue #254 January 24th, 2021

Articles

Coinbase Incident Post Mortem: January 6–7, 2021

This one’s juicy. At one point, the front-end was blocked up, so the back-end saw less traffic and scaled down. Then when the traffic came flooding back, the back-end was ill-prepared. We can all learn from this.

Coinbase

タイトルに記載されている通り、Coinbase社のポストモーテム。ポストモーテムが出揃って完全版になったことに伴い更新されている。ダウンタイムの原因、修正方法、および同様の停止を防ぐために講じている手順について詳しく解説している。
この障害ではcoinbase.comとモバイルアプリの提供に使用されるAPIに影響を与えたが、 APIを介した取引所での取引、および基礎となる市場の健全性は影響を受けなかった。

Soar: Simulation for Observability, reliAbility, and secuRity

Cloudflare has what amounts to a sophisticated staging environment for testing new code.

Yan Zhai — Cloudflare

ソフトウェアの複雑さと戦うために使用する手法の1つであるシミュレーションについて解説している。
タイトルにもなっているCloudflare社のシミュレーションシステム「SOAR」は、以下の環境。
- Simply put, it’s a data center built specifically for simulations. It runs the same software stack as our production data centers, but without any production traffic. Within SOAR, there are end-user servers, product servers, and origin servers (Figure 2). The product servers behave exactly the same as servers in our production edge network, and they are the targets that we want to test.

Failing to make progress under excess request load

Sometimes rolling back doesn’t actually get you back to a good state, especially when there’s pent-up demand.

Rachel By the Bay

筆者が体験した障害について共有している。発生した事象については、タイトルと上記Editorのコメントにある通り。

Google Cloud Issue Summary — Google Meet — 2021-01-08

Here’s Google’s follow-up on a Google Meet outage earlier this month.

Google

タイトルにある障害のサマリー。Google Meetの障害により、landing pageにアクセスできない事象が発生。新しいlanding pageのリリースに伴い、新旧のlanding page間のredirectを設定していたが、redirectのループが発生していた。

The Next Gen Database Servers Powering Let’s Encrypt

Those are some seriously big database servers.

Josh Aas and James Renken — Let’s Encrypt

Let’s Encryptが2020年後半に実施した、データベースサーバーのアップグレードにより、満足した結果が出せていることを解説している。

Incident Management in 2021: from Basics to Best Practices

A great general overview of all aspects of incident response, including definitions and best practices.

Better Uptime

以下の「5 parts of the incident management process」と「5 steps to a bulletproof incident management process」の項目に沿って、タイトルの内容を解説している。
- 5 parts of the incident management process
  1. Best incident monitoring practices
  2. Best on-call practices
  3. Best incident alerting practices
  4. Best incident communication practices
  5. Best incident response practices
- 5 steps to a bulletproof incident management process
  1. Best incident monitoring practices
  2. Best on-call practices
  3. Best incident alerting practices
  4. Best incident communication practices
  5. Best incident response practices

Using GPT-3 for plain language incident root cause from logs

Check out what happens when you unleash a generalized language model AI on some log messages related to an incident.

Larry Lancaster — Zebrium

筆者が関わっているOpenAI社の「GPT-3 language model」で何をしてきたかを垣間見えるように解説している。基本的なプロンプトのみを使用して、いくつかの簡単な結果を共有している。

Taming Operational Load with VMware CRE

The CRE team at VMware undertook a project to find and reduce toil. Note that “with VMware CRE” does not mean “with some product named VMware CRE™”.

Gustavo Franco — VMware

VMware社のCRE(Customer Reliability Engineering)チームが最近完了した運用負荷のassessmentによる以下の成果を解説している。
- As a result, we significantly reduced that load, improved our team well-being, and increased the amount of spare time and energy we have to invest in reliability engineering projects to improve Tanzu.

Slack RCA for outage on January 4, 2021

This is Slack’s RCA for their outage earlier this month. This is a great example of a complex incident with many contributing factors — certainly no single “root cause” here.

タイトルの通り、Slack社の障害の最終版。
上記のEditorのコメント通り、複数の要素により発生している障害で、単一の根本原因では無いため、是正措置も数が多い。いくつかの項目ではcloud providerの協力を得て改善案を出している。

Outages

Slack
Signal
Apple iCloud
Facebook
CBS

上記各社の障害情報

KubeWeekly #248 January 29th, 2021

The Headlines

Editor’s pick of the highlights from the past week.

Maintainer Spotlight: Kevin Wang of KubeEdge and Volcano

CNCF blog

This month’s spotlight focuses on Kevin Wang, a contributor in the CNCF community since its beginning, leader of the cloud native open source team at Huawei, and co-founder of the KubeEdge and Volcano projects. Read the blog to learn more about Kevin’s experience with the CNCF community over the past five years.

今回スポットライトが当たっているKevin Wang氏はCNCF TOC electionにも挑戦中。TOC Elections for 2021を確認してみたところ、以下の日程なので、もうすぐ結果が出る。
- Election closes Feb 1, announced at noon

Great to see @LF_Networking projects joining forces https://t.co/7a2JO6WvE9
— Bill Mulligan (@breakawaybilly) 2021年1月26日

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

Operator integration testing for Operator Lifecycle Manager

Taneem Ibrahim, Red Hat

タイトル通り、OperatorのOLM(Operator Lifecycle Manager)統合をテストするために必要な手順を解説している。デモでは、テストメッセージをシェルに出力する単純なOperatorを使用している。
このハンズオンで利用するローカル開発環境に必要なツールは以下の通り。また、Red Hat Quay.ioの無料アカウントの利用も案内している。
- Red Hat CodeReady Containers (CRC)
- Podman , or a Docker daemon process running on the local machine
- Operator SDK toolkit, v1.0.0 or higher (optional)
- Operator Package Manager (OPM)
- OpenShift Container Platform, cluster version 4.5 or higher

Kubernetes and GitOps with Flux CD V2.0

Raynix

タイトルの内容をofficial instructionsに沿ってハンズオンした内容を解説している。同僚の方が作られた参考になるプロジェクトk8s-gitopsをオススメされていたものの、Flux CDの使い方を完全に理解したかったので、ゼロから上記のofficial instructionを利用したが、GitOpsを自身のクラスターで有効にするのにそれほど掛からなかったとのこと。

Kubernetes at scale using Rancher Fleet

Saiyam Pathak,Civo

タイトルの内容を解説しているYouTubeのWebinar動画。
スピーカーのSaiyam Pathak氏はCNCF Ambassador、Director of Technical Evangelism at @civocloudとして精力的にイベントのライブストリームのインタビュー動画と、このようなWebiar動画を出されていて参考になるので、チャンネル登録しておいた。

Database Migrations Using Screwdriver and Kubernetes

Zhongkai Liu, Software Dev Engineer II & Palash Agrawal, Principal Software Dev Engineer

Yahoo Sportsにおける以前と現在のDB Migration Processの違いを解説している。ツールとしてオープンソースのCD(継続的デリバリー)プラットフォームであるScrewdriverを利用している。
冒頭でタイトルと本文中で使用する「Migration」という用語は、テーブルの挿入または削除、データベースへのデータの入力、データベースからのエントリの削除など、データベースに加えられた変更を意味することを確認している。

Firecracker: start a VM in less than a second

Julia Evans

筆者がよりDIYの「VMを実行したい」という観点からFirecrackerの使用について解説している。
当社は筆者も自身が使うものとは想定していなかったものの、きっかけとなった以下のポイントを冒頭に説明している。
- Firecracker is relatively straightforward to use (or at least as straightforward as anything else that’s for running VMs)
- The documentation and examples are pretty clear
- You definitely don’t need to be a cloud provider to use it
- As advertised, it starts VMs really fast!

Scaling Kubernetes to 7,500 Nodes

OpenAI

Kubernetesを筆者のチームの研究ニーズに対して非常に柔軟なプラットフォームとして利用している知見を共有している。
Unsolved problemsとして以下の2つを挙げ、Metricsの問題に関しては移行作業を実施中で、結果については今後ブログ投稿が予定されている模様。
- Metrics
- Pod network traffic shaping

Docker security scanning cheat sheet 2021

Jim Armstrong, Snyk

Docker DesktopとSnykを使用してコンテナイメージのスキャンを開始するのに役立つ「Docker Vulnerability Scanning CLI cheatsheet」を紹介している。

How to unit-test your helm charts with Golang

Alistair Hey

タイトル通り、GolangでHelm chartに対するユニットテストを作成して、品質を高く保ち、自信を持って変更を加えていける方法を解説している。
「The Upsides」と「The downsides?」についてもまとめてあり、筆者が用意した基本的なHelm chart例を含んでいるリポジトリはコチラ。

Create Kubernetes federated clusters on AWS

Theo “Bob” Massard, particle.io

AWSが最近導入したフェデレーションEKSクラスターをオーケストレーションする新しいソリューションを、このソリューションの基になっているKubefed (Kubernetes Cluster のFederation用途)から紹介している。

Self-Service Velero Backups with Kyverno

Ritesh Patel, Nirmata

新しいCNCFサンドボックスプロジェクトであるKyvernoを使用して、Veleroで開発者のセルフサービスバックアップを有効にする方法を解説している。

ICYMI: CNCF online programs this week

A weekly summary of CNCF online programs from this week.

CNCF Live webinar: Kubernetes 1.20

Jeremy Rickard, VMware and Kirsten Garrison, Red Hat

Kubernetes 1.20についてリリースチームが新機能と重要な非推奨について詳しく解説している。
Kubernetes 1.20は40を超えるさまざまな拡張機能を備えた、これまでで最大のリリースの1つである、とのこと。

This Week in Cloud Native: Cloud Native Infrastructure in the Data Center with Cluster API & Tinkerbell (CAPT) (livestream)

Jason DeTiberus, Equinix and Manny Mendez, Equinix

以下の課題感を元に、Cluster APIとTinkerbellを使用して、データセンターに真のクラウドネイティブなインフラの管理を導入する方法を解説している。
- Up until now managing Kubernetes infrastructure outside of cloud providers has been difficult, and while there have been attempts to ease management of Kubernetes clusters within the data center previously we feel those attempts have been focused mostly on trying to shoehorn the management of clusters into legacy practices.

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

Siri, Storage, and Solutions, with Josh Bernstein

Craig Box, Kubernetes Podcast from Google

Google社社員によるKubernetes Podcast。現在のCo-hostはCraig Box氏。Adam Glick氏はgreener pasturesへ。数週間は過去のゲストをゲストホストとして参加予定。
今週はGoogle社のStaff TPM & ManagerでIstio、Anthosを担当されているJasmine Jaksic氏がゲストホスト。
Google Cloudにdirector of infrastructure modernization solutionsとして最近JoinしたJoshua Bernstein氏をゲストとして迎えている。
News of the weekで気になったトピックは以下の通り。
- New Google Cloud Run networking features
- Kubernetes honey tokens by Brad Geesaman
- Bad pods: privilege escalation by Seth Art
- The US Air Force are feeling supersonic

Nvidia Views Kubernetes as Key to GPU Accelerated AI Scale

Tobias Mann, SDxCentral

このタイトルの内容について全く関わり/接点が無かったので気づきが多かった。以下のポイントだけでも目を通しておくと良さそう。
- He explained that Nvidia’s work in this arena has been somewhat drowned out by webscale applications which have been and remain the primary use case for Kubernetes. However, Lamb argues there is a huge potential for GPU-accelerated Kubernetes clusters in artificial intelligence (AI) workloads, an arena where Nvidia has long dominated.
- Looking to the future, Lamb expects GPUs will begin to the move into the mainstream of Kubernetes, especially as “AI serving becomes a GPU-accelerated workload, which is just at the inflection point of taking off.”
- “As things expand, I think most people are going to be able to just think about GPU accelerated as a fast button or an efficient button and not have to think about GPU development or programming,” he added.

Closed Box Monitoring, the Artist Formerly Known as Black Box Monitoring

Rick Rackow, Red Hat

クローズドボックスモニタリングと、OpenShift DedicatedのSREがそれを使用して可観測性スタックを補完している方法について説明している。

Announcing Vitess 9

Vitess team

タイトル通り、Vitess9のリリースを案内しているCNCFの記事。
この記事では、Major Themesとして以下の項目で解説している。
- Compatibility (MySQL, frameworks)
- Migration
- Innovation
- Documentation
Release Notesはコチラ。

Mentorship Spotlight: CommunityBridge Mentee with Keptn

CNCF blog

筆者がCNCF sandbox projectであるKeptnのCommunityBridge ProgramのMenteeとしてプログラムを修了した報告をしている記事。
「CommunityBridge program」から、「LFX Mentorship program」に現在は名前が変わっているとのこと。

# 63 – From Prometheus to Thanos with Simon Pasquier (in French)

Electro Monkeys podcast

フランス語のPodcast。ThanosのようなプロジェクトがPrometheusにもたらすもの、それがどのように機能するか、そしてその機能は何かについてを語っている、とのこと。

What is GitOps?

Salman Iqbal

GitOpsとその利点に関する原則を解説しているYouTubeの6分程度のWebinar動画。

The Cloud Native Landscape: The Application Definition and Development Layer

Catherine Paganini, Buoyant and Jason Morgan, VMware

「Cloud Native Computing Foundation Business Value Subcommittee」のco-chair Catherine Paganini氏とJason Morgan氏によるシリーズ連載の記事。cloud native landscapeを技術職だけでなく、非技術職の読者にもわかるように解説している。
この記事では、cloud native landscapeの開発レイヤーについて解説している。次の記事では、クラウドネイティブプラットフォームに焦点を当てるとのこと。

Kubernetes Begins Year With A Bang — And You Can Expect More

Chris Metinko, Crunchbase

2021年の年初から既に動きのあるKubernetesのエコシステムでの投資や買収、今後の予測などについて解説している記事。

Linkerd User Survey 2021- Take the survey

Googleフォームの1ページのSurvey。Linkerdを利用していたり、興味ある方はチェックを。

Another one in the books!
Great job today, @SethMcCombs! ✨ https://t.co/5iFceLlhEI
— Stephen Augustus (@stephenaugustus) 2021年1月26日

Upcoming CNCF Online Programs

This Week in Cloud Native: Kubernetes Policies-as-code
Jim Bugwadia, Nirmata
February 3, 2021 at 11:00 am PT
Register Now

CNCF On-demand Webinar: Policy as Code to Manage Security Rist in Kubernetes Before and After Deployment
Cesar Rodriguez, Accurics
February 4, 2021
Register Now

For more information, please visit our updated Online Programs page.

いかがでしたか？気になる記事や情報はありましたか？

私もまだ内容を咀嚼出来ていないものが多々ありますので、この備忘録兼リンク集を活用しながら理解を深めていきたいと思います。

では、また。

Bye now!!

Yoshiki Fujiwara