SRE / DevOps / Kubernetes Weekly Reportまとめ#8(3/22~3/27) - 運び屋 (A carrier(forwarder) changed his career to an engineer)

この記事は2020/3/22~3/27に発行された下記3つのWeekly Reportを読み、備忘録兼リンク集として残したものです。
English Version of this blow is here.
DEVOPS WEEKLY ISSUE #482 March 22nd, 2020
- News
- Tools
SRE Weekly Issue #212 March 23rd, 2020
- Articles
- Outages
KubeWeekly #209: March 27th, 2020

この記事は2020/3/22~3/27に発行された下記3つのWeekly Reportを読み、備忘録兼リンク集として残したものです。

誰かの情報源や検索工数削減などになれば幸いです。

DEVOPS WEEKLY ISSUE #482 March 22nd, 2020

SRE Weekly Issue #212 March 23rd, 2020

KubeWeekly #209: March 27th, 2020

English Version of this blow is here.

この記事を読んで疑問点や不明点があれば、URLから本文をご確認の上、ご指摘頂ければ幸いです。
理解が浅いジャンルも、とにかくコメントする様にしていますので、私の勘違いや説明不足による誤解も多々あろうかと思います。
情報量が多いので文字とリンクだけに絞っております。
各レポートで取り上げられている記事には2019年以前のものもあり、必ずしも最新のものという訳ではない様です。

DEVOPS WEEKLY ISSUE #482 March 22nd, 2020

News

A detailed look at the Envoy proxy, focused on usage patterns rather than technology. Interesting looking at the different use case, both low level and integrated into higher-level service mesh tooling.

タイトルは「On the state of Envoy Proxy control planes」。
先週の記事(KubeWeekly #208: March 20th, 2020)で取り上げたLyft社のSoftware EngineerかつEnvoyのCreatorであるMatt Klein氏の個人ブログでのEnvoy Proxyコントロールプレーンの現状とこれから数年間の分析。

An interesting post looking at using Kubernetes DaemonSets to help cluster administrators manage the cluster. Nice examples of provisioning SSH access to cluster nodes and running a virus scanning.

タイトルは「> MANAGING YOUR K8S CLUSTER VIA DAEMONSETS」。
DaemonSetsを利用して、プロダクション環境の稼働に必要なソフトウェア、システム、構成を管理する方法の提案。
labelsが「aikido」なのが個人的に気になりました。筆者のJames Hunt氏の個人サイトを開いたら「クロノ・トリガー」の壁紙が飛び込んでくるなど。

A discussion of abstractions and how that maps to serverless architectures and some thoughts on configuration management.

タイトルは「Abstractions and serverless」。
抽象化の意義を解説し、サーバーレスをその文脈で深掘り。
理解、維持、進化が可能かつ明確な抽象化の境界を持つシステムを作り上げる事が多くのITシステムで解決されるべき課題としている。

A discussion of incident management practices, in particular looking at involving developers in incident response and on-call activities.

タイトルは「Involving Engineers in Incident Management: QCon London Q&A」。
3/2から3/6まで開催されていたQCon LondonでのFinancial Times社のPrincipal EngineerであるSamuel Parkinson氏の「過去のインシデントから学ぶから受ける恩恵と、インシデント管理にエンジニアを巻き込む事を勧めている」セッションのQ&Aから。
「過去に起きた出来事でも常に新たな発見があり、新たなメンバーが加わると既存のメンバーが気づかなかった新たな視点を持っている」という筆者のコメントは当たり前の様でいて、その様な姿勢で真摯に取り組むかはチームメンバーの人間性、リーダーのリーダーシップ、過去経緯などに大きく依存する部分だと感じた。

Another post on architectural approaches to splitting up a large monolithic application, in particular looking at the strangler pattern and the importance of observability.

タイトルは「Break that big ball of mud!」。2017年10月の記事。
「NDC 2016 Blog」で元々発表されている内容の記事。スターウォーズファンの筆者がたとえにフォース、ヨーダ、デススターなどを使っている。
スターウォーズのヨーダの言葉を引き合いに出し、「15年以上のコーディング経験でレガシーコードを取り扱う度にかなりの割合で恐れ、怒り、憎しみ、苦痛を伴う」と語り、その内容を解説している。

Analysis of a recent paper on analyzing characteristics of serverless usage, looking at the interesting optimisation where users want fast function start times and the cloud provider wants to minimise resources consumed.

タイトルは「Serverless in the wild: characterizing and optimising the serverless workload at a large cloud provider」。
Adrian Colyer氏によるCSの研究をランダムに見ていくシリーズ。
今回はarXiv （アーカイヴ、archiveと同じ発音）から、「大手クラウドプロバイダー環境(Azure)でのサーバレスのワークロードの特徴と最適化」に関する論文を取り上げている。PDF版はこちら。
twitterでJonathan Mace氏が元記事についてつぶやいた事から筆者が注目した。
コールドスタート、Pre-warming、キープアライブ、アイドル時間、リソース管理、コストなどの観点から図やグラフを盛り込みながら解説されている。興味深い。

A post on the importance of egress filtering of network traffic. ALthough this particular post talks about Serverless, this is relevant to any architecture or infrastructure I think.

タイトルは「Egress Filtering in Serverless Applications」。
サーバーレスアプリで見落とされがちな外向けの通信をフィルタリングする重要性、方法、リスクの実例などを解説している。

Tools

Backstage is described as a platform for building developer portals. It has an impressive vision, to become a standard toolbox for the open source infrastructure landscape.

開発者向けの統一されたフロントエンドのポータル画面を提供するツール「Backstage」のWebページ。
GitHubページはこちら。

Docker released a useful new GitHub Action which makes building and publishing Docker images easier. Some nice touches like automatic tagging and building multiple tags.

Docker社が新たにリリースしたGitHub ActionのGitHubページ。Dockerイメージのビルドと公開を容易にし、自動タグ付けや複数タグ付けなどを実装している。

With a surge of developers and IT practitioners working remotely, there’s also a surge of confusion and operational inefficiency. See how data and automation is improving the way DevOps and IT operations engineers build, release and maintain reliable services remotely:

DevOp WeeklyのスポンサーであるVictorOps社のブログ記事。
タイトルは「Using Data and Automation to Help Engineering Teams Work Remotely」。
昨今の最もエンジニアの関心を集めている「リモートワーク」についてNetwork Operations Center (NOC) モデルなどを引き合いに出しながら、考慮すべき自動化、データの連携方法などに触れ、解決方法として自社サービスの14日間のフリートライアルを提案している。

SRE Weekly Issue #212 March 23rd, 2020

Articles

Meaningful availability

This very clearly written paper describes the Google G Suite team’s search for a meaningful availability metric: one that accurately reflected what their end users experienced, and that could be used by engineers to pinpoint issues and guide improvements.

Hauer et al. — NSDI’20 (original paper) Adrian Colyer — The Morning Paper (summary)

今週はDevOps Weeklyでも取り上げられていたAdrian Colyer氏によるCSの研究をランダムに見ていくシリーズ。
今回はUSENIX主催のNSDI '20(2/25から2/27にカリフォルニア州SANTA CLARA)から、「GoogleのG Suiteチームによる有意義な可用性に関するメトリックの調査」の論文を取り上げている。PDFはこちら。
筆者がDamien Mathieu氏よりオススメされたとの事。
文章中でキーワードの意味する定義をきちんと述べていて良い。例えば、「有意義な」とは、ユーザー体験を捉えているものである事。
これはガッツリと繰り返し読んで議論出来る内容。個人的な宿題。

Our Top 5 On-Call Practices – Blameless: Better Reliability Through SRE

Their top 5 are:

Use Meaningful Severity Levels
Create Detailed Runbooks
Load Balance Through Qualitative Metrics
Get Ahead of Incidents
Cultivate a Culture of On-Call Empathy

Emily Arnott — Blameless

「オンコールを必要悪と見なしているかもしれません」という前置きから入り、上記の5つのベストプラクティスを通してチームの応答を早くし、より回復性の高いシステムを構築し、繰り返される割り込みを最小化する提案をしている。

NTP: Building a more accurate time service at Facebook scale

Synchronizing clocks can be critical in an HA system, and Facebook went to great lengths to ensure clock accuracy.

Zoe Talamantes and Oleg Obleukhov — Facebook

Facebookの規模感でのNTP(Network Time Protocol)の重要性と正確さを実現している方法の紹介。
chronyとntpdを詳細に比較していて参考になる。
PTP (Precision Time Protocol)は今までノーマークだったので併せてチェックしていきたい。

The Fallacy of Move Fast and Break Things

You might end up just breaking things.
Dawn Parzych — LaunchDarkly

筆者が元々DevOps.comで出していた記事の内容。
Facebookのマーク・ザッカーバーグの言葉「move fast and break things」が多くの開発チームのモットーになっている事に触れ、多くのユニコーン企業になりたい会社が倣ったが、業界を跨いだり、全てのチームで上手くいっているわけではない事を論じている。
「ハイパフォーマンスのチームは、この考え方が機能する様に支える良いシステムとプロセスがあるのであって、この考え方の言葉を額面通りに受け取っている訳ではない」として、リスクと自身の組織に合わせた文化とツールを整えていく事を提案している。

InSearch: LinkedIn’s new message search platform

LinkedIn’s message search system takes advantage of the fact that relatively few users actually search their message. It only builds a search index the first time a user performs a search.
Suruchi Shah and Hari Shankar — LinkedIn

LinkedIn社のメッセージングのバックエンドを刷新し、「InSearch」と呼ぶ新しいものを導入した話。

Destiny 2 Outage and Rollback

This followup post from Bungie covers two related incidents in February that caused loss of user data.
Bungie

Bungie社の開発、運用しているゲーム「Destiny 2」での障害とロールバックの話。

Involving Engineers in Incident Management: QCon London Q&A

An interview about how one company got their developers to join the on-call rotation. It covers how they trained them to help them build confidence and what benefits they got by joining.
Ben Linders — InfoQ

上記DEVOPS WEEKLY ISSUE #482で取り上げているので割愛。

Outages

Statuspage.io
- The text of this incident originally mentioned Heroku, and it lines up with the Heroku outage below.
- They also had this unrelated outage.
Heroku
- Heroku suffered two short bouts of 85% request failure to applications hosted on their platform.Separately, they recently posted a couple of followup reports for previous incidents:
  - Incident #1961: logging outage
  - Incident #1968: EU application errors
Zoom
MacStadium
Hulu
Bumble
Microsoft Teams and Office 365
Discord
- Discord posted this gem of a followup analysis just a few days after their outage last week.
GoToMeeting
Google Nest
DoorDash

上記各社の障害情報。

KubeWeekly #209: March 27th, 2020

The Headlines

Editor’s pick of the highlights from the past week.

Kubernetes 1.18

Kubernetes 1.18 is the first release of 2020! Kubernetes 1.18 consists of 38 enhancements: 15 enhancements are moving to stable, 11 enhancements in beta, and 12 enhancements in alpha.
Kubernetes 1.18 is a "fit and finish" release. Significant work has gone into improving beta and stable features to ensure users have a better experience. An equal effort has gone into adding new developments and exciting new features that promise to enhance the user experience even more. Having almost as many enhancements in alpha, beta, and stable is a great achievement. It shows the tremendous effort made by the community on improving the reliability of Kubernetes as well as continuing to expand its existing functionality.

Kubernetesバージョン1.18のリリース案内。上記の通り38の機能改善(15機能のstable化、11機能のベータ化、12機能のアルファ化)している。
リリースロゴ、主要な変更やリリースノート、GitHubのダウンロードページなどの必要情報やリンクがまとめられているので、チェックしていかねば。

Kubernetes 1.18, with release team manager Jorge Alarcon

Adam Glick and Craig Box, Kubernetes Podcast from Google

Kubernetes 1.18 is out – almost! A bug has pushed it back a day. While you’re waiting, release team lead Jorge Alarcon will tell you all about the fit and finish you can expect in the release when it’s out tomorrow. Adam and Craig bring you the other community news of the week, as well as some podcast follow-up.

Google社社員によるKubernetes Podcast。現在のCo-hostはCraig Box氏とAdam Glick氏。
Kubernetesコミュニティーのrelease team leadおよびsearchable.ai社のSREであるJorge Alarcon氏がゲスト。
News of the weekで気になったトピックは以下の3つ。

ICYMI: CNCF Webinars

Weekly recap of CNCF member and project webinars that you might have missed.
You can view all CNCF recorded and upcoming webinars here.

CNCF Project Webinar: How to Migrate a MySQL Database to Vitess

Liz van Dijk, Solution Architect and Field Operations @PlanetScale

PlanetScale社のSolution Architect & Field OperationsであるLiz van Dijk氏による「MySQLデータベースをVitessに移行する方法」を解説しているWebinarの動画。
デモもあり見やすい。
音声が時々飛んでいるので、そこは撮影時のスピーカーのネットワーク環境影響かと思われるので、ご容赦を。

CNCF Member Webinar: Lowering the Barrier to Kubernetes Proficiency – Navigating the Stormy Seas of Information Overload

Angel Rivera, Developer Advocate @CircleCI

CircleCI社のDeveloper AdvocateであるAngel Rivera氏が「Kubernetesの初学者向けに上達へのバリアを下げる為に」Kubernetesを解説しているWebinarの動画。
Kubernetesが必要とされる様になった背景、略語、リソース、コンポーネントなどを丁寧に解説しています。

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.
Anatomy of my Kubernetes Cluster
Antonin Stefanutti

Red Hat社のSoftware EngineerであるAntonin Stefanutti氏が「おうちKubernetes」を自身の考えた要件に合わせて解剖学的にガッツリやってみた結果です。スゴイ。

Writing Kubernetes network policies with Inspektor Gadget’s Network Policy Advisor

Alban Crequy, Kinvolk

Kinvolk社のCTO & co-founderであるAlban Crequy氏によるKubernetes上のアプリのデバッグと調査を行うガジェットの集まりであるOSSのInspektor Gadgetのを使って、Kubernetesのネットワークポリシーを Inspektor Gadgetの「Network Policy Advisor」を書く方法を紹介している。
コントリビューター募集中で、Kubernetes Slackの #inspektor-gadgetでの議論への参加も呼びかけている。

Okteto Push – Your Code to Kubernetes in Seconds

Pablo Chico de Guzman, Okteto

Okteto社のFounder & CTOであるPablo Chico de Guzman氏による自社ブログ内のOkteto Pushの紹介記事。
「Kubernetes上にコードのプッシュを最速で行うツール」との触れ込み。
フィードバックをoktetoのtwitterもしくはKubernetesの#oktetoチャンネルで大募集中との事。

Converting an Old MacBook Into an Always-On Personal Kubernetes Cluster

Sid Palas, DevOps Directive

筆者が常に立ち上げっぱなしのKubernetesのクラスターが欲しかったので、手元で使っていなかった「2012年製のMacBook Air」をクラスター化した話。

Quality of Service and OOM in Kubernetes

Ciro S. Costa, OpsTips

VMware社(twitter上のプロフィールを最新として)のSoftware EngineerであるCiro S. Costa氏の個人ブログサイトの記事。
長い間Kubernetesのリソースを利用してきたが、「個人的にNodeのレベルでKubernetesのリソースを深掘りしてこなかった」との事で、今回の記事ではこれをテーマとしている。
3つのQoS(quality of service) classes、OOM score、cgroup tree、cgroup単位のmemory、kubeletのPod evictionを丁寧に解説している。

Kubernetes secrets

Ciro S. Costa, OpsTips

１つ上の記事と同じ筆者。
kubeletがKubernetesのSecretをNode内のプロセスが利用できるものにしているかを調べ、解説している記事。
Secretの管理は最近よく見るテーマであるが、自身の理解が浅い。。これも宿題。

Setting up a ProxySQL Sidecar Container

Jake Davis, Percona

Percona社のDBA(Database Administrator)であるJake Davis氏によるProxySQLサイドカーの設定方法の紹介。
彼らの顧客であるDuolingo社がAuroraの最大接続数16,000(全てのインスタンスクラスのハードリミットだったとの事)に達していたが、ProxySQLサイドカーを利用して現在(2020、3/23時点)ではピーク時で6,000程度に抑えられているとの事。

OpenShift 4.4 OKD Bare Metal Install on VMWare Home Lab

Craig Robinson, East Carolina University

OKDはRed Hat社のOCP(OpenShift Container Platform)のアップストリームかつコミュニティーサポート版。
OKD 4.4のクラスターを自宅環境でテストが出来る様にセッティングを解説してくれています。
仮想化プラットフォーム、Linuxの基礎知識、Google先生に聞く力があればOK。
画面のスクリーンショット、解説が手厚い。。

Building a TODO API in Golang with Kubernetes

Alex Ellis

CNCF AmbassadorであるAlex Ellis氏による、Kubernetes初心者で実践的なGo言語の APIを書いてTo DoリストをKubernetes上にデプロイ、管理したい方向けの記事。

A Guide On The Installation Of Spinnaker in Kubernetes Cluster

Vikas Saini, Magalix

SpinnakerをSpinnakerのインストール、構成、アップデート用のツールであるhalyardを利用してGKEにインストールする手順を解説する記事。

A Primer: Continuous Integration and Continuous Delivery (CI/CD)

Catherine Paganini, Kublr

ビジネスリーダーにITのコンセプトを説明するシリーズの記事。
今回のテーマはCI/CDで、キーワードとイメージしやすい図を並べながら解説がされている。

#minikube just crossed the 1-millionth GitHub download threshold!

I'm very excited about the minikube v1.9 release this week, in sync with Kubernetes v1.18. It's huge! 🎉☸️💪 pic.twitter.com/f1EzRrJvcH
— Thomas Strömberg (@thomrstrom) March 24, 2020

mobile.twitter.com

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

Threading the Needle on Kubernetes Complexity with AI-Powered Observability

Andreas Grabner, DevOps.com

Kubernetesがもたらす複雑性と大量のデータに対して、AIによる可観測製で対応したいニーズについての話。具体的なツールとかの話は無し。

A ‘No-BS’ Checklist for Kubernetes

Oleg Chunikhin, Kublr

No-BS = Bad Staff、つまりKubernetesをエンタープライズの本番環境で動かすのに必要な要件を含んでいないベンダー、サービスを見分ける為のチェックリストを筆者達が作り、共有している。あると良い要素も「Nice-to-Haves」も記載。

We are pleased to announce that the new event dates for #KubeCon + #CloudNativeCon Europe 2020 have been confirmed and that we are now planning to hold the event at the RAI Amsterdam from Thursday, August 13 to Sunday, August 16, 2020 https://t.co/lYE8E9yjg0
— CNCF (@CloudNativeFdn) 2020年3月26日

Upcoming CNCF webinars

気になるWebinarがあれば登録してチェックを。以下は直近のものとしてピックアップされていたものです。

Container Security at Scale: Lessons Learned from the Front Lines with ABN AMRO and Palo Alto Networks
Wiebe de Roos, CI/CD Consultant @Flusso and ABN Amro
Keith Mokris,Technical Marketing Engineer @Palo Alto Networks
Member webinar
April 1, 2020 10:00 AM Pacific Time
REGISTER NOW »

Taming Your AI/ML Workloads with Kubeflow The Journey to Version 1.0
David Aronchick @Microsoft
Elvira Dzhureava, Technical Product Engineer AI/M @Cisco
Johnu George, Technical lead @Cisco Systems
Member webinar
April 2, 2020 9:00 AM Pacific Time
REGISTER NOW »

Welcome to CloudLand! An Illustrated Intro to the Cloud Native Landscape
Kaslin Fields, Developer Advocate @Google
Ambassador webinar
April 3, 2020 10:00 AM Pacific Time
REGISTER NOW »

Pravega: Rethinking storage for streams
Dell
Member webinar
April 7, 2020 10:00 AM Pacific Time
REGISTER NOW »

Best Practices for Deploying a Service Mesh in Production: From Technology to Teams
Buoyant
Member webinar
April 8, 2020 10:00 AM Pacific Time
REGISTER NOW »

New thoughts on distributed file system in the cloud native era JD.com
Member webinar
April 9, 2020 10:00 AM Pacific Time
REGISTER NOW »

Declarative Host Upgrades From Within Kubernetes
Adrian Goins,Director of Community and Evangelism @Rancher Labs
Dax McDonald,Software Engineer @Rancher Labs
Jacob Blain Christen, Principal Software Engineer @Rancher Labs
Member webinar
April 14, 2020 10:00 AM Pacific Time
REGISTER NOW »

如何让你的Windows应用运行在Kubernetes平台
杨雨 Alex Yang, 解决方案架构师 Solution Architect @Mirantis
张文墨Larry Zhang, 解决方案架构师 Solution Architect @Mirantis
Member webinar
This webinar will be delivered in Chinese
April 23, 2020 10:00 AM China Standard Time
REGISTER NOW »

Kubernetes 1.18
Kubernetes team
Project webinar
April 23, 2020 9:00 AM Pacific Time
REGISTER NOW »

Pivoting Your Pipeline from Legacy to Cloud Native
Tracy Ragan, CEO of DeployHub and CDF Board Member
Member webinar
June 30, 2020 10:00 AM Pacific Time
REGISTER NOW »

いかがでしたか？気になる記事や情報はありましたか？

私もまだ内容を咀嚼出来ていないものが多々ありますので、この備忘録兼リンク集を活用しながら理解を深めていきたいと思います。

では、また。

Bye now!!

Yoshiki Fujiwara