SRE / DevOps / Kubernetes Weekly Reportまとめ#50(2021/1/10~1/15) - 運び屋 (A carrier(forwarder) changed his career to an engineer)

この記事は2021/1/10~2021/1/15発行の下記3つのWeekly Reportを読み、備忘録兼リンク集として残しているものです。
なるべく情報を早く届けたい/共有したいので、ブログのリンクを確認次第、先行公開しています。自身のコメントは随時追加しています。
English Version of this blog is here.
DEVOPS WEEKLY ISSUE #524 January 10th, 2021
- News
- Events
  - 3 a.m. wake ups are for heart surgeons, newborn parents… and SREs. Sarah Wells from the Financial Times is on a mission with Jamie Dobson, CEO of Container Solutions to keep good night rests sacred. Join them in the latest WTFinar, on Alert Fatigue and how to manage it. Sign up here:
SRE Weekly Issue #252 January 10th, 2021
- Articles
- Outages
KubeWeekly #246 January 15th, 2021

この記事は2021/1/10~2021/1/15発行の下記3つのWeekly Reportを読み、備忘録兼リンク集として残しているものです。

なるべく情報を早く届けたい/共有したいので、ブログのリンクを確認次第、先行公開しています。自身のコメントは随時追加しています。

誰かの情報源や検索工数削減などになれば幸いです。

DEVOPS WEEKLY ISSUE #524 January 10th, 2021

SRE Weekly Issue #252 January 10th, 2021

KubeWeekly #246 January 15th, 2021

English Version of this blog is here.

この記事を読んで疑問点や不明点があれば、URLから本文をご確認の上、ご指摘頂ければ幸いです。
理解が浅いジャンルも、とにかくコメントする様にしていますので、私の勘違いや説明不足による誤解も多々あろうかと思います。
情報量が多いので文字とリンクだけに絞っております。
各レポートで取り上げられている記事には2019年以前のものもあり、必ずしも最新のものという訳ではない様です。

DEVOPS WEEKLY ISSUE #524 January 10th, 2021

News

A post on different sources of on-going maintenance, and some discussion of ways to improve the situation.

タイトルは「Software is drowning the world」。
筆者が多くの組織で働くことで得た多くの利点の1つとして共通点を理解できることを挙げ、「技術的負債」を題材に下記観点で解説している。
- Every time you decide to solve a problem with code, you are committing part of your future capacity to maintaining and operating that code. Software is never done.

As development teams grow, change becomes harder. This post looks at an approach for addressing high impact changes that are spread out across teams, and how to get buy in.

タイトルは「Campaigns」。
筆者がCampaignと呼ぶツールを提唱している。
技術的負債を返済したり、アーキテクチャの変更を行って、顧客体験の向上やコスト削減などを実現する際に多くの人々のグループを調整し、グループに責任を負わせ、最終的に成功させるために使用できるツール/フレームワークとして、Campaignには以下が必要であることなどを解説している。
- A Goal
- Metrics toward that goal
- Buy-in
- Method of Accountability
- A “Window”
- A Target Date

Discussion of the evolution of frameworks in software development, in particular looking at how AWS itself can be considered a framework, providing primitives for logging, events, scaling, monitoring and more.

タイトルは「AWS as a Framework」。
下記の観点でタイトルに沿って解説している。AWSのフレームワークと、それを十分に活用した場合の独自の可能性の両方の正当性がわかるようになることを目指している。
- AWS doesn’t sound like an “infrastructure” provider anymore, not even a “platform” provider. It sounds like a framework!

FOSDEM is going virtual this year on the 6th and 7th February, and lots of the devrooms have announced sessions. I’m particularly looking forward to the software composition devroom.

無料のオープンソースソフトウェアの普及を促進するためにボランティアが主催する2日間のオンラインイベント「FOSDEM(Free Open Source Developers’ European Meeting)」のWebページ。上記リンクは各トラックの紹介。
Editorが興味を持っている「the software composition devroom」はこちら。
通常、ブリュッセル(ベルギー)で開催され、「 FOSDEM is widely recognised as the best such conference in Europe.」とのこと。

Lots of organisations will have a bunch of Perl code busily servicing critical needs. The Perl Foundation is looking for input on what they can do to better support the community.

タイトルは「Coding in Perl? What support do you need?」。
Perlへの移行またはPerlをの成長を目指しているエンジニアをサポートするために、何を望んでいるか、または何が必要か、を調査するため、ほんの数分の時間で終わるsurveyを実施中。
このsurveyは1月を通して実施され、結果は前記事で触れているFOSDEMで発表される。

A walkthrough of setting up a build and deployment pipeline for AWS ECS using Terraform, Terragrunt and GitHub Actions.

タイトルは「CI/CD Workflow for AWS ECS via Terragrunt and GitHub Actions」。
タイトルの内容を下記の流れで、図やコードを見やすく色を分けて解説している。
- Initial Setup
- Workflow via GitHub Flow
- Configure Infrastructure and Deployment Targets
- Configure Container Environment and Secrets
- Integration via GitHub Actions – Pytest
- Deployment via GitHub Actions – Terragrunt
- Conclusion

A great resource for learning Google Cloud Platform, this repo contains comprehensive sketchnotes covering all of the main GCP services.

上記の通り、GCPの主要なサービスをカバーしているスケッチノートのGitHubページ。Topicとして「Next 2020 Summary Announcements」があり、こういうイベントで発表されたサービスのまとめがあると、イメージしやすくて良さそう。

A good reading list for anyone moving into more management roles in software.

タイトルは「Recommended Engineering Management Books」。
過去3年半エンジニアリングマネージャーを務めている筆者からの本の紹介。「10年以上プロのソフトウェアエンジニアであり、まったく新しい挑戦、エンジニアリングマネージャーとして成長する過程」で自身を助け、影響/インパクトを与えた本を厳選したリストを紹介し、エンジニアリングマネージャーに強くお勧めしている。
本のリストは以下。それぞれ良かったポイントを実体験を交えて解説していて良い。
- The Manager’s Path: A Guide for Tech Leaders Navigating Growth & Change by Camille Fournier
- Thanks for the Feedback by Douglas Stone & Sheila Heen
- The Hard Thing About Hard Things: Building a Business When There are No Easy Answers by Ben Horowitz
- Accelerate: Building and Scaling High Performing Technology Organizations by Nicole Forsgren, PhD, Jez Humble, and Gene Kim
- Dare to Lead: Brave Work. Tough Conversations. Whole Hearts. by Brene Brown
- Switch: How to Change Things When Change is Hard
- Atomic Habits: An Easy & Proven Way to Build Good Habits by James Clear

A how to for setting up Kubernetes to use AWS EC2 spot instances to reduce cost and maintain a zero-downtime cluster as instances come and go.

タイトルは「Run Kubernetes Production Environment on EC2 Spot Instances With Zero Downtime: A Complete Guide」。
先週のKubeWeekly #245で取り上げているため、割愛。

Events

「Alert Fatigue(アラート疲れ)」をテーマにしたWebinarの紹介。
1/14(木) 11:00CET(Central European Time zone)なので、日本時間では19:00から開催予定。

SRE Weekly Issue #252 January 10th, 2021

Articles

Building On-Call Culture at GitHub

Their on-call started out as four 24 hour shifts per person interspersed throughout the year. Find out how they transitioned to a new approach in a process that spanned the start of the pandemic.

Mary Moore-Simmons — GitHub

タイトルの内容を以下の大項目に沿って解説している。
- Monolithic On-Call
- New On-Call Culture
- Continuing the Journey
「Monolithic On-Call」という表現、様々な観点のハードルは面白かった。
ちょっと文字がギュッと詰まりすぎだとは思う。もう少し改行などで見やすくしたい。

Google Cloud Issue Summary — Google Meet — 2020-12-14

A new Meet version had a higher storage usage requirement, and a backend system filled up.

Google

2020-12-14 from 08:20AM to 11:36AM(PST)にGoogle Meetで発生した障害のサマリーで、新機能のリリース時にストレージが急激に上昇し、一つのデーターストアのリソースが枯渇したことが原因。再発防止策は以下。
- Review alerting processes to improve detection of data store capacity issues
- Adjust automated monitoring system logs to be more concise and exact to assist in troubleshooting
- Evaluate existing troubleshooting processes to determine available improvements to mitigation and resolution times.

WTF is Alert Fatigue

This is webinar on alert fatigue, coming up on January 14.

Sarah Wells — Financial Times Jamie Dobson — Container Solutions

上記のDEVOPS WEEKLY ISSUE #524で取り上げているので、割愛。

Announcing the Security Chaos Engineering Report

The chaos experiments you do for security purposes can often expose weak points in reliability as well.

Aaron Rinehart — Verica Kelly Shortridge — Capsul8

複数回にわたる無料のオライリーレポートに関するシリーズの最初の記事。
下記のセリフを繰り出した後、Security Chaos Engineering(SCE)、SCEの核となるツール「ChaoSlinger」などに触れつつレポートの概要に触れている。
- Hope isn’t a strategy. Likewise, perfection isn’t a plan.

Little Known Ways to Better Use Your Error Budgets

Here are four nifty outside-the-box ideas to use the data you may already have.

Emily Arnott — Blameless

エラーバジェットがQA、法務、経営幹部など、組織全体の部門の枠を超えたチームにどのように役立つか下記の項目を挙げて解説している。また、エンジニアが開発プランの枠を超えてエラーバジェットを使用する方法についても触れている。
- Legal teams can use error budgets as early warnings
- Executives can use error budgets to take the pulse of development
- Error budgets and SLOs elevate the role of QA
- Error budgets provide objectivity for experimentation

Lessons learned in incident management

Their custom incident management tool, DropSEV, can detect incident-worthy availability drops and file an incident automatically, obviating the need for an engineer to decide on severity level on the fly.

Joey Beyda and Ross Delinger — DropBox

Dropbox社がインシデント管理で学んだ教訓を以下6つの項目に分けて詳しく解説している。
1. Background
2. The SEV process
3. Detection
4. Diagnosis
5. Recovery
6. Continuous improvement
筆者は、この記事が組織自体のインシデント対応を体系的に把握し、ユーザーニーズに合わせて進化させる方法のケーススタディーとして役立つことを願っている。

GitHub Availability Report: December 2020

This one has some additional detail on a November outage involving MySQL replication lag.

Keith Ballinger — GitHub

このブログでも何回か取り上げているGitHub社の月次のAvailability Reportの、2020年12月分。
12月には、サービスのダウンタイムにつながるインシデントは発生しなかったため、11月のレポートに記載されているインシデントへの対処の概要とフォローアップの詳細を提供している。

Outages

Slack
My first couple hours of work this year were oddly quiet…
Heroku
Google Meet
This is different from the one above.
Fanduel
Twitch
Coinbase
Archive of Our Own

上記各社の障害情報

KubeWeekly #246 January 15th, 2021

The Headlines

Editor’s pick of the highlights from the past week.

CNCF Security Whitepaper Shows the Complexity of Securing Cloud Native Operations

Jack Wallen, The New Stack

Jack Wallen of The New Stack dives into CNCF’s Security whitepaper that focuses on the security of cloud native applications and highlights key learnings. The whitepaper discusses everything from cloud native layers, to the full lifecycle of development, to compliance (and everything in between).

CNCFがリリースした「Security whitepaper」を紹介している。
クラウドの複数のレイヤーにまたがった複雑性を抱えて開発、管理する必要に触れ、管理者目線でwhitepaperを深掘りしている。

Hello folx - @CloudNativeFdn is hiring! Apply to work with yours truly and the rest of our awesome team as a Content and Comms person to up level our storytelling and communication across various channels. https://t.co/tYcrl48IIl
— Priyanka Sharma (@pritianka) 2021年1月14日

The Technical

Tutorials, tools, and more that take you on a deep dive into the code.

Analyze Kubernetes files for errors with KubeLinter

Jessica Cherry, Opensource.com

Red Hat社が買収を発表したStackrox社のOSSのYAMLファイルのセキュリティー上の問題とエラーのあるコードを解析するツール「KubeLinter」を解説している記事。

Getting started with Buildah

Cedric Clyburn, Red Hat

タイトル通り「Buildah」を使い始める方法を解説している。インタラクティブなセッションのYouTube動画もWebページに埋め込まれている。

Isolate a Pod in Kubernetes

Salman Iqbal

先週に引き続きSalman氏のYouTube動画でのKubernetesの各コンポーネントの挙動を解説しているWebinarシリーズを取り上げている。時間が10分程度で収まっているので見やすい。

Build Your Kubernetes Operator With the Right Tool

Alex Handy, Red Hat

ソフトウェア用のKubernetes Operatorを構築する場合に、選択できるツールがたくさんある現状に触れ、ユースケースに合わせた意思決定を簡単にするさまざまなアプローチを解説している。

The Editorial

Articles, announcements, and morethatgive you a high-level overview of challenges and features.

Sysdig 2021 container security and usage report: Shifting left is not enough

Aaron Newcomb, Sysdig

タイトルの通り、4回目に当たるSysdig社の年次レポート。メトリクスの使用状況、人気のあるアラート、コンテナ密度の傾向、Kubernetesの使用パターンについても詳しく説明している。
各要素の数字や割合などが、図やグラフを組み合わせてわかりやすく表現されている。

Vertical Pod Autoscaling: The Definitive Guide

Povilas Versockas

筆者が書いているように「Definitive/Complete guide」としてPodの垂直スケーリングを下記項目を挙げて網羅的に説明している。後で改めて読む記事。
- Why we need Vertical Pod Autoscaling?
- Kubernetes Resource Requirements Model
- What is Vertical Pod Autoscaling?
- Understanding Recommendations
- When to use VPA?
- VPA Limitations
- Real-World Examples
- How does VPA work?
- VPA’s Recommendation model
- Lots more

What’s Your Kubernetes Maturity?

Danielle Cook, Fairwinds

Kubernetesジャーニーのエンドツーエンドの概要、通過するフェーズ、それぞれで学習/実行する必要のあるスキルとアクティビティーを提供する「Kubernetes Maturity Model」を解説している。
各フェースの詳細をチェックする場合はこちらのリンクから。この記事では各フェーズに関する簡潔なサマリーの記載に止めている。
- Phase 1 Prepare
- Phase 2 Transform
- Phase 3 Deploy
- Phase 4 Build Confidence
- Phase 5 Improve Operations
- Phase 6 Measure & Control
- Phase 7 Optimize & Automate

#WebAssembly is one of the hot technologies for 2021 (at least that’s what we think on @CloudNativeFdn TOC). @linuxfoundation have a free course on it coming up: https://t.co/opHmdUg5KC
— Liz Rice 🏡 (@lizrice) 2021年1月13日

Upcoming CNCF Online Programs

We have expanded our webinar program to Online Programs! Visit our website for the latest updates.

リンク先を確認して、「Upcoming webinars」が2021/01/16時点では「No Results Found」だったので、今年のWebinarはまだ更新待ちのようです。

いかがでしたか？気になる記事や情報はありましたか？

私もまだ内容を咀嚼出来ていないものが多々ありますので、この備忘録兼リンク集を活用しながら理解を深めていきたいと思います。

では、また。

Bye now!!

Yoshiki Fujiwara