SRE / DevOps / Kubernetes Weekly Reportまとめ#42(11/15~11/20) - 運び屋 (A carrier(forwarder) changed his career to an engineer)

この記事は2020/11/15~11/20発行の下記3つのWeekly Reportを読み、備忘録兼リンク集として残しているものです。
なるべく情報を早く届けたい/共有したいので、ブログのリンクを確認次第、先行公開しています。自身のコメントは随時追加しています。
English Version of this blog is here.
DEVOPS WEEKLY ISSUE #516 November 15th, 2020
SRE Weekly Issue #244 November 15th, 2020
- Articles
- Outages
KubeWeekly #242 ←今週はお休みの模様。

この記事は2020/11/15~11/20発行の下記3つのWeekly Reportを読み、備忘録兼リンク集として残しているものです。

なるべく情報を早く届けたい/共有したいので、ブログのリンクを確認次第、先行公開しています。自身のコメントは随時追加しています。

誰かの情報源や検索工数削減などになれば幸いです。

DEVOPS WEEKLY ISSUE #516 November 15th, 2020

SRE Weekly Issue #244 November 15th, 2020

KubeWeekly #242 ←今週はお休みの模様(2020/11/22 10:00JST時点更新/休みなどの情報確認できず)

English Version of this blog is here.

この記事を読んで疑問点や不明点があれば、URLから本文をご確認の上、ご指摘頂ければ幸いです。
理解が浅いジャンルも、とにかくコメントする様にしていますので、私の勘違いや説明不足による誤解も多々あろうかと思います。
情報量が多いので文字とリンクだけに絞っております。
各レポートで取り上げられている記事には2019年以前のものもあり、必ずしも最新のものという訳ではない様です。

DEVOPS WEEKLY ISSUE #516 November 15th, 2020

News

A series of videos on building a modern CI/CD pipeline for a typical Java application using ArgoCD and Tekton.

タイトルは「Video course on cloud-native CI/CD with Tekton & ArgoCD」。
TektonとArgoCDを使用して、最新のエンタープライズJavaプロジェクトに適切な継続的デリバリーパイプラインを実装する方法を解説している。筆者が解説している9個のYouTube動画をWebページに埋め込んである。

A talk (video and slides) about measuring continuous delivery though lead time, deployment frequency, change failure rate and time to recovery.

タイトルは「Measuring DevOps」。
Googleの調査では、ソフトウェアの配信とパフォーマンスの高レベルのシステムビューを提供し、組織が目標を達成する能力を予測する以下4つのメトリクスを開発して検証してきた。この動画では、この4つのメトリクスの生成と収集を自動化するGCPとTektonによる自動化を紹介している。
- Deployment Frequency
- Lead Time to Change
- Change Fail Rate
- Time to Restore

A look at building on top of the new Pulumi automation API, and some thoughts about the emergence of platform teams.

タイトルは「Self-Service Platform Development Made Easy with Pulumi」。
自身の数十のinternal platform teamにコンサルをした経験を踏まえて、後の記事で出てくるPuppet社の「the State of DevOps Report 2020」に触れ、全てのセクションが「Scaling DevOps practices with internal platforms」に関するものであり驚いたことと内容について解説している。
筆者がプロトタイプとして作成、コードをオープンソース化した「Cloud platyform」を紹介している。これはプラットフォームチームがセルフサービスを作成する簡単な方法を提供するもの。

A post talking about a specific example of helping development teams address security problems (in this case leaking sensitive data in log files) and how to embed a culture of security in engineering organizations.

タイトルは「Fixing leaky logs: how to find a bug and ensure it never returns」。
セキュリティの実施を開発者の手に委ねるケースを紹介している。筆者と同僚の開発者が、ログ内のデータ漏洩を正常に識別し、問題を修正し、将来発生しないようにする方法を実行したかを解説。

A good post on the different challenges posted by edge environments around hardware discovery, manageability, provisioning and more.

タイトルは「Facing Challenges at the Edge」。
エッジのプラットフォームとアプリの管理性に関するいくつかの課題、潜在的な設計およびソリューションについて詳しく解説している。

The latest Puppet State of Devops Report is available, with some interesting industry stats and analysis, in particular around platform teams and change management.

DevOpsに関する2つの記事を案内している。上記リンクの1つ目のタイトルは「2020 State of DevOps Report is here!」。調査によりDevOpsの進化と社内プラットフォームの使用との間に強い関係があることを突き止め、分析器の結果、承認プロセス(正統派と適応型)、自動化されたテストとデプロイメント、および高度なリスク軽減手法に基づいた変更管理への以下4つの異なるアプローチが明らかにしている。
- Operationally mature
- Engineering driven
- Governance focused
- Ad hoc.
2つ目のタイトルは「Two secret weapons DevOps can use to take over the entire enterprise」。こちらも上記の記事に触れ、社内プラットフォームチームについてのアイデアを共有している。

Events

WTF is Platform as a Product? Companies are going full speed ahead into treating their platforms as products. But WTF does that mean? And WTF are the advantages? In this free 90-minute event on 19 November, you’ll get insight from Matthew Skelton, co-author of Team Topologies, and Jamie Dobson, CEO of Container Solutions, with a special appearance by Dave Farley! Register now.

今週も引き続き、Container Solutions社のイベントを取り上げている。今回は「Team Topologies」の共著者Matthew Skelton氏をゲストとして迎えている。上記の通り、90分の講座が開催予定。11/19(木) 14:15CET(Central European Time zone)なので、日本時間では22:15開始。

Books

SRE: The Cloud Native Approach to Operations explains how SRE, or Site Reliability Engineering, can help your organisation balance innovation with reliability. In this new e-book from Michael Mueller, a managing director at Container Solutions, you’ll learn what SRE is, and why you might need it; the differences between SRE and DevOps; best practices, and more. Get your free e-book.

Container Solutions社提供の無料e-book「SRE: The Cloud Native Approach to Operations」を紹介している。
フルネームとメールアドレスを入力すれば、メールでダウンロードページを案内される。分量は30ページ。

Tools

ctlptl aims to make it easier to grab an ephemeral, local, Kubernetes cluster for development purposes. Rather than competing with Docker Desktop, KIND, Minikube or similar tools it provides a higher-level user interface.

ローカルのKubernetesクラスターを宣言的にセットアップするためのCLIツール「ctlptl」のGitHubページ。
Goals、Non-Goalsが明確に書かれていて、他のツールに言及して紹介しているので良い。

Athenz is a platform for X.509 certificate based service authentication and fine grained access control in dynamic infrastructures. It supports provisioning and configuration (centralized authorization) use cases as well as serving/runtime (decentralized authorization) use cases.

動的インフラストラクチャにおけるX.509証明書ベースのサービス認証ときめ細かいアクセス制御のためのオープンソースプラットフォーム「Athenz 」のGitHubページ。
プロビジョニングと構成(集中型認証)のユースケース、およびサービング/ランタイム(分散型認証)のユースケースをサポート。 Athenz認証システムは、x.509証明書と業界標準の相互TLSバインドされたoauth2アクセストークンを利用。
「Athenz」という名前は、「AuthNZ」(認証のN、認可のZ）が由来。

PowerfulSeal is a chaos testing tool for Kubernetes. Describe scenarios in YAML and PowerfulSeal can kill running resources and check services are still running, and export results to Prometheus and other monitoring tools.

先週のKubeWeeklyで取り上げているので割愛。

K0s is a new small Kubernetes distribution intended for anything from local development usage to large-scale edge deployments.

新しいKubernetesディストリビューション「K0s」を紹介している。Lensのチームが開発している。
ioページはこちら。
「なぜゼロなのか？」に対する答えは以下。
- The “zero” in k0s really captures our aspiration to not compromise as we build the ultimate Kubernetes distribution:
  - Zero Friction
  - Zero Dependencies
  - Zero Overhead
  - Zero Cost
  - Zero Downtime

SRE Weekly Issue #244 November 15th, 2020

Articles

Type in the exact number of machines to proceed

If you’re gonna operate on a pile of computers all at once that numbers 6+ figures, making you type that number in is a way to make you pause and think about what you’re doing.

Rachel by the bay

多くのマシンに対してCLIで操作して、サニティチェックとして確認プロンプトを生成する場合に、Y / Nタイプを聞いて入力させる代わりに、以下の様に数を読んで入力させる方法を提案している。
- Blah blah blah 123,456 machines will be affected. Proceed?
  
  Enter number of machines to confirm: 123456
  OK! Continuing.
確かに意思を問われてもYesとしか押さないので、対象を表示して確認頂いて、数を入力すれば影響範囲を誤ることも、無意識に大量のマシンを巻き込む作業は行わないと思う。

IT metrics: Why the five 9s must go

Find out why they decided to focus less on nines, and what they did instead.

Robert Sullivan

稼働率の「9」を並べていくことの問題/課題点を指摘し、自社のSRE、L1/L2サポート、運用チームによって考案された以下のアプローチを紹介している。
1. Count all the minutes that affect business performance. These are “impacted” minutes.
2. Not all incidents are the same, so it’s important to agree on definitions. Here’s what we decided:
3. Count all impacted minutes (global, partial, and degraded) against the total number of minutes in a month.
4. Meet with business leadership regularly (we do this weekly) to discuss the numbers and the impact of service interruptions on the business.
5. Track instances in which your monitoring leads to action that avoids impacted time. (We refer to these as “mitigated events.”)
6. Count the minutes that high-availability services were not fully redundant.

Rule 1: It’s ALWAYS DNS

Reminds me of the classic:

It’s not DNS There’s no way it’s DNS It was DNS

— (ssbroski on reddit)
Mike S.

2017年の記事なので、結構前ですね。DNSは難しいのと、「network solutions haiku」のリンクを見て笑っちゃいました。

Moving OkCupid from REST to GraphQL

Their front-end made duplicate calls to the new API to test load and response time prior to cutting over.

Michael P. Geraci — OkCupid

何百万人ものユーザーがいるサイトで、パフォーマンスを低下させることなくREST APIからGraphQL API1 に移行を、実現したOkCupid社の事例紹介。
以下4つのプロセスと、リリースを通じて学んだ「やっておいた方が良かったこと」も共有している。
1. Pick an appropriate page to convert
2. Build the schema
3. Add a shadow request to call the new API while still fetching data via the REST API
4. Do an A/B test with real users that changes the data source

New Arctic Air Crash Aftermath Role-Play Simulation Orchestrating a Fundamental Surprise

This is really cool. The researchers created a role-play scenario based on a real plane crash. They tried to get participants to blame “human error”, so that they could then surprise them with all of the (many) contributing factors that were involved.

Emily S. Patterson, Richard I. Cook, David D. Woods, Marta L. Render

ロールプレ中に学んだ教訓と潜在的な教訓との間の不一致を意図的に発生させ、複雑なシステムがいかにして疑いをもたれないかについて、過度に単純化された仮定をさせる「fundamental surprise」の状況を作り上げている。面白い。

From Sysadmin to SRE

Tips from one Sysadmin’s journey to becoming an SRE.

Josh Duffney — Octopus Deploy

タイトルに沿って、業界に新しく入ってきている方向けに、以下の項目で筆者の考えを共有している。
- It isn’t about tools, but...
- Learn to code from the command-line
- Start at the source
- Pull requests mean deployments

Outages

YouTube
Macs
Mac users had issues launching applications, owing to an outage of ocsp.apple.com. Apple confirmed the issue.
PrometheusKube
The link points to their awesome writeup of what went wrong and the on-the-fly reworking they had to do to fix it.
Instagram
Hotmail
Various stock trading platforms
There’s some speculation that this was a result of increased trading volume following Pfizer’s announcement about vaccine trial results.
Robinhood
Increased Error Rates

上記各社の障害情報

KubeWeekly #242 ←今週はお休みの模様。

いかがでしたか？気になる記事や情報はありましたか？

私もまだ内容を咀嚼出来ていないものが多々ありますので、この備忘録兼リンク集を活用しながら理解を深めていきたいと思います。

では、また。

Bye now!!

Yoshiki Fujiwara

この記事は2020/11/15~11/20発行の下記3つのWeekly Reportを読み、備忘録兼リンク集として残しているものです。

なるべく情報を早く届けたい/共有したいので、ブログのリンクを確認次第、先行公開しています。自身のコメントは随時追加しています。

DEVOPS WEEKLY ISSUE #516 November 15th, 2020

SRE Weekly Issue #244 November 15th, 2020

KubeWeekly #242 ←今週はお休みの模様(2020/11/22 10:00JST時点 更新/休みなどの情報確認できず)

DEVOPS WEEKLY ISSUE #516 November 15th, 2020

News

Events

Books

Tools

SRE Weekly Issue #244 November 15th, 2020

Articles

Outages

KubeWeekly #242 ←今週はお休みの模様。

KubeWeekly #242 ←今週はお休みの模様(2020/11/22 10:00JST時点更新/休みなどの情報確認できず)