SRE / DevOps / Kubernetes Weekly Reportまとめ#47(12/20~12/25) - 運び屋 (A carrier(forwarder) changed his career to an engineer)

この記事は2020/12/20~12/25発行の下記3つのWeekly Reportを読み、備忘録兼リンク集として残しているものです。
なるべく情報を早く届けたい/共有したいので、ブログのリンクを確認次第、先行公開しています。自身のコメントは随時追加しています。
English Version of this blog is here.
DEVOPS WEEKLY ISSUE #521 December 20th, 2020
- News
SRE Weekly Issue #249 December 20th, 2020
- Articles
- Outages
KubeWeekly #245 December 25th, 2020 ←受領次第更新。特に情報は無いが、休みの可能性あり。

この記事は2020/12/20~12/25発行の下記3つのWeekly Reportを読み、備忘録兼リンク集として残しているものです。

なるべく情報を早く届けたい/共有したいので、ブログのリンクを確認次第、先行公開しています。自身のコメントは随時追加しています。

誰かの情報源や検索工数削減などになれば幸いです。

DEVOPS WEEKLY ISSUE #521 December 20th, 2020

SRE Weekly Issue #249 December 20th, 2020

KubeWeekly #245 December 25th, 2020←受領次第更新。特に情報は無いが、休みの可能性あり。

English Version of this blog is here.

この記事を読んで疑問点や不明点があれば、URLから本文をご確認の上、ご指摘頂ければ幸いです。
理解が浅いジャンルも、とにかくコメントする様にしていますので、私の勘違いや説明不足による誤解も多々あろうかと思います。
情報量が多いので文字とリンクだけに絞っております。
各レポートで取り上げられている記事には2019年以前のものもあり、必ずしも最新のものという訳ではない様です。

DEVOPS WEEKLY ISSUE #521 December 20th, 2020

News

There are lots of tools for storing data, but how do you find the right dataset for analysis? This post explores a number of different architectural approaches and discusses pros and cons.

タイトルは「DataHub: Popular metadata architectures explained」。
業界がデータディスカバリーツールとしてこれまでに生み出してきた3世代のアーキテクチャと、その範囲に沿って、よく知られているオプションの多くが、どこに該当するかを説明している。
- First-generation architecture: Monolith everything
- Second-generation architecture: 3-tier app with a service API
- Third-generation architecture: Event-sourced metadata
世代間のアーキテクチャの進歩は、この記事を出しているLinkedIn社のDataHubのアーキテクチャの進化にも反映されている。同社は、最新のベストプラクティスを以下のオープンソース化を通して推進してきた。
- (first open sourced and shared with the world as WhereHows in 2016, and then rewritten completely and re-shared with the open source community in 2019 as DataHub).

A good writeup from a recent AWS reInvent talk focused on AWS’s Serverless services. This post focused on what this means for operations, which is often neatly ignored in the marketing.

タイトルは「Does AWS Serverless care about IT Operations? Their service naming says "no" but their breadth and quality of choice says "yes"」。
「サーバーレス」の意味について冒頭で触れ、文字通りサーバーを無くすものではなく、以下のように述べている。
- I believe quite the opposite, that serverless is the wave beyond VM configuration management in empowering operations-minded people to reclaim their focus, creativity, and business relevance.
AWSのre:Inventでの発表でサーバーレスに関連するものを取り上げてテーマに沿って解説している。
- I wrote operations in this post about as many times as AWS uses the word innovation in their presentations, but I’m walking away from re:Invent with the impression that AWS is serious about both.

タイトルは「Raft does not Guarantee Liveness in the face of Network Faults」。
以前取り上げたCloudflare社のポストモーテム「A Byzantine failure in the real world」に触れ、分散合意アルゴリズムのRaftに関してTwitter上でのディスカッションがあったことを踏まえ、以下の3つのポイントで解説している。
- Does Raft guarantee liveness in the presence of network failures?
- So, does Raft with PreVote guarantee liveness then?
- Does Raft with PreVote and CheckQuorum guarantee liveness?

This long read introduces YOLOSec and FOMOSec as terms to describe problematic but all-to-common approaches to security strategy, driven either by short-termism or by chasing fashion.

タイトルは「On YOLOsec and FOMOsec」。
YOLO security (YOLOsec)とFOMO securityの両方が、infosecの防御にとって有害な不利益である理由と、組織のセキュリティ戦略からそれらを守るためにそれらを見つける方法について、提唱者である筆者が説明している。
タイトルの左上の「33 minutes」の表記を見た瞬間にだいぶ心を折られました。tl;drとConclusionの一部抜粋はそれぞれ以下。
- The tl;dr is that #yolosec and #fomosec are disconnected from the goals and needs of the business, forsaking pragmatism and prudence in favor of fanatical flavors of recklessness. YOLOsec reflects a security strategy driven by a “you only live once” mentality – one that emboldens people to ignore future concerns around security to achieve today’s gratification. FOMOsec reflects a security strategy driven by a fear of missing out – one that frightens people into misallocating resources towards what makes them feel better about their security efforts.
- If security must shun both YOLOsec and FOMOsec, how should it look instead? To simultaneously alleviate a longing for belonging, envy, and myopia, infosec defenders must seek out and share the identity of “builder”58 with software engineers59. Aligning infosec metrics to software delivery metrics facilitates the alignment of infosec work to software delivery work. Acting upon this alignment – not just paying lip service – engenders the opportunity for security teams to more tangibly connect the work they perform with value and meaning produced.

More and more teams are now needing to manage multiple Kubernetes clusters. This post takes a look at the monitoring challenges that brings, and how to solve them with Prometheus and Grafana.

タイトルは「How to monitor multi-cloud Kubernetes with Prometheus and Grafana」。
先週のKubeWeekly #244で取り上げているので、割愛します。

A post exploring DNS routing in Kubernetes, stepping through several potential solutions to a specific problem.

タイトルは「Forbidden lore: hacking DNS routing for k8s」。
Harborで複数のレジストリがあり、コンテナイメージを取得する際に使用方法に応じて異なるレジストリを指すようにしようとDNSと格闘している話。

Ensuring SSL certificates don’t expire is an essential if annoying problem, and several services exist to help. This post runs down a list of different solutions.

タイトルは「10 Best Tools to Monitor SSL Certificate Expiry, Validity & Change」。
下記10のSSL証明書の期限/有効性/変更をタイトル通りそれぞれ図などを用いて解説している。
1. Sematext Synthetics
2. TrackSSL
3. Pingdom
4. Smartbear
5. Keychest
6. Site24x7
7. Sucuri
8. SSL Certificate Expiration Alerts
9. Certificate Expiry Monitor
10. SSL Certification Expiration Checker

A look at building a Kubernetes-based platform using Argo Workflows and Argo Events.

タイトルは「Building Kubernetes Clusters using Kubernetes」。
「Argo EventsとArgo Workflowsを使ってKubernetesを使用してKubernetesクラスターを構築する方法」を解説している。
今回の記事で使っているSAP ConcurはEKSを利用していること、他のクラウドプロバイダーでも同様のコンセプトは適用できることを添えている。
- Note: SAP Concur uses AWS EKS, and a similar concept can be applied to Google’s GKE, Azure’s AKS, or any other cloud provider’s Kubernetes offering.

SRE Weekly Issue #249 December 20th, 2020

Articles

Generic mitigations

Every service needs a couple of big hammers that are easy to swing.

Jennifer Mace — O’Reilly and Google

「generic mitigation」の概念をかわいいイラストを用いながら解説している。

How Facebook keeps its large-scale infrastructure hardware up and running

Answer: automation. Lots of automation. And automation of the automation.

Fred Lin, Harish Dattatraya Dixit, and Sriram Sankar — Facebook

ハードウェアの障害をツールを繋いで自動/定期的に検出、アラート発火、自動修復のフロー図など見やすい。
「詳細は論文で」として下記4つの論文も紹介されている。

Tips for On Call Engineers During the Holidays

Oh, how quaint! This article was written back when people traveled for the holidays.

Ashley Roof —

休暇シーズンのOn Call対応のTipsを紹介している。
TranspositではOn Callでの苦痛を自分たちで知っているので、シフトで休日をできるだけ苦痛のないものにするため、団結して以下5つのTipsを考え出した。
- Share the love (or spread the pain) when organizing on call shifts, and incentivize communal behavior.
- Communicate early and often, with and without runbooks.
- Plan around potential travel problems
- Let friendly allies help you manage the social side of the situation
- Pat yourself and your team on the back

Raft does not Guarantee Liveness in the face of Network Faults

Surprise! Fortunately, there are some ways to fix this limitation.

Heidi Howard, Ittai Abraham — Decentralized Thoughts

上記のDEVOPS WEEKLY ISSUE #521で触れているので、割愛します。

Anatomy of Unsuccessful Incident Management

A common question when a company is implementing incident management is: why do we need this process?

It turns out that the easiest way to answer this question is to look at the world of unsuccessful incident management.

Kintaba

企業がインシデント管理を実装しているときによくある質問、「なぜこのプロセスが必要なのか」に答える最も簡単な方法として、以下の失敗したインシデント管理の特徴を解説している。
- Confusion about Process
- Panic and Thrash
- Lack of Awareness
- Blame
- Uncoordinated & Conflicting Response
- Confusion over Ownership
- Repeat Problems

Just Culture: Standardizing Fire Service Accountability

Whether you’re new to Just Culture or an old hand, there’s a lot of great detail in this article.

Tory Thompson — Firehouse

組織を構成する行動、システム、期待を検討する価値観に基づく説明責任モデルの説明に使用される業界用語として「Just Culture」を解説している。
公正な文化を育むには、リスクを管理するための多面的なアプローチが必要であり、組織の運営に内在する問題やリスクを調査するときは、全体的なアプローチをとることが重要として以下の観点で解説している。
- Knowledge, systems, safeguards
- Human performance
- How we make mistakes
- Safety and reporting culture
- Systems and safeguards
- Our experience
- Standardization and bias reduction
- Big data
- Building trust

Let’s Talk: Full-Service Ownership

Not sold yet on full service ownership for development teams? This interview may help.

Vivian Chan — PagerDuty

課題に対して「full-service ownership」の導入を紹介し、インタビュー形式で疑問に答えている。質問は以下。
- Q: First things first, what exactly is a service?
- Q: So what’s the big deal about full-service ownership? Why should IT and engineering leaders care? Paint me a picture.
- Q: What is one of the biggest drivers for moving to a model of full-service ownership?
- Q: Where does one even start?

Jeli.io: Supporting Grounded Incident Analysis

While ostensibly about Jeli.io, this article makes a great case for why incident analysis is important in general and what kind of data we should be trying to gather.

John Allspaw — Adaptive Capacity Labs

エンジェル投資家による、ソフトウェア関連のインシデントに特化した分析プラットフォーム「Jeli.io」の紹介記事。

Heroku incident #2130 follow-up: Heroku Connect Sync Issue

A new feature roll-out resulted in impaired service for some customers.

Heroku社のHeroku Connectの障害情報。Salesforceとの同期で本番環境の25%の接続に影響があった。

Uber’s adventures in the adaptive universe

The adaptive universe: where adaptations to challenges feed back and cause more challenges, requiring more adaptations.

Lorin Hochstein

タイトルとEditorのコメントにある内容を解説しているが、元UberエンジニアMcLarenStanley氏のTwitterスレッドから書いている記事のため、筆者としては以下のように元スレットを読むように強く勧めている。
- I highly recommend reading the original thread in full. My writing above is based solely on that thread, I don’t have any additional information, and I probably got some stuff wrong. I also created a concept map based on Stanley’s thread.

The Shadow Request: Troubleshooting OkCupid’s First GraphQL Release

Our first GraphQL release was twice as slow as our old REST API. Here’s how we fixed it.

Another great example of making a duplicate request to a new API in the background to test it before deploying it.

Michael P. Geraci — OkCupid

GraphQL APIをまったく新しいスタックで構築していたので、実際の本番環境の負荷で以前のREST APIと比較してどのように測定されるかを確認し、ユーザーエクスペリエンスに悪影響を与えないようにしたいと考え、「Shadow Request」をリリースした話。
- Shadow Requestではターゲットページで、ユーザーは通常どおりREST APIからページのデータを読み込み、ページを表示し、ユーザーはGraphQLから同じデータをロードし、その呼び出しのタイミングを測定して、データを破棄した。
DockerおよびNode環境で見つかった改善点、GraphQLリゾルバーがエンティティーのリストでどのように機能するか、およびCORSリクエストについて説明している。

Outages

Google Workspace Status Dashboard
All Google services that use OAuth were unreachable due to an issue with Google’s User ID service. Click through for their report. This one caused issues for the start of my daughters’ school day since Meet and Classroom were down.
Google Cloud Status Dashboard
Gmail
Delivery of messages to @gmail.com addresses failed permanently and would not be retried. This report by Google has the details.
Instagram
Microsoft Outlook
Galileo (satellite navigation system)
Spotify

上記各社の障害情報

KubeWeekly #245 December 25th, 2020 ←受領次第更新。特に情報は無いが、休みの可能性あり。

いかがでしたか？気になる記事や情報はありましたか？

私もまだ内容を咀嚼出来ていないものが多々ありますので、この備忘録兼リンク集を活用しながら理解を深めていきたいと思います。

では、また。

Bye now!!

Yoshiki Fujiwara