SRE / DevOps / Kubernetes Weekly Reportまとめ#48(12/27~1/1) - 運び屋 (A carrier(forwarder) changed his career to an engineer)

タイトルは「How Shopify Uses WebAssembly Outside of the Browser」。
Shopify社がパフォーマンス、安全性、柔軟性を保証するユニバーサルフォーマットであるWebAssemblyを選択した下記の経緯、セキュリティー/パフォーマンス/柔軟性/コミュニティードリブンの観点、アーキテクチャ、などを解説している。
- We want Partners to focus on using their domain knowledge to solve problems, and not on managing scalable web services. To make this a reality we’re keeping the flexibility of untrusted Partner code, but executing it on our own infrastructure. We choose a universal format for that code that ensures it’s performant, secure, and flexible: WebAssembly.

Details of one team moving away from (some) microservices and merging code back into a monolithic application. Good discussion of the advantages and costs of microservices and how to right-size individual services.

タイトルは「Why I've Been Merging Microservices Back Into The Monolith At InVision」。
自身が所属していたマイクロサービス化されていたレガシーサービスを、モノリスに統合してリサイズしている話。
下記の様に「私はAnti-Microservicesではない。モノリスを適切なサイズにし、チームのペインポイントを解決したかった」ということなどを冒頭に明言している。
- To be very clear, I wanted to start this post off by stating unequivocally that I am not anti-microservices. My merging of services back into the monolith is not some crusade to get microservices out of my life. This quest is intended to "right size" the monolith. What I am doing is solving a pain-point for my team. If it weren't reducing friction, I wouldn't spend so much time (and opportunity cost) lifting, shifting, and refactoring old code.
マイクロサービスが解決する問題、自社が導入した経緯、やり直すとしたら、などの観点がとても参考になった。
- In short, all the benefits of Conway's Law for the organization have become liabilities over time for my "legacy" team. And so, we've been trying to "right size" our domain of responsibility, bringing balance back to Conway's Law. Or, in other words, we're trying to alter our service boundaries to match our team boundary. Which means, merging microservices back into the monolith.
- A far more helpful term would have been, "right sized". Microservices were never intended to be "small services", they were intended to be "right sized services."
- For my team, "right sized" means fewer repositories, fewer deployment queues, fewer languages, and fewer operational dashboards. For my rather small team, "right sized" is more about "People" than it is about "Technology". So, in the same way that InVision originally introduced microservices to solve "People problems", my team is now destroying those very same microservices in order to solve "People problems".

A good architecture post on building a realtime platform API, moving from polling to gRPC-based bi-directional streaming.

タイトルは「Uber’s Real-Time Push Platform」。
Uber社がアプリの更新方法をポーリングから、gRPCベースの双方向ストリーミングプロトコルに移行して自社のアプリエクスペリエンスを構築した話。
「Eliminating polling, introducing RAMEN」の項目は二度見しました。RAMENはRealtime Asynchronous MEssaging Networkの略だそうです。「RAMEN Server」「Scaling RAMEN globally」の文字を見て、空腹を覚えました。Uber Eatsの出番か。

A post on how applying GitOps practices can improve the security characteristics of your deployment pipeline.

タイトルは「How GitOps Improves the Security of Your Development Pipelines」。
バーチャルイベント「GitOpsDays 2020」のセッションの概略記事。セッションのYouTube動画が埋め込まれている。
GitOpsを使用すると、変更を制御でき、単一のソースから変更を確認できるとして、以下3つのポイントを解説している。
1. Config as Code
2. Changes are auditable
3. Production matches the desired state kept in Git

Most Dockerfiles are simple, but it’s possible to solve more complex problems too. This example shows cross-compilation patterns for expensive compilation targets.

タイトルは「Compiling Qt with Docker multi-stage and multi-platform」。
クロスプラットフォームの開発フレームワークQtをDockerを利用して、マルチステージ、マルチプラットフォームにビルドしていく話。
組み込みデバイスの場合、コンパイルが簡単とは言えず、Qt(およびQtWebEngine)のコンパイルは非常に重いオペレーションになる。そのため、Dockerfileが(インストールプロセスの一部としてコンパイルするのではなく)ビルドプロセスにダウンロードして含られるように、Qtをプリコンパイルして配布する。

A look at OpenTelemetry and in particular it’s usage in Java applications.

タイトルは「OpenTelemetry Java: All you need to know」。
チュートリアルとして、OpenTelemetry Java Agentのアタッチ方法、Trace methods、Span methodsなどを解説している。
GitHubのページはこちら。

Tools

Tobs is a distribution of monitoring tools for Kubernetes. Setup a full stack with Prometheus, Grafana, Promscale, Promlens and more with Helm or a custom CLI.

完全な可観測性スタックをKubernetesクラスターにできるだけ簡単にインストールできるようにするツール「Tobs(The Observability Stack for Kubernetes)」のGitHubページ。
デプロイとオペレーションを簡単にするCLIツールを提供し、直接または他のプロジェクトのサブチャートとして使用できるヘルムチャートも提供している。

Singer is an open source toolkit for ETL. At its core is a specification, and a system of taps (for extracting data) and targets (for saving it)

上記の通り、ETL用のOSSツール「Singer」のGitHubページ。データベース、Web API、ファイル、キュー、およびその他の考えられるあらゆるものの間でデータを送信する。
GitHubページはこちら。

Grafana-sync is a handy tool for syncing dashboards between Grafana installs using the Grafana API.

その名の通り、Grafanaダッシュボードを同期するツール「grafana-sync」のGitHubページ。

SRE Weekly Issue #250 December 27th, 2020

Articles

Salt Incident: May 3rd 2020 Retrospective and Update

Here’s how Algolia was affected by the Salt Stack RCE vulnerability earlier this year and how they dealt with it.

Julien Lemoine — Algolia

Salt社の020/05/05付けの、2020/05/03に障害のポストモーテム。
Saltの構成管理の脆弱性「CVE-2020-11651」によりAlgoliaのインフラが攻撃を受け、2種類のマルウェアコードがAlgoliaの構成マネージャーに侵入する可能性があった。

How to Prepare for a Site Reliability Engineer Interview

Includes background information on SRE and example interview questions.

Marlo Vernon — Splunk

SREとして採用面接を受ける方向けの記事で、以下の3つの項目に分けて解説しています。採用する側も項目/内容は参考になるかと思います。
- What is a site reliability engineer? (SRE)
- Primary roles and responsibilities of an SRE
- Questions to expect in a site reliability engineer interview

6 Scary Outage Stories from CTOs

DNS, TLS certificates, and Unicode, among other issues, make for some great (and cringe-worthy) stories.

Adam LaGreca, with stories from Charity Majors, Matthew Fornaciari, Liran Haimovitch, Daniel Spoonhower, Lee Liu, and Tina Huang

タイトルの通り、6社のCTOがそれぞれ自社の障害について語っているハローウィンの企画記事を取り上げている。

The Day of the RDS Multi-AZ Failover

In this story of a failover gone wrong, they discovered that they had had innodb_flush_log_at_trx_commit set incorrectly, explaining how they lost data when they weren’t expecting to.

Rajeev Rai — Razorpay

2019年に同社が経験したRDSのマルチAZへのフェイルオーバー失敗の経緯、対応内容、知見を共有している。

Much that we’ve gotten wrong about Site Reliability Engineering

This is a nice little comic about the role of SRE. Engineer the bridge, don’t be the bridge.

Piyush Verma — Last9

上記の通り、コミック仕立ての記事。SREがプラットフォームに対して「エンジニアリング」「観察」ができる様に、SRE自身と周りが役割を理解し、属人的で場当たり的な対応をしない様に注意が必要と感じた。
- SREs should’ve been engineering and observing the bridge, but instead they became the bridge.

You Reap What You Code

Lots of great concepts about human/computer systems, including this gem:

log facts, not interpretations

Fred Hebert

COVID-19でのオンラインカンファレンスである「Deserted Island DevOps Summer Send-Off」での筆者の講演のゆるやかな記録。貼られているイメージはゆるいですが、内容は盛り盛りです。
カンファレンス全体がゲーム「あつまれどうぶつの森」上で行われた。
カンファレンスのふりかえり記事はこちら。

The Mysterious Case of the Bad Gateway (502)

In this troubleshooting story, an innocent-seeming dependency upgrade introduced a subtle but nasty bug.

タイトルにある通り、APIリクエストに対して時折返されていた502エラーの調査をした話。同社ではTCP backlog lengthがデフォルトの128ではなく、1に設定されていたことが原因。

Jordan Place — Transposit

Google Cloud Platform

Google released an update to their post-analysis for the December 14th outage involving Google OAuth.

先週のOutagesで触れた障害の事後分析に下記の修正が入っていたため、Editorが取り上げている。
- The following is a correction to the previously posted ISSUE SUMMARY, which after further research we determined needed an amendment. All services that require sign-in via a Google Account were affected with varying impact. Some operations with Cloud service accounts experienced elevated error rates on requests to the following endpoints: www.googleapis.com or oauth2.googleapis.com. Impact varied based on the Cloud Service and service account. Please open a support case if you were impacted and have further questions.

Outages

Filecoin
Gucci Online store

上記各社の障害情報

KubeWeekly #245 January 1st, 2021←受領次第更新。特に情報は無いが、休みの可能性あり。

いかがでしたか？気になる記事や情報はありましたか？

私もまだ内容を咀嚼出来ていないものが多々ありますので、この備忘録兼リンク集を活用しながら理解を深めていきたいと思います。

では、また。

Bye now!!

Yoshiki Fujiwara