From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.2 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1633EC433DB for ; Tue, 2 Feb 2021 08:57:24 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6EFF864DD6 for ; Tue, 2 Feb 2021 08:57:23 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6EFF864DD6 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id E4EAF6B0070; Tue, 2 Feb 2021 03:57:22 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id DD9986B0072; Tue, 2 Feb 2021 03:57:22 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C9F4A6B0073; Tue, 2 Feb 2021 03:57:22 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0154.hostedemail.com [216.40.44.154]) by kanga.kvack.org (Postfix) with ESMTP id AB16A6B0070 for ; Tue, 2 Feb 2021 03:57:22 -0500 (EST) Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 75512180AD817 for ; Tue, 2 Feb 2021 08:57:22 +0000 (UTC) X-FDA: 77772723924.15.stove28_2805e0b275c9 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin15.hostedemail.com (Postfix) with ESMTP id 4B1991814B0C7 for ; Tue, 2 Feb 2021 08:57:22 +0000 (UTC) X-HE-Tag: stove28_2805e0b275c9 X-Filterd-Recvd-Size: 7157 Received: from mail-il1-f182.google.com (mail-il1-f182.google.com [209.85.166.182]) by imf04.hostedemail.com (Postfix) with ESMTP for ; Tue, 2 Feb 2021 08:57:21 +0000 (UTC) Received: by mail-il1-f182.google.com with SMTP id l4so18334190ilo.11 for ; Tue, 02 Feb 2021 00:57:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:message-id:mime-version:content-disposition; bh=oGwAR6bY8VSRQmUwKZ34FKMeypPh0jsTSiLoYBbPc64=; b=EEa1m8UfTMdIkJdT6uQshRNB67qXyf9x3s84g2hkFToF1WcifUqTHBDWLLigrf6nK/ 1dHgKQ2vLWuWk6rdMbJ+MYZNrdlAW9ZGcCdlr9nCanSgCGEeQVc18Y/o/AStQ7dPXCN2 G/84S2z9GuCoxONLys1cRj5XeETwoP/7GnGmuYa/lHNbFqn8mRsga2O3HSAWvqezXx0J 52Inn+Fd6PkCBjBSFvfGr5yAKiUN7aAdbw0nORI9KcG4mh3c6C/BeBySDOPED2QPeZvO hHK75MAHn5ANUg0e1bQuJS0qfXWuhy9Ukc0Yo8YyEIwzBRWVMFu5ILBvES9RZuTki+zE nUcA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:mime-version :content-disposition; bh=oGwAR6bY8VSRQmUwKZ34FKMeypPh0jsTSiLoYBbPc64=; b=gc57aeORixNqn6yE4VfDvWsw+gBqjLu9lo4a+qFs8Dam2wBW3tcyP4ZerBV3bAgKNs QNiivDNzRvt8p0fmrjM/VuPs48uiBVjX/p4a7rMKuCwduDXYtp5bpB4HOdgbNL53LJ5P wK2+dN+VnAVEA/hzeSLoiR79IAkQqGLjoGnIRXZVRIYykY1+HmROpoS32+5jmeJ3iizV YHGMApLIzbuCnvbDm1J6G8V6eVMILyHO94d31Xw8EVQ3t0hygRJZlKL65uGODFZT1GSj PK483GzZfMObdWKljzblx8I1INRGVwDLbRZDab8+ZhfYd7FZPWShLCK8CbXRA0asAtAF B24A== X-Gm-Message-State: AOAM5310QooDU64aAkcl7b2HYFi5PHH95Zohf7U7J5nsPa38e68Weid/ JqfROdWQPNVV0Ad+/XMmQqmiqRFL1JkzDw== X-Google-Smtp-Source: ABdhPJy/BTGAL3CxSNcLT44HtScvwuY6Ur4oNDxCjia11s3y4xc2HoorkufOuwsOmqinor4VC1jyjg== X-Received: by 2002:a92:c090:: with SMTP id h16mr420348ile.190.1612256240438; Tue, 02 Feb 2021 00:57:20 -0800 (PST) Received: from google.com ([2620:15c:183:200:78f9:6386:be80:e7fc]) by smtp.gmail.com with ESMTPSA id k125sm3507818iof.14.2021.02.02.00.57.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 02 Feb 2021 00:57:19 -0800 (PST) Date: Tue, 2 Feb 2021 01:57:15 -0700 From: Yu Zhao To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, page-reclaim@google.com Subject: Augmented Page Reclaim Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Bogosity: Ham, tests=bogofilter, spamicity=0.037503, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: ====================== Augmented Page Reclaim ====================== We would like to share a work with you and see if there is enough interest to warrant a run for the mainline. This work is a part of result from a decade of research and experimentation in memory overcommit at Google: an augmented page reclaim that, in our experience, is performant, versatile and, more importantly, simple. Performance =========== On Android, our most advanced simulation that generates memory pressure from realistic user behavior shows 18% fewer low-memory kills, which in turn reduces cold starts by 16%. This is on top of per-process reclaim, a predecessor of ``MADV_COLD`` and ``MADV_PAGEOUT``, against background apps. On Borg (warehouse-scale computers), a similar approach enables us to identify jobs that underutilize their memory and downsize them considerably without compromising any of our service level indicators. Our findings are published in the papers listed below, e.g., 32% of memory usage on Borg has been idle for at least 2 minutes. On Chrome OS, our field telemetry reports 96% fewer low-memory tab discards and 59% fewer OOM kills from fully-utilized devices and no UX regressions from underutilized devices. Our real-world benchmark that browses popular websites in multiple tabs demonstrates 51% less CPU usage from ``kswapd`` and 45% (some) and 52% (full) less PSI on v5.11-rc6 built from the tree below. Versatility =========== Userspace can trigger aging and eviction independently via the ``debugfs`` interface [note]_ for working set estimation, proactive reclaim, far memory tiering, NUMA-aware job scheduling, etc. The metrics from the interface are easily interpretable, which allows intuitive provisioning and discoveries like the Borg example above. For a warehouse-scale computer, the interface is intended to be a building block of a closed-loop control system, with a machine learning algorithm being the controller. Simplicity ========== The workflow [note]_ is well defined and each step in it has a clear meaning. There are no magic numbers or heuristics involved but a few basic data structures that have negligible memory footprint. This simplicity has served us well as the scale and the diversity of our workloads constantly grow. Repo ==== git pull https://linux-mm.googlesource.com/page-reclaim refs/changes/80/1080/1 Gerrit: https://linux-mm-review.googlesource.com/c/page-reclaim/+/1080 .. [note] See ``Documentation/vm/multigen-lru.rst`` in the tree. FAQ === What is the motivation for this work? ------------------------------------- In our case, DRAM is a major factor in total cost of ownership, and improving memory overcommit brings a high return on investment. Moreover, Google-Wide Profiling has been observing the high CPU overhead [note]_ from page reclaim. Why not try to improve the existing code? ----------------------------------------- We have tried but concluded the two limiting factors [note]_ in the existing code are fundamental, and therefore changes made atop them will not result in substantial gains on any of the aspects above. What particular workloads does it help? --------------------------------------- This work optimizes page reclaim for workloads that are not IO bound, because we find they are the norm on servers and clients in the cloud era. It would most likely help any workloads that share the common characteristics [note]_ we observed. How would it benefit the community? ----------------------------------- Google is committed to promoting sustainable development of the community. We hope successful adoptions of this work will steadily climb over time. To that end, we would be happy to learn your workloads and work with you case by case, and we will do our best to keep the repo fully maintained. For those whose workloads rely on the existing code, we will make sure you will not be affected in any way. References ========== 1. `Long-term SLOs for reclaimed cloud computing resources `_ 2. `Profiling a warehouse-scale computer `_ 3. `Evaluation of NUMA-Aware Scheduling in Warehouse-Scale Clusters `_ 4. `Software-defined far memory in warehouse-scale computers `_ 5. `Borg: the Next Generation `_