From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.4 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 52C44C4BA2D for ; Wed, 26 Feb 2020 23:37:05 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 137BA20658 for ; Wed, 26 Feb 2020 23:37:04 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="ZJBAH3FN" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 137BA20658 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 993806B0003; Wed, 26 Feb 2020 18:37:04 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 91B7D6B0005; Wed, 26 Feb 2020 18:37:04 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7BCA06B0006; Wed, 26 Feb 2020 18:37:04 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0134.hostedemail.com [216.40.44.134]) by kanga.kvack.org (Postfix) with ESMTP id 5F3F96B0003 for ; Wed, 26 Feb 2020 18:37:04 -0500 (EST) Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 12019181AC9BF for ; Wed, 26 Feb 2020 23:37:04 +0000 (UTC) X-FDA: 76533891168.25.map50_208b2677ed411 X-HE-Tag: map50_208b2677ed411 X-Filterd-Recvd-Size: 7710 Received: from mail-ot1-f68.google.com (mail-ot1-f68.google.com [209.85.210.68]) by imf36.hostedemail.com (Postfix) with ESMTP for ; Wed, 26 Feb 2020 23:37:03 +0000 (UTC) Received: by mail-ot1-f68.google.com with SMTP id w6so1221147otk.0 for ; Wed, 26 Feb 2020 15:37:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=XULG1qV+GtPvt2uO0AXXC3BAWFE0n3RhoAR0KT3bDIA=; b=ZJBAH3FNJRTpWpahP8HUNPNfUAwrAW/0tUJi9fsibAX1z7l3xqlvASMOgyOiToMPA6 ZbMpl4IPsxjgHQwZakzzGhqWaHK//Xz85+h1es+oYw985yX8ZUjOydsHjG0444tchghQ HolRU5DwwDPudBWVQlb6+RIlAg0k7DvmQzWwW7iu3QOtzB00p8EeLyoYHDUVDRrhQYFH i0gpivgz5BFHbZB+B+rEgPgcP+S2xgsgitzBglXAmRSW+ciUvPiemJKHMKKj0EtXymWx wakh2O/n7qyNR7j+OEUqmzqqY4MCtY3FdsL1zlfkCdfwz4Jq6zwnC+SRkqJ3R0Jpi2+5 QIdw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=XULG1qV+GtPvt2uO0AXXC3BAWFE0n3RhoAR0KT3bDIA=; b=X9afEbFifl3I4sM/Tf4ETXTmHiSqVK5FBUbMiYBzQm5R3P3HOcQWMCc+u/l3+XwASD 89+wmAwZEtBEyfXcwfOQpKrroCEtysQ6O1X0Pn0KL1R3cDVErrHnLyLQAhWCZmCMKTGa c7Tlb38TJuD82hep8Y7kcp7gf6HYgZlo9nx0nesnlw3ww6BLrkdqf/ghAcv9RYNSgeZi RP1pgcFY27PNTPV7iB/y0qxZgEHPTiGFLrw9EmTPuHw9/DvnntaRaocIccCeXw3JOUpP Dwp+HoEjGKJ1FPJgD2Vc2pOH93iIj3atWz8kj9f86hIPbjv8whRG9lhBG81WmGcQUC++ TOjw== X-Gm-Message-State: APjAAAXITFsERp7INI7XNjMPw6BcXZfj6gIOReEsgatHr5o4LfxqN6lF sqT/qjVVce/ONCP6YpShL9nczVBSlkR7hL6z5CK+Zg== X-Google-Smtp-Source: APXvYqzXinpUFLtUGrlQZP3zqc5wlK8wMFVoQcMpZnMn7j1luSRg4yOSbCLaRxb+lt7onsvgtiiUe/ZQFVJ+O5/tAnI= X-Received: by 2002:a05:6830:11:: with SMTP id c17mr1052015otp.360.1582760222491; Wed, 26 Feb 2020 15:37:02 -0800 (PST) MIME-Version: 1.0 References: <20200219181219.54356-1-hannes@cmpxchg.org> <20200226222642.GB30206@cmpxchg.org> In-Reply-To: <20200226222642.GB30206@cmpxchg.org> From: Shakeel Butt Date: Wed, 26 Feb 2020 15:36:50 -0800 Message-ID: Subject: Re: [PATCH] mm: memcontrol: asynchronous reclaim for memory.high To: Johannes Weiner Cc: Yang Shi , Andrew Morton , Michal Hocko , Tejun Heo , Roman Gushchin , Linux MM , Cgroups , LKML , Kernel Team Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Feb 26, 2020 at 2:26 PM Johannes Weiner wrote: > > On Wed, Feb 26, 2020 at 12:25:33PM -0800, Shakeel Butt wrote: > > On Wed, Feb 19, 2020 at 10:12 AM Johannes Weiner wrote: > > > > > > We have received regression reports from users whose workloads moved > > > into containers and subsequently encountered new latencies. For some > > > users these were a nuisance, but for some it meant missing their SLA > > > response times. We tracked those delays down to cgroup limits, which > > > inject direct reclaim stalls into the workload where previously all > > > reclaim was handled my kswapd. > > > > > > This patch adds asynchronous reclaim to the memory.high cgroup limit > > > while keeping direct reclaim as a fallback. In our testing, this > > > eliminated all direct reclaim from the affected workload. > > > > > > memory.high has a grace buffer of about 4% between when it becomes > > > exceeded and when allocating threads get throttled. We can use the > > > same buffer for the async reclaimer to operate in. If the worker > > > cannot keep up and the grace buffer is exceeded, allocating threads > > > will fall back to direct reclaim before getting throttled. > > > > > > For irq-context, there's already async memory.high enforcement. Re-use > > > that work item for all allocating contexts, but switch it to the > > > unbound workqueue so reclaim work doesn't compete with the workload. > > > The work item is per cgroup, which means the workqueue infrastructure > > > will create at maximum one worker thread per reclaiming cgroup. > > > > > > Signed-off-by: Johannes Weiner > > > --- > > > mm/memcontrol.c | 60 +++++++++++++++++++++++++++++++++++++------------ > > > mm/vmscan.c | 10 +++++++-- > > > > This reminds me of the per-memcg kswapd proposal from LSFMM 2018 > > (https://lwn.net/Articles/753162/). > > Ah yes, I remember those discussions. :) > > One thing that has changed since we tried to implement this last was > the workqueue concurrency code. We don't have to worry about a single > thread or fixed threads per cgroup, because the workqueue code has > improved significantly to handle concurrency demands, and having one > work item per cgroup makes sure we have anywhere between 0 threads and > one thread per cgroup doing this reclaim work, completely on-demand. > > Also, with cgroup2, memory and cpu always have overlapping control > domains, so the question who to account the work to becomes a much > easier one to answer. > > > If I understand this correctly, the use-case is that the job instead > > of direct reclaiming (potentially in latency sensitive tasks), prefers > > a background non-latency sensitive task to do the reclaim. I am > > wondering if we can use the memory.high notification along with a new > > memcg interface (like memory.try_to_free_pages) to implement a user > > space background reclaimer. That would resolve the cpu accounting > > concerns as the user space background reclaimer can share the cpu cost > > with the task. > > The idea is not necessarily that the background reclaimer is lower > priority work, but that it can execute in parallel on a separate CPU > instead of being forced into the execution stream of the main work. > > So we should be able to fully resolve this problem inside the kernel, > without going through userspace, by accounting CPU cycles used by the > background reclaim worker to the cgroup that is being reclaimed. > > > One concern with this approach will be that the memory.high > > notification is too late and the latency sensitive task has faced the > > stall. We can either introduce a threshold notification or another > > notification only limit like memory.near_high which can be set based > > on the job's rate of allocations and when the usage hits this limit > > just notify the user space. > > Yeah, I think it would be a pretty drastic expansion of the memory > controller's interface. I understand the concern of expanding the interface and resolving the problem within kernel but there are genuine use-cases which can be fulfilled by these interfaces. We have a distributed caching service which manages the caches in anon pages and their hotness. It is preferable to drop a cold cache known to the application in the user space on near stall/oom/memory_pressure then let the kernel swap it out and face a stall on fault as the caches are replicated and other nodes can serve it. For such workloads kernel reclaim does not help. What would be your recommendation for such a workload. I can envision memory.high + PSI notification but note that these are based on stalls which the application wants to avoid. Shakeel