From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3D93ED6B6D0 for ; Wed, 30 Oct 2024 19:52:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 83EDA6B0088; Wed, 30 Oct 2024 15:52:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7EE916B0089; Wed, 30 Oct 2024 15:52:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6DD6F6B00A6; Wed, 30 Oct 2024 15:52:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 511746B0088 for ; Wed, 30 Oct 2024 15:52:38 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 0F284C0EE3 for ; Wed, 30 Oct 2024 19:52:38 +0000 (UTC) X-FDA: 82731315450.18.9374E95 Received: from mail-qv1-f43.google.com (mail-qv1-f43.google.com [209.85.219.43]) by imf27.hostedemail.com (Postfix) with ESMTP id BD2534000A for ; Wed, 30 Oct 2024 19:52:09 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=IR6TIMeI; spf=pass (imf27.hostedemail.com: domain of yosryahmed@google.com designates 209.85.219.43 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730317875; a=rsa-sha256; cv=none; b=UYi2M9JQ2S8gNyDSVkJ7KtIonni6nZcnZRq8XXxkAoEWFU/6Gu5JlnFzdx+T60+gHR7IjR cpW0TscXnX+lpRBoh9v6rfl/UN5DHYvM1gpwnzcid857j/Qo9wH5EdR3mfxTGvbMqBsRxM UdI8ORaIoVDP3wWNmS4XuvVfQIOo9Ac= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=IR6TIMeI; spf=pass (imf27.hostedemail.com: domain of yosryahmed@google.com designates 209.85.219.43 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730317875; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+/o+6f340yStUOUUizhAe4dchrInAWR4WctMvK2Qybg=; b=67ScMDR/Eecq1EovE3bo3A9XYOenXnDQe7Cus3u90gCNIT+Q8KncFwZ/u2Lu6z8IvPg4fw hvFNWqzSDCyzzmoGmDbiSaZcEiTTOZVKOauy2GGDFzz649GWpfIMVs2hsDQO6PyBNLK/0P Fh2aT0vZ3hROXuSR0KoklA6f1za9ceE= Received: by mail-qv1-f43.google.com with SMTP id 6a1803df08f44-6cbf2fc28feso1455716d6.0 for ; Wed, 30 Oct 2024 12:52:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730317955; x=1730922755; darn=kvack.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=+/o+6f340yStUOUUizhAe4dchrInAWR4WctMvK2Qybg=; b=IR6TIMeIB2lJ8vRsETozI5RqtRePHph6Jt3daKWT78kuke4jJiBvpgG4Zp8PSHP989 P9k8WgBGSpEtd7ZYEnalZ90Sim9MFkQ6DtT9CcwvdyWalqFqgPQGyLVcWnnHHpYNTy1V Oe6jgLvWYZ7eeGnkmuFliG2Jwm63XHrTb3pt3dh1i0XhS+isZoksbYwTCXcez22WndQ7 /NaVkqo0zvTyuuyuQkskIigXM6n4ldEVJ5H5tjWpTfVYvE1XY8pxsJyiY58DlT0ayd50 gakxlImTXLvJ631Wqy64VDc0YjTzN+w+jgbZFfpiqbSnpAz41EQlciR3+8046qsy1ON7 kYpw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730317955; x=1730922755; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=+/o+6f340yStUOUUizhAe4dchrInAWR4WctMvK2Qybg=; b=cyF002MnlZ6ZblHE/EN74zHpOaJPd/BmYxmy3Jm1z9GZpAP4fU8CsAsKYqhzsFGftr yCz2e8Eu0IBBZCI6DDDzCnnsPlpMuPITMoRUaE7Eiu0k0GBxUqf+L98C+6bkyo41kvt5 2a+xBov3aYCbNczMPX63ZG4cK6mDe0zzJy9F04udqG8RO0L8mbg18vDTnOe2kT7/d5Au DJQPsaKYvJjALRO001kNSj+XlPBXWJwsOt8rIKlbNXvfNZkiZUeV/DGWcBYBQSoZzJn+ 9538Yhg4Vg5Z5hblHIdU/V64Y0Yvul0iE4JrO1B2iHqqyqC2mUU9m54GglInFCyMqFxn 335w== X-Forwarded-Encrypted: i=1; AJvYcCVWFztp0nj+p+rXQLcuQ6y0+IUtGmFGkqBS7lX6vmrS/RTtyNxBkGoICJXfIiSowcXdUjt/VvPdnQ==@kvack.org X-Gm-Message-State: AOJu0YzZ6/KgIKIejgwAPPL0iBqLwM5Em4NaqHhLtUb9jVzW8soMh+1t +7xHblM6gxTE/m6KI4Any24sTB6aNyUml1hb/TClCzjabu49bbTaoOXUyoJIA8ILjss8rr7sti/ K0WQA4Uocjz65HsbDXawQqDDSl0b6OMCtUzZR X-Google-Smtp-Source: AGHT+IGRLo1FAaraQ2RBzX9kuBOQfVcUrcJ9p+UHHehrTc7kScapNfxQYzuEyupDQ73cuNZT2pAWcXQYgZhtt1hgNXw= X-Received: by 2002:a05:6214:5bc4:b0:6cb:600f:568b with SMTP id 6a1803df08f44-6d18568374dmr219394266d6.8.1730317955230; Wed, 30 Oct 2024 12:52:35 -0700 (PDT) MIME-Version: 1.0 References: <20241027001444.3233-1-21cnbao@gmail.com> <33c5d5ca-7bc4-49dc-b1c7-39f814962ae0@gmail.com> In-Reply-To: From: Yosry Ahmed Date: Wed, 30 Oct 2024 12:51:58 -0700 Message-ID: Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing for nearly full memcg To: Usama Arif Cc: Barry Song <21cnbao@gmail.com>, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Barry Song , Kanchana P Sridhar , David Hildenbrand , Baolin Wang , Chris Li , "Huang, Ying" , Kairui Song , Ryan Roberts , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song Content-Type: text/plain; charset="UTF-8" X-Stat-Signature: 64mx7nzjzobguft5usoam3zs8wjsanpn X-Rspamd-Queue-Id: BD2534000A X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1730317929-102262 X-HE-Meta: U2FsdGVkX1+0P3y9H0VjSUHV+mvlh9C1edWQt7vZMSb7f8HpJc/T/2tfrYAABHHTWxS4MhTVt56VwnQx1ygH3r3xgpHOEcB4PtNyfSMHavMGt2g7MKfW8xBfAHxxRfCJFcrmYy13qz0DQPvNQrPA64JHfADh9EdO2eeK12W4jvDWBFIX16ERs+rq+CZKijzhkBIW6v28g9rTCqAcXqggCfeKatiHMqSki3E7ditIfQ8eLC7YTqB6ch6cXAFayRHFo0jPSGNn1lPFr8nzj5Q/KpAyu0etnVlGxvmeaaiTyoxkyIzim78OWnII/t07xmjl2SJ48gsE8bhyywITaAK2VlwLrR74TbKFd+NxeF7aQZSiMTbfJbTJe8PpVOuU9HF/h4nJUBmyBXqvdF1Uq0jGo+0UHgPflhwYDm8jmsgNbMqr8uGbTJvX/J6rn7kJKXhCJAo+nOHw2yvnYKSJaSAsoabe832/z6qeHD50vbjVx4jIXRt0BYavtO6Q5VN0n391ZiQXM1+kDNH4TD/Yo0H124goZrqCBbINPmWzZDNlBzYOqCVNqjEPTKBMSS6tk6/AJWOlzjh/gR0qoD8Y6cVed0FdV1caEXkhOjUSpkBe7gcClUEcovV0wEiY+eItmdx3w6ksyY2BdmM//xLjH4o+pWxum1JEg+sB+uglkHF0uIrItOoD08f51JK4PuGjmEVk1U0+9zmooenN8/OvKcwIWCnzBD+Omvba8sIAxS5xfajIuNizf6O5kOu2/zBG4Xd244LVdC0TKH6ww1HtlNjirlxeSJOGxwmM/ZUsdWPry2wg4YVQQn8uujk00E7SBpuif7j7KMo5WtRemq/xdH9zxOcO1kZsSDfnjhJUwLl+Kx4gas48dUKSx9Br3awJIw7DfCuy0zAtPlqLMvjHwE/HHOiyR9d8in3BzqKyMyL3z7tnk94Jx5mUGoxqL7hfEX7qq2zA+UILlWapy5wxBGL zXGbIShm CBBd5EE5/vkq1Pezwee/Nz/wDK5OMSL6bVHTAaWMLDEzxW8QXAfosRJeSevCPXOxC44boobmZHIHQGC2GpYmO3alMkcLwOoFEJ1rRIhWIiPlj+jVQZ3Iz9KJ5Ue2BikqWHxHAJOUbAXWWouDIuTzvZT6uCm8JGwtMXJI8Nl/tL1auACyeqJu/CTD/XeWuHxZLCmu+ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: [..] > > My second point about the mitigation is as follows: For a system (or > > memcg) under severe memory pressure, especially one without hardware TLB > > optimization, is enabling mTHP always the right choice? Since mTHP operates at > > a larger granularity, some internal fragmentation is unavoidable, regardless > > of optimization. Could the mitigation code help in automatically tuning > > this fragmentation? > > > > I agree with the point that enabling mTHP always is not the right thing to do > on all platforms. I also think it might be the case that enabling mTHP > might be a good thing for some workloads, but enabling mTHP swapin along with > it might not. > > As you said when you have apps switching between foreground and background > in android, it probably makes sense to have large folio swapping, as you > want to bringin all the pages from background app as quickly as possible. > And also all the TLB optimizations and smaller lru overhead you get after > you have brought in all the pages. > Linux kernel build test doesnt really get to benefit from the TLB optimization > and smaller lru overhead, as probably the pages are very short lived. So I > think it doesnt show the benefit of large folio swapin properly and > large folio swapin should probably be disabled for this kind of workload, > eventhough mTHP should be enabled. > > I am not sure that the approach we are trying in this patch is the right way: > - This patch makes it a memcg issue, but you could have memcg disabled and > then the mitigation being tried here wont apply. Is the problem reproducible without memcg? I imagine only if the entire system is under memory pressure. I guess we would want the same "mitigation" either way. > - Instead of this being a large folio swapin issue, is it more of a readahead > issue? If we zswap (without the large folio swapin series) and change the window > to 1 in swap_vma_readahead, we might see an improvement in linux kernel build time > when cgroup memory is limited as readahead would probably cause swap thrashing as > well. I think large folio swapin would make the problem worse anyway. I am also not sure if the readahead window adjusts on memory pressure or not. > - Instead of looking at cgroup margin, maybe we should try and look at > the rate of change of workingset_restore_anon? This might be a lot more complicated > to do, but probably is the right metric to determine swap thrashing. It also means > that this could be used in both the synchronous swapcache skipping path and > swapin_readahead path. > (Thanks Johannes for suggesting this) > > With the large folio swapin, I do see the large improvement when considering only > swapin performance and latency in the same way as you saw in zram. > Maybe the right short term approach is to have > /sys/kernel/mm/transparent_hugepage/swapin > and have that disabled by default to avoid regression. > If the workload owner sees a benefit, they can enable it. > I can add this when sending the next version of large folio zswapin if that makes > sense? I would honestly prefer we avoid this if possible. It's always easy to just put features behind knobs, and then users have the toil of figuring out if/when they can use it, or just give up. We should find a way to avoid the thrashing due to hitting the memcg limit (or being under global memory pressure), it seems like something the kernel should be able to do on its own. > Longer term I can try and have a look at if we can do something with > workingset_restore_anon to improve things. I am not a big fan of this, mainly because reading a stat from the kernel puts us in a situation where we have to choose between: - Doing a memcg stats flush in the kernel, which is something we are trying to move away from due to various problems we have been running into. - Using potentially stale stats (up to 2s), which may be fine but is suboptimal at best. We may have blips of thrashing due to stale stats not showing the refaults.