From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B5B63C4167B for ; Tue, 28 Nov 2023 06:03:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2FCAB6B02B0; Tue, 28 Nov 2023 01:03:34 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 284846B02F4; Tue, 28 Nov 2023 01:03:34 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0FF0C6B02FD; Tue, 28 Nov 2023 01:03:34 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id EE1DC6B02B0 for ; Tue, 28 Nov 2023 01:03:33 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id C037DA01A8 for ; Tue, 28 Nov 2023 06:03:33 +0000 (UTC) X-FDA: 81506321106.22.DC1E85B Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.93]) by imf24.hostedemail.com (Postfix) with ESMTP id E3E3718000F for ; Tue, 28 Nov 2023 06:03:30 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=S5DsGItT; spf=pass (imf24.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.93 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701151411; a=rsa-sha256; cv=none; b=EmLC2rLQAW88/AvS6Sl3jJM/1G12Prhb+VRtvCn8sphjY4DYjHyMyTlQEl5jagHMRqfLYu UbbrJQHJyBdmbWp3Ju9OIpAEvfsm5J2ZJB+jYzJ0QeMrSpFGa07JQnOkvaptSR1L4A2Ms9 LvNFOR9PHGKDArwwbs8ospwUe9BFT8o= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=S5DsGItT; spf=pass (imf24.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.93 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701151411; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kTAtfgw1IeBBB6KOKW3Xn/ZRIqXS9WLRhvUnlSUHGR4=; b=3qkx2cGWnmjws7TmGuMD62RbL/IRRZubJ3nt+525UH4Sup9DQulqcDlrlLRVbCLS5KhSss Nvo1vuGAm+DAwCO5vbJ7DOWlz9KliRSEgJrO3ozCAJgcHxHs23QX246T1GUf+29GyzMlSS ttfqADvhOhEeBzvdqbu3QFlaX2qiIdM= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1701151410; x=1732687410; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=GDhbfX3yxDNw6VIdEgaKxTC6cvAf9bK9F2hht6HPDjI=; b=S5DsGItTUlWWC8EmBH8tRGGkOscd9+o9F8bzarTfR9p0BI1nf60Dj8sD k++W35rf8iehtP3ajA4iwOl/PJgQBJXiHj66Oh288V1NuSCamHayXQ/hk mQ0YpffGrbTlqbsdLy1OpJ06zP4mLdgLGYcRcYedZ46OK7BmD5OvYz55F M3IPXLMlDHuXE2D+uPkSDjTX+k63U+MxkDakVFvqHhkpZofNllMnEqXpq bQXHhDEisNO9EBNL4YtlJGZtFJRxU6MmhbmvKmls+1AE86cE1ZEqhnTyK DhYoKpSapIIpJH2a8UT2ddoH1msizK35qD5FtXIn23UsGuryuBAbTUBXX Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10907"; a="390013393" X-IronPort-AV: E=Sophos;i="6.04,233,1695711600"; d="scan'208";a="390013393" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Nov 2023 21:54:44 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10907"; a="859310234" X-IronPort-AV: E=Sophos;i="6.04,233,1695711600"; d="scan'208";a="859310234" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Nov 2023 21:54:42 -0800 From: "Huang, Ying" To: Yosry Ahmed Cc: Minchan Kim , Chris Li , Michal Hocko , Liu Shixin , Yu Zhao , Andrew Morton , Sachin Sant , Johannes Weiner , Kefeng Wang , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v10] mm: vmscan: try to reclaim swapcache pages if no swap space In-Reply-To: (Yosry Ahmed's message of "Mon, 27 Nov 2023 21:41:54 -0800") References: <87msv58068.fsf@yhuang6-desk2.ccr.corp.intel.com> <87h6l77wl5.fsf@yhuang6-desk2.ccr.corp.intel.com> <87bkbf7gz6.fsf@yhuang6-desk2.ccr.corp.intel.com> <87msuy5zuv.fsf@yhuang6-desk2.ccr.corp.intel.com> <87fs0q5xsq.fsf@yhuang6-desk2.ccr.corp.intel.com> <87bkbe5tha.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Tue, 28 Nov 2023 13:52:42 +0800 Message-ID: <875y1m5sr9.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: E3E3718000F X-Stat-Signature: buiiw1ypyy35tykiwqswnpnxgubfzwx6 X-Rspam-User: X-HE-Tag: 1701151410-667843 X-HE-Meta: U2FsdGVkX18eMDyEVY1vmC+Ta13Rytb3dh4Ovqu2fUM6njC44dJ1kETVb+qKExzVr6HGch/ol4KlmRcoIWDUUUtVX6fg8Am7vog3qoxnMNIvpkwnVaRYmKPnWffQodPrRqozV1BsIp+OLxxI6qNimL3FJzUTcXN+d37F61Z275n09DlzsaeC/zu0TpXycICCTBkLNIPm9WSb0gfE9XJf9rQF/dxwTeDxBVtqzQO2EkMLG7xFkvNzU65UsAUeswZYrq5tuUSlmeuatP158pgifGLYCh1TchEnFiCTQFb1/96nZekEBY7cwmSctNcriGkNfiKaOfRuTG8kdUBhww6YmRS8wJGjBmz4Up0QClYUksaddNE/8qjuNMy1YhlI0vbJX4xBBZL+TDfbKKtKvLctRKgYOCxPguBSVSkxXStOdDfVE2SF7hXHNLkQFOcGL0tmV7MqR96uVpWssRxdhLhDEC9lI22U3FDlOnAABCXCo8Rt6Y2aLTikam2gEyHp1160WZDPbRJXbAHpbvpK3RtcY+4cbGcotceKpW/exfV+YAf1KGOeootPkNsI5r6n4RWhJSle29GJQ/c5b7FCO794sjqmIqUcb2+uCai980ilXyFRh4NpSb47y7YzsfUqBRecT4Gloc2FcMm66Uta3FEqepn9c20CWj6ahzUUF5cmR5pjjqWYwzTzPXk2fOi1Jdv366F3mpbzwJnl9R1JMKlXLNHwl3evVYnXmhGQvxXSgbHk1jZv+mzq/CknPO5Pzz7a0A8HCRLVUGymS3MnZlDPP8rE6gngwrNRZi3ijbTx7Kf8JJ6V55LJKbooHl51wBOAE/e+0qTlPzoGJsUq3DSbPT4o+qdXyPoxD/XkJjBb7b3NmRtwSfJq0DGRDR1uCWDysDVfQkmCAMwRG67QnGVyzo8sZeNOny0NdVuxYk+fN3PyK9Ia1UvpgvSBlAeF4b5Cj805kaWu+TS+iCG63xT +WJB4soj hLKIlXHI1YcZCJ3tj5IO2Q3WDccaPbRaTVOfRjoBL99l6Nc00fAIQj5I5TZtf6HI3KScMXLi/mO1LEX3LP3Vrxr3hZbVnUZBhfMqiJOhKcunrUO1h9sXLQShMiVgg0naBlyWqdtHEPJvI3sxP266MC0WfzH/2aF9xvTVOgw6RbThSnxLVaJbvEcvxbfTbYST1F3UnThPiG/Zk1VD5HHJ3pC0fdE8AWUiPtdXsnBfPzRh3rTDHxvssost4eBTD9XaYSz0j X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Yosry Ahmed writes: > On Mon, Nov 27, 2023 at 9:39=E2=80=AFPM Huang, Ying wrote: >> >> Yosry Ahmed writes: >> >> > On Mon, Nov 27, 2023 at 8:05=E2=80=AFPM Huang, Ying wrote: >> >> >> >> Yosry Ahmed writes: >> >> >> >> > On Mon, Nov 27, 2023 at 7:21=E2=80=AFPM Huang, Ying wrote: >> >> >> >> >> >> Yosry Ahmed writes: >> >> >> >> >> >> > On Mon, Nov 27, 2023 at 1:32=E2=80=AFPM Minchan Kim wrote: >> >> >> >> >> >> >> >> On Mon, Nov 27, 2023 at 12:22:59AM -0800, Chris Li wrote: >> >> >> >> > On Mon, Nov 27, 2023 at 12:14=E2=80=AFAM Huang, Ying wrote: >> >> >> >> > > > I agree with Ying that anonymous pages typically have di= fferent page >> >> >> >> > > > access patterns than file pages, so we might want to trea= t them >> >> >> >> > > > differently to reclaim them effectively. >> >> >> >> > > > One random idea: >> >> >> >> > > > How about we put the anonymous page in a swap cache in a = different LRU >> >> >> >> > > > than the rest of the anonymous pages. Then shrinking agai= nst those >> >> >> >> > > > pages in the swap cache would be more effective.Instead o= f having >> >> >> >> > > > [anon, file] LRU, now we have [anon not in swap cache, an= on in swap >> >> >> >> > > > cache, file] LRU >> >> >> >> > > >> >> >> >> > > I don't think that it is necessary. The patch is only for = a special use >> >> >> >> > > case. Where the swap device is used up while some pages ar= e in swap >> >> >> >> > > cache. The patch will kill performance, but it is used to = avoid OOM >> >> >> >> > > only, not to improve performance. Per my understanding, we= will not use >> >> >> >> > > up swap device space in most cases. This may be true for Z= RAM, but will >> >> >> >> > > we keep pages in swap cache for long when we use ZRAM? >> >> >> >> > >> >> >> >> > I ask the question regarding how many pages can be freed by t= his patch >> >> >> >> > in this email thread as well, but haven't got the answer from= the >> >> >> >> > author yet. That is one important aspect to evaluate how valu= able is >> >> >> >> > that patch. >> >> >> >> >> >> >> >> Exactly. Since swap cache has different life time with page cac= he, they >> >> >> >> would be usually dropped when pages are unmapped(unless they ar= e shared >> >> >> >> with others but anon is usually exclusive private) so I wonder = how much >> >> >> >> memory we can save. >> >> >> > >> >> >> > I think the point of this patch is not saving memory, but rather >> >> >> > avoiding an OOM condition that will happen if we have no swap sp= ace >> >> >> > left, but some pages left in the swap cache. Of course, the OOM >> >> >> > avoidance will come at the cost of extra work in reclaim to swap= those >> >> >> > pages out. >> >> >> > >> >> >> > The only case where I think this might be harmful is if there's = plenty >> >> >> > of pages to reclaim on the file LRU, and instead we opt to chase= down >> >> >> > the few swap cache pages. So perhaps we can add a check to only = set >> >> >> > sc->swapcache_only if the number of pages in the swap cache is m= ore >> >> >> > than the number of pages on the file LRU or similar? Just make s= ure we >> >> >> > don't chase the swapcache pages down if there's plenty to scan o= n the >> >> >> > file LRU? >> >> >> >> >> >> The swap cache pages can be divided to 3 groups. >> >> >> >> >> >> - group 1: pages have been written out, at the tail of inactive LR= U, but >> >> >> not reclaimed yet. >> >> >> >> >> >> - group 2: pages have been written out, but were failed to be recl= aimed >> >> >> (e.g., were accessed before reclaiming) >> >> >> >> >> >> - group 3: pages have been swapped in, but were kept in swap cache= . The >> >> >> pages may be in active LRU. >> >> >> >> >> >> The main target of the original patch should be group 1. And the = pages >> >> >> may be cheaper to reclaim than file pages. >> >> >> >> >> >> Group 2 are hard to be reclaimed if swap_count() isn't 0. >> >> >> >> >> >> Group 3 should be reclaimed in theory, but the overhead may be hig= h. >> >> >> And we may need to reclaim the swap entries instead of pages if th= e pages >> >> >> are hot. But we can start to reclaim the swap entries before the = swap >> >> >> space is run out. >> >> >> >> >> >> So, if we can count group 1, we may use that as indicator to scan = anon >> >> >> pages. And we may add code to reclaim group 3 earlier. >> >> >> >> >> > >> >> > My point was not that reclaiming the pages in the swap cache is more >> >> > expensive that reclaiming the pages in the file LRU. In a lot of >> >> > cases, as you point out, the pages in the swap cache can just be >> >> > dropped, so they may be as cheap or cheaper to reclaim than the pag= es >> >> > in the file LRU. >> >> > >> >> > My point was that scanning the anon LRU when swap space is exhausted >> >> > to get to the pages in the swap cache may be much more expensive, >> >> > because there may be a lot of pages on the anon LRU that are not in >> >> > the swap cache, and hence are not reclaimable, unlike pages in the >> >> > file LRU, which should mostly be reclaimable. >> >> > >> >> > So what I am saying is that maybe we should not do the effort of >> >> > scanning the anon LRU in the swapcache_only case unless there aren'= t a >> >> > lot of pages to reclaim on the file LRU (relatively). For example, = if >> >> > we have a 100 pages in the swap cache out of 10000 pages in the anon >> >> > LRU, and there are 10000 pages in the file LRU, it's probably not >> >> > worth scanning the anon LRU. >> >> >> >> For group 1 pages, they are at the tail of the anon inactive LRU, so = the >> >> scan overhead is low too. For example, if number of group 1 pages is >> >> 100, we just need to scan 100 pages to reclaim them. We can choose to >> >> stop scanning when the number of the non-group-1 pages reached some >> >> threshold. >> >> >> > >> > We should still try to reclaim pages in groups 2 & 3 before OOMing >> > though. Maybe the motivation for this patch is group 1, but I don't >> > see why we should special case them. Pages in groups 2 & 3 should be >> > roughly equally cheap to reclaim. They may have higher refault cost, >> > but IIUC we should still try to reclaim them before OOMing. >> >> The scan cost of group 3 may be high, you may need to scan all anonymous >> pages to identify them. The reclaim cost of group 2 may be high, it may >> just cause trashing (shared pages that are accessed by just one >> process). So I think that we can allow reclaim group 1 in all cases. >> Try to reclaim swap entries for group 3 during normal LRU scanning after >> more than half of swap space of limit is used. As a last resort before >> OOM, try to reclaim group 2 and group 3. Or, limit scan count for group >> 2 and group 3. > > It would be nice if this can be done auto-magically without having to > keep track of the groups separately. Some rough idea may be - trying to scan anon LRU if there are swap cache pages. - if some number of pages other than group 1 encountered, stop scanning anon LRU list. - the threshold to stopping can be tuned according to whether we are going to OOM. We can try to reclaim swap entries for group 3 when we haven't run out of swap space yet. >> >> BTW, in some situation, OOM is not the worst situation. For example, >> trashing may kill interaction latency, while killing the memory hog (may >> be caused by memory leak) saves system response time. > > I agree that in some situations OOMs are better than thrashing, it's > not an easy problem. -- Best Regards, Huang, Ying