From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BC534E77188 for ; Thu, 2 Jan 2025 19:59:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 518416B0101; Thu, 2 Jan 2025 14:59:40 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4C7D06B0102; Thu, 2 Jan 2025 14:59:40 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3902B6B0103; Thu, 2 Jan 2025 14:59:40 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 1E8AA6B0101 for ; Thu, 2 Jan 2025 14:59:40 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id C33A943205 for ; Thu, 2 Jan 2025 19:59:39 +0000 (UTC) X-FDA: 82963574946.07.BD3C1FC Received: from mail-qt1-f174.google.com (mail-qt1-f174.google.com [209.85.160.174]) by imf25.hostedemail.com (Postfix) with ESMTP id 8FC4CA0008 for ; Thu, 2 Jan 2025 19:59:01 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=U31RD3eZ; spf=pass (imf25.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.174 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1735847943; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=PjSIfZUGuUkWyeI8Uu3qOfSAK+XFtRgECvJydFhzJn8=; b=hqMVxWId8jCNfS2c/d2h8FCOjx9cc9X25h4FV0EnaC7X2HGuiXkEZYFELxOQM1+Qra+I63 Swei3E3vWMhkPZSlQnjFn6baJ/yqRamwNY5T/2/itYuBxvsWc9HV+EFMjqptOBaacWG0tg vmUMSyQ6eeKU956cXysQIPc/K5akgJo= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=U31RD3eZ; spf=pass (imf25.hostedemail.com: domain of joannelkoong@gmail.com designates 209.85.160.174 as permitted sender) smtp.mailfrom=joannelkoong@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1735847943; a=rsa-sha256; cv=none; b=QcDBMSeZOoR9h5reDwow04fVwYFez+AobFamdluGoS3q1JL2tOUxfKaJgqyWw07NbRVnE3 6f05Sk9FIlIfGhXIUm3NV7BNQJqApxfnPkqYEMFaMj/DLfJmuJzoNy3LpqIZM9/2Vz+8w0 T2OiiPMls2SDBCfWzUOl2GvTQHGedfA= Received: by mail-qt1-f174.google.com with SMTP id d75a77b69052e-468f6b3a439so101420421cf.1 for ; Thu, 02 Jan 2025 11:59:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1735847977; x=1736452777; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=PjSIfZUGuUkWyeI8Uu3qOfSAK+XFtRgECvJydFhzJn8=; b=U31RD3eZvX3NAWjR5cUrPMdbZJsm4vBhGLDGl7Xzb5mF1E1QmN0D5tR5MiGbAEisFi 3hUNyEsFAYJlPttoRUpltW/i6SbkQehZFvoE6hDBc12rdTNka9hYS6An8tooYd+iPP1x HFCS69FF/smt/o5H3GK0upp84r6fuls1PSIiNOS2GtCVHDMPl6TlnChHf/cGIhdiz75U cLZCIa9DnoMHH8XaVPLoIRhT9sWcwAmHRZppm6AnXryYD4+I7JQLSIKgEEDd+PNr5oC8 eK+8nXNMnRFKDb3Rb8kBsrn12OO67+syMbkIY6IkDNgyYEcM5O33BPoeVGApeabPIlU8 5lCw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1735847977; x=1736452777; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=PjSIfZUGuUkWyeI8Uu3qOfSAK+XFtRgECvJydFhzJn8=; b=IL37oY6DQBIXZ678JYK7mk9tQxWOQX6reRnO3+BtMYO07vNXyGXJ/UWdrE0oQLIXz8 Kafx9jHtcP0SwAGb9PnhQ9Fw9Oh5HUDlEAGOSZUbxk5AByOv83c6yOiNuNIUEoOuPwBr WP48eOYRZDTzlI3fSZkI7+kyCReCUJ8qdkdXVehvjDypCrXQjWyHRIOll5yi3r9F3XlY OlrPp5Ii674vSSVxql/y3yIBruX9sk56glwuCtqjwScgIU4V+vJ3/+lk8nvZ0M8z2b36 YThRH+Yox9S8u7GBhf43tqwXHhXx7HcpV38oufElhAVSXe9+ylhrCaMJ1cmKh5so719D djWg== X-Forwarded-Encrypted: i=1; AJvYcCXYjtLyZDNTv20yJHq2QEiz3KDpCIaReuwWEjLPNzo8ZseOjDEXtHKXRSwwk00LOy/5DTiufQx1Pw==@kvack.org X-Gm-Message-State: AOJu0Yz2i9fQwUASY3D96EAuTiDnSGAiR/LFmj0mm8D/Mjf8ydWP6eGQ wbNy6YbtalRX7wO9vK03xKVDGhSy6Ph4p0h4nx0kh32DTDfjAohDZAefJN2iSz0CHWGtaGWh+JV cEMt7aX3SwemewsJMyOyB5mYYRsY= X-Gm-Gg: ASbGncsAIXh4+GP4BeiGeu3H2rAKNiJ3RzorDIHh3U2+C9HfZY9rCq9y3Pb3GsFiKXg lRn/ay+bfau5TNS3W3UjxzQ9Yq4lfo12EdfmcrPM= X-Google-Smtp-Source: AGHT+IEZkNcUvY6ZY5oPLOjgB3IoNX19M+CYYbv8xFIM1fWH0J82WPyfmMIYP/gUvxNTXaDxcMsgUr+umRjaqoRECa0= X-Received: by 2002:ac8:5d07:0:b0:467:73f3:887d with SMTP id d75a77b69052e-46a4a8f0f2cmr764271641cf.33.1735847976752; Thu, 02 Jan 2025 11:59:36 -0800 (PST) MIME-Version: 1.0 References: <71d7ac34-a5e5-4e59-802b-33d8a4256040@redhat.com> <9404aaa2-4fc2-4b8b-8f95-5604c54c162a@redhat.com> <3f3c7254-7171-4987-bb1b-24c323e22a0f@redhat.com> <0ed5241e-10af-43ee-baaf-87a5b4dc9694@redhat.com> In-Reply-To: From: Joanne Koong Date: Thu, 2 Jan 2025 11:59:25 -0800 Message-ID: Subject: Re: [PATCH v6 4/5] mm/migrate: skip migrating folios under writeback with AS_WRITEBACK_INDETERMINATE mappings To: Shakeel Butt Cc: David Hildenbrand , Bernd Schubert , Zi Yan , miklos@szeredi.hu, linux-fsdevel@vger.kernel.org, jefflexu@linux.alibaba.com, josef@toxicpanda.com, linux-mm@kvack.org, kernel-team@meta.com, Matthew Wilcox , Oscar Salvador , Michal Hocko Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam05 X-Stat-Signature: rg46x73ci9crry39wcde4jpcza5bawh7 X-Rspamd-Queue-Id: 8FC4CA0008 X-Rspam-User: X-HE-Tag: 1735847941-570545 X-HE-Meta: U2FsdGVkX191r6DybwbjGDRXih7cVALpVpwkXlPbwrbAYeYLbu1WF+UjVriGAHQfodH9blBa3V1q008rTmxmlkL0DG3Si+EjGMFLPrT0aQqbzSU/2ycO1HM1eZcG3aLNtcplofFVefWedtPB/MMzemTYFch7ZnEzkkIZK3kQkE60LurkbtNuUNotZJasdu3cvijM/zM2kMlueqeZ/tIdIwyQ+vmPOjcmNo7M2kecWf3MNoGTgD5lMMviVxR9ZQ21Z2hB2k1SA53q/bIWwD+2bdt4nsTy6G1qrXXDt5CIX5q5TxJTJLrQcB0h+hJUNz/FAZC4nwQn3tbQBhdiOn6FK9M+ufpnPadK2d4+Dd6m2dRfPUBtjck8X244is3xixRBK6rEINjgQmR2LfySzWap2O8MJ92gFVirhbW1ovzQNTwEHiL7vRhTnLXxirQFlLkz2aPUukEplgLS/J6HOlGcGGr5mE6lKpfDyMa3eR7NCScIcfNjsYp8PTtPJQk91KVb9TTgZeBegWFbRVmIFA4dG8mbrKRPt7cYT44Gcgt9Au1w0OUZOT64UT7tEcLc3VrJWPmmnw4sxZu009RyaJQ4HnxaZrXiLftlOX7z1Vi8gq0wHc8Glx1c176E75Wj3k/oQ5+mthcbgZj0Ng8BmKYIK4bZlVL+5Md1WKeIwdcogay4ZwszPNZH4OSOt2+W90EBVhCRW1KJtsTTmxMU24WYXYlDevPwUPGdyh/0p+fUqFq23cxWpx63lTNfqZTO15inK0Rsrmb/jGET0DK3IKQoZYSu7EjinU0JhkJjqronEQpArVFp3xTxx+UFmVHzhmHVsUmR2q4egrxm4XR8Jq9aTD2bDdyWhg9ldNmnRZeiN6hYlR1iiN7smg5LL5upY1T7QNkVzhJQVjyGDtLPN1ADdqaCEErqB/3uEY05tifOXFnmbgSTGpx+JsEtayrVsszSTsB1sUzNu06/eO1XVec OwIlMal0 vGlMGnZqrCDnRuT+hbsxJaULqaE+0M5+8uiIoNanMa3Yl03YUS7bF1McTQ6ZenHUCsqhJ4QNX2SNPgnB+0UiRUc/hkZuOV4lyOcaGDuvdFelDX99iG6vzLTXE8H3RFO/bDYYwTJM9rgow8UPneb8nJ3p4d/0l4O2NvhfKxjqP7KjXUtlEP7mORDZRYrJhwz7qewegsT3SUrCW/vNWAH98kET4YHEto5ydoTRIT7qtLryrcQ90TkGlCW/hRca+z3e+PLGaAmWV5Ij0hL57jCDt8esOkw8ytA/0/x+yvo9FO7EMSl1/tIVyxvN9cLuYMJonISaxIgCx2klSVksSy/EsgF9V2aGK2744bCypp+F4g8xR9B6EhIEPmuNuCuo22S/1zbg1CGBnLK2jJew= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Dec 30, 2024 at 12:04=E2=80=AFPM Shakeel Butt wrote: > > On Mon, Dec 30, 2024 at 10:38:16AM -0800, Joanne Koong wrote: > > On Mon, Dec 30, 2024 at 2:16=E2=80=AFAM David Hildenbrand wrote: > > Thanks David for the response. > > > > > > > >> BTW, I just looked at NFS out of interest, in particular > > > >> nfs_page_async_flush(), and I spot some logic about re-dirtying pa= ges + > > > >> canceling writeback. IIUC, there are default timeouts for UDP and = TCP, > > > >> whereby the TCP default one seems to be around 60s (* retrans?), a= nd the > > > >> privileged user that mounts it can set higher ones. I guess one co= uld run > > > >> into similar writeback issues? > > > > > > > > > > Hi, > > > > > > sorry for the late reply. > > > > > > > Yes, I think so. > > > > > > > >> > > > >> So I wonder why we never required AS_WRITEBACK_INDETERMINATE for n= fs? > > > > > > > > I feel like INDETERMINATE in the name is the main cause of confusio= n. > > > > > > We are adding logic that says "unconditionally, never wait on writeba= ck > > > for these folios, not even any sync migration". That's the main probl= em > > > I have. > > > > > > Your explanation below is helpful. Because ... > > > > > > > So, let me explain why it is required (but later I will tell you ho= w it > > > > can be avoided). The FUSE thread which is actively handling writeba= ck of > > > > a given folio can cause memory allocation either through syscall or= page > > > > fault. That memory allocation can trigger global reclaim synchronou= sly > > > > and in cgroup-v1, that FUSE thread can wait on the writeback on the= same > > > > folio whose writeback it is supposed to end and cauing a deadlock. = So, > > > > AS_WRITEBACK_INDETERMINATE is used to just avoid this deadlock. > > > > > The in-kernel fs avoid this situation through the use of GFP_NOF= S > > > > allocations. The userspace fs can also use a similar approach which= is > > > > prctl(PR_SET_IO_FLUSHER, 1) to avoid this situation. However I have= been > > > > told that it is hard to use as it is per-thread flag and has to be = set > > > > for all the threads handling writeback which can be error prone if = the > > > > threadpool is dynamic. Second it is very coarse such that all the > > > > allocations from those threads (e.g. page faults) become NOFS which > > > > makes userspace very unreliable on highly utilized machine as NOFS = can > > > > not reclaim potentially a lot of memory and can not trigger oom-kil= l. > > > > > > > > > > ... now I understand that we want to prevent a deadlock in one specif= ic > > > scenario only? > > > > > > What sounds plausible for me is: > > > > > > a) Make this only affect the actual deadlock path: sync migration > > > during compaction. Communicate it either using some "context" > > > information or with a new MIGRATE_SYNC_COMPACTION. > > > b) Call it sth. like AS_WRITEBACK_MIGHT_DEADLOCK_ON_RECLAIM to expres= s > > > that very deadlock problem. > > > c) Leave all others sync migration users alone for now > > > > The deadlock path is separate from sync migration. The deadlock arises > > from a corner case where cgroupv1 reclaim waits on a folio under > > writeback where that writeback itself is blocked on reclaim. > > > > Joanne, let's drop the patch to migrate.c completely and let's rename > the flag to something like what David is suggesting and only handle in > the reclaim path. > > > > > > > Would that prevent the deadlock? Even *better* would be to to be able= to > > > ask the fs if starting writeback on a specific folio could deadlock. > > > Because in most cases, as I understand, we'll not actually run into = the > > > deadlock and would just want to wait for writeback to just complete > > > (esp. compaction). > > > > > > (I still think having folios under writeback for a long time might be= a > > > problem, but that's indeed something to sort out separately in the > > > future, because I suspect NFS has similar issues. We'd want to "wait > > > with timeout" and e.g., cancel writeback during memory > > > offlining/alloc_cma ...) > > Thanks David and yes let's handle the folios under writeback issue > separately. > > > > > I'm looking back at some of the discussions in v2 [1] and I'm still > > not clear on how memory fragmentation for non-movable pages differs > > from memory fragmentation from movable pages and whether one is worse > > than the other. > > I think the fragmentation due to movable pages becoming unmovable is > worse as that situation is unexpected and the kernel can waste a lot of > CPU to defrag the block containing those folios. For non-movable blocks, > the kernel will not even try to defrag. Now we can have a situation > where almost all memory is backed by non-movable blocks and higher order > allocations start failing even when there is enough free memory. For > such situations either system needs to be restarted (or workloads > restarted if they are cause of high non-movable memory) or the admin > needs to setup ZONE_MOVABLE where non-movable allocations don't go. Thanks for the explanations. The reason I ask is because I'm trying to figure out if having a time interval wait or retry mechanism instead of skipping migration would be a viable solution. Where when attempting the migration for folios with the as_writeback_indeterminate flag that are under writeback, it'll wait on folio writeback for a certain amount of time and then skip the migration if no progress has been made and the folio is still under writeback. there are two cases for fuse folios under writeback (for folios not under writeback, migration will work as is): a) normal case: server is not malicious or buggy, writeback is completed in a timely manner. For this case, migration would be successful and there'd be no difference for this between having no temp pages vs temp pages b) server is malicious or buggy: eg the server never completes writeback With no temp pages: The folio under writeback prevents a memory block (not sure how big this usually is?) from being compacted, leading to memory fragmentation With temp pages: fuse allocates a non-movable page for every page it needs to write back, which worsens memory usage, these pages will never get freed since the server never finishes writeback on them. The non-movable pages could also fragment memory blocks like in the scenario with no temp pages. Is the b) case with no temp pages worse for memory health than the scenario with temp pages? For the cpu usage issue (eg kernel keeps trying to defrag blocks containing these problematic folios), it seems like this could be potentially mitigated by marking these blocks as uncompactable? Thanks, Joanne > > > Currently fuse uses movable temp pages (allocated with > > gfp flags GFP_NOFS | __GFP_HIGHMEM), and these can run into the same > > issue where a buggy/malicious server may never complete writeback. > > So, these temp pages are not an issue for fragmenting the movable blocks > but if there is no limit on temp pages, the whole system can become > non-movable (there is a case where movable blocks on non-ZONE_MOVABLE > can be converted into non-movable blocks under low memory). ZONE_MOVABLE > will avoid such scenario but tuning the right size of ZONE_MOVABLE is > not easy. > > > This has the same effect of fragmenting memory and has a worse memory > > cost to the system in terms of memory used. With not having temp pages > > though, now in this scenario, pages allocated in a movable page block > > can't be compacted and that memory is fragmented. My (basic and maybe > > incorrect) understanding is that memory gets allocated through a buddy > > allocator and moveable vs nonmovable pages get allocated to > > corresponding blocks that match their type, but there's no other > > difference otherwise. Is this understanding correct? Or is there some > > substantial difference between fragmentation for movable vs nonmovable > > blocks? > > The main difference is the fallback of high order allocation which can > trigger compaction or background compaction through kcompactd. The > kernel will only try to defrag the movable blocks. >