From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 88090C433EF for ; Thu, 24 Mar 2022 13:14:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D1BE46B0071; Thu, 24 Mar 2022 09:14:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CCB2C6B0073; Thu, 24 Mar 2022 09:14:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B93426B0074; Thu, 24 Mar 2022 09:14:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.26]) by kanga.kvack.org (Postfix) with ESMTP id AC0776B0071 for ; Thu, 24 Mar 2022 09:14:33 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay13.hostedemail.com (Postfix) with ESMTP id 7C20C61C82 for ; Thu, 24 Mar 2022 13:14:33 +0000 (UTC) X-FDA: 79279324026.11.E5CDAAC Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by imf02.hostedemail.com (Postfix) with ESMTP id D61298001C for ; Thu, 24 Mar 2022 13:14:32 +0000 (UTC) Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out1.suse.de (Postfix) with ESMTP id AA57F210FD; Thu, 24 Mar 2022 13:14:31 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1648127671; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=VNnFxSS83AKZ6mnxU8f26Q342BEjDVtjsTkx4noK0Kw=; b=CXiEmbSksC8S0uFG7nfe5goyrnD7vsPUiuu+18xeZ6CQa1oo1MTpp+Bc/iPPAM4YHeED16 46aPCruoGBseREY4lOBxtBj6YPCbLipsOT+mDGkwuaVQVLyUg/vYxa1xAh//JWQxhbwipl A++vMknOYm4oUJfX2eoeKM4g89TFxHE= Received: from suse.cz (unknown [10.100.201.86]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id 721A1A3B89; Thu, 24 Mar 2022 13:14:31 +0000 (UTC) Date: Thu, 24 Mar 2022 14:14:30 +0100 From: Michal Hocko To: Charan Teja Kalla Cc: akpm@linux-foundation.org, minchan@kernel.org, surenb@google.com, vbabka@suse.cz, rientjes@google.com, nadav.amit@gmail.com, edgararriaga@google.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 2/2] mm: madvise: return exact bytes advised with process_madvise under error Message-ID: References: <0fa1bdb5009e898189f339610b90ecca16f243f4.1648046642.git.quic_charante@quicinc.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <0fa1bdb5009e898189f339610b90ecca16f243f4.1648046642.git.quic_charante@quicinc.com> Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=CXiEmbSk; spf=pass (imf02.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.28 as permitted sender) smtp.mailfrom=mhocko@suse.com; dmarc=pass (policy=quarantine) header.from=suse.com X-Rspam-User: X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: D61298001C X-Stat-Signature: t7qwc1ph85do5d9714brp4dizod18fi3 X-HE-Tag: 1648127672-413682 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed 23-03-22 20:54:10, Charan Teja Kalla wrote: > From: Charan Teja Reddy > > The commit 5bd009c7c9a9 ("mm: madvise: return correct bytes advised with > process_madvise") fixes the issue to return number of bytes that are > successfully advised before hitting error with iovec elements > processing. But, when the user passed unmapped ranges in iovec, the > syscall ignores these holes and continues processing and returns ENOMEM > in the end, which is same as madvise semantic. This is a problem for > vector processing where user may want to know how many bytes were > exactly processed in a iovec element to make better decissions in the > user space. As in ENOMEM case, we processed all bytes in a iovec element > but still returned error which will confuse the user whether it is > failed or succeeded to advise. Do you have any specific example where the initial semantic is really problematic or is this mostly a theoretical problem you have found when reading the code? > As an example, consider below ranges were passed by the user in struct > iovec: iovec1(ranges: vma1), iovec2(ranges: vma2 -- vma3 -- hole) and > iovec3(ranges: vma4). In the current implementation, it fully advise > iovec1 and iovec2 but just returns number of processed bytes as iovec1 > range. Then user may repeat the processing of iovec2, which is already > processed, which then returns with ENOMEM. Then user may want to skip > iovec2 and starts processing from iovec3. Here because of wrong return > processed bytes, iovec2 is processed twice. I think you should be much more specific why this is actually a problem. This would surely be less optimal but is this a correctness issue? [...] > + vma = find_vma_prev(mm, start, &prev); > + if (vma && start > vma->vm_start) > + prev = vma; > + > + blk_start_plug(&plug); > + for (;;) { > + /* > + * It it hits a unmapped address range in the [start, end), > + * stop processing and return ENOMEM. > + */ > + if (!vma || start < vma->vm_start) { > + error = -ENOMEM; > + goto out; > + } > + > + tmp = vma->vm_end; > + if (end < tmp) > + tmp = end; > + > + error = madvise_vma_behavior(vma, &prev, start, tmp, behavior); > + if (error) > + goto out; > + tmp_bytes_advised += tmp - start; > + start = tmp; > + if (prev && start < prev->vm_end) > + start = prev->vm_end; > + if (start >= end) > + goto out; > + if (prev) > + vma = prev->vm_next; > + else > + vma = find_vma(mm, start); > + } > +out: > + /* > + * partial_bytes_advised may contain non-zero bytes indicating > + * the number of bytes advised before failure. Holds zero incase > + * of success. > + */ > + *partial_bytes_advised = error ? tmp_bytes_advised : 0; Although this looks like a fix I am not sure it is future proof. madvise_vma_behavior doesn't report which part of the range has been really processed. I do not think that currently supported madvise modes for process_madvise support an early break out with return to the userspace (madvise_cold_or_pageout_pte_range bails on fatal signals for example) but this can change in the future and then you are back to "imprecise" return value problem. Yes, this is a theoretical problem but so it sounds the problem you are trying to fix IMHO. I think it would be better to live with imprecise return values reporting rather than aiming for perfection which would be fragile and add a future maintenance burden. On the other hand if there are _real_ workloads which suffer from the existing semantic then sure the above seems to be an appropriate fix AFAICS. -- Michal Hocko SUSE Labs