linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: Charan Teja Kalla <quic_charante@quicinc.com>
Cc: akpm@linux-foundation.org, minchan@kernel.org, surenb@google.com,
	vbabka@suse.cz, rientjes@google.com, nadav.amit@gmail.com,
	edgararriaga@google.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/2] mm: madvise: return exact bytes advised with process_madvise under error
Date: Thu, 24 Mar 2022 14:14:30 +0100	[thread overview]
Message-ID: <Yjxutr35QLGhjJ57@dhcp22.suse.cz> (raw)
In-Reply-To: <0fa1bdb5009e898189f339610b90ecca16f243f4.1648046642.git.quic_charante@quicinc.com>

On Wed 23-03-22 20:54:10, Charan Teja Kalla wrote:
> From: Charan Teja Reddy <quic_charante@quicinc.com>
> 
> The commit 5bd009c7c9a9 ("mm: madvise: return correct bytes advised with
> process_madvise") fixes the issue to return number of bytes that are
> successfully advised before hitting error with iovec elements
> processing. But, when the user passed unmapped ranges in iovec, the
> syscall ignores these holes and continues processing and returns ENOMEM
> in the end, which is same as madvise semantic. This is a problem for
> vector processing where user may want to know how many bytes were
> exactly processed in a iovec element to make better decissions in the
> user space. As in ENOMEM case, we processed all bytes in a iovec element
> but still returned error which will confuse the user whether it is
> failed or succeeded to advise.

Do you have any specific example where the initial semantic is really
problematic or is this mostly a theoretical problem you have found when
reading the code?


> As an example, consider below ranges were passed by the user in struct
> iovec: iovec1(ranges: vma1), iovec2(ranges: vma2 -- vma3 -- hole) and
> iovec3(ranges: vma4). In the current implementation, it fully advise
> iovec1 and iovec2 but just returns number of processed bytes as iovec1
> range. Then user may repeat the processing of iovec2, which is already
> processed, which then returns with ENOMEM. Then user may want to skip
> iovec2 and starts processing from iovec3. Here because of wrong return
> processed bytes, iovec2 is processed twice.

I think you should be much more specific why this is actually a problem.
This would surely be less optimal but is this a correctness issue?

[...]
> +	vma = find_vma_prev(mm, start, &prev);
> +	if (vma && start > vma->vm_start)
> +		prev = vma;
> +
> +	blk_start_plug(&plug);
> +	for (;;) {
> +		/*
> +		 * It it hits a unmapped address range in the [start, end),
> +		 * stop processing and return ENOMEM.
> +		 */
> +		if (!vma || start < vma->vm_start) {
> +			error = -ENOMEM;
> +			goto out;
> +		}
> +
> +		tmp = vma->vm_end;
> +		if (end < tmp)
> +			tmp = end;
> +
> +		error = madvise_vma_behavior(vma, &prev, start, tmp, behavior);
> +		if (error)
> +			goto out;
> +		tmp_bytes_advised += tmp - start;
> +		start = tmp;
> +		if (prev && start < prev->vm_end)
> +			start = prev->vm_end;
> +		if (start >= end)
> +			goto out;
> +		if (prev)
> +			vma = prev->vm_next;
> +		else
> +			vma = find_vma(mm, start);
> +	}
> +out:
> +	/*
> +	 * partial_bytes_advised may contain non-zero bytes indicating
> +	 * the number of bytes advised before failure. Holds zero incase
> +	 * of success.
> +	 */
> +	*partial_bytes_advised = error ? tmp_bytes_advised : 0;

Although this looks like a fix I am not sure it is future proof.
madvise_vma_behavior doesn't report which part of the range has been
really processed. I do not think that currently supported madvise modes
for process_madvise support an early break out with return to the
userspace (madvise_cold_or_pageout_pte_range bails on fatal signals for
example) but this can change in the future and then you are back to
"imprecise" return value problem. Yes, this is a theoretical problem
but so it sounds the problem you are trying to fix IMHO. I think it
would be better to live with imprecise return values reporting rather
than aiming for perfection which would be fragile and add a future
maintenance burden.

On the other hand if there are _real_ workloads which suffer from the
existing semantic then sure the above seems to be an appropriate fix
AFAICS.
-- 
Michal Hocko
SUSE Labs


  reply	other threads:[~2022-03-24 13:14 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-23 15:24 [PATCH 0/2] " Charan Teja Kalla
2022-03-23 15:24 ` [PATCH 1/2] Revert "mm: madvise: skip unmapped vma holes passed to process_madvise" Charan Teja Kalla
2022-03-24 12:48   ` Michal Hocko
2022-03-24 14:03     ` Charan Teja Kalla
2022-03-23 15:24 ` [PATCH 2/2] mm: madvise: return exact bytes advised with process_madvise under error Charan Teja Kalla
2022-03-24 13:14   ` Michal Hocko [this message]
2022-03-24 15:45     ` Charan Teja Kalla
2022-03-25  0:46       ` Minchan Kim
2022-03-25  0:48       ` Minchan Kim
2022-03-25  8:02       ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yjxutr35QLGhjJ57@dhcp22.suse.cz \
    --to=mhocko@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=edgararriaga@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan@kernel.org \
    --cc=nadav.amit@gmail.com \
    --cc=quic_charante@quicinc.com \
    --cc=rientjes@google.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox