From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.5 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, FSL_HELO_FAKE,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1F4D4C3B187 for ; Tue, 11 Feb 2020 21:08:58 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B35CE20708 for ; Tue, 11 Feb 2020 21:08:57 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="e4Aq+FYO" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B35CE20708 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 429AA6B0320; Tue, 11 Feb 2020 16:08:57 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3D9496B0321; Tue, 11 Feb 2020 16:08:57 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2A23A6B0322; Tue, 11 Feb 2020 16:08:57 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0131.hostedemail.com [216.40.44.131]) by kanga.kvack.org (Postfix) with ESMTP id 104116B0320 for ; Tue, 11 Feb 2020 16:08:57 -0500 (EST) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id B5EDF180AD807 for ; Tue, 11 Feb 2020 21:08:56 +0000 (UTC) X-FDA: 76479085872.24.ducks86_30e39ef9f9418 X-HE-Tag: ducks86_30e39ef9f9418 X-Filterd-Recvd-Size: 21800 Received: from mail-pj1-f66.google.com (mail-pj1-f66.google.com [209.85.216.66]) by imf32.hostedemail.com (Postfix) with ESMTP for ; Tue, 11 Feb 2020 21:08:55 +0000 (UTC) Received: by mail-pj1-f66.google.com with SMTP id f2so1165423pjq.1 for ; Tue, 11 Feb 2020 13:08:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=ltO6Zce63S8zmNhRANcObeczMOjpx1NtTOE8FTwRD38=; b=e4Aq+FYO9rLKD8OJ7P7pujZ2/4R0bEetguMcna0ShlnkpmOf249ddX9KMYBwjp8dYT Vb2R746gaeOO52JxbRXZ24R1H5jo+PWO0+IZs8O/BSx3+pxITuMgKgMr9kLtvLWuLelp jrEuan1KpT8juL5gjtALHzljSDuMVDsR4DoHPJ5h3Kj3lHtYzURflkmKw3PKy5nj8nve JipRjh3ZdR80T1fJUdlMuwq5/l0QsU5NdJ8i9EAshR0XSN7BpZi0CI5o42YOZ8KmaxU0 nSO/ZsjdikMfeOEkraWoP12uRIxyfvfbnLql3zEcTsEwaZp7S487gV8yKoIeZWQzS93n Q+mw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to:user-agent; bh=ltO6Zce63S8zmNhRANcObeczMOjpx1NtTOE8FTwRD38=; b=Q22/zX/Q2FllL4Rt9226XtjlQGT1kqbL4mUALOMNlW6r/57w2LYt7J0i2YY6jnZEuK u1aCb0JGz5pz766hZPxQFF6glw3M26PRyaGD553H9/OyiBaV2LtHZjmJuE94qjdseECz bpRRleZ6MBzuL0nfT2iknrv+8hplRcQyCmTxV8n2Efp6beBXbO3maxiHCg0c6zAwKnqa 8irpOAvUihSMgMGCf+Jk1esbncAxROeilrqjTN95bexaxMw+uB7Ppd18u485Afputb5E o9NXzQn1jw92kykGqnwp9SrfkiXxQ+eQF1RYl6pSK1f5O+7C52L8iORDt3yKJaSwZlXS l7sA== X-Gm-Message-State: APjAAAUdQ97EapoIsbozHJ0H3G6P7XX5RRwXnRBKIeq5jnCAuS/mmRUk G0b0VPkJMBJaC3XcpZmsmzk= X-Google-Smtp-Source: APXvYqwg6l3gsGrGg5pvld5vC15F4Uso+nUnV7fasT9cFrDNZYZknQUtgLIdsH0L6Z17uv7G8p5uow== X-Received: by 2002:a17:90a:102:: with SMTP id b2mr5833537pjb.64.1581455334301; Tue, 11 Feb 2020 13:08:54 -0800 (PST) Received: from google.com ([2620:15c:211:1:3e01:2939:5992:52da]) by smtp.gmail.com with ESMTPSA id b3sm5620337pfr.88.2020.02.11.13.08.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 11 Feb 2020 13:08:53 -0800 (PST) Date: Tue, 11 Feb 2020 13:08:51 -0800 From: Minchan Kim To: Alexander Duyck Cc: Andrew Morton , LKML , linux-mm , Linux API , oleksandr@redhat.com, Suren Baghdasaryan , Tim Murray , Daniel Colascione , Sandeep Patil , Sonny Rao , Brian Geffon , Michal Hocko , Johannes Weiner , Shakeel Butt , John Dias , Joel Fernandes Subject: Re: [PATCH v3 1/5] mm: factor out madvise's core functionality Message-ID: <20200211210851.GB185752@google.com> References: <20200128001641.5086-1-minchan@kernel.org> <20200128001641.5086-2-minchan@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Alexander, On Mon, Feb 10, 2020 at 03:00:41PM -0800, Alexander Duyck wrote: > On Mon, Jan 27, 2020 at 4:17 PM Minchan Kim wrote: > > > > This patch factor out madvise's core functionality so that upcoming > > patch can reuse it without duplication. It shouldn't change any behavior. > > > > Signed-off-by: Minchan Kim > > There is a lot to unpack here. I really feel like this description > doesn't do the changes below any service. Really I feel like this > patch should probably be broken up over a few patches to make it > easier to review. Once I read up below, yes, I agree with you. I will try to make it more clear as you suggested. > > You have the moving of the function form the syscall madvise to the > function madvise_common. Then you have various function arguments that > are being added throughout. > > Also this patchset needs to be rebased. It looks like there is already > a patch that is making madvise accessible: > https://lore.kernel.org/io-uring/20200110154739.2119-3-axboe@kernel.dk/ Thanks for heads up! > > > --- > > mm/madvise.c | 194 +++++++++++++++++++++++++++++---------------------- > > 1 file changed, 111 insertions(+), 83 deletions(-) > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > index bcdb6a042787..0c901de531e4 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -35,6 +35,7 @@ > > struct madvise_walk_private { > > struct mmu_gather *tlb; > > bool pageout; > > + struct task_struct *task; > > }; > > > > /* > > @@ -306,12 +307,13 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, > > bool pageout = private->pageout; > > struct mm_struct *mm = tlb->mm; > > struct vm_area_struct *vma = walk->vma; > > + struct task_struct *task = private->task; > > pte_t *orig_pte, *pte, ptent; > > spinlock_t *ptl; > > struct page *page = NULL; > > LIST_HEAD(page_list); > > > > - if (fatal_signal_pending(current)) > > + if (fatal_signal_pending(task)) > > return -EINTR; > > > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > > Is this the only spot in this function that uses the task? If so why > bother with adding the argument above and not just call private->task > instead of current? I just wanted to check target process signal initially but afterwards realized we need to check caller process as well as target's one. Anyway, it would be more clear not to handle it in this patchset to make it "non-behavior change" for easier review. > > > @@ -469,12 +471,14 @@ static const struct mm_walk_ops cold_walk_ops = { > > }; > > > > static void madvise_cold_page_range(struct mmu_gather *tlb, > > + struct task_struct *task, > > struct vm_area_struct *vma, > > unsigned long addr, unsigned long end) > > { > > struct madvise_walk_private walk_private = { > > .pageout = false, > > .tlb = tlb, > > + .task = task, > > }; > > > > tlb_start_vma(tlb, vma); > > @@ -482,7 +486,7 @@ static void madvise_cold_page_range(struct mmu_gather *tlb, > > tlb_end_vma(tlb, vma); > > } > > > > -static long madvise_cold(struct vm_area_struct *vma, > > +static long madvise_cold(struct task_struct *task, struct vm_area_struct *vma, > > struct vm_area_struct **prev, > > unsigned long start_addr, unsigned long end_addr) > > { > > @@ -495,19 +499,21 @@ static long madvise_cold(struct vm_area_struct *vma, > > > > lru_add_drain(); > > tlb_gather_mmu(&tlb, mm, start_addr, end_addr); > > - madvise_cold_page_range(&tlb, vma, start_addr, end_addr); > > + madvise_cold_page_range(&tlb, task, vma, start_addr, end_addr); > > tlb_finish_mmu(&tlb, start_addr, end_addr); > > > > return 0; > > } > > > > Is there any specific reason for adding the task in the middle of the > list of arguments instead of just placing it at the end? It makes it a > bit harder to review when arguments are added in the middle of the > argument list, or at least that is my opinion. Personally, I prefer top-down hierarchy list from bigger one to smaller. I have thought it's popular way for code strucdturing(but I am only one person to agree on it ;-)) for consistency and easy to think. If we add back to the list for new arguments, it would hurt it, which is my concern. > > > static void madvise_pageout_page_range(struct mmu_gather *tlb, > > + struct task_struct *task, > > struct vm_area_struct *vma, > > unsigned long addr, unsigned long end) > > { > > struct madvise_walk_private walk_private = { > > .pageout = true, > > .tlb = tlb, > > + .task = task, > > }; > > > > tlb_start_vma(tlb, vma); > > @@ -531,9 +537,9 @@ static inline bool can_do_pageout(struct vm_area_struct *vma) > > inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0; > > } > > > > -static long madvise_pageout(struct vm_area_struct *vma, > > - struct vm_area_struct **prev, > > - unsigned long start_addr, unsigned long end_addr) > > +static long madvise_pageout(struct task_struct *task, > > + struct vm_area_struct *vma, struct vm_area_struct **prev, > > + unsigned long start_addr, unsigned long end_addr) > > { > > struct mm_struct *mm = vma->vm_mm; > > struct mmu_gather tlb; > > @@ -547,7 +553,7 @@ static long madvise_pageout(struct vm_area_struct *vma, > > > > lru_add_drain(); > > tlb_gather_mmu(&tlb, mm, start_addr, end_addr); > > - madvise_pageout_page_range(&tlb, vma, start_addr, end_addr); > > + madvise_pageout_page_range(&tlb, task, vma, start_addr, end_addr); > > tlb_finish_mmu(&tlb, start_addr, end_addr); > > > > return 0; > > @@ -751,7 +757,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma, > > return 0; > > } > > > > Okay so we were adding a task_struct pointer as an argument up to this > point and now we are switching to adding an mm_struct pointer here. I > would split these two off as separate patches and explain in the patch > description why you are doing that instead of just passing the task > here as well and then accessing task->mm. Sure. > > > > -static long madvise_dontneed_free(struct vm_area_struct *vma, > > +static long madvise_dontneed_free(struct mm_struct *mm, > > + struct vm_area_struct *vma, > > struct vm_area_struct **prev, > > unsigned long start, unsigned long end, > > int behavior) > > @@ -763,8 +770,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, > > if (!userfaultfd_remove(vma, start, end)) { > > *prev = NULL; /* mmap_sem has been dropped, prev is stale */ > > > > - down_read(¤t->mm->mmap_sem); > > - vma = find_vma(current->mm, start); > > + down_read(&mm->mmap_sem); > > + vma = find_vma(mm, start); > > if (!vma) > > return -ENOMEM; > > if (start < vma->vm_start) { > > @@ -811,7 +818,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, > > * Application wants to free up the pages and associated backing store. > > * This is effectively punching a hole into the middle of a file. > > */ > > -static long madvise_remove(struct vm_area_struct *vma, > > +static long madvise_remove(struct mm_struct *mm, > > + struct vm_area_struct *vma, > > struct vm_area_struct **prev, > > unsigned long start, unsigned long end) > > { > > @@ -845,13 +853,13 @@ static long madvise_remove(struct vm_area_struct *vma, > > get_file(f); > > if (userfaultfd_remove(vma, start, end)) { > > /* mmap_sem was not released by userfaultfd_remove() */ > > - up_read(¤t->mm->mmap_sem); > > + up_read(&mm->mmap_sem); > > } > > error = vfs_fallocate(f, > > FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, > > offset, end - start); > > fput(f); > > - down_read(¤t->mm->mmap_sem); > > + down_read(&mm->mmap_sem); > > return error; > > } > > > > @@ -925,21 +933,23 @@ static int madvise_inject_error(int behavior, > > #endif > > > > static long > > -madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, > > +madvise_vma(struct task_struct *task, struct mm_struct *mm, > > + struct vm_area_struct *vma, struct vm_area_struct **prev, > > unsigned long start, unsigned long end, int behavior) > > { > > switch (behavior) { > > case MADV_REMOVE: > > - return madvise_remove(vma, prev, start, end); > > + return madvise_remove(mm, vma, prev, start, end); > > case MADV_WILLNEED: > > return madvise_willneed(vma, prev, start, end); > > case MADV_COLD: > > - return madvise_cold(vma, prev, start, end); > > + return madvise_cold(task, vma, prev, start, end); > > case MADV_PAGEOUT: > > - return madvise_pageout(vma, prev, start, end); > > + return madvise_pageout(task, vma, prev, start, end); > > case MADV_FREE: > > case MADV_DONTNEED: > > - return madvise_dontneed_free(vma, prev, start, end, behavior); > > + return madvise_dontneed_free(mm, vma, prev, start, > > + end, behavior); > > default: > > return madvise_behavior(vma, prev, start, end, behavior); > > } > > Here you added both task and mm. Is ti truly necessary to have both? > Would it be possible to make use of either task->mm or mm->owner in > order to make it so that you only had to carry one of these? Good point. I guess task would be enough to pass mm_struct. I will look into that. > > > @@ -984,67 +994,19 @@ madvise_behavior_valid(int behavior) > > } > > > > /* > > - * The madvise(2) system call. > > + * madvise_common - request behavior hint to address range of the target process > > * > > - * Applications can use madvise() to advise the kernel how it should > > - * handle paging I/O in this VM area. The idea is to help the kernel > > - * use appropriate read-ahead and caching techniques. The information > > - * provided is advisory only, and can be safely disregarded by the > > - * kernel without affecting the correct operation of the application. > > + * @task: task_struct got behavior hint, not giving the hint > > + * @mm: mm_struct got behavior hint, not giving the hint > > + * @start: base address of the hinted range > > + * @len_in: length of the hinted range > > + * @behavior: requested hint > > * > > - * behavior values: > > - * MADV_NORMAL - the default behavior is to read clusters. This > > - * results in some read-ahead and read-behind. > > - * MADV_RANDOM - the system should read the minimum amount of data > > - * on any access, since it is unlikely that the appli- > > - * cation will need more than what it asks for. > > - * MADV_SEQUENTIAL - pages in the given range will probably be accessed > > - * once, so they can be aggressively read ahead, and > > - * can be freed soon after they are accessed. > > - * MADV_WILLNEED - the application is notifying the system to read > > - * some pages ahead. > > - * MADV_DONTNEED - the application is finished with the given range, > > - * so the kernel can free resources associated with it. > > - * MADV_FREE - the application marks pages in the given range as lazy free, > > - * where actual purges are postponed until memory pressure happens. > > - * MADV_REMOVE - the application wants to free up the given range of > > - * pages and associated backing store. > > - * MADV_DONTFORK - omit this area from child's address space when forking: > > - * typically, to avoid COWing pages pinned by get_user_pages(). > > - * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking. > > - * MADV_WIPEONFORK - present the child process with zero-filled memory in this > > - * range after a fork. > > - * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK > > - * MADV_HWPOISON - trigger memory error handler as if the given memory range > > - * were corrupted by unrecoverable hardware memory failure. > > - * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory. > > - * MADV_MERGEABLE - the application recommends that KSM try to merge pages in > > - * this area with pages of identical content from other such areas. > > - * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others. > > - * MADV_HUGEPAGE - the application wants to back the given range by transparent > > - * huge pages in the future. Existing pages might be coalesced and > > - * new pages might be allocated as THP. > > - * MADV_NOHUGEPAGE - mark the given range as not worth being backed by > > - * transparent huge pages so the existing pages will not be > > - * coalesced into THP and new pages will not be allocated as THP. > > - * MADV_DONTDUMP - the application wants to prevent pages in the given range > > - * from being included in its core dump. > > - * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. > > - * > > - * return values: > > - * zero - success > > - * -EINVAL - start + len < 0, start is not page-aligned, > > - * "behavior" is not a valid value, or application > > - * is attempting to release locked or shared pages, > > - * or the specified address range includes file, Huge TLB, > > - * MAP_SHARED or VMPFNMAP range. > > - * -ENOMEM - addresses in the specified range are not currently > > - * mapped, or are outside the AS of the process. > > - * -EIO - an I/O error occurred while paging in data. > > - * -EBADF - map exists, but area maps something that isn't a file. > > - * -EAGAIN - a kernel resource was temporarily unavailable. > > + * @task could be a zombie leader if it calls sys_exit so accessing mm_struct > > + * via task->mm is prohibited. Please use @mm instead of task->mm. > > */ > > -SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) > > +static int madvise_common(struct task_struct *task, struct mm_struct *mm, > > + unsigned long start, size_t len_in, int behavior) > > { > > unsigned long end, tmp; > > struct vm_area_struct *vma, *prev; > > So I would save this piece and move it to a separate patch. All of the > noise from the comment move would be greatly reduced and make it > easier to review the rest of this. Agree. > > > @@ -1082,10 +1044,10 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) > > > > write = madvise_need_mmap_write(behavior); > > if (write) { > > - if (down_write_killable(¤t->mm->mmap_sem)) > > + if (down_write_killable(&mm->mmap_sem)) > > return -EINTR; > > } else { > > - down_read(¤t->mm->mmap_sem); > > + down_read(&mm->mmap_sem); > > } > > > > /* > > @@ -1093,7 +1055,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) > > * ranges, just ignore them, but return -ENOMEM at the end. > > * - different from the way of handling in mlock etc. > > */ > > - vma = find_vma_prev(current->mm, start, &prev); > > + vma = find_vma_prev(mm, start, &prev); > > if (vma && start > vma->vm_start) > > prev = vma; > > > > @@ -1118,7 +1080,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) > > tmp = end; > > > > /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */ > > - error = madvise_vma(vma, &prev, start, tmp, behavior); > > + error = madvise_vma(task, mm, vma, &prev, start, tmp, behavior); > > if (error) > > goto out; > > start = tmp; > > @@ -1130,14 +1092,80 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) > > if (prev) > > vma = prev->vm_next; > > else /* madvise_remove dropped mmap_sem */ > > - vma = find_vma(current->mm, start); > > + vma = find_vma(mm, start); > > } > > out: > > blk_finish_plug(&plug); > > if (write) > > - up_write(¤t->mm->mmap_sem); > > + up_write(&mm->mmap_sem); > > else > > - up_read(¤t->mm->mmap_sem); > > + up_read(&mm->mmap_sem); > > > > return error; > > } > > Okay so the rest of this is just more of the task/mm being added as > separate arguments. > Thanks for the review!