From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ot0-f200.google.com (mail-ot0-f200.google.com [74.125.82.200]) by kanga.kvack.org (Postfix) with ESMTP id B90296B0010 for ; Mon, 11 Jun 2018 12:48:47 -0400 (EDT) Received: by mail-ot0-f200.google.com with SMTP id y18-v6so14955547otg.14 for ; Mon, 11 Jun 2018 09:48:47 -0700 (PDT) Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id l186-v6sor2712182oif.94.2018.06.11.09.48.46 for (Google Transport Security); Mon, 11 Jun 2018 09:48:46 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20180611154146.jc5xt4gyaihq64lm@quack2.suse.cz> References: <152850182079.38390.8280340535691965744.stgit@dwillia2-desk3.amr.corp.intel.com> <152850187437.38390.2257981090761438811.stgit@dwillia2-desk3.amr.corp.intel.com> <20180611154146.jc5xt4gyaihq64lm@quack2.suse.cz> From: Dan Williams Date: Mon, 11 Jun 2018 09:48:45 -0700 Message-ID: Subject: Re: [PATCH v4 10/12] filesystem-dax: Introduce dax_lock_page() Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org List-ID: To: Jan Kara Cc: linux-nvdimm , Christoph Hellwig , Linux MM , linux-fsdevel On Mon, Jun 11, 2018 at 8:41 AM, Jan Kara wrote: > On Fri 08-06-18 16:51:14, Dan Williams wrote: >> In preparation for implementing support for memory poison (media error) >> handling via dax mappings, implement a lock_page() equivalent. Poison >> error handling requires rmap and needs guarantees that the page->mapping >> association is maintained / valid (inode not freed) for the duration of >> the lookup. >> >> In the device-dax case it is sufficient to simply hold a dev_pagemap >> reference. In the filesystem-dax case we need to use the entry lock. >> >> Export the entry lock via dax_lock_page() that uses rcu_read_lock() to >> protect against the inode being freed, and revalidates the page->mapping >> association under xa_lock(). >> >> Signed-off-by: Dan Williams > > Some comments below... > >> diff --git a/fs/dax.c b/fs/dax.c >> index cccf6cad1a7a..b7e71b108fcf 100644 >> --- a/fs/dax.c >> +++ b/fs/dax.c >> @@ -361,6 +361,82 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping, >> } >> } >> >> +struct page *dax_lock_page(unsigned long pfn) >> +{ > > Why do you return struct page here? Any reason behind that? Because struct > page exists and can be accessed through pfn_to_page() regardless of result > of this function so it looks a bit confusing. Also dax_lock_page() name > seems a bit confusing. Maybe dax_lock_pfn_mapping_entry()? > >> + pgoff_t index; >> + struct inode *inode; >> + wait_queue_head_t *wq; >> + void *entry = NULL, **slot; >> + struct address_space *mapping; >> + struct wait_exceptional_entry_queue ewait; >> + struct page *ret = NULL, *page = pfn_to_page(pfn); >> + >> + rcu_read_lock(); >> + for (;;) { >> + mapping = READ_ONCE(page->mapping); >> + >> + if (!mapping || !IS_DAX(mapping->host)) >> + break; >> + >> + /* >> + * In the device-dax case there's no need to lock, a >> + * struct dev_pagemap pin is sufficient to keep the >> + * inode alive. >> + */ >> + inode = mapping->host; >> + if (S_ISCHR(inode->i_mode)) { >> + ret = page; >> + break; >> + } >> + >> + xa_lock_irq(&mapping->i_pages); >> + if (mapping != page->mapping) { >> + xa_unlock_irq(&mapping->i_pages); >> + continue; >> + } >> + index = page->index; >> + >> + init_wait(&ewait.wait); >> + ewait.wait.func = wake_exceptional_entry_func; > > This initialization could be before the loop. > >> + >> + entry = __radix_tree_lookup(&mapping->i_pages, index, NULL, >> + &slot); >> + if (!entry || >> + WARN_ON_ONCE(!radix_tree_exceptional_entry(entry))) { >> + xa_unlock_irq(&mapping->i_pages); >> + break; >> + } else if (!slot_locked(mapping, slot)) { >> + lock_slot(mapping, slot); >> + ret = page; >> + xa_unlock_irq(&mapping->i_pages); >> + break; >> + } >> + >> + wq = dax_entry_waitqueue(mapping, index, entry, &ewait.key); >> + prepare_to_wait_exclusive(wq, &ewait.wait, >> + TASK_UNINTERRUPTIBLE); >> + xa_unlock_irq(&mapping->i_pages); >> + rcu_read_unlock(); >> + schedule(); >> + finish_wait(wq, &ewait.wait); >> + rcu_read_lock(); >> + } >> + rcu_read_unlock(); > > I don't like how this duplicates a lot of get_unlocked_mapping_entry(). > Can we possibly factor this out similary as done for wait_event()? Ok, I'll give that a shot.