From: Robin Holt <holt@sgi.com>
To: Hugh Dickins <hugh@veritas.com>
Cc: Robin Holt <holt@sgi.com>, Nick Piggin <nickpiggin@yahoo.com.au>,
Ingo Molnar <mingo@elte.hu>, Christoph Lameter <clameter@sgi.com>,
Jack Steiner <steiner@sgi.com>, Rik van Riel <riel@redhat.com>,
Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
linux-mm@kvack.org
Subject: Re: Can get_user_pages( ,write=1, force=1, ) result in a read-only pte and _count=2?
Date: Tue, 24 Jun 2008 10:19:08 -0500 [thread overview]
Message-ID: <20080624151908.GO10062@sgi.com> (raw)
In-Reply-To: <Pine.LNX.4.64.0806232134330.19691@blonde.site>
OK. I just gave it a try and I still get a failure with this patch
applied.
I can't help wondering if we (XPMEM, IB, etc) shouldn't be setting a
page flag indicating that the attempt to swap this page out should not
even be tried. I would guess this has already been discussed to death.
If so, I am sorry, but I missed those discussions.
Thanks,
Robin
On Mon, Jun 23, 2008 at 09:58:42PM +0100, Hugh Dickins wrote:
> On Mon, 23 Jun 2008, Robin Holt wrote:
> > On Mon, Jun 23, 2008 at 05:48:17PM +0100, Hugh Dickins wrote:
> >
> > > reuse test in do_wp_page that I'm still working on - of which Nick sent
> > > a lock_page approximation for you to try? Would you still be able to
> > > try mine when I'm ready, or does it now appear irrelevant to you?
> >
> > Before your response, I had convinced myself my problem was specific to
> > XPMEM, but I see your point and may agree that it is a problem for all
> > get_user_pages() users.
> >
> > I can certainly test when you have it ready.
>
> Thanks a lot, Robin. Here it is below.
>
> >
> > I had confused myself about Nick's first patch. I will give that
> > another look over and see if it fixes the problem.
>
> Nick's _first_ patch? The one I was thinking of was the one
> with lock_page in do_wp_page, but it shouldn't be necessary now if
> what's below works - though his is a much smaller and less worrying
> patch, so anyone looking for a quick fix to the issue might well
> prefer his.
>
> >
> > > http://lkml.org/lkml/2006/9/14/384
> > >
> > > but it's a broken thread, with misunderstanding on all sides,
> > > so rather hard to get a grasp of it.
> >
> > That is extremely similar to the issue I am seeing. I think that if
> > Infiniband were using the mmu_notifier stuff, they would be closer, but
> > IIRC, there are significant hardware restrictions which prevent demand
> > paging for working on some IB devices.
>
> Ah, I'm glad you've managed to glean something from it, good.
>
> Here's the rollup of the patches I'm proposing for two issues:
> it doesn't get my signoff yet because I'll want to split it into
> little stages properly commented, and I'll want to do more strenuous
> testing; but this shouldn't blow up in your face. Against 2.6.26-rc7,
> should apply quite easily to earlier, but a little more work against
> 2.6.26-rc5-mm3 - though it simplifies some of that too (Rik+Lee Cced).
>
> I say two issues, two competing issues. One is the issue which may be
> your issue, that we want to decide COW in do_wp_page without needing
> PageLocked, so a concurrent shrink_page_list() doesn't occasionally
> force an unwanted COW, messing up some get_user_pages() uses. The
> other, of no interest to you, is that we do want PageLocked when
> deciding COW in do_wp_page, because that's a good moment to free
> up the swap space - leaving the modified page with swap just makes
> a big seek likely when it's next written to swap.
>
> Hugh
>
> include/linux/swap.h | 21 ++++--
> mm/memory.c | 15 +++-
> mm/migrate.c | 5 -
> mm/page_io.c | 2
> mm/swap_state.c | 12 ++-
> mm/swapfile.c | 128 ++++++++++++++++++++++-------------------
> 6 files changed, 105 insertions(+), 78 deletions(-)
>
> --- 2.6.26-rc7/include/linux/swap.h 2008-05-03 21:55:11.000000000 +0100
> +++ linux/include/linux/swap.h 2008-06-23 18:12:47.000000000 +0100
> @@ -250,8 +250,9 @@ extern unsigned int count_swap_pages(int
> extern sector_t map_swap_page(struct swap_info_struct *, pgoff_t);
> extern sector_t swapdev_block(int, pgoff_t);
> extern struct swap_info_struct *get_swap_info_struct(unsigned);
> -extern int can_share_swap_page(struct page *);
> -extern int remove_exclusive_swap_page(struct page *);
> +extern int can_reuse_swap_page_unlocked(struct page *);
> +extern int can_reuse_swap_page_locked(struct page *);
> +extern int try_to_free_swap(struct page *);
> struct backing_dev_info;
>
> extern spinlock_t swap_lock;
> @@ -319,8 +320,6 @@ static inline struct page *lookup_swap_c
> return NULL;
> }
>
> -#define can_share_swap_page(p) (page_mapcount(p) == 1)
> -
> static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
> gfp_t gfp_mask)
> {
> @@ -337,9 +336,19 @@ static inline void delete_from_swap_cach
>
> #define swap_token_default_timeout 0
>
> -static inline int remove_exclusive_swap_page(struct page *p)
> +static inline int can_reuse_swap_page_unlocked(struct page *page)
> {
> - return 0;
> + return 0; /* irrelevant: never called */
> +}
> +
> +static inline int can_reuse_swap_page_locked(struct page *page)
> +{
> + return 0; /* irrelevant: never called */
> +}
> +
> +static inline int try_to_free_swap(struct page *page)
> +{
> + return 0; /* irrelevant: never called */
> }
>
> static inline swp_entry_t get_swap_page(void)
> --- 2.6.26-rc7/mm/memory.c 2008-06-21 08:41:19.000000000 +0100
> +++ linux/mm/memory.c 2008-06-23 18:12:47.000000000 +0100
> @@ -1686,9 +1686,14 @@ static int do_wp_page(struct mm_struct *
> * not dirty accountable.
> */
> if (PageAnon(old_page)) {
> - if (!TestSetPageLocked(old_page)) {
> - reuse = can_share_swap_page(old_page);
> - unlock_page(old_page);
> + if (page_mapcount(old_page) == 1) {
> + if (!PageSwapCache(old_page))
> + reuse = 1;
> + else if (!TestSetPageLocked(old_page)) {
> + reuse = can_reuse_swap_page_locked(old_page);
> + unlock_page(old_page);
> + } else
> + reuse = can_reuse_swap_page_unlocked(old_page);
> }
> } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
> (VM_WRITE|VM_SHARED))) {
> @@ -2185,7 +2190,7 @@ static int do_swap_page(struct mm_struct
>
> inc_mm_counter(mm, anon_rss);
> pte = mk_pte(page, vma->vm_page_prot);
> - if (write_access && can_share_swap_page(page)) {
> + if (write_access && can_reuse_swap_page_locked(page)) {
> pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> write_access = 0;
> }
> @@ -2196,7 +2201,7 @@ static int do_swap_page(struct mm_struct
>
> swap_free(entry);
> if (vm_swap_full())
> - remove_exclusive_swap_page(page);
> + try_to_free_swap(page);
> unlock_page(page);
>
> if (write_access) {
> --- 2.6.26-rc7/mm/migrate.c 2008-06-21 08:41:19.000000000 +0100
> +++ linux/mm/migrate.c 2008-06-23 18:12:47.000000000 +0100
> @@ -330,8 +330,10 @@ static int migrate_page_move_mapping(str
> get_page(newpage); /* add cache reference */
> #ifdef CONFIG_SWAP
> if (PageSwapCache(page)) {
> - SetPageSwapCache(newpage);
> set_page_private(newpage, page_private(page));
> + /* page_swapcount() relies on private whenever PageSwapCache */
> + smp_wmb();
> + SetPageSwapCache(newpage);
> }
> #endif
>
> @@ -398,7 +400,6 @@ static void migrate_page_copy(struct pag
> #endif
> ClearPageActive(page);
> ClearPagePrivate(page);
> - set_page_private(page, 0);
> page->mapping = NULL;
>
> /*
> --- 2.6.26-rc7/mm/page_io.c 2008-04-17 03:49:44.000000000 +0100
> +++ linux/mm/page_io.c 2008-06-23 18:12:47.000000000 +0100
> @@ -98,7 +98,7 @@ int swap_writepage(struct page *page, st
> struct bio *bio;
> int ret = 0, rw = WRITE;
>
> - if (remove_exclusive_swap_page(page)) {
> + if (try_to_free_swap(page)) {
> unlock_page(page);
> goto out;
> }
> --- 2.6.26-rc7/mm/swap_state.c 2008-05-03 21:55:12.000000000 +0100
> +++ linux/mm/swap_state.c 2008-06-23 18:12:47.000000000 +0100
> @@ -76,13 +76,15 @@ int add_to_swap_cache(struct page *page,
> BUG_ON(PagePrivate(page));
> error = radix_tree_preload(gfp_mask);
> if (!error) {
> + set_page_private(page, entry.val);
> + /* page_swapcount() relies on private whenever PageSwapCache */
> + smp_wmb();
> write_lock_irq(&swapper_space.tree_lock);
> error = radix_tree_insert(&swapper_space.page_tree,
> entry.val, page);
> if (!error) {
> page_cache_get(page);
> SetPageSwapCache(page);
> - set_page_private(page, entry.val);
> total_swapcache_pages++;
> __inc_zone_page_state(page, NR_FILE_PAGES);
> INC_CACHE_INFO(add_total);
> @@ -105,7 +107,6 @@ void __delete_from_swap_cache(struct pag
> BUG_ON(PagePrivate(page));
>
> radix_tree_delete(&swapper_space.page_tree, page_private(page));
> - set_page_private(page, 0);
> ClearPageSwapCache(page);
> total_swapcache_pages--;
> __dec_zone_page_state(page, NR_FILE_PAGES);
> @@ -188,13 +189,14 @@ void delete_from_swap_cache(struct page
> *
> * Its ok to check for PageSwapCache without the page lock
> * here because we are going to recheck again inside
> - * exclusive_swap_page() _with_ the lock.
> + * try_to_free_swap() _with_ the lock.
> * - Marcelo
> */
> static inline void free_swap_cache(struct page *page)
> {
> - if (PageSwapCache(page) && !TestSetPageLocked(page)) {
> - remove_exclusive_swap_page(page);
> + if (PageSwapCache(page) && !page_mapped(page) &&
> + !TestSetPageLocked(page)) {
> + try_to_free_swap(page);
> unlock_page(page);
> }
> }
> --- 2.6.26-rc7/mm/swapfile.c 2008-05-03 21:55:12.000000000 +0100
> +++ linux/mm/swapfile.c 2008-06-23 18:12:47.000000000 +0100
> @@ -251,7 +251,6 @@ static struct swap_info_struct * swap_in
> goto bad_offset;
> if (!p->swap_map[offset])
> goto bad_free;
> - spin_lock(&swap_lock);
> return p;
>
> bad_free:
> @@ -300,90 +299,104 @@ void swap_free(swp_entry_t entry)
>
> p = swap_info_get(entry);
> if (p) {
> + spin_lock(&swap_lock);
> swap_entry_free(p, swp_offset(entry));
> spin_unlock(&swap_lock);
> }
> }
>
> /*
> - * How many references to page are currently swapped out?
> + * How many page table references are there to this page's swap entry?
> + * including references to the page itself if add_mapcount.
> + *
> + * page_swapcount is only called on a page which was recently PageSwapCache
> + * (but might have been deleted from swap cache just before getting here).
> + *
> + * When called with PageLocked, the swapcount is stable, and the mapcount
> + * cannot rise (but may fall if the page is concurrently unmapped from a
> + * page table whose lock we don't hold).
> + *
> + * When called without PageLocked, the swapcount is stable while we hold
> + * swap_lock, and the mapcount cannot rise from 1 while we hold that page
> + * table lock (but may fall if the page is concurrently unmapped from a
> + * page table whose lock we don't hold).
> + *
> + * do_swap_page and unuse_pte call page_add_anon_rmap before swap_free,
> + * try_to_unmap_one calls swap_duplicate before page_remove_rmap:
> + * so in general, swapcount+mapcount should never be seen too low -
> + * but you need to consider more memory barriers if extending its use.
> */
> -static inline int page_swapcount(struct page *page)
> +static int page_swapcount(struct page *page, int add_mapcount)
> {
> int count = 0;
> struct swap_info_struct *p;
> swp_entry_t entry;
>
> - entry.val = page_private(page);
> - p = swap_info_get(entry);
> - if (p) {
> - /* Subtract the 1 for the swap cache itself */
> - count = p->swap_map[swp_offset(entry)] - 1;
> - spin_unlock(&swap_lock);
> + spin_lock(&swap_lock);
> + if (add_mapcount)
> + count = page_mapcount(page);
> + if (PageSwapCache(page)) {
> + /* We can rely on page_private once PageSwapCache is visible */
> + smp_rmb();
> + entry.val = page_private(page);
> + p = swap_info_get(entry);
> + if (p) {
> + /* Subtract the 1 for the swap cache itself */
> + count += p->swap_map[swp_offset(entry)] - 1;
> + }
> }
> + spin_unlock(&swap_lock);
> return count;
> }
>
> /*
> - * We can use this swap cache entry directly
> - * if there are no other references to it.
> + * Can do_wp_page() make the faulting swapcache page writable without COW?
> + * But something else, probably shrink_page_list(), already has PageLocked.
> + * Don't be misled to COW the page unnecessarily: check swapcount+mapcount.
> */
> -int can_share_swap_page(struct page *page)
> +int can_reuse_swap_page_unlocked(struct page *page)
> {
> - int count;
> -
> - BUG_ON(!PageLocked(page));
> - count = page_mapcount(page);
> - if (count <= 1 && PageSwapCache(page))
> - count += page_swapcount(page);
> - return count == 1;
> + return page_swapcount(page, 1) == 1;
> }
>
> /*
> - * Work out if there are any other processes sharing this
> - * swap cache page. Free it if you can. Return success.
> + * Can do_wp_page() make the faulting swapcache page writable without COW?
> + * having acquiring PageLocked. In this case, since we have that lock and
> + * are about to modify the page, we'd better free its swap space - it won't
> + * be read again, and writing there later would probably require an extra seek.
> */
> -int remove_exclusive_swap_page(struct page *page)
> +int can_reuse_swap_page_locked(struct page *page)
> {
> - int retval;
> - struct swap_info_struct * p;
> - swp_entry_t entry;
> + VM_BUG_ON(!PageLocked(page));
> + if (page_swapcount(page, 1) != 1)
> + return 0;
> + if (!PageSwapCache(page))
> + return 1;
> + if (PageWriteback(page))
> + return 1;
>
> - BUG_ON(PagePrivate(page));
> - BUG_ON(!PageLocked(page));
> + delete_from_swap_cache(page);
> + SetPageDirty(page);
> + return 1;
> +}
>
> +/*
> + * If swap is getting full, or if there are no more mappings of this page,
> + * then try_to_free_swap is called to free its swap space.
> + */
> +int try_to_free_swap(struct page *page)
> +{
> + VM_BUG_ON(!PageLocked(page));
> if (!PageSwapCache(page))
> return 0;
> if (PageWriteback(page))
> return 0;
> - if (page_count(page) != 2) /* 2: us + cache */
> + if (page_swapcount(page, 0)) /* Here we don't care about mapcount */
> return 0;
>
> - entry.val = page_private(page);
> - p = swap_info_get(entry);
> - if (!p)
> - return 0;
> -
> - /* Is the only swap cache user the cache itself? */
> - retval = 0;
> - if (p->swap_map[swp_offset(entry)] == 1) {
> - /* Recheck the page count with the swapcache lock held.. */
> - write_lock_irq(&swapper_space.tree_lock);
> - if ((page_count(page) == 2) && !PageWriteback(page)) {
> - __delete_from_swap_cache(page);
> - SetPageDirty(page);
> - retval = 1;
> - }
> - write_unlock_irq(&swapper_space.tree_lock);
> - }
> - spin_unlock(&swap_lock);
> -
> - if (retval) {
> - swap_free(entry);
> - page_cache_release(page);
> - }
> -
> - return retval;
> + delete_from_swap_cache(page);
> + SetPageDirty(page);
> + return 1;
> }
>
> /*
> @@ -400,6 +413,7 @@ void free_swap_and_cache(swp_entry_t ent
>
> p = swap_info_get(entry);
> if (p) {
> + spin_lock(&swap_lock);
> if (swap_entry_free(p, swp_offset(entry)) == 1) {
> page = find_get_page(&swapper_space, entry.val);
> if (page && unlikely(TestSetPageLocked(page))) {
> @@ -410,14 +424,10 @@ void free_swap_and_cache(swp_entry_t ent
> spin_unlock(&swap_lock);
> }
> if (page) {
> - int one_user;
> -
> - BUG_ON(PagePrivate(page));
> - one_user = (page_count(page) == 2);
> - /* Only cache user (+us), or swap space full? Free it! */
> + /* Not mapped elsewhere, or swap space full? Free it! */
> /* Also recheck PageSwapCache after page is locked (above) */
> if (PageSwapCache(page) && !PageWriteback(page) &&
> - (one_user || vm_swap_full())) {
> + (!page_mapped(page) || vm_swap_full())) {
> delete_from_swap_cache(page);
> SetPageDirty(page);
> }
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2008-06-24 15:19 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-06-18 16:41 Robin Holt
2008-06-18 17:29 ` Nick Piggin
2008-06-18 19:01 ` Hugh Dickins
2008-06-18 20:33 ` Robin Holt
2008-06-18 21:46 ` Hugh Dickins
2008-06-19 3:31 ` Nick Piggin
2008-06-19 3:34 ` Nick Piggin
2008-06-19 11:39 ` Hugh Dickins
2008-06-19 12:07 ` Nick Piggin
2008-06-19 12:21 ` Nick Piggin
2008-06-19 17:48 ` Christoph Lameter
2008-06-19 12:34 ` Hugh Dickins
2008-06-19 12:53 ` Nick Piggin
2008-06-19 13:25 ` Hugh Dickins
2008-06-19 13:35 ` Robin Holt
2008-06-19 16:32 ` Robin Holt
2008-06-20 9:23 ` Nick Piggin
2008-06-19 3:07 ` Nick Piggin
2008-06-19 11:09 ` Hugh Dickins
2008-06-19 13:38 ` Robin Holt
2008-06-19 13:49 ` Hugh Dickins
2008-06-23 15:54 ` Robin Holt
2008-06-23 16:48 ` Hugh Dickins
2008-06-23 17:52 ` Robin Holt
2008-06-23 20:58 ` Hugh Dickins
2008-06-24 11:56 ` Robin Holt
2008-06-24 15:19 ` Robin Holt [this message]
2008-06-24 20:19 ` Hugh Dickins
2008-06-23 19:11 ` Robin Holt
2008-06-23 19:12 ` Robin Holt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20080624151908.GO10062@sgi.com \
--to=holt@sgi.com \
--cc=Lee.Schermerhorn@hp.com \
--cc=clameter@sgi.com \
--cc=hugh@veritas.com \
--cc=linux-mm@kvack.org \
--cc=mingo@elte.hu \
--cc=nickpiggin@yahoo.com.au \
--cc=riel@redhat.com \
--cc=steiner@sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox