From: 'David Gibson' <david@gibson.dropbear.id.au>
To: Andrew Morton <akpm@osdl.org>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>,
'Christoph Lameter' <christoph@schroedinger.engr.sgi.com>,
Hugh Dickins <hugh@veritas.com>,
bill.irwin@oracle.com, Adam Litke <agl@us.ibm.com>,
linux-mm@kvack.org
Subject: Re: [RFC] reduce hugetlb_instantiation_mutex usage
Date: Fri, 27 Oct 2006 11:47:40 +1000 [thread overview]
Message-ID: <20061027014740.GD11733@localhost.localdomain> (raw)
In-Reply-To: <20061026233137.GA11733@localhost.localdomain>
On Fri, Oct 27, 2006 at 09:31:37AM +1000, 'David Gibson' wrote:
> On Thu, Oct 26, 2006 at 03:44:51PM -0700, Andrew Morton wrote:
[snip]
> > The key points:
> >
> > - Use tree_lock to prevent the race
> >
> > - allocate the hugepage inside tree_lock so we never get into this
> > two-threads-tried-to-allocate-the-final-page problem.
> >
> > - The hugepage is zeroed without locks held, under lock_page()
> >
> > - lock_page() is used to make the other thread(s) sleep while the winner
> > thread is zeroing out the page.
> >
> > It means that rather a lot of add_to_page_cache() will need to be copied
> > into hugetlb_no_page().
>
> This handles the case of processes racing on a shared mapping, but not
> the case of threads racing on a private mapping. In the latter case
> the race ends at the set_pte() rather than the add_to_page_cache()
> (well, strictly with the whole page_table_lock atomic lump). And we
> can't move the clear after the set_pte() :(.
At various times many people have proposed "solutions" which address
the SHARED case, or the PRIVATE case, but not both. As Andrew points
out in a later mail his approach may be fixable for the PRIVATE case,
but nonetheless it's important to check both cases.
So, here's another patch for libhugetlbfs, extending its testcase for
this race to check the MAP_PRIVATE case as well as the MAP_SHARED
case. Adam, please apply. Everyone else, please test proposed
approaches against the testsuite. It's not exhaustive, but it's easy
to run and a good start :).
libhugetlbfs: Testcase for MAP_PRIVATE OOM-liable race condition
The spurious OOM condition which can be caused by race conditions in
the hugetlb fault handler can be triggered with both SHARED mappings
(separate processes racing on the same address_space) and with PRIVATE
mappings (different threads racing on the same vma).
At present the alloc-instantiate-race testcase only tests the SHARED
mapping case. Since at various times kernel fixes have been proposed
which address only one or the other of the cases, extend the testcase
to check the MAP_PRIVATE in addition to the MAP_SHARED case.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Index: libhugetlbfs/tests/alloc-instantiate-race.c
===================================================================
--- libhugetlbfs.orig/tests/alloc-instantiate-race.c 2006-09-04 17:08:33.000000000 +1000
+++ libhugetlbfs/tests/alloc-instantiate-race.c 2006-10-27 11:31:54.000000000 +1000
@@ -42,13 +42,18 @@
#include <sched.h>
#include <signal.h>
#include <sys/wait.h>
+#include <pthread.h>
+#include <linux/unistd.h>
#include <hugetlbfs.h>
#include "hugetests.h"
+_syscall0(pid_t, gettid);
+
static int hpage_size;
static pid_t child1, child2;
+static pthread_t thread1, thread2;
void cleanup(void)
{
@@ -58,9 +63,8 @@ void cleanup(void)
kill(child2, SIGKILL);
}
-
-static void one_racer(void *p, int cpu,
- volatile int *mytrigger, volatile int *othertrigger)
+static int one_racer(void *p, int cpu,
+ volatile int *mytrigger, volatile int *othertrigger)
{
volatile int *pi = p;
cpu_set_t cpuset;
@@ -70,7 +74,7 @@ static void one_racer(void *p, int cpu,
CPU_ZERO(&cpuset);
CPU_SET(cpu, &cpuset);
- err = sched_setaffinity(getpid(), CPU_SETSIZE/8, &cpuset);
+ err = sched_setaffinity(gettid(), CPU_SETSIZE/8, &cpuset);
if (err != 0)
CONFIG("sched_setaffinity(cpu%d): %s", cpu, strerror(errno));
@@ -83,16 +87,39 @@ static void one_racer(void *p, int cpu,
/* Instantiate! */
*pi = 1;
- exit(0);
+ return 0;
+}
+
+static void proc_racer(void *p, int cpu,
+ volatile int *mytrigger, volatile int *othertrigger)
+{
+ exit(one_racer(p, cpu, mytrigger, othertrigger));
}
-static void run_race(void *syncarea)
+struct racer_info {
+ void *p; /* instantiation address */
+ int cpu;
+ int race_type;
+ volatile int *mytrigger;
+ volatile int *othertrigger;
+ int status;
+};
+
+static void *thread_racer(void *info)
+{
+ struct racer_info *ri = info;
+ int rc;
+
+ rc = one_racer(ri->p, ri->cpu, ri->mytrigger, ri->othertrigger);
+ return ri;
+}
+static void run_race(void *syncarea, int race_type)
{
volatile int *trigger1, *trigger2;
int fd;
void *p;
int status1, status2;
- pid_t ret;
+ int ret;
memset(syncarea, 0, sizeof(*trigger1) + sizeof(*trigger2));
trigger1 = syncarea;
@@ -104,47 +131,90 @@ static void run_race(void *syncarea)
FAIL("hugetlbfs_unlinked_fd()");
verbose_printf("Mapping final page.. ");
- p = mmap(NULL, hpage_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+ p = mmap(NULL, hpage_size, PROT_READ|PROT_WRITE, race_type, fd, 0);
if (p == MAP_FAILED)
FAIL("mmap(): %s", strerror(errno));
verbose_printf("%p\n", p);
- child1 = fork();
- if (child1 < 0)
- FAIL("fork(): %s", strerror(errno));
- if (child1 == 0)
- one_racer(p, 0, trigger1, trigger2);
-
- child2 = fork();
- if (child2 < 0)
- FAIL("fork(): %s", strerror(errno));
- if (child2 == 0)
- one_racer(p, 1, trigger2, trigger1);
-
- /* wait() calls */
- ret = waitpid(child1, &status1, 0);
- if (ret < 0)
- FAIL("waitpid() child 1: %s", strerror(errno));
- verbose_printf("Child 1 status: %x\n", status1);
-
-
- ret = waitpid(child2, &status2, 0);
- if (ret < 0)
- FAIL("waitpid() child 2: %s", strerror(errno));
- verbose_printf("Child 2 status: %x\n", status2);
-
- if (WIFSIGNALED(status1))
- FAIL("Child 1 killed by signal %s",
- strsignal(WTERMSIG(status1)));
- if (WIFSIGNALED(status2))
+ if (race_type == MAP_SHARED) {
+ child1 = fork();
+ if (child1 < 0)
+ FAIL("fork(): %s", strerror(errno));
+ if (child1 == 0)
+ proc_racer(p, 0, trigger1, trigger2);
+
+ child2 = fork();
+ if (child2 < 0)
+ FAIL("fork(): %s", strerror(errno));
+ if (child2 == 0)
+ proc_racer(p, 1, trigger2, trigger1);
+
+ /* wait() calls */
+ ret = waitpid(child1, &status1, 0);
+ if (ret < 0)
+ FAIL("waitpid() child 1: %s", strerror(errno));
+ verbose_printf("Child 1 status: %x\n", status1);
+
+
+ ret = waitpid(child2, &status2, 0);
+ if (ret < 0)
+ FAIL("waitpid() child 2: %s", strerror(errno));
+ verbose_printf("Child 2 status: %x\n", status2);
+
+ if (WIFSIGNALED(status1))
+ FAIL("Child 1 killed by signal %s",
+ strsignal(WTERMSIG(status1)));
+ if (WIFSIGNALED(status2))
FAIL("Child 2 killed by signal %s",
strsignal(WTERMSIG(status2)));
- if (WEXITSTATUS(status1) != 0)
- FAIL("Child 1 terminated with code %d", WEXITSTATUS(status1));
+ status1 = WEXITSTATUS(status1);
+ status2 = WEXITSTATUS(status2);
+ } else {
+ struct racer_info ri1 = {
+ .p = p,
+ .cpu = 0,
+ .mytrigger = trigger1,
+ .othertrigger = trigger2,
+ };
+ struct racer_info ri2 = {
+ .p = p,
+ .cpu = 1,
+ .mytrigger = trigger2,
+ .othertrigger = trigger1,
+ };
+ void *tret1, *tret2;
+
+ ret = pthread_create(&thread1, NULL, thread_racer, &ri1);
+ if (ret != 0)
+ FAIL("pthread_create() 1: %s\n", strerror(errno));
+
+ ret = pthread_create(&thread2, NULL, thread_racer, &ri2);
+ if (ret != 0)
+ FAIL("pthread_create() 2: %s\n", strerror(errno));
+
+ ret = pthread_join(thread1, &tret1);
+ if (ret != 0)
+ FAIL("pthread_join() 1: %s\n", strerror(errno));
+ if (tret1 != &ri1)
+ FAIL("Thread 1 returned %p not %p, killed?\n",
+ tret1, &ri1);
+ ret = pthread_join(thread2, &tret2);
+ if (ret != 0)
+ FAIL("pthread_join() 2: %s\n", strerror(errno));
+ if (tret2 != &ri2)
+ FAIL("Thread 2 returned %p not %p, killed?\n",
+ tret2, &ri2);
+
+ status1 = ri1.status;
+ status2 = ri2.status;
+ }
- if (WEXITSTATUS(status2) != 0)
- FAIL("Child 2 terminated with code %d", WEXITSTATUS(status2));
+ if (status1 != 0)
+ FAIL("Racer 1 terminated with code %d", status1);
+
+ if (status2 != 0)
+ FAIL("Racer 2 terminated with code %d", status2);
}
int main(int argc, char *argv[])
@@ -153,14 +223,25 @@ int main(int argc, char *argv[])
int fd;
void *p, *q;
unsigned long i;
+ int race_type;
test_init(argc, argv);
- if (argc != 2)
- CONFIG("Usage: alloc-instantiate-race <# total available hugepages>");
+ if (argc != 3)
+ CONFIG("Usage: alloc-instantiate-race"
+ "<# total available hugepages> <private|shard>");
totpages = atoi(argv[1]);
+ if (strcmp(argv[2], "shared") == 0) {
+ race_type = MAP_SHARED;
+ } else if (strcmp(argv[2], "private") == 0) {
+ race_type = MAP_PRIVATE;
+ } else {
+ CONFIG("Usage: alloc-instantiate-race"
+ "<# total available hugepages> <private|shard>");
+ }
+
hpage_size = gethugepagesize();
if (hpage_size < 0)
CONFIG("No hugepage kernel support");
@@ -189,7 +270,7 @@ int main(int argc, char *argv[])
memset(p + (i * hpage_size), 0, sizeof(int));
verbose_printf("done\n");
- run_race(q);
+ run_race(q, race_type);
PASS();
}
Index: libhugetlbfs/tests/run_tests.sh
===================================================================
--- libhugetlbfs.orig/tests/run_tests.sh 2006-10-27 10:08:23.000000000 +1000
+++ libhugetlbfs/tests/run_tests.sh 2006-10-27 10:45:06.000000000 +1000
@@ -147,7 +147,8 @@ functional_tests () {
# killall -HUP hugetlbd
# to make the sharing daemon give up the files
run_test chunk-overcommit `free_hpages`
- run_test alloc-instantiate-race `free_hpages`
+ run_test alloc-instantiate-race `free_hpages` shared
+ run_test alloc-instantiate-race `free_hpages` private
run_test truncate_reserve_wraparound
run_test truncate_sigbus_versus_oom `free_hpages`
}
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2006-10-27 1:47 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-10-26 22:17 Chen, Kenneth W
2006-10-26 22:44 ` Andrew Morton
2006-10-26 23:31 ` 'David Gibson'
2006-10-27 0:04 ` Andrew Morton
2006-10-27 3:11 ` 'David Gibson'
2006-10-27 3:35 ` Andrew Morton
2006-10-27 4:06 ` 'David Gibson'
2006-10-31 2:54 ` Chen, Kenneth W
2006-10-31 3:17 ` 'David Gibson'
2006-10-31 5:15 ` Chen, Kenneth W
2006-10-31 11:05 ` 'David Gibson'
2006-10-31 12:48 ` Hugh Dickins
2006-11-01 6:18 ` Nick Piggin
2006-11-01 10:17 ` Chen, Kenneth W
2006-11-02 3:06 ` Nick Piggin
2006-11-02 2:29 ` 'David Gibson'
2006-10-27 1:47 ` 'David Gibson' [this message]
2006-10-30 20:55 ` Adam Litke
2006-10-26 23:47 ` 'David Gibson'
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20061027014740.GD11733@localhost.localdomain \
--to=david@gibson.dropbear.id.au \
--cc=agl@us.ibm.com \
--cc=akpm@osdl.org \
--cc=bill.irwin@oracle.com \
--cc=christoph@schroedinger.engr.sgi.com \
--cc=hugh@veritas.com \
--cc=kenneth.w.chen@intel.com \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox