+ Linux memory policy supports the following optional mode flags:
+
+ MPOL_F_STATIC_NODES: This flag specifies that the nodemask passed by
+ the user should not be remapped if the task or VMA's set of allowed
+ nodes changes after the memory policy has been defined.
+
+ Without this flag, anytime a mempolicy is rebound because of a
+ change in the set of allowed nodes, the node (Preferred) or
+ nodemask (Bind, Interleave) is remapped to the new set of
+ allowed nodes. This may result in nodes being used that were
+ previously undesired.
+
+ With this flag, if the user-specified nodes overlap with the
+ nodes allowed by the task's cpuset, then the memory policy is
+ applied to their intersection. If the two sets of nodes do not
+ overlap, the Default policy is used.
+
+ For example, consider a task that is attached to a cpuset with
+ mems 1-3 that sets an Interleave policy over the same set. If
+ the cpuset's mems change to 3-5, the Interleave will now occur
+ over nodes 3, 4, and 5. With this flag, however, since only node
+ 3 is allowed from the user's nodemask, the "interleave" only
+ occurs over that node. If no nodes from the user's nodemask are
+ now allowed, the Default behavior is used.
+
+ MPOL_F_STATIC_NODES cannot be combined with the
+ MPOL_F_RELATIVE_NODES flag. It also cannot be used for
+ MPOL_PREFERRED policies that were created with an empty nodemask
+ (local allocation).
+
+ MPOL_F_RELATIVE_NODES: This flag specifies that the nodemask passed
+ by the user will be mapped relative to the set of the task or VMA's
+ set of allowed nodes. The kernel stores the user-passed nodemask,
+ and if the allowed nodes changes, then that original nodemask will
+ be remapped relative to the new set of allowed nodes.
+
+ Without this flag (and without MPOL_F_STATIC_NODES), anytime a
+ mempolicy is rebound because of a change in the set of allowed
+ nodes, the node (Preferred) or nodemask (Bind, Interleave) is
+ remapped to the new set of allowed nodes. That remap may not
+ preserve the relative nature of the user's passed nodemask to its
+ set of allowed nodes upon successive rebinds: a nodemask of
+ 1,3,5 may be remapped to 7-9 and then to 1-3 if the set of
+ allowed nodes is restored to its original state.
+
+ With this flag, the remap is done so that the node numbers from
+ the user's passed nodemask are relative to the set of allowed
+ nodes. In other words, if nodes 0, 2, and 4 are set in the user's
+ nodemask, the policy will be effected over the first (and in the
+ Bind or Interleave case, the third and fifth) nodes in the set of
+ allowed nodes. The nodemask passed by the user represents nodes
+ relative to task or VMA's set of allowed nodes.
+
+ If the user's nodemask includes nodes that are outside the range
+ of the new set of allowed nodes (for example, node 5 is set in
+ the user's nodemask when the set of allowed nodes is only 0-3),
+ then the remap wraps around to the beginning of the nodemask and,
+ if not already set, sets the node in the mempolicy nodemask.
+
+ For example, consider a task that is attached to a cpuset with
+ mems 2-5 that sets an Interleave policy over the same set with
+ MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the
+ interleave now occurs over nodes 3,5-6. If the cpuset's mems
+ then change to 0,2-3,5, then the interleave occurs over nodes
+ 0,3,5.
+
+ Thanks to the consistent remapping, applications preparing
+ nodemasks to specify memory policies using this flag should
+ disregard their current, actual cpuset imposed memory placement
+ and prepare the nodemask as if they were always located on
+ memory nodes 0 to N-1, where N is the number of memory nodes the
+ policy is intended to manage. Let the kernel then remap to the
+ set of memory nodes allowed by the task's cpuset, as that may
+ change over time.
+
+ MPOL_F_RELATIVE_NODES cannot be combined with the
+ MPOL_F_STATIC_NODES flag. It also cannot be used for
+ MPOL_PREFERRED policies that were created with an empty nodemask
+ (local allocation).
+
+MEMORY POLICY REFERENCE COUNTING
+
+To resolve use/free races, struct mempolicy contains an atomic reference
+count field. Internal interfaces, mpol_get()/mpol_put() increment and
+decrement this reference count, respectively. mpol_put() will only free
+the structure back to the mempolicy kmem cache when the reference count
+goes to zero.
+
+When a new memory policy is allocated, it's reference count is initialized
+to '1', representing the reference held by the task that is installing the
+new policy. When a pointer to a memory policy structure is stored in another
+structure, another reference is added, as the task's reference will be dropped
+on completion of the policy installation.
+
+During run-time "usage" of the policy, we attempt to minimize atomic operations
+on the reference count, as this can lead to cache lines bouncing between cpus
+and NUMA nodes. "Usage" here means one of the following:
+
+1) querying of the policy, either by the task itself [using the get_mempolicy()
+ API discussed below] or by another task using the /proc/<pid>/numa_maps
+ interface.
+
+2) examination of the policy to determine the policy mode and associated node
+ or node lists, if any, for page allocation. This is considered a "hot
+ path". Note that for MPOL_BIND, the "usage" extends across the entire
+ allocation process, which may sleep during page reclaimation, because the
+ BIND policy nodemask is used, by reference, to filter ineligible nodes.
+
+We can avoid taking an extra reference during the usages listed above as
+follows:
+
+1) we never need to get/free the system default policy as this is never
+ changed nor freed, once the system is up and running.
+
+2) for querying the policy, we do not need to take an extra reference on the
+ target task's task policy nor vma policies because we always acquire the
+ task's mm's mmap_sem for read during the query. The set_mempolicy() and
+ mbind() APIs [see below] always acquire the mmap_sem for write when
+ installing or replacing task or vma policies. Thus, there is no possibility
+ of a task or thread freeing a policy while another task or thread is
+ querying it.
+
+3) Page allocation usage of task or vma policy occurs in the fault path where
+ we hold them mmap_sem for read. Again, because replacing the task or vma
+ policy requires that the mmap_sem be held for write, the policy can't be
+ freed out from under us while we're using it for page allocation.
+
+4) Shared policies require special consideration. One task can replace a
+ shared memory policy while another task, with a distinct mmap_sem, is
+ querying or allocating a page based on the policy. To resolve this
+ potential race, the shared policy infrastructure adds an extra reference
+ to the shared policy during lookup while holding a spin lock on the shared
+ policy management structure. This requires that we drop this extra
+ reference when we're finished "using" the policy. We must drop the
+ extra reference on shared policies in the same query/allocation paths
+ used for non-shared policies. For this reason, shared policies are marked
+ as such, and the extra reference is dropped "conditionally"--i.e., only
+ for shared policies.
+
+ Because of this extra reference counting, and because we must lookup
+ shared policies in a tree structure under spinlock, shared policies are
+ more expensive to use in the page allocation path. This is especially
+ true for shared policies on shared memory regions shared by tasks running
+ on different NUMA nodes. This extra overhead can be avoided by always
+ falling back to task or system default policy for shared memory regions,
+ or by prefaulting the entire shared memory region into memory and locking
+ it down. However, this might not be appropriate for all applications.
+