mem in linux

 df /dev/shm

Filesystem     1K-blocks  Used Available Use% Mounted on
tmpfs            1886116  1544   1884572   1% /dev/shm

its default size is half of the available physical RAM

ramdisk in fstab

none /ramdisk tmpfs defaults,size=256m 1 2

vm=virtual memory =addressing scheme the Linux kernel uses to handle memory

The physical memory is not necessarily contiguous; it might be accessible as a set of distinct
address ranges. Besides, different CPU architectures, and even different implementations of the same architecture have different views
of how these address ranges are defined.

All this makes dealing directly with physical memory quite complex and to avoid this complexity a concept of virtual memory was developed.

The virtual memory abstracts the details of physical memory from the application software, allows to keep only needed information in the physical memory (demand paging) and provides a mechanism for the protection and controlled sharing of data between processes.

With virtual memory, each and every memory access uses a virtual address. When the CPU decodes an instruction that reads (or writes) from (or to) the system memory, it translates the virtual address encoded in that instruction to a physical address that the memory controller can understand.

The physical system memory is divided into page frames, or pages. The size of each page is architecture specific. Some architectures allow selection of the page size from several supported values; this selection is performed at the kernel build time by setting an appropriate kernel configuration option.

Each physical memory page can be mapped as one or more virtual pages. These mappings are described by page tables that allow translation from a virtual address used by programs to the physical memory address. The page tables are organized hierarchically.

The tables at the lowest level of the hierarchy contain physical addresses of actual pages used by the software. The tables at higher levels contain physical addresses of the pages belonging to the lower levels. The pointer to the top level page table resides in a register. When the CPU performs the address translation, it uses this register to access the top level page table. The high bits of the virtual address are used to index an entry in the top level page table. That entry is then used to access the next level in the hierarchy with the next bits of the virtual address as the index to that level page table. The lowest bits in the virtual address define the offset inside the actual page.

Huge pages:

The address translation requires several memory accesses and memory accesses are slow relatively to CPU speed. To avoid spending precious processor cycles on the address translation, CPUs maintain a cache of such translations called Translation Lookaside Buffer (or TLB). Usually TLB is pretty scarce resource and applications with large memory working set will experience performance hit because of TLB misses.

Many modern CPU architectures allow mapping of the memory pages directly by the higher levels in the page table. For instance, on x86, it is possible to map 2M and even 1G pages using entries in the second and the third level page tables. In Linux such pages are called huge. Usage of huge pages significantly reduces pressure on TLB, improves TLB hit-rate and thus improves overall system performance.

There are two mechanisms in Linux that enable mapping of the physical memory with the huge pages. The first one is HugeTLB filesystem, or hugetlbfs. It is a pseudo filesystem that uses RAM as its backing store. For the files created in this filesystem the data resides in the memory and mapped using huge pages. The hugetlbfs is described at HugeTLB Pages.

Another, more recent, mechanism that enables use of the huge pages is called Transparent HugePages, or THP. Unlike the hugetlbfs that requires users and/or system administrators to configure what parts of the system memory should and can be mapped by the huge pages, THP manages such mappings transparently to the user and hence the name. See Transparent Hugepage Support for more details about THP.

Zones

Often hardware poses restrictions on how different physical memory ranges can be accessed. In some cases, devices cannot perform DMA to all the addressable memory. In other cases, the size of the physical memory exceeds the maximal addressable size of virtual memory and special actions are required to access portions of the memory. Linux groups memory pages into zones according to their possible usage. For example, ZONE_DMA will contain memory that can be used by devices for DMA, ZONE_HIGHMEM will contain memory that is not permanently mapped into kernel’s address space and ZONE_NORMAL will contain normally addressed pages.

The actual layout of the memory zones is hardware dependent as not all architectures define all zones, and requirements for DMA are different for different platforms.

Numa

Many multi-processor machines are NUMA - Non-Uniform Memory Access - systems. In such systems the memory is arranged into banks that have different access latency depending on the “distance” from the processor. Each bank is referred to as a node and for each node Linux constructs an independent memory management subsystem. A node has its own set of zones, lists of free and used pages and various statistics counters. You can find more details about NUMA in What is NUMA?` and in NUMA Memory Policy.

page cache

The physical memory is volatile and the common case for getting data into the memory is to read it from files. Whenever a file is read, the data is put into the page cache to avoid expensive disk access on the subsequent reads. Similarly, when one writes to a file, the data is placed in the page cache and eventually gets into the backing storage device. The written pages are marked as dirty and when Linux decides to reuse them for other purposes, it makes sure to synchronize the file contents on the device with the updated data.

Anonymous memory

The anonymous memory or anonymous mappings represent memory that is not backed by a filesystem. Such mappings are implicitly created for program’s stack and heap or by explicit calls to mmap(2) system call. Usually, the anonymous mappings only define virtual memory areas that the program is allowed to access. The read accesses will result in creation of a page table entry that references a special physical page filled with zeroes. When the program performs a write, a regular physical page will be allocated to hold the written data. The page will be marked dirty and if the kernel decides to repurpose it, the dirty page will be swapped out.

Reclaim

Throughout the system lifetime, a physical page can be used for storing different types of data. It can be kernel internal data structures, DMA’able buffers for device drivers use, data read from a filesystem, memory allocated by user space processes etc.

Depending on the page usage it is treated differently by the Linux memory management. The pages that can be freed at any time, either because they cache the data available elsewhere, for instance, on a hard disk, or because they can be swapped out, again, to the hard disk, are called reclaimable. The most notable categories of the reclaimable pages are page cache and anonymous memory.

In most cases, the pages holding internal kernel data and used as DMA buffers cannot be repurposed, and they remain pinned until freed by their user. Such pages are called unreclaimable. However, in certain circumstances, even pages occupied with kernel data structures can be reclaimed. For instance, in-memory caches of filesystem metadata can be re-read from the storage device and therefore it is possible to discard them from the main memory when system is under memory pressure.

The process of freeing the reclaimable physical memory pages and repurposing them is called (surprise!) reclaim. Linux can reclaim pages either asynchronously or synchronously, depending on the state of the system. When the system is not loaded, most of the memory is free and allocation requests will be satisfied immediately from the free pages supply. As the load increases, the amount of the free pages goes down and when it reaches a certain threshold (low watermark), an allocation request will awaken the kswapd daemon. It will asynchronously scan memory pages and either just free them if the data they contain is available elsewhere, or evict to the backing storage device (remember those dirty pages?). As memory usage increases even more and reaches another threshold - min watermark - an allocation will trigger direct reclaim. In this case allocation is stalled until enough memory pages are reclaimed to satisfy the request.

Compaction

As the system runs, tasks allocate and free the memory and it becomes fragmented. Although with virtual memory it is possible to present scattered physical pages as virtually contiguous range, sometimes it is necessary to allocate large physically contiguous memory areas. Such need may arise, for instance, when a device driver requires a large buffer for DMA, or when THP allocates a huge page. Memory compaction addresses the fragmentation issue. This mechanism moves occupied pages from the lower part of a memory zone to free pages in the upper part of the zone. When a compaction scan is finished free pages are grouped together at the beginning of the zone and allocations of large physically contiguous areas become possible.

Like reclaim, the compaction may happen asynchronously in the kcompactd daemon or synchronously as a result of a memory allocation request.

OOM killer

It is possible that on a loaded machine memory will be exhausted and the kernel will be unable to reclaim enough memory to continue to operate. In order to save the rest of the system, it invokes the OOM killer.

The OOM killer selects a task to sacrifice for the sake of the overall system health. The selected task is killed in a hope that after it exits enough memory will be freed to continue normal operation.

proccess page table

/proc/pid/pagemap

following data (from fs/proc/task_mmu.c, above pagemap_read):

Bits 0-54 page frame number (PFN) if present
Bits 0-4 swap type if swapped
Bits 5-54 swap offset if swapped
Bit 55 pte is soft-dirty (see Soft-Dirty PTEs)
Bit 56 page exclusively mapped (since 4.2)
Bit 57 pte is uffd-wp write-protected (since 5.13) (see Userfaultfd)
Bits 58-60 zero
Bit 61 page is file-page or shared-anon (since 3.5)
Bit 62 page swapped
Bit 63 page present

only users with the CAP_SYS_ADMIN capability can get PFNs

memory use in linux for:

buffers for things like network stacks, SCSI queues, etc

```
applications
```
```
disk/file cache
```

kernel buffers must always stay at ram

Applications and cache don’t need to stay in RAM

their cache can be dropped, and the applications can be paged out to the 
swap file. Dropping cache means a potential performance hit. Likewise 
with paging applications out.

 vm.swappiness parameter helps the kernel decide what to do. By setting 
it to the maximum of 100 the kernel will swap very aggressively. By 
setting it to 0 the kernel will only swap to protect against an 
out-of-memory condition.

Swapping is bad in a virtual environment, at any level

 sudo sysctl -w vm.swappiness=0

set it in /etc/sysctl.conf by appending “vm.swappiness = 0

default is 60

/sbin/sysctl  vm.swappiness 
vm.swappiness = 60

The Linux kernel stages disk writes into cache, and over time 
asynchronously flushes them to disk. This has a nice effect of speeding 
disk I/O but it is risky. When data isn’t written to disk there is an 
increased chance of losing it.

There is also the chance that a lot of I/O will overwhelm the cache, 
too. Ever written a lot of data to disk all at once, and seen large 
pauses on the system while it tries to deal with all that data? Those 
pauses are a result of the cache deciding that there’s too much data to 
be written asynchronously (as a non-blocking background operation, 
letting the application process continue), and switches to writing 
synchronously (blocking and making the process wait until the I/O is 
committed to disk). Of course, a filesystem also has to preserve write 
order, so when it starts writing synchronously it first has to destage 
the cache. Hence the long pause.

vm.dirty_background_ratio

Contains, as a percentage of total available memory that contains free pages and reclaimable pages, the number of pages at which the background kernel flusher threads will start writing out dirty data.

percentage of system memory that can be filled with “dirty” pages — memory pages that still need to be written to disk — before the pdflush/flush/kdmflush background processes kick in to write it to disk

The total available memory is not equal to total system memory.

vm.dirty_background_bytes

Contains the amount of dirty memory at which the background kernel flusher threads will start writeback.

Note:: dirty_background_bytes is the counterpart of dirty_background_ratio. Only one of them may be specified at a time. When one sysctl is written it is immediately taken into account to evaluate the dirty memory limits and the other appears as 0 when read.

vm.dirty_bytes

Contains the amount of dirty memory at which a process generating disk writes will itself start writeback.

Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be specified at a time. When one sysctl is written it is immediately taken into account to evaluate the dirty memory limits and the other appears as 0 when read.

Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any value lower than this limit will be ignored and the old configuration will be retained.

dirty_ratio

Contains, as a percentage of total available memory that contains free pages and reclaimable pages, the number of pages at which a process which is generating disk writes will itself start writing out dirty data.

The total available memory is not equal to total system memory.

the absolute maximum amount of system memory that can be filled with dirty pages before everything must get committed to disk. When the system gets to this point all new I/O blocks until dirty pages have been written to disk. This is often the source of long I/O pauses, but is a safeguard against too much data being cached unsafely in memory.

If you set the _bytes version the _ratio version will become 0, and vice-versa.

vm.dirty_expire_centisecs is how long something can be in cache before it needs to be written. In this case it’s 30 seconds. When the pdflush/flush/kdmflush processes kick in they will check to see how old a dirty page is, and if it’s older than this value it’ll be written asynchronously to disk. Since holding a dirty page in memory is unsafe this is also a safeguard against data loss.

 This tunable is used to define when dirty data is old enough to be eligible
for writeout by the kernel flusher threads.  It is expressed in 100’ths
of a second.  Data which has been dirty in-memory for longer than this
interval will be written out next time a flusher thread wakes up.

 /sbin/sysctl vm.dirty_expire_centisecs
vm.dirty_expire_centisecs = 3000

 cat /proc/vmstat | egrep "dirty|writeback"
nr_dirty 19
nr_writeback 0
nr_writeback_temp 0
nr_dirty_threshold 90853
nr_dirty_background_threshold 45371

 I have 19 dirty pages waiting to be written to disk.

case 1

for critical data in storage

/etc/sysctl.conf and reloading with “sysctl –p

 vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

good on virtual machines, as well as Linux-based hypervisors

data contained on a Linux guest isn’t critical and can be lost, and 
usually where an application is writing to the same files repeatedly or 
in repeatable bursts. In theory, by allowing more dirty pages to exist 
in memory you’ll rewrite the same blocks over and over in cache, and 
just need to do one write every so often to the actual disk.

vm.dirty_background_ratio = 50
vm.dirty_ratio = 80

infrequent, bursty traffic to slow disk (batch jobs at the top of the 
hour, midnight, writing to an SD card on a Raspberry Pi, etc.). In that 
case an approach might be to allow all that write I/O to be deposited in
 the cache so that the background flush operations can deal with it 
asynchronously over time:

vm.dirty_background_ratio = 5
vm.dirty_ratio = 80

https://www.kernel.org/doc/html/latest/admin-guide/sysctl/vm.html

https://www.kernel.org/doc/html/latest/admin-guide/mm/index.html

https://man7.org/linux/man-pages/man5/proc.5.html

vfs

https://docs.kernel.org/filesystems/vfs.html

VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so on
are called from a process context.  Filesystem locking is described in
the document Locking.

Decreasing the virtual file system (VFS) cache parameter value may improve system responsiveness:

vm.vfs_cache_pressure = 50

The value controls the tendency of the kernel to reclaim the memory which is used for caching of directory and inode objects (VFS cache). Lowering it from the default value of 100 makes the kernel less inclined to reclaim VFS cache (do not set it to 0, this may produce out-of-memory conditions).

 https://sysctl-explorer.net/

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/security_guide/sect-security_guide-server_security-disable-source-routing

/etc/udev/rules.d/60-ioschedulers.rules

# NVMe SSD

ACTION=="add|change", KERNEL=="nvme[0-9]*", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="none"

self.note

Search This Blog

mem in linux

vm.dirty_background_ratio

vm.dirty_background_bytes

vm.dirty_bytes

dirty_ratio

Comments

Post a Comment

Popular posts from this blog

sxhkd volume andbrightness config for dwm on void

Hidden Wiki

download office 2021 and activate