Friday, March 4, 2011

Learning the Linux Boot Memory Allocator (MIPS)

Do you ever look at the report of 'cat /proc/meminfo' and wonder why the reported MemTotal is not equal to the amount of memory you have for the system? The followed is an example report from one of my systems. It shows the total memory is 60860kB (59.4MB) from line 2:

1 # cat /proc/meminfo
2 MemTotal:          60860 kB
3 MemFree:           47448 kB
4 Buffers:               0 kB
5 Cached:             7180 kB
6 SwapCached:            0 kB
7 ... 

which is different from the amount of memory that bootloader gave to kernel, 64MB.
1 ...
2 ## Transferring control to Linux (at address 800042f0) ...
3 ## Giving linux memsize in MB, 64 


The total physical memory is 64MB (0x4000000) according to the kernel log line 4. And the memory available (line 20) is 59176k because 6360k bytes is reserved. As we know that available = total - reserved. (59176k = 65536k - 6360k).
 1 # dmesg
 2 ...
 3 Determined physical RAM map:
 4  memory: 04000000 @ 00000000 (usable)
 5 Initrd not found or empty - disabling initrd
 6 Zone PFN ranges:
 7   Normal   0x00000000 -> 0x00004000
 8 Movable zone start PFN for each node
 9 early_node_map[1] active PFN ranges
10     0: 0x00000000 -> 0x00004000
11 Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 16256
12 Kernel command line: console=ttyS1,57600n8 root=/dev/ram0 rootfstype=squashfs
13 PID hash table entries: 256 (order: -2, 1024 bytes)
14 Dentry cache hash table entries: 8192 (order: 3, 32768 bytes)
15 Inode-cache hash table entries: 4096 (order: 2, 16384 bytes)
16 Primary instruction cache 64kB, VIPT, 4-way, linesize 32 bytes.
17 Primary data cache 32kB, 4-way, PIPT, no aliases, linesize 32 bytes
18 Writing ErrCtl register=00000000
19 Readback ErrCtl register=00000000
20 Memory: 59176k/65536k available (2905k kernel code, 6360k reserved, 980k data, 1684k init, 0k highmem)
21 ...
22 Freeing unused kernel memory: 1684k freed
23 ... 

However, you may notice that MemTotal and the available memory are not equal. It is because kernel can free memory reserved for initialization during booting, if the memory is not going to used again. Developers mark those codes with a specific attribute, and link them into the .init section. Kernel will release the memory by invoking free_initmem() when initial bootup is done.

Thus, MemTotal = total - reserved + .init
In the example, 60860k = 65536k - 6360k + 1684k

The question is what does kernel reserved those memory for? Most of reserved memory is used to store kernel image. Take a look to the information read from vmlinux as the followed, sections with number within 0 and 13 are required, including the well known text, data, and bss sections. We can conclude that more than 0x5a7b90 bytes of memory could be used. The minimum unit of reserved memory is 1 page (4K bytes by default). Hense 0x5a7b90 will be 4K aligned instead, which is 0x5a8000.

 1 $ readelf -S vmlinux
 2 There are 19 section headers, starting at offset 0x576694:
 3 
 4 Section Headers:
 5   [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
 6   [ 0]                   NULL            00000000 000000 000000 00      0   0  0
 7   [ 1] .text             PROGBITS        80000000 002000 2d66f8 00  AX  0   0 2048
 8   [ 2] __ex_table        PROGBITS        802d6700 2d8700 001b20 00   A  0   0  4
 9   [ 3] .rodata           PROGBITS        802d9000 2db000 0b09fc 00   A  0   0 32
10   [ 4] __ksymtab         PROGBITS        803899fc 38b9fc 004d20 00   A  0   0  4
11   [ 5] __ksymtab_gpl     PROGBITS        8038e71c 39071c 0022c8 00   A  0   0  4
12   [ 6] __ksymtab_strings PROGBITS        803909e4 3929e4 00f55e 00   A  0   0  1
13   [ 7] __param           PROGBITS        8039ff44 3a1f44 0010bc 00   A  0   0  4
14   [ 8] .data             PROGBITS        803a2000 3a4000 0299c0 00  WA  0   0 8192
15   [ 9] .data..shared_ali PROGBITS        803cb9c0 3cd9c0 000080 00  WA  0   0 32
16   [10] .init.text        PROGBITS        803cc000 3ce000 024270 00  AX  0   0  4
17   [11] .init.data        PROGBITS        803f0270 3f2270 17f6f2 00  WA  0   0  8
18   [12] .exit.text        PROGBITS        8056f964 571964 0015d8 00  AX  0   0  4
19   [13] .bss              NOBITS          80571000 572f3c 036b90 00  WA  0   0 4096
20   [14] .mdebug.abi32     PROGBITS        805a7b90 572f3c 000000 00      0   0  1
21   [15] .comment          PROGBITS        00000000 572f3c 0036a2 00      0   0  1
22   [16] .shstrtab         STRTAB          00000000 5765de 0000b3 00      0   0  1
23   [17] .symtab           SYMTAB          00000000 57698c 063290 10     18 18308  4
24   [18] .strtab           STRTAB          00000000 5d9c1c 077ce3 00      0   0  1
25 Key to Flags:
26   W (write), A (alloc), X (execute), M (merge), S (strings)
27   I (info), L (link order), G (group), x (unknown)
28   O (extra OS processing required) o (OS specific), p (processor specific) 


To dig deeper, let's turn the bootmem debugging on by adding "bootmem_debug=1" to the kernel command line. We focus on MIPS in this example, and implementation of bootmem allocator for other architectures could be slightly different. But the basic idea is the same. Bootmem allocator uses a bitmap structure representing the use of memory. It allows us to allocate memory by simply setting or clearing the corresponding bit, before getting the zone allocator ready.

 1 # dmesg
 2 ...
 3 Determined physical RAM map:
 4  memory: 04000000 @ 00000000 (usable)
 5 bootmem::init_bootmem_core nid=0 start=0 map=5a8 end=4000 mapsize=800
 6 bootmem::mark_bootmem_node nid=0 start=5a8 end=4000 reserve=0 flags=0
 7 bootmem::__free nid=0 start=5a8 end=4000
 8 bootmem::mark_bootmem_node nid=0 start=5a8 end=5a9 reserve=1 flags=0
 9 bootmem::__reserve nid=0 start=5a8 end=5a9 flags=0
10 Initrd not found or empty - disabling initrd
11 Zone PFN ranges:
12   Normal   0x00000000 -> 0x00004000
13 Movable zone start PFN for each node
14 early_node_map[1] active PFN ranges
15     0: 0x00000000 -> 0x00004000
16 bootmem::alloc_bootmem_core nid=0 size=80000 [128 pages] align=20 goal=1000000 limit=0
17 bootmem::__reserve nid=0 start=1000 end=1080 flags=1
18 bootmem::alloc_bootmem_core nid=0 size=8 [1 pages] align=20 goal=1000000 limit=0
19 bootmem::__reserve nid=0 start=1080 end=1081 flags=1
20 bootmem::alloc_bootmem_core nid=0 size=200 [1 pages] align=20 goal=1000000 limit=0
21 bootmem::__reserve nid=0 start=1081 end=1081 flags=1
22 bootmem::alloc_bootmem_core nid=0 size=1c [1 pages] align=20 goal=1000000 limit=0
23 bootmem::__reserve nid=0 start=1081 end=1081 flags=1
24 bootmem::alloc_bootmem_core nid=0 size=49 [1 pages] align=20 goal=1000000 limit=0
25 bootmem::__reserve nid=0 start=1081 end=1081 flags=1
26 bootmem::alloc_bootmem_core nid=0 size=49 [1 pages] align=20 goal=1000000 limit=0
27 bootmem::__reserve nid=0 start=1081 end=1081 flags=1
28 Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 16256
29 Kernel command line: console=ttyS1,57600n8 root=/dev/ram0 rootfstype=squashfs bootmem_debug=1
30 bootmem::alloc_bootmem_core nid=0 size=400 [1 pages] align=20 goal=1000000 limit=0
31 bootmem::__reserve nid=0 start=1081 end=1081 flags=1
32 PID hash table entries: 256 (order: -2, 1024 bytes)
33 bootmem::alloc_bootmem_core nid=0 size=8000 [8 pages] align=20 goal=1000000 limit=0
34 bootmem::__reserve nid=0 start=1081 end=1089 flags=1
35 Dentry cache hash table entries: 8192 (order: 3, 32768 bytes)
36 bootmem::alloc_bootmem_core nid=0 size=4000 [4 pages] align=20 goal=1000000 limit=0
37 bootmem::__reserve nid=0 start=1089 end=108d flags=1
38 Inode-cache hash table entries: 4096 (order: 2, 16384 bytes)
39 Primary instruction cache 64kB, VIPT, 4-way, linesize 32 bytes.
40 Primary data cache 32kB, 4-way, PIPT, no aliases, linesize 32 bytes
41 Writing ErrCtl register=00000000
42 Readback ErrCtl register=00000000
43 bootmem::free_all_bootmem_core nid=0 start=0 end=4000 aligned=1
44 bootmem::free_all_bootmem_core nid=0 released=39cb
45 Memory: 59176k/65536k available (2905k kernel code, 6360k reserved, 980k data, 1684k init, 0k highmem)
46 ...
47 Freeing unused kernel memory: 1684k freed
48 ...  

bootmem_init() initialises the bootmem allocator and setup initrd related data if needed. It calls init_bootmem_node() and calls init_bootmem_core(), and reserves all the pages intaially. In function bootmem_init(), the variable mapstart comes from the symbol _end. It is the first free PFN (page frame number) and states the location of bootmem bitmap.

the _end from symbol table:
 1 $ tail System.map
 2 805a7ab0 b sit_net_id
 3 805a7ac0 b ip6_tnl_net_id
 4 805a7ad0 B br_should_route_hook
 5 805a7ae0 b br_fdb_cache
 6 805a7ae4 b fdb_salt
 7 805a7af0 b br_mac_zero_aligned
 8 805a7b00 B vlan_net_id
 9 805a7b04 b vlan_group_hash
10 805a7b90 A __bss_stop
11 805a7b90 A _end 

Line 5 of kernel log tells us that: it iniailises node ID 0 (This is an UMA system and only 1 node exists.), PFN range is from 0 to 0x3999, and the bootmem bitmap is located at PFN 0x5a8 and of size 0x800 bytes. 64MB memory needs pages of number 64M/4K = 16K = 0x4000, and 16KB pages needs bootmem bitmap of size 16K/8 = 2K = 0x800 bytes.

Line 6~7: bootmem_init() then calls free_bootmem() and calls mark_bootmem(). It marks pages after the location of kernel image usable, and the ranage is within PFN 0x5a8 and 0x3999.

Line 8~9: bootmem_init() then calls reserve_bootmem() and calls mark_bootmem(). It reserve memory for bootmem bitmap. The bootmap size is 0x800 bytes, less than 1 page, so PFN 0x5a8 is enough.

Line 11~15: kernel initialises paging system, calls free_area_init_nodes() to initialise all pg_data_t and zone data, and prints information of all zones, movable zones, and early node map.

Line 16~17: free_area_init_nodes() calls free_area_init_node() for each node. It calculate total pages of this node, and calls alloc_node_mem_map() to allocate memory for memory map. The size of one page structure is 32 bytes, and we need 0x4000 pages totally. So the allocation size would be 0x32 * 0x4000 = 0x80000 bytes, which is 128 pages (0x80000 / 0x1000). The goal 0x1000000 means that it allocates memory in normal zone. It skips the first 16MB of DMA addresses. Because the address 0x1000000 is PFN 0x1000, alloc_bootmem_core() will allocate pages starting from PFN 0x1000.

Line 18~21: free_area_init_node() calls free_area_init_core() to built the memory map, and initialise freelists and buddy bitmaps of every zone in the node. It calls setup_usemap() to allocate bootmem for pageblock_flags. The usemap size is 0x4000 / 0x400 * 3, rounded up 32, then / 8 = 8 bytes (line 18). During zone data initialization, free_area_init_core() also calls zone_wait_table_init() to initialise wait queue hash table. Size of wait_queue_head_t is 8 bytes and number of wait queue is 64, so the allocation size would be 8 * 0x40 = 0x200 bytes (line 20).

Line 21: Why would the start and end PFN be the same? Line 19 reserves less than 1 page, and alloc_bootmem_core() will attemp to use a free fragment of the last allocated page. It calculates allocation size each time called, and sums up the information in bdata->last_end_off. When last_end_off is less than a page, a free fragment exists. Next time allocating bootmem, it will try to use the fragment if its size is enough. We may see that line 21, 23, 25, 27, and 31 shared the same page.

Line 22~23: Kernel tries to allocate bootmem for resource in resource_init(). The size of resource structure is 28 (0x1c) bytes.

Line 24~27: Kernel tries to allocate bootmem for saved_command_line and static_command_line in setup_command_line(), and both of them are of length 72 (0x49) bytes. The command line is also printed in line 29.

Line 28: Number of pages in this zone is 16384 (0x4000), and 128 of them are used for memory map. Therefore, the present_pages is 16384 - 128 = 16256.

Line 30~32: Kernel tries to allocate bootmem for PID hash table in pidhash_init(). pid_hash is 4 bytes, and the table is 256 * 4 = 1024 (0x400) bytes.

Line 33~38: Kernel tries to allocate bootmem for Dentry cache and Inode cache hash table in vfs_caches_init_early(). 1 hash list structure is 4 bytes. 8192 Dentry cache hash entries woul be 8192 * 4 = 0x8000 bytes. And 4096 Inode cache hash entries would be 0x4000 bytes.

Line 43~44: When the memory sub system is ready, it calls free_all_bootmem() that calls free_all_bootmem_core() for each bootmem bitmap to release all free pages to the buddy allocator. It examines the node_bootmem_map for all pages in the node (PFN 0 ~ 0x4000), calls __free_pages_bootmem() to release free either 32 pages or 1 page each time (depending on vec is all 1s or not). Free pages include PFN 0x5a9 ~ 0xfff and 0x108d ~ 0x3fff, and the total number of them is 0x39ca. And we can check the size of reserved memory: (0x4000 - 0x39ca) * 4096 = 6360K bytes. After freeing those pages, the pages storing bootmem bitmap can also be freed. It is 1 page in this example. Thus, the total number of released pages is 0x39cb (line 44).

Line 47: When kernel completes initial bootup and ready to start user-mode stuff, it calls free_initmem() to free .init section between symbol __init_begin and __init_end.


In conclusion,
  1. available memory = total - reserved
  2. MemTotal = total - reserved + .init
  3. Bootmem allocator uses a simple bitmap representing the use of memory.
  4. Reserved memory is mainly used to store kernel image.
  5. The rest of reserved memory is for memory map, various hash tables, and some miscellaneous data.
  6. After the memory sub system and zone allocator get ready, it will release the memory used for bootmem bitmap (bootmem retires).
  7. When initial bootup is done, it will release the .init section that will never be used again.

3 comments:

Anonymous said...

Thanks learnt something new today

Anonymous said...

Could you take down the picture show on the left site of your page?

Winfred said...

As you wish.