Linux 2 6 31 /whats in new in k 2.6.31/
Linux 2.6.31 kernel released on 9 September, 2009
Spam: Valerie Aurora has publised on LWN
a great article explaining some parts of the deep internals of btrfs. Since btrfs is expected to replace Ext4 at some point, it's an interesting read.
Summary: This version adds USB
3.0 support, a equivalent of FUSE for character devices used for
proxying OSS sound to ALSA, some memory management changes that improve
interactivity in desktops, readahead improvements, ATI Radeon
Modesetting support, support for Intel's Wireless Multicomm 3200 Wifi
devices, kernel support and a userspace tool for performance counters,
gcov support, a memory checker for unitialized memory, a memory leak
detector, a reimplementation of inotify and dnotify on top of a new
filesystem notification infrastructure, btrfs improvements, support for
the IEEE 802.15.4 network standard, IPv4 over Firewire, many new
drivers, small improvements and fixes.
- Prominent features (the cool stuff)
- USB 3 support
- CUSE (character devices in userspace) and OSS Proxy
- Improve desktop interactivity under memory pressure
- ATI Radeon Kernel Mode Setting support
- Performance Counters
- IEEE 802.15.4 Low-Rate Wireless Personal Area Networks support
- Gcov support
- Kmemcheck
- Kmemleak
- Fsnotify
- Preliminary NFS 4.1 client support
- Context Readahead algorithm and mmap readhead improvements
Various core changes
Filesystems
Networking
Security
Tracing/Profiling
DM
Crypto
Virtualization
PCI
Block
Memory management
Architecture-specific changes
Drivers
Graphics
Storage
Network
Input
USB
Sound
V4L/DVB
Staging
FireWire
MTD
WATCHDOG
HWMON
HID
RTC
Serial
I2C
MFD
Rfkill
MMC
Regulator
Various
Other news sources tracking the kernel changes
1. Prominent features (the cool stuff)1.1. USB 3 supportThis version Linux adds
support for USB 3.0 devices (contributed by Sarah Sharp from Intel) and
the hardware that implements the
eXtensible Host Controller Interface (xHCI) 0.95 specification.
No xHCI hardware has made it onto the market yet, but these patches
have been tested under the Fresco Logic host controller prototype.
Code: drivers/usb/host/xhci*
1.2. CUSE (character devices in userspace) and OSS ProxyRecommended LWN article:
Character devices in user space CUSE is an extension of FUSE allowing character
devices to be implemented in userspace, it has been contributed by
Tejun Heo (SUSE)
It can be used for many things, for example
"proxying" OSS audio from OSS apps through the ALSA userspace layer, or
to an audio system which can forward the sound through the network.
ALSA contains OSS emulation, but sadly the emulation is in the kernel,
behind the userland multiplexing layer, which means that if your sound
card doesn't support multiple audio streams (most modern cards don't),
only either one of ALSA or OSS emulation interface would be usable at
any given moment.
OSS Proxy uses CUSE to implement the OSS
interface - /dev/dsp, /dev/adsp and /dev/mixer. From the POV of the
applications, these devices are proper character devices and behave
exactly the same way, so it can be used as a replacement for the
in-kernel ALSA OSS emulation layer. The app sends the audio to these
CUSE devices, and the OSS Proxy will forward it to a "slave" (currently
there's only one slave implemented, pulseaudio)
Code: CUSE
(commit) OSS Proxy home and code:
http://userweb.kernel.org/~tj/ossp/ 1.3. Improve desktop interactivity under memory pressurePROT_EXEC pages are pages
that normally belong to some currently running executables and their
linked libraries, they shall really be cached aggressively to provide
good user experiences because if they aren't, the desktop applications
will experience very long and noticeable pauses when the application's
code path jumps to a part of the code which is not cached in memory and
needs to be read from the disk, which is very slow. Due to some memory
management scalability work in recent kernel versions, there're some
(commonly used) workloads which can send these PROT_EXEC pages to the
list of filesystem-backed pages (the ones used to map files) which are
unactive and can get flushed out of the working set. The result is a
desktop environment with poor interactivity: the applications become
unresponsive too easily.
In this version, some heuristics have been used
to make much harder to get the mapped executable pages out of the list
of active pages. The result is an improved desktop experience:
Benchmarks on memory tight desktops show clock time and major faults
reduced by 50%, and pswpin numbers are reduced to ~1/3, that means X
desktop responsiveness is doubled under high memory/swap pressure.
Memory flushing benchmarks in a file server shows the number of major
faults going from 50 to 3 during 10% cache hot reads. See the commit
link for more details and benchmarks.
Code:
(commit 1,
2,
3) 1.4. ATI Radeon Kernel Mode Setting supportThis version adds Kernel
Mode Setting (KMS) support for ATI Radeon. Hardware supported is
R1XX,R2XX,R3XX,R4XX,R5XX (radeon up to X1950). Works is underway to
provide support for R6XX, R7XX and newer hardware (radeon from HD2XXX
to HD4XXX).
Code:
(commit),
(commit) 1.5. Performance CountersRecommended LWN article:
Followups: performance counters, ksplice, and fsnotify The Performance Counter subsystem provides an
abstraction of special performance counter hardware registers available
on most modern CPUs. These registers count the number of certain types
of hw events: such as instructions executed, cachemisses suffered, or
branches mis-predicted - without slowing down the kernel or
applications. These registers can also trigger interrupts when a
threshold number of events have passed - and can thus be used to
profile the code that runs on that CPU. In this release, support for
x86, PPC and partial support for S390 and FRV have been added.
Users are not expected to use the API
themselves. Instead, a powerful performance analysis tool has been
built: "perf", which is available at tools/perf/ (in an unusual
decision of including kernel-related userspace software into the kernel
tree).
perf supports a few modes of operation, like
"perf top", which shows a top-like interface, which you can restrict to
any given set of events, process or CPU. There's also "perf record",
which records a profile into a file, and "perf report", which reads the
profile and shows it in the screen, or "perf annotate", which reads the
data and displays the annotated code. There's also "perf list", which
shows the list of events supported by the hardware, and "perf stat",
which runs a command and gathers performance statistics which are
printed into the screen. All the documentation and man pages are
available in the 'Documentation' subdirectory. Some examples:
$ ./perf stat -r 3 -- echo -n
Performance counter stats for 'echo -n' (3 runs):
2.337404 task-clock-msecs # 0.566 CPUs ( +- 1.704% )
1 context-switches # 0.000 M/sec ( +- 0.000% )
0 CPU-migrations # 0.000 M/sec ( +- 0.000% )
184 page-faults # 0.079 M/sec ( +- 0.000% )
4319963 cycles # 1848.188 M/sec ( +- 1.615% )
5024608 instructions # 1.163 IPC ( +- 0.722% )
73278 cache-references # 31.350 M/sec ( +- 1.636% )
2019 cache-misses # 0.864 M/sec ( +- 6.535% )
0.004126139 seconds time elapsed ( +- 24.603% )
$ perf report -s comm,dso,symbol -C firefox -d /usr/lib64/xulrunner-1.9.1/libxul.so | grep :: | head
2.21% [.] nsDeque::Push(void*)
1.78% [.] GraphWalker::DoWalk(nsDeque&)
1.30% [.] GCGraphBuilder::AddNode(void*, nsCycleCollectionParticipant*)
1.27% [.] XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode)
1.18% [.] imgContainer::DrawFrameTo(gfxIImageFrame*, gfxIImageFrame*, nsRect&)
1.13% [.] nsDeque::PopFront()
1.11% [.] nsGlobalWindow::RunTimeout(nsTimeout*)
0.97% [.] nsXPConnect::Traverse(void*, nsCycleCollectionTraversalCallback&)
0.95% [.] nsJSEventListener::cycleCollection::Traverse(void*, nsCycleCollectionTraversalCallback&)
0.95% [.] nsCOMPtr_base::~nsCOMPtr_base()
Code:
(commit 1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12) 1.6. IEEE 802.15.4 Low-Rate Wireless Personal Area Networks supportIEEE Std 802.15.4 defines a
low data rate, low power and low complexity short range wireless
personal area networks. It was designed to organise networks of
sensors, switches, etc automation devices. Maximum allowed data rate is
250 kb/s and typical personal operating space around 10m.
Code:
(commit 1,
2,
3,
4,
5) 1.7. Gcov supportThis version enables the use of
GCC's coverage testing tool gcovwith the Linux kernel. gcov may be useful for: debugging (has this code
been reached at all?), test improvement (how do I change my test to
cover these lines?), minimizing kernel configurations (do I need this
option if the associated code is never run?) and other things.
Code:
(commit 1,
2) 1.8. KmemcheckKmemcheck is a debugging
feature for the Linux Kernel. More specifically, it is a dynamic
checker that detects and warns about some uses of uninitialized memory.
Userspace programmers might be familiar with Valgrind's memcheck. The
main difference between memcheck and kmemcheck is that memcheck works
for userspace programs only, and kmemcheck works for the kernel only.
Enabling kmemcheck on a kernel will probably
slow it down to the extent that the machine will not be usable for
normal workloads such as e.g. an interactive desktop. kmemcheck will
also cause the kernel to use about twice as much memory as normal. For
this reason, kmemcheck is strictly a debugging feature.
Code:
(commit 1,
2,
3,
4,
5,
6,
7,
1.9. KmemleakRecommended LWN article:
Detecting kernel memory leaks Kmemleak provides a way of detecting possible kernel memory leaks in a way similar to a
tracing garbage collector,
with the difference that the orphan objects are not freed. Instead, a
kernel thread scans the memory every 10 minutes (by default) and prints
any new unreferenced objects found in /sys/kernel/debug/kmemleak and
warns about them ti . A similar method is used by the Valgrind tool
(memcheck --leak-check) to detect the memory leaks in user-space
applications.
Code:
(commit 1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11) 1.10. FsnotifyFsnotify is a backend for
filesystem notification. Fsnotify itself does not provide any userspace
interface but does provide the basis needed for other notification
schemes such as dnotify, inotify and fanotify (this last notification
interface, will be included in future releases). In fact, in this
release dnotify and inotify have been rewritten in top of fsnotify,
removing at the same time the ugly and complex code from those systems.
Fsnotify provides a mechanism for "groups" to register for some set of
filesystem events and to then deliver those events to those groups for
processing, and the locking is much simpler. Fsnotify has other
benefits, like shrinking the size of an inode.
Code:
(commit 1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16) 1.11. Preliminary NFS 4.1 client support2.6.30 added some developer support for NFS
4.1. This version enables optional support for minor version 1 of the
NFSv4 protocol (draft-ietf-nfsv4-minorversion1) in the kernel's NFS
client
Code:
(commit) 1.12. Context Readahead algorithm and mmap readhead improvementsThis version introduces a page cache context
based readahead algorithm. The current readahead algorithm detects
interleaved reads in a passive way, the context readahead algorithm
guarantees to discover the sequentialness no matter how the streams are
interleaved. The beneficiaries are strictly interleaved reads and
cooperative IO processes (i.e. NFS and SCST). SCST benchmarks
show 6%~40% performance gains in various cases and achieves equal performance in others
There're also some improvements to mmap
readahead. On a NFS-root desktop, mmap readahead reduced major faults
by 1/3 and no obvious overheads, mmap io can be further reduced by 1/4.
Code:
(commit 1,
2,
3,
4) 2. Various core changes
- Add caching of ACLs in struct inode (commit 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- Provide generic atomic64_t implementation (commit)
- eventfd: revised interface and cleanups (commit)
- modules: sysctl to block module loading (commit)
- poll: avoid extra wakeups in select/poll (commit)
- proc: export more page flags in /proc/kpageflags (commit), export statistics for softirq to /proc (commit)
- ramdisk: remove long-deprecated "ramdisk=" boot-time parameter (commit)
- RCU: make treercu be default (commit)
- Add requeue_pi functionality (commit), (commit), (commit), (commit), (commit), (commit)
- signals: implement sys_rt_tgsigqueueinfo (commit), (commit)
- softirq: introduce statistics for softirq (commit)
- splice: implement pipe to pipe splicing (commit), implement default splice_read method (commit), implement default splice_write method (commit)
- timers: Framework for identifying pinned timers (commit), (commit), logic to move non pinned timers (commit), /proc/sys sysctl hook to enable timer migration (commit)
- Add new pre-allocation ioctls to vfs for compatibility with legacy xfs ioctls (commit)
- vsprintf: introduce %pf format specifier (commit)
- printk: add support of hh length modifier for printk (commit)
3. Filesystems
- Btrfs
- Mixed back reference (FORWARD ROLLING FORMAT CHANGE). It scales significantly better with a large number of snapshots (commit)
- Use hybrid extents+bitmap rb tree for free space. Currently
btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with
thousands of entries on an rb-tree per block group, which usually spans
1 gig of area. This patch solves this problem by using bitmaps for
parts of the free space cache. The maximum amount of RAM that should
ever be used to track 1 gigabyte of diskspace will be 32k of RAM (commit)
- Btrfs: async block group caching (commit)
- Reduce mount -o ssd CPU usage (commit)
- Add mount -o nossd (commit)
- Add mount -o ssd_spread to spread allocations out (commit)
- Autodetect SSD devices (commit)
- Implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION (attributes set via chattr and read via lsattr) (commit)
Ext4
Avoid unnecessary spinlock in critical POSIX ACL path (3% improvement in stat and open/close latencies) (commit), (commit)
Hook fiemap operation for directories (commit)
Add EXT4_IOC_MOVE_EXT ioctl (will be used for online defrag in the future) (commit)
teach the inode allocator to use a goal inode number (commit)
Convert instrumentation from markers to tracepoints (commit), (commit)
Ext3
Avoid unnecessary spinlock in critical POSIX ACL path (3% improvement in stat and open/close latencies) (commit), (commit)
CIFS
Add addr= mount option alias for ip= (commit)
Add mention of new mount parm (forceuid) to cifs readme (commit)
NILFS2
Allow future expansion of metadata read out via get info ioctl (commit)
Pagecache usage optimization on NILFS2 (commit)
XFS
Use generic Posix ACL code (commit)
FAT
Add 'errors' mount option (commit)
GFS2
Add tracepoints (commit)
NFS
Add support for splice writes (commit)
OCFS2
Add statistics for the checksum and ecc operations. (commit)
UBIFS
Start using hrtimers (commit)
4. Networking
- netfilter
- conntrack: add support for DCCP handshake sequence to ctnetlink (commit)
- conntrack: optional reliable conntrack event delivery (commit)
- nf_ct_tcp: TCP simultaneous open support (commit)
- passive OS fingerprint xtables match (commit)
- xt_NFQUEUE: queue balancing support (commit)
WiFi
cfg80211: add cipher capabilities (commit), add rfkill support (commit), allow adding/deleting stations on mesh (commit), allow setting station parameters in mesh (commit)
mac80211: implement beacon filtering in software (commit), improve powersave implementation (commit)
wimax: Add netlink interface to get device state (commit)
TX_RING and packet mmap, makes packet socket more efficient for transmission (commit)
nl80211: Add IEEE 802.1X PAE control for station mode (commit), add set/get for frag/rts threshold and retry limits (commit), add support for configuring MFP (commit)
sit: stateless autoconf for isatap (commit)
af_iucv: Provide new socket type SOCK_SEQPACKET (commit)
ipv4: New multicast-all socket option (commit)
tcp: extend ECN sysctl to allow server-side only ECN (commit)
irda: new Blackfin on-chip SIR IrDA driver (commit)
irda-usb: suspend/resume support (commit)
dropmon: add ability to detect when hardware dropsrxpackets (commit)
5. Security
- SELinux: Permissive domain in userspace object manager (commit)
- smack: implement logging V3 (commit), (commit)
- IMA: Minimal IMA policy and boot param for TCB IMA policy (commit)
- Don't raise all privs on setuid-root file with fE set (v2) (commit)
6. Tracing/Profiling
- tracing: add average time in function to function profiler (commit)
- tracing: add function profiler (commit)
- tracing: add hierarchical enabling of events (commit)
- tracing: adding function timings to function profiler (commit)
- tracing: export stats of ring buffers to userspace (commit)
- oprofile: add support for Core i7 and Atom (commit)
- oprofile: introduce module_param oprofile.cpu_type (commit)
- oprofile: re-add force_arch_perfmon option (commit)
- oprofile: remove undocumented oprofile.p4force option (commit)
- ring-buffer: add benchmark and tester (commit)
7. DM8. Crypto
- aes-ni - Add support for more modes (commit)
- padlock - Enable on x86_64 (commit)
- talitos - Add ablkcipher algorithms (commit)
9. Virtualization
- KVM
- Add VT-x machine check support (commit)
- Enable MSI-X for KVM assigned device (commit)
- Enable snooping control for supported hardware (commit)
- Add SVM NMI injection support (commit)
VT-d
Add device IOTLB invalidation support (commit)
Parse ATSR in DMA Remapping Reporting Structure (commit)
Support the device IOTLB (commit)
virtio
expose features in sysfs (commit)
blk: SG_IO passthru support (commit)
pci: optional MSI-X support (commit)
XEN
Add "capabilities" file (commit)
Add /dev/xen/evtchn driver (commit)
Add /sys/hypervisor support (commit)
lguest: improve interrupt handling, speed up stream networking (commit), PAE support (commit)