Monday, January 7, 2013

INFO PARTITIONS

http://lwn.net/Articles/428584/


Not logged inLog in nowCreate an accountSubscribe to LWNWeekly EditionReturn to the Kernel pageRecent FeaturesLWN.net Weekly Edition for January 4, 2013Previewing digiKam 3.0.0LWN.net Weekly Edition for December 20, 2012A preview of Inkscape 0.49LWN.net Weekly Edition for December 13, 2012Printable pageWeekly editionKernelSecurityDistributionsContact UsSearchArchivesCalendarSubscribeWrite for LWNLWN.net FAQSponsors

Optimizing Linux with cheap flash drives

February 18, 2011This article was contributed by Arnd BergmannFlash drives are getting larger and cheaper; as a result, they are showing up in an increasing number of devices. These drives are not the same as the rotating-media drives which preceded them, and they have different performance characteristics. If Linux is to make proper use of of this class of hardware, it must drive it in a way which is aware of its advantages and disadvantages.This article will review the properties of typical flash devices and list some optimizations that should allow Linux to get the most out of low-cost flash drives. The kernel working group of the Linaro project is currently researching this topic as an increasing number of embedded designs move away from raw NAND flash devices to embedded MMC or SD drives that hide the NAND interface and provide a simplified linear block device. This drives down system design complexity and cost but also means that regular block-oriented filesystems are used instead of the Linux MTD layer that can talk to raw flash.Most filesystems and the block layer in Linux are highly optimized for rotating media, in particular by organizing all accesses to avoid seeks. It has become clear that some of these optimizations are pointless or even counterproductive with solid-state storage media. In recent kernels, there is a per-device flag for non-rotational devices that treats these slightly differently, by assuming that all seeks are free, but is that really enough to get good I/O performance on solid state drives? High-end drives are getting fast enough to make optimizations for CPU load more interesting than optimizations for ideal access patterns. In contrast, the more common SD cards and USB flash drives are very sensitive to specific access patterns and can show very high latencies for writes unless they are used with the preformatted FAT32 file layout.As an example, a desktop machine using a 16 GB, 25 MB/s CompactFlash card to hold an ext3 root filesystem ended up freezing the user interface for minutes during phases of intensive block I/O, despite having gigabytes of free RAM available. Similar problems often happen on small embedded and mobile machines that rely on SD cards for their file systems.To understand why this happens, it is important to find out how the embedded controllers on these cards work. Since very little information is publicly documented, most of the following information had to be gathered using reverse engineering based on timing data collected from a large number of SD cards and other devices.

Pages, erase blocks and segments

All NAND flash chips are physically organized into "pages" and "erase blocks." A page is the smallest unit that can be addressed in a single read or write operation by the embedded microcontroller on a managed flash device, and it has an effective size between 2KB and 32KB in current consumer flash drives. This means that while a single 512-byte access is possible on the host interface (USB, ATA, MMC, ...), it takes almost the same time as a full page access inside of the drive.Although it is usually possible to write single pages, the data cannot be overwritten without being erased first, and erasing is only possible in much larger units, typically between 128KB and 2MB. The controllers group these erase blocks into even larger segments, called "erase block groups," "allocation units," or simply "segments." The most common size for these segments is 4MB for drives in the multi-gigabyte class, and all operations on the drive happen in these units; in particular, the drive will never erase any unit smaller than a segment.The drives have a single lookup table which contains a mapping between logical segments and physical segments. On a typical 8GB SD card using 4MB segments, this table contains a little under 2000 entries, which is small enough to be kept in the RAM of the card's microcontroller at all times. A small number of physical segments is set aside in a pool to handle wear leveling, bad blocks and garbage collection.Ideally, the drive expects all data to be written in full segments, which is what happens when recording a live video or storing a music collection on a FAT32 filesystem.The way the physical characteristics of the card make themselves felt can be seen in the plot to the right (click on the thumbnail for the full-size version), which summarizes the results of a number of tests on an SDHC memory card. The best-case read throughput is 13.5MB/s, while the linear write throughput is 11.5MB/s. The results show that the segment size is 4MB; any properly-aligned, 4MB write will be fast. The smallest efficient block size for reads and writes is 64KB, all accesses smaller than that are significantly slower. Individual pages are 8KB; the costs of extra garbage collection caused by smaller writes can be seen. The card as a whole has been optimized for linear write operations; random writes are much slower. Additionally, only one segment can be open at a time; alternating between two segments will cause garbage collection at every access, slowing write speeds to a mere 33KB/s. That said, the FAT file table area (from 4MB to 8MB) is managed differently, enabling small writes to be done efficiently there.The second image to the right shows a plot of read access times, in page granularity, on the first 32MB of a Panasonic Class 10 SDHC card. This plot illustrates various properties of the card. The segment size of 4MB can clearly be seen from the various changes in performance at the boundaries between segments. All closed segments have the same read performance, as do have all erased segments, which are a little faster to read. The FAT area in the second segment is a bit slower when reading because it uses a block remapping algorithm. One segment has been opened for writing by writing a few blocks in the middle before the read test, that segment can be seen as being a little faster to read on this specific card. Also, an effect of multi-level-cell (MLC) flash is that it alternates between slightly slower and faster pages, which the plot shows as two parallel lines for some segments.

Wear leveling

When a segment that already contains data is written to, a new segment is allocated from the free pool and the drive writes the new data into that segment. Once the segment has been written to from start to finish, the lookup table will be updated to point to the new segment, while the old segment is put into the free pool and erased in the background.By always allocating a new segment, the drive can avoid wearing out a single physical segment in cases where the host always writes to the same block addresses. Instead, all writes are statistically distributed to all the segments that get written to from time to time. The better memory cards and SSDs also do static wear leveling, meaning they occasionally move a logical segment that contains static data to a physical segment that has been erased many times to even out the wear and increase the expected lifetime of the card. However, the vast majority of cheap memory cards do not do this but, instead, rely on the host software to write to every segment of the drive at some time or other.The diagram to the right shows how this mapping works in a typical flash drive; click on it for an animated version.To improve wear leveling, the host can also issue trim or erase commands on full segments to increase the size of the free pool. However, file systems in Linux do not know the segment size and typically issue trim commands on partial segments, which can improve write performance inside that segment but not help wear leveling across segments.

Garbage Collection

In real life, writing 4 MB segments at once is more the exception than the rule, so drives need to cope with partial updates of segments. While data gets written to a logical segment, the controller normally has an old and a new physical segment associated with it. In order to free up the extra segment, it has to combine all the logical blocks in that segment into physical blocks on only one segment and discard all the previously used physical blocks, a process called garbage collection. A number of garbage collection techniques can be observed in current drives, including special optimizations using caching in RAM or NOR flash and dynamically adapting to the access patterns.Most drives however use a very simple garbage collection method, typically one of the following three. Each description below is accompanied by a diagram which, when clicked, will lead to an animated version showing how the technique works.Linear-access optimized garbage collection. Drives that are advertised as being ideal for video storage usually expect long, contiguous reads and writes. They always write a physical segment from start to end, so, if the first write into a segment does not address the first logical block inside it, the drive copies all blocks in front of it from the old segment before writing the new data. Similarly, a subsequent write to a block that is not logically contiguous to the previously written one requires the drive to copy all intermediate blocks.Garbage collection simply fills the new segment up to the end with copies of the unchanged blocks from the old segment.The advantage is optimum performance for all reads and for long writes, but the disadvantage is that the drive ends up copying almost an entire segment for each block that gets written in the wrong order, for instance when the block elevator algorithm writes the blocks in reverse order attempting to avoid long seeks. Also, writing linear data smaller than the minimum block size of the drive makes it write the same block twice, which forces an immediate garbage collection. The minimum block size that the drive expects here is normally the cluster size of the preformatted FAT32 filesystem, between 4KB and 32KB, but on SD cards, it can be even larger than that.Drives that are hardwired to linear-access optimized segments are basically useless for ext3 and most other Linux filesystems because of this, because they keep small data structures like inodes and block bitmaps in front of the actual data and need to seek back to these in order to write new small files.Block remapping. Fortunately, a significant number of flash drives support random access within a logical segment, by remapping logical blocks to free physical blocks as they get written. Since this requires maintaining another lookup mechanism, both read and write accesses are slightly slower than the ideal linear-access behavior, and a small amount of out-of-band data needs to be reserved to store the lookup table.This method also does not allow efficient writing in any small units when the manufacturers optimize for larger blocks in order to keep the size of the lookup table small. Writing the same block repeatedly still requires a full garbage-collection, which makes this method unsuitable for storing an ext3 journal or any other data that frequently gets written to the same area on the drive.Data logging. The best random-access behavior is provided by using the same approach that log-structured filesystems like jffs2, logfs or nilfs2 and block-remappers like UBI in Linux use. Data that is written anywhere in the logical segment always goes to the next free block in the new physical segment, and the drive keeps a log of all the writes cached. Once the last free block is used up, a garbage collection is performed using a third physical segment.In the end, writing this way is slower than the other two approaches in the best case, because every block is written at least twice, but the worst case is much better.This approach is normally used only in the first few segments on the drive, which contain the file allocation table in FAT32 preformatted drives. Some drives are also able to use this mode when they detect access patterns that match writes to a FAT32 style directory entry.Obviously, any such optimizations don't normally do the right thing when a different filesystem is used on the drive than it was intended for, but there is some potential for optimization, e.g. by ensuring that the ext3 journal uses the blocks that are designed to hold the FAT.

Restrictions on open segments

One major difference between the various manufacturers is how many segments they can write to at any given time. Starting to write a segment requires another physical segment, or two in case of a data logging algorithm, to be reserved, and requires some RAM on the embedded microcontroller to maintain the segment. Writing to a new segment will cause garbage collection on a previously open segment. That can lead to thrashing as the drive must repeatedly switch open segments; see the animation behind the diagram to the right for a visualization of how that works.On many of the better drives, five or more segments can be open simultaneously, which is good enough for most use cases, but some brands can only have one or two segments open at a time, which causes them to constantly go through garbage collection when used with most of the common filesystems other than FAT32.When a drive reserves the segments specifically to hold the FAT, these will always be open to allow updating it while writing streaming data to other segments.

Partitioning

When a filesystem wants to optimize its block allocation to the geometry of a flash drive, it needs to know the position of the segments on the drive. On partitioned media, this also implies that each partition is aligned to the start of a segment, and this is true for all preformatted SD cards and other media that require special care for segment optimizations.Unfortunately, the fdisk and sfdisk tools from util-linux make it particularly hard to do this correctly, because they try to preserve an archaic geometry of 255 "heads" and 63 "sectors" and, by default, align partitions to "cylinder" boundaries. None of these units have any significance on today's hard drives or flash drives, but they are kept for backwards compatibility with existing software. The result is that most partitions are as misaligned as possible, they start on a odd-numbered 512-byte sector, which defeats all optimizations that a filesystem can do to align its accesses to logical blocks and segments inside of the partition.The same problem has been discussed a lot in the light of hard drives with 4KB sectors, but it is much more significant when dealing with flash media. Current versions of fdisk ask the kernel about physical sector (BLKPBSZGET) and optimum I/O size (BLKIOOPT), but currently these are rarely reported correctly by the kernel for flash drives, because the kernel itself does not have the necessary information. SDHC cards report the segment size in sysfs, but this is not used by any partitioning tools, and all cards currently seem to report 4MB segments, even those that actually use 2MB or 8MB segments internally.The linaro-media-create tool (from Linaro Image Tools) has recently been changed to align partitions to 4 MB boundaries when installing to a bootable SD card, to work around this problem.

Future work

There is a huge potential for optimizing Linux to better deal with the deficiencies of flash media in various places in the kernel and elsewhere. With the storage and filesystem summit coming up this April, there is hopefully time to discuss these and other ideas:All partition tools should default to a much larger alignment, e.g. 4 MB or what the drive itself reports, for flash media and ignore cylinder boundaries.The page cache could benefit from the fact that larger accesses end up taking less time than accesses shorter than a flash page. When a drive reads 16KB, the kernel may as well add all of it to the page cache.The elevator and I/O scheduler algorithms can do much better than they do today for drives that only do linear access. Ideally, all outstanding writes to one segment should be submitted in order within a segment before moving to another segment.A stacked block device can be used to reorder blocks during write, creating a copy-on-write log-structured device on top of drives that can only write to one segment at a time. A first draft design for device is available on the FlashDeviceMapper pageat Linaro.The largest potential is probably in the block allocation algorithm in the filesystem. The filesystem can ensure that it submits writes in the correct order to avoid garbage collection most of the time. Btrfs, nilfs2 and logfs get this right to a certain degree, but could probably get much better.

Resources

More information about specific measurements can be found in the Linaro flash card survey. Readers are welcome to add data about their memory cards and USB drives to the list.The tool that was used to do all measurements is available from git://git.linaro.org/people/arnd/flashbench.git.(Log in to post comments)Optimizing Linux with cheap flash drivesPosted Feb 18, 2011 22:02 UTC (Fri) by boog (subscriber, #30882) [Link]How depressing.I have a Dell laptop with an early flash drive. Performance is sometimes awful and I always suspected the drive. (It is still quiet, which is appreciated.)So I should reformat it as FAT to have any hope of a speed-up?Optimizing Linux with cheap flash drivesPosted Feb 18, 2011 22:39 UTC (Fri) by jengelh (subscriber, #33263) [Link]And lose all the features of a modern POSIX filesystem? Starting with permissions... (unless you go via layers like posixovl)Optimizing Linux with cheap flash drivesPosted Feb 19, 2011 5:00 UTC (Sat) by eru (subscriber, #2753) [Link]And lose all the features of a modern POSIX filesystem?Just resurrect the UMSDOS filesystem!Optimizing Linux with cheap flash drivesPosted Feb 19, 2011 8:41 UTC (Sat) by arnd (subscriber, #8866) [Link]It seems that Samsung have done something like this with their RFS (Reliable FAT file system) that they use on a lot of phones with eMMC storage.Unfortunately, their implementation has a lot of other performance problems and, worse, it's not even available under a GPL compatible license.Optimizing Linux with cheap flash drivesPosted Feb 22, 2011 17:34 UTC (Tue) by ttonino (subscriber, #4073) [Link]My Samsung Galaxy S shows:On the NAND flash with a software translation layer:/ rootfs ro,relatime/mnt/.lfs j4fs rw,relatime/system rfs ro,relatime/dbdata rfs rw,relatime/cache rfs rw,relatimeOn the internal SD device:/data rfs rw/mnt/sdcard vfat rw,dirsync,noatime,nodiratime/mnt/sdcard/external_sd vfat rw,dirsync,noatime,nodiratimeThus, rfs and normal vfat are both used. The rfs description reminded me of the phase tree FAT implementation that was going around a long time ago.The software translation for the NAND flash is GPL.Optimizing Linux with cheap flash drivesPosted Feb 18, 2011 23:26 UTC (Fri) by bronson (subscriber, #4806) [Link]Maybe your partitons are unaligned?unaligned partitionsPosted Feb 23, 2011 12:11 UTC (Wed) by alex (subscriber, #1355) [Link]We found that write performance for small files improved by at least a factor of 10 on our embedded SSDs when we fixed the partition alignment. fdisk doesn't help when it tries to work with the fake geometry most SSDs report.unaligned partitionsPosted Feb 24, 2011 16:14 UTC (Thu) by mgedmin (subscriber, #34497) [Link]How do you check for partition alignment? fdisk -u -l /dev/sdX?I'm guessing 63 sectors (of 512 bytes) is not a good alignment?Are there any tools for fixing partition alignment? Does parted's "move" command shift the data in the partition, or just adjust the boundaries in the partition table?unaligned partitionsPosted Feb 24, 2011 18:53 UTC (Thu) by meyering (guest, #48285) [Link]You can make parted list the partition table in units of sectors,then ensure that (assuming 512-byte sectors) each partition'sstart sector is divisible by some round number, like 2048if you want them to be 1MiB-aligned.For example, here all partitions are MiB-aligned, exceptfor the first one, which is only 32KiB-aligned. But since it'sonly for grub, that is ok:$ parted -m -s -- /dev/sdb u s p freeBYT;/dev/sdb:117231408s:scsi:512:512:gpt:ATA OCZ-VERTEX2;1:34s:63s:30s:free;1:64s:4095s:4032s:ext2:_grub_bios:bios_grub;2:4096s:1048575s:1044480s:ext3:_/boot:boot;3:1048576s:12582911s:11534336s:ext4:_/:;4:12582912s:16777215s:4194304s:linux-swap(v1):_/swap:;5:16777216s:37748735s:20971520s:ext4:_/usr:;6:37748736s:52428799s:14680064s:ext4:_/var:;7:52428800s:52449279s:20480s:ext4:_/full:;8:52449280s:117229567s:64780288s:ext4:_/h:;1:117229568s:117231374s:1807s:free;Please do not use parted's "move" command. It is risky since it triesto be smart and is file-system aware. In addition to moving the partitionit may try to move an embedded file system, too, but its built-in FS-awarecode is so old and unreliable that it is slated to be removed altogether.If you try to use that sub-command (or e.g., mkpartfs which is in the sameboat), recent versions of parted will emit a big warning telling you someof the above.unaligned partitionsPosted Feb 25, 2011 0:15 UTC (Fri) by jnh (subscriber, #69758) [Link]It depends on the physical sector size of the device. If you have physical sectors of 512 bytes, then partitions measured in 512 logical sectors are aligned regardless of where they start; with larger physical sectors, starting a partition at LBA 63 isn't going to be aligned. Annoyingly, many SSDs do not correctly report their true internal topology, so even modern partitioning tools which can use that information may need to be given hints, but that said, it isn't immediately clear to me exactly what an SSD should report its topology as given the current interfaces.I recommend reading Martin K. Petersen's advanced storage papers fromhttp://oss.oracle.com/~mkp/Optimizing Linux with cheap flash drivesPosted Feb 19, 2011 8:29 UTC (Sat) by roblucid (subscriber, #48964) [Link]IIRC early OEM drives may have poor controllers (like the Jmicron) with things like the so called "write stutter" problem. Once you'ld run out of virgin pages (segments?). A "secure erase" utility would be required to restore original performance.There should be some good articles on SSD performance issues (mainly or wholly used with Windows but still should be relevant enough), and you could perhaps see if an SSD upgrade to more modern firmware, might be beneficial. Optimizing Linux with cheap flash drivesPosted Feb 19, 2011 8:31 UTC (Sat) by roblucid (subscriber, #48964) [Link]Whoops, I left out the site name AndandTech http://www.anandtech.com/Optimizing Linux with cheap flash drivesPosted Feb 19, 2011 8:47 UTC (Sat) by arnd (subscriber, #8866) [Link]The CF card I mentioned in the article is similar to a lot of the really cheap SSDs. It can have three 4 MB segments open for linear writing and one for random access, which is apparently just not enough for ext3 with journal.Aligning the partition to 4 MB and changing to btrfs solved it for me for that card. I have not yet done thorough testing to find out what the specific requirements of the possible file systems are, but it's worth a try.I'd also be interested to see what flashbench shows about this drive. If you can create an empty 4 MB aligned partition on it, please run it and send the results to flashbench-results@lists.linaro.org.Optimizing Linux with cheap flash drivesPosted Feb 19, 2011 10:30 UTC (Sat) by aleXXX (subscriber, #2742) [Link]Very nice article, but what is a bit unclear to me is whether these issues mostly apply to CF- and SD-cards and USB memory sticks or do they also apply to SSD drives ?AlexOptimizing Linux with cheap flash drivesPosted Feb 19, 2011 10:40 UTC (Sat) by arnd (subscriber, #8866) [Link]The cheapest SSD drives are basically CF cards in a different form factor, or nowadays with a PATA-SATA converter. This will show the exact same behavior as good SD cards.High-end SSDs come with significant amounts of RAM that can be used to hide most of the nasty effects, or to do something much smarter altogether, such as implementing the entire drive as a log structured file.The caching unfortunately makes it a lot harder to reverse-engineer the drive through timing attacks, so it's much harder to tell what it really does.What we know is that the underlying NAND flash technology is very similar, so in the best case, an SSD will be able to hide the problems, but not completely avoid them. If I were to design an SSD controller, I'd do the same things that I'm suggesting in https://wiki.linaro.org/WorkingGroups/KernelConsolidation...Optimizing Linux with cheap flash drivesPosted Feb 19, 2011 18:23 UTC (Sat) by aleXXX (subscriber, #2742) [Link]You mention read/write speed around 15 MB/s.How does that fit together with the number between 150 and 350 MB/s which are listed for SSD drives e.g. on alternate.de ?Actually I can remember that when writing to raw NAND we had also rates somewhere in the 10 to 15 MB/s range.AlexOptimizing Linux with cheap flash drivesPosted Feb 19, 2011 20:03 UTC (Sat) by arnd (subscriber, #8866) [Link]15 MB/s is typical for good SD cards (e.g. Class 6), which are limited by design to 20-25 MB/s anyway (UHS-1 SDHC will be faster, but is still rare today). High-end SSDs can be much faster for a number of reasons:* SATA is a much faster interface than SD/MMC* NCQ and write caching allows optimizing the accesses by reordering and batching NAND flash accesses* Using SLC NAND instead of MLC improves raw accesses* Using multiple NAND chips in parallel gives a better combined throughput* Expensive microcontrollers on the drive can use smarter algorithmsAll of these cost money, so you don't find them on the low end drives that I analyzed.Optimizing Linux with cheap flash drivesPosted Feb 20, 2011 9:12 UTC (Sun) by alonz (subscriber, #815) [Link]Actually, according to information in this AnandTech article, some high-end controllers use even weirder techniques... (They mention specifically real-time compression and real-time deduplication, and there's likely a lot more)Optimizing Linux with cheap flash drivesPosted Apr 6, 2011 18:36 UTC (Wed) by taggart (subscriber, #13801) [Link]The compression and deduplication of the Sandforce controller show big benefits over the controllers that don't have them. But those benefits are lost if your data isn't compressable/redundant like if it's encrypted :(Live CDPosted Apr 20, 2011 19:12 UTC (Wed) by dmarti (subscriber, #11625) [Link]What about a live CD that you boot from, type "yes I want to trash my flash drive" and it automatically tries different partition schemes, runs benchmarks, and tells you which one is fast? Don't trust what the drive says, just try it a bunch of possible ways and see what works for real. (I'd pay $14.95 for the iso assuming the underlying code was Free.)Optimizing Linux with cheap flash drivesPosted Feb 18, 2011 22:13 UTC (Fri) by ewen (subscriber, #4772) [Link]Excellent article. Thanks for the research and writing it up.Reading the article, it seems to me that we're going through another "abstracted by an IO controller" shift similar to what happened when SCSI and then IDE drives were introduced. Before that -- back in the late 80s/early 90s hard drives ("MFM" and then the later "RLL" with 33% more storage!) -- were interfaced with motor control signals and analogue read/write head signals through to a controller card in the computer which managed the drive at a pretty low level (mostly took care of the digital to analogue signals, and a little bit of the timing requirements); the disk driver was required to do the rest in software, including things like remapping bad blocks. As drives with SCSI (and later IDE) controllers on board were introduced, this detail was abstracted away -- it became "read block", "write block", etc, with quite a bit of abstraction as to how that was done. And this abstraction has only increased over time, to the point where (as the article notes), things like heads/cylinders/sectors are completely hidden and irrelevant. (And unlike the MFM/RLL days, there very much aren't the same number of sectors per track; old drives were used with constant angular velocity, and modern drives are used with constant linear velocity.) So basically as things got more complicated, it was offloaded to a separate computer -- built onto the (SCSI/IDE) drive -- and it took a while to get those to the point of being optimally efficient. It also took a _long_ time for OS developers to get over the idea that they could optimise for track layout, etc, the way they used to have to do so when "closer" to the media.With the shift from raw-NAND-storage to SSD/MMC/CF/IDE-attached/SCSI-attached flash, we seem to be at a similar point to where we were with magnetic storage in the early 1990s: everything is clearly moving in the direction of offloading the IO work to a separate controller (the flash controller), but the implementations in those separate controllers are either fairly inefficient and naive, or end up optimising for a particular benchmark ("FAT32 file system", "video recording/playback"). Hopefully given a few more years, and increasing uses in situations where FAT32 isn't the answer (eg, lots of non-Win-XX based embedded devices) the flash controller firmware will improve to be more generally optimised for scenarios other than FAT32/video. (And presumably over the same time micro controller costs will come down a bit so that, eg, not everything has to be fitted into 2000 entries in RAM in a tiny micro controller.)Until then, knowing what the algorithms are inside these devices really does help in making more optimal choices of device for the workload, and in designing file systems to use on them. (Finally there seems to be a media type that really cries out for a log based file system at either the flash controller level or the OS level -- or both. On magnetic media the seeks required used to be a major problem, and flash eliminates that issue.)EwenOptimizing Linux with cheap flash drivesPosted Feb 18, 2011 22:36 UTC (Fri) by zlynx (subscriber, #2285) [Link]The firmware in 2.5" SSDs is already at the advanced levels you are talking about. They can handle any sort of access pattern and maintain great performance while doing it.The problem is going to remain with CF and SD cards and memory sticks. These small devices cannot include the powerful processsing chips and/or 64 MB DRAM that some of the SSDs require for their magic.I think these small devices would be much better off to create a new bypass mode in the IDE/SCSI/USB that exposes the actual flash to the host system. Give the real block size, block error status, erase, read and write commands. Might also need commands to access special metadata write areas like block maps, last written block, etc. Then we could run real flash filesystems on them.Optimizing Linux with cheap flash drivesPosted Feb 18, 2011 23:35 UTC (Fri) by bronson (subscriber, #4806) [Link]> I think these small devices would be much better off to create a new bypass modeI think everyone will agree with this. But, short of waving a magic wand (i.e. Microsoft or Intel write specs), I don't see any way of making this happen. It's a monster chicken-and-egg problem: OSes can't add support for devices that don't exist, and vendors won't bother implementing a raw interface until OSes can use it.Optimizing Linux with cheap flash drivesPosted Feb 19, 2011 0:00 UTC (Sat) by saffroy (subscriber, #43999) [Link]Also I suspect the patents on FTL (flash translation layer) algorithms still make it hard for free OSes to use similar approaches.Optimizing Linux with cheap flash drivesPosted Feb 19, 2011 12:40 UTC (Sat) by willy (subscriber, #9762) [Link]It's thought to be too hard to "release specs". The algorithms for handling any particular generation (and type) of Intel flash are substantially different from each other. That's just one manufacturer ... Linux would have a horrendous time trying to keep up with the dozens of flash manufacturers each releasing a new generation of flash every 18 months, possibly in several different flavours (1, 2, 3 and 4 bit per cell).It's probably not even possible for Linux mainline to keep up with that frequency, let alone the enterprise distros or the embedded distros (I was recently asked "So what changed in the USB system between 2.6.10 and 2.6.37?"). And then there's the question about what to do for other OSes.It's not just a question of suboptimal performance if you use the wrong algorithms for a given piece of flash; there are real problems of data loss and the flash wearing out. No flash manufacturer wants to be burdened with a massive in-warranty return because some random dude decided to change an '8' to a '16' in their OS that tens of millions of machines ended up running.So yes, as Arnd says, the industry is looking to abstract away the difference between NAND chips and run the algorithms down on the NAND controller. I'm doing my best to help in the NVMHCI working group ... seehttp://www.bswd.com/FMS10/FMS10-Huffman-Onufryk.pdf for a presentation given last August.(I work for Intel, but these are just my opinions).Optimizing Linux with cheap flash drivesPosted Feb 19, 2011 19:06 UTC (Sat) by ewen (subscriber, #4772) [Link]Perhaps the middle ground is to come up with some (de facto?) standardised way for manufacturers to categorise how their flash algorithms are optimised. (In addition to any minimum/maximum speed claims, etc.) "Optimised for video streaming", and "optimised for FAT32" being two raised by the article as relatively common, but there's a need for several more categories. At least that way, even without knowing the exact details of how, one could attempt to match the media purchased to the intended workload. Because at the moment it seems tricky as a purchaser to do that, outside perhaps video streaming and assuming everything else is optimised for the FAT file system on it at purchase.EwenOptimizing Linux with cheap flash drivesPosted Feb 19, 2011 21:03 UTC (Sat) by arnd (subscriber, #8866) [Link]The NVMHCI concept (thanks for the Link!) makes a lot of sense at the high end where the drives can be smart enough to do a good job at providing high performance and low wear.However, at the low end that I looked at, most drives get everything wrong to start with: there is too little RAM and processing power to do the reordering that would be needed for ideal NAND access patterns, the drives only do dynamic wear leveling, if any, so they break down more quickly than necessary.The way that the SD card association deals with the problem is to declare all file systems other than FAT32 (with 32KB clusters) unsupported.What we'd instead need for these devices is indeed a way to be smarter in the host about what it's doing. The block discard a.k.a. trim logic is one example of this that sometimes works already, but is not really enough to work with dumb controllers. What I'd like to see is an abstraction on segment level, using commands like "open this segment for sequential writes", "garbage-collect this segment now", "report status of currently open segments", "how often has this segment been erased?".Optimizing Linux with cheap flash drivesPosted Feb 19, 2011 22:40 UTC (Sat) by willy (subscriber, #9762) [Link]Yes, these are quite different devices ... I would estimate a factor of 100+ difference in price, and probably similar factors in terms of capacity, speed, power, etc, etc.The API you're suggesting makes a ton of sense for the low end devices. I don't think there's a whelk's chance in a supernova of it coming to anything, though. You'd need the SD association to require it, and I can't see it being in the interest of any of their members. When the reaction to "hey, your cards suck for this other filesystem" is "your filesystem is wrong", I can't see them being enthusiastic about something this radical.I do see that Intel are members. I'll try to find out internally if something like this could fly.Optimizing Linux with cheap flash drivesPosted Feb 20, 2011 3:25 UTC (Sun) by Oddscurity (subscriber, #46851) [Link]So in summary, if I want to run an ext3 filesystem on a USB stick, I'm better off formatting the stick as FAT32 and then running the ext3 as a loop?Or would that be the wrong conclusion?Not that it's all I took away from this great article, but I'm wondering what I can do in the meantime to optimise my use with such devices.Optimizing Linux with cheap flash drivesPosted Feb 20, 2011 4:07 UTC (Sun) by ewen (subscriber, #4772) [Link]Alas, no, running ext3 in a loop on FAT32 doesn't magically change your file system access patterns from ext3 access patterns to FAT access patterns. (Eg, in that scenario the FAT would hardly ever change since you allocate a huge file for the loop and then just write within it, versus native FAT32 with it changing with each file change, so the cheap flash drives optimisation for the 4MB holding the FAT would be wasted; and you'd still get random updates frequently in "unexpected" -- by the naive firmware -- locations.)It appears if you want to run ext3 on a cheap flash drive, you pretty much have to assume that it's going to be slower than advertised (possibly MUCH slower, especially for write), and that there's a very real risk of wearing out some areas of the flash faster than might be expected. Probably okay for a mostly-read workload if you ensure that you turn off atime completely (or every read is also a write!), but not ideal for something with regular writes.If it's an option for your use case, then sticking with the original FAT file system -- and using it natively -- is probably the least bad option. Certainly that's what I do with all my portable drives that see any kind of regular updates. (It also has the benefit that I don't have to worry about drivers for the file system on any system I might plug it into.)EwenOptimizing Linux with cheap flash drivesPosted Feb 20, 2011 14:11 UTC (Sun) by Oddscurity (subscriber, #46851) [Link]Thanks for the comprehensive answer.I may as well switch to just FAT32 for part of the use cases and the other ones are dominated by reads, so can stay on ext.Optimizing Linux with cheap flash drivesPosted Feb 21, 2011 14:41 UTC (Mon) by marcH (subscriber, #57642) [Link]> It's a monster chicken-and-egg problem: OSes can't add support for devices that don't exist, and vendors won't bother implementing a raw interface until OSes can use it.Agreed, and any way out of this situation would require (at least) a transition phase were some devices support either mode, letting the operating system choose.Is such a "dual-mode" technically feasible?Optimizing Linux with cheap flash drivesPosted Feb 22, 2011 14:44 UTC (Tue) by etienne (subscriber, #25256) [Link]I do not know a lot about it, but there is specs athttp://onfi.org/specifications/There is even a connector for FLASH looking like the SDRAM connector.Re: ONFI (Optimizing Linux with cheap flash drives)Posted Apr 27, 2011 20:44 UTC (Wed) by frr (guest, #74556) [Link]Thanks for that link :-) I've noticed that industry group before, but didn't pay much attention. To me, it's been just another flash chip interface standard from the JEDEC stable - notably without Samsung :-) After your remark about the standard connectors, I've taken a better look...Since 2006 or 2007, there have been several revisions of the ONFI interface standard: 1.0, 1.1, 2.0, 2.1, 2.2, 2.3, and recently 3.0. The most visible differences are in transfer rate. The "NAND connector" spec from 2008 is a separate paper - not an integral part of the main standard document. The NAND Connector paper refers to ONFI 1.0 and 2.0 standards documents. But - have you ever seen some motherboard or controller board with an ONFI socket? I haven't. In the meantime, there's ONFI 3.0 - it postulates some changes to the set of electrical signals, for the sake of PCB simplification - but there's no update to the "NAND connector" paper. To me that would hint that the NAND connector is a dead end - a historical branch of evolution that has proved fruitless... Please correct me if I'm wrong there, as I'd love to be :-)ONFI 3.0 does refer to an LGA-style socket (maybe two flavours thereof), apart from a couple of standard BGA footprints. Which would possibly allow for field-replaceable/upgradeable chip packages, similar to today's CPU's. Note that the 3.0 spec doesn't contain a single occurrence of the word "connector" :-)As far as I'm concerned, for most practical purposes, ONFI remains a Flash chip-level interface standard. It seems ONFI is inside the current Intel SSD's - it's the interface between the flash chips and the multi-channel target-mode SATA Flash controller. The multiple channels are ONFI channels. The SATA Flash controller comprises the SSD's disk-like interface to the outside world, and does all the "Flash housekeeping" in a hidden way.Note that there's an FAQ at the ONFI web site, claiming that "No, ONFI is not another card standard."From a different angle, note that the ONFI electrical-level interface (set of signals, framing, traffic protocol) is different from the native busses you can typically see in today's computers, such as FSB/QPI/PCI-e/PCI/LPC/ISA/DDR123_RAM. ONFI is not "seamless" or "inherent" to today's PC's: you have nowhere to attach that bus to, such that you'd have the Flash memory e.g. linear-mapped into the host system's memory space - which doesn't look like a good idea anyway, considering the Flash capacities and the CPU cores' address bus width (no it's not a full 64 bits - it's more like 32, 36 or maybe slightly more with the Xeons). Getting a "NAND connector" slot in your PC is not just a matter of the bus and connector and some passive PCB routing to some existing chipset platform. You'd need a "bridge" or "bus interface", most likely from PCI-e to ONFI (less likely straight from the root complex / memory hub). For several practical purposes, the hypothetical PCI interface would likely use a MMIO window + paged access to the ONFI address space, or possibly SG-DMA for optimum performance. I could imagine a simple interface using a general-purpose "PCI slave bridge" with DMA capabilities, similar to those currently made by PLX Corp. - except that those cannot do DDR, the transfer rates are too low, the FIFO buffers are perhaps too small for a full NAND Flash page and the bridges can't do SG-DMA... The initiative would IMO have to come from chipset makers (read: Intel) who could integrate an ONFI port in the south bridge. I haven't found a single hint of any initiative in that vein. There are even no stand-alone chips implementing a dedicated PCI-to-ONFI "dumb bridge". Google reveals some "ONFI silicon IP cores" from a couple fabless silicon design companies - those could be used as the ONFI part of such a bridge, if some silicon maker should decide to go that way, or maybe some are "synthesizable" in a modern FPGA.As for the basic idea, which is to "present raw NAND chips to the host system and let the host OS do the Flash housekeeping in software, with full knowledge of the gory details": clearly ONFI isn't going that way. And quite possibly, it's actually heading in precisely the opposite direction :-) There is a tendency to hide some of the gory details even at the chip interface level. On the ONFI Specs page you can find another "stand-alone paper" specifying "Block Abstracted NAND", as an enhancement to the basic ONFI 2.1 standards document. The paper is also referred back to by the ONFI 3.0 standard (where it lists BA NAND opcodes). Looks like an "optional LBA access mechanism to NAND Flash" (does this correlate with the moment SanDisk got a seat at the ONFI table, by any chance?) And in the ONFI 3.0 spec, you can find a chapter on "EZ NAND", which is to hide some of the gory details of ECC handling (at the chip interface level).Ahh well...Optimizing Linux with cheap flash drivesPosted Feb 24, 2011 21:35 UTC (Thu) by ajb (subscriber, #9694) [Link]Possibly it would be easier, instead of exposing the internals of the SD card, for the OS to provide computation and memory services to the SD card. This would have to be optional, because the SD card might be plugged into a cheap camera or something with no memory either. But it would be a fairly simple interface, which would not need to change based on the card internals.Optimizing Linux with cheap flash drivesPosted Feb 24, 2011 22:27 UTC (Thu) by zlynx (subscriber, #2285) [Link]Perhaps a Java class stored on the storage card. It could implement some well-defined interface type and its constructor could take some parameters for things like a hardware interface class, memory buffer, debug logger and a few other things.It could run from userspace with the right interface class. Or from the kernel if someone wrote a simplified Java interpreter or maybe a module compiler.I suppose instead of Java it could be written in whatever VM it is that ACPI uses. Kernels already have interpreters for that.It could be fairly nifty.Optimizing Linux with cheap flash drivesPosted Feb 18, 2011 23:54 UTC (Fri) by ewen (subscriber, #4772) [Link]That's good news about the 2.5" SSD flash drives. From what I'd read to date it seemed like the better ones were using decent (random read/write performance oriented) algorithms, but the cheaper ones were still using a fairly naive approach (and, eg, getting much slower as soon as all the flash cells had been written at least once). Perhaps we're further along the "adopt the IO controller" curve this time than I thought.As you say, the (physically) smaller devices (CF/SD/etc), especially at the low cost end of the market, are always going to be constrained by available processing resources. So maybe some sort of "direct media control" API is the most optimal answer at the low end, especially if we can avoid the worst of the "fakeraid" situation (one-OS-version-only binary blobs). (There's also a higher risk of bricking the drive if you're moving, eg, a SD card back and forth between something doing its own low level access and something using the higher level API and firmware access. But like the NTFS driver presumably eventually enough the details will be right that people can trust it. And embedded devices with, eg, internal SD, can mostly ignore that risk.)EwenOptimizing Linux with cheap flash drivesPosted Feb 19, 2011 9:03 UTC (Sat) by roblucid (subscriber, #48964) [Link]I just can't see why or who in the industry is going to have enough interests in this to "make it so". These cards that I come across are intended for storing camera type output, creating jpg and avi files or in embedded Sat Nav systems (replacing DVD).They may be slow (and especially on writes) but for intended purpose, they're quick enough, how is there ever going to be momentum to create a market standard to convenice the small minority "enthusiast" market segment, who want to "hack" around with the hardware.The SSD drive manufacturers see peformance block I/O support for NTFS, ext3/4, xfs and eventually btrfs as a way to add value and differentiate their product. Any alternative to the ATA interface, requires widespread software & hardware support, perhaps it would have happened if Flash had the kind of marketing hoop-la & attention that CPU architecture receives. The fact is, MS tried Readyboost feature in Vista to allow flash drives for fast virtual memory, and it flopped horribly in practice as the drives weren't quick enough, and memory prices fell fast enough to throw RAM at the paging problem.Now perhaps there has been an opportunity in smart Phones and embedded manufacturers wanting to avoid MS patent taxes on use of FAT; but again when I read around on reviews, noone talks about filesystem performance, they're getting excited by multi-core, to resolve possible latency issues which show up as "UI sluggishness". It seems again that either the performance of the flash drive is good enough, or they've mitigated the issues in products and that becomes part of the competitive advantage.Without encumbunts needing it, who's going to develop "direct media control"?Optimizing Linux with cheap flash drivesPosted Feb 18, 2011 22:49 UTC (Fri) by jengelh (subscriber, #33263) [Link]>and modern drives are used with constant linear velocity.)I beg to differ. CAV (with zoned recording), please. You know the horrid seek times of CLV as used with CD-ROM.Optimizing Linux with cheap flash drivesPosted Feb 18, 2011 23:44 UTC (Fri) by ewen (subscriber, #4772) [Link]Yes, "zoned CAV" was what I had in mind when I wrote "constant linear velocity" for modern hard drives: the idea being that in the areas where there's more media passing under the head, you can have higher bit density than in the areas where there's less media passing under the head. But as you point out trying to optimise this on a per-track basis is horribly counter productive.I probably should have said "semi-constant" linear velocity (the "zoned CAV" term didn't even come to mind until you pointed it out).EwenOptimizing Linux with cheap flash drivesPosted Feb 19, 2011 0:57 UTC (Sat) by jnh (subscriber, #69758) [Link]To be fair, recent versions of util-linux are improved with respect to partition alignment. fdisk tries to get it right. sfdisk uses whatever units you tell it to, so you've always had to know what you're doing if you use it, and I don't think its ever had any auto-alignment logic. Unfortunately I don't think cfdisk has been updated to take topology clues from the kernel into account yet, so there it might be fair to say it makes it hard to do partitioning correctly. But yes, the topology info is only useful if the device doesn't lie, as the article underscores. Hardware RAID controllers are another sore spot when it comes to all this.Optimizing Linux with cheap flash drivesPosted Feb 21, 2011 10:51 UTC (Mon) by etienne (subscriber, #25256) [Link]Note that for FAT filesystems, the start of the partition has very few relation with the data block alignement, because the FAT itself is located at the beginning and is of variable size (depends on the number of clusters).When I have written the FAT creation support for the Gujin bootloader, I did try to adjust and display the alignment of data clusters, there is all fields needed in the FAT superblock to leave gaps to align anything you want.Obvioulsy you should not try to *move* the whole partition and change the alignment of the first sector, even if there are tools to do so.thanks, NTFS and utilsPosted Feb 19, 2011 1:27 UTC (Sat) by pflugstad (subscriber, #224) [Link]First and foremost, thank you for an excellent article. This is why I pay for LWN and it's payed off again. I would think that with more and more systems using WinXP/Vista/7, many of these drives would start optimizing for NTFS instead of FAT, especially the higher-end SSDs. I don't know anything about NTFS, but presumably it doesn't have the same access pattens as FAT. I believe it has a log like ext3, right? Is this happening, or is that what the "data logging" drives are doing? Anyway to tell other than by testing which algorithms a drive uses?Are there utils available for testing the layouts of drives and trying to optimize them for your usage pattern. Even something like whatever scripts were created to create the plots from this article would be useful I think. At a minimum, they can help you identify drives that are not optimized in a useful way. Finally, other than optimizing partitioning and layout, are there benefit to using a filesystem such as logfs, jffs2, etc? (hmmm... can one even use them with a block device?)Again, thanks for the article. thanks, NTFS and utilsPosted Feb 19, 2011 9:04 UTC (Sat) by arnd (subscriber, #8866) [Link]For the extreme low end (SD cards, specifically), NTFS does not help because the SD card standard mandates not only the exact type of file system to be used (FAT16 up to 2 GB, FAT32 from 4 to 32 GB, ExFAT from 64 GB to 2 TB), but also the specific layout.For USB sticks, the situation may be a little better because I've seen ads for ones that are allegedly optimized for NTFS. I have yet to get hold of those to find out what they do.On the negative side, I have seen one USB stick with the typical FAT optimization (the second 4 MB segment being optimized for random access), which came preformatted with the FAT in another segment.Regarding logfs, in theory it should be really well optimized for SSDs and it can work on a block device. Unfortunately, it's designed on the assumption that you can have around a dozen segments open at a time, which I have shown not to be possible on most media. However, if you can get the alignment and the layout right, logfs should still give you the best possible performance of the drive as long as you don't do a single fsync, at which point it will theoretically get into the worst case of thrashing.jffs2 is completely useless on multi-gigabyte media, and does not really work on block devices.ubifs on top of ubi on top of mtd on top of block2mtd on top of the block device might be an option, but stacking so many layers sounds scary to me.Monitoring wear levelingPosted Feb 19, 2011 2:20 UTC (Sat) by ayers (subscriber, #53541) [Link]I keep wondering whether there is a standard interface (SMART attribute?) which measures/monitors the wear leveling to indicate when it's time to replace the SSD/SD cards before the risk of data loss exceeds some probability.Monitoring wear levelingPosted Feb 20, 2011 0:47 UTC (Sun) by rwa (subscriber, #69887) [Link]Well not really standard but Intel SSDs support something like this. This is from one of our build servers:ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 3 Spin_Up_Time 0x0020 100 100 000 Old_age Offline - 0 4 Start_Stop_Count 0x0030 100 100 000 Old_age Offline - 0 5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 11 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 3516 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 26 192 Unsafe_Shutdown_Count 0x0032 100 100 000 Old_age Always - 11 225 Host_Writes_32MiB 0x0030 200 200 000 Old_age Offline - 715312 226 Workld_Media_Wear_Indic 0x0032 100 100 000 Old_age Always - 3701 227 Workld_Host_Reads_Perc 0x0032 100 100 000 Old_age Always - 1 228 Workload_Minutes 0x0032 100 100 000 Old_age Always - 16928063 232 Available_Reservd_Space 0x0033 099 099 010 Pre-fail Always - 0 233 Media_Wearout_Indicator 0x0032 096 096 000 Old_age Always - 0 184 End-to-End_Error 0x0033 100 100 099 Pre-fail Always - 0 The wearout indicator (233) started ad 100 and is no at 96 after writing 21TB in almost 5 Months. It's an 160GB X-25M from Intel.Monitoring wear levelingPosted Feb 20, 2011 3:33 UTC (Sun) by dougg (subscriber, #1894) [Link]# smartctl -a /dev/sda...Model Family: Intel X18-M/X25-M/X25-V G2 SSDs...9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 1028...233 Media_Wearout_Indicator 0x0032 099 099 000 Old_age Always - 0...No need to wonder.FirmwarePosted Feb 19, 2011 3:02 UTC (Sat) by pabs (subscriber, #43278) [Link]Firmware is top of the list of why we can't have nice things.Superb articlePosted Feb 19, 2011 12:01 UTC (Sat) by kragilkragil (guest, #72832) [Link]I really hope most of the important kernel hackers have a LWN subcription.BTW Does Linus read LWN?I think my 2 year old EeePC 901 will get a better second life when this work will be included in 2.6.39.Optimizing Linux with cheap flash drivesPosted Feb 19, 2011 18:32 UTC (Sat) by Tara_Li (subscriber, #26706) [Link]I'm most interested in this in concerns of speeding up my general, fairly vanilla desktop system. To wit, it'd be nice to grab a couple of 8GB or 16GB USB2 keys, and do something nice with them.Different parts of the filesystem have different access patterns - /etc , /bin , /sbin , and /usr are like 99% read, /tmp , /var , and /home are a lot of read/write mix, and something with swap would be nice too.What actually works for this if someone wanted to do something along these lines?Optimizing Linux with cheap flash drivesPosted Feb 19, 2011 19:24 UTC (Sat) by Richard_J_Neill (subscriber, #23093) [Link]It's important to analyse this correctly: how much of the workload is- first-time reads (at boot, or new data)- cached reads- in-principle-cached reads (if you had enough RAM)- writesIn terms of speeding up a Linux desktop system, the key benefit is going to arise from lots of RAM (8GB or more) to make perfect use of the filesystem cache. This won't speed up boot, but after a few day's uptime, all the apps will be starting from memory. USB flash disks really won't get you much. After that, the next important thing is to use ext4 (with relatime) rather than ext3. USB flash devices are great for low-cost, low-power, small size, silent systems. But they are still terrible for writes. Also, swap isn't something for "normal" use anymore: it's there so that when you run right out of RAM, your system doesn't OOMkill an app, but just gets very slow instead. So don't worry about it.Optimizing Linux with cheap flash drivesPosted Feb 20, 2011 2:23 UTC (Sun) by Tara_Li (subscriber, #26706) [Link]*meh*Since I only *have* 4GB, and most of my apps seem to want to alloc it all and let none of it go, I'm usually filling up 2GB of swap on top of the RAM every day or two.So basically, what you're saying is that flash keys will remain the floppy disks of our age.I really wish I could turn off caching on some files - there's no good reason to read-cache an MP3 or video file - it's going in and out of the app, and pretty much wastes memory when it's done so.Frankly, I'm not sure it's worth going to 8GB of RAM - I'd have to move to 64-bit code, which from what I can tell, takes at least twice as much (and sometimes more) RAM to execute in with all of its alignment issues.Optimizing Linux with cheap flash drivesPosted Feb 20, 2011 2:33 UTC (Sun) by Richard_J_Neill (subscriber, #23093) [Link]Well, if you are stuck with 4 GB only (and presumably 32-bit), the there might be some mileage in putting swap on USB keys - but may I advise you to RAID-1 them: (and use different root controllers). Incidentally, yes, 64-bit is worth having: it allows lots more RAM, it allows apps to *map* more than 4GB, and the architecture, though wasting some space on larger pointers, has more registers, so it's still a win. It also lets firefox address 8GB of swap (this is purely for memory leaks, so it doesn't matter that it's slow: fx merely has to write the data successfully, it will never read it again).As for caching files, you *might* be able to do it with mplayer.Optimizing Linux with cheap flash drivesPosted Feb 20, 2011 13:45 UTC (Sun) by nix (subscriber, #2304) [Link]I really wish I could turn off caching on some files - there's no good reason to read-cache an MP3 or video file - it's going in and out of the app, and pretty much wastes memory when it's done so.It takes code changes, but posix_fadvise() takes a POSIX_FADV_NOREUSE which may be useful for mpg123 et al (if indeed they are actually using most of the data only once: I don't know enough about media formats to know if this is true). But mpg123 doesn't use it... and the kernel doesn't use it either. Oops.Optimizing Linux with cheap flash drivesPosted Feb 21, 2011 19:32 UTC (Mon) by Tara_Li (subscriber, #26706) [Link]Yeah, oops.I'm looking at the fact that by far, the thing slowing my system down the most is my hard drives. And looking at the specs of the newer hard drives, I'm not seeing any real advances in anything *BUT* capacity, and maybe in really large reads/writes.But the myriad of little reads and writes from Gnome/KDE putting snippets of XML into individual files, the .cache directory, the web browser and it's plethora of relatively tiny files, the often-times widely scattered reads and writes of swap...These are getting no loving what so ever. Meanwhile, everyone's saying just add RAM. Except my motherboard only goes up to 8GB of RAM, and I'm not so sure that's really going to improve things that much.And I'm caching stuff that it doesn't seem to make sense to - multi-megabyte audio files, half-gigabyte and up video files - and once a program allocates itself some RAM, it *SURE* doesn't seem to want to give it up. (Of course, this is really hard to be sure of, because any long time linux user will tell you that the memory columns in ps don't really mean anything useful, and when you dig further - it's hard to be sure the *KERNEL* knows what memory belongs to who - although it does do a good job of cleaning up after the application crashes, so it knows *SOMETHING*.)So why not a two-pronged approach. A file-system bit, like the sticky bit used to be to force caching of some kind (Linux nearly ignores it, apparently) - but to tell the kernel that said file is *NOT* to be cached. With /etc/ , /bin, etc all on a flash drive, there'd really be little reason to cache them, so you could mark them so - and you could also mark media files likewise, and not need to change the apps. There could be a flag for opening the file to force caching of it (though I'm not sure I can see why), allowing user space to handle things as it feels the need, but if it doesn't, the mostly right thing happens anyway. updatedb - would use the call to open *EVERYTHING* in non-caching mode, vlc wouldn't need to worry about it - could do its own limited caching it needs to operate - since it could assume media files wouldn't be cached if the user doesn't want them to be - while other programs could say cache it - even if it doesn't think it needs to be.And for that matter - does the OOM killer create its own swap for some reason? I turned off swap a few minutes ago, after a reboot, opened some memory hogs, and when memory filled up (according to XOSview - I'm sure not the most accurate system meter, but it sure makes sense to me!), I got *MASSIVE* hard drive I/O, which pretty much tied the whole system into knots for well over 15 minutes (before I gave X it's version of the vulcan nerve pinch - took it another 5 to recognize *THAT*). What was the hard drive doing thrashing, since there was no designated swap for memory to be pushed off to temporarily?I'm going to admit it here - I'm a Linux *USER*. I can - and have - done make clean ; make ; make install to build a kernel, or to install a program that didn't have a convenient package (and why can't I, as a user, install a package to *my* home directory, instead of having to go to root and install it system wide? I know, I wander off topic...) I just want a machine that runs nicely. The recent bit that does some kind of ulimiting per console (some kind of easy kernel patch, that someone else showed could be almost as easily done with a few lines in .bashrc ?) helped a fair bit - but still, every day I gnash my teeth as I wait for my hard drives to catch up with everything else.Optimizing Linux with cheap flash drivesPosted Feb 21, 2011 19:44 UTC (Mon) by nix (subscriber, #2304) [Link]You won't be seeing slowdowns from writes unless a lot is being written or you are terribly short of memory, as they can always be cached and written back later. It's blocking for reads that's killing you.The best way to speed up reads on current systems is probably to use RAID: add lots of disks and reads speed up enormously, given a fast enough bus (at a cost in write speed). e.g. my four-way RAID-5 here combines four fairly slow low-power disks to given an aggregate read speed between 190Mb/s and 250Mb/s. That knocks the socks off any single disk: even high-speed ones at the fast outer edge of the disk are slower than this RAID array is at the slow inside.It might also be worth trying cachefs, caching onto a USB key (and not caching the filesystem on which your media files are located), but I'm not sure that this will actually gain you anything. (Worth trying though.)If you turn off swap and the system is already short of memory performance will hit a wall, because all of a sudden rarely used but dirtied pages of non-file-backed memory *have* to be held in RAM, rather than being swapped to disk once and forgotten about. (glibc creates several hundred Kb of these for every program that might use locales, which pretty much means anything that calls printf(), i.e. everything). So all of a sudden your available memory plunges and lots of frequently-used but read-only pages of program text get forced out of memory, leading to major thrashing as they get pulled back off disk all the time. (So you will probably find iostat and/or blktrace reporting major *reads* in this situation, but not major *writes*.)Optimizing Linux with cheap flash drivesPosted Feb 21, 2011 21:58 UTC (Mon) by Tara_Li (subscriber, #26706) [Link]How do you tell Linux not to cache a filesystem? I can't find it in mount options.Optimizing Linux with cheap flash drivesPosted Feb 22, 2011 0:24 UTC (Tue) by nix (subscriber, #2304) [Link]This isn't about the page cache. I was talking about FS-Cache, which is a separate filesystem layer which can be used to cache other filesystems on local media (such as, well, flash drives, or, more often, hard drives).-- unfortunately it needs specific support for each filesystem, and it doesn't look like this has been added for any non-networked filesystems. Curses.(See Documentation/filesystems/caching/.)Optimizing Linux with cheap flash drivesPosted Feb 22, 2011 10:22 UTC (Tue) by cladisch (✭ supporter ✭, #50193) [Link]> once a program allocates itself some RAM, it *SURE* doesn't seem to want to give it up.A program's allocations tend to be too fragmented, so functions like free() usually do not even try to give the memory back to the OS.> Of course, this is really hard to be sure of, because any long time linux user will tell you that the memory columns in ps don't really mean anything useful, and when you dig further - it's hard to be sure the *KERNEL* knows what memory belongs to who - although it does do a good job of cleaning up after the application crashes, so it knows *SOMETHING*.The biggest problem are shared libraries; the kernel knows who uses them, but it's not clear how their memory should be counted. (Which process gets charged for that memory? And when it exits, does the memory usage of the other processes suddenly increase?)> What was the hard drive doing thrashing, since there was no designated swap for memory to be pushed off to temporarily?Normal data can get swapped out, if there is swap.Code from executable files does not need to be saved to swap because it can be reloaded from the executable file. In other words, every executable file is a read-only swap file.Optimizing Linux with cheap flash drivesPosted Feb 25, 2011 21:39 UTC (Fri) by oak (guest, #2786) [Link]> The biggest problem are shared libraries; the kernel knows who uses them, but it's not clear how their memory should be counted. (Which process gets charged for that memory?Look at the PSS figures in /proc/PID/smaps file.Nice tool for that could be e.g. smem: http://www.selenic.com/smem/(You can just apt-get it and then do "smem --pie=name".)> And when it exits, does the memory usage of the other processes suddenly increase?)If you're looking at PSS figures, yes.Installing to your home drivePosted Feb 22, 2011 13:46 UTC (Tue) by pflugstad (subscriber, #224) [Link](and why can't I, as a user, install a package to *my* home directory, instead of having to go to root and install it system wide? I know, I wander off topic...)$ ./configure --prefix=/my/home/dir $ make $ make install Beyond this, you need to dig into the Makefile (or whatever the build utility is) to figure out how it's installed. Very often there is a install_prefix variable of some kind.Installing to your home drivePosted Feb 22, 2011 18:15 UTC (Tue) by nye (guest, #51576) [Link]The point was to install a *package*. To my knowledge no widely used package manager allows per-user package installation.Installing to your home drivePosted Feb 23, 2011 5:45 UTC (Wed) by idupree (subscriber, #71169) [Link]Most package managers offer binary (already-compiled) packages. Most Linux/Unix software can't have its prefix changed *after* compilation. Everyone's user directory has a different path (thus, prefix). Thus, problems.Options I've heard of & played with: GoboLinux's "Rootless" project is a system for installing from source in your home directory (on any distro). ZeroInstall I believe does source and binary (not sure how it manages binary) without root privileges. Some NixOS research has looked into rewriting the paths in compiled packages (though not changing its total-number-of-characters length).Installing to your home drivePosted Feb 23, 2011 9:34 UTC (Wed) by nix (subscriber, #2304) [Link]Also, gnulib contains support code to make 'relocatable packages' work (by looking at argv[0], or, if this contains no path, by hunting along $PATH to find itself, then using relative paths everywhere in the knowledge of the location of the binary). GCC has worked this way for ever, but it's only fairly recently that relocatable support has started to find its way into other GNU programs. (Most non-GNU programs still don't care, but the GNU Project cares about keeping its stuff installable into people's home directories on random systems: that's how it started, after all.)Installing to your home drivePosted Feb 23, 2011 19:08 UTC (Wed) by talex (subscriber, #19139) [Link]ZeroInstall I believe does source and binary (not sure how it manages binary) without root privileges.In my experience (0install developer), a surprising number of programs are relocatable:Anything that's been ported to Windows or Mac will be relocatable.Anything that encourages non-technical end-users to download beta versions will be relocatable.Things written in languages with built-in string concatenation (i.e. anything except C) are usually relocatable.Libraries always seem to be relocatable (not sure why this is; maybe to allow them to be bundled with other relocatable programs?).There were a few suggestions for supporting non-relocatable programs (e.g. using Plash to adjust paths at runtime, Klik-style binary rewriting, etc) but there don't seem to be many programs that need it these days.Installing to your home drivePosted Feb 23, 2011 22:36 UTC (Wed) by nix (subscriber, #2304) [Link]Libraries always seem to be relocatable (not sure why this is; maybe to allow them to be bundled with other relocatable programs?).That is definitely not always true. The KDE3 libraries, for instance, were not relocatable: they had $datadir and $libdir/kde3 baked into them. (I think the same is true of glib and gtk as well.)Optimizing Linux with cheap flash drivesPosted Feb 25, 2011 19:22 UTC (Fri) by giraffedata (subscriber, #1954) [Link]How do you know you're caching these files? I don't know exactly what Linux's page replacement policy is this week, but I'd be surprised if it caches video and music files. It knows you're accessing these pages only once, so keeping them around in preference to something else would be a loss. It usually takes two accesses to a page to get it any significant priority for memory allocation. Linux also knows you're accessing the file sequentially, so it knows even sooner that it otherwise would that the pages won't be accessed a second time.Of course, if there's absolutely nothing else worth using memory for, Linux will just go ahead and fill it with this data, just in case. But that's not a problem.Optimizing Linux with cheap flash drivesPosted Feb 21, 2011 10:06 UTC (Mon) by dgm (subscriber, #49227) [Link]> the key benefit is going to arise from lots of RAM (8GB or more)At the price of slower suspend/resume and (slightly) increased power consumption. Also, it will speed things up "eventually", but will do nothing for you if you just turn on your computer for some quick lookups and turn it off again.No, it has to be a better way. If cheap flash cards are weak at writting but great at seeking and reading, why not use them for small sparingly written files?. I think /usr and /etc would be good candidates.Optimizing Linux with cheap flash drivesPosted Feb 21, 2011 11:50 UTC (Mon) by mgedmin (subscriber, #34497) [Link]The speed of suspend-to-RAM doesn't change with the amount of memory you have, as far as I know. Did you mean suspend-to-disk aka hibernation?Optimizing Linux with cheap flash drivesPosted Feb 21, 2011 14:36 UTC (Mon) by dgm (subscriber, #49227) [Link]Yes, I meant hibernation, but I was not sure of the term for the reverse process (un-hibernation? thaw?) so I used the generic suspend/resume.Optimizing Linux with cheap flash drivesPosted Feb 19, 2011 20:57 UTC (Sat) by nhippi (subscriber, #34640) [Link]Good article but I think we are settling for too little. We shouldn't accept a situation where you can1) buy cheap sd/cf/sata nand drives where you have to second guess how the card expects things to be written on it - basically trying forcing use to make our FS access look like FAT access. 2) buy very expensive ssd drives where the said price is justified by embedded controller code that tries to guess what the host computer is trying to write and distributing it to the proprietary filesystem on the ssd drives cheap NAND chips.Instead, we should press to sd/sata standards to include a raw (or semiraw) NAND access channel.Else we'll get stuck, like in the BIOS/ACPI world, emulating windows behaviour.Optimizing Linux with cheap flash drivesPosted Feb 20, 2011 1:18 UTC (Sun) by rwa (subscriber, #69887) [Link]Some time ago as I was pondering buying an SSD and while researching I came across the following two posts in Ted Ts'o blog:http://thunk.org/tytso/blog/2009/02/20/aligning-filesyste...http://thunk.org/tytso/blog/2009/03/01/ssds-journaling-an...The posts are from 2009 but are still interesting - including the numerous comments.Optimizing Linux with cheap flash drivesPosted Feb 20, 2011 13:37 UTC (Sun) by arnd (subscriber, #8866) [Link]Ted has many great posts on this topic, also http://thunk.org/tytso/blog/2009/02/22/should-filesystems... is really worth reading.There are two pieces of information that I only found during my research that he evidently also didn't know about:* Erase blocks are much larger these days than they used to be. Traditionally, erase blocks have always been 128 KB, but now they are effectively 4 MB in most cases (2 MB physical erase blocks with multi-plane accesses).* The question whether you should use ext3, ext2 or btrfs for best performance is highly media dependent. Ted showed that there was no performance penalty for ext3 over ext2 on his SSD, but for media with fewer open segments, the answer can be different. This is one area that I still need to analyze further.Optimizing Linux filesystems for cheap flash drivesPosted Feb 20, 2011 9:33 UTC (Sun) by alonz (subscriber, #815) [Link]Is it possible to hack a special "cheap flash mode" into filesystems such as ext3/ext4/btrfs?Such a mode could e.g. move the log + inode table + allocation bitmaps etc. to the "FAT" segment (either breaking the info apart from the block groups or maybe even using a single block group for the entire device). Plus it could set the block size to the expected FAT cluster size (after all, these filesystems do know how to use block sizes != page size). Possibly other optimizations can be applied.Optimizing Linux filesystems for cheap flash drivesPosted Feb 20, 2011 13:47 UTC (Sun) by nix (subscriber, #2304) [Link]That sounds like a good idea, as many of the rationales for positioning most of these things in the middle of the disk is to reduce seek time, which hardly matters on an SSD.One downside, though: I suspect the FAT segment is (much) too small.Optimizing Linux filesystems for cheap flash drivesPosted Feb 21, 2011 13:15 UTC (Mon) by mchazaux (guest, #64024) [Link]Then do the opposite : add UID, GID, permissions and such to a FAT-like filesystem ;-)Optimizing Linux filesystems for cheap flash drivesPosted Feb 21, 2011 15:21 UTC (Mon) by giggls (subscriber, #48434) [Link]One apon a time there has been the UMSDOS filesystem which did exactly this. I'm unaware if this is still available in current kernels though.Optimizing Linux filesystems for cheap flash drivesPosted Feb 21, 2011 19:36 UTC (Mon) by nix (subscriber, #2304) [Link]No, it was removed some years ago.Optimizing Linux filesystems for cheap flash drivesPosted Feb 21, 2011 19:37 UTC (Mon) by arnd (subscriber, #8866) [Link]It's not, see the discussion about. I think we can do much better than FAT as well, even given the characteristics of the current drives. Ted Ts'o has some ideas for ext4, and my understanding of btrfs is that it does not rely on a specific block allocation at all, so that could be an excellent target as well.Starting out a completely new file system designed only for SD cards would of course make it possible to get the best result, but that would also be an enormous amount of work.Optimizing Linux with cheap flash drivesPosted Feb 21, 2011 7:56 UTC (Mon) by HBM (guest, #72284) [Link]I really would hope that some Industry Consortium (like Linaro) pushes for more sane Smartcard Filesystems instead of fixing the stuff afterwards. I mean using FAT for flash storage seems pretty awkward to me.Besides having a abstraction layer for naked flash in kernel seems like a good idea. So shoving this stuff into the propretary firmware layer doesn't sound like a good idea to me.TimOptimizing Linux with cheap flash drivesPosted Feb 21, 2011 19:18 UTC (Mon) by xxiao (subscriber, #9631) [Link]I second this idea, SD cards are increasingly used in various embedded devices these days. I'm comparing various SD(SLC and MLC) using different file systems, the result is pretty random so it's hard to find the best SD with the best filesystem easily for certain applications these days.Optimizing Linux with cheap flash drivesPosted Feb 21, 2011 19:43 UTC (Mon) by arnd (subscriber, #8866) [Link]Please contact me by email about your test work. It would be very good to correlate these high-level benchmarks with the low-level measurements that I started on https://wiki.linaro.org/WorkingGroups/KernelConsolidation... .Also everyone else, if you have a lot of SD cards or USB sticks, please run flashbench on it and send me the results.Put the smarts in mkfsPosted Feb 22, 2011 2:19 UTC (Tue) by sethml (subscriber, #8471) [Link]To me it's always seemed like the best solution would be allowing raw flash access, but I've come to accept that anything which requires mass industry cooperation has roughly zero chance of happening. Now I think a practical good approach would be to tune the filesystem to the device. In particular, give, say, ext4 the ability to store various device characeristics (erase block size & alignment, any good region of the partition for frequently-changing data, max concurrently open sectors, etc.) in the filesystem header, and then have the kernel filesystem code tune its accesses to work well with the limitations of the device. Then add a flag to mkfs and tunefs which cause them to spend a few minutes benchmarking the device and heuristically deciding what the device characteristics are. Even better of course would be to combine this approach with a log-structured fs, to really avoid the weaknesses of the hardware. Not perfect, but a heck of a lot more likely to be useful than petitioning device manufacturers to do anything different. partitioning toolsPosted Feb 22, 2011 8:38 UTC (Tue) by rh-kzak (guest, #51571) [Link]fdisk supports partitions alignment according to I/O limits since util-linux[-ng] 2.17 (Jan 2010).fdisk ***does not use CHS addressing by default*** and it uses 1MiB grain for partitions since util-linux[-ng] 2.18 (Jun 2010).GNU Parted was also updated also one year ago.PartitioningPosted Feb 23, 2011 9:50 UTC (Wed) by shane (subscriber, #3335) [Link]The Arch Linux distribution recommends using GPT for partitioning on the wiki:https://wiki.archlinux.org/index.php/Solid_State_Drives#P...PartitioningPosted Feb 24, 2011 10:20 UTC (Thu) by arnd (subscriber, #8866) [Link]Yes. Both the gdisk recommended there and the new fdisk mentioned by rh-kzak align partitions to 1 MB, which is much better than what the old fdisk does in many current distros.However, the alignment should really be 4 MB or higher, not 1 MB, at least on the low-end devices. I hope to get optimizations for 4 MB segments into btrfs, ext4 and other file systems, but they can only work if the file system is fully aligned.Optimizing Linux with cheap flash drivesPosted Feb 24, 2011 10:43 UTC (Thu) by jond (subscriber, #37669) [Link]When comparing filesystem performance, should Lvm be considered separately? I'm guessing it makes alignment Guarantees even more difficult or impossible.Flash memory and partionning "optimization"Posted Feb 25, 2011 15:29 UTC (Fri) by meuh (subscriber, #22042) [Link]Is there any Flash devices using the "offset by one" trick used by some 4KBytes hard drives to get logically aligned on 255/63 partitions be physically on 4KBytes boundary, as reported on https://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues#S-1._Yet_another_workaround_from_the_firmware_-_offset-by-one.Optimizing Linux with cheap flash drivesPosted Mar 6, 2011 12:12 UTC (Sun) by pabs (subscriber, #43278) [Link]http://blog.datalight.com/why-raw-nand-flash-with-hardwar...Optimizing Linux with cheap flash drivesPosted Mar 6, 2011 18:06 UTC (Sun) by bronson (subscriber, #4806) [Link]All true but, until someone defines a workable API/ABI/software interface, it's never going to happen.Consider the esoteric optimization features that he mentions... If the engineers writing the low-level API didn't anticipate multiplane access then unmanaged will still be slower than managed.Is anyone out there actually trying to write a high-performance, low-level Flash API that's intended to displace SSD controllers?Optimizing Linux with cheap flash drivesPosted Apr 20, 2011 13:47 UTC (Wed) by Thom (guest, #73471) [Link]By interface, do you mean a generic read and write that can handle the vagaries of all sorts of flash parts? Or do you mean a driver to handle the flash properly?Datalight's solution is the latter. By working with Flash vendors and creating custom Flash Interface Modules, our Flash Management software utilizes the optimizations of each flash part. This blog post calls for what ONFI specified as EZ-NAND, and the Datalight solution supports those modules also. A fully supported on-die ECC of chips like ClearNAND, plus Wear Leveling and Bad Block Management, both visible and customizable, is truly the best of both worlds.In order to displace an SSD controller, much more than throughput and endurance have to be considered - for example, hardware compression, or aggressive caching. With the right file system support, JEDEC's eMMC might be the best opponent for an SSD.Utility to find ideal blocksize.Posted Mar 6, 2011 15:05 UTC (Sun) by gmatht (guest, #58961) [Link]If anyone in interested I wrote a utility to help detect the ideal blocksize and alignment for writing to a device (particularly cheap flash devices). It allows you to set the read pattern, blocksize, and offset (the offset may be useful on drives that correct for XP's weird alignment of partitions); it will then benchmark writes with those settings.http://dansted.co.cc/scripts/detectblocksize.cFor example, I found on my device if we write sequentially, writing of blocks of 64K is sufficient to maximize the data transfer rate, while if blocks are written randomly, 4MB is required.This utility was discussed on the linux-bcache list, but the old mail archives don't seem to be on the web. I could discuss this further if anyone is interested.Utility to find ideal blocksize.Posted Mar 9, 2011 15:50 UTC (Wed) by arnd (subscriber, #8866) [Link]The results you found are very typical, and match what the flashbench tool referenced in the last sentence of the article finds on many media. The other interesting number is how many (4MB) segments can be written to alternating, which you can find out withflashbench --open-au --open-au-nr=<NR> --erasesize=$[4096 * 1024] [ --random ]with varying values for NR. With low numbers, it will be fast for all block sizes, while with large numbers of open segments, the time to write all segments is basically independent of the block size, because every write forces a garbage collection on one of the other open segments.There is usually a very sharp contrast between the slow and fast results, e.g. five being very fast but six already being very slow.Copyright © 2011, Eklektix, Inc.Comments and public postings are copyrighted by their creators.Linux is a registered trademark of Linus Torvalds

No comments:

Post a Comment