A White Paper from the X/Open Base Working Group.
Abstract
This paper is an abridged version of the submission received by X/Open from the Large File Summit, an industry initiative to produce a common specification for support of files that are bigger than the current limit of 2GB on existing 32-bit systems. It details the modifications to X/Open's Single UNIX Specification to support large files with unlimited file offsets. These changes have been incorporated into the next issue of the Single UNIX Specification.
This document is based on the 20Mar96 Large File Summit Submission. It has been abridged to refer only to the set of changes to the Single UNIX Specification.
Last Update: 14Aug96
X/Open gratefully acknowledges the Large File Summit for their work in developing the set of changes to X/Open's Single UNIX Specification to support large files.
For further details of the Large File Summit, please see http://www.sas.com/standards/large.file (also available locally here).
A number of major system vendors and users met at the "Large File Summit" (LFS) for over a year to develop a set of changes to the existing Single UNIX Specification (SUS) that allow both new and converted programs to address files of arbitrary sizes. This set of changes was provided to X/Open for inclusion into the next version of the SUS. In addition, a set of transitional extensions intended to permit users to immediately implement large file support on typical 32-bit UNIX operating systems was proposed. This abridged document only contains the identified changes to the SUS document and the accompanying rationale.
(Note the attribute "resource limits" as used in the SUS is not defined.)
For regular files, no data transfer will occur past the offset maximum established in the open file description associated with aiocbp->aio_fildes.ERRORS
The following is an additional condition which may be detected synchronously or asynchronously:Note: This is a new error condition.
- [EOVERFLOW]
- The file is a regular file, aiocbp->aio_nbytes is greater than 0 and the starting offset in aiocbp->aio_offset is before the end-of-file and is at or beyond the offset maximum in the open file description associated with aiocbp->aio_fildes.
For regular files, no data transfer will occur past the offset maximum established in the open file description associated with aiocbp->aio_fildes.ERRORS
The following is an additional condition which may be detected synchronously or asynchronously:Note: This is an additional EFBIG error condition.
- [EFBIG]
- The file is a regular file, aiocbp->aio_nbytes is greater than 0 and the starting offset in aiocbp->aio_offset is at or beyond the offset maximum in the open file description associated with aiocbp->aio_fildes.
The saved resource limits in the new process image are set to be a copy of the process's corresponding hard and soft resource limits.
These functions will fail if:Note: This is an additional EFBIG error condition.
- [EFBIG]
- The file is a regular file and an attempt was made to write at or beyond the offset maximum associated with the corresponding stream.
An unlock (F_UNLCK) request in which l_len is non-zero and the offset of the last byte of the requested segment is the maximum value for an object of type off_t, when the process has an existing lock in which l_len is 0 and which includes the last byte of the requested segment, will be treated as a request to unlock from the start of the requested segment with an l_len equal to 0. Otherwise an unlock (F_UNLCK) request will attempt to unlock only the requested segment.ERRORS
The fcntl() function will fail if:Note: These are new error conditions.
- [EOVERFLOW]
- One of the values to be returned cannot be represented correctly.
- [EOVERFLOW]
- The cmd argument is F_GETLK, F_SETLK or F_SETLKW and the smallest or, if l_len is non-zero, the largest, offset of any byte in the requested segment cannot be represented correctly in an object of type off_t.
The fdopen() function will preserve the offset maximum previously set for the open file description corresponding to fildes.
These functions will fail if data needs to be read and:Note: This is a new error condition.
- [EOVERFLOW]
- The file is a regular file and an attempt was made to read at or beyond the offset maximum associated with the corresponding stream.
The fgetpos() function will fail if:Note: This is a new error condition.
- [EOVERFLOW]
- The current value of the file position cannot be represented correctly in an object of type fpos_t.
The largest value that can be represented correctly in an object of type off_t will be established as the offset maximum in the open file description.ERRORS
The fopen() and freopen() functions will fail if:Note: This is a new error condition.
- [EOVERFLOW]
- The named file is a regular file and the size of the file cannot be represented correctly in an object of type off_t.
Variable Value of name Notes FILESIZEBITS _PC_FILESIZEBITS 3,4
These functions will fail if either the stream is unbuffered or the stream's buffer needed to be flushed and:Note: This is an additional EFBIG error condition.
- [EFBIG]
- The file is a regular file and an attempt was made to write at or beyond the offset maximum.
The fseek() function will fail if:Note: This is a new error condition.
- [EOVERFLOW]
- The resulting file offset would be a value which cannot be represented correctly in an object of type long.
The fseeko() function is identical to the modified fseek() except that the offset argument is of type off_t and the EOVERFLOW error is changed as follows:ERRORS
Note: This is a new function.
- [EOVERFLOW]
- The resulting file offset would be a value which cannot be represented correctly in an object of type off_t.
These functions will fail if:Note: This is an additional EOVERFLOW error condition.
- [EOVERFLOW]
- The file size in bytes or the number of blocks allocated to the file or the file serial number cannot be represented correctly in the structure pointed to by buf.
These functions will fail if:Note: This is a new error condition.
- [EOVERFLOW]
- One of the values to be returned cannot be represented correctly in the structure pointed to by buf.
The ftell() function will fail if:Note: This is a new error condition.
- [EOVERFLOW]
- The current file offset cannot be represented correctly in an object of type long.
The ftello() function is identical to the modified ftell() except that the return value is of type off_t and the EOVERFLOW error is changed as follows:ERRORS
Note: This is a new function.
- [EOVERFLOW]
- The current file offset cannot be represented correctly in an object of type off_t.
The ftruncate() function will fail if:Note: This is an additional EFBIG error condition.
- [EFBIG]
- The file is a regular file and length is greater than the offset maximum established in the open file description associated with fildes.
When using the getrlimit() function, if a resource limit can be represented correctly in an object of type rlim_t then its representation is returned; otherwise if the value of the resource limit is equal to that of the corresponding saved hard limit the value returned is RLIM_SAVED_MAX; otherwise the value returned is RLIM_SAVED_CUR.When using the setrlimit() function, if the requested new limit is RLIM_INFINITY the new limit will be "no limit"; otherwise if the requested new limit is RLIM_SAVED_MAX the new limit will be the corresponding saved hard limit; otherwise if the requested new limit is RLIM_SAVED_CUR the new limit will be the corresponding saved soft limit; otherwise the new limit will be the requested value. In addition, if the corresponding saved limit can be represented correctly in an object of type rlim_t then it will be overwritten with the new limit.
The result of setting a limit to RLIM_SAVED_MAX or RLIM_SAVED_CUR is unspecified unless a previous call to getrlimit() returned that value as the soft or hard limit for the corresponding resource limit.
The determination of whether a limit can be correctly represented in an object of type rlim_t is implementation-dependent. For example, some implementations permit a limit whose value is greater than RLIM_INFINITY and others do not.
The exec family of functions also cause resource limits to be saved. (See 2.2.1.3 exec).
For regular files, no data transfer will occur past the offset maximum established in the open file description associated with aiocbp->aio_fildes.ERRORS
The following are additional error codes which may be set for each aiocb control block:Note: These are additional EFBIG and EOVERFLOW error conditions.
- [EOVERFLOW]
- The aiocbp->aio_lio_opcode is LIO_READ, the file is a regular file, aiocbp->aio_nbytes is greater than 0, and the aiocbp->aio_offset is before the end-of-file and is greater than or equal to the offset maximum in the open file description associated with aiocbp->aio_fildes.
- [EFBIG]
- The aiocbp->aio_lio_opcode is LIO_WRITE, the file is a regular file, aiocbp->aio_nbytes is greater than 0, and the aiocbp->aio_offset is greater than or equal to the offset maximum in the open file description associated with aiocbp->aio_fildes.
An F_ULOCK request in which size is non-zero and the offset of the last byte of the requested section is the maximum value for an object of type off_t, when the process has an existing lock in which size is 0 and which includes the last byte of the requested section, will be treated as a request to unlock from the start of the requested section with a size equal to 0. Otherwise an F_ULOCK request will attempt to unlock only the requested section.ERRORS
The lockf() function will fail if:Note: This is a clarification of the EINVAL error condition.
- [EINVAL]
- The function argument is not one of F_LOCK, F_TLOCK, F_TEST or F_ULOCK; or size plus the current file offset is less than 0.
- [EOVERFLOW]
- The offset of the first, or if size is not 0 then the last, byte in the requested section cannot be represented correctly in an object of type off_t.
Note: EOVERFLOW is a new error condition.
The lseek() function will fail if:Note: This is a new error condition.
- [EOVERFLOW]
- The resulting file offset would be a value which cannot be represented correctly in an object of type off_t.
The mmap() function will fail if:Note: This is a new error condition.
- [EOVERFLOW]
- The file is a regular file and the value of off plus len exceeds the offset maximum established in the open file description associated with fildes.
The largest value that can be represented correctly in an object of type off_t will be established as the offset maximum in the open file description.ERRORS
The open() function will fail if:Note: This is a new error condition.
- [EOVERFLOW]
- The named file is a regular file and the size of the file cannot be represented correctly in an object of type off_t.
For regular files, no data transfer will occur past the offset maximum established in the open file description associated with fildes.ERRORS
The read() and readv() functions will fail if:Note: This is a new error condition.
- [EOVERFLOW]
- The file is a regular file, nbyte is greater than 0, the starting position is before the end-of-file and the starting position is greater than or equal to the offset maximum established in the open file description associated with fildes.
The readdir() function will fail if:Note: This is a new error condition.
- [EOVERFLOW]
- One of the values in the structure to be returned cannot be represented correctly.
For regular files, no data transfer will occur past the offset maximum established in the open file description associated with fildes.ERRORS
These functions will fail if:Note: This is an additional EFBIG error condition.
- [EFBIG]
- The file is a regular file, nbyte is greater than 0 and the starting position is greater than or equal to the offset maximum established in the open file description associated with fildes.
Name Description Acceptable Value FILESIZEBITS Minimum number of bits * needed to represent, as a signed integer value, the maximum size of a regular file allowed in the specified directory.
int fseeko(FILE *stream, off_t offset, int whence); off_t ftello(FILE *stream);The type off_t is defined through typedef as described in <sys/types.h>.
RLIM_SAVED_MAX A value of type rlim_t indicating an unrepresentable saved hard limit. RLIM_SAVED_CUR A value of type rlim_t indicating an unrepresentable saved soft limit.On implementations where all resource limits are representable in an object of type rlim_t, RLIM_SAVED_MAX and RLIM_SAVED_CUR need not be distinct from RLIM_INFINITY.
blkcnt_t st_blocks number of blocks allocated for this object.
fsblkcnt_t f_blocks total number of blocks in the file system in units of f_frsize. fsblkcnt_t f_bfree total number of free blocks. fsblkcnt_t f_bavail number of free blocks available to non-privileged process. fsfilcnt_t f_files total number of file serial numbers. fsfilcnt_t f_ffree total number of free file serial numbers. fsfilcnt_t f_favail number of free file serial numbers available to non-privileged process.
blkcnt_t Used for file block counts. fsblkcnt_t Used for file system block counts. fsfilcnt_t Used for file system file counts.
The types blkcnt_t and off_t are defined as extended signed integral types.
The types fsblkcnt_t, fsfilcnt_t, and ino_t are defined as extended unsigned integral types.
_PC_FILESIZEBITS
The following utilities will support files of any size up to the maximum that can be created by the implementation. This support includes correct writing of file size related values (such as file sizes and offsets, line numbers, and block counts) and correct interpretation of command line arguments that contain such values.
basename return non-directory portion of pathname cat concatenate and print files cd change working directory chgrp change file group ownership chmod change file modes chown change file ownership cksum write file checksums and sizes cmp compare two files cp copy files dd convert and copy a file df report free disk space dirname return directory portion of pathname du estimate file space usage find find files ln link files ls list directory contents mkdir make directories mv move files pathchk check pathnames pwd return working directory name rm remove directory entries rmdir remove directories sh shell, the standard command language interpreter sum print checksum and block or byte count of a file test evaluate expression touch change file access and modification times ulimit set or report file size limitExceptions to the requirement that utilities support files of any size up to the maximum are:
Pathname expansion will not fail due to the size of a file.Shell input and output redirections will have an implementation-specific offset maximum that will be established in the open file description.
The pax utility is not able to handle arbitrary file sizes. There is currently a proposal in ballot in IEEE Project 1003.2b to address this issue.
The reader is referred to http://www.sas.com/standards/large.file for the full rationale for this section. Only the rationale relevant to the Single UNIX Specification is included in this abridged paper.
if (stat(path, ...) < 0) { /* assume file does not exist, so create it */ if ((fd = creat(path, ...)) < 0) { /* print out error text */ } }In this example the stat() function is being used to determine the existence of a file. But if the file size cannot be represented correctly in an object of type off_t then stat() will fail (see 2.2.1.14 fstat(), lstat() and stat()) and if creat() did not then fail it would have the unintended effect of truncating the file to 0 length. Many applications and standard utilities have code similar to this example, including typical implementations of the touch utility.
Several existing implementations of fcntl() permit locking the byte whose offset is the maximum value that can be represented correctly in a object of type off_t, even though write() cannot write to that offset. This specification permits that behavior.
The fcntl() function will fail if the cmd argument is F_GETLK and the first lock which blocks the lock description has a starting offset or length which cannot be represented correctly in an object of type off_t. Information about such a lock cannot be correctly returned.
Discussion of the semantics of fcntl() locks that cross the off_t boundary resulted in six competing proposals:
An advantage of 2, 4, and 6 is that they do not change existing behavior of a 32-bit application.
Proposals 1 and 5 can result in a new type of failure in the case where the program creates a lock with l_len equal to 0 and then clips off the beginning leaving behind an unrepresentable lock.
Proposal 4 precludes truly "whole file" locking.
Proposal 6 was adopted because as it preserves existing 32-bit behavior and is less disruptive than proposal 2 (which extends lock requests in addition to unlock requests).
The fcntl() and lockf() functions will fail if the offset of the first byte in the region, or if l_len (size) is non-zero then the offset of last byte in the region, exceeds the largest possible value in an object of type off_t. Otherwise the process could create a lock which would be "beyond" the ability of the program to represent.
Programs typically, but incorrectly, fail to check the return value of these functions, which renders the error return less useful. On the other hand, returning an incorrect offset can result in serious malfunction as well.
An lseek() to the end of a file using
lseek(fd, 0, SEEK_END);is quite common. It is unfortunate that these fail on a too-large file since the return value is usually ignored. One alternative that was considered was for lseek() to move the file offset for all valid requests and then return an error if the resulting offset is too large. That is, the call would succeed for applications that do not check the return code, but also fail for applications that do check. This option was deemed too bizarre to adopt. For example, it might be difficult to implement using a remote procedure call system that was constructed to return either results or an error, but not both. In addition, the POSIX 1003.1 standard requires the file offset to remain unchanged if an error is returned by lseek().
Another potentially serious consequence of ignoring the return value of lseek() is that programs which extend data files by attempting to seek beyond the end-of-file and then writing may instead overwrite existing data.
For example, typical implementations of the dbm and ndbm libraries contain code such as:
(void) lseek(db->dbm_pagf, blkno*PBLKSIZ, L_SET); if (write(db->dbm_pagf, pagebuf, PBLKSIZ) != PBLKSIZ) ... error handling ...
The problem is that the return code of lseek() is not checked and so if "blkno*PBLKSIZ" overflows the lseek() will fail (or will seek to an unintended offset) and the data will be written to an unintended offset.
The _PC_FILESIZEBITS option makes it possible for a process to determine how large a file can be created in a given directory. It takes into account implementation limitations in the file system (e.g. due to the size of file size and block count variables), and it takes into account long term policy limitations (e.g. due to the mount utility's -o nolargefiles option). It does not take into account dynamic restrictions such as the RLIM_FSIZE resource limit or the number of available file blocks, so the process must perform appropriate checks.
When the current directory is on a typical large file capable file system and is mounted with the -o nolargefiles option,
pathconf(".", _PC_FILESIZEBITS);will return 32. In general, if the maximum size file that could ever exist on the mounted file system is maxsize then the returned value is 2 plus the floor of the base 2 logarithm of maxsize.
When ftruncate() is used to increase the size of a file, the semantics are similar to a write() of zeroes to the file. For consistency with write(), the ftruncate() function will fail when the request is beyond the offset maximum (even if the effect of the request would be to shorten the file).
If setrlimit() fails for any reason (for example, EPERM), the resource limits and saved resource limits remain unchanged.
This proposal does not specify any particular value for RLIM_INFINITY, RLIM_SAVED_MAX or RLIM_SAVED_CUR. Typical current implementations use the value 0x7FFFFFFF for RLIM_INFINITY, and it is recommended that RLIM_SAVED_MAX and RLIM_SAVED_CUR have similar large values.
Few, if any, programs will need to refer explicitly to RLIM_SAVED_MAX or RLIM_SAVED_CUR. Those that do should not use them in C-language switch cases since they may have the same value in some implementations (see 2.2.2.3 <sys/resource.h>).
A limit that can be represented correctly in an object of type rlim_t is either "no limit", which is represented with RLIM_INFINITY, or has a value not equal to any of RLIM_INFINITY or RLIM_SAVED_MAX or RLIM_SAVED_CUR and which can be represented correctly in an object of type rlim_t and which meets any additional implementation-specific criteria for correct representation.
A rejected alternative proposal was to map limits that could not be represented to and from RLIM_INFINITY. This would avoid the need for the new symbols RLIM_SAVED_MAX and RLIM_SAVED_CUR. But such mapping would arguably be a lie, and the resulting information loss would cause unintuitive program behavior, especially in programs running with appropriate privileges needed to raise hard limits.
A rejected alternative proposal was that if getrlimit() could not correctly return a current limit then it should instead return -1 and set errno to EOVERFLOW. But that would result in unnecessary breakage of programs. (Note that this breakage occurs even when no large files are present.) It would also result in malfunction of programs that assume that they are calling getrlimit() properly and so failure "cannot happen". For example, in the 4.4 BSD-Lite distribution, there are at least 15 unchecked calls to getrlimit(). When the 4.4 BSD csh limit function is used to report the current limits, there is no check of the return code and so the reported results can be entirely incorrect. Also, non-superuser programs typically unlimit themselves with:
getrlimit(RLIMIT_STACK, &rl); rl.rlim_cur = rl.rlim_max; setrlimit(RLIMIT_STACK, &rl);If the getrlimit() fails then garbage is passed to setrlimit() which may result in an unwanted and extremely restricted limit. Several utilities that are part of the GNU C compiler have this problem.
Vendor and third-party backup software is also unable to support large files and will require modification in order to do so.
Typical core utilities must be compiled in a "large" off_t compilation environment or must use the transitional APIs. Using the compilation environment reduces the number of editing changes required to port a program, but it does not reduce the effort required to ensure the correctness of the port.
The chgrp, chmod, chown, ln, and rm utilities probably require use of large file capable versions of stat(), lstat(), ftw(), and the stat structure.
The cat, cksum, cmp, cp, dd, mv, sum, and touch utilities probably require use of large file capable versions of creat(), open(), and fopen().
The cat, cksum, cmp, dd, df, du, ls, and sum utilities may require writing large integer values. For example,
The dd, find and test utilities may need to interpret command arguments that contain 64-bit values. For dd the arguments include skip=n, seek=n, and count=n. For find the arguments include -size n. For test the arguments are those associated with algebraic comparisons.
The df utility might need to access large file systems with statvfs().
The ulimit utility will need to use large file capable versions of getrlimit() and setrlimit() and be able to read and write large integer values.
Conversion between off_t (or other derived types) and ASCII is unspecified, which is a significant practical deficiency. This is being considered by other groups. For example, see: ftp://ftp.dmk.com/DMK/sc22wg14/c9x/extended-integers/
The offset maximum used for shell input and output redirections is implementation-specific. Some vendors prefer to use the smallest supported off_t, others prefer the largest.
Read or download the complete Single UNIX Specification from http://www.UNIX-systems.org/go/unix.
Copyright © 1997-1998 The Open Group
UNIX is a registered trademark of The Open Group.