When programming in Python, you will often encounter operations that read and write files. The various modes of reading and writing files (such as read, write, append, etc.) can sometimes be really confusing, as well as confusing the use of methods such as open, read, readline, readlines, write, writelines, etc. can also throw you for a loop.
I hope this article will help you better understand how to read and write files and use the most appropriate methods in the most appropriate places.
What is a document?
Before we start looking at how to use files in Python, it’s important to understand exactly what files are and how modern operating systems handle certain aspects of them.
Essentially, a file is a contiguous set of bytes used to store data. This data is organized in a specific format and can be anything as simple as a text file, or as complex as a program executable. Finally, these byte files are translated into binary files1,0 for easier processing by the computer.
A file on most modern file systems consists of three main parts.
- Header : metadata about the contents of the file (file name, size, type, etc.)
- Data : the contents of the file as written by the creator or editor
- End of File (EOF) : A special character indicating the end of the file
The content of the data representation depends on the format specification used and is usually indicated by the extension. For example, a file with a .gif extension is most likely to conform to the Graphics Interchange Format specification. There are hundreds, if not thousands, of file extensions.
File path
File paths are required to access files on the operating system. A file path is a string that represents the location of a file. It is divided into three main parts.
- folder path: the location of the folder on the file system, with subsequent folders separated by a forward slash / (Unix) or a backslash \ (Windows)
- filename: the actual name of the file
- Extension: the file path is pre-set with a period (.) at the end , used to indicate the file type
Python file path-related operations.
- Use getcwd() to get the current working directory, i.e., the path to the directory where the Python script is currently working, similar to the pwd command in Linux.
- If you want to print a file name or path that contains Chinese in Windows, you need to use “GBK” to decode it.
- Merge paths using path.join()
- Use sep for backslashes in Windows and forward slashes in Linux (backslashes need to be escaped with \)
- Relative paths
- (.) Dot indicates the current folder
- (. ) dot means parent folder
- Relative path, absolute path conversion
- path.abspath(path): return absolute path
- path.isabs(path): determine if it is an absolute path
- path.relpath(path,start): return relative path
- path.split.
- path.dirname(path): return the directory where the file is located, os.path.basename(path): return the file name
- path.split(path) == (os.path.dirname(path), os.path.basename(path)).silit(os.sep) is splitting the path by separator
- path.basename(): get the file name
- path.splitext(): split the extension
- View file size: path.getsize(filename), can only count files, not folders, if you need to count folders, you need to traverse by yourself.
- Get file attributes: stat(file)
- Check the validity of the path
- path.exists: whether the path exists
- path.isdir: if or not it is a directory
- path.isfile: if or not it is a file
- Return all files and directory names in the specified directory: listdir()
Other methods.
- remove files: os.remove()
- Remove directories: removedirs(r “c:\python”)
- rename: rename(old, new) # files or directories are used with this command
- Create multi-level directories: makedirs(r “c:\python\test”), which will create intermediate folders, similar to the Linux command mkdir -p
- Create a single directory: mkdir(“test”)
- Modify file permissions and timestamps: chmod(file)
Directory operations.
- Copy a file.
- copyfile(“oldfile”, “newfile”) # Both oldfile and newfile can only be files
- copy(“oldfile”, “newfile”) # oldfile can only be a folder, newfile can be a file or a target directory
- copy folder: copytree(“olddir”, “newdir”) # both olddir and newdir can only be directories, and newdir must not exist
- Move a file (directory): move(“oldpos”, “newpos”)
- Delete a directory
- rmdir(“dir”) # Only empty directories can be deleted
- rmtree(“dir”) # Empty directories and directories with contents can be deleted
End of line
A common problem encountered when working with file data is the representation of new lines or line endings. Line endings originated in the days of Morse code, when a specific symbol was used to indicate the end of a transmission or the end of a line.
Later, the International Organization for Standardization (ISO) and the American Standards Association (ASA) standardized teletypewriters. the ASA standard specifies that line endings should use carriage returns (sequence CR or \r) and line feed (LF or \n) characters (CR+LF or \r\n). However, the ISO standard allows CR+LF characters or LF characters only.
Windows uses the CR+LF character for newlines, while Unix and newer Mac versions use only the LF character. This can lead to some complications when you are working with files that originate on different operating systems. Here’s a simple example. Suppose we examine the file dog_breeds.txt created on a Windows system.
The same output will be interpreted differently on Unix devices.
Solution.
Python traversal method for folders
Use walk()
Output is always folder first and then file name
Using listdir
The output follows the directory tree structure and is sorted alphabetically.
glob module
The glob module is one of the simplest modules, with very little content. Use it to find pathnames of files that match specific rules.
glob.glob
returns a list of all matching file paths. It has only one argument, pathname, which defines the file path matching rules, here it can be either absolute or relative paths. Example.
Python reads and writes file contents
Open a file with the with statement
Reading and writing files in Python requires 3 steps.
- Calling the open function, which returns a File object
- Calling the read() or write() method of the File object
- Calling the close() method of the File object to close the file
Common opening modes for files.
- ‘r’: read-only (default. Throws an error if the file does not exist)
- ‘w’: write-only (if the file does not exist, the file is automatically created)
- ‘a’: append to the end of the file
- ‘r+’: read and write
If you need to open the file in binary mode, you need to add the character “b” after mode, such as “rb”, “wb”, etc.
If you don’t use the with statement, the code is as follows.
There are two problems here :
- Possible forgetting to close the file handle.
- File read data exception occurred and nothing was done about it.
Enhanced code.
Although this code works well, it is too long. with, in addition to having a more elegant syntax, also does a good job of handling exceptions generated by the contextual environment.
The with version of the code.
The workflow of with.
- After the statement immediately following with is evaluated, the
__enter__()
method of the returned object is called, and the return value of this method will be assigned to the variable following as. - When the block of code following the with has all been executed, the
__exit__()
method of the preceding return object is called.
Sometimes you may want to read a file and write to another file at the same time. If you use the example shown when learning how to write to a file, it can actually be merged into the following.
Text reading: the difference between read(), readline(), readlines()
read()
read() is the simplest way to read all the contents of a file at once in one big string, i.e., in memory.
Advantages of read().
- Convenient and simple
- Reads the file in one big string at a time, fastest
Disadvantages of read().
- When the file is too large, it will take up too much memory
readline()
readline() reads text line by line, the result is a list
Advantages of readline().
- Small memory footprint, read line by line
Disadvantages of readline().
- Slower because it reads line by line
readlines()
readlines() reads all the contents of the text at once, the result is a list
This method reads text with a ‘\n’ line break at the end of each line (you can use L.rstrip(’\n’) to remove the line break)
Advantages of readlines().
- Reads text content in one go, faster
Disadvantages of readlines().
- As the text grows, it takes up more and more memory
Text writing: write() and writelines()
- write() is passed a string
- writelines() passes in an array