1.2 Manage Files and Directories

File Editing

Note throughout this writing I use the words column and field interchangeably as well as record and row.

In the Unix Philosophy, programs are written to handle text streams, as they are the universal interface. The following programs are essential for manipulating these text streams for common tasks at the terminal as well as more complex bash scripting in Linux.

Cut

The simplest command to manipulate fields is cut. The default delimiter is a tab but can be specified with -d[character] and only supports a single character. Include desired fields with -f.

Let's take a look at the /etc/passwd file, which contains all human and non-human users on the system.

~$ cat /etc/passwd 
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin

Now let's just grab the username and shell. I've set the delimiter -d: and specified I want fields 1-7 -f1,7.

~$ cut -d: -f1,7 /etc/passwd
root:/bin/bash
daemon:/usr/sbin/nologin
bin:/usr/sbin/nologin

Awk

The awk command is a much larger utility and is an entire programming language in and of itself. It supports both columns and rows. Awk is often compared to Perl (Practical Extracting and Reporting Language) for complex text/data processing jobs, such as handling large csv files.

To specify your field's with the -F flag, and make sure it's uppercase. Let's print out just the username and shell of etc/passwd as before with the cut command.

awk -F: '{print $1, $7}' /etc/passwd
root /bin/bash
daemon /usr/sbin/nologin
bin /usr/sbin/nologin

Unlike cut, which just supports field separation, awk can select records (rows) with NR (number of record).

~$ awk -F: 'NR==2 {print $1, $7}' /etc/passwd
daemon /usr/sbin/nologin

If a record has 8 fields, $8 will select the last field, as the number of fields (NF) is eight. In this case $NF would be equivalent to $8. Knowing this, we can always select the last field of any given record with $NF.

~$ awk -F: '{print $NF}' /etc/passwd
/bin/bash
/usr/sbin/nologin
/usr/sbin/nologin

Awk is a very powerful tool and this is just the tip of the iceberg. But I hope was a good introduction.

Sed

Say we have a file todo.txt and want to manipulate by records/rows.

~$ cat todo.txt 
1. water plants
2. walk dog
3. finish homework
4. study linux

By default, sed will print out the line specified, then the whole file again.

~$ sed '1p' todo.txt 
1. water plants
1. water plants
2. walk dog
3. finish homework
4. study linux

To prevent this, use the -n flag for (n)o output. Think of it as (n)ot printing the whole file, rather only what you specify.

~$ sed -n '1p' todo.txt 
1. water plants

Specify ranges with a comma.

~$ sed -n '1,3p' todo.txt 
1. water plants
2. walk dog
3. finish homework

Use a semicolon as a delimiter.

~$ sed -n '1p;4p' todo.txt 
1. water plants
4. study linux

When deleting lines, omit the -n flag.

~$ sed '1,2d' todo.txt 
3. finish homework
4. study linux

Besides selecting records of a file to print, we can find and replace with regex-like syntax. Say I have a dockerfile and I've decided to change the username. Instead of manually editing the file by changing each instance, I can use sed to substitute each value.

~$ sed 's/promptier/NEWUSERNAME/' Dockerfile 
FROM ubuntu
RUN apt-get update && \
    apt-get install -y sudo && \
    useradd -m NEWUSERNAME && \
    echo "NEWUSERNAME:1234" | chpasswd && \
    adduser NEWUSERNAME sudo

USER NEWUSERNAME

As this example only prints to STDOUT, we can choose instead to edit the file in place with the -i flag or create a new file.

~$ sed -i 's/promptier/NEWUSERNAME/' Dockerfile 
OR
~$ sed 's/promptier/NEWUSERNAME/' Dockerfile > Dockerfile2

Keep in mind substituting will run a single time on each line. If there were two or more instances of "promptier" on single line, I would add the global flag -g at the end to replace all instances : sed 's/promptier/NEWUSERNAME/g'

Two useful characters to know are the caret ^ and dollar sign $ which represent the beginning and end of a line. These are common regex syntax that are used in both sed and vim. As an example, the way you would delete all blank lines in vim would be to check if the beginning of the line is immediately proceeded by the end of the line.

~$ cat todo.txt 
1. water plants

2. walk dog

3. finish homework

4. study linux

~$ sed '/^$/d' todo.txt 
1. water plants
2. walk dog
3. finish homework
4. study linux

It is possible to insert lines before and after a match with i and a. Here I add a note after "walk dog". I could optionally insert before the line with i. Note the insertion begins with a backslash.

❯ sed '/dog/a\  -remember a leash' todo.txt 
1. water plants
2. walk dog
  -remember a leash
3. finish homework
4. study linux

Swap characters with y. This is identical to the translate tr command.

❯ sed 'y/sed/XYZ/' todo.txt 
1. watYr plantX
2. walk Zog
3. finiXh homYwork
4. XtuZy linux

❯ cat todo.txt | tr 'sed' 'XYZ' 
1. watYr plantX
2. walk Zog
3. finiXh homYwork
4. XtuZy linux

Say I have a bash script with many comments and newlines. I could make a short and concise version by removing the comments and removing newlines.

sed '/^#/d;/^$/d' verbose-script.sh > short-script.sh

Say I have a very long file with authors and their relevant works and info, I could see on what lines "Hubbard" appears with =. This is very similar to grep -n.

❯ sed -n '/Hubbard/=' poems.csv 
2
3
❯ grep -n 'Hubbard' 18.csv 
2:1771/1798/E. H. (Elihu Hubbard)/Smith/8/SONNET I./Sent to Miss  , with a Braid of Hair.
3:1771/1798/E. H. (Elihu Hubbard)/Smith/8/SONNET II./Sent to Mrs.  , with a Song.

Sort

sort command will sort alphabetically by the first column by default, and will separate columns by white space unless the separator is specified with -t.

Say I wanted to sort kernel modules by their size, I would numeric sort -n , then choose the second column/key with -k 2.

lsmod | sort -n -k 2
Module                  Size  Used by
algif_hash             16384  1
algif_skcipher         16384  1
cmac                   16384  2

Now say I want to sort the users in /etc/passwd based on their user ID in descending order, which is the third column seperated by colons. I can use -t for field seperator and -k3 for the third column. For descending order I will use the reverse flag -r and make sure to numeric sort -n.

sort -t: -rn -k3 /etc/passwd
nobody:x:65534:65534:nobody:/nonexistent:/usr/sbin/nologin
joseph:x:1000:1000:joseph,,,:/home/joseph:/bin/bash
hplip:x:126:7:HPLIP system user,,,:/run/hplip:/bin/false
fwupd-refresh:x:125:135:fwupd-refresh user,,,:/run/systemd:/usr/sbin/nologin

Keep in mind this is a file we are reading so it can come after the command, the previous command lsmod was not a file, so I had to pipe the output into sort.

File Metadata

Inodes keep track of files and directories in Linux, they contain everything but the filename and contents. This includes permissions, timestamps, file size, and ownership. Find exact inodes with ls -i.

ls -i
18219914 file1  18221167 file2  18256138 file3

Two commands are used to check metadata, file and stat. File is much simpler and is often used to check the type of a file. Stat shows much more information, including file size, modification and access time, user and group ID, etc. This is read directly from the Inode.

$ file obs.sh
obs.sh: Bourne-Again shell script, ASCII text executable

$ stat obs.sh
  File: obs.sh
  Size: 694         Blocks: 8          IO Block: 4096   regular file
Device: 802h/2050d  Inode: 18220124    Links: 2
Access: (0775/-rwxrwxr-x)  Uid: ( 1000/     joe)   Gid: ( 1000/     joe)
Access: 2023-12-09 12:58:05.139821191 -0600
Modify: 2023-10-15 11:13:26.272931896 -0500
Change: 2023-12-09 12:58:02.799821244 -0600
 Birth: 2023-10-15 11:13:26.272931896 -0500

Archiving and Compressing

Compressed files use less disk space and download faster over a network. On Windows, often files are downloaded in a zip format, which then are unzipped (decompressed). When installing software on Linux, file are often archived and compressed with a tar.gz ending. An archive is simply a collection of files and/or directories with tar (tape archive). The gz ending means its been compressed using gzip.

As an example, I will extract, meaning take from an archive, and decompress the files in nvim-linux64.tar.gz. x is extract, v for verbose, f for file, and C to place it in a directory.

sudo tar -xvf nvim-linux64.tar.gz -C /usr/local/lib

Note it is not necessary to mention the compression type, such as z for gzip, J for xz and so on. tar will automatically detect the compression type. These flags are only necessary when creating compressed archives.

In all the compression tools, there is an direct relationship between CPU processing time (compression time) and compression ratio. What this means is there is a trade off. If you want the maximum amount of disk space saved, it will require more CPU processing power and time to compress. If you want to save on CPU processing or time, a lower compression ratio can be used, but will use more disk space.

In general xz achieves the highest compression level, followed by bzip2 and then gzip. In order to achieve better compression however xz usually takes the longest to complete, followed by bzip2 and then gzip.

Each program has similar syntax. With no arguments, they compress. -d will decompress. Each deletes the input file during compression or compression. Use -k to keep it, verbose -v to display compression ratio, and -t for integrity test on compressed files.

To demonstrate, I will create a 100 megabyte file, and compress using each. dd is a powerful but dangerous command, so be careful. dev/zero as the input file (if) is just an endless stream of null characters, which are ASCII code zero and signifies the end of a string (\0).

❯ dd if=/dev/zero of=bigfile bs=1M  count=100 
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.0834263 s, 1.3 GB/s
❯ xz -k bigfile; gzip -k bigfile; bzip2 -k bigfile
❯ ls -lh
total 101M
-rw-rw-r-- 1 promptier promptier 100M Dec 23 10:11 bigfile
-rw-rw-r-- 1 promptier promptier  113 Dec 23 10:11 bigfile.bz2
-rw-rw-r-- 1 promptier promptier 100K Dec 23 10:11 bigfile.gz
-rw-rw-r-- 1 promptier promptier  16K Dec 23 10:11 bigfile.xz

Comparing the sizes, gzip shrunk it down to 100K, while xz came in at 16K. bzip2 came at 113 bytes, which I am unsure how.

Backup with dd

dd is a tool that will copy everything from a file, bit by bit, in the same order. It is known by many names, including "disk dump", "disk destroyer" and "(d)on't (d)o it". It can be dangerous if you don't know what you are doing. Specify your input file if and output file of. The following are some common uses.

Create backup image file of a drive then restore to another drive.

dd if=/dev/sda of=~/backup.img
dd if=~/backup.img of=/dev/sdb

You could also copy directly to another drive if desired.

We can also use special character devices to zero out/wipe a disk or add a layer of encryption.

dd if=/dev/zero of=/dev/sda
dd if=/dev/urandom of=/dev/sdb

Text Editors

Though I will not provide an entire guide of either Nano or Vim, I will touch on them briefly and give my advice on which you should spend your time using.

Nano is easy to use and will perform the job just as well as Vim. However, if you plan to spend a lot of time at the terminal, learning vim will initially set you back in term of speed productivity, but will be worth the ROI in one to two weeks of use.

Vim movements are not just confined to the editor. Try opening up a man page and you will find the same navigation keys such as k and j for up and down, gg and G to move to the top and bottom of the file, and / to search.

less is a program that displays the contents of a file one page/screen at a time, as opposed to using cat and then viewing the contents through stdout. Similar to man pages, less also has vim like navigation.

Overall, learning vim will not just help you while using the editor, but provide universal navigation across many programs on Unix-like systems. Additionally, once you have the Vim movements to muscle memory, you can bring these keybindings to any text editor or note taking app. Both Vscode and Obsidian are two examples which support vim, either nativity or through third party extension, this makes Vim a valuable skill to acquire for Linux and beyond.

Rsync

Rsync (Remote Synchronization) was an algorithm presented in a PhD thesis for sorting and syncing for remote data update. In the first phase, a client and server is defined, the client initiates and connects to the server either locally, via remote shell or via network socket. After connections is established, the client and server assume the roles of sender and receiver respectively. Rsync Takes into account the file path, ownership, mode (read, write, etc.), permissions, size, modification time, and an optional checksum. Primarily, rsync determines which files differ between the sender and receiver by comparing the modification time and size of each file.

The general use is to copy contents of dir1 to dir2. Keep in mind if a file already exists in dir2, it will simply ignore it, unless there is a more recent modification time, in which case it will be updated, hence the nature of syncing.

~$ rsync -r dir1/ dir2

Instead of recursive -r, many use archive -a, which preserves links, timestamps, permission, etc. as well as recursively copies. Run -v for verbose.

If copying a large number of files to a remote server, it is recommended to dry run first to simulate the command without actually running it to catch errors before they cause problems.

rsync -av --dry-run send-dir joe@8.8.8.212:/home/recieve-dir

Here was me backing up my downloads folder to an external USB.

sudo rsync -av /home/promptier/downloads/ /media/promptier/USB/downloads/