Managing Big Files On Linux

When working with big data, it becomes crucial to manage big files efficiently. Printing a single line from a 150GB text file (like SQL, triples, quads or whatever) can be horrible and often requires some thinking, let alone editing files.

Printing Big Files

Using head can work to get the first or the last 5 number of lines:

// first 5 lines
head n 5

// last 5 lines
head -n 5

Using sed (stream editor) to get any line or any range of lines:

// print line 1500
sed -n '1500p' my_big_file

// print line 1500 to 2000
sed -n '1500,2000p' my_big_file

Splitting Big Files

Editing big files can be made doable by splitting the file before editing it and finally concatenating the pieces again.

// split a file and make every piece 1000 lines
split -l 1000 my_big_file my_big_file_segment_

// split a file and make every piece 500 MB long
split -b 500M my_big_file my_big_file_segment_

Both commands will output my_big_file_segment_aa, my_big_file_segment_ab, and so on.

Counting Lines

When splitting files, it can be useful to know how many lines a file has.

wc -l my_big_file

Remove First Line

Handy when removing column names in csv files is removing the first line like this:

sed -i '1d' my_big_file.csv