Tricks concerning awk, sed, csplit and other tool that work with unstructured text.

Awk

A data driven programming language 💓. Gawk: Effective AWK Programming (gawk manual)

Misc

Some “aha, I forgot!” awk snippets.

# Don't print newlines: use printf e.g:
{sum+=$4}; END  {printf "%f",sum/NR}

# Split $NF by ":" into array a (arrays indexed from 1):
{split($NF,a,":"); print a[1]}


# Jump to next line while processing
/pattern/{
    # Make something
    next # skips last "catch all block"
    }
/next_pattern/{next}

{# catch all block
}

Uniq via awk

AWK can be used for quicker alternative to |sort | uniq. It doesn’t have to sort everything and uses hash table. It has to store everything in memory though. If you really need speed, my best choice is quniq.c

awk '!visited[$0]++' your_file > deduplicated_file

Sed, tail, head

# sed - remove 1st line
sed '1d' xx01

# Tail - omit first line
# "start passing through on the second line of output".
cat file | tail -n +2

# Head - omit last line
cat /etc/passwd | head -n -1

csplit

Split file into multiple files. result files fill have split string as first line. Filenames are automatically generated (x00).

# splitting on '</doc>'
csplit <filename> '/</doc>/' '{*}'

base64

Generate random ASCII text.

base64 /dev/urandom # lines

# only alphanum
base64 /dev/urandom | sed 's/[+/]/a/g' | head -c 1024

Worth noting

  • sd: fast sed alternative. Does not play well with endless streams though. Written in Rust.
  • ripgrep: fastest (AFAIK) grep alternative. Written in Rust.