Unstructured Text Processing

Tricks concerning awk, sed, csplit and other tool that work with unstructured text.

Awk

A data driven programming language 💓. Gawk: Effective AWK Programming (gawk manual)

Misc

Some “aha, I forgot!” awk snippets.

# Don't print newlines: use printf e.g:
{sum+=$4}; END  {printf "%f",sum/NR}

# Split $NF by ":" into array a (arrays indexed from 1):
{split($NF,a,":"); print a[1]}


# Jump to next line while processing
/pattern/{
    # Make something
    next # skips last "catch all block"
    }
/next_pattern/{next}

{# catch all block
}

Uniq via awk

AWK can be used for quicker alternative to |sort | uniq. It doesn’t have to sort everything and uses hash table. It has to store everything in memory though. If you really need speed, my best choice is quniq.c

awk '!visited[$0]++' your_file > deduplicated_file

Sed, tail, head

# sed - remove 1st line
sed '1d' xx01

# Tail - omit first line
# "start passing through on the second line of output".
cat file | tail -n +2

# Head - omit last line
cat /etc/passwd | head -n -1

csplit

Split file into multiple files. result files fill have split string as first line. Filenames are automatically generated (x00).

# splitting on '</doc>'
csplit <filename> '/</doc>/' '{*}'

base64

Generate random ASCII text.

base64 /dev/urandom # lines

# only alphanum
base64 /dev/urandom | sed 's/[+/]/a/g' | head -c 1024

Worth noting

sd: fast sed alternative. Does not play well with endless streams though. Written in Rust.
ripgrep: fastest (AFAIK) grep alternative. Written in Rust.

Unstructured Text Processing

Contents

Awk

Misc

Uniq via awk

Sed, tail, head

csplit

base64

Worth noting

Comments

Unstructured Text Processing

Contents

Awk

Misc

Uniq via awk

Sed, tail, head

csplit

base64

Worth noting

Related Posts

Comments