Skip to content
This repository was archived by the owner on Nov 11, 2023. It is now read-only.

Awk and Sort Tricks

JP de Vooght edited this page May 16, 2020 · 1 revision

Sampling

nucoll uses text files to store its data. If you have access to GNU/Linux utilities such as grep, sort, or head you can perform additional steps in data preparation.

For example, the commands below sample 100 random handles from a large .dat file.

$ sort -R input.dat | head -n 100 >output.dat

Diff'ing Sets

I saved some handles on urban intelligence in a list and noticed a similar "smart cities" list from @dr_rick. How can I find out the list of handles which I have that @dr_rick didn't include?

First, let's retrieve the data.

$ nucoll init -m smart-cities dr_rick

$ nucoll init -m urbanintel jdevoo

Now, let's use awk to answer the question above.

awk -F ',' 'NR==FNR {m[$1]++; next} !m[$1]' dr_rick.dat jdevoo.dat > diff.dat

This command tells awk to use comma as field separator and build a lookup array from the identifiers found in the first file (dr_rick.dat). The value in the count will not be used here. The final step is to print out lines for which the identifier $1 in the second file is not found in the lookup array. The result is not displayed but written to diff.dat.

You could then process diff using nucoll fetch diff and move on to generating the .gml with edgelist.

Clone this wiki locally