Robert Mesibov
2023-08-16 23:48:57 UTC
I'm hoping for ideas to simplify a duplicate-listing job. In this space-separated file "demo", fields 1 and 3 are codes unique to each record.
fld1 fld2 fld3 fld4 fld5
001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
004 shoe dd try tiger
005 worm ee law tiger
006 pear ff law tiger
007 pear gg hat apple
008 shoe hh cup heron
009 worm ii cup heron
To find the partial duplicate records which are identical except in those unique codes, I can parse "demo" twice like this:
awk 'FNR==NR {$1=$3=1; a[$0]++; next} {x=$0; $1=$3=1} a[$0]>1 {print x}' demo demo
which returns
001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
007 pear gg hat apple
I would like those 2 sets of partial duplicates (the rose-hat-apple set and the pear-hat-apple set) to be sorted alphabetically and separated, like this:
002 pear bb hat apple
007 pear gg hat apple
001 rose aa hat apple
003 rose cc hat apple
I can do that by piping the first AWK command's output to
sort -t" " -k2 | awk 'NR==1 {print; $1=$3=1; x=$0} NR>1 {y=$0; $1=$3=1; print $0==x ? y : "\n"y; x=$0}'
but this seems like a lot of coding for a result. I'd be grateful for suggestions on how to get the sorted, separated result in a single AWK command, if possible.
fld1 fld2 fld3 fld4 fld5
001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
004 shoe dd try tiger
005 worm ee law tiger
006 pear ff law tiger
007 pear gg hat apple
008 shoe hh cup heron
009 worm ii cup heron
To find the partial duplicate records which are identical except in those unique codes, I can parse "demo" twice like this:
awk 'FNR==NR {$1=$3=1; a[$0]++; next} {x=$0; $1=$3=1} a[$0]>1 {print x}' demo demo
which returns
001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
007 pear gg hat apple
I would like those 2 sets of partial duplicates (the rose-hat-apple set and the pear-hat-apple set) to be sorted alphabetically and separated, like this:
002 pear bb hat apple
007 pear gg hat apple
001 rose aa hat apple
003 rose cc hat apple
I can do that by piping the first AWK command's output to
sort -t" " -k2 | awk 'NR==1 {print; $1=$3=1; x=$0} NR>1 {y=$0; $1=$3=1; print $0==x ? y : "\n"y; x=$0}'
but this seems like a lot of coding for a result. I'd be grateful for suggestions on how to get the sorted, separated result in a single AWK command, if possible.