Simplify an AWK pipeline?

Discussion:

(too old to reply)

Robert Mesibov

2023-08-16 23:48:57 UTC

I'm hoping for ideas to simplify a duplicate-listing job. In this space-separated file "demo", fields 1 and 3 are codes unique to each record.

fld1 fld2 fld3 fld4 fld5
001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
004 shoe dd try tiger
005 worm ee law tiger
006 pear ff law tiger
007 pear gg hat apple
008 shoe hh cup heron
009 worm ii cup heron

To find the partial duplicate records which are identical except in those unique codes, I can parse "demo" twice like this:

awk 'FNR==NR {$1=$3=1; a[$0]++; next} {x=$0; $1=$3=1} a[$0]>1 {print x}' demo demo

which returns

001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
007 pear gg hat apple

I would like those 2 sets of partial duplicates (the rose-hat-apple set and the pear-hat-apple set) to be sorted alphabetically and separated, like this:

002 pear bb hat apple
007 pear gg hat apple

001 rose aa hat apple
003 rose cc hat apple

I can do that by piping the first AWK command's output to

sort -t" " -k2 | awk 'NR==1 {print; $1=$3=1; x=$0} NR>1 {y=$0; $1=$3=1; print $0==x ? y : "\n"y; x=$0}'

but this seems like a lot of coding for a result. I'd be grateful for suggestions on how to get the sorted, separated result in a single AWK command, if possible.

Kaz Kylheku

2023-08-17 00:59:20 UTC

Permalink

Like this?

$ txr group.tl < data
002 pear bb hat apple
007 pear gg hat apple

006 pear ff law tiger

001 rose aa hat apple
003 rose cc hat apple

008 shoe hh cup heron

004 shoe dd try tiger

009 worm ii cup heron

005 worm ee law tiger

$ cat group.tl
(flow (get-lines)
(sort-group @1 (opip (spl " ") [callf list* 1 3 4..:]))
(each ((group @1))
(put-lines group)
(put-line)))

Here's a dime kid, ...

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca

Kaz Kylheku

2023-08-17 23:00:45 UTC

Permalink

Post by Kaz Kylheku
(flow (get-lines)

^^^^^^^^

Post by Kaz Kylheku
[...]

This selects the second, fourth and fifth fields and each field after
the fifth, as the non-unique fields on which to group.

I inferred the requirement that the complement of the unique fields
should be used: all fields which are not the unique ones.

Janis Papanagnou

2023-08-17 03:38:54 UTC

Permalink

Post by Robert Mesibov
I'm hoping for ideas to simplify a duplicate-listing job. In this
space-separated file "demo", fields 1 and 3 are codes unique to each
record.
fld1 fld2 fld3 fld4 fld5
001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
004 shoe dd try tiger
005 worm ee law tiger
006 pear ff law tiger
007 pear gg hat apple
008 shoe hh cup heron
009 worm ii cup heron
To find the partial duplicate records which are identical except in
awk 'FNR==NR {$1=$3=1; a[$0]++; next} {x=$0; $1=$3=1} a[$0]>1 {print x}' demo demo
which returns
001 rose aa hat apple
002 pear bb hat apple
003 rose cc hat apple
007 pear gg hat apple
I would like those 2 sets of partial duplicates (the rose-hat-apple
set and the pear-hat-apple set) to be sorted alphabetically and
002 pear bb hat apple
007 pear gg hat apple
001 rose aa hat apple
003 rose cc hat apple
I can do that by piping the first AWK command's output to
sort -t" " -k2 | awk 'NR==1 {print; $1=$3=1; x=$0} NR>1 {y=$0; $1=$3=1; print $0==x ? y : "\n"y; x=$0}'
but this seems like a lot of coding for a result. I'd be grateful for
suggestions on how to get the sorted, separated result in a single
AWK command, if possible.

You can alternatively do it (e.g.) in one instance also like this...

{ k = $2 SUBSEP $4 SUBSEP $5 ; a[k] = a[k] RS $0 ; c[k]++ }
END { for(k in a) if (c[k]>1) print a[k] }

which is not (not much) shorter character wise but doesn't need the
external sort command, it is all in one awk instance (as you want),
and single pass. (I think the code is also a bit clearer than the
one you posted above, but YMMV.)

Janis

Robert Mesibov

2023-08-17 20:56:47 UTC

Permalink

Post by Janis Papanagnou
You can alternatively do it (e.g.) in one instance also like this...
{ k = $2 SUBSEP $4 SUBSEP $5 ; a[k] = a[k] RS $0 ; c[k]++ }
END { for(k in a) if (c[k]>1) print a[k] }
which is not (not much) shorter character wise but doesn't need the
external sort command, it is all in one awk instance (as you want),
and single pass. (I think the code is also a bit clearer than the
one you posted above, but YMMV.)
Janis

Many thanks, Janis, that's very nice, but it depends on specifying the non-unique fields 2, 4 and 5. In the real-world cases I work with, there are 1-2 unique ID code fields and sometimes 300+ non-unique-ID fields (2, 4, 5...300+). That's why I replace the unique-ID fields with the arbitrary value "1" when testing for duplication.

Bob

Kenny McCormack

2023-08-17 21:27:41 UTC

Permalink

In article <b00f43d1-f50f-44ca-bb1f-***@googlegroups.com>,
Robert Mesibov <***@gmail.com> wrote:
...

Many thanks, Janis, that's very nice, but it depends on specifying the
non-unique fields 2, 4 and 5. In the real-world cases I work with,
there are 1-2 unique ID code fields and sometimes 300+ non-unique-ID
fields (2, 4, 5...300+). That's why I replace the unique-ID fields with
the arbitrary value "1" when testing for duplication.

1) Well, it seems like it shouldn't be too hard for you to retrofit your
hack ($1 = $3 = 1) into Janis's hack. FWIW, I would probably just set to
"" instead of 1.

2) You probably don't need to mess with SUBSEP. Your data seems to be OK
with assuming no embedded spaces (i.e., so using space as the delimiter is OK)
Note that SUBSEP is intended to be used as the delimiter for the
implementation of old-fashioned pseudo-multi-dimensional arrays in AWK, but
nobody uses that functionality anymore. Therefore, some AWK programmers
have co-opted SUBSEP as a symbol provided by the language to represent a
character that is more-or-less guaranteed to never occur in user data.

3) I don't see how Janis's solution implements your need for sorting.
Unless he is using the WHINY_USERS option. Or asort or asorti or
PROCINFO["sorted_in"] or ...

--
"Every time Mitt opens his mouth, a swing state gets its wings."

(Should be on a bumper sticker)

Janis Papanagnou

2023-08-17 22:03:50 UTC

Permalink

Post by Kenny McCormack
...

Many thanks, Janis, that's very nice, but it depends on specifying the
non-unique fields 2, 4 and 5. In the real-world cases I work with,
there are 1-2 unique ID code fields and sometimes 300+ non-unique-ID
fields (2, 4, 5...300+). That's why I replace the unique-ID fields with
the arbitrary value "1" when testing for duplication.

1) Well, it seems like it shouldn't be too hard for you to retrofit your
hack ($1 = $3 = 1) into Janis's hack. FWIW, I would probably just set to
"" instead of 1.

Yes, indeed. (See my other post.)

Post by Kenny McCormack
2) You probably don't need to mess with SUBSEP. Your data seems to be OK
with assuming no embedded spaces (i.e., so using space as the delimiter is OK)
Note that SUBSEP is intended to be used as the delimiter for the
implementation of old-fashioned pseudo-multi-dimensional arrays in AWK, but
nobody uses that functionality anymore. Therefore, some AWK programmers
have co-opted SUBSEP as a symbol provided by the language to represent a
character that is more-or-less guaranteed to never occur in user data.

Yes, SUBSEP is the default separation character for arrays and. Of
course you can use other characters (that require less text). Why
you think that "nobody uses that functionality anymore" is beyond
me; I doubt you have any evidence for that, so I interpret it just
as "I [Kenny] don't use it anymore.", which is fine by me.

Post by Kenny McCormack
3) I don't see how Janis's solution implements your need for sorting.

Sort can make sense in three different abstractions.

I interpreted the OP as doing the 'sort' just to be able to compare
the actual data set with the previous data set, to have them together;
this is unnecessary, though, with the approach I used with the keys
in associative array. Since the original data is also already sorted
my a unique numeric key and I sequentially concatenate the data it's
also not necessary to sort the data in that respect. So what's left
is the third thing that can be sorted, and that's the order of the
classes; that all, say, "pear" elements come before all "rose"
elements. This sort, in case it would be desired, is not reflected
in my approach.

Janis

Post by Kenny McCormack
Unless he is using the WHINY_USERS option. Or asort or asorti or
PROCINFO["sorted_in"] or ...

Robert Mesibov

2023-08-17 22:36:38 UTC

Permalink

Apologies for not explaining that there are numerous non-unique-ID fields, and yes, what I am aiming for is a sort beginning with the first non-unique-ID field.

My code is complicated because I need to preserve the original records for the output, while also modifying the original records by "de-uniquifying" the unique-ID fields in order to hunt for partial duplicates.

I'll continue to tinker with this and report back if I can simplify the code, but I would be grateful for any other AWK solutions.

Janis Papanagnou

2023-08-17 23:47:10 UTC

Permalink

Post by Robert Mesibov
I'll continue to tinker with this and report back if I can simplify
the code, but I would be grateful for any other AWK solutions.

For any additional sorting Kenny gave hints (see his point 3) that
can simply be added if you're using GNU awk.

Janis

Robert Mesibov

2023-08-18 08:23:00 UTC

Permalink

Many thanks again, Janis. I doubt that I can improve on

awk '{x=$0; $1=$3=1; y=$0; a[y]=a[y] RS x; b[y]++}; END {for (i in a) if (b[i]>1) print a[i]}' demo

and the sorting isn't critical.

Bob

Kenny McCormack

2023-08-21 14:54:53 UTC

Permalink

In article <ubm5g7$3u7rt$***@dont-email.me>,
Janis Papanagnou <janis_papanagnou+***@hotmail.com> wrote:
...

Post by Janis Papanagnou

2) You probably don't need to mess with SUBSEP. Your data seems
to be OK with assuming no embedded spaces (i.e., so using space
as the delimiter is OK) Note that SUBSEP is intended to be
used as the delimiter for the implementation of old-fashioned
pseudo-multi-dimensional arrays in AWK, but nobody uses that
functionality anymore. Therefore, some AWK programmers have co-opted
SUBSEP as a symbol provided by the language to represent a character
that is more-or-less guaranteed to never occur in user data.

It may be a language barrier - I understand that English is not your first
language - but in colloquial English, the phrase "nobody does X anymore"
often means something close to "nobody should do X anymore" or "Only uncool
people still do X". Obviously, *some* people still do. BTW, see also the
famous Yogi Berra quip: (Of a certain restaurant) "Nobody goes there
anymore; it's too crowded."

Anyway, this is definitely true of old-fashioned AWK pseudo-multi-dimensional
arrays. They never really worked well, and now that we have true MDAs,
nobody should be using the old stuff.

--
The randomly chosen signature file that would have appeared here is more than 4
lines long. As such, it violates one or more Usenet RFCs. In order to remain
in compliance with said RFCs, the actual sig can be found at the following URL:
http://user.xmission.com/~gazelle/Sigs/GodDelusion

Janis Papanagnou

2023-08-21 16:10:10 UTC

Permalink

Post by Kenny McCormack
...

Post by Janis Papanagnou

It may be a language barrier - I understand that English is not your first
language - but in colloquial English, the phrase "nobody does X anymore"
often means something close to "nobody should do X anymore" or "Only uncool
people still do X". Obviously, *some* people still do. BTW, see also the
famous Yogi Berra quip: (Of a certain restaurant) "Nobody goes there
anymore; it's too crowded."
Anyway, this is definitely true of old-fashioned AWK pseudo-multi-dimensional
arrays. They never really worked well, and now that we have true MDAs,
nobody should be using the old stuff.

Okay, thanks for explaining. So I've interpreted it right
(despite any probably existing language barrier). - And I
disagree with you in the given thread context, still also
generally.

"True" multi-dimensional arrays are unnecessary here, and
if you use separate keys where you need only one composite
key is not only unnecessary it seems to complicate matters.
(But you may provide code to prove me wrong if you like;
how would multidimensional arrays help here?)

In the past I used Gnu Awk's multi-dimensional arrays in
contexts where it was necessary, and there it simplified
*these* things. But usually when using awk I observed that
"simple [associative] arrays" is what I need in 98% of my
awk applications[*] - of course the situation where _you_
(personally) use Awk arrays may be different (that would
actually mean "I [Kenny] don't use it anymore.", what I
interpreted upthread).[**]

Since a[k] is a/the common use the question is, in which
contexts is a[k1][k2] necessary and in which is a[k1,k2]
sufficient? - My observation is that only where you need
true multi-dimensional access a[k1][k2] is advantageous;
but this appears not to be the common case. (BTW, [***].)

I think it boils down to observe that the concrete given
solution uses just one composed index and that there's no
need for non-standard "true multi-dimensional arrays"
because here there are no multi-dimensional arrays.[****]

Thanks for reading.

Janis

[*] Reminds me of the reasons why in Pascal the supported
only loops based on integral indices (and not FP); because
there was evidence that this was used most of the times.
(It doesn't mean that there aren't sensible applications
beyond that.)

[**] Of course you may also provide evidence and reasons
for the given hypotheses "nobody should do X anymore" -
Why? - and "Only uncool people still do X" - "uncool"? -
for (X = "don't use simple awk arrays". - I think such
statements make just no sense, yet if they are just fuzzy
(non determined) or personal without evidence.

[***] I deliberately ignored that the GNU Awk extension
is also non-standard, since it's not necessary for our
dispute.

[****] You see that where 'k' is composed and only a[k]
and c[k] used; simply and without disadvantage.

Janis Papanagnou

2023-08-21 16:56:33 UTC

Permalink

Post by Kenny McCormack
Anyway, this is definitely true of old-fashioned AWK pseudo-multi-dimensional
arrays. They never really worked well, and now that we have true MDAs,
nobody should be using the old stuff.

I think the misunderstandings in this subthread were...
- we have no disagreement where "MDAs" are _necessary_ and used,
- in this thread's solutions we had no application of "MDAs"
(just a composed key), and "MDAs" also weren't necessary,
- (thesis) basic associative arrays are predominantly used
(mileages may probably vary depending on where awk is used),
- "MDAs" support associative functionality thus hardly avoidable
(is a[k] "old stuff" or is it an MDA with one dimension?)

Janis

Kenny McCormack

2023-08-21 19:04:17 UTC

Permalink

Post by Janis Papanagnou

I never said anything about any of that - That is, anything about whether
or not MDAs were needed in the context of this thread (Clearly, they are
not).

My content was, as it usually is, entirely "meta". Thus, the following two
comments:

1) It sounded like you had misunderstood my comment about "nobody does
that anymore", so I clarified what the colloquial meaning of that
expression is. Note that I have hit a similar thing a while back in the
shell group - where I stated that nobody uses backticks anymore,
because we now have $(), which, as we all know, is better in just about
every way (the only exception that I can think of is that if you are
programming in csh or tcsh, then you have to use backticks - although
this may sound facetious, I still do some tcsh stuff, so I have to keep
this in mind).

I got a lot of blowback from indignant people who wanted me to know
that they still use backticks and they were personally insulted that I
claimed that no one did that anymore. Clearly, those people did not
understand the idiomatic meaning of the expression either.

2) You had used SUBSEP in your script (reply to OP), but were
(obviously) not using (any form of) MDAs, so I made some comments (not
for your benefit, but for OP's) about your usage of SUBSEP (i.e., how
it is usually only used when using pseudo-MDAs, but that some people
have co-opted it for other uses).

--
The plural of "anecdote" is _not_ "data".

Janis Papanagnou

2023-08-17 21:42:50 UTC

Permalink

Post by Robert Mesibov

Many thanks, Janis, that's very nice, but it depends on specifying
the non-unique fields 2, 4 and 5. In the real-world cases I work
with, there are 1-2 unique ID code fields and sometimes 300+
non-unique-ID fields (2, 4, 5...300+). That's why I replace the
unique-ID fields with the arbitrary value "1" when testing for
duplication.

That was not apparent from your description. But defining the key
by constructing it is not mandatory, you can also define it using
elimination (as in your code); the point was what is following in
the code after the k=... statement.

Janis

Post by Robert Mesibov
Bob