Discussion:
writing a good gsub regexp for matching between two specific characters
(too old to reply)
Bryan
2023-03-12 00:06:09 UTC
Permalink
I'm using gawk 5.1.0, bash 5.1.16, Ubuntu 22.04.2. I will write and provide a lot of material in case it is useful or there is conflict in the script, but I am trying not to ramble.

I prepared a test script below - which should be easy to copy/paste into a shell, e.g. bash. I am focused on the gsub regexps, which are obviously contrived to replace all these different strings which - as they vary from output from another program - take the general form (attempting a "plain English" version):

[open apostrophe][the word "path"][maybe an underscore][various digits][end apostrophe]

I want to take all of that ^^^ and delete it - or equivalently replace it with nothing (ideally), to prepare input to gnuplot as "x,y" or "x y" data - two columns.

I tried using this type of command :

gsub("^[a-z]{4}$","TEST") ;

... and more, e.g. trying sub and gensub - but did not get far - I am aware of a curly brace escape that is important or not depending on the awk version, so I also tried with \{ and \}.

I put "TEST" in the present case for testing a few different cases. I wrote this script based on extensive reading of a certain popular online resource and the The Awk Programming Language (1988 - maybe time for a newer edition?). This is a useful script because as I find new types of output from the upstream program (a whole other story), I might add new gsub commands to take care of it.

copy/paste example script:

echo "\
{\"path_1234567\"\
:[`seq -s',' -f '%f' 1 20 `],\
\"path_123456\"\
:[`seq -s',' -f '%f' 1 20 `],\
\"path_1234\"\
:[`seq -s',' -f '%f' 1 20 `],\
\"path1234\"\
:[`seq -s',' -f '%f' 1 20 `]}" | \
gawk -F, '
{
gsub("\{","") ;
gsub("\}","") ;
gsub("\]","") ;
gsub("^[a-z]{4}$","TEST") ;
gsub("\"[a-z][a-z][a-z][a-z]_[0-9][0-9][0-9][0-9][0-9][0-9][0-9]\":\\\[","TESTSEVEN") ;
gsub("\"[a-z][a-z][a-z][a-z][0-9][0-9][0-9][0-9][0-9][0-9]\":\\\[","TESTSIX") ;
gsub("\"[a-z][a-z][a-z][a-z][0-9][0-9][0-9][0-9]\":\\\[","TESTFOURB") ;
gsub("\"[a-z][a-z][a-z][a-z]_[0-9][0-9][0-9][0-9]\":\\\[","TESTFOURA") ;
for (i=1;i<=NF;i++)
{
printf("%s%s",$i,i%2?",":"\n")
}
}'

... the last printf thing is perhaps for another post, but (IIUC) matches every 2nd comma and replaces it with a newline. So that's the "x,y" data idea. I hope that is clear - I imagine the regexps in the [a-z][0-9] parts ought to be able to go all into one gsub if I knew the syntax or what to read about.
Janis Papanagnou
2023-03-12 02:52:27 UTC
Permalink
First, I cannot really decipher what you actually want to do and
where your problems are. The usual procedure is to post sample data:
input data and the corresponding output data at least (not shell
code that creates the input data). Anyway you find below some hints
and suggestions...
Post by Bryan
I'm using gawk 5.1.0, bash 5.1.16, Ubuntu 22.04.2. I will write and
provide a lot of material in case it is useful or there is conflict
in the script, but I am trying not to ramble.
I prepared a test script below - which should be easy to copy/paste
into a shell, e.g. bash. I am focused on the gsub regexps, which are
obviously contrived to replace all these different strings which - as
they vary from output from another program - take the general form
[open apostrophe][the word "path"][maybe an underscore][various digits][end apostrophe]
I want to take all of that ^^^ and delete it - or equivalently
replace it with nothing (ideally), to prepare input to gnuplot as
"x,y" or "x y" data - two columns.
gsub("^[a-z]{4}$","TEST") ;
This is fine to substitutes lines containing _only_ a sequence of
four lower case letters to "TEST". gsub() _without_ the ^ and $
anchors will substitute any occurrence of that pattern on a line.
You can provide a third argument to gsub() to operate on variables
or specific fields; in that case the anchors ^ and $ will define
the beginning and end of that variable or field respectively.
It is also advantageous to use /.../ syntax for constant patterns
instead of the string form "...".
Post by Bryan
... and more, e.g. trying sub and gensub - but did not get far - I am
aware of a curly brace escape that is important or not depending on
the awk version, so I also tried with \{ and \}.
There's no need to escape these braces.
Post by Bryan
I put "TEST" in the present case for testing a few different cases. I
wrote this script based on extensive reading of a certain popular
online resource and the The Awk Programming Language (1988 - maybe
time for a newer edition?). This is a useful script because as I find
new types of output from the upstream program (a whole other story),
I might add new gsub commands to take care of it.
echo "\
{\"path_1234567\"\
:[`seq -s',' -f '%f' 1 20 `],\
\"path_123456\"\
:[`seq -s',' -f '%f' 1 20 `],\
\"path_1234\"\
:[`seq -s',' -f '%f' 1 20 `],\
\"path1234\"\
:[`seq -s',' -f '%f' 1 20 `]}" | \
gawk -F, '
{
gsub("\{","") ;
gsub("\}","") ;
gsub("\]","") ;
gsub("^[a-z]{4}$","TEST") ;
gsub("\"[a-z][a-z][a-z][a-z]_[0-9][0-9][0-9][0-9][0-9][0-9][0-9]\":\\\[","TESTSEVEN") ;
gsub("\"[a-z][a-z][a-z][a-z][0-9][0-9][0-9][0-9][0-9][0-9]\":\\\[","TESTSIX") ;
gsub("\"[a-z][a-z][a-z][a-z][0-9][0-9][0-9][0-9]\":\\\[","TESTFOURB") ;
gsub("\"[a-z][a-z][a-z][a-z]_[0-9][0-9][0-9][0-9]\":\\\[","TESTFOURA") ;
for (i=1;i<=NF;i++)
{
printf("%s%s",$i,i%2?",":"\n")
}
}'
Instead of echo arguments with quotes and newline-escapes I suggest,
in shell, to use here-documents with this syntax:

awk '
# ... your awk program ...
...
' <<EOT
your data line 1
your data line 2
...
EOT

and with the more contemporary $(...) a line might be

{"path_1234567":[$(seq -s',' -f '%f' 1 20)], ...

but I wouldn't call seq many times but only once and assign it to a
variable and use that repeatedly

s=$(seq -s',' -f '%f' 1 20)
awk '
...
' <<EOT
{"path_1234567":[${s}], ...
...
EOT

If you pipe in or redirect other input just omit the code from <<EOT
onward.
data_from_some_process | awk '...'
awk '...' < data_from_some_file

(But for testing the here-documents have advantages.)
Post by Bryan
... the last printf thing is perhaps for another post, but (IIUC)
matches every 2nd comma and replaces it with a newline.
printf doesn't replace anything. It prints every other time a newline
instead of a comma.
Post by Bryan
So that's the
"x,y" data idea. I hope that is clear - I imagine the regexps in the
[a-z][0-9] parts ought to be able to go all into one gsub if I knew
the syntax or what to read about.
To match more than one regexp for the _same_ replacement you can
combine them with the | (or) operator. For an example from your
code above use, e.g., gsub(/{|}|]/, "") to remove those three
braces/brackets in one expression.

But with your samples above you can also use other regexp syntaxes,
like ? (for optional parts) and use grouping with parenthesis (...)
for longer subexpressions, e.g.
[a-z][4}_?[0-9]{4}([0-9]{2})?
for an optional underscore and two optional digits.

Janis
Bryan
2023-03-12 16:25:50 UTC
Permalink
Apologies for the `seq` synthetic data, I'll prepare it the better way next time.
Post by Janis Papanagnou
But with your samples above you can also use other regexp syntaxes,
like ? (for optional parts) and use grouping with parenthesis (...)
for longer subexpressions, e.g.
[a-z][4}_?[0-9]{4}([0-9]{2})?
for an optional underscore and two optional digits.
This is exactly what I was looking for and it works (I think a typo is in there but let's leave it for now).

I tried {1-4} to get a range, but it didn't work - is that the idea? so

[a-z]{4}_?[0-9]{4}([0-9]{1-4})?

to match any number of digits from 1 to 4?
Kenny McCormack
2023-03-12 16:49:42 UTC
Permalink
Post by Bryan
Apologies for the `seq` synthetic data, I'll prepare it the better way next time.
Post by Janis Papanagnou
But with your samples above you can also use other regexp syntaxes,
like ? (for optional parts) and use grouping with parenthesis (...)
for longer subexpressions, e.g.
[a-z][4}_?[0-9]{4}([0-9]{2})?
for an optional underscore and two optional digits.
This is exactly what I was looking for and it works (I think a typo is
in there but let's leave it for now).
I tried {1-4} to get a range, but it didn't work - is that the idea? so
[a-z]{4}_?[0-9]{4}([0-9]{1-4})?
to match any number of digits from 1 to 4?
It is: {1,4}
--
"If our country is going broke, let it be from feeding the poor and caring for
the elderly. And not from pampering the rich and fighting wars for them."

--Living Blue in a Red State--
Bryan
2023-03-12 20:11:09 UTC
Permalink
This is great. My old awk book (Aho, Kernighan, and Weinberger) has a table on p.32 saying :

"expression [c1-c2] matches any character in the range beginning with c1 and ending with c2."

... p.30 has more discussion, and I never saw anything about the comma "," to indicate a range - perhaps this is a strong indication I need to get a better book.

And, I apologize, but I must say - this discussion reached a good answer in less than 24 hours - even though discussion doesn't "scale", and I can't cast a vote on it.

IOW Thank you!
Bryan
2023-03-12 20:43:38 UTC
Permalink
addendum : in writing a separate question about the printf statement, I found a better way to print a newline instead of every 2nd comma from a long string of signed floating points, so I simply share the method here :

digits=$(seq -s',' -f '%f' -10 10)
gawk -F, '
{
for (i=1;i<=NF;i++)
{
printf("%3.6f%s",$i,i%2?",":"\n")
}
}' <<EOT
${digits}
EOT
Janis Papanagnou
2023-03-12 21:42:10 UTC
Permalink
Post by Bryan
This is great. My old awk book (Aho, Kernighan, and Weinberger) has a
"expression [c1-c2] matches any character in the range beginning
with c1 and ending with c2."
You are referring here to something different. Slightly simplified said
[a-z] is a regexp matching any single lowercase letter
[0-9] any single digit
[0-9a-fA-F] any hexadecimal digit

The multiplicity syntax {N}, {N,}, {,M}, {N,M} is not supported by the
classic awk ("nawk") that is based of Aho's, etc. book. More recent and
commonly used Awks like GNU awk supports it, though. That's why there's
no mention in that book.
Post by Bryan
... p.30 has more discussion, and I never saw anything about the
comma "," to indicate a range - perhaps this is a strong indication I
need to get a better book.
The old book is excellently written and contains all what comprises
the power of the awk language. (Don't ignore it nor throw it away!)

But I suggest, especially if you use GNU awk which supports yet more
features, to get a copy of Arnold Robbin's "Effective Awk Programming"
which is based on GNU Awk. (It's also online available in a searchable
digital form.)

Janis
Janis Papanagnou
2023-03-13 21:03:26 UTC
Permalink
Post by Janis Papanagnou
This is great. My old awk book (Aho, Kernighan, and Weinberger) [...]
The multiplicity syntax {N}, {N,}, {,M}, {N,M} is not supported by the
classic awk ("nawk") that is based of Aho's, etc. book. More recent and
commonly used Awks like GNU awk supports it, though. That's why there's
no mention in that book.
While true for classic awk ("nawk") Arnold Robbins informed me that
in more recent versions of "nawk" this syntax is also supported, now
already for years. (Just in case my post was misinterpreted.)

To my knowledge, though, there's no newer/updated releases of the book
you mentioned; it is based on the old version of (n)awk, and thus it
does not describe that (newer) feature. (Which was my point.)

Janis
Bryan
2023-03-14 13:55:57 UTC
Permalink
I noticed in the "Computerphile" video with Brian Kernighan - shared on this user group - that a new version of The Awk Book might be in the works as of August 2022.

Meanwhile, the overnight delivery is in-hand now, and, from page 45:

"[begin quote]
{n}
{n,}
{n,m}
One or two numbers inside braces denote an *interval expression*. If there is one number in the braces, the preceeding regexp is repeated n times. If there are two numbers separated by a comma, the preceding regexp is repeated n to m times. if [p. 46] there is one number followed by a comma, then the preceding regexp is repeated at least n times:[end quote]"

... examples shown are :
wh{3}y Matches 'whhhy', but not 'why' or 'whhhhy'.
wh{3,5}y matches 'whhhy', 'whhhy', or 'whhhhhy' only.
wh{2,}y matches 'whhy', 'whhhy', and so on.

There is more.

Lastly, fom the back cover :

"You have the freedom to copy and modify this GNU manual."

Glad to support the FSF in this way!
Janis Papanagnou
2023-03-14 23:14:30 UTC
Permalink
Post by Bryan
I noticed in the "Computerphile" video with Brian Kernighan - shared
on this user group - that a new version of The Awk Book might be in
the works as of August 2022.
I cannot find a new version of the original Awk book with Google
(or other commercial providers). Could you provide a link, please?

Or are you speaking about Arnold Robbin's book? (Especially since
below you mention GNU and the FSF.)

I'm certainly confused by your mention of Brian Kernighan, one of
the authors of the original book.
Post by Bryan
Meanwhile, the overnight delivery is in-hand now, [...] There is
more.
"You have the freedom to copy and modify this GNU manual."
Glad to support the FSF in this way!
Janis
Ben Bacarisse
2023-03-14 23:46:24 UTC
Permalink
Post by Janis Papanagnou
Post by Bryan
I noticed in the "Computerphile" video with Brian Kernighan - shared
on this user group - that a new version of The Awk Book might be in
the works as of August 2022.
I cannot find a new version of the original Awk book with Google
(or other commercial providers). Could you provide a link, please?
Or are you speaking about Arnold Robbin's book? (Especially since
below you mention GNU and the FSF.)
I'm certainly confused by your mention of Brian Kernighan, one of
the authors of the original book.
Th phrase "might be in the works" means only that there is a possibility
that a new edition might be in preparation. Is that's what's confusing?

Bryan is clearly talking about a new version of the original book, but
he is referring to the most vague suggestion that there might, soon, be
a new edition. As far as I can tell there isn't one, but there could be
on "in the works" (i.e. in preparation).
--
Ben.
Janis Papanagnou
2023-03-15 00:22:23 UTC
Permalink
Post by Ben Bacarisse
Post by Janis Papanagnou
Post by Bryan
I noticed in the "Computerphile" video with Brian Kernighan - shared
on this user group - that a new version of The Awk Book might be in
the works as of August 2022.
I cannot find a new version of the original Awk book with Google
(or other commercial providers). Could you provide a link, please?
Or are you speaking about Arnold Robbin's book? (Especially since
below you mention GNU and the FSF.)
I'm certainly confused by your mention of Brian Kernighan, one of
the authors of the original book.
Th phrase "might be in the works" means only that there is a possibility
that a new edition might be in preparation. Is that's what's confusing?
It was various things that confused me (but not the "in works" per se):
- "might be in the works" vs. "the overnight delivery is in-hand now"
- "GNU" and "FSF" vs. "The [original][commercial] Awk Book"
- and the date "August 2022" I couldn't assign to both books mentioned
Post by Ben Bacarisse
Bryan is clearly talking about a new version of the original book, but
he is referring to the most vague suggestion that there might, soon, be
a new edition. As far as I can tell there isn't one, but there could be
on "in the works" (i.e. in preparation).
I am certainly interested in any new version. Read his post as if he
already had got it. But I didn't find anything online.

Janis
Bryan
2023-03-15 15:31:02 UTC
Permalink
I apologize for the confusion!

I will make a note on the Brian Kernighan video thread - the video I listened to/watched when stuck (not a bad idea, IMHO).
Ed Morton
2023-03-15 17:12:09 UTC
Permalink
Post by Bryan
I apologize for the confusion!
I will make a note on the Brian Kernighan video thread - the video I listened to/watched when stuck (not a bad idea, IMHO).
You're posting on usenet, not a forum, so please make sure every post
has enough context included to make sense stand-alone. Right now you're
truncating/removing all context on all of your posts.

Thanks.
Keith Thompson
2023-03-14 23:49:00 UTC
Permalink
Post by Bryan
I noticed in the "Computerphile" video with Brian Kernighan - shared
on this user group - that a new version of The Awk Book might be in
the works as of August 2022.
"[begin quote]
{n}
{n,}
{n,m}
One or two numbers inside braces denote an *interval expression*. If
there is one number in the braces, the preceeding regexp is repeated n
times. If there are two numbers separated by a comma, the preceding
regexp is repeated n to m times. if [p. 46] there is one number
followed by a comma, then the preceding regexp is repeated at least n
times:[end quote]"
wh{3}y Matches 'whhhy', but not 'why' or 'whhhhy'.
wh{3,5}y matches 'whhhy', 'whhhy', or 'whhhhhy' only.
wh{2,}y matches 'whhy', 'whhhy', and so on.
There is more.
"You have the freedom to copy and modify this GNU manual."
Glad to support the FSF in this way!
That's the GNU Awk manual. I don't have a printed version, but it
appears to have the same content as the online manual available by
typing "info gawk" (if you have the right things installed)
or at <https://www.gnu.org/software/gawk/manual/gawk.html>.

"The Awk Book" presumably refers to the original "The AWK Programming
Language" by Aho, Kernighan, and Weinberger, published in 1988.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Working, but not speaking, for XCOM Labs
void Void(void) { Void(); } /* The recursive call of the void */
Kpop 2GM
2023-08-01 04:11:18 UTC
Permalink
Post by Keith Thompson
"The Awk Book" presumably refers to the original "The AWK Programming
Language" by Aho, Kernighan, and Weinberger, published in 1988.
I've seen the entirety of the original 1988 book scanned and viewable in PDF format online

( I'll refrain from linking it here since I'm uncertain about copyrights of the PDFs, but shouldn't be too hard to locate via google search or somewhere on github )

That said, even the original authors didn't do a particular good job at selling awk's real strengths. If i began my awk journey with that book, I would've jumped ship to perl longlong ago.

thank goodness I didn't step into that sarlacc pit that is perl5, or worse, raku.


The 4Chan Teller

#####################
Janis Papanagnou
2023-08-01 15:19:41 UTC
Permalink
Post by Kpop 2GM
Post by Keith Thompson
"The Awk Book" presumably refers to the original "The AWK
Programming Language" by Aho, Kernighan, and Weinberger, published
in 1988.
That said, even the original authors didn't do a particular good job
at selling awk's real strengths.
When I had first read about the awk command I was curiously looking
for more detailed information than just "it's a language to process
text patterns", so I was quite glad to find that book. It came out
very quickly, only a year after the official release (three years
after the stable version had been developed). The book is very well
written and provides everything you need to understand the concepts
of Awk which are, IMO, the "real strengths" of the Awk language. Of
course it's not a long developed "hacker book" with tips and tricks.
Neither does it has all that fancy stuff that we were publishing or
discussing here in this newsgroup during the past decades. I agree
with you, though, that there wasn't - maybe still isn't - anything
worth on that "hacker-level". But I wouldn't blame that old book or
their authors for this deficiency. After all folks who came up with
advanced ideas likely read that book (and maybe other later sources)
to develop application ideas that the original authors did not have
in mind.

And I also think that the more advanced methods that contribute to
Awk's strengths further would likely have repelled possible users;
many are cryptic and not too easy to understand for newbies. - The
book was, IMHO, exactly what was necessary at that time! - I would
still recommend it to Awk-beginners, even today.[*]
Post by Kpop 2GM
If i began my awk journey with that
book, I would've jumped ship to perl longlong ago.
I had been starting with that book (and a brain that came for free),
and nothing else. (And at times I'm still locking into that book to
look up things.)

With which sources have you "began [your] awk journey", since you
seem to avoid Perl and enjoy Awk on an advanced level?

Janis

[*] With the cutback of the unpleasantly high price of the booklet.
jeorge
2023-08-01 20:32:33 UTC
Permalink
Post by Keith Thompson
"The Awk Book" presumably refers to the original "The AWK
Programming Language" by Aho, Kernighan, and Weinberger, published
in 1988.
<snip>
.. I would still recommend it to Awk-beginners, even today.[*]
<snip>> [*] With the cutback of the unpleasantly high price of the booklet.

Speaking of, I came across an announcement of a new edition:

The AWK Programming Language, Second Edition
https://awk.dev/
"The book will be available by the end of September."
Janis Papanagnou
2023-08-01 21:02:51 UTC
Permalink
Post by jeorge
Post by Keith Thompson
"The Awk Book" presumably refers to the original "The AWK
Programming Language" by Aho, Kernighan, and Weinberger, published
in 1988.
<snip>
.. I would still recommend it to Awk-beginners, even today.[*]
<snip>> [*] With the cutback of the unpleasantly high price of the booklet.
The AWK Programming Language, Second Edition
https://awk.dev/
"The book will be available by the end of September."
It would be interesting to know whether it's just a reprint or
a reworked (updated/enhanced/extended) edition.

Janis
Keith Thompson
2023-08-01 21:20:46 UTC
Permalink
Post by Janis Papanagnou
Post by jeorge
Post by Keith Thompson
"The Awk Book" presumably refers to the original "The AWK
Programming Language" by Aho, Kernighan, and Weinberger, published
in 1988.
<snip>
.. I would still recommend it to Awk-beginners, even today.[*]
<snip>> [*] With the cutback of the unpleasantly high price of the booklet.
The AWK Programming Language, Second Edition
https://awk.dev/
"The book will be available by the end of September."
It would be interesting to know whether it's just a reprint or
a reworked (updated/enhanced/extended) edition.
A mere reprint would not be called the "Second Edition".

From the cited web page:

The first edition was written by Al Aho, Brian Kernighan and Peter
Weinberger in 1988. Awk has evolved since then, there are multiple
implementations, and of course the computing world has changed
enormously. The new edition of the Awk book reflects some of those
changes.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */
Janis Papanagnou
2023-08-01 22:33:52 UTC
Permalink
Post by Keith Thompson
Post by Janis Papanagnou
It would be interesting to know whether it's just a reprint or
a reworked (updated/enhanced/extended) edition.
A mere reprint would not be called the "Second Edition".
Ah, okay, thanks for the hint.[*]

I can only speak from publishers hereabouts; "zweite Auflage" (en.
"second edition") just means a new edition after the first one, and
if there isn't anything mentioned like "überarbeitete" (en. revised),
"verbesserte" (en. improved), "durchgesehene" (en. revised version),
"korrigierte" (en. corrected), "erweiterte" (en. extended), or many
other possible adjectives declaring the type of the edition, then
it's usually (or even generally?) just a reprint because of new or
significant more customer demands than originally expected.
Post by Keith Thompson
[...]
And thanks for the quote. (I could have looked it up myself but was
too lazy.)

Janis

[*] Though I have also seen in the English domain books that have a
note "rev. ed." adjective (e.g. Bolsky, Korn), so I guess it varies?
Keith Thompson
2023-08-01 21:14:29 UTC
Permalink
Post by Kpop 2GM
Post by Keith Thompson
"The Awk Book" presumably refers to the original "The AWK Programming
Language" by Aho, Kernighan, and Weinberger, published in 1988.
I've seen the entirety of the original 1988 book scanned and viewable in PDF format online
( I'll refrain from linking it here since I'm uncertain about
copyrights of the PDFs, but shouldn't be too hard to locate via google
search or somewhere on github )
I'm far more certain. The 1988 book is still under copyright, and any
PDF copy that's not explicitly authorized by the publisher is in
violation of that copyright.

(The 1988 AWK book doesn't appear to be available in electronic form.
Amazon has it in paperback for $114.71. The second edition is supposed
to be available 2023-09-22, at a much more reasonable price.)

[...]
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */
Loading...