Generic transformations of arbitrary data entities

Discussion:

(too old to reply)

Janis Papanagnou

2023-10-12 12:04:56 UTC

In a recent thread I posted an Awk code pattern to define words that
match a pattern and conditionally transforms it; it just relied on
POSIX Awk features. Actually, though, it's a generally usable code
pattern. With standard Awk you can substitute the entity pattern and
function to transform the defined data entities as necessary.

GNU Awk supports a couple newer features to make that generalization
more explicit, by use of first class patterns and indirect functions.

# generic function to transform specified data entities
function trent (line, pattern, transform, out)
{
for (line=$0; match(line, pattern);
line=substr(line, RSTART+RLENGTH))
{
out = out substr(line, 1, RSTART-1) \
@transform(substr(line, RSTART, RLENGTH))
}
out = out line
return out
}

With a transformation function like

function highlight (str)
{
return "\033[7m" str "\033[0m"
}

a sample usage can be

BEGIN { words = @/[[:alpha:]]+/ }
{
print trent($0, words, "highlight")
}

Applied to the task from the other thread you can provide

function isogram_highlight (str)
{
return (isogram(str) ? "\033[7m" str "\033[0m" : str)
}

using Mike's (only slightly changed by me) isogram() algorithm

function isogram(str, c, x, y) {
y = length(str)
for (x = 1; x < y; x++) {
c = substr(str, x, 1)
if (index(substr(str, x + 1), c)) return 0
}
return 1
}

in a context like

BEGIN { words = @/[[:alpha:]]+/ }
{
print trent($0, words, "highlight")
print trent($0, words, "isogram_highlight")
}

Note again that this solution based on a generalized algorithm
uses GNU Awk specific features and is not conforming to POSIX!

Janis

Janis Papanagnou

2023-10-12 16:23:58 UTC

Permalink

The line=$0 assignment was a remains from an earlier version. Here
you don't want it, since 'line' is passed as a function parameter.
So make that just

for ( ; match(line, pattern);

Post by Janis Papanagnou
line=substr(line, RSTART+RLENGTH))
{
out = out substr(line, 1, RSTART-1) \
@transform(substr(line, RSTART, RLENGTH))
}
out = out line
return out
}
With a transformation function like
function highlight (str)
{
return "\033[7m" str "\033[0m"
}
a sample usage can be
{
print trent($0, words, "highlight")
}
Applied to the task from the other thread you can provide
function isogram_highlight (str)
{
return (isogram(str) ? "\033[7m" str "\033[0m" : str)
}
using Mike's (only slightly changed by me) isogram() algorithm
function isogram(str, c, x, y) {
y = length(str)
for (x = 1; x < y; x++) {
c = substr(str, x, 1)
if (index(substr(str, x + 1), c)) return 0
}
return 1
}
in a context like
{
print trent($0, words, "highlight")
print trent($0, words, "isogram_highlight")
}
Note again that this solution based on a generalized algorithm
uses GNU Awk specific features and is not conforming to POSIX!
Janis

Mike Sanders

2023-10-12 19:00:13 UTC

Permalink

Post by Janis Papanagnou
In a recent thread I posted an Awk code pattern to define words that
match a pattern and conditionally transforms it; it just relied on
POSIX Awk features. Actually, though, it's a generally usable code
pattern. With standard Awk you can substitute the entity pattern and
function to transform the defined data entities as necessary.
GNU Awk supports a couple newer features to make that generalization
more explicit, by use of first class patterns and indirect functions.
# generic function to transform specified data entities
function trent (line, pattern, transform, out)
{
for (line=$0; match(line, pattern);
line=substr(line, RSTART+RLENGTH))
{
out = out substr(line, 1, RSTART-1) \
@transform(substr(line, RSTART, RLENGTH))
}
out = out line
return out
}
With a transformation function like
function highlight (str)
{
return "\033[7m" str "\033[0m"
}
a sample usage can be
{
print trent($0, words, "highlight")
}
Applied to the task from the other thread you can provide
function isogram_highlight (str)
{
return (isogram(str) ? "\033[7m" str "\033[0m" : str)
}
using Mike's (only slightly changed by me) isogram() algorithm
function isogram(str, c, x, y) {
y = length(str)
for (x = 1; x < y; x++) {
c = substr(str, x, 1)
if (index(substr(str, x + 1), c)) return 0
}
return 1
}
in a context like
{
print trent($0, words, "highlight")
print trent($0, words, "isogram_highlight")
}
Note again that this solution based on a generalized algorithm
uses GNU Awk specific features and is not conforming to POSIX!
Janis

Good stuff. Adding this to my notes in fact. I really was hoping
others would see some value in using hilite(). Its handy on my end too.

--
:wq
Mike Sanders

Janis Papanagnou

2023-10-13 07:33:05 UTC

Permalink

Post by Mike Sanders

Post by Janis Papanagnou
[...]
{
print trent($0, words, "highlight")
print trent($0, words, "isogram_highlight")
}
Note again that this solution based on a generalized algorithm
uses GNU Awk specific features and is not conforming to POSIX!

Good stuff. Adding this to my notes in fact. I really was hoping
others would see some value in using hilite(). Its handy on my end too.

I'm using ANSI escaped from time to time, and also just recently,
e.g. for coloring.

But my point here was more the generalization. The task to change
some entities on a line while preserving the spacing, delimiters,
and other information is quite common. I used it a couple times
and always reprogrammed the two-lines loop with different pattern
for different transformations. That's why I think that GNU Awk's
features - too sad you cannot use them! - are valuable; they can
emulate quite nicely what other languages do with real function
arguments.

I expanded my test program[*] with some more simple applications
that lead to

BEGIN {
...
words = @/[[:alpha:]]+/
numbers = @/[[:digit:]]+/
names = @/([[:upper:]][.])*[[:upper:]][[:lower:]]*/
}
{
print trent($0, words, "highlight")
print trent($0, words, "isogram_highlight")
print trent($0, numbers, "black_out")
print trent($0, names, "black_out")
print trent($0, names, "anonymize")
}

Just to demonstrate the point by possible combinations of patterns
(that can of course be simply refined) and functions (identified
by their names).

Janis

[*] Extended test program: volatile.gridbug.de/transform_words

Mike Sanders

2023-10-14 00:39:14 UTC

Permalink

Post by Janis Papanagnou
I'm using ANSI escaped from time to time, and also just recently,
e.g. for coloring.

Myself as well...

<https://drive.google.com/file/d/1tf_X3U3TwJQz67z3gdFBSZo2oKW2vcao/view>

Post by Janis Papanagnou
But my point here was more the generalization. The task to change
some entities on a line while preserving the spacing, delimiters,
and other information is quite common. I used it a couple times
and always reprogrammed the two-lines loop with different pattern
for different transformations. That's why I think that GNU Awk's
features - too sad you cannot use them! - are valuable; they can
emulate quite nicely what other languages do with real function
arguments.

I hope too soon =) Yet a while longer I can't.

Post by Janis Papanagnou
I expanded my test program[*] with some more simple applications
that lead to
BEGIN {
...
}

That is so cool!

Post by Janis Papanagnou
[*] Extended test program: volatile.gridbug.de/transform_words

Will you have an index page of your projects/snippets
in the future Janis?

--
:wq
Mike Sanders

Janis Papanagnou

2023-10-14 11:17:42 UTC

Permalink

Post by Mike Sanders

Post by Janis Papanagnou
[*] Extended test program: volatile.gridbug.de/transform_words

Will you have an index page of your projects/snippets
in the future Janis?

Unfortunately(?), no. - I've never[*] started to systematically publish
any code (and I don't intend to do so). My approach was discussions in
Usenet, sharing knowledge, and code only on demand or where it supports
the shared and discussed topics. There's also too much stuff that got
accumulated over the decades; it would require quite some effort to
provide that in a form of sufficient quality. My view was that anything
useful that I posted could eventually be retrieved using some search
engine[**]. The ideas (those that are worth it) and insights can still
spread (or become forgotten). For me it's "Open Ideas", something like
Open Source for non-code contributions. Occasionally I drop some code
on grigbug.de ('volatile' for stuff I might delete, 'random' for stuff
that might stay available), but that's just a small fraction of the
stuff I have on my disks. These two sub-domains have thus no index
page[***] and bound to a post (or an email), but previously in Usenet
posted links might still have the information.

For the intention of my previous post the code for the sample functions
were unnecessary, but I wanted to provide them as "amendment" for folks
who want to see some complete and runnable code.

Feel free to ask if you need something specific.

Janis

[*] "never" = only rarely, or only in specific cases.

[**] Sadly whenever I now try to find some older stuff I often cannot
find it any more (using Google).

[***] Other sub-domains for specific topics do an organized form with
an index.

Mike Sanders

2023-10-16 18:49:23 UTC

Permalink

Unfortunately(?), no...

No? I say 'yes'. Much to read/learn...

Me? I think I will in fact. Index by the end of the week
and lots of interesting (at least to me) items on the way.

You only live once Janis, I hope someday you'll reconsider
for the benefit of others =)

--
:wq
Mike Sanders

Janis Papanagnou

2023-10-16 19:38:40 UTC

Permalink

Post by Mike Sanders
You only live once Janis, I hope someday you'll reconsider
for the benefit of others =)

For the benefit of others, spread the word... - with or without
an index. :-)

I promise I will reconsider it in my next life! ;-)

Janis

Kpop 2GM

2023-10-27 19:37:56 UTC

Permalink

hmm ….. a heterogram is when # unique chars == string length, but isogram technically just means all chars within it show up at the same frequency -

i.e. "DODO" is an isogram, but the function above results a FALSE (0). The code below should rectify the test case differences. The updated function adds 2 rapid exit criteria based on whether (a) input string is empty or only 1 character long, or (b) whether # of copies of left most character isn't an integer multiple of the total input length. From there on, the freq counts returned by each subsequent gsub(…) must match that of the left-most char.

. . 1 .FRR . . . . . 0 .}:orig | new:{ .0
. . 2 .DODO . . . . .0 .}:orig | new:{ .1 .<-----
. . 3 .ECBFADEDCFAB .0 .}:orig | new:{ .1 .<-----
. . 4 .KWNAWKAN . . .0 .}:orig | new:{ .1 .<-----
. . 5 .BAIDU . . . . 1 .}:orig | new:{ .1

. . 6 .BLACKHORSE . .1 .}:orig | new:{ .1
. . 7 .DUBAI . . . . 1 .}:orig | new:{ .1
. . 8 .DUMBWAITER . .1 .}:orig | new:{ .1
. . 9 .ISOGRAM . . . 1 .}:orig | new:{ .1
. .10 .PATHFINDER . .1 .}:orig | new:{ .1

======================================

function isogram_new(__, _, ___) {
. .
. . if ( ! ((_ = (___ = length(__)) <= !!___) ||
. . . . . ___ % (___ = gsub(substr(__, ++_, _--), "", __))))
. .
. . . . for (_++; __; )
. . . . . . ___ == gsub(substr(__, _, _), "", __) || _ *= __ = ""

. . return _
}

— The 4Chan Teller