Discussion:
djb2 hash using awk
(too old to reply)
Michael Sanders
2023-09-22 19:50:09 UTC
Permalink
Greetings folks, just sharing the knowledge (& test posting using google groups).

Beware wordwrap...

function djb2(str, hash, i, char) {

# initialize hash to an arbitrary prime number
hash = 5381
n = length(str)

# iterate over characters and compute hash
for(i = 1; i <= n; i++) {
char = substr(str, i, 1)
# modulo simulates 32-bit unsigned integer
hash = (hash * 33 + ord(char)) % 2147483647
}

return hash
}

function ord(char) {

return (char ~ /[[:digit:]]/ ? int(char) : int(ord[char]))

}

BEGIN {

for(i=32;i<=126;i++) ord[sprintf("%c", i)] = i

print djb2("my key is...")

}

notes:

https://stackoverflow.com/questions/1579721/why-are-5381-and-33-so-important-in-the-djb2-algorithm

https://stackoverflow.com/questions/10696223/reason-for-the-number-5381-in-the-djb-hash-function

https://groups.google.com/g/comp.lang.c/c/lSKWXiuNOAk

https://en.wikipedia.org/wiki/Daniel_J._Bernstein
Janis Papanagnou
2023-09-23 08:45:06 UTC
Permalink
Post by Michael Sanders
Greetings folks, just sharing the knowledge (& test posting using google groups).
Hello,

the post through Google groups seems technically fine - it certainly
wasn't in the past -, only Usenet posting conventions seem violated
(line length).

But the Awk code rises a couple questions...
Post by Michael Sanders
Beware wordwrap...
function djb2(str, hash, i, char) {
# initialize hash to an arbitrary prime number
hash = 5381
n = length(str)
Variable 'n' not declared as local (just for good measure, and in
case you want to use the code in other code contexts).
Post by Michael Sanders
# iterate over characters and compute hash
for(i = 1; i <= n; i++) {
char = substr(str, i, 1)
# modulo simulates 32-bit unsigned integer
The previous comment suggests that the subsequent code should use
the constant 2147483648 if you want a 32-bit unsigned integer. On
the other hand such formulas often use primes (and 2147483647 is
one). So it leaves me with uncertainty what is actually intended.
Post by Michael Sanders
hash = (hash * 33 + ord(char)) % 2147483647
Is it guaranteed that the expression evaluates correctly in all
awks? The intermediate results can become as large as 70866960411
(0x108000001B, which is larger than 32 bit), which needs to be
representable in awk (not sure what POSIX awk defines/requires
here).
Post by Michael Sanders
}
return hash
}
function ord(char) {
return (char ~ /[[:digit:]]/ ? int(char) : int(ord[char]))
Why use this costly pattern match and conditional expression
if you can anyway simply access the ord[] array where all data
is already present?

(I would also try to avoid name clashes like ord() and ord[],
BTW. Here it's simple, because the function is superfluous;
just replace the ord(char) at the calling side by ord[char]
and remove the function completely.)
Post by Michael Sanders
}
BEGIN {
for(i=32;i<=126;i++) ord[sprintf("%c", i)] = i
What will the code produce with, say, "extended" ASCII, or the
common ISO 8859-x family of character sets? Is there any reason
to restrict it here?

What are the criteria for the chosen character subset? How will
or how should a TAB character in the data change the result?
Post by Michael Sanders
print djb2("my key is...")
}
As I said, just a couple of more or less obvious questions.

Janis
Janis Papanagnou
2023-09-23 08:48:57 UTC
Permalink
Post by Janis Papanagnou
Post by Michael Sanders
[...]
# modulo simulates 32-bit unsigned integer
The previous comment suggests that the subsequent code should use
the constant 2147483648 if you want a 32-bit unsigned integer. [...]
Erm..., _unsigned_ 32 bit integer would of course mean 4294967296.

Janis
Michael Sanders
2023-09-23 11:39:05 UTC
Permalink
Post by Janis Papanagnou
Variable 'n' not declared as local (just for good measure, and in
case you want to use the code in other code contexts).
Aye, I spotted this *after* I posted (of course...). Updated
function signature below:

function djb2(str, hash, n, i, char)
Post by Janis Papanagnou
Is it guaranteed that the expression evaluates correctly in all
awks? The intermediate results can become as large as 70866960411
(0x108000001B, which is larger than 32 bit), which needs to be
representable in awk (not sure what POSIX awk defines/requires
here).
Huh? From my initial post in this thread:

# modulo simulates 32-bit unsigned integer
hash = (hash * 33 + ord(char)) % 2147483647

Modulo wraps at 2147483647, it'll never outgrow
it limits. Maybe I'm not grokking what you meant.
Post by Janis Papanagnou
Erm..., _unsigned_ 32 bit integer would of course mean 4294967296
That's the max, but certainly not the only possibility...
Post by Janis Papanagnou
[...]
(I would also try to avoid name clashes like ord() and ord[],
BTW. Here it's simple, because the function is superfluous;
just replace the ord(char) at the calling side by ord[char]
and remove the function completely.)
And yet, this rational assumes an extension to the language,
which implies *only* gawk though right? What about the *standard*
awks (tested dbj2() using only mawk & busybox's awk thus far -
the smallest implementations out there). Still, have you by chance
invoked gawk as:

gawk --traditional

I cant on my end as the hard drive died on porkchop.bsd,
my OpenBSD box...
Post by Janis Papanagnou
What will the code produce with, say, "extended" ASCII, or the
common ISO 8859-x family of character sets? Is there any reason
to restrict it here?
What are the criteria for the chosen character subset? How will
or how should a TAB character in the data change the result?
No issues with TAB (see my function ord()), & non 7bit character
sets? Dunno... only need lower ASCII here, hence the hard-coded
limit:

min: 32, max: 126

If so compelled, test other sets & post your findings here.
Would be interested to read what you come up with.
Post by Janis Papanagnou
As I said, just a couple of more or less obvious questions.
Thanks Janis, as always, appreciate your insights. Hoping to get
back to using tin as my news reader soon, the google groups
interface its perhaps not my cup of tea. :/
Janis Papanagnou
2023-09-23 13:17:32 UTC
Permalink
Post by Michael Sanders
[...]
Post by Janis Papanagnou
Is it guaranteed that the expression evaluates correctly in all
awks? The intermediate results can become as large as 70866960411
(0x108000001B, which is larger than 32 bit), which needs to be
representable in awk (not sure what POSIX awk defines/requires
here).
# modulo simulates 32-bit unsigned integer
hash = (hash * 33 + ord(char)) % 2147483647
Modulo wraps at 2147483647, it'll never outgrow
it limits. Maybe I'm not grokking what you meant.
With this formula the hash variable may (because of the modulus)
get values up to 2147483647-1, so within the expression the value
may grow beyond that limit to (2147483647-1)*33+126 because of
the previous (potentially large) hash value calculated.

(There's a good chance that modern systems and languages calculate
that without problem, but if we just assume "32-bit integer" then
it will (typically implicitly) overflow and produce wrong results.
A comment in the code may be useful and is often sufficient here,
so that folks using/porting the code to their environment may have
a visible caveat and may check that. - Myself I haven't inspected
the POSIX specs on that, that's why I formulated it as question.)
Post by Michael Sanders
Post by Janis Papanagnou
[...]
(I would also try to avoid name clashes like ord() and ord[],
BTW. Here it's simple, because the function is superfluous;
just replace the ord(char) at the calling side by ord[char]
and remove the function completely.)
And yet, this rational assumes an extension to the language,
which implies *only* gawk though right? [...]
Maybe I misunderstand you. My suggestion here is not relying on
extensions; it's basically just a change of the expression '#<<'

for(i = 1; i <= n; i++) {
char = substr(str, i, 1)
hash = (hash * 33 + ord[char]) % 2147483647 #<<
}

# function ord(char) ## deleted

# once in the BEGIN section (as you've done it)
for(i=32;i<=126;i++) ord[sprintf("%c", i)] = i
Post by Michael Sanders
Post by Janis Papanagnou
What will the code produce with, say, "extended" ASCII, or the
common ISO 8859-x family of character sets? Is there any reason
to restrict it here?
What are the criteria for the chosen character subset? How will
or how should a TAB character in the data change the result?
No issues with TAB (see my function ord()), & non 7bit character
sets? Dunno... only need lower ASCII here, hence the hard-coded
min: 32, max: 126
I understand that. My question with TAB was aiming at what this
sample text should calculate

Hello<Blank>brave<Blank>new<Tab>World!<Newline>

Shall the Blank be part of the hash but the Tab not? - Appears to
me to be inconsistent, and if it is "as designed" it should have
an explicit and clear comment on that difference (and also that
(and maybe also why) e.g. [other] control characters [deliberately]
evaluate to 0).
Post by Michael Sanders
If so compelled, test other sets & post your findings here.
Not sure what that means - by "sets" you mean "code-sets"? - ...
Post by Michael Sanders
Would be interested to read what you come up with.
...if so then it would be simple; I'd include the whole range of
8-bit characters (including control characters), i.e. 0..255 to
cover all octet streams (and take also means to not exclude the
\n, or RS and FS in Awk parlance).
Post by Michael Sanders
Post by Janis Papanagnou
As I said, just a couple of more or less obvious questions.
Thanks Janis, as always, appreciate your insights. Hoping to get
back to using tin as my news reader soon, the google groups
interface its perhaps not my cup of tea. :/
In your private environment at least you are free to choose. The
horror is when a company forces you to switch from Unix to Windows,
when all applications are getting Web based, and if you're not
allowed to install "local" tools.

Janis
Michael Sanders
2023-09-23 12:25:39 UTC
Permalink
Post by Janis Papanagnou
(I would also try to avoid name clashes like ord() and ord[],
BTW. Here it's simple, because the function is superfluous;
just replace the ord(char) at the calling side by ord[char]
and remove the function completely.)
One more quick question... does gawk have an ord() builtin
function or not?
Janis Papanagnou
2023-09-23 13:21:10 UTC
Permalink
Post by Michael Sanders
Post by Janis Papanagnou
(I would also try to avoid name clashes like ord() and ord[],
BTW. Here it's simple, because the function is superfluous;
just replace the ord(char) at the calling side by ord[char]
and remove the function completely.)
One more quick question... does gawk have an ord() builtin
function or not?
Not that I know of. - I seem to recall that when I once needed that
I implemented an array (as you did). - There's a chance that GNU Awk
has something in some extension libraries (but I'm too lazy to check).

Janis
Manuel Collado
2023-09-24 08:27:43 UTC
Permalink
Post by Janis Papanagnou
Post by Michael Sanders
One more quick question... does gawk have an ord() builtin
function or not?
Not that I know of. - I seem to recall that when I once needed that
I implemented an array (as you did). - There's a chance that GNU Awk
has something in some extension libraries (but I'm too lazy to check).
Yes, ord() and chr() are included in the gawk distribution as a dynamic
extension.

gawk -l ordchr 'BEGIN{print ord("x")}'
120

HTH
--
Manuel Collado - http://mcollado.z15.es
Michael Sanders
2023-09-24 10:31:44 UTC
Permalink
On Sunday, September 24, 2023 at 3:27:47 AM UTC-5, Manuel Collado wrote:

Yes it does help. Earnest thanks Manuel!
Post by Manuel Collado
Yes, ord() and chr() are included in the gawk distribution as a dynamic
extension.
gawk -l ordchr 'BEGIN{print ord("x")}'
120
HTH
Ben Bacarisse
2023-09-24 19:45:30 UTC
Permalink
Post by Michael Sanders
Yes it does help. Earnest thanks Manuel!
Note that the gawk library extension 'ord' will not do what your ord
function does, particularly with digits. But treating digits
differently to other characters is not what DJB2 does, so the library
extension ord will fix a bug.
--
Ben.
Manuel Collado
2023-09-23 08:54:25 UTC
Permalink
There is something wrong:

$ LC_ALL=C gawk 'function ord(char) {
return (char ~ /[[:digit:]]/ ? int(char) : int(ord[char]))
}

BEGIN {print ord("x")}'
gawk: cmd. line:3: error: function `ord' called with space between name
and `(',
or used as a variable or an array

Please note ord[char] vs. ord(char)

HTH
Post by Michael Sanders
Greetings folks, just sharing the knowledge (& test posting using google groups).
Beware wordwrap...
function djb2(str, hash, i, char) {
# initialize hash to an arbitrary prime number
hash = 5381
n = length(str)
# iterate over characters and compute hash
for(i = 1; i <= n; i++) {
char = substr(str, i, 1)
# modulo simulates 32-bit unsigned integer
hash = (hash * 33 + ord(char)) % 2147483647
}
return hash
}
function ord(char) {
return (char ~ /[[:digit:]]/ ? int(char) : int(ord[char]))
}
BEGIN {
for(i=32;i<=126;i++) ord[sprintf("%c", i)] = i
print djb2("my key is...")
}
https://stackoverflow.com/questions/1579721/why-are-5381-and-33-so-important-in-the-djb2-algorithm
https://stackoverflow.com/questions/10696223/reason-for-the-number-5381-in-the-djb-hash-function
https://groups.google.com/g/comp.lang.c/c/lSKWXiuNOAk
https://en.wikipedia.org/wiki/Daniel_J._Bernstein
--
Manuel Collado - http://mcollado.z15.es
Michael Sanders
2023-09-23 11:42:52 UTC
Permalink
Hi Manuel.

Question, have you tried: --traditional or --posix options?

See also:

https://www.gnu.org/software/gawk/manual/html_node/Compatibility-Mode.html
Post by Manuel Collado
$ LC_ALL=C gawk 'function ord(char) {
return (char ~ /[[:digit:]]/ ? int(char) : int(ord[char]))
}
BEGIN {print ord("x")}'
gawk: cmd. line:3: error: function `ord' called with space between name
and `(',
or used as a variable or an array
Please note ord[char] vs. ord(char)
HTH
Ben Bacarisse
2023-09-23 21:29:34 UTC
Permalink
Post by Michael Sanders
Greetings folks, just sharing the knowledge (& test posting using google groups).
Beware wordwrap...
function djb2(str, hash, i, char) {
# initialize hash to an arbitrary prime number
hash = 5381
n = length(str)
# iterate over characters and compute hash
for(i = 1; i <= n; i++) {
char = substr(str, i, 1)
# modulo simulates 32-bit unsigned integer
hash = (hash * 33 + ord(char)) % 2147483647
This modulus does nor do what you claim you want to do (as has been
pointed out). To simulate 32-bit unsigned arithmetic you need

hash = (hash * 33 + ord(char)) % 4294967296

here. Mind you, there is no formal definition of what the DJB2 hash is.
The only reliable reference is a C function that uses unsigned long int
calculations. This could be 32-bits wide, but is often in wider.
Post by Michael Sanders
}
return hash
}
function ord(char) {
return (char ~ /[[:digit:]]/ ? int(char) : int(ord[char]))
What version of AWK, run with what flags, accepts this? I can't get it
past gawk, nawk, mawk nor original-awk.
Post by Michael Sanders
}
BEGIN {
for(i=32;i<=126;i++) ord[sprintf("%c", i)] = i
print djb2("my key is...")
}
--
Ben.
Michael Sanders
2023-09-24 02:20:51 UTC
Permalink
Post by Ben Bacarisse
This modulus does nor do what you claim you want to do (as has been
pointed out).
No sir, it does, in fact.

# modulo simulates 32-bit unsigned integer

https://en.wikipedia.org/wiki/2,147,483,647

https://www.bing.com/search?FORM=U523DF&PC=U523&q=is+this+2147483647+a+32-bit+number

https://www.google.com/search?q=is+this+2147483647+a+32-bit+number
Post by Ben Bacarisse
What version of AWK, run with what flags, accepts this? I can't get it
past gawk, nawk, mawk nor original-awk.
https://gnuwin32.sourceforge.net/packages/mawk.htm

https://frippery.org/busybox/
Janis Papanagnou
2023-09-24 04:38:59 UTC
Permalink
Post by Michael Sanders
Post by Ben Bacarisse
This modulus does nor do what you claim you want to do (as has been
pointed out).
No sir, it does, in fact.
# modulo simulates 32-bit unsigned integer
For unsigned ints with a range that is implementable by binary words
x % 4294967296 (modulo) is equivalent to x & 4294967295 (bitwise-and),
x % 2147483648 (modulo) is equivalent to x & 2147483647 (bitwise-and).
(Mind the differing last digit of modulo and bitwise-and expressions!)
The expressions on the first line are for 32-bit unsigned integer and
the expressions on the second line are for 31-bit unsigned integer
calculations.

Your sample (the fact I pointed out in an earlier post and Ben repeated
here) uses x % 2147483647 which doesn't match a "32-bit simulation".
(Compare it with the four forms above and think about it.)

To add something new with this post I want to point out that a modulus
with primes (like 2147483647, a Mersenne prime) is regularly done e.g.
in random number generators or in cryptography. In these areas you may
have large numbers x and a comparable small prime modulus. Since you
don't have such large registers for the x you can iterate over parts
of the number sequentially (using a loop) to perform the operation.
This insight can also be applied in case of the issue you have with
your original code where internal overflows can arise; in case that
the prime number _2147483647_ was *intended* (as opposed to 31/32 bit
unsigned integer operations) you can avoid these computation errors
by that iterative method (here, in fact, a simple split of x in only
two parts would suffice for the "iteration").

Janis
Michael Sanders
2023-09-24 10:30:26 UTC
Permalink
Post by Janis Papanagnou
Your sample (the fact I pointed out in an earlier post and Ben repeated
here) uses x % 2147483647 which doesn't match a "32-bit simulation".
That's not correct & there is is no error using 2147483647, it works just
as I intended. But elsewhere... I do think my choice of naming ord()
& ord[] were poor, as these may be intrinsic to some versions of
a given awk's namespace.
Post by Janis Papanagnou
(Mind the differing last digit of modulo and bitwise-and expressions!)
See the dbj2gawk() version I posted elsewhere in this thread. But of
course the older awks lack bitwise...
Post by Janis Papanagnou
Since you don't have such large registers...
Exactly, hence the signed int masquerading as an unsigned int.
We dont want to get sidetracked on that point, there are lots of
other 32-bit numbers that can be chosen in any event (& ought to
be to for security reasons). I took one that works with mawk,
busybox awk. Just think about this & you'll see what I mean.
Post by Janis Papanagnou
in case that the prime number _2147483647_ was *intended*...
Yes, intended.

Thanks Janis.
Janis Papanagnou
2023-09-24 13:28:02 UTC
Permalink
Post by Michael Sanders
Post by Janis Papanagnou
Your sample (the fact I pointed out in an earlier post and Ben repeated
here) uses x % 2147483647 which doesn't match a "32-bit simulation".
That's not correct & there is is no error using 2147483647, it works just
as I intended.
I cannot see how "it works" if it is not "working" for the valid
31-bit number 2147483647; the modulus 2147483647 % 2147483647 == 0
thus it doesn't "simulate" 31-bit arithmetic (and also not 32-bit
arithmetic) with that modulus. But maybe you just have a different
understanding when speaking about unsigned integer "simulation".

For an operation x % 2147483647 you are (as I previously mentioned,
similar as in some coding theory applications) doing an arithmetic
reduction (or "folding") of longer numbers to smaller ones. That's
fine per se, but you are not "simulating" unsigned integer math. If
you wanted to say that you are restricting computation and numeric
output to 31 bit values then it's okay. With the already mentioned
flaw that the expressions exceeds 31 and 32 bit in the expression,
so the math processor needs to support larger integers or FP math
with sufficient resolution in the mantissa. Which is another point
why your expression is not [reliably] working [correctly] within a
32 bit range. To illustrate let's assume that the expression will
(as shown to be possible) overflow the 32 bit, then the result will
effectively have implicitly undergone an implicit x % 2147483648
operation and then an explicit x % 2147483647 operation; together
res = x % 2147483648 % 2147483647. By this there is algorithmically
a discontinuity introduced which is almost always not desired.
Post by Michael Sanders
But elsewhere... I do think my choice of naming ord()
& ord[] were poor, as these may be intrinsic to some versions of
a given awk's namespace.
Post by Janis Papanagnou
(Mind the differing last digit of modulo and bitwise-and expressions!)
See the dbj2gawk() version I posted elsewhere in this thread. But of
course the older awks lack bitwise...
Yes, that's why you'd use the modulus; but x % 2147483648 are
necessary to handle all 31 bit unsigned integer values and not
x % 2147483647.

Notwithstanding that the use the latter may be algorithmically
needed. Such algorithms may choose also other primes p in x % p.
But still note the undesired introduced discontinuity mentioned
above in case of the overflows within expressions.

Janis
Janis Papanagnou
2023-09-24 13:37:20 UTC
Permalink
Post by Janis Papanagnou
To add something new with this post I want to point out that a modulus
with primes (like 2147483647, a Mersenne prime) is regularly done e.g.
in random number generators or in cryptography. In these areas you may
have large numbers x and a comparable small prime modulus. Since you
don't have such large registers for the x you can iterate over parts
of the number sequentially (using a loop) to perform the operation. [...]
I just found on an old disk some Awk code where I used this method
to compute and check IBANs some years ago. Here's the diff of two
approaches, one using external multiple-precision arithmetic using
'bc' co-process (GNU Awk) and one working on chunks of the numeric
data string.

$ diff iban1 iban2
59a60,72
Post by Janis Papanagnou
function mod(n, s, l, z, m) {
while (length(s) > 4) {
z = substr(s, 1, 4)
m = z % n
s = m substr(s, 5)
}
return s % n
}
function mod97(s) {
return mod(97, s)
}
66,67c79
< print "98 - ( " iban " % 97 )" |& "bc"
< "bc" |& getline checksum
---
Post by Janis Papanagnou
checksum = 98 - mod97(iban)
72,73c84
< print iban " % 97 == 1" |& "bc"
< "bc" |& getline ok
---
Post by Janis Papanagnou
ok = mod97(iban) == 1
(Just for illustration purposes in case someone's interested in that
method.)

Janis
Ben Bacarisse
2023-09-24 19:39:44 UTC
Permalink
Post by Michael Sanders
Post by Ben Bacarisse
This modulus does nor do what you claim you want to do (as has been
pointed out).
No sir, it does, in fact.
I can only repeat that it does not. A simple test shows that it does
not, so it is probable that you don't know what "simulates 32-bit
unsigned integer" means.

Positive integers are not "unsigned". 32-bit unsigned arithmetic has no
sign representation at all, and the operations are all done modulo
2**32.

Even if you interpreted "32-bit unsigned integers" to mean that the
range should be that of the positive (but signed) 32-bit integers, your
modulus is still wrong. For that, you would need to use 2147483648 as
the modulus, giving 0 to 2147483647 inclusive.
Post by Michael Sanders
# modulo simulates 32-bit unsigned integer
https://en.wikipedia.org/wiki/2,147,483,647
https://www.bing.com/search?FORM=U523DF&PC=U523&q=is+this+2147483647+a+32-bit+number
https://www.google.com/search?q=is+this+2147483647+a+32-bit+number
What do you think these links show other than the obvious and
uncontested fact that the largest signed 32-bit number is a 32-bit
number?
Post by Michael Sanders
Post by Ben Bacarisse
What version of AWK, run with what flags, accepts this? I can't get it
past gawk, nawk, mawk nor original-awk.
https://gnuwin32.sourceforge.net/packages/mawk.htm
What version of mawk is that? On my system:

$ mawk -W version
mawk 1.3.4 20200120
Copyright 2008-2019,2020, Thomas E. Dickey
Copyright 1991-1996,2014, Michael D. Brennan

random-funcs: srandom/random
regex-funcs: internal
compiled limits:
sprintf buffer 8192
maximum-integer 2147483647
$ mawk -f hash.awk
mawk: hash.awk: line 19: syntax error at or near [
mawk: hash.awk: line 25: syntax error at or near [

Lines 19 and 25 have 'ord['... where ord has been previously defined as
a function.
Post by Michael Sanders
https://frippery.org/busybox/
Ah, yes, busybox awk accepts it. I wonder why? It's definitely an
outlier. Don't use ord as the name of a function and of an array.
--
Ben.
Michael Sanders
2023-09-24 21:17:30 UTC
Permalink
On Sunday, September 24, 2023 at 2:39:49 PM UTC-5, Ben Bacarisse wrote:

[...]

Ben, listen the number wraps (or folds as stated elsewhere)
to zero at its max using modulo. Thats obvious. I, you, Janis
(& likely others) see that.

Now the thing is, the comment I placed in the script was the
word 'simulates', if you want to fritter away your time debating
that, have at it. Me? Too busy to deal with the snarky attitude.

The scripts I posted, both dbj2() & dbj2gawk() work fine & correctly
on my end & it seems rich to me that you cant get them to run but
otherwise know that I'm wrong. Hmmm... None of this is the point,
we simply want a large number (32bit - 1 as is the case) to lesson
the chance of collisions in the hash & that's it... I cant believe
you & Janis cant see that for what it is. Good grief, I should've known...
Janis Papanagnou
2023-09-24 22:45:50 UTC
Permalink
[...]
You are right, we all wasted our time.
Kenny McCormack
2023-09-24 23:36:49 UTC
Permalink
Post by Janis Papanagnou
[...]
You are right, we all wasted our time.
And you've got no one to blame but yourself.

You & Ben.

It was pretty clear OP wasn't interested in the usual "human compiler"
treatment, but you guys went ahead with it anyway.
--
The book "1984" used to be a cautionary tale;
Now it is a "how-to" manual.
Janis Papanagnou
2023-09-25 00:38:02 UTC
Permalink
Post by Kenny McCormack
Post by Janis Papanagnou
[...]
You are right, we all wasted our time.
And you've got no one to blame but yourself.
Sure.
Post by Kenny McCormack
You & Ben.
It was pretty clear OP wasn't interested in the usual "human compiler"
treatment, but you guys went ahead with it anyway.
The OP initially said about his interest it's "sharing the knowledge".
Anyone tell us what that would be in this thread. - Rhetorical, don't
bother.

He could as well have contributed BEGIN { print 42 } . "It's working."
And it's not rising all the questions the OP's code did and still does.

And (with your contribution and my reply) we continue wasting our time,
don't we, Kenny?
Ben Bacarisse
2023-09-25 01:27:45 UTC
Permalink
Post by Michael Sanders
[...]
Ben, listen the number wraps (or folds as stated elsewhere)
to zero at its max using modulo. Thats obvious. I, you, Janis
(& likely others) see that.
Indeed we do all know that.
Post by Michael Sanders
Now the thing is, the comment I placed in the script was the
word 'simulates', if you want to fritter away your time debating
that, have at it. Me? Too busy to deal with the snarky attitude.
No, I have no interest in debating it. I thought you might want to know
how to implement the behaviour described by the comment you wrote,
especially as that behaviour was a key part of how the original DJB hash
was defined.
Post by Michael Sanders
The scripts I posted, both dbj2() & dbj2gawk() work fine & correctly
on my end & it seems rich to me that you cant get them to run but
otherwise know that I'm wrong.
Of course I can get your code to run. It's trivial to fix the bug of
using ord for two incompatible purposes, but I find it rich that you
think it at all odd that someone could know your code is wrong without
running it. That's what programmers do every day.
Post by Michael Sanders
Hmmm... None of this is the point,
we simply want a large number
Both the subject line of the thread, and the links you gave in the
original post, suggest you wanted to implement a specific hash function:
DJB's 32-bit "multiply by 33" hash. But your code does not do that.
For example, the DJB hash of the string "abcdef" is 4048079738, but your
code gives 1900599327. Your hash can go no higher than 2**31 - 2, but
the original one uses the full range of 32-bit unsigned int.
Post by Michael Sanders
(32bit - 1 as is the case) to lesson
the chance of collisions in the hash & that's it...
This is why halving the range by using only 31 bits is not a good idea.
It's not a /terrible/ idea because 31 bits is still a huge range, but
since 32-bit wrapping is free on most hardware, there is absolutely no
upside to using modulo 2147483647 arithmetic for a hash.

And your saying "32bit - 1 as is the case" means you still don't know
what the code you wrote is really doing.
Post by Michael Sanders
I cant believe
you & Janis cant see that for what it is. Good grief, I should've known...
Well now you know!

1) DJB's has hash usually uses 32-bit unsigned arithmetic. (Very old
implementation might use 16-bit arithmetic, and some newer ones might
use 64 bits.)

2) You can do that in many modern awks with modulo 4294967296
arithmetic.

3) In gawk you could use its logical 'and' function along with its
hexadecimal numbers to make it very clear what's going on

hash = and(hash * 33 + ord(c), 0xFFFFFFFF)
--
Ben.
Michael Sanders
2023-09-25 07:26:58 UTC
Permalink
Post by Ben Bacarisse
For example, the DJB hash of the string "abcdef" is 4048079738, but your
code gives 1900599327. Your hash can go no higher than 2**31 - 2, but
the original one uses the full range of 32-bit unsigned int.
[...]
This is why halving the range by using only 31 bits is not a good idea.
It's not a /terrible/ idea because 31 bits is still a huge range, but
since 32-bit wrapping is free on most hardware, there is absolutely no
upside to using modulo 2147483647 arithmetic for a hash.
And your saying "32bit - 1 as is the case" means you still don't know
what the code you wrote is really doing.
Reader's from the future, re-read the 1st paragraph...

Here's where good ol' Ben has (unwittingly, chuckle) proven, & in his own
words mind you, the importance of using of using a unique key number.
Ben Bacarisse
2023-09-25 22:50:41 UTC
Permalink
Post by Michael Sanders
Post by Ben Bacarisse
For example, the DJB hash of the string "abcdef" is 4048079738, but your
code gives 1900599327. Your hash can go no higher than 2**31 - 2, but
the original one uses the full range of 32-bit unsigned int.
[...]
This is why halving the range by using only 31 bits is not a good idea.
It's not a /terrible/ idea because 31 bits is still a huge range, but
since 32-bit wrapping is free on most hardware, there is absolutely no
upside to using modulo 2147483647 arithmetic for a hash.
And your saying "32bit - 1 as is the case" means you still don't know
what the code you wrote is really doing.
Reader's from the future, re-read the 1st paragraph...
Here's where good ol' Ben has (unwittingly, chuckle) proven, & in his own
words mind you, the importance of using of using a unique key number.
What's a key number?
--
Ben.
Michael Sanders
2023-09-26 03:32:22 UTC
Permalink
Post by Ben Bacarisse
What's a key number?
It doesn't matter Ben. I do appreciate your
willingness to help despite our p*ssing contest.
Moving forward, here's an intended usage example.

Given a CSV file of: Smith, John, HASH

And parsed as, oh something like:

if ($1 == "Smith" && $2 == "John") {

if ($3 == dbj2("policy_number")) {...}
}

We've gained simple a way to obfuscate small chucks
of patient records.
Ben Bacarisse
2023-09-26 11:20:17 UTC
Permalink
Post by Michael Sanders
Post by Ben Bacarisse
What's a key number?
It doesn't matter Ben.
OK. I thought you might want to be understood.
--
Ben.
Mike Sanders
2023-09-26 19:21:19 UTC
Permalink
Post by Ben Bacarisse
OK. I thought you might want to be understood.
Well, I wake up in a new world everyday it seems =)

Have look up thread somewhere or another (subject
contains '[Complete]'.
--
:wq
Mike Sanders
Michael Sanders
2023-09-24 02:50:41 UTC
Permalink
Post by Michael Sanders
Greetings folks, just sharing the knowledge (& test posting using google groups).
Note to self: built a version for gawk & tested online here:

https://ideone.com/ZZVQl2

script follows:

function djb2gawk(str, n, i, hash, ascii) {

hash = 5381
n = length(str)

for(i = 1; i <= n; ++i) {
char = substr(str, i, 1)
ascii = alphanum(char)
hash = (hash * 33 + ascii) % ((2^32) - 1)
}

return hash
}

function alphanum(char) {

return index("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", char)
}

BEGIN {
str = "comp.lang.awk"
print djb2gawk(str)
}
Michael Sanders
2023-09-25 19:28:24 UTC
Permalink
Post by Michael Sanders
Greetings folks, just sharing the knowledge (& test posting using google groups).
For the sake of completeness, a key generator (you supply a 32bit key):

# Michael Sanders 2023: key generator for dbj2() & dbj2gawk()
#
# creates a key block (25 keys, 5x5 matrix) as shown below:
#
# 2147483647 2147483646 2147483645 2147483644 2147483643
# 2147483642 2147483641 2147483640 2147483639 2147483638
# 2147483637 2147483636 2147483635 2147483634 2147483633
# 2147483632 2147483631 2147483630 2147483629 2147483628
# 2147483627 2147483626 2147483625 2147483624 2147483623

BEGIN {

KEYMAX = 2147483647 # your key's max number
BLOCKS = 5 # number of blocks to create

for (x = KEYMAX; x >= 0; x--) {
printf("%010d ", x);
if ((KEYMAX - x) % 5 == 4) {
printf("\n");
if (++y % 5 == 0) {
printf("\n")
if (--BLOCKS < 1) exit
}
}
}

}
Mike Sanders
2023-09-26 19:11:30 UTC
Permalink
Well, about to bring this little project to its conclusion =)
A special thanks (in no particular order) to:

Janis Papanagnou, Manuel Collado, & Ben Bacarisse.

Great insights, advice & help, appreciate you all.

tags: awk, sh, djb2, hash, keys, crypto

# Michael Sanders 2023: djb2 - simple string hash for awk
#
# https://porkchop.neocities.org/notes/djb2.txt
#
# usage example...
#
# given a CSV file of: Smith, John, HASH
#
# you could deploy djb2() in the following manner:
#
# if ($1 == "Smith" && $2 == "John" && $3 == djb2("medical_condition")) {...}
#
# if a unique number is desired, use one of the key generators below

function djb2(str, hash, x, y, ascii) {

hash = 5381
y = length(str)

for(x = 1; x <= y; x++) {
ascii = substr(str, x, 1)
hash = (hash * 33 + alphanum(ascii)) % ((2^32) - 1) # 32-bit key or
# hash = (hash * 33 + alphanum(ascii)) % 2147483647 # generated key
}

return hash

}

function alphanum(tmp) {

return index("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", tmp)

}

BEGIN {

print djb2("sensitive data")

}

# -----------------------------------------------------------------

# Michael Sanders 2023: key generator for djb2.awk
#
# creates 1250 random & unique keys, where each block
# contains 25 keys within a 5x5 matrix, sample output:
#
# 4044119583 2096446957 0146530746 2307162583 1145196028
# 2074570158 0782926302 0126577871 1957612719 1264962473
# 2768662034 3770197738 1165669763 3328823363 1160752685
# 1609089706 2699782284 0621370497 3966710650 2024400675
# 1453519246 1174954443 1588003602 3567230071 2877329117

BEGIN {

MIN = 0123456789 # min 10 digit num
MAX = 4294967295 # max 10 digit num
BLX = 50 # number of blocks

seed = systime(); srand(seed)
printf("\nSeed: %s\n\n", seed)

for (x = 1; x <= BLX; x++) {
for (y = 1; y <= 5; y++) {
s = ""
for (z = 1; z <= 5; z++) {
do {key = int(rand() * (MAX - MIN + 1)) + MIN} while(key in keys)
keys[key]
s = (s == "" ? sprintf("%010d", key) : s " " sprintf("%010d", key))
}
printf("%s\n", s)
}
if (x < BLX) printf("%s", "\n")
}
}

# -----------------------------------------------------------------

# Michael Sanders 2023: key generator2 for older awks using dbj2.awk
#
# use this key generator instead if your awk lacks the abilty to
# use either large integers or the systime() builtin
#
# creates key blocks (25 keys, 5x5 matrix) as shown below:
#
# 2147483647 2147483646 2147483645 2147483644 2147483643
# 2147483642 2147483641 2147483640 2147483639 2147483638
# 2147483637 2147483636 2147483635 2147483634 2147483633
# 2147483632 2147483631 2147483630 2147483629 2147483628
# 2147483627 2147483626 2147483625 2147483624 2147483623

BEGIN {

KEYMAX = 2147483647 # your 10 digit max number
BLOCKS = 50 # number of blocks to create

for (x = KEYMAX; x >= 0; x--) {
printf("%010d ", x);
if ((KEYMAX - x) % 5 == 4) {
printf("\n");
if (++y % 5 == 0) {
printf("\n")
if (--BLOCKS < 1) exit
}
}
}

}

# -----------------------------------------------------------------

notes:

https://stackoverflow.com/questions/1579721/why-are-5381-and-33-so-important-in-the-djb2-algorithm

https://stackoverflow.com/questions/10696223/reason-for-the-number-5381-in-the-djb-hash-function

https://groups.google.com/g/comp.lang.c/c/lSKWXiuNOAk

https://en.wikipedia.org/wiki/Daniel_J._Bernstein

https://groups.google.com/g/comp.lang.awk/c/FZNpn8Sxf9k

# eof
--
:wq
Mike Sanders
Keith Thompson
2023-09-26 21:31:40 UTC
Permalink
***@invalid.foo (Mike Sanders) writes:
[...]
Post by Mike Sanders
function djb2(str, hash, x, y, ascii) {
hash = 5381
y = length(str)
for(x = 1; x <= y; x++) {
ascii = substr(str, x, 1)
hash = (hash * 33 + alphanum(ascii)) % ((2^32) - 1) # 32-bit key or
# hash = (hash * 33 + alphanum(ascii)) % 2147483647 # generated key
}
return hash
}
function alphanum(tmp) {
return index("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", tmp)
}
[...]

Why do you call this "djb2"?

Here's a description of the djb2 hash function from
<https://www.eecs.yorku.ca/~oz/hash.html>:

this algorithm (k=33) was first reported by dan bernstein many years
ago in comp.lang.c. another version of this algorithm (now favored
by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; the
magic of number 33 (why it works better than many other constants,
prime or not) has never been adequately explained.

unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;

while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

return hash;
}

I see some similarities between the original djb and your hash function,
and yours was undoubtedly inspired by djb2, but they're very much not the
same thing. Would you agree that calling it "djb2" is misleading?

I find it odd that your function treats all non-alphanumeric characters
as the same. Is there a reason for that?
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */
Mike Sanders
2023-09-26 22:20:56 UTC
Permalink
Post by Keith Thompson
Why do you call this "djb2"?
Well, because that is what its called.
Post by Keith Thompson
Here's a description of the djb2 hash function from
Yes, one of the links in the notes section has a reference to that url.
Did you not read the notes?
Post by Keith Thompson
this algorithm (k=33) was first reported by dan bernstein many years
ago in comp.lang.c. another version of this algorithm (now favored
by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; the
magic of number 33 (why it works better than many other constants,
prime or not) has never been adequately explained.
I know, intesting stuff I thought!
Post by Keith Thompson
unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
Here's my take in virtually indentical take in c (uint32_t):

uint32_t djb2(const char *str) {

uint32_t hash = 5381;
int c;
while ((c = *str++))
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

return hash;
}
Post by Keith Thompson
I see some similarities between the original djb and your hash function,
and yours was undoubtedly inspired by djb2, but they're very much not the
same thing. Would you agree that calling it "djb2" is misleading?
Nope, wont change it either. =) I've provided my notes, so others can study it,
use it, extend it, trace it back to its source if so compelled... I make
no claims as to being its originator. Heck the *cough* notes *cough* alone
demonstrate that. I feel its 'nifty' for no reason than its short & sweet.
awk (or least I) needed this functionality. Besides, I'm hard-pressed to think
anyone will mistake it as anything misleading. I reckon no good deed will
go unpunished. Such is life on usenet...
Post by Keith Thompson
I find it odd that your function treats all non-alphanumeric characters
as the same. Is there a reason for that?
Yes, just dealing (a-z, A-Z, 0-9) here. Personally don't want to imbue
it with extra, uneeded baggage. Feel free to modify it to suit your tastes.
--
:wq
Mike Sanders
Keith Thompson
2023-09-26 22:38:47 UTC
Permalink
Post by Mike Sanders
Post by Keith Thompson
Why do you call this "djb2"?
Well, because that is what its called.
You mean that's what you decided to call it. I'm asking why.
Post by Mike Sanders
Post by Keith Thompson
Here's a description of the djb2 hash function from
Yes, one of the links in the notes section has a reference to that url.
Did you not read the notes?
I did. That's how I noticed that you've implemented a different
algorithm than any version of the original djb2.

[...]
Post by Mike Sanders
Post by Keith Thompson
I see some similarities between the original djb and your hash function,
and yours was undoubtedly inspired by djb2, but they're very much not the
same thing. Would you agree that calling it "djb2" is misleading?
Nope, wont change it either. =)
[...]

Noted.

What you've posted is not djb2. You shouldn't claim that it is. You've
made it clear that you intend to continue to do so.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */
Mike Sanders
2023-09-26 23:12:11 UTC
Permalink
Post by Keith Thompson
You mean that's what you decided to call it. I'm asking why.
Well, I don't think I have an anwser that will satisfy you.
Post by Keith Thompson
What you've posted is not djb2. You shouldn't claim that it is. You've
made it clear that you intend to continue to do so.
What do you feel I ought to do to clarify the problem with this cavet:

It wont be taken down or renamed.

Perhaps add clarification to the document how my take only deals with
alphanumeric chars? Let's work it out.
--
:wq
Mike Sanders
Keith Thompson
2023-09-26 23:53:41 UTC
Permalink
Post by Mike Sanders
Post by Keith Thompson
You mean that's what you decided to call it. I'm asking why.
Well, I don't think I have an anwser that will satisfy you.
Then post an answer that doesn't satisfy me. So far you've said nothing.
Post by Mike Sanders
Post by Keith Thompson
What you've posted is not djb2. You shouldn't claim that it is. You've
made it clear that you intend to continue to do so.
What do you feel I ought to do to clarify the problem
Rename it.
Post by Mike Sanders
It wont be taken down or renamed.
If you wrote a hash function that yields a 160-bit result that differs
from the result returned by the actual sha1 function, would you release
it under the name "sha1"? Assuming your answer is no, how is this
different? (Admittedly djb2 isn't as well established.)
Post by Mike Sanders
Perhaps add clarification to the document how my take only deals with
alphanumeric chars? Let's work it out.
Work it out yourself.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */
Mike Sanders
2023-09-27 00:49:14 UTC
Permalink
Post by Keith Thompson
Work it out yourself.
Will do.
--
:wq
Mike Sanders
Mike Sanders
2023-09-27 01:32:05 UTC
Permalink
Post by Mike Sanders
Will do.
https://porkchop.neocities.org/notes/mHash.txt
--
:wq
Mike Sanders
Janis Papanagnou
2023-09-27 08:03:59 UTC
Permalink
Post by Keith Thompson
[...]
Why do you call this "djb2"?
Here's a description of the djb2 hash function from
[...]
I see some similarities between the original djb and your hash function,
and yours was undoubtedly inspired by djb2, but they're very much not the
same thing. Would you agree that calling it "djb2" is misleading?
I find it odd that your function treats all non-alphanumeric characters
as the same. Is there a reason for that?
Indeed. I as well was misleaded by the impression I've got that he
was trying to implement (just in a wrong way) an existing algorithm.
I noticed that (my) mistake not before I saw two versions of his code
in one post, each using different character set encodings - which of
course would lead to different results even if we compare only the
OP's own implementations' results. This arbitrariness in conjunction
with the inherent flaws the code versions had (and may still have;
I don't bother any more) makes clear that we obviously have a "Not
Invented Here" case (https://en.wikipedia.org/wiki/Not_invented_here),
which most likely makes all in depth discussions void anyway. Glad
it became obvious after all. (Bad that I noticed that too late. And I
see that the OP just recently followed your suggestion to rename his
hash function.)

It's always good to get information about the _intention_ of a piece
of posted code early and with the original post; that way bandworm
threads and misunderstandings could be avoided.

<OT>
Let's come back to the actual application and application domain ...

Inferred from the OP's posting where he wrote: dbj2("policy_number"))
and "simple a way to obfuscate small chucks of patient records." ...

In statistical medicine context there's demand to anonymize patient
data. For statistical purposes and cause-effect insights it's usually
necessary to be able to assign data to an "individual entity" without
disclosing his identity. But a hash code does not support that; it
can lead to "ID"-clashes that assigns different results to one same
subject. You need to create (real) keys. - Remember Ben's confusion
about (the OP's misnomer) "key number"! The OP may seem to require
keys but he's creating a hash value.

Of course if a hash value is sufficient - only the OP knows this! -
then I'd have used some existing and proven algorithm instead of
inventing my own (with possible flaws or non-obvious bad properties).
(But this as well has the OP to decide and is not an Awk question
anyway.)
</OT>

Janis
Mike Sanders
2023-09-27 10:45:18 UTC
Permalink
Wow... I've simply attempted to learn & share that with others.
I cant recall every seeing such, I dunno, at a loss for words.
It would've taken me a few more iterations to bring it all togther
but with this environment, its not okay.
Post by Janis Papanagnou
Indeed. I as well was misleaded by the impression I've got that he
was trying to implement (just in a wrong way) an existing algorithm.
I noticed that (my) mistake not before I saw two versions of his code
in one post, each using different character set encodings - which of
course would lead to different results even if we compare only the
OP's own implementations' results. This arbitrariness in conjunction
with the inherent flaws the code versions had (and may still have;
I don't bother any more) makes clear that we obviously have a "Not
Invented Here" case (https://en.wikipedia.org/wiki/Not_invented_here),
which most likely makes all in depth discussions void anyway. Glad
it became obvious after all. (Bad that I noticed that too late. And I
see that the OP just recently followed your suggestion to rename his
hash function.)
It's always good to get information about the _intention_ of a piece
of posted code early and with the original post; that way bandworm
threads and misunderstandings could be avoided.
<OT>
Let's come back to the actual application and application domain ...
Inferred from the OP's posting where he wrote: dbj2("policy_number"))
and "simple a way to obfuscate small chucks of patient records." ...
In statistical medicine context there's demand to anonymize patient
data. For statistical purposes and cause-effect insights it's usually
necessary to be able to assign data to an "individual entity" without
disclosing his identity. But a hash code does not support that; it
can lead to "ID"-clashes that assigns different results to one same
subject. You need to create (real) keys. - Remember Ben's confusion
about (the OP's misnomer) "key number"! The OP may seem to require
keys but he's creating a hash value.
Of course if a hash value is sufficient - only the OP knows this! -
then I'd have used some existing and proven algorithm instead of
inventing my own (with possible flaws or non-obvious bad properties).
(But this as well has the OP to decide and is not an Awk question
anyway.)
</OT>
Janis
--
:wq
Mike Sanders
Michael Sanders
2023-09-27 18:13:21 UTC
Permalink
Post by Janis Papanagnou
Of course if a hash value is sufficient - only the OP knows this!
key = 32bitmax - random_lesser

(admin issues keys so no duplicates)

Loading...