Discussion:
serial numbers as RS
(too old to reply)
raj
2023-01-18 03:30:39 UTC
Permalink
Hi
I have file with 7 fields.
The first field is serial number
In some records 5th field is missing.
Few records got truncated with the next record. In the sample file
I have shown only two records truncation but in some cases even three to four records got truncated.
sample file:

1 651 643786485 107249 5190 M SMITH 1284
2 963 212018826 103480 M746 R WADHWA 156
3 232 215036022 105012 M743 SAMBA 337
4 232 215036023 105012 M743 SAMBA 443
5 054 215036704 103325 KIYA K 351 ====> 5th field is missing
6 205 308363068 103402 5537 Mc DON 943
7 231 343328800 105880 MANO M 6403 8 231 343329128 105880 MANO M 8324 =====> in both the records 5th field is missing
9 309 361257222 103595 M564 C R SAM 102 10 309 361297561 103595 M564 C R SAM 332
11 216 308659868 625402 9693 FERNAND 365

The required output:

1 651 643786485 107249 5190 M SMITH 1284
2 963 212018826 103480 M746 R WADHWA 156
3 232 215036022 105012 M743 SAMBA 337
4 232 215036023 105012 M743 SAMBA 443
5 054 215036704 103325 4897 KIYA K 351
6 205 308363068 103402 5537 Mc DON 943
7 231 343328800 105880 MANO M 6403
8 231 343329128 105880 MANO M 8324
9 309 361257222 103595 M564 C R SAM 102
10 309 361297561 103595 M564 C R SAM 332

I have tried by considering the serial number as RS but did not get the desired result

awk 'BEGIN{RS="[0-9]+"}{
print $0 RT
}' file

Actually I need first four fields(including serial number) and the last field.
If the "," delimiter is given in the output that would be more helpful.

Thank you
Janis Papanagnou
2023-01-18 05:56:33 UTC
Permalink
The contents of your post is inconsistent...
Post by raj
Hi
I have file with 7 fields.
No. Field numbers vary. A typical value is 8.
Post by raj
The first field is serial number
No. There's gaps, or, joined subsequent lines.
Post by raj
In some records 5th field is missing.
Also other fields in joined lines.
Post by raj
Few records got truncated with the next record. In the sample file
I have shown only two records truncation but in some cases even three to four records got truncated.
1 651 643786485 107249 5190 M SMITH 1284
2 963 212018826 103480 M746 R WADHWA 156
3 232 215036022 105012 M743 SAMBA 337
4 232 215036023 105012 M743 SAMBA 443
5 054 215036704 103325 KIYA K 351 ====> 5th field is missing
6 205 308363068 103402 5537 Mc DON 943
7 231 343328800 105880 MANO M 6403 8 231 343329128 105880 MANO M 8324 =====> in both the records 5th field is missing
9 309 361257222 103595 M564 C R SAM 102 10 309 361297561 103595 M564 C R SAM 332
11 216 308659868 625402 9693 FERNAND 365
1 651 643786485 107249 5190 M SMITH 1284
2 963 212018826 103480 M746 R WADHWA 156
3 232 215036022 105012 M743 SAMBA 337
4 232 215036023 105012 M743 SAMBA 443
5 054 215036704 103325 4897 KIYA K 351
And where from should that "4897" come?
Post by raj
6 205 308363068 103402 5537 Mc DON 943
7 231 343328800 105880 MANO M 6403
8 231 343329128 105880 MANO M 8324
You want records with 7 and 8 fields mixed?
Post by raj
9 309 361257222 103595 M564 C R SAM 102
10 309 361297561 103595 M564 C R SAM 332
I have tried by considering the serial number as RS but did not get the desired result
awk 'BEGIN{RS="[0-9]+"}{
print $0 RT
}' file
Actually I need first four fields(including serial number) and the last field.
This does not match with the "required output" above.
Post by raj
If the "," delimiter is given in the output that would be more helpful.
Thank you
...so fix your data sample and requirements first.

And have a closer look on the definition of lines that have a number
of fields that may be 14, 15, 16, and how to distinguish that data.

And speak with the one who created that data trash to fix his process.

Janis
raj
2023-01-18 14:57:33 UTC
Permalink
Post by Janis Papanagnou
The contents of your post is inconsistent...
Post by raj
Hi
I have file with 7 fields.
No. Field numbers vary. A typical value is 8.
Post by raj
The first field is serial number
No. There's gaps, or, joined subsequent lines.
Post by raj
In some records 5th field is missing.
Also other fields in joined lines.
Post by raj
Few records got truncated with the next record. In the sample file
I have shown only two records truncation but in some cases even three to four records got truncated.
1 651 643786485 107249 5190 M SMITH 1284
2 963 212018826 103480 M746 R WADHWA 156
3 232 215036022 105012 M743 SAMBA 337
4 232 215036023 105012 M743 SAMBA 443
5 054 215036704 103325 KIYA K 351 ====> 5th field is missing
6 205 308363068 103402 5537 Mc DON 943
7 231 343328800 105880 MANO M 6403 8 231 343329128 105880 MANO M 8324 =====> in both the records 5th field is missing
9 309 361257222 103595 M564 C R SAM 102 10 309 361297561 103595 M564 C R SAM 332
11 216 308659868 625402 9693 FERNAND 365
1 651 643786485 107249 5190 M SMITH 1284
2 963 212018826 103480 M746 R WADHWA 156
3 232 215036022 105012 M743 SAMBA 337
4 232 215036023 105012 M743 SAMBA 443
5 054 215036704 103325 4897 KIYA K 351
And where from should that "4897" come?
Post by raj
6 205 308363068 103402 5537 Mc DON 943
7 231 343328800 105880 MANO M 6403
8 231 343329128 105880 MANO M 8324
You want records with 7 and 8 fields mixed?
Post by raj
9 309 361257222 103595 M564 C R SAM 102
10 309 361297561 103595 M564 C R SAM 332
I have tried by considering the serial number as RS but did not get the desired result
awk 'BEGIN{RS="[0-9]+"}{
print $0 RT
}' file
Actually I need first four fields(including serial number) and the last field.
This does not match with the "required output" above.
Post by raj
If the "," delimiter is given in the output that would be more helpful.
Thank you
...so fix your data sample and requirements first.
And have a closer look on the definition of lines that have a number
of fields that may be 14, 15, 16, and how to distinguish that data.
And speak with the one who created that data trash to fix his process.
Janis
The data was copy and pasted in a text editor from a pdf file.
The user is not having any tool/access to convert the pdf to doc or excel.

The problem is arising when it is directly copied from the pdf file.
That is the reason for inconsistency.

awk 'BEGIN{RS="[0-9]+"}{
print $0 RT
}' file
The result of above is breaking each field into a separate record.

1
651
643786485
107249
5190
M SMITH 1284

2
963
212018826
103480
M746
R WADHWA 156

3
232
215036022
105012
M743
SAMBA 337

4
232
215036023
105012
M743
SAMBA 443

5
054
215036704
103325
4897
KIYA K 351

....
.....
Janis Papanagnou
2023-01-18 15:26:54 UTC
Permalink
Post by raj
[...]
The data was copy and pasted in a text editor from a pdf file.
If all you have is a PDF I suggest to use a more sophisticated
PDF tool to extract the text in a more accurate plain text form,
or otherwise fix the worst formatting issue by hand before posting.
Post by raj
The user is not having any tool/access to convert the pdf to doc or excel.
The problem is arising when it is directly copied from the pdf file.
That is the reason for inconsistency.
And don't forget to answer/clarify the other issues you have been
hinted to.

Janis
Post by raj
[snip]
Kees Nuyt
2023-01-18 14:45:37 UTC
Permalink
On Tue, 17 Jan 2023 19:30:39 -0800 (PST), raj
Post by raj
Actually I need first four fields(including serial number) and the last field.
The "last field" can always be addressed with $NF
Post by raj
If the "," delimiter is given in the output that would be more helpful.
Have a look at OFS or printf. Your choice.
--
Kees Nuyt
Loading...