Filling file with 0xFF gives C3BF in OSX
Clash Royale CLAN TAG#URR8PPP
up vote
4
down vote
favorite
This command will fill the file with 0xff
in Linux.
dd if=/dev/zero ibs=1k count=100 | tr "00" "377" >paddedFile.bin
When I run it in OSX, the results are different.
$ dd if=/dev/zero ibs=1k count=100 | tr "00" "377" >paddedFile.bin
100+0 records in
200+0 records out
102400 bytes transferred in 0.000781 secs (131104008 bytes/sec)
$ hexdump -C paddedFile.bin
00000000 c3 bf c3 bf c3 bf c3 bf c3 bf c3 bf c3 bf c3 bf
|................|
*
00032000
What's going on here?
macos dd
add a comment |Â
up vote
4
down vote
favorite
This command will fill the file with 0xff
in Linux.
dd if=/dev/zero ibs=1k count=100 | tr "00" "377" >paddedFile.bin
When I run it in OSX, the results are different.
$ dd if=/dev/zero ibs=1k count=100 | tr "00" "377" >paddedFile.bin
100+0 records in
200+0 records out
102400 bytes transferred in 0.000781 secs (131104008 bytes/sec)
$ hexdump -C paddedFile.bin
00000000 c3 bf c3 bf c3 bf c3 bf c3 bf c3 bf c3 bf c3 bf
|................|
*
00032000
What's going on here?
macos dd
add a comment |Â
up vote
4
down vote
favorite
up vote
4
down vote
favorite
This command will fill the file with 0xff
in Linux.
dd if=/dev/zero ibs=1k count=100 | tr "00" "377" >paddedFile.bin
When I run it in OSX, the results are different.
$ dd if=/dev/zero ibs=1k count=100 | tr "00" "377" >paddedFile.bin
100+0 records in
200+0 records out
102400 bytes transferred in 0.000781 secs (131104008 bytes/sec)
$ hexdump -C paddedFile.bin
00000000 c3 bf c3 bf c3 bf c3 bf c3 bf c3 bf c3 bf c3 bf
|................|
*
00032000
What's going on here?
macos dd
This command will fill the file with 0xff
in Linux.
dd if=/dev/zero ibs=1k count=100 | tr "00" "377" >paddedFile.bin
When I run it in OSX, the results are different.
$ dd if=/dev/zero ibs=1k count=100 | tr "00" "377" >paddedFile.bin
100+0 records in
200+0 records out
102400 bytes transferred in 0.000781 secs (131104008 bytes/sec)
$ hexdump -C paddedFile.bin
00000000 c3 bf c3 bf c3 bf c3 bf c3 bf c3 bf c3 bf c3 bf
|................|
*
00032000
What's going on here?
macos dd
asked Aug 16 at 3:35
Synesso
2861212
2861212
add a comment |Â
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
9
down vote
accepted
Straight to the point.
It all hinges on the LANG
or LC_ALL
value set in your terminal session when you run tr
. Linux has them set to C
while macOS has it set to something like en_US.UTF-8
. Of course that en_US
could be some other local language such as en_UK
(UK English) but the point is the [something].UTF-8
setting instead of plain ASCII via C
is what is causing this.
More details.
Seems that tr
in macOS is converting the 0xff
to the UTF8 equivalent of c3bf
when it gets instead of the pure ASCII 0xff
. This is explained here on this Apple community support thread here:
Linux doesn't handle Unicode in the Terminal like the Mac does. If you set the "LANG" environment variable to "C" (as it probably is on Linux), it will work. Otherwise, all those high-order bits are going to get interpreted as Unicode characters.
And using that LANG
tip works! Just do the following; tested personally by me just now on macOS 10.13.6 (High Sierra).
First, make note of what the existing LANG
value is like this:
echo $LANG
The output I see is:
en_US.UTF-8
Now set the LANG
value to C
like this:
LANG=C
And run that command again:
dd if=/dev/zero ibs=1k count=100 | tr "00" "377" >paddedFile.bin
Now the hexdump
values should look like this:
hexdump -C paddedFile.bin
00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
*
00019000
To reset the LANG
value just close that terminal session or just run this command:
LANG=en_US.UTF-8
OrâÂÂas pointed out in the commentsâÂÂyou can just set the LANG
value straight in the command line options before calling tr
like this:
dd if=/dev/zero ibs=1k count=100 | LANG=C tr "00" "377" >paddedFile.bin
And you can even use LC_ALL
instead of LANG
because LANG
is just derived from LC_ALL
anyway like this:
dd if=/dev/zero ibs=1k count=100 | LC_ALL=C tr "00" "377" >paddedFile.bin
4
"Linux has that set toC
while macOS has it set to something likeen_US.UTF-8
" -- I'm not sure this is the whole story. In my Kubuntu or Debianenv | grep -E 'LANG|LC'
returnsLANG=pl_PL.UTF-8
only, so it's Unicode. Still the OP's original command yields0xff
out of the box. Could it be becausetr
implementation itself differs between Linux and Mac?
â Kamil Maciorowski
Aug 16 at 5:27
1
Regarding my doubt, I have found this answer which says "many implementations oftr
, including the one in GNU coreutils, don't support multibyte encodings". Seems legit. In my Debiantr 'Ã Â' 'L'
translatesà Â
toLL
(Ã Â
is a Polish letter, I useLANG=pl_PL.UTF-8
), so it apparently treats its first argument as two characters.
â Kamil Maciorowski
Aug 16 at 5:46
3
Yes, it has to be done bytr
. It would make negative sense for such conversion to happen when writing to file.
â grawity
Aug 16 at 5:47
It's not really hard to test that it's not about the locale setting. WithLANG=en_US.UTF-8
(on a Linux system that has that locale generated),printf ' ' | tr ' ' '377' | hexdump -C
plainly showsff
.
â ilkkachu
Aug 16 at 9:03
And, actually, changingLANG
might not be enough. The relevant locale setting isLC_CTYPE
, and the value it gets comes first fromLC_ALL
, thenLC_CTYPE
, thenLANG
, with the first one set taking effect (that's the same for all other locale settings). So, ifLC_CTYPE
is set, changingLANG
doesn't do anything in this case. To reliably override it, you'd need to setLC_ALL
. Also, it's enough to set it just fortr
, i.e.... | LC_ALL=C tr ' ' '377' |ÃÂ ...
â ilkkachu
Aug 16 at 9:08
 |Â
show 1 more comment
up vote
4
down vote
The issue is that GNU tr
, which you have on Linux, doesn't really have a concept of multibyte characters, but instead works byte at a time.
The tr
man page and online documentation speak of characters, but that's a bit of a simplification. The TODO
file in the source code package mentions this item (picked from coreutils 8.30):
Adapt tools like wc, tr, fmt, etc. (most of the textutils) to be
multibyte aware. The problem is that I want to avoid duplicating
significant blocks of logic, yet I also want to incur only minimal
(preferably 'no') cost when operating in single-byte mode.
On a Linux systemâÂÂeven with a UTF-8 locale (en_US.UTF-8
)âÂÂGNU tr
replaces an ä
as two "characters" (the UTF-8 representation of ä
has two bytes):
linux$ echo 'ä' | tr 'ä' 'x'
xx
In the same vein, mixing an ä
and an ö
produces funny results, since their UTF-8 representations share a common byte:
linux$ echo 'ö' | tr ä x
x�
Or the other way around (the x
doesn't apply here):
linux$ echo ab | tr ab äx
ä
And in your case, GNU tr
takes the 377
as a raw byte value.
The tr
on Mac is different, it knows the concept of multibyte characters and acts accordingly:
mac$ echo 'ä' | tr ä x
x
mac$ echo ab | tr ab äx
äx
The UTF-8 representation of the character with numerical value 0377 (U+00ff) is the two bytes c3 bf
, so that's what you get.
The easy way to have tr
work byte-by-byte is to have it use the C locale, instead of a UTF-8 locale. This gives the funny behavior again:
$ echo 'ä' | LC_ALL=C tr 'ä' 'x'
xx
And in your case, you can use:
... |ÃÂ LC_ALL=C tr "00" "377"
Or you could use something like Perl to generate those xff
bytes:
perl -e 'printf "377" x 1000 for 1..100'
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
9
down vote
accepted
Straight to the point.
It all hinges on the LANG
or LC_ALL
value set in your terminal session when you run tr
. Linux has them set to C
while macOS has it set to something like en_US.UTF-8
. Of course that en_US
could be some other local language such as en_UK
(UK English) but the point is the [something].UTF-8
setting instead of plain ASCII via C
is what is causing this.
More details.
Seems that tr
in macOS is converting the 0xff
to the UTF8 equivalent of c3bf
when it gets instead of the pure ASCII 0xff
. This is explained here on this Apple community support thread here:
Linux doesn't handle Unicode in the Terminal like the Mac does. If you set the "LANG" environment variable to "C" (as it probably is on Linux), it will work. Otherwise, all those high-order bits are going to get interpreted as Unicode characters.
And using that LANG
tip works! Just do the following; tested personally by me just now on macOS 10.13.6 (High Sierra).
First, make note of what the existing LANG
value is like this:
echo $LANG
The output I see is:
en_US.UTF-8
Now set the LANG
value to C
like this:
LANG=C
And run that command again:
dd if=/dev/zero ibs=1k count=100 | tr "00" "377" >paddedFile.bin
Now the hexdump
values should look like this:
hexdump -C paddedFile.bin
00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
*
00019000
To reset the LANG
value just close that terminal session or just run this command:
LANG=en_US.UTF-8
OrâÂÂas pointed out in the commentsâÂÂyou can just set the LANG
value straight in the command line options before calling tr
like this:
dd if=/dev/zero ibs=1k count=100 | LANG=C tr "00" "377" >paddedFile.bin
And you can even use LC_ALL
instead of LANG
because LANG
is just derived from LC_ALL
anyway like this:
dd if=/dev/zero ibs=1k count=100 | LC_ALL=C tr "00" "377" >paddedFile.bin
4
"Linux has that set toC
while macOS has it set to something likeen_US.UTF-8
" -- I'm not sure this is the whole story. In my Kubuntu or Debianenv | grep -E 'LANG|LC'
returnsLANG=pl_PL.UTF-8
only, so it's Unicode. Still the OP's original command yields0xff
out of the box. Could it be becausetr
implementation itself differs between Linux and Mac?
â Kamil Maciorowski
Aug 16 at 5:27
1
Regarding my doubt, I have found this answer which says "many implementations oftr
, including the one in GNU coreutils, don't support multibyte encodings". Seems legit. In my Debiantr 'Ã Â' 'L'
translatesà Â
toLL
(Ã Â
is a Polish letter, I useLANG=pl_PL.UTF-8
), so it apparently treats its first argument as two characters.
â Kamil Maciorowski
Aug 16 at 5:46
3
Yes, it has to be done bytr
. It would make negative sense for such conversion to happen when writing to file.
â grawity
Aug 16 at 5:47
It's not really hard to test that it's not about the locale setting. WithLANG=en_US.UTF-8
(on a Linux system that has that locale generated),printf ' ' | tr ' ' '377' | hexdump -C
plainly showsff
.
â ilkkachu
Aug 16 at 9:03
And, actually, changingLANG
might not be enough. The relevant locale setting isLC_CTYPE
, and the value it gets comes first fromLC_ALL
, thenLC_CTYPE
, thenLANG
, with the first one set taking effect (that's the same for all other locale settings). So, ifLC_CTYPE
is set, changingLANG
doesn't do anything in this case. To reliably override it, you'd need to setLC_ALL
. Also, it's enough to set it just fortr
, i.e.... | LC_ALL=C tr ' ' '377' |ÃÂ ...
â ilkkachu
Aug 16 at 9:08
 |Â
show 1 more comment
up vote
9
down vote
accepted
Straight to the point.
It all hinges on the LANG
or LC_ALL
value set in your terminal session when you run tr
. Linux has them set to C
while macOS has it set to something like en_US.UTF-8
. Of course that en_US
could be some other local language such as en_UK
(UK English) but the point is the [something].UTF-8
setting instead of plain ASCII via C
is what is causing this.
More details.
Seems that tr
in macOS is converting the 0xff
to the UTF8 equivalent of c3bf
when it gets instead of the pure ASCII 0xff
. This is explained here on this Apple community support thread here:
Linux doesn't handle Unicode in the Terminal like the Mac does. If you set the "LANG" environment variable to "C" (as it probably is on Linux), it will work. Otherwise, all those high-order bits are going to get interpreted as Unicode characters.
And using that LANG
tip works! Just do the following; tested personally by me just now on macOS 10.13.6 (High Sierra).
First, make note of what the existing LANG
value is like this:
echo $LANG
The output I see is:
en_US.UTF-8
Now set the LANG
value to C
like this:
LANG=C
And run that command again:
dd if=/dev/zero ibs=1k count=100 | tr "00" "377" >paddedFile.bin
Now the hexdump
values should look like this:
hexdump -C paddedFile.bin
00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
*
00019000
To reset the LANG
value just close that terminal session or just run this command:
LANG=en_US.UTF-8
OrâÂÂas pointed out in the commentsâÂÂyou can just set the LANG
value straight in the command line options before calling tr
like this:
dd if=/dev/zero ibs=1k count=100 | LANG=C tr "00" "377" >paddedFile.bin
And you can even use LC_ALL
instead of LANG
because LANG
is just derived from LC_ALL
anyway like this:
dd if=/dev/zero ibs=1k count=100 | LC_ALL=C tr "00" "377" >paddedFile.bin
4
"Linux has that set toC
while macOS has it set to something likeen_US.UTF-8
" -- I'm not sure this is the whole story. In my Kubuntu or Debianenv | grep -E 'LANG|LC'
returnsLANG=pl_PL.UTF-8
only, so it's Unicode. Still the OP's original command yields0xff
out of the box. Could it be becausetr
implementation itself differs between Linux and Mac?
â Kamil Maciorowski
Aug 16 at 5:27
1
Regarding my doubt, I have found this answer which says "many implementations oftr
, including the one in GNU coreutils, don't support multibyte encodings". Seems legit. In my Debiantr 'Ã Â' 'L'
translatesà Â
toLL
(Ã Â
is a Polish letter, I useLANG=pl_PL.UTF-8
), so it apparently treats its first argument as two characters.
â Kamil Maciorowski
Aug 16 at 5:46
3
Yes, it has to be done bytr
. It would make negative sense for such conversion to happen when writing to file.
â grawity
Aug 16 at 5:47
It's not really hard to test that it's not about the locale setting. WithLANG=en_US.UTF-8
(on a Linux system that has that locale generated),printf ' ' | tr ' ' '377' | hexdump -C
plainly showsff
.
â ilkkachu
Aug 16 at 9:03
And, actually, changingLANG
might not be enough. The relevant locale setting isLC_CTYPE
, and the value it gets comes first fromLC_ALL
, thenLC_CTYPE
, thenLANG
, with the first one set taking effect (that's the same for all other locale settings). So, ifLC_CTYPE
is set, changingLANG
doesn't do anything in this case. To reliably override it, you'd need to setLC_ALL
. Also, it's enough to set it just fortr
, i.e.... | LC_ALL=C tr ' ' '377' |ÃÂ ...
â ilkkachu
Aug 16 at 9:08
 |Â
show 1 more comment
up vote
9
down vote
accepted
up vote
9
down vote
accepted
Straight to the point.
It all hinges on the LANG
or LC_ALL
value set in your terminal session when you run tr
. Linux has them set to C
while macOS has it set to something like en_US.UTF-8
. Of course that en_US
could be some other local language such as en_UK
(UK English) but the point is the [something].UTF-8
setting instead of plain ASCII via C
is what is causing this.
More details.
Seems that tr
in macOS is converting the 0xff
to the UTF8 equivalent of c3bf
when it gets instead of the pure ASCII 0xff
. This is explained here on this Apple community support thread here:
Linux doesn't handle Unicode in the Terminal like the Mac does. If you set the "LANG" environment variable to "C" (as it probably is on Linux), it will work. Otherwise, all those high-order bits are going to get interpreted as Unicode characters.
And using that LANG
tip works! Just do the following; tested personally by me just now on macOS 10.13.6 (High Sierra).
First, make note of what the existing LANG
value is like this:
echo $LANG
The output I see is:
en_US.UTF-8
Now set the LANG
value to C
like this:
LANG=C
And run that command again:
dd if=/dev/zero ibs=1k count=100 | tr "00" "377" >paddedFile.bin
Now the hexdump
values should look like this:
hexdump -C paddedFile.bin
00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
*
00019000
To reset the LANG
value just close that terminal session or just run this command:
LANG=en_US.UTF-8
OrâÂÂas pointed out in the commentsâÂÂyou can just set the LANG
value straight in the command line options before calling tr
like this:
dd if=/dev/zero ibs=1k count=100 | LANG=C tr "00" "377" >paddedFile.bin
And you can even use LC_ALL
instead of LANG
because LANG
is just derived from LC_ALL
anyway like this:
dd if=/dev/zero ibs=1k count=100 | LC_ALL=C tr "00" "377" >paddedFile.bin
Straight to the point.
It all hinges on the LANG
or LC_ALL
value set in your terminal session when you run tr
. Linux has them set to C
while macOS has it set to something like en_US.UTF-8
. Of course that en_US
could be some other local language such as en_UK
(UK English) but the point is the [something].UTF-8
setting instead of plain ASCII via C
is what is causing this.
More details.
Seems that tr
in macOS is converting the 0xff
to the UTF8 equivalent of c3bf
when it gets instead of the pure ASCII 0xff
. This is explained here on this Apple community support thread here:
Linux doesn't handle Unicode in the Terminal like the Mac does. If you set the "LANG" environment variable to "C" (as it probably is on Linux), it will work. Otherwise, all those high-order bits are going to get interpreted as Unicode characters.
And using that LANG
tip works! Just do the following; tested personally by me just now on macOS 10.13.6 (High Sierra).
First, make note of what the existing LANG
value is like this:
echo $LANG
The output I see is:
en_US.UTF-8
Now set the LANG
value to C
like this:
LANG=C
And run that command again:
dd if=/dev/zero ibs=1k count=100 | tr "00" "377" >paddedFile.bin
Now the hexdump
values should look like this:
hexdump -C paddedFile.bin
00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
*
00019000
To reset the LANG
value just close that terminal session or just run this command:
LANG=en_US.UTF-8
OrâÂÂas pointed out in the commentsâÂÂyou can just set the LANG
value straight in the command line options before calling tr
like this:
dd if=/dev/zero ibs=1k count=100 | LANG=C tr "00" "377" >paddedFile.bin
And you can even use LC_ALL
instead of LANG
because LANG
is just derived from LC_ALL
anyway like this:
dd if=/dev/zero ibs=1k count=100 | LC_ALL=C tr "00" "377" >paddedFile.bin
edited Aug 16 at 16:20
answered Aug 16 at 4:00
JakeGould
29.1k1087128
29.1k1087128
4
"Linux has that set toC
while macOS has it set to something likeen_US.UTF-8
" -- I'm not sure this is the whole story. In my Kubuntu or Debianenv | grep -E 'LANG|LC'
returnsLANG=pl_PL.UTF-8
only, so it's Unicode. Still the OP's original command yields0xff
out of the box. Could it be becausetr
implementation itself differs between Linux and Mac?
â Kamil Maciorowski
Aug 16 at 5:27
1
Regarding my doubt, I have found this answer which says "many implementations oftr
, including the one in GNU coreutils, don't support multibyte encodings". Seems legit. In my Debiantr 'Ã Â' 'L'
translatesà Â
toLL
(Ã Â
is a Polish letter, I useLANG=pl_PL.UTF-8
), so it apparently treats its first argument as two characters.
â Kamil Maciorowski
Aug 16 at 5:46
3
Yes, it has to be done bytr
. It would make negative sense for such conversion to happen when writing to file.
â grawity
Aug 16 at 5:47
It's not really hard to test that it's not about the locale setting. WithLANG=en_US.UTF-8
(on a Linux system that has that locale generated),printf ' ' | tr ' ' '377' | hexdump -C
plainly showsff
.
â ilkkachu
Aug 16 at 9:03
And, actually, changingLANG
might not be enough. The relevant locale setting isLC_CTYPE
, and the value it gets comes first fromLC_ALL
, thenLC_CTYPE
, thenLANG
, with the first one set taking effect (that's the same for all other locale settings). So, ifLC_CTYPE
is set, changingLANG
doesn't do anything in this case. To reliably override it, you'd need to setLC_ALL
. Also, it's enough to set it just fortr
, i.e.... | LC_ALL=C tr ' ' '377' |ÃÂ ...
â ilkkachu
Aug 16 at 9:08
 |Â
show 1 more comment
4
"Linux has that set toC
while macOS has it set to something likeen_US.UTF-8
" -- I'm not sure this is the whole story. In my Kubuntu or Debianenv | grep -E 'LANG|LC'
returnsLANG=pl_PL.UTF-8
only, so it's Unicode. Still the OP's original command yields0xff
out of the box. Could it be becausetr
implementation itself differs between Linux and Mac?
â Kamil Maciorowski
Aug 16 at 5:27
1
Regarding my doubt, I have found this answer which says "many implementations oftr
, including the one in GNU coreutils, don't support multibyte encodings". Seems legit. In my Debiantr 'Ã Â' 'L'
translatesà Â
toLL
(Ã Â
is a Polish letter, I useLANG=pl_PL.UTF-8
), so it apparently treats its first argument as two characters.
â Kamil Maciorowski
Aug 16 at 5:46
3
Yes, it has to be done bytr
. It would make negative sense for such conversion to happen when writing to file.
â grawity
Aug 16 at 5:47
It's not really hard to test that it's not about the locale setting. WithLANG=en_US.UTF-8
(on a Linux system that has that locale generated),printf ' ' | tr ' ' '377' | hexdump -C
plainly showsff
.
â ilkkachu
Aug 16 at 9:03
And, actually, changingLANG
might not be enough. The relevant locale setting isLC_CTYPE
, and the value it gets comes first fromLC_ALL
, thenLC_CTYPE
, thenLANG
, with the first one set taking effect (that's the same for all other locale settings). So, ifLC_CTYPE
is set, changingLANG
doesn't do anything in this case. To reliably override it, you'd need to setLC_ALL
. Also, it's enough to set it just fortr
, i.e.... | LC_ALL=C tr ' ' '377' |ÃÂ ...
â ilkkachu
Aug 16 at 9:08
4
4
"Linux has that set to
C
while macOS has it set to something like en_US.UTF-8
" -- I'm not sure this is the whole story. In my Kubuntu or Debian env | grep -E 'LANG|LC'
returns LANG=pl_PL.UTF-8
only, so it's Unicode. Still the OP's original command yields 0xff
out of the box. Could it be because tr
implementation itself differs between Linux and Mac?â Kamil Maciorowski
Aug 16 at 5:27
"Linux has that set to
C
while macOS has it set to something like en_US.UTF-8
" -- I'm not sure this is the whole story. In my Kubuntu or Debian env | grep -E 'LANG|LC'
returns LANG=pl_PL.UTF-8
only, so it's Unicode. Still the OP's original command yields 0xff
out of the box. Could it be because tr
implementation itself differs between Linux and Mac?â Kamil Maciorowski
Aug 16 at 5:27
1
1
Regarding my doubt, I have found this answer which says "many implementations of
tr
, including the one in GNU coreutils, don't support multibyte encodings". Seems legit. In my Debian tr 'Ã
Â' 'L'
translates Ã
Â
to LL
(Ã
Â
is a Polish letter, I use LANG=pl_PL.UTF-8
), so it apparently treats its first argument as two characters.â Kamil Maciorowski
Aug 16 at 5:46
Regarding my doubt, I have found this answer which says "many implementations of
tr
, including the one in GNU coreutils, don't support multibyte encodings". Seems legit. In my Debian tr 'Ã
Â' 'L'
translates Ã
Â
to LL
(Ã
Â
is a Polish letter, I use LANG=pl_PL.UTF-8
), so it apparently treats its first argument as two characters.â Kamil Maciorowski
Aug 16 at 5:46
3
3
Yes, it has to be done by
tr
. It would make negative sense for such conversion to happen when writing to file.â grawity
Aug 16 at 5:47
Yes, it has to be done by
tr
. It would make negative sense for such conversion to happen when writing to file.â grawity
Aug 16 at 5:47
It's not really hard to test that it's not about the locale setting. With
LANG=en_US.UTF-8
(on a Linux system that has that locale generated), printf ' ' | tr ' ' '377' | hexdump -C
plainly shows ff
.â ilkkachu
Aug 16 at 9:03
It's not really hard to test that it's not about the locale setting. With
LANG=en_US.UTF-8
(on a Linux system that has that locale generated), printf ' ' | tr ' ' '377' | hexdump -C
plainly shows ff
.â ilkkachu
Aug 16 at 9:03
And, actually, changing
LANG
might not be enough. The relevant locale setting is LC_CTYPE
, and the value it gets comes first from LC_ALL
, then LC_CTYPE
, then LANG
, with the first one set taking effect (that's the same for all other locale settings). So, if LC_CTYPE
is set, changing LANG
doesn't do anything in this case. To reliably override it, you'd need to set LC_ALL
. Also, it's enough to set it just for tr
, i.e. ... | LC_ALL=C tr ' ' '377' |ÃÂ ...
â ilkkachu
Aug 16 at 9:08
And, actually, changing
LANG
might not be enough. The relevant locale setting is LC_CTYPE
, and the value it gets comes first from LC_ALL
, then LC_CTYPE
, then LANG
, with the first one set taking effect (that's the same for all other locale settings). So, if LC_CTYPE
is set, changing LANG
doesn't do anything in this case. To reliably override it, you'd need to set LC_ALL
. Also, it's enough to set it just for tr
, i.e. ... | LC_ALL=C tr ' ' '377' |ÃÂ ...
â ilkkachu
Aug 16 at 9:08
 |Â
show 1 more comment
up vote
4
down vote
The issue is that GNU tr
, which you have on Linux, doesn't really have a concept of multibyte characters, but instead works byte at a time.
The tr
man page and online documentation speak of characters, but that's a bit of a simplification. The TODO
file in the source code package mentions this item (picked from coreutils 8.30):
Adapt tools like wc, tr, fmt, etc. (most of the textutils) to be
multibyte aware. The problem is that I want to avoid duplicating
significant blocks of logic, yet I also want to incur only minimal
(preferably 'no') cost when operating in single-byte mode.
On a Linux systemâÂÂeven with a UTF-8 locale (en_US.UTF-8
)âÂÂGNU tr
replaces an ä
as two "characters" (the UTF-8 representation of ä
has two bytes):
linux$ echo 'ä' | tr 'ä' 'x'
xx
In the same vein, mixing an ä
and an ö
produces funny results, since their UTF-8 representations share a common byte:
linux$ echo 'ö' | tr ä x
x�
Or the other way around (the x
doesn't apply here):
linux$ echo ab | tr ab äx
ä
And in your case, GNU tr
takes the 377
as a raw byte value.
The tr
on Mac is different, it knows the concept of multibyte characters and acts accordingly:
mac$ echo 'ä' | tr ä x
x
mac$ echo ab | tr ab äx
äx
The UTF-8 representation of the character with numerical value 0377 (U+00ff) is the two bytes c3 bf
, so that's what you get.
The easy way to have tr
work byte-by-byte is to have it use the C locale, instead of a UTF-8 locale. This gives the funny behavior again:
$ echo 'ä' | LC_ALL=C tr 'ä' 'x'
xx
And in your case, you can use:
... |ÃÂ LC_ALL=C tr "00" "377"
Or you could use something like Perl to generate those xff
bytes:
perl -e 'printf "377" x 1000 for 1..100'
add a comment |Â
up vote
4
down vote
The issue is that GNU tr
, which you have on Linux, doesn't really have a concept of multibyte characters, but instead works byte at a time.
The tr
man page and online documentation speak of characters, but that's a bit of a simplification. The TODO
file in the source code package mentions this item (picked from coreutils 8.30):
Adapt tools like wc, tr, fmt, etc. (most of the textutils) to be
multibyte aware. The problem is that I want to avoid duplicating
significant blocks of logic, yet I also want to incur only minimal
(preferably 'no') cost when operating in single-byte mode.
On a Linux systemâÂÂeven with a UTF-8 locale (en_US.UTF-8
)âÂÂGNU tr
replaces an ä
as two "characters" (the UTF-8 representation of ä
has two bytes):
linux$ echo 'ä' | tr 'ä' 'x'
xx
In the same vein, mixing an ä
and an ö
produces funny results, since their UTF-8 representations share a common byte:
linux$ echo 'ö' | tr ä x
x�
Or the other way around (the x
doesn't apply here):
linux$ echo ab | tr ab äx
ä
And in your case, GNU tr
takes the 377
as a raw byte value.
The tr
on Mac is different, it knows the concept of multibyte characters and acts accordingly:
mac$ echo 'ä' | tr ä x
x
mac$ echo ab | tr ab äx
äx
The UTF-8 representation of the character with numerical value 0377 (U+00ff) is the two bytes c3 bf
, so that's what you get.
The easy way to have tr
work byte-by-byte is to have it use the C locale, instead of a UTF-8 locale. This gives the funny behavior again:
$ echo 'ä' | LC_ALL=C tr 'ä' 'x'
xx
And in your case, you can use:
... |ÃÂ LC_ALL=C tr "00" "377"
Or you could use something like Perl to generate those xff
bytes:
perl -e 'printf "377" x 1000 for 1..100'
add a comment |Â
up vote
4
down vote
up vote
4
down vote
The issue is that GNU tr
, which you have on Linux, doesn't really have a concept of multibyte characters, but instead works byte at a time.
The tr
man page and online documentation speak of characters, but that's a bit of a simplification. The TODO
file in the source code package mentions this item (picked from coreutils 8.30):
Adapt tools like wc, tr, fmt, etc. (most of the textutils) to be
multibyte aware. The problem is that I want to avoid duplicating
significant blocks of logic, yet I also want to incur only minimal
(preferably 'no') cost when operating in single-byte mode.
On a Linux systemâÂÂeven with a UTF-8 locale (en_US.UTF-8
)âÂÂGNU tr
replaces an ä
as two "characters" (the UTF-8 representation of ä
has two bytes):
linux$ echo 'ä' | tr 'ä' 'x'
xx
In the same vein, mixing an ä
and an ö
produces funny results, since their UTF-8 representations share a common byte:
linux$ echo 'ö' | tr ä x
x�
Or the other way around (the x
doesn't apply here):
linux$ echo ab | tr ab äx
ä
And in your case, GNU tr
takes the 377
as a raw byte value.
The tr
on Mac is different, it knows the concept of multibyte characters and acts accordingly:
mac$ echo 'ä' | tr ä x
x
mac$ echo ab | tr ab äx
äx
The UTF-8 representation of the character with numerical value 0377 (U+00ff) is the two bytes c3 bf
, so that's what you get.
The easy way to have tr
work byte-by-byte is to have it use the C locale, instead of a UTF-8 locale. This gives the funny behavior again:
$ echo 'ä' | LC_ALL=C tr 'ä' 'x'
xx
And in your case, you can use:
... |ÃÂ LC_ALL=C tr "00" "377"
Or you could use something like Perl to generate those xff
bytes:
perl -e 'printf "377" x 1000 for 1..100'
The issue is that GNU tr
, which you have on Linux, doesn't really have a concept of multibyte characters, but instead works byte at a time.
The tr
man page and online documentation speak of characters, but that's a bit of a simplification. The TODO
file in the source code package mentions this item (picked from coreutils 8.30):
Adapt tools like wc, tr, fmt, etc. (most of the textutils) to be
multibyte aware. The problem is that I want to avoid duplicating
significant blocks of logic, yet I also want to incur only minimal
(preferably 'no') cost when operating in single-byte mode.
On a Linux systemâÂÂeven with a UTF-8 locale (en_US.UTF-8
)âÂÂGNU tr
replaces an ä
as two "characters" (the UTF-8 representation of ä
has two bytes):
linux$ echo 'ä' | tr 'ä' 'x'
xx
In the same vein, mixing an ä
and an ö
produces funny results, since their UTF-8 representations share a common byte:
linux$ echo 'ö' | tr ä x
x�
Or the other way around (the x
doesn't apply here):
linux$ echo ab | tr ab äx
ä
And in your case, GNU tr
takes the 377
as a raw byte value.
The tr
on Mac is different, it knows the concept of multibyte characters and acts accordingly:
mac$ echo 'ä' | tr ä x
x
mac$ echo ab | tr ab äx
äx
The UTF-8 representation of the character with numerical value 0377 (U+00ff) is the two bytes c3 bf
, so that's what you get.
The easy way to have tr
work byte-by-byte is to have it use the C locale, instead of a UTF-8 locale. This gives the funny behavior again:
$ echo 'ä' | LC_ALL=C tr 'ä' 'x'
xx
And in your case, you can use:
... |ÃÂ LC_ALL=C tr "00" "377"
Or you could use something like Perl to generate those xff
bytes:
perl -e 'printf "377" x 1000 for 1..100'
edited Aug 16 at 23:54
JakeGould
29.1k1087128
29.1k1087128
answered Aug 16 at 19:41
ilkkachu
551212
551212
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1349494%2ffilling-file-with-0xff-gives-c3bf-in-osx%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password