Filling file with 0xFF gives C3BF in OSX

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
4
down vote

favorite
2












This command will fill the file with 0xff in Linux.



dd if=/dev/zero ibs=1k count=100 | tr "00" "377" >paddedFile.bin


When I run it in OSX, the results are different.



$ dd if=/dev/zero ibs=1k count=100 | tr "00" "377" >paddedFile.bin
100+0 records in
200+0 records out
102400 bytes transferred in 0.000781 secs (131104008 bytes/sec)
$ hexdump -C paddedFile.bin
00000000 c3 bf c3 bf c3 bf c3 bf c3 bf c3 bf c3 bf c3 bf
|................|
*
00032000


What's going on here?







share|improve this question
























    up vote
    4
    down vote

    favorite
    2












    This command will fill the file with 0xff in Linux.



    dd if=/dev/zero ibs=1k count=100 | tr "00" "377" >paddedFile.bin


    When I run it in OSX, the results are different.



    $ dd if=/dev/zero ibs=1k count=100 | tr "00" "377" >paddedFile.bin
    100+0 records in
    200+0 records out
    102400 bytes transferred in 0.000781 secs (131104008 bytes/sec)
    $ hexdump -C paddedFile.bin
    00000000 c3 bf c3 bf c3 bf c3 bf c3 bf c3 bf c3 bf c3 bf
    |................|
    *
    00032000


    What's going on here?







    share|improve this question






















      up vote
      4
      down vote

      favorite
      2









      up vote
      4
      down vote

      favorite
      2






      2





      This command will fill the file with 0xff in Linux.



      dd if=/dev/zero ibs=1k count=100 | tr "00" "377" >paddedFile.bin


      When I run it in OSX, the results are different.



      $ dd if=/dev/zero ibs=1k count=100 | tr "00" "377" >paddedFile.bin
      100+0 records in
      200+0 records out
      102400 bytes transferred in 0.000781 secs (131104008 bytes/sec)
      $ hexdump -C paddedFile.bin
      00000000 c3 bf c3 bf c3 bf c3 bf c3 bf c3 bf c3 bf c3 bf
      |................|
      *
      00032000


      What's going on here?







      share|improve this question












      This command will fill the file with 0xff in Linux.



      dd if=/dev/zero ibs=1k count=100 | tr "00" "377" >paddedFile.bin


      When I run it in OSX, the results are different.



      $ dd if=/dev/zero ibs=1k count=100 | tr "00" "377" >paddedFile.bin
      100+0 records in
      200+0 records out
      102400 bytes transferred in 0.000781 secs (131104008 bytes/sec)
      $ hexdump -C paddedFile.bin
      00000000 c3 bf c3 bf c3 bf c3 bf c3 bf c3 bf c3 bf c3 bf
      |................|
      *
      00032000


      What's going on here?









      share|improve this question











      share|improve this question




      share|improve this question










      asked Aug 16 at 3:35









      Synesso

      2861212




      2861212




















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          9
          down vote



          accepted










          Straight to the point.



          It all hinges on the LANG or LC_ALL value set in your terminal session when you run tr. Linux has them set to C while macOS has it set to something like en_US.UTF-8. Of course that en_US could be some other local language such as en_UK (UK English) but the point is the [something].UTF-8 setting instead of plain ASCII via C is what is causing this.



          More details.



          Seems that tr in macOS is converting the 0xff to the UTF8 equivalent of c3bf when it gets instead of the pure ASCII 0xff. This is explained here on this Apple community support thread here:




          Linux doesn't handle Unicode in the Terminal like the Mac does. If you set the "LANG" environment variable to "C" (as it probably is on Linux), it will work. Otherwise, all those high-order bits are going to get interpreted as Unicode characters.




          And using that LANG tip works! Just do the following; tested personally by me just now on macOS 10.13.6 (High Sierra).



          First, make note of what the existing LANG value is like this:



          echo $LANG


          The output I see is:



          en_US.UTF-8


          Now set the LANG value to C like this:



          LANG=C


          And run that command again:



          dd if=/dev/zero ibs=1k count=100 | tr "00" "377" >paddedFile.bin


          Now the hexdump values should look like this:



          hexdump -C paddedFile.bin
          00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
          *
          00019000


          To reset the LANG value just close that terminal session or just run this command:



          LANG=en_US.UTF-8


          Or—as pointed out in the comments—you can just set the LANG value straight in the command line options before calling tr like this:



          dd if=/dev/zero ibs=1k count=100 | LANG=C tr "00" "377" >paddedFile.bin


          And you can even use LC_ALL instead of LANG because LANG is just derived from LC_ALL anyway like this:



          dd if=/dev/zero ibs=1k count=100 | LC_ALL=C tr "00" "377" >paddedFile.bin





          share|improve this answer


















          • 4




            "Linux has that set to C while macOS has it set to something like en_US.UTF-8" -- I'm not sure this is the whole story. In my Kubuntu or Debian env | grep -E 'LANG|LC' returns LANG=pl_PL.UTF-8 only, so it's Unicode. Still the OP's original command yields 0xff out of the box. Could it be because trimplementation itself differs between Linux and Mac?
            – Kamil Maciorowski
            Aug 16 at 5:27






          • 1




            Regarding my doubt, I have found this answer which says "many implementations of tr, including the one in GNU coreutils, don't support multibyte encodings". Seems legit. In my Debian tr 'Ł' 'L' translates Ł to LL (Ł is a Polish letter, I use LANG=pl_PL.UTF-8), so it apparently treats its first argument as two characters.
            – Kamil Maciorowski
            Aug 16 at 5:46






          • 3




            Yes, it has to be done by tr. It would make negative sense for such conversion to happen when writing to file.
            – grawity
            Aug 16 at 5:47










          • It's not really hard to test that it's not about the locale setting. With LANG=en_US.UTF-8 (on a Linux system that has that locale generated), printf ' ' | tr ' ' '377' | hexdump -C plainly shows ff.
            – ilkkachu
            Aug 16 at 9:03










          • And, actually, changing LANG might not be enough. The relevant locale setting is LC_CTYPE, and the value it gets comes first from LC_ALL, then LC_CTYPE, then LANG, with the first one set taking effect (that's the same for all other locale settings). So, if LC_CTYPE is set, changing LANG doesn't do anything in this case. To reliably override it, you'd need to set LC_ALL. Also, it's enough to set it just for tr, i.e. ... | LC_ALL=C tr ' ' '377' | ...
            – ilkkachu
            Aug 16 at 9:08


















          up vote
          4
          down vote













          The issue is that GNU tr, which you have on Linux, doesn't really have a concept of multibyte characters, but instead works byte at a time.



          The tr man page and online documentation speak of characters, but that's a bit of a simplification. The TODO file in the source code package mentions this item (picked from coreutils 8.30):




          Adapt tools like wc, tr, fmt, etc. (most of the textutils) to be
          multibyte aware. The problem is that I want to avoid duplicating
          significant blocks of logic, yet I also want to incur only minimal
          (preferably 'no') cost when operating in single-byte mode.




          On a Linux system—even with a UTF-8 locale (en_US.UTF-8)—GNU tr replaces an ä as two "characters" (the UTF-8 representation of ä has two bytes):



          linux$ echo 'ä' | tr 'ä' 'x'
          xx


          In the same vein, mixing an ä and an ö produces funny results, since their UTF-8 representations share a common byte:



          linux$ echo 'ö' | tr ä x
          x�


          Or the other way around (the x doesn't apply here):



          linux$ echo ab | tr ab äx
          ä


          And in your case, GNU tr takes the 377 as a raw byte value.



          The tr on Mac is different, it knows the concept of multibyte characters and acts accordingly:



          mac$ echo 'ä' | tr ä x
          x

          mac$ echo ab | tr ab äx
          äx


          The UTF-8 representation of the character with numerical value 0377 (U+00ff) is the two bytes c3 bf, so that's what you get.



          The easy way to have tr work byte-by-byte is to have it use the C locale, instead of a UTF-8 locale. This gives the funny behavior again:



          $ echo 'ä' | LC_ALL=C tr 'ä' 'x'
          xx


          And in your case, you can use:



          ... | LC_ALL=C tr "00" "377"


          Or you could use something like Perl to generate those xff bytes:



          perl -e 'printf "377" x 1000 for 1..100'





          share|improve this answer






















            Your Answer







            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "3"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: false,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );








             

            draft saved


            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1349494%2ffilling-file-with-0xff-gives-c3bf-in-osx%23new-answer', 'question_page');

            );

            Post as a guest






























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            9
            down vote



            accepted










            Straight to the point.



            It all hinges on the LANG or LC_ALL value set in your terminal session when you run tr. Linux has them set to C while macOS has it set to something like en_US.UTF-8. Of course that en_US could be some other local language such as en_UK (UK English) but the point is the [something].UTF-8 setting instead of plain ASCII via C is what is causing this.



            More details.



            Seems that tr in macOS is converting the 0xff to the UTF8 equivalent of c3bf when it gets instead of the pure ASCII 0xff. This is explained here on this Apple community support thread here:




            Linux doesn't handle Unicode in the Terminal like the Mac does. If you set the "LANG" environment variable to "C" (as it probably is on Linux), it will work. Otherwise, all those high-order bits are going to get interpreted as Unicode characters.




            And using that LANG tip works! Just do the following; tested personally by me just now on macOS 10.13.6 (High Sierra).



            First, make note of what the existing LANG value is like this:



            echo $LANG


            The output I see is:



            en_US.UTF-8


            Now set the LANG value to C like this:



            LANG=C


            And run that command again:



            dd if=/dev/zero ibs=1k count=100 | tr "00" "377" >paddedFile.bin


            Now the hexdump values should look like this:



            hexdump -C paddedFile.bin
            00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
            *
            00019000


            To reset the LANG value just close that terminal session or just run this command:



            LANG=en_US.UTF-8


            Or—as pointed out in the comments—you can just set the LANG value straight in the command line options before calling tr like this:



            dd if=/dev/zero ibs=1k count=100 | LANG=C tr "00" "377" >paddedFile.bin


            And you can even use LC_ALL instead of LANG because LANG is just derived from LC_ALL anyway like this:



            dd if=/dev/zero ibs=1k count=100 | LC_ALL=C tr "00" "377" >paddedFile.bin





            share|improve this answer


















            • 4




              "Linux has that set to C while macOS has it set to something like en_US.UTF-8" -- I'm not sure this is the whole story. In my Kubuntu or Debian env | grep -E 'LANG|LC' returns LANG=pl_PL.UTF-8 only, so it's Unicode. Still the OP's original command yields 0xff out of the box. Could it be because trimplementation itself differs between Linux and Mac?
              – Kamil Maciorowski
              Aug 16 at 5:27






            • 1




              Regarding my doubt, I have found this answer which says "many implementations of tr, including the one in GNU coreutils, don't support multibyte encodings". Seems legit. In my Debian tr 'Ł' 'L' translates Ł to LL (Ł is a Polish letter, I use LANG=pl_PL.UTF-8), so it apparently treats its first argument as two characters.
              – Kamil Maciorowski
              Aug 16 at 5:46






            • 3




              Yes, it has to be done by tr. It would make negative sense for such conversion to happen when writing to file.
              – grawity
              Aug 16 at 5:47










            • It's not really hard to test that it's not about the locale setting. With LANG=en_US.UTF-8 (on a Linux system that has that locale generated), printf ' ' | tr ' ' '377' | hexdump -C plainly shows ff.
              – ilkkachu
              Aug 16 at 9:03










            • And, actually, changing LANG might not be enough. The relevant locale setting is LC_CTYPE, and the value it gets comes first from LC_ALL, then LC_CTYPE, then LANG, with the first one set taking effect (that's the same for all other locale settings). So, if LC_CTYPE is set, changing LANG doesn't do anything in this case. To reliably override it, you'd need to set LC_ALL. Also, it's enough to set it just for tr, i.e. ... | LC_ALL=C tr ' ' '377' | ...
              – ilkkachu
              Aug 16 at 9:08















            up vote
            9
            down vote



            accepted










            Straight to the point.



            It all hinges on the LANG or LC_ALL value set in your terminal session when you run tr. Linux has them set to C while macOS has it set to something like en_US.UTF-8. Of course that en_US could be some other local language such as en_UK (UK English) but the point is the [something].UTF-8 setting instead of plain ASCII via C is what is causing this.



            More details.



            Seems that tr in macOS is converting the 0xff to the UTF8 equivalent of c3bf when it gets instead of the pure ASCII 0xff. This is explained here on this Apple community support thread here:




            Linux doesn't handle Unicode in the Terminal like the Mac does. If you set the "LANG" environment variable to "C" (as it probably is on Linux), it will work. Otherwise, all those high-order bits are going to get interpreted as Unicode characters.




            And using that LANG tip works! Just do the following; tested personally by me just now on macOS 10.13.6 (High Sierra).



            First, make note of what the existing LANG value is like this:



            echo $LANG


            The output I see is:



            en_US.UTF-8


            Now set the LANG value to C like this:



            LANG=C


            And run that command again:



            dd if=/dev/zero ibs=1k count=100 | tr "00" "377" >paddedFile.bin


            Now the hexdump values should look like this:



            hexdump -C paddedFile.bin
            00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
            *
            00019000


            To reset the LANG value just close that terminal session or just run this command:



            LANG=en_US.UTF-8


            Or—as pointed out in the comments—you can just set the LANG value straight in the command line options before calling tr like this:



            dd if=/dev/zero ibs=1k count=100 | LANG=C tr "00" "377" >paddedFile.bin


            And you can even use LC_ALL instead of LANG because LANG is just derived from LC_ALL anyway like this:



            dd if=/dev/zero ibs=1k count=100 | LC_ALL=C tr "00" "377" >paddedFile.bin





            share|improve this answer


















            • 4




              "Linux has that set to C while macOS has it set to something like en_US.UTF-8" -- I'm not sure this is the whole story. In my Kubuntu or Debian env | grep -E 'LANG|LC' returns LANG=pl_PL.UTF-8 only, so it's Unicode. Still the OP's original command yields 0xff out of the box. Could it be because trimplementation itself differs between Linux and Mac?
              – Kamil Maciorowski
              Aug 16 at 5:27






            • 1




              Regarding my doubt, I have found this answer which says "many implementations of tr, including the one in GNU coreutils, don't support multibyte encodings". Seems legit. In my Debian tr 'Ł' 'L' translates Ł to LL (Ł is a Polish letter, I use LANG=pl_PL.UTF-8), so it apparently treats its first argument as two characters.
              – Kamil Maciorowski
              Aug 16 at 5:46






            • 3




              Yes, it has to be done by tr. It would make negative sense for such conversion to happen when writing to file.
              – grawity
              Aug 16 at 5:47










            • It's not really hard to test that it's not about the locale setting. With LANG=en_US.UTF-8 (on a Linux system that has that locale generated), printf ' ' | tr ' ' '377' | hexdump -C plainly shows ff.
              – ilkkachu
              Aug 16 at 9:03










            • And, actually, changing LANG might not be enough. The relevant locale setting is LC_CTYPE, and the value it gets comes first from LC_ALL, then LC_CTYPE, then LANG, with the first one set taking effect (that's the same for all other locale settings). So, if LC_CTYPE is set, changing LANG doesn't do anything in this case. To reliably override it, you'd need to set LC_ALL. Also, it's enough to set it just for tr, i.e. ... | LC_ALL=C tr ' ' '377' | ...
              – ilkkachu
              Aug 16 at 9:08













            up vote
            9
            down vote



            accepted







            up vote
            9
            down vote



            accepted






            Straight to the point.



            It all hinges on the LANG or LC_ALL value set in your terminal session when you run tr. Linux has them set to C while macOS has it set to something like en_US.UTF-8. Of course that en_US could be some other local language such as en_UK (UK English) but the point is the [something].UTF-8 setting instead of plain ASCII via C is what is causing this.



            More details.



            Seems that tr in macOS is converting the 0xff to the UTF8 equivalent of c3bf when it gets instead of the pure ASCII 0xff. This is explained here on this Apple community support thread here:




            Linux doesn't handle Unicode in the Terminal like the Mac does. If you set the "LANG" environment variable to "C" (as it probably is on Linux), it will work. Otherwise, all those high-order bits are going to get interpreted as Unicode characters.




            And using that LANG tip works! Just do the following; tested personally by me just now on macOS 10.13.6 (High Sierra).



            First, make note of what the existing LANG value is like this:



            echo $LANG


            The output I see is:



            en_US.UTF-8


            Now set the LANG value to C like this:



            LANG=C


            And run that command again:



            dd if=/dev/zero ibs=1k count=100 | tr "00" "377" >paddedFile.bin


            Now the hexdump values should look like this:



            hexdump -C paddedFile.bin
            00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
            *
            00019000


            To reset the LANG value just close that terminal session or just run this command:



            LANG=en_US.UTF-8


            Or—as pointed out in the comments—you can just set the LANG value straight in the command line options before calling tr like this:



            dd if=/dev/zero ibs=1k count=100 | LANG=C tr "00" "377" >paddedFile.bin


            And you can even use LC_ALL instead of LANG because LANG is just derived from LC_ALL anyway like this:



            dd if=/dev/zero ibs=1k count=100 | LC_ALL=C tr "00" "377" >paddedFile.bin





            share|improve this answer














            Straight to the point.



            It all hinges on the LANG or LC_ALL value set in your terminal session when you run tr. Linux has them set to C while macOS has it set to something like en_US.UTF-8. Of course that en_US could be some other local language such as en_UK (UK English) but the point is the [something].UTF-8 setting instead of plain ASCII via C is what is causing this.



            More details.



            Seems that tr in macOS is converting the 0xff to the UTF8 equivalent of c3bf when it gets instead of the pure ASCII 0xff. This is explained here on this Apple community support thread here:




            Linux doesn't handle Unicode in the Terminal like the Mac does. If you set the "LANG" environment variable to "C" (as it probably is on Linux), it will work. Otherwise, all those high-order bits are going to get interpreted as Unicode characters.




            And using that LANG tip works! Just do the following; tested personally by me just now on macOS 10.13.6 (High Sierra).



            First, make note of what the existing LANG value is like this:



            echo $LANG


            The output I see is:



            en_US.UTF-8


            Now set the LANG value to C like this:



            LANG=C


            And run that command again:



            dd if=/dev/zero ibs=1k count=100 | tr "00" "377" >paddedFile.bin


            Now the hexdump values should look like this:



            hexdump -C paddedFile.bin
            00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
            *
            00019000


            To reset the LANG value just close that terminal session or just run this command:



            LANG=en_US.UTF-8


            Or—as pointed out in the comments—you can just set the LANG value straight in the command line options before calling tr like this:



            dd if=/dev/zero ibs=1k count=100 | LANG=C tr "00" "377" >paddedFile.bin


            And you can even use LC_ALL instead of LANG because LANG is just derived from LC_ALL anyway like this:



            dd if=/dev/zero ibs=1k count=100 | LC_ALL=C tr "00" "377" >paddedFile.bin






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Aug 16 at 16:20

























            answered Aug 16 at 4:00









            JakeGould

            29.1k1087128




            29.1k1087128







            • 4




              "Linux has that set to C while macOS has it set to something like en_US.UTF-8" -- I'm not sure this is the whole story. In my Kubuntu or Debian env | grep -E 'LANG|LC' returns LANG=pl_PL.UTF-8 only, so it's Unicode. Still the OP's original command yields 0xff out of the box. Could it be because trimplementation itself differs between Linux and Mac?
              – Kamil Maciorowski
              Aug 16 at 5:27






            • 1




              Regarding my doubt, I have found this answer which says "many implementations of tr, including the one in GNU coreutils, don't support multibyte encodings". Seems legit. In my Debian tr 'Ł' 'L' translates Ł to LL (Ł is a Polish letter, I use LANG=pl_PL.UTF-8), so it apparently treats its first argument as two characters.
              – Kamil Maciorowski
              Aug 16 at 5:46






            • 3




              Yes, it has to be done by tr. It would make negative sense for such conversion to happen when writing to file.
              – grawity
              Aug 16 at 5:47










            • It's not really hard to test that it's not about the locale setting. With LANG=en_US.UTF-8 (on a Linux system that has that locale generated), printf ' ' | tr ' ' '377' | hexdump -C plainly shows ff.
              – ilkkachu
              Aug 16 at 9:03










            • And, actually, changing LANG might not be enough. The relevant locale setting is LC_CTYPE, and the value it gets comes first from LC_ALL, then LC_CTYPE, then LANG, with the first one set taking effect (that's the same for all other locale settings). So, if LC_CTYPE is set, changing LANG doesn't do anything in this case. To reliably override it, you'd need to set LC_ALL. Also, it's enough to set it just for tr, i.e. ... | LC_ALL=C tr ' ' '377' | ...
              – ilkkachu
              Aug 16 at 9:08













            • 4




              "Linux has that set to C while macOS has it set to something like en_US.UTF-8" -- I'm not sure this is the whole story. In my Kubuntu or Debian env | grep -E 'LANG|LC' returns LANG=pl_PL.UTF-8 only, so it's Unicode. Still the OP's original command yields 0xff out of the box. Could it be because trimplementation itself differs between Linux and Mac?
              – Kamil Maciorowski
              Aug 16 at 5:27






            • 1




              Regarding my doubt, I have found this answer which says "many implementations of tr, including the one in GNU coreutils, don't support multibyte encodings". Seems legit. In my Debian tr 'Ł' 'L' translates Ł to LL (Ł is a Polish letter, I use LANG=pl_PL.UTF-8), so it apparently treats its first argument as two characters.
              – Kamil Maciorowski
              Aug 16 at 5:46






            • 3




              Yes, it has to be done by tr. It would make negative sense for such conversion to happen when writing to file.
              – grawity
              Aug 16 at 5:47










            • It's not really hard to test that it's not about the locale setting. With LANG=en_US.UTF-8 (on a Linux system that has that locale generated), printf ' ' | tr ' ' '377' | hexdump -C plainly shows ff.
              – ilkkachu
              Aug 16 at 9:03










            • And, actually, changing LANG might not be enough. The relevant locale setting is LC_CTYPE, and the value it gets comes first from LC_ALL, then LC_CTYPE, then LANG, with the first one set taking effect (that's the same for all other locale settings). So, if LC_CTYPE is set, changing LANG doesn't do anything in this case. To reliably override it, you'd need to set LC_ALL. Also, it's enough to set it just for tr, i.e. ... | LC_ALL=C tr ' ' '377' | ...
              – ilkkachu
              Aug 16 at 9:08








            4




            4




            "Linux has that set to C while macOS has it set to something like en_US.UTF-8" -- I'm not sure this is the whole story. In my Kubuntu or Debian env | grep -E 'LANG|LC' returns LANG=pl_PL.UTF-8 only, so it's Unicode. Still the OP's original command yields 0xff out of the box. Could it be because trimplementation itself differs between Linux and Mac?
            – Kamil Maciorowski
            Aug 16 at 5:27




            "Linux has that set to C while macOS has it set to something like en_US.UTF-8" -- I'm not sure this is the whole story. In my Kubuntu or Debian env | grep -E 'LANG|LC' returns LANG=pl_PL.UTF-8 only, so it's Unicode. Still the OP's original command yields 0xff out of the box. Could it be because trimplementation itself differs between Linux and Mac?
            – Kamil Maciorowski
            Aug 16 at 5:27




            1




            1




            Regarding my doubt, I have found this answer which says "many implementations of tr, including the one in GNU coreutils, don't support multibyte encodings". Seems legit. In my Debian tr 'Ł' 'L' translates Ł to LL (Ł is a Polish letter, I use LANG=pl_PL.UTF-8), so it apparently treats its first argument as two characters.
            – Kamil Maciorowski
            Aug 16 at 5:46




            Regarding my doubt, I have found this answer which says "many implementations of tr, including the one in GNU coreutils, don't support multibyte encodings". Seems legit. In my Debian tr 'Ł' 'L' translates Ł to LL (Ł is a Polish letter, I use LANG=pl_PL.UTF-8), so it apparently treats its first argument as two characters.
            – Kamil Maciorowski
            Aug 16 at 5:46




            3




            3




            Yes, it has to be done by tr. It would make negative sense for such conversion to happen when writing to file.
            – grawity
            Aug 16 at 5:47




            Yes, it has to be done by tr. It would make negative sense for such conversion to happen when writing to file.
            – grawity
            Aug 16 at 5:47












            It's not really hard to test that it's not about the locale setting. With LANG=en_US.UTF-8 (on a Linux system that has that locale generated), printf ' ' | tr ' ' '377' | hexdump -C plainly shows ff.
            – ilkkachu
            Aug 16 at 9:03




            It's not really hard to test that it's not about the locale setting. With LANG=en_US.UTF-8 (on a Linux system that has that locale generated), printf ' ' | tr ' ' '377' | hexdump -C plainly shows ff.
            – ilkkachu
            Aug 16 at 9:03












            And, actually, changing LANG might not be enough. The relevant locale setting is LC_CTYPE, and the value it gets comes first from LC_ALL, then LC_CTYPE, then LANG, with the first one set taking effect (that's the same for all other locale settings). So, if LC_CTYPE is set, changing LANG doesn't do anything in this case. To reliably override it, you'd need to set LC_ALL. Also, it's enough to set it just for tr, i.e. ... | LC_ALL=C tr ' ' '377' | ...
            – ilkkachu
            Aug 16 at 9:08





            And, actually, changing LANG might not be enough. The relevant locale setting is LC_CTYPE, and the value it gets comes first from LC_ALL, then LC_CTYPE, then LANG, with the first one set taking effect (that's the same for all other locale settings). So, if LC_CTYPE is set, changing LANG doesn't do anything in this case. To reliably override it, you'd need to set LC_ALL. Also, it's enough to set it just for tr, i.e. ... | LC_ALL=C tr ' ' '377' | ...
            – ilkkachu
            Aug 16 at 9:08













            up vote
            4
            down vote













            The issue is that GNU tr, which you have on Linux, doesn't really have a concept of multibyte characters, but instead works byte at a time.



            The tr man page and online documentation speak of characters, but that's a bit of a simplification. The TODO file in the source code package mentions this item (picked from coreutils 8.30):




            Adapt tools like wc, tr, fmt, etc. (most of the textutils) to be
            multibyte aware. The problem is that I want to avoid duplicating
            significant blocks of logic, yet I also want to incur only minimal
            (preferably 'no') cost when operating in single-byte mode.




            On a Linux system—even with a UTF-8 locale (en_US.UTF-8)—GNU tr replaces an ä as two "characters" (the UTF-8 representation of ä has two bytes):



            linux$ echo 'ä' | tr 'ä' 'x'
            xx


            In the same vein, mixing an ä and an ö produces funny results, since their UTF-8 representations share a common byte:



            linux$ echo 'ö' | tr ä x
            x�


            Or the other way around (the x doesn't apply here):



            linux$ echo ab | tr ab äx
            ä


            And in your case, GNU tr takes the 377 as a raw byte value.



            The tr on Mac is different, it knows the concept of multibyte characters and acts accordingly:



            mac$ echo 'ä' | tr ä x
            x

            mac$ echo ab | tr ab äx
            äx


            The UTF-8 representation of the character with numerical value 0377 (U+00ff) is the two bytes c3 bf, so that's what you get.



            The easy way to have tr work byte-by-byte is to have it use the C locale, instead of a UTF-8 locale. This gives the funny behavior again:



            $ echo 'ä' | LC_ALL=C tr 'ä' 'x'
            xx


            And in your case, you can use:



            ... | LC_ALL=C tr "00" "377"


            Or you could use something like Perl to generate those xff bytes:



            perl -e 'printf "377" x 1000 for 1..100'





            share|improve this answer


























              up vote
              4
              down vote













              The issue is that GNU tr, which you have on Linux, doesn't really have a concept of multibyte characters, but instead works byte at a time.



              The tr man page and online documentation speak of characters, but that's a bit of a simplification. The TODO file in the source code package mentions this item (picked from coreutils 8.30):




              Adapt tools like wc, tr, fmt, etc. (most of the textutils) to be
              multibyte aware. The problem is that I want to avoid duplicating
              significant blocks of logic, yet I also want to incur only minimal
              (preferably 'no') cost when operating in single-byte mode.




              On a Linux system—even with a UTF-8 locale (en_US.UTF-8)—GNU tr replaces an ä as two "characters" (the UTF-8 representation of ä has two bytes):



              linux$ echo 'ä' | tr 'ä' 'x'
              xx


              In the same vein, mixing an ä and an ö produces funny results, since their UTF-8 representations share a common byte:



              linux$ echo 'ö' | tr ä x
              x�


              Or the other way around (the x doesn't apply here):



              linux$ echo ab | tr ab äx
              ä


              And in your case, GNU tr takes the 377 as a raw byte value.



              The tr on Mac is different, it knows the concept of multibyte characters and acts accordingly:



              mac$ echo 'ä' | tr ä x
              x

              mac$ echo ab | tr ab äx
              äx


              The UTF-8 representation of the character with numerical value 0377 (U+00ff) is the two bytes c3 bf, so that's what you get.



              The easy way to have tr work byte-by-byte is to have it use the C locale, instead of a UTF-8 locale. This gives the funny behavior again:



              $ echo 'ä' | LC_ALL=C tr 'ä' 'x'
              xx


              And in your case, you can use:



              ... | LC_ALL=C tr "00" "377"


              Or you could use something like Perl to generate those xff bytes:



              perl -e 'printf "377" x 1000 for 1..100'





              share|improve this answer
























                up vote
                4
                down vote










                up vote
                4
                down vote









                The issue is that GNU tr, which you have on Linux, doesn't really have a concept of multibyte characters, but instead works byte at a time.



                The tr man page and online documentation speak of characters, but that's a bit of a simplification. The TODO file in the source code package mentions this item (picked from coreutils 8.30):




                Adapt tools like wc, tr, fmt, etc. (most of the textutils) to be
                multibyte aware. The problem is that I want to avoid duplicating
                significant blocks of logic, yet I also want to incur only minimal
                (preferably 'no') cost when operating in single-byte mode.




                On a Linux system—even with a UTF-8 locale (en_US.UTF-8)—GNU tr replaces an ä as two "characters" (the UTF-8 representation of ä has two bytes):



                linux$ echo 'ä' | tr 'ä' 'x'
                xx


                In the same vein, mixing an ä and an ö produces funny results, since their UTF-8 representations share a common byte:



                linux$ echo 'ö' | tr ä x
                x�


                Or the other way around (the x doesn't apply here):



                linux$ echo ab | tr ab äx
                ä


                And in your case, GNU tr takes the 377 as a raw byte value.



                The tr on Mac is different, it knows the concept of multibyte characters and acts accordingly:



                mac$ echo 'ä' | tr ä x
                x

                mac$ echo ab | tr ab äx
                äx


                The UTF-8 representation of the character with numerical value 0377 (U+00ff) is the two bytes c3 bf, so that's what you get.



                The easy way to have tr work byte-by-byte is to have it use the C locale, instead of a UTF-8 locale. This gives the funny behavior again:



                $ echo 'ä' | LC_ALL=C tr 'ä' 'x'
                xx


                And in your case, you can use:



                ... | LC_ALL=C tr "00" "377"


                Or you could use something like Perl to generate those xff bytes:



                perl -e 'printf "377" x 1000 for 1..100'





                share|improve this answer














                The issue is that GNU tr, which you have on Linux, doesn't really have a concept of multibyte characters, but instead works byte at a time.



                The tr man page and online documentation speak of characters, but that's a bit of a simplification. The TODO file in the source code package mentions this item (picked from coreutils 8.30):




                Adapt tools like wc, tr, fmt, etc. (most of the textutils) to be
                multibyte aware. The problem is that I want to avoid duplicating
                significant blocks of logic, yet I also want to incur only minimal
                (preferably 'no') cost when operating in single-byte mode.




                On a Linux system—even with a UTF-8 locale (en_US.UTF-8)—GNU tr replaces an ä as two "characters" (the UTF-8 representation of ä has two bytes):



                linux$ echo 'ä' | tr 'ä' 'x'
                xx


                In the same vein, mixing an ä and an ö produces funny results, since their UTF-8 representations share a common byte:



                linux$ echo 'ö' | tr ä x
                x�


                Or the other way around (the x doesn't apply here):



                linux$ echo ab | tr ab äx
                ä


                And in your case, GNU tr takes the 377 as a raw byte value.



                The tr on Mac is different, it knows the concept of multibyte characters and acts accordingly:



                mac$ echo 'ä' | tr ä x
                x

                mac$ echo ab | tr ab äx
                äx


                The UTF-8 representation of the character with numerical value 0377 (U+00ff) is the two bytes c3 bf, so that's what you get.



                The easy way to have tr work byte-by-byte is to have it use the C locale, instead of a UTF-8 locale. This gives the funny behavior again:



                $ echo 'ä' | LC_ALL=C tr 'ä' 'x'
                xx


                And in your case, you can use:



                ... | LC_ALL=C tr "00" "377"


                Or you could use something like Perl to generate those xff bytes:



                perl -e 'printf "377" x 1000 for 1..100'






                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Aug 16 at 23:54









                JakeGould

                29.1k1087128




                29.1k1087128










                answered Aug 16 at 19:41









                ilkkachu

                551212




                551212






















                     

                    draft saved


                    draft discarded


























                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1349494%2ffilling-file-with-0xff-gives-c3bf-in-osx%23new-answer', 'question_page');

                    );

                    Post as a guest













































































                    這個網誌中的熱門文章

                    How to combine Bézier curves to a surface?

                    Mutual Information Always Non-negative

                    Why am i infinitely getting the same tweet with the Twitter Search API?