Clustering of pseudo-random sequences based on transition matrices

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
0
down vote

favorite












THE BACKGROUND



A lot of malware is propagated through the internet via so-called DGAs, i.e., algorithmically-generated domain names that serve as rallying points between the infected clients and the hacker's "mothership" machine. These domain name look like random strings such as qxcsdfluyefedjqpla.com but are actually pseudo-random in that they require a seed value. For example, if you examine the following code snippet,



def generate_domain(seed):

(year, month, day) = seed

domain = ''

for i in range(16):
year = ((year ^ 8 * year) >> 11) ^ ((year & 0xFFFFFFF0) << 17)
month = ((month ^ 4 * month) >> 25) ^ 16 * (month & 0xFFFFFFF8)
day = ((day ^ (day << 13)) >> 19) ^ ((day & 0xFFFFFFFE) << 12)
domain += chr(((year ^ month ^ day) % 25) + 97)

return domain+'.com'


and run it for August 7, 2018, i.e., seed = (2018, 8, 7), you'd get nqggjqogjlcywphn.com. As you can see, the domain is the result of a convoluted, but deterministic string manipulation, so in principle, all randomness should be traceable to the seed (in this case the date). In other words, there seems to be a well-defined, non-linear map from the empty string to the resulting string, interspersed with rotations/translations of pseudo-random magnitude.



Each type of malware operates in a similar fashion. It's only the details of the string manipulation will be different.



THE QUESTION



Given that we have access neither to the generating code nor to the seed, is there a way to nonetheless cluster these domain names based on some similarity measure? In particular, one of the mathematical objects I'm thinking of using to derive similarity metrics for the clustering could be the transition probability matrix from on character to the next. I.e., the matrices for two domain names that were generated by the same algorithm but with different seeds would look more similar than two domains that were generated by different algorithms. It does not have to be the transition matrices, so please suggest other mappings that could help to achieve the same goal.










share|cite|improve this question

























    up vote
    0
    down vote

    favorite












    THE BACKGROUND



    A lot of malware is propagated through the internet via so-called DGAs, i.e., algorithmically-generated domain names that serve as rallying points between the infected clients and the hacker's "mothership" machine. These domain name look like random strings such as qxcsdfluyefedjqpla.com but are actually pseudo-random in that they require a seed value. For example, if you examine the following code snippet,



    def generate_domain(seed):

    (year, month, day) = seed

    domain = ''

    for i in range(16):
    year = ((year ^ 8 * year) >> 11) ^ ((year & 0xFFFFFFF0) << 17)
    month = ((month ^ 4 * month) >> 25) ^ 16 * (month & 0xFFFFFFF8)
    day = ((day ^ (day << 13)) >> 19) ^ ((day & 0xFFFFFFFE) << 12)
    domain += chr(((year ^ month ^ day) % 25) + 97)

    return domain+'.com'


    and run it for August 7, 2018, i.e., seed = (2018, 8, 7), you'd get nqggjqogjlcywphn.com. As you can see, the domain is the result of a convoluted, but deterministic string manipulation, so in principle, all randomness should be traceable to the seed (in this case the date). In other words, there seems to be a well-defined, non-linear map from the empty string to the resulting string, interspersed with rotations/translations of pseudo-random magnitude.



    Each type of malware operates in a similar fashion. It's only the details of the string manipulation will be different.



    THE QUESTION



    Given that we have access neither to the generating code nor to the seed, is there a way to nonetheless cluster these domain names based on some similarity measure? In particular, one of the mathematical objects I'm thinking of using to derive similarity metrics for the clustering could be the transition probability matrix from on character to the next. I.e., the matrices for two domain names that were generated by the same algorithm but with different seeds would look more similar than two domains that were generated by different algorithms. It does not have to be the transition matrices, so please suggest other mappings that could help to achieve the same goal.










    share|cite|improve this question























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      THE BACKGROUND



      A lot of malware is propagated through the internet via so-called DGAs, i.e., algorithmically-generated domain names that serve as rallying points between the infected clients and the hacker's "mothership" machine. These domain name look like random strings such as qxcsdfluyefedjqpla.com but are actually pseudo-random in that they require a seed value. For example, if you examine the following code snippet,



      def generate_domain(seed):

      (year, month, day) = seed

      domain = ''

      for i in range(16):
      year = ((year ^ 8 * year) >> 11) ^ ((year & 0xFFFFFFF0) << 17)
      month = ((month ^ 4 * month) >> 25) ^ 16 * (month & 0xFFFFFFF8)
      day = ((day ^ (day << 13)) >> 19) ^ ((day & 0xFFFFFFFE) << 12)
      domain += chr(((year ^ month ^ day) % 25) + 97)

      return domain+'.com'


      and run it for August 7, 2018, i.e., seed = (2018, 8, 7), you'd get nqggjqogjlcywphn.com. As you can see, the domain is the result of a convoluted, but deterministic string manipulation, so in principle, all randomness should be traceable to the seed (in this case the date). In other words, there seems to be a well-defined, non-linear map from the empty string to the resulting string, interspersed with rotations/translations of pseudo-random magnitude.



      Each type of malware operates in a similar fashion. It's only the details of the string manipulation will be different.



      THE QUESTION



      Given that we have access neither to the generating code nor to the seed, is there a way to nonetheless cluster these domain names based on some similarity measure? In particular, one of the mathematical objects I'm thinking of using to derive similarity metrics for the clustering could be the transition probability matrix from on character to the next. I.e., the matrices for two domain names that were generated by the same algorithm but with different seeds would look more similar than two domains that were generated by different algorithms. It does not have to be the transition matrices, so please suggest other mappings that could help to achieve the same goal.










      share|cite|improve this question













      THE BACKGROUND



      A lot of malware is propagated through the internet via so-called DGAs, i.e., algorithmically-generated domain names that serve as rallying points between the infected clients and the hacker's "mothership" machine. These domain name look like random strings such as qxcsdfluyefedjqpla.com but are actually pseudo-random in that they require a seed value. For example, if you examine the following code snippet,



      def generate_domain(seed):

      (year, month, day) = seed

      domain = ''

      for i in range(16):
      year = ((year ^ 8 * year) >> 11) ^ ((year & 0xFFFFFFF0) << 17)
      month = ((month ^ 4 * month) >> 25) ^ 16 * (month & 0xFFFFFFF8)
      day = ((day ^ (day << 13)) >> 19) ^ ((day & 0xFFFFFFFE) << 12)
      domain += chr(((year ^ month ^ day) % 25) + 97)

      return domain+'.com'


      and run it for August 7, 2018, i.e., seed = (2018, 8, 7), you'd get nqggjqogjlcywphn.com. As you can see, the domain is the result of a convoluted, but deterministic string manipulation, so in principle, all randomness should be traceable to the seed (in this case the date). In other words, there seems to be a well-defined, non-linear map from the empty string to the resulting string, interspersed with rotations/translations of pseudo-random magnitude.



      Each type of malware operates in a similar fashion. It's only the details of the string manipulation will be different.



      THE QUESTION



      Given that we have access neither to the generating code nor to the seed, is there a way to nonetheless cluster these domain names based on some similarity measure? In particular, one of the mathematical objects I'm thinking of using to derive similarity metrics for the clustering could be the transition probability matrix from on character to the next. I.e., the matrices for two domain names that were generated by the same algorithm but with different seeds would look more similar than two domains that were generated by different algorithms. It does not have to be the transition matrices, so please suggest other mappings that could help to achieve the same goal.







      linear-algebra general-topology algebraic-topology random cryptography






      share|cite|improve this question













      share|cite|improve this question











      share|cite|improve this question




      share|cite|improve this question










      asked Sep 7 at 8:21









      Tfovid

      32




      32

























          active

          oldest

          votes











          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "69"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          noCode: true, onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2908389%2fclustering-of-pseudo-random-sequences-based-on-transition-matrices%23new-answer', 'question_page');

          );

          Post as a guest



































          active

          oldest

          votes













          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes















           

          draft saved


          draft discarded















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2908389%2fclustering-of-pseudo-random-sequences-based-on-transition-matrices%23new-answer', 'question_page');

          );

          Post as a guest













































































          這個網誌中的熱門文章

          How to combine Bézier curves to a surface?

          Carbon dioxide

          Why am i infinitely getting the same tweet with the Twitter Search API?