Clustering of pseudo-random sequences based on transition matrices

up vote
0
down vote

favorite

THE BACKGROUND

A lot of malware is propagated through the internet via so-called DGAs, i.e., algorithmically-generated domain names that serve as rallying points between the infected clients and the hacker's "mothership" machine. These domain name look like random strings such as qxcsdfluyefedjqpla.com but are actually pseudo-random in that they require a seed value. For example, if you examine the following code snippet,

def generate_domain(seed):

 (year, month, day) = seed

 domain = ''

 for i in range(16):
 year = ((year ^ 8 * year) >> 11) ^ ((year & 0xFFFFFFF0) << 17)
 month = ((month ^ 4 * month) >> 25) ^ 16 * (month & 0xFFFFFFF8)
 day = ((day ^ (day << 13)) >> 19) ^ ((day & 0xFFFFFFFE) << 12)
 domain += chr(((year ^ month ^ day) % 25) + 97)

 return domain+'.com'

and run it for August 7, 2018, i.e., seed = (2018, 8, 7), you'd get nqggjqogjlcywphn.com. As you can see, the domain is the result of a convoluted, but deterministic string manipulation, so in principle, all randomness should be traceable to the seed (in this case the date). In other words, there seems to be a well-defined, non-linear map from the empty string to the resulting string, interspersed with rotations/translations of pseudo-random magnitude.

Each type of malware operates in a similar fashion. It's only the details of the string manipulation will be different.

THE QUESTION

Given that we have access neither to the generating code nor to the seed, is there a way to nonetheless cluster these domain names based on some similarity measure? In particular, one of the mathematical objects I'm thinking of using to derive similarity metrics for the clustering could be the transition probability matrix from on character to the next. I.e., the matrices for two domain names that were generated by the same algorithm but with different seeds would look more similar than two domains that were generated by different algorithms. It does not have to be the transition matrices, so please suggest other mappings that could help to achieve the same goal.

asked Sep 7 at 8:21

Tfovid

add a commentÂ |Â

up vote
0
down vote

favorite

THE BACKGROUND

def generate_domain(seed):

 (year, month, day) = seed

 domain = ''

 for i in range(16):
 year = ((year ^ 8 * year) >> 11) ^ ((year & 0xFFFFFFF0) << 17)
 month = ((month ^ 4 * month) >> 25) ^ 16 * (month & 0xFFFFFFF8)
 day = ((day ^ (day << 13)) >> 19) ^ ((day & 0xFFFFFFFE) << 12)
 domain += chr(((year ^ month ^ day) % 25) + 97)

 return domain+'.com'

Each type of malware operates in a similar fashion. It's only the details of the string manipulation will be different.

THE QUESTION

asked Sep 7 at 8:21

Tfovid

add a commentÂ |Â

up vote
0
down vote

favorite

THE BACKGROUND

def generate_domain(seed):

 (year, month, day) = seed

 domain = ''

 for i in range(16):
 year = ((year ^ 8 * year) >> 11) ^ ((year & 0xFFFFFFF0) << 17)
 month = ((month ^ 4 * month) >> 25) ^ 16 * (month & 0xFFFFFFF8)
 day = ((day ^ (day << 13)) >> 19) ^ ((day & 0xFFFFFFFE) << 12)
 domain += chr(((year ^ month ^ day) % 25) + 97)

 return domain+'.com'

Each type of malware operates in a similar fashion. It's only the details of the string manipulation will be different.

THE QUESTION

asked Sep 7 at 8:21

Tfovid

THE BACKGROUND

def generate_domain(seed):

 (year, month, day) = seed

 domain = ''

 for i in range(16):
 year = ((year ^ 8 * year) >> 11) ^ ((year & 0xFFFFFFF0) << 17)
 month = ((month ^ 4 * month) >> 25) ^ 16 * (month & 0xFFFFFFF8)
 day = ((day ^ (day << 13)) >> 19) ^ ((day & 0xFFFFFFFE) << 12)
 domain += chr(((year ^ month ^ day) % 25) + 97)

 return domain+'.com'

Each type of malware operates in a similar fashion. It's only the details of the string manipulation will be different.

THE QUESTION

linear-algebra general-topology algebraic-topology random cryptography

asked Sep 7 at 8:21

Tfovid

asked Sep 7 at 8:21

Tfovid

asked Sep 7 at 8:21

Tfovid

asked Sep 7 at 8:21

Tfovid

asked Sep 7 at 8:21

Tfovid

add a commentÂ |Â

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "69"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2908389%2fclustering-of-pseudo-random-sequences-based-on-transition-matrices%23new-answer', 'question_page');

);

Post as a guest

Name

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

Vtyjkyuk