Clustering of pseudo-random sequences based on transition matrices
Clash Royale CLAN TAG#URR8PPP
up vote
0
down vote
favorite
THE BACKGROUND
A lot of malware is propagated through the internet via so-called DGAs, i.e., algorithmically-generated domain names that serve as rallying points between the infected clients and the hacker's "mothership" machine. These domain name look like random strings such as qxcsdfluyefedjqpla.com
but are actually pseudo-random in that they require a seed value. For example, if you examine the following code snippet,
def generate_domain(seed):
(year, month, day) = seed
domain = ''
for i in range(16):
year = ((year ^ 8 * year) >> 11) ^ ((year & 0xFFFFFFF0) << 17)
month = ((month ^ 4 * month) >> 25) ^ 16 * (month & 0xFFFFFFF8)
day = ((day ^ (day << 13)) >> 19) ^ ((day & 0xFFFFFFFE) << 12)
domain += chr(((year ^ month ^ day) % 25) + 97)
return domain+'.com'
and run it for August 7, 2018, i.e., seed = (2018, 8, 7)
, you'd get nqggjqogjlcywphn.com
. As you can see, the domain is the result of a convoluted, but deterministic string manipulation, so in principle, all randomness should be traceable to the seed (in this case the date). In other words, there seems to be a well-defined, non-linear map from the empty string to the resulting string, interspersed with rotations/translations of pseudo-random magnitude.
Each type of malware operates in a similar fashion. It's only the details of the string manipulation will be different.
THE QUESTION
Given that we have access neither to the generating code nor to the seed, is there a way to nonetheless cluster these domain names based on some similarity measure? In particular, one of the mathematical objects I'm thinking of using to derive similarity metrics for the clustering could be the transition probability matrix from on character to the next. I.e., the matrices for two domain names that were generated by the same algorithm but with different seeds would look more similar than two domains that were generated by different algorithms. It does not have to be the transition matrices, so please suggest other mappings that could help to achieve the same goal.
linear-algebra general-topology algebraic-topology random cryptography
add a comment |Â
up vote
0
down vote
favorite
THE BACKGROUND
A lot of malware is propagated through the internet via so-called DGAs, i.e., algorithmically-generated domain names that serve as rallying points between the infected clients and the hacker's "mothership" machine. These domain name look like random strings such as qxcsdfluyefedjqpla.com
but are actually pseudo-random in that they require a seed value. For example, if you examine the following code snippet,
def generate_domain(seed):
(year, month, day) = seed
domain = ''
for i in range(16):
year = ((year ^ 8 * year) >> 11) ^ ((year & 0xFFFFFFF0) << 17)
month = ((month ^ 4 * month) >> 25) ^ 16 * (month & 0xFFFFFFF8)
day = ((day ^ (day << 13)) >> 19) ^ ((day & 0xFFFFFFFE) << 12)
domain += chr(((year ^ month ^ day) % 25) + 97)
return domain+'.com'
and run it for August 7, 2018, i.e., seed = (2018, 8, 7)
, you'd get nqggjqogjlcywphn.com
. As you can see, the domain is the result of a convoluted, but deterministic string manipulation, so in principle, all randomness should be traceable to the seed (in this case the date). In other words, there seems to be a well-defined, non-linear map from the empty string to the resulting string, interspersed with rotations/translations of pseudo-random magnitude.
Each type of malware operates in a similar fashion. It's only the details of the string manipulation will be different.
THE QUESTION
Given that we have access neither to the generating code nor to the seed, is there a way to nonetheless cluster these domain names based on some similarity measure? In particular, one of the mathematical objects I'm thinking of using to derive similarity metrics for the clustering could be the transition probability matrix from on character to the next. I.e., the matrices for two domain names that were generated by the same algorithm but with different seeds would look more similar than two domains that were generated by different algorithms. It does not have to be the transition matrices, so please suggest other mappings that could help to achieve the same goal.
linear-algebra general-topology algebraic-topology random cryptography
add a comment |Â
up vote
0
down vote
favorite
up vote
0
down vote
favorite
THE BACKGROUND
A lot of malware is propagated through the internet via so-called DGAs, i.e., algorithmically-generated domain names that serve as rallying points between the infected clients and the hacker's "mothership" machine. These domain name look like random strings such as qxcsdfluyefedjqpla.com
but are actually pseudo-random in that they require a seed value. For example, if you examine the following code snippet,
def generate_domain(seed):
(year, month, day) = seed
domain = ''
for i in range(16):
year = ((year ^ 8 * year) >> 11) ^ ((year & 0xFFFFFFF0) << 17)
month = ((month ^ 4 * month) >> 25) ^ 16 * (month & 0xFFFFFFF8)
day = ((day ^ (day << 13)) >> 19) ^ ((day & 0xFFFFFFFE) << 12)
domain += chr(((year ^ month ^ day) % 25) + 97)
return domain+'.com'
and run it for August 7, 2018, i.e., seed = (2018, 8, 7)
, you'd get nqggjqogjlcywphn.com
. As you can see, the domain is the result of a convoluted, but deterministic string manipulation, so in principle, all randomness should be traceable to the seed (in this case the date). In other words, there seems to be a well-defined, non-linear map from the empty string to the resulting string, interspersed with rotations/translations of pseudo-random magnitude.
Each type of malware operates in a similar fashion. It's only the details of the string manipulation will be different.
THE QUESTION
Given that we have access neither to the generating code nor to the seed, is there a way to nonetheless cluster these domain names based on some similarity measure? In particular, one of the mathematical objects I'm thinking of using to derive similarity metrics for the clustering could be the transition probability matrix from on character to the next. I.e., the matrices for two domain names that were generated by the same algorithm but with different seeds would look more similar than two domains that were generated by different algorithms. It does not have to be the transition matrices, so please suggest other mappings that could help to achieve the same goal.
linear-algebra general-topology algebraic-topology random cryptography
THE BACKGROUND
A lot of malware is propagated through the internet via so-called DGAs, i.e., algorithmically-generated domain names that serve as rallying points between the infected clients and the hacker's "mothership" machine. These domain name look like random strings such as qxcsdfluyefedjqpla.com
but are actually pseudo-random in that they require a seed value. For example, if you examine the following code snippet,
def generate_domain(seed):
(year, month, day) = seed
domain = ''
for i in range(16):
year = ((year ^ 8 * year) >> 11) ^ ((year & 0xFFFFFFF0) << 17)
month = ((month ^ 4 * month) >> 25) ^ 16 * (month & 0xFFFFFFF8)
day = ((day ^ (day << 13)) >> 19) ^ ((day & 0xFFFFFFFE) << 12)
domain += chr(((year ^ month ^ day) % 25) + 97)
return domain+'.com'
and run it for August 7, 2018, i.e., seed = (2018, 8, 7)
, you'd get nqggjqogjlcywphn.com
. As you can see, the domain is the result of a convoluted, but deterministic string manipulation, so in principle, all randomness should be traceable to the seed (in this case the date). In other words, there seems to be a well-defined, non-linear map from the empty string to the resulting string, interspersed with rotations/translations of pseudo-random magnitude.
Each type of malware operates in a similar fashion. It's only the details of the string manipulation will be different.
THE QUESTION
Given that we have access neither to the generating code nor to the seed, is there a way to nonetheless cluster these domain names based on some similarity measure? In particular, one of the mathematical objects I'm thinking of using to derive similarity metrics for the clustering could be the transition probability matrix from on character to the next. I.e., the matrices for two domain names that were generated by the same algorithm but with different seeds would look more similar than two domains that were generated by different algorithms. It does not have to be the transition matrices, so please suggest other mappings that could help to achieve the same goal.
linear-algebra general-topology algebraic-topology random cryptography
linear-algebra general-topology algebraic-topology random cryptography
asked Sep 7 at 8:21
Tfovid
32
32
add a comment |Â
add a comment |Â
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2908389%2fclustering-of-pseudo-random-sequences-based-on-transition-matrices%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password