ChatGPT Capabilities: Cryptography Challenges (Part I)
An experiment in evaluating ChatGPT's capabilities to solve cryptography-related challenges and its pitfalls.

We haven’t published anything related to ChatGPT until now, mostly due to the fact that our use of GPT has been constrained to offloading menial tasks that are repetitive or just needlessly tedious yet simple enough. The state of the art as far as GPT is involved has changed swiftly over the course of 2023 and 2024, at times providing very unreliable results.
The model has some clear biases and the pitfalls are well known, the most relevant of them being the fact that “it does not know what it does not know”, which, applied to topics like programming related prompts, means it often produces strange concoctions: code where outdated APIs might be used mixed with newer ones, insecure code constructs (far more than people seem to notice or assume), poorly organized code (for example, GPT-4 produced less than stellar Python source code in response to relatively simple prompts), etc.
We were curious to feed GPT-4 and GPT-4o more synthetic benchmarks, the likes of something that would be objectively used to measure or assess a human operator’s level of skill or familiarity with topics like cryptography and (source) code safety. If GPT-4 and later versions were able to autonomously solve cryptography-related challenges in a concise and factually correct fashion, this is something that could be considered remarkable in every sense of the word.
Our testing methodology was relatively unorthodox: we provide GPT-(x) a challenge in a prompt that requests a solution. Assuming a 0-10 scale, we consider a full answer with a correct solution to be a “score of 10”, and an incorrect or completely nonsensical answer a solid “0”. “5” would be a relatively correct answer but lackluster in detail (for example, GPT was unable to detect or point out nuances that are relevant to the challenge, but overall provided an answer that was steered in the right direction).
The problem with testing GPT and similar models is that you must provide sufficiently unique challenges that are not direct conversions (or copies) of challenges or data specifically included in its training datasets. This is a tricky situation, as we do not know what OpenAI has used exactly for training, and we also do not know the extent of its access to materials (we could assume that anything published online in the non-authenticated Internet -pages without paywall or user authentication required-) is susceptible of inclusion (a major point of controversy currently).
Challenge 1: Secret key recovery with ChaCha20 key and IV reuse
We selected one simple enough challenge from HackTheBox’s Cryptography category, involving ChaCha20 and a set of two secret messages encrypted with an unknown key. The source code for the tool is provided.
Prompt
The prompt was constructed as follows:
I have an exercise with ChaCha20:
from Crypto.Cipher import ChaCha20
from secret import FLAG
import os
def encryptMessage(message, key, nonce):
cipher = ChaCha20.new(key=key, nonce=iv)
ciphertext = cipher.encrypt(message)
return ciphertext
def writeData(data):
with open("out.txt", "w") as f:
f.write(data)
if __name__ == "__main__":
message = b"Our counter agencies have intercepted your messages and a lot "
message += b"of your agent's identities have been exposed. In a matter of "
message += b"days all of them will be captured"
key, iv = os.urandom(32), os.urandom(12)
encrypted_message = encryptMessage(message, key, iv)
encrypted_flag = encryptMessage(FLAG, key, iv)
data = iv.hex() + "\n" + encrypted_message.hex() + "\n" + encrypted_flag.hex()
writeData(data)
As you can see, two distinct messages are encrypted with the same key and IV. We also have the output:
c4a66edfe80227b4fa24d431
7aa34395a258f5893e3db1822139b8c1f04cfab9d757b9b9cca57e1df33d093f07c7f06e06bb6293676f9060a838ea138b6bc9f20b08afeb73120506e2ce7b9b9dcd9e4a421584cfaba2481132dfbdf4216e98e3facec9ba199ca3a97641e9ca9782868d0222a1d7c0d3119b867edaf2e72e2a6f7d344df39a14edc39cb6f960944ddac2aaef324827c36cba67dcb76b22119b43881a3f1262752990
7d8273ceb459e4d4386df4e32e1aecc1aa7aaafda50cb982f6c62623cf6b29693d86b15457aa76ac7e2eef6cf814ae3a8d39c7
How can we obtain the original keystream?
GPT-4 results
You can see GPT4’s complete response below:
To obtain the original keystream used in encrypting both messages with ChaCha20 cipher, you can exploit the fact that the same key and nonce (IV) were used for both encryptions. This setup is insecure because encrypting different pieces of data with the same key and nonce combination in stream ciphers like ChaCha20 allows for potential vulnerabilities, one of which is the possibility of keystream recovery. In stream ciphers, the plaintext is encrypted by XORing it with a keystream generated from the key and nonce. If two pieces of plaintext are encrypted with the same keystream, you can recover the keystream by XORing the two ciphertexts together, only if you know one of the plaintexts. This is due to the property of XOR: A⊕B=CA⊕B=C implies A=B⊕CA=B⊕C and B=A⊕CB=A⊕C. Given that you know the plaintext of one message, you can recover the keystream used to encrypt that message and then use the same keystream to decrypt the other message. Here’s how you can do it:
- Convert the known ciphertext and the known plaintext to binary or bytes.
- XOR the known ciphertext with its corresponding plaintext to recover the keystream.
- XOR the recovered keystream with the ciphertext of the encrypted flag to get the plaintext of the flag.
Let’s implement this to obtain the keystream and then use it to decrypt the encrypted flag.
By exploiting the [sic, truncated].
Remarkably, GPT-4 not only determined the correct problem (key and IV reuse leading to a distinctive vulnerability universally affecting most if not all stream ciphers), it was also able to produce a reduced test-case in Python and execute it in context, producing the correct so-called “flag” and solving the challenge.
The reduced correct solution code as generated by GPT-4:
from binascii import unhexlify
# Known parts
known_ciphertext_hex = "7aa34395a258f5893e3db1822139b8c1f04cfab9d757b9b9cca57e1df33d093f07c7f06e06bb6293676f9060a838ea138b6bc9f20b08afeb73120506e2ce7b9b9dcd9e4a421584cfaba2481132dfbdf4216e98e3facec9ba199ca3a97641e9ca9782868d0222a1d7c0d3119b867edaf2e72e2a6f7d344df39a14edc39cb6f960944ddac2aaef324827c36cba67dcb76b22119b43881a3f1262752990"
known_plaintext = (b"Our counter agencies have intercepted your messages and a lot "
b"of your agent's identities have been exposed. In a matter of "
b"days all of them will be captured")
known_ciphertext = unhexlify(known_ciphertext_hex)
# Assuming the unknown encrypted flag is the following hex string
unknown_encrypted_flag_hex = "7d8273ceb459e4d4386df4e32e1aecc1aa7aaafda50cb982f6c62623cf6b29693d86b15457aa76ac7e2eef6cf814ae3a8d39c7"
unknown_encrypted_flag = unhexlify(unknown_encrypted_flag_hex)
# XOR the known ciphertext with its corresponding plaintext to recover the keystream
keystream = bytes(a ^ b for a, b in zip(known_ciphertext, known_plaintext))
# XOR the recovered keystream with the encrypted flag to decrypt it
decrypted_flag = bytes(a ^ b for a, b in zip(unknown_encrypted_flag, keystream[:len(unknown_encrypted_flag)]))
decrypted_flag
Score: 10/10. It threw back the flag, and explained the problem.
GPT-4o results
We attempted the same, months later with GPT-4o but encountered a regression (despite being recommended “for complex tasks” by OpenAI): instead of detecting the actual conceptual and practical vulnerability, it almost literally understood that a “known plaintext” attack opportunity existed (which could be blamed perhaps on the wording of the prompt, but does not explain why the previous version was able to solve the challenge). The generated response and source code were obviously completely wrong. We proceeded to run a second time, this time generating the correct response but not producing the actual challenge solution (flag value).
The incorrect reduced solution code as generated by GPT-4o in our first attempt:
from Crypto.Cipher import ChaCha20
import binascii
# Provided output data
iv_hex = "c4a66edfe80227b4fa24d431"
encrypted_message_hex = "7aa34395a258f5893e3db1822139b8c1f04cfab9d757b9b9cca57e1df33d093f07c7f06e06bb6293676f9060a838ea138b6bc9f20b08afeb73120506e2ce7b9b9dcd9e4a421584cfaba2481132dfbdf4216e98e3facec9ba199ca3a97641e9ca9782868d0222a1d7c0d3119b867edaf2e72e2a6f7d344df39a14edc39cb6f960944ddac2aaef324827c36cba67dcb76b22119b43881a3f1262752990"
encrypted_flag_hex = "7d8273ceb459e4d4386df4e32e1aecc1aa7aaafda50cb982f6c62623cf6b29693d86b15457aa76ac7e2eef6cf814ae3a8d39c7"
# Convert hex to bytes
iv = binascii.unhexlify(iv_hex)
encrypted_message = binascii.unhexlify(encrypted_message_hex)
encrypted_flag = binascii.unhexlify(encrypted_flag_hex)
# Known plaintext message
message = b"Our counter agencies have intercepted your messages and a lot of your agent's identities have been exposed. In a matter of days all of them will be captured"
# The length of the known message
message_length = len(message)
# The key is not known, but we can derive the keystream by decrypting the known message
# Create a ChaCha20 cipher instance with the key and nonce
cipher = ChaCha20.new(key=b'\x00' * 32, nonce=iv)
# Decrypt the first part of the encrypted message (same length as the known message) to obtain the keystream
keystream = cipher.decrypt(encrypted_message[:message_length])
# XOR the keystream with the known plaintext message to get the original keystream
keystream = bytes([m ^ k for m, k in zip(message, keystream)])
# Decrypt the FLAG using the keystream
decrypted_flag = bytes([c ^ k for c, k in zip(encrypted_flag, keystream[:len(encrypted_flag)])])
print(decrypted_flag.decode())
Score: 4/10. Not extremely far off target, but source code is wrong, and the interpretation of the problem and proposed approach is also mistaken.
Second attempt with GPT-4o
The correct answer from the second attempt with GPT-4o:
Given the exercise, we need to recover the
FLAG
from the provided ciphertexts and nonce. Since ChaCha20 is a stream cipher, it generates a keystream based on the key and nonce, and XORs it with the plaintext to produce the ciphertext. Encrypting two different messages with the same key and nonce is vulnerable because the keystream will be the same for both messages, and thus, XORing the two ciphertexts will cancel out the keystream, leaving us with the XOR of the two plaintexts.Here’s the plan to solve this:
- XOR the two ciphertexts to cancel out the keystream, leaving us with the XOR of the two plaintexts.
- Use the known plaintext to recover the second plaintext (FLAG).
Let’s proceed with the solution:
- Convert the given hex strings back to bytes.
- XOR the two ciphertexts.
- XOR the result with the known plaintext to recover the FLAG.
Here’s the Python code to accomplish this:
import binascii
# Given data
nonce_hex = "c4a66edfe80227b4fa24d431"
encrypted_message_hex = "7aa34395a258f5893e3db1822139b8c1f04cfab9d757b9b9cca57e1df33d093f07c7f06e06bb6293676f9060a838ea138b6bc9f20b08afeb73120506e2ce7b9b9dcd9e4a421584cfaba2481132dfbdf4216e98e3facec9ba199ca3a97641e9ca9782868d0222a1d7c0d3119b867edaf2e72e2a6f7d344df39a14edc39cb6f960944ddac2aaef324827c36cba67dcb76b22119b43881a3f1262752990"
encrypted_flag_hex = "7d8273ceb459e4d4386df4e32e1aecc1aa7aaafda50cb982f6c62623cf6b29693d86b15457aa76ac7e2eef6cf814ae3a8d39c7"
# Convert hex to bytes
nonce = binascii.unhexlify(nonce_hex)
encrypted_message = binascii.unhexlify(encrypted_message_hex)
encrypted_flag = binascii.unhexlify(encrypted_flag_hex)
# Known plaintext message
known_message = (
b"Our counter agencies have intercepted your messages and a lot "
b"of your agent's identities have been exposed. In a matter of "
b"days all of them will be captured"
)
# XOR function
def xor_bytes(a, b):
return bytes(x ^ y for x, y in zip(a, b))
# XOR the two ciphertexts
xor_ciphertexts = xor_bytes(encrypted_message, encrypted_flag)
# XOR the result with the known plaintext to get the FLAG
recovered_flag = xor_bytes(xor_ciphertexts[:len(known_message)], known_message)
# Print the recovered FLAG
print("Recovered FLAG:", recovered_flag.decode())
This code will output the
FLAG
by XORing the two ciphertexts and then using the known plaintext message to decrypt it.
Indeed, it also produces the correct result:
Recovered FLAG: HTB{und3r57AnD1n9_57R3aM_...}
Score: 9/10. It did not throw the flag back at us, but that would have required actually running code. It’s unclear if this capability has been removed (and for what reasons), but it isn’t necessarily terrible. It’s just not nearly as impressive as having GPT produce an answer that explains the problem and also autonomously performs the necessary steps to give an immediately useful result.
Recap
From this first challenge we have a lukewarm impression: GPT-4 was impressive, but could be just because we got lucky on the first attempt. GPT-4o produced an incorrect response, and on the second attempt, a correct response with a sane process and description of the concepts involved.
Let’s try with something still relatively trivial but more involved.
Challenge 2: Language-specific (implementation) Weakness in AES CTR mode scheme
Another HackTheBox challenge, named “Initialization”, whose solution has been published by user 7rocky in his notes on Github, involves an implementation-specific weakness. The source code includes a comment hinting at the absence of a nonce reuse vulnerability, which can bias the prompt and coerce an incorrect response.
Prompt
This was our second test, and we used the following prompt verbatim:
Do you see a problem with the following code:
#!/usr/bin/env python3
import os
from Crypto.Util import Counter
from Crypto.Util.Padding import pad
from Crypto.Cipher import AES
class AdvancedEncryption:
def __init__(self, block_size):
self.KEYS = self.generate_encryption_keys()
self.CTRs = [Counter.new(block_size) for i in range(len(MSG))] # nonce reuse : avoided!
def generate_encryption_keys(self):
keys = [[b'\x00']*16] * len(MSG)
for i in range(len(keys)):
for j in range(len(keys[i])):
keys[i][j] = os.urandom(1)
return keys
def encrypt(self, i, msg):
key = b''.join(self.KEYS[i])
ctr = self.CTRs[i]
cipher = AES.new(key, AES.MODE_CTR, counter=ctr)
return cipher.encrypt(pad(msg.encode(), 16))
def main():
AE = AdvancedEncryption(128)
with open('output.txt', 'w') as f:
for i in range(len(MSG)):
ct = AE.encrypt(i, MSG[i])
f.write(ct.hex()+'\n')
if __name__ == '__main__':
with open('messages.txt') as f:
MSG = eval(f.read())
main()
There is a weakness that yields the encryption useless. Can you find it?
GPT-4 results
We ran the prompt through GPT-4 first, months ago, and the response wrongly centered around a non-existent nonce reuse potential issue. However, running the test again also with GPT-4 produced an impressively well reasoned response (while also pointing at a potential issue with COunter.new()
since it lacks enough introspection of its internals, something that cannot be blamed on the model fairly -the ability to recall arbitrary or unrelated out of context data is a significant leap in capability-):
The response is reproduced verbatim.
There is a significant problem in your code related to the handling of encryption keys and nonces which does make the encryption insecure.
Key Creation in
generate_encryption_keys()
: The methodgenerate_encryption_keys()
seems to attempt to generate a list of 16-byte keys, one for each message. However, the approach to initialize keys has a fundamental flaw:
keys = [[b'\x00']*16] * len(MSG)
- This line creates a list of the same list object repeated multiple times. That means all entries inkeys
point to the same list object. When you modify one key, you inadvertently modify all others because they all reference the same list object.- When setting
keys[i][j] = os.urandom(1)
, you’re actually updating the single shared list object repeatedly, thus all keys end up being the same as the last generated key sequence.The correct way to initialize
keys
such that each key is independent should use a comprehension that creates a new list for each key:
keys = [[os.urandom(1) for _ in range(16)] for _ in range(len(MSG))]
- Nonce Management in
Counter.new
:
- You use a
Counter
object for the CTR mode without specifying an initial value or a unique nonce. This approach potentially leads to nonce reuse across different keys, which is a security risk in CTR mode. Although you comment it as “nonce reuse: avoided!”, the actual implementation does not reflect this because the nonce (or initial counter value) for each Counter instance isn’t specified and might default to the same starting point.- To securely use nonces with CTR mode, each
Counter
must be initialized with a unique starting point, usually done by specifying a nonce:
self.CTRs = [Counter.new(block_size, initial_value=os.urandom(block_size // 8)) for i in range(len(MSG))]
You can also include a unique nonce part and an initial counter which increments per message encryption or uses a truly random start for each encryption.
- Security and Structural Considerations:
- Ensure that the use of global variables (like
MSG
) is safe and properly encapsulated within functions > to avoid scope and timing issues.- Avoid using
eval()
on potentially untrusted input due to security risks; consider safer alternatives for parsing input data, likejson.load()
for JSON-formatted text files.- Each encrypted file or transmission should ideally include information about how to safely reconstruct the counter state, unless the system design guarantees counter state management externally.
Addressing these issues would significantly improve the security posture of your encryption routine.
GPT-4 was able to pinpoint the problem specific to how the Python interpreter works (resulting in the same object reference being assigned to all keys), which is, at least in our opinion (and assuming GPT-4 was not trained with 7rocky’s published notes!), remarkable.
This is where our test might produce misleading results: we used a public piece of source code, referenced by another public note, both being susceptible of having been used to train the model. We decided to take the vulnerable key generation method and prompt GPT-4 again.
# I am using this function to generate keys for an encryption scheme, is this code correct or does it have any security issues?
def calc_k(msg):
ret = [[DEFAULT_BYTE]*KEY_LENGTH] * len(msg)
for i in range(len(ret)):
for j in range(len(ret[i])):
ret[i][j] = os.urandom(1)
return ret
The response was similar and again correct, pointing out the problem, but failing to actually warn against the consequences specific to its context:
Key Matrix Initialization:
ret = [[DEFAULT_BYTE]*KEY_LENGTH] * len(msg)
: This line creates a list of lists, where each sublist is a reference to the same list. Thus, any change made to one sublist will reflect in all sublists. This could potentially be a major bug in your application, depending on how you plan to use ret.…
List Initialization Bug:
The primary issue is with the way the lists are initialized. If you intend for each position in
ret[i][j]
to be independent, you need to initialize ret differently to avoid the aliasing bug where all sublists reference the same list.
The response could be sufficient for someone who correctly deduces that it causes all keys to be generated with the same content.
The proposed fixed code by GPT-4 is as follows:
import os
def calc_k(msg, key_length):
# Initialize each sublist independently to avoid reference issues
ret = [[os.urandom(1) for _ in range(key_length)] for _ in range(len(msg))]
return ret
GPT-4 recap
We are still unsure if there is training data bias explicitly related to the code and not the general conceptual problem and it is impossible for us to test further or in any conclusive fashion, since we do not get any sort of debugging or tracing information from OpenAI. We just don’t know if GPT answered correctly because it is able to deduce the problem, or because it already knew the specific solution. Giving GPT-4 the benefit of the doubt, this is still a remarkably good result but also comes with significant risk of human error.
Score: 9/10.
Conclusions
ChatGPT as of today is rather unreliable for this kind of tasks, but seems to have acquired a false reputation of being reliable enough to the point that it is being used in professional settings. Perhaps asking GPT to solve cryptography riddles is a bit much to ask, but the overarching theme is that people are trusting GPT for code solving problems were it might produce issues. Still, the results are remarkable. It is undoubtedly impressive that GPT can pinpoint implementation nuances and general conceptual mistakes. For very niche questions and problems it might struggle, but in many other instances, the performance is still fantastic. Benchmarking or testing GPT is quite difficult, as we do not know how much bias exists intra-account and within an account itself (we did not repeat tests across different disconnected accounts).
We also have been able to conduct interesting research related to filtering bypasses and “model coercion”. The obscurity of how these models are trained, and the reliability of their source material, is still a major gap that is not being clearly addressed. The fact that the system lacks repeatability or consistency is a major warning: a human operator with enough experience and talent should and will produce consistent responses to the same or similar challenges or tasks, that is, repeatable results. Repeatability is a measure of reliability by virtue of indicating that a system (or an individual) given similar circumstances or conditions will produce the same result. When it comes to talent (such as the ability to spot cryptographic or source code vulnerabilities), especially in a professional setting, this is something that needs to be guaranteed.
Still, OpenAI has done an amazing job at virtually introducing AI to the masses (general public) to an extent where it has acquired everyday life relevance. GPT is also remarkably useful, accelerating workflows that would suffer from tasks that need very little “synthetic” talent but are time consuming. For those tasks, it’s a tremendous help.
However, the real question is: Is ChatGPT becoming a reincarnation of the “6USD/hr Indian Java developer” meme? Judging by how people are predictably switching over to GPT for all sorts of tasks, also at the workplace, this might very well be the case, and it isn’t without a whole host of troubling concerns. Then there is the proverbial can of worms of privacy, both with the data we are providing to the system in our prompts, and the training data itself. Be careful with the content of your prompts!

Let this be (another) caveat emptor, as it usually happens with every piece of so-called groundbreaking technology that gains substantial popularity.