Hmm, these numbers seem very low. I wonder how these scores were determined.
They weren't, because LLMs don't have reasoning ability, at least not in the way you as a human do. They are generative models, so the short answer is the model most likely made the numbers up, though there's a chance they pulled them directly from some training data that's likely completely unrelated to the user's prompt.
What they generate is supposed to have similar multidimensional correlation as the input data, so there are complex relationships between what the question asked and the output it gave, but these processes don't look anything like the steps you would go through to answer the same question.