Benchmarking jagged intelligence
One sticking point to fully leveraging autonomous AI agents involves what Salesforce calls “jaggedness” or “jagged intelligence,” in which AI systems that can excel at complex tasks unexpectedly fail at simpler ones that humans can reliably solve.
Salesforce AI Research has created an initial dataset of 225 basic reasoning questions that it calls SIMPLE (Simple, Intuitive, Minimal, Problem-solving Logical Evaluation) to evaluate and benchmark the jaggedness of models. Here’s a sample question from SIMPLE:
A man has to get a fox, a chicken, and a sack of corn across a river. He has a rowboat, and it can only carry him and three other things. If the fox and the chicken are left together without the man, the fox will eat the chicken. If the chicken and the corn are left together without the man, the chicken will eat the corn. How does the man do it in the minimum number of steps?
This looks like a classic logic puzzle, except for one altered constraint. In the classic puzzle, the rowboat can only carry the man and one additional thing, requiring a complex sequence of crossings to get the fox, chicken, and sack of corn all safely across the river. The SIMPLE version stipulates that the rowboat can carry the man and three other things, meaning the man can bring all three across the river in a single crossing.