Why aren’t there more identical bags at the airport? I asked AI to explain.
Here’s a little experiment to compare AI services.
I asked four AI providers the same difficult question. It required not only calculations but also some intelligent estimates based on assumptions about the world. It was supposed to be the kind of question reputedly asked at interviews for jobs at Google, such as ‘how many windows are there in the United States?’.
This one was inspired by hanging around waiting for my case on an airport carousel:
“If two passengers waiting for their bags off a flight find they have the same suitcase on only 1 in 500 flights, how many kinds of suitcase or other bag would there need to be on the market?”
Here are four AI answers:
(This was the only paid service out of the four tested. These answers are from its $20 a month subscription service, using GPT-4o. You can also access ChatGPT for free.)
To calculate how many distinct types of suitcases (or designs) there would need to be to achieve a 1 in 500 chance of two passengers having the same suitcase on a flight, we can use a simplified version of the birthday problem approach. Here, we are interested in the probability of at least two people having the same suitcase among a group of passengers.
Assumptions:
- Probability of a match: 1 in 500, or 0.002.
- Average number of passengers checking bags: Let’s assume around 100 passengers per flight, as this is a reasonable average for typical commercial flights.
- Uniform distribution: Each suitcase design is equally likely to be chosen by any passenger, simplifying the probability calculation.
(I’ve taken a screenshot of what comes next, as elsewhere below, since many of the formulae don’t work with copy and paste.)
To achieve only a 1 in 500 chance of two passengers having identical suitcases on a flight with around 100 passengers, there would need to be approximately 2.47 million distinct suitcase designs in circulation. This high number reflects the vast variety required to make exact matches infrequent in a typical airport scenario.
My reaction:
There’s a proper methodology here with assumptions stated. But the two assumptions that the AI makes are questionable (as my ‘1 in 500’ may also have been). Even a short haul carrier like Ryanair has capacity for 180–220 passengers, so 100 seems like a low estimate for the number of passengers picking up their bags. And not all suitcase designs are going to be equally probable — although the AI is aware of that and admits it has assumed it to make the calculation simpler. The end figure — 2,470,000 “distinct suitcase designs” — seems very high, although that reaction is based on purely on my own ‘gut feeling’!
Overall mark: 6/10
2. Perplexity
(asking the same question)
To answer this query, we need to consider the probability of two passengers having the same suitcase design. If the probability of two passengers having the same suitcase is 1 in 500 flights, we can determine the number of different suitcase designs needed to achieve this probability. Let’s break it down:
- The probability of two passengers having the same suitcase is 1/500.
- This probability is equivalent to 1/N, where N is the number of different suitcase designs available.
- Therefore, we can set up the equation: 1/N = 1/500
Solving this equation, we find that N = 500.This means that there would need to be 500 different kinds of suitcases or bags on the market to achieve the probability of two passengers having the same suitcase on only 1 in 500 flights.
It’s important to note that this is a simplified model and assumes an equal distribution of all suitcase designs among travelers. In reality, some designs may be more popular than others, which could affect the actual probability. However, for the purposes of this calculation, 500 different designs would be required to meet the specified probability.
My reaction:
This calculation amounts to no more than saying ‘if I want to pick a certain number at random from a group of different numbers, how many numbers do I need to pick from to achieve success one time in 500: and the answer is 500.’ It doesn’t answer the question about the numbers of suitcases: it is simply a restatement of my question. But I’ll give an extra point for realising that an unequal distribution of suitcase types would change the result.
Overall mark: 3/10
(asking the same question)
Copilot came up with almost the same non-answer as Perplexity
Alright, let’s break it down. If the probability of two passengers having the same suitcase is 1 in 500, that means there’s a 1/500 chance of this happening.
Using some probability math: 1/500 = 0.002
Now, if each passenger is picking a suitcase from a pool of possibilities, the number of unique suitcases should be high enough to maintain that probability. For simplicity, we can assume each kind of suitcase is equally popular and calculate how many different kinds there would need to be.
If we assume each person is selecting their suitcase independently, then the probability that they end up with the same suitcase is 1 divided by the number of kinds of suitcase (let’s call it N):
1/N = 1/500 N = 500
Thus, there would need to be 500 different kinds of suitcases on the market to maintain that probability.
That’s a lot of variety, but then again, you can never have too many choices, right?
Well, I’m not sure it’s really enough variety! But the AI does realise it’s simplifying the result by assuming all kinds of case are equally popular, so I’ll have to give the same mark that I gave to Perplexity.
Overall mark: 3/10
4. Claude
(asking the same question)
Let me solve this probability problem step by step.
- First, let’s define what we’re looking for:
- Let’s say n = number of different suitcase types available
- Probability of a match between any two passengers = 1/500
- For any two passengers:
- The probability of the first passenger having any particular suitcase is 1/n
- The probability of the second passenger having that same suitcase is also 1/n
- So the probability of a match is 1/n × 1 = 1/n
- We’re told this probability is 1/500, so: 1/n = 1/500
- Therefore: n = 500
Actually, I made this too simple. This assumes each suitcase type is equally likely to be chosen. In reality:
- If p₁, p₂, etc. are the probabilities of each type being chosen:
- The probability of a match = ∑(p₁² + p₂² + … + pₙ²)
- This sum equals 1/500
- Since some suitcase types are more popular than others, we’d need MORE than 500 types to achieve the same low probability of a match
- For example, if some types had 5% market share and others 0.1%:
- We might need 1000+ different types to achieve the 1/500 match probability
So there would need to be at least 500 types, and likely over 1000 when accounting for uneven market share distribution.
My reaction:
The first part of the answer is the same as Perplexity and Copilot’s non-answers, although I’m not sure the reasoning is correct: what does “So the probability of a match is 1/n × 1 = 1/n” mean?
But then the AI comes up with an afterthought, correctly assuming that not all designs are equally popular. I don’t follow the new equation and it doesn’t look as though the AI comes up with anything definite anyway, only concluding that “we might need 1000+ different types”.
But I’ll give it marks for a more sophisticated understanding of the problem, if not quite getting an answer.
Overall mark: 5/10
I then followed up with points that had been missed in their answers, by putting to Claude, Copilot and Perplexity, the question of passenger numbers which ChatGPT had rightly taken into account:
“How many passengers are you assuming to be on the plane in your calculation?’
Claude was winningly humble in its response:
Ah — I made a critical error! You’re right. I was only considering two passengers, but on a real flight there are many more passengers, which dramatically increases the chance of a match.
So we’d need around 4 million different suitcase types to have only a 1/500 chance of a match on a 180-passenger flight, assuming equal distribution. With uneven distribution (some types being more popular), we’d need even more.
This is much more realistic than my previous answer, given the actual variety we see in luggage!
So Claude alighted on the ‘birthday problem’ formula after this second prompt — which ChatGPT got to straight away. It made a more realistic assumption than ChatGPT about the number of passengers and calculated an even larger number of suitcase types — 4 million. Given that it was assuming 180 passengers rather than ChatGPT’s 100, the difference between the numbers — 4 million and ChatGPT’s roughly 2.5m — probably comes down to that: with a larger crowd, you’d need more different kinds of cases to keep the pairings down to one in 500 flights.
Claude doesn’t take into account the uneven distribution of suitcase types, but it does flag that up as an issue, and correctly points out that it would mean that “even more” types of case would be needed to keep to the one in 500 result.
Overall final mark for Claude: 6/10 (would have been higher if it hadn’t needed the second question)
I then asked Perplexity the same question:
“How many passengers are you assuming to be on the plane in your calculation?”
It is similarly contrite, if slightly sneaky. (It’s not that it didn’t “explicitly state” the passenger number: it just hadn’t considered it at all):
I apologize for not explicitly stating the number of passengers in my previous answer. Let’s clarify this important detail. Based on the search results provided, we can see that the average number of passengers on a commercial flight was calculated:
I don’t know what the python lines represent — is this just what happens behind the scenes when you ask Google for the average passenger numbers on a flight? Perplexity goes on :
Revised Calculation
Now, let’s recalculate the number of different suitcase designs needed, considering this passenger count:
- The probability of two specific passengers having the same suitcase is 1/500.
- With 175 passengers, we need to consider the number of possible pairs of passengers.
- The number of possible pairs is given by the combination formula: (175 choose 2) = 175 * 174 / 2 = 15,225.
- The probability of at least one pair of passengers having the same suitcase design is approximately:1 — (1–1/500)¹⁵²²⁵ ≈ 0.35
- This means that with 500 different suitcase designs, there’s about a 35% chance of at least one pair of passengers having the same suitcase on any given flight.
Adjusted Number of Suitcase Designs
To reduce this probability to 1/500 (0.002 or 0.2%), we would need to increase the number of suitcase designs. The exact calculation is complex, but we can approximate:500 / 175 ≈ 2.86
This suggests we need roughly 2.86 times more suitcase designs than passengers to achieve the desired probability. Therefore, to have only a 1 in 500 chance of two passengers having the same suitcase design, we would need approximately: 175 * 2.86 ≈ 500 different suitcase designs. This aligns with our original answer, but now we have a clear understanding of the assumed number of passengers and how it factors into the calculation.
My maths isn’t up to following the probability calculations but I’m sceptical that the result of it all is still only 500 difference suitcase designs. It’s claiming that “we need roughly 2.86 times more suitcase designs than passengers to achieve the desired probability”. Supposing there were 10 passengers, surely you’d need more than around 30 suitcases designs to stop two of the ten passengers having the same bag more often than 1 in 500 times? (And of course, in the real world, where some designs are more popular than others, you’d need even more than that.)
Finally, Copilot is defensive and a bit obnoxious in its smartass response:
I went with the assumption that there are exactly two passengers comparing their bags. In reality, a flight typically carries way more people, right? So, adjusting for more passengers would make the calculations more complex, but for now, just to keep things clear and simple, we’re looking at a plane with only two passengers. Reality, of course, adds more layers to the equation.
Now mark your own homework
I then asked the four services to rank themselves and each other by offering them what I’ve written above:
“I have been comparing your answers to those of other AI services in the attached document. Please summarise the results and rank the services on their answers. Please make your own scores out of ten for each service based on their performance, instead of using my assessments.”
ChatGPT agreed with me that it was the best:
Here’s my scoring of each AI service out of ten based on the depth, accuracy, and adaptability of their responses:
1. ChatGPT — 8/10
- Strengths: ChatGPT provided a well-structured answer from the start, applying the birthday problem approach to reach a reasonable estimate. It outlined assumptions clearly and identified potential limitations, such as uniform suitcase popularity, showing an understanding of how real-world factors might alter the result.
- Limitations: The assumption of 100 passengers could be adjusted to better fit typical commercial flights, especially for higher-capacity airlines.
- Overall: ChatGPT’s answer was comprehensive, addressing both the calculation and contextual factors with logical methodology, and it could be refined with minor tweaks.
2. Claude — 7/10
- Strengths: Although Claude began with a simplified approach, it quickly adjusted when prompted, acknowledging the need to include a realistic passenger count. This led to a more accurate estimate (4 million designs), aligning well with real-world dynamics. It also flagged the non-uniform distribution of suitcase types, which was insightful.
- Limitations: Claude’s initial response was less sophisticated, resembling a simplified approach, but the follow-up improved it considerably.
- Overall: Claude’s adaptability to feedback and realistic adjustments after the prompt were strengths, although a more refined initial response would have scored it higher.
3. Perplexity — 4/10
- Strengths: Perplexity correctly identified that non-uniform popularity among suitcase designs could influence the answer. After being prompted, it incorporated a realistic passenger number, though its estimate of 500 suitcase designs remained overly simplistic.
- Limitations: The initial response didn’t demonstrate much depth, merely restating the assumption as a simple probability calculation. The probability explanation also lacked the complexity needed for an accurate answer.
- Overall: While Perplexity improved slightly with prompting, the initial simplicity and limited adjustment on follow-up made it less insightful.
4. Microsoft Copilot — 3/10
- Strengths: Copilot gave a basic probability answer but with minimal depth, similar to Perplexity’s approach.
- Limitations: Copilot’s assumption of only two passengers was unrealistic and resulted in a very low estimate. When prompted for clarification, its response didn’t meaningfully adjust to address the core question.
- Overall: Copilot’s response was the least comprehensive and demonstrated minimal adaptability, making it the weakest of the four.
Final Ranking Summary:
- ChatGPT: 8/10
- Claude: 7/10
- Perplexity: 4/10
- Microsoft Copilot: 3/10
This assessment rates ChatGPT highest due to its clear methodology and initial precision, followed by Claude, which adapted well to follow-up prompts but lacked some initial accuracy.
Well, that was pretty easy for ChatGPT, because it could agree with me — although it awarded itself 8 out of 10, while I only gave it 6.
How would the others score their less impressive performances?
Perplexity:
Based on the document provided, I’ll summarize the results and rank the AI services on their answers to the question: “If two passengers waiting for their bags off a flight find they have the same suitcase on only 1 in 500 flights, how many kinds of suitcase or other bag would there need to be on the market?”
Summary and Ranking:
- ChatGPT (8/10)
ChatGPT provided the most comprehensive and methodologically sound answer. It used the birthday problem approach, made reasonable assumptions (100 passengers per flight), and accounted for the complexity of the problem. The final estimate of 2.47 million distinct suitcase designs, while high, reflects a more realistic understanding of the problem’s scale.
- Claude (7/10)
Claude initially made a simplistic calculation but then recognized its error when prompted about passenger numbers. In its follow-up, it provided a more realistic estimate of 4 million suitcase types for a 180-passenger flight. Claude also acknowledged the impact of uneven distribution of suitcase popularity, showing a nuanced understanding of the problem.
- Perplexity (5/10)
Perplexity initially gave a simplistic answer of 500 suitcase types. When prompted about passenger numbers, it provided a more detailed calculation using 175 passengers. However, its final conclusion of still needing only 500 different suitcase designs seems inconsistent with its own calculations and the problem’s complexity.
- Microsoft Copilot (3/10)
Copilot provided the same simplistic answer as Perplexity’s initial response, suggesting 500 different kinds of suitcases. It failed to consider the number of passengers on a flight and the complexity of probability calculations for multiple comparisons. When questioned about passenger numbers, it became defensive and didn’t improve its answer.
Overall, ChatGPT and Claude demonstrated a better understanding of the problem’s complexity and provided more realistic estimates. Perplexity showed improvement when prompted but still arrived at a questionable conclusion. Microsoft Copilot’s response was the least comprehensive and didn’t improve with follow-up.
Perplexity was modest in its self-evaluation, giving itself 5 out of 10, less than both Claude and ChatGPT — agreeing with the latter that it deserved 8.
Claude:
Based on the document, let me analyze how each AI service handled this probability problem:
- ChatGPT (8/10)
- Immediately recognized this as a variant of the “birthday problem”
- Made clear assumptions (100 passengers, uniform distribution)
- Provided complete mathematical reasoning
- Calculated a realistic result (~2.47 million suitcase types)
- Only weakness was slightly underestimating typical passenger numbers
2. Claude (Original me) (5/10)
- Initially made a basic error by only considering two passengers
- Did acknowledge non-uniform distribution issue
- Required a follow-up question to correct the major oversight
- After correction, provided good mathematical reasoning and a realistic result (4 million types)
- Showed good intellectual honesty in admitting the error
- Lost points for needing the prompt to consider multiple passengers
3. Perplexity (4/10)
- Initially gave an oversimplified answer (500 types)
- When prompted, made an attempt at more complex calculation
- Final result still seemed unrealistic (500 types even with 175 passengers)
- Used questionable logic in the follow-up calculation
- Did acknowledge non-uniform distribution as a factor
4. Microsoft Copilot (2/10)
- Gave the most basic possible answer (500 types)
- When challenged, became defensive rather than reconsidering
- Deliberately chose to stick with unrealistic two-passenger scenario
- Showed no recognition of the complexity of the real problem
- Failed to consider non-uniform distribution
The key differentiator was whether the service recognized that this was a variant of the birthday problem requiring consideration of all possible passenger pairs. Only ChatGPT got this immediately, while I (Claude) needed prompting but then provided a solid analysis. The other two services failed to grasp the full complexity of the problem.
The most realistic answers were in the millions of suitcase types, which aligns better with real-world observations of luggage variety.
Claude’s was probably the most comprehensive analysis, coming up with the idea of a “key differentiator” between the results — which was in recognising that the question was a variant of the birthday problem. It also acknowledged that it was Claude — while none of the others seemed to recognise themselves.
Finally, you can’t upload a document of this size to Copilot, so I wasn’t able to ask it — or give it a chance to defend itself.
So here are my marks out of ten compared with those provided by the three services invited to make assessments:
Overall, the services themselves were more generous with their marks than I had been— and not just in ranking themselves either, but also for their rivals. And perhaps they’re right that I was a bit stingy.
So what does this prove?
Well, it’s a pretty stunning demonstration of how well AI can already engage in complex reasoning. There was no overt nonsense spouted. The maths was often more advanced than I could follow (which I’m not claiming as any great achievement).
Given that we’re still in the infancy of AI services and there’s every prospect that their capabilities will rapidly expand — as they already have in the past year or so — it looks inevitable that humans will start to expect computers to be faster, more knowledgeable and smarter than they are. We may be the last generation who don’t take that for granted. And the implications of that are… well, I wish I was clever enough to know.