Want a job working on mongoDB? Your first online interview is in this post

Posted April 23, 2011
Filed under: Uncategorized |

I like smart programmers.

So, I thought I’d put out a programming puzzle. If you an solve it, you are well on your way to a very cool job. If you can’t solve it, send it to a smart friend who can. Even if you don’t want a job, I promise you’ll enjoy the solution when you find it.

Here’s the puzzle:

You’re given an array of N 64 bit integers. N may be very large. You know that every integer 1-N appears once in the array, except there is one integer missing and one integer duplicated.

I’d like a linear time algorithm to find the missing and duplicated numbers. Further, your algorithm should run in small constant space and leave the array untouched.

— Max

PS I love smart people in other jobs too; while I think this problem will appeal mostly to programmers, 10gen is a company where solving a problem like this helps you get a job in marketing, finance, or any other area that fits your skills. Send me a solution and tell me what job you want to interview for!

NOTE 4/24 4am: I have caught up with all the responses but plan to take Sunday off and spend it with my family. Thanks for making a job at 10gen more popular than baby tips from Elton John today.

260 comments so far

Cliff Elam on April 23, 2011

I would put it into Excel column A, sort it, then copy this excel formula next to it in B:

=IF(A20A19+1,IF(A20=A19,”Dupe”, “Missing”),” “)

Then I would use my highly sophisticated Mark 1 human eyeball to spot the problem.

Problem solution benefits:
1> Small space. Small is relative, this works on a cheap PC doing a thousand other things and doesn’t require any reboot or configuration.

2> Untouched array. Copy/paste, ’nuff said.

3> N is very large. Depends on what you mean by large. I can to 5M+ rows in excel now. God knows how we lived without that.

4> Linear. Well, something this simple probably is almost linear inside Excel, really, but more complex operations than this bog down as N gets large. But it’s close and the really expensive part of the equation, well, It took me longer to type in the first paragraph above than it would to actually solve the problem in the first place.

5> Additional benefits. Nobody will take that code fragment and leave it lying about in a production release.

Yours in excel mastery and advanced joshing algorithms,

-XC

Reply
- Max Schireson on April 23, 2011
  
  LOL 🙂
  
  Reply
  - Mohammed Ahmed on April 23, 2011
    
    Hire this guy already!
- Girish on April 24, 2011
  
  Best answer. Ever.
  
  Reply
- Think2day (@think2day) on November 30, 2011
  
  You must be a manager in MongoDB. Smart enough 🙂
  
  Reply
Joris Verschoor on April 23, 2011

loop 1..N as x
find x in array as index
if not found, x is missing
if found, find again at index. if found, duplicate
end

this is easily parallelized (word?)

Reply
- Max Schireson on April 23, 2011
  
  You loop N times, each time having to find x in the array. How long does each find take? Looking for something linear.
  
  Reply
Joris Verschoor on April 23, 2011

With map/reduce one would first emit(m, {status: ok/missing/duplicate}) in the map function.
Then you could reduce that down to missing + duplicate values.

Reply
Joris Verschoor on April 23, 2011

sum(array) – N! gives you duplicate-missing
you can use this to loop over the array, and to check if a value + (duplicate-missing) exists

If it’s missing, you know how to calculate the duplicate value: (duplicate-missing) + missing.

I might have some + / – wrong though 😉

Reply
Joris Verschoor on April 23, 2011

Sorry.. final one 🙂

sum(array) – N! gives you duplicate-missing (dm)
Then loop from 1..N, and check each value for existence.
If it does not exist, you know which value is duplicate, because you have dm. Duplicate = dm + nonexisting

This is small in size, and loops 1 time over the array to sum, and does one find per value

Reply
- Joris Verschoor on April 23, 2011
  
  N! would me 1*2*3..*N but I meant 1+2+3+4..+N =)
  
  Reply
  - Max Schireson on April 23, 2011
    
    Interesting path, but still needing that one find per value. Finding a value in a large array takes time. You found duplicate-missing quickly, can you extend that to find duplicate and find missing?
  - Joris Verschoor on April 23, 2011
    
    Someone beat me with the xor..
    
    I’m no algo-expert, or novice even =)
    Was a nice lil’ puzzle, and I’m proud of the sum dup-miss thingy =)
    
    I’m not really a fan of this sort of interviews, but that wans’t my goal.
    
    Thanks for the challenge! Happy easter!
- Martin on April 23, 2011
  
  sum(array) – (1+2+3+4..+N) does not give you the missing. It gives you the difference between the missing and the duplicate. For instance:
  [10,2,6,9,8,5,9,1,3,7]
  
  Sum([10,2,6,9,8,5,9,1,3,7]) = 60
  Sum([1,2,3,4,5,6,7,8,9,10]) = 55
  
  60 – 55 = 5 <– No the missing! But, it is (9 – 4), the difference between the missing and the duplicate
  
  =)
  
  Reply
Edward Ribeiro on April 23, 2011

The solution I came up with is as follows:

1. Creates a bit vector and initialize each element with zero. The “i” number will occupy the i-eth position in the bit vector, from left to right.

2. For each x from input-vector do:
2.1. check to see if the x-eth position of the bit vector is already set to 1. If so, the it’s the repeated element, and print ‘x’ as the repeated;
2.2. otherwise, set the x-eth position of the bit to 1;

3. Out of the loop, use Bit Twiddling Hacks to find the first zero bit from left to right (Hacker’s Delight book shows just how to do that).

This solution is space efficient because the bit vector, and runs in linear time. Below is a quick and dirty snippet of java code that does exactly that, but it is limited by a huge memory consumption (for the input array), and is limited to Integer.MAX_VALUE, that is pretty big, but is could not be large enough. For a robust implementation, I would use C and a lot of Bit Twiddling Hacks to implement the above algorithm. 🙂

============================ Java Code ==========================

public static void search(final Long[] input) {

BitSet bitarray = new BitSet(input.length + 1); // should be from 1-N not from 0- (N-1)

int repeated = 0;

for (Long val : input) {
if (bitarray.get(val.intValue())) {
repeated = val.intValue();
}
bitarray.set(val.intValue());
}

System.out.println(“repeated: ” + repeated);
System.out.println(“absent:” + bitarray.nextClearBit(1)); //ignores first bit (ZERO)
}

===============================================================

Reply
- Max Schireson on April 23, 2011
  
  The space you’re using is still linear in the number of values. One bit per, but that adds up if you have billions of values… Can you do it in (small) constant space?
  
  Reply
  - Edward Ribeiro on April 24, 2011
    
    Taking in account the constant space requirement, I would use something like Bloom Filter. It’s great to check if a value is in a set, even with false positives’ probability. But a relatively modest bit vector size can reduce this probability.
    
    As to the repetition checking, I don’t know by heart now, but I am curious to see the final results. 😉
- Wedge on April 23, 2011
  
  Congrats, you’ve just managed to use 2^64 bits, or 2.3 exabytes of RAM. This may not be the most efficient solution.
  
  Reply
  - Edward Ribeiro on April 24, 2011
    
    LOL, are you sure you really understood my proposed solution? It’s basically a bitmap!
    
    Each bit in the bit vector (implemented as an array of unsigned ints or whatever) identifies if the i-th value appeared or not. If you want to represent a list with 1 to 1 billion values (any order) then this bitmap will be 125MB in size.
    
    The java snippet is just to show some code tough, as it should be done in C/C++ if billions of values are to be stored in the input list.
dm on April 23, 2011

i guess Gauss would get the job

Reply
- Max Schireson on April 23, 2011
  
  he’d get the live interview. then we’d have to see if he could actually program, which is different from coming up with clever algorithms 🙂
  
  Reply
Bob Kimble on April 23, 2011

Joris is on the right track. As Max notes, his technique yields duplicate-missing. Repeating the process by summing the squares of the array similarly yields duplicate^2-missing^2. Dividing the first into the second yields duplicate+missing. The two equations together quickly yields both values. FWIW, the sum of the integers from 1 to N is N*(N+1)/2, and the sum of the squares of the integers from 1 to N is N*(N+1)*(2*N+1)/6. However, I feel as if I’m cheating, because I am a mathematician as well as a programmer. 🙂

Reply
- Joris Verschoor on April 23, 2011
  
  Ahhh =)
  
  Yeah I guess that’s cheating 😉
  
  Reply
- Max Schireson on April 23, 2011
  
  i felt guilty of cheating for answering a fibonacci question with neither iteration nor recursion…
  
  Reply
  - Max Schireson on April 24, 2011
    
    But bob was the first to have it correct, even if it was submitted a comment on another post rather than his own post.
Cliff Elam on April 23, 2011

See, I’d have had this done in Excel and be playing Scrabble by now.

-XC

Reply
ken on April 23, 2011

Keep 64 counters, one for each bit of a 64 bit integer. For each entry i encountered, increment the counters for each bit that is on in i. At the end, compare the counters to what you expect. Given that we know there was exactly one entry incorrect, we can tell how its bits differed from what we expected, giving both the missing number and the extra number.

Reply
- Wedge on April 23, 2011
  
  I’m not sure this is sufficient. For example: what if the number 8 is missing and the number 9 is duplicated? All you’d end up with is a difference in the 1’s counter, there isn’t enough information left to tell you about the other bits of the missing or duplicated numbers.
  
  Reply
Levi Strope on April 23, 2011

Amazon asked me this same question (nearly verbatim) in an interview and handed me a pad of paper. Fun times.

Reply
- Max Schireson on April 23, 2011
  
  Next time you’ll be better prepared having seen all these responses!
  
  Reply
Mike Thvedt on April 23, 2011

let arr = the array, miss = the missing integer, dup = the duplicated integer
x = the sum of all elements of the array
I posit that x = dup – miss. x != 0.

we’re going to partition the array by modulus, and binary search. since we only have to search two “sub-arrays” at once, constant space is required. we perform at most B passes over the array, where B is the bit size of the integers (here 64)

let search(array) {
global dup = 0, miss = 0
search(array, 1, 0)
return (dup, miss)
}

let search(array, modulus, the_filter) = {
if modulus == 2^64, scan the array for the_filter
if found, set dup = the_filter; return
if not found, set miss = the_filter; return
sum (filter (\x: x mod modulus == the_filter) array)
//comment: the filter can be done lazily
if sum == 0, break
if sum != 0 {
search(array, modulus * 2, the_filter)
search(array, modulus * 2, modulus * 2 + the_filter)
}
}

since each ‘sub-array’ contains, for all x, its additive inverse ~x, except for one duplicated x and one missing x, if the sum of the sub-array is 0, then the sub-array has a duplicated x, missing x, or both.

disclaimer: code is not tested.

Reply
- Mike Thvedt on April 23, 2011
  
  hey, this code is a little sloppy (break and return are confused, for example), but i’m convinced this is the right way to do it
  
  Reply
  - Torsak Chaiwattanaphong on April 23, 2011
    
    The recursion can make stack overflow due to the very big number (2^64). You can adapt it as “for loop” procedure by split the array to a smaller part (Ex. 1..100, 101..200, 201..300 and so on) and trying to check just the smaller part in each loop. In this way, you can have a constant space and linear time algorithm and won’t make your stack to be overflowed by a very large number.
  - Mike Thvedt on April 23, 2011
    
    I posted a way less crazy solution here, though it builds primarily off someone else’s solution: http://news.ycombinator.com/item?id=2477401
  - Max Schireson on April 23, 2011
    
    Mike, i’ll check your hacker news solution and comment there since you seem to like it better than this one.
    
    — Max
itrend on April 23, 2011

Aha, inspired by Joris Verschoor’s half solution: iterate over the array accumulating the sum (S) of the numbers and the sum of the numbers squared (S2). If d is the duplicate and m is the missing then:

d – m = S – n*(n+1)/2
d^2 – m^2 = S2 – n*(n+1)*(2*n+1)/6

We have now reduced the problem to a well known one 🙂

Reply
- itrend on April 23, 2011
  
  Thought I might as well post my quick proof of concept Haskell code for the curious (formatting will probably be lost though):
  
  import Data.List
  
  solve a = (dupe, miss)
  where
  dupe = (delta*delta + deltaSq) `div` (delta*2)
  miss = dupe – delta
  n = length a
  delta = sum – (n*(n+1) `div` 2)
  deltaSq = sumSq – (n*(n+1)*(2*n+1) `div` 6)
  (sum, sumSq) = foldl’ (\(sum, sumSq) x -> (sum+x, sumSq+x*x)) (0,0) a
  
  Reply
- Max Schireson on April 23, 2011
  
  (edited) First solution posted as such though we had one from Bob Kimble in a comment earlier so really its second 🙂
  
  Reply
Jamie Rumbelow on April 23, 2011

Are the duplicate values consecutive?

Reply
- Max Schireson on April 23, 2011
  
  not necessarily…
  
  Reply
zvpebfbsg_fhpxf on April 23, 2011

Treat each array entry as an index back into the array, which induces a linked-list in the array. Find the cycle in this list in constant space (another good interview question). The entrypoint into the cycle is the duplicated value.

To find the missing value, take the XOR of all numbers 1 to N, with the exception of the duplicated value. Then, XOR this value with each value in the array. The result is the missing number, since all the others canceled each other out.

I first saw this problem posed by Rapleaf here: https://www.rapleaf.com/careers/challenges#array

Reply
- David K on April 23, 2011
  
  Ah… so I got the cycle detection bit… but was still trying to reverse that and try and find 2 heads to the graph to determine the missing number.
  
  Neat.
  
  Reply
- dm on April 23, 2011
  
  ( 2, 1, … )
  
  cycles on the first two values, yet those may not be duplicated in the rest of the input.
  
  Reply
- Max Schireson on April 23, 2011
  
  Not quite the same problem (assuming you mean question 2 in rapleaf’s challenge). For better or worse, theirs is easier.
  
  I can make the function in rapleaf’s question 1 really really really fast.
  
  Question 3 is interesting and harder.
  
  — Max
  
  Reply
Aaron on April 23, 2011

//Assuming we can use an iterator style interface with the array

//Also assuming we don’t know the array length in advance so we don’t
//want to do something like array.length, since that would involve counting
//the elements, so that’s some extra CPU we don’t want to waste 🙂

var i=1, skipped = 0, duped = 0;

while (_next = next(array) && !skipped && !duped)
{
if (i == _next) continue;
else if (i < _next)
{
skipped = i;
i++; // i needs to catch up!
}
else
{
duped = i;
i–; // i needs to slow down!
}

i++ ;
}

//If you wanted to be clever you could condense the if/else into ternary, but then it would be too damn hard to read.

Reply
- Aaron on April 23, 2011
  
  Just saw the assumption that it’s NOT sorted, which seems obvious now. In that case this solution completely fails 🙂
  
  Reply
Emanuele Di Saverio on April 23, 2011

Ok let me try…

we can to find a couple (missing number, duplicated number)

calculate Sum(1 to N) = N(N+1)/2 (constant)
calculate Sum of input data (linear in N)
delta between the two

now we now that the delta between the missing and the duplicate. If we find one, we find the other. Let’s find the missing

if delta > 0 , missingN is in the interval [1, N – delta]
if delta < 0, missingN is in the interval [delta, N]

now you have the interval of missing

sum all numbers of the interval (linear in N) , be it S

cycle through input data,
for every number in the interval
subtract from S

what is left over is missing number.

1) is that correct?
2) sure it's ugly algorithm
3) doesn't use the "64 bit" piece of the problem statement

had fun trying anyway 🙂

-Emanuele

Reply
- Max Schireson on April 23, 2011
  
  I believe its not correct. Say delta is -1. Missing and duplicate could be say 8,7 or some such. if N is 100 both missing and duplicate are in the interval and you’d get delta again. Which way you were calculating delta wasn’t quite clear, so I may have a sign reversed but I don’t see an obvious way to fix your approach.
  — Max
  
  Reply
Pius on April 23, 2011

This sounds like a job for bitwise XOR.

Reply
- Max Schireson on April 23, 2011
  
  It’s the magic bullet if you have either a missing value or a dupe, but not enough when you have both at once.
  
  Reply
Andy Gelfond on April 23, 2011

Works for N < 2^32, O(N) time, O(1) memory (not including array)

For alpha = 1 to N
let omega = arr[alpha]
if omega has its top bit set, you have the duplicate number (clear omega's top bit to get it) and its position (alpha)
otherwise, set arr[omega]'s top bit

You can find the missing one by finding the index of the element that does not have its top bit set, either in line, or a separate loop

You can then clear the top bits in the array

To try make this work for N < 2^64, I would consider looking at some xor magic, either within the array, or in some extra variable, but I do not have the confidence in this approach to spend the time since it appears all bits would be needed

Part time job for 1/2 an answer ?

🙂

Reply
- Max Schireson on April 24, 2011
  
  🙂
  
  Reply
Ron on April 23, 2011

Do you want solutions posted here or sent somewhere else?

Reply
- Max Schireson on April 23, 2011
  
  Post away (though I’ll check email to, you can find my address in “about”
  
  Reply
Warren Wilkinson on April 23, 2011

CONSTANT total = sum the numbers 0..63
CONSTANT square_total = sum of squares of numbers from 0..63

iterate array,
summing the numbers into sum,
summing square of numbers into square_sum.

LET diff = total – sum
LET square_diff = square_total – sum

duplicate = (square_diff – (expt diff 2)) / (diff * 2)
missing = duplicate + diff.

————

How it works.

Comparing the expected total to the computed total gives the difference between the missing number and the duplicate number.

missing – doubled = some constant

Comparing the expected square total to the computed square total gives another equation:

(missing * missing) – (doubled * doubled) = another constant.

Now that you have two equations, you can solve for both variables.

Reply
- Max Schireson on April 23, 2011
  
  Correct solution. Congrats.
  
  Reply
David K on April 23, 2011

I’m not looking for a job, but having found this via HN thought I’d give it a whirl.

You don’t mention… but is the array sorted?

If it’s not, the solution I found for linear time required 16 exabytes of storage for a bitmap capable of storing the size of the language.

If it’s sorted, then you could use a very small bitmap to track the last X Ns and detect the dupe whenever you set the same flag, and determine the gap whenever you have a flag not set. This fulfils all of the requirements, small space, linear time, and immutable array… but I’m betting it’s not that simple.

If it isn’t sorted, then I’ll take another look.

Reply
- Max Schireson on April 23, 2011
  
  It isn’t sorted, and I’m looking for something without the 16 exabytes 🙂
  Under 1 kilobyte is no problem once you see the answer…
  
  — Max
  
  Reply
  - David K on April 23, 2011
    
    I’m thinking if the array is treated as a directed graph, then cycle detection would find the dupe. I’m less sure how to make that detect the missing element though.
    
    Might knock something up to test the theory. Seems like a tortoise + hare linked list, cycle detection thing.
    
    At least… it’s the only way I currently have to do away with a bit map.
Chintan Shah on April 23, 2011

If SEQ is the array,
D is the duplicate number ,
and M is the missing number, then

First compute sum of all elements in SEQ. Let it be SUM.
(I) N*(N+1)/2 – SUM = D – M /*(Using the identity for sum of first N numbers.)*/
(II) N*(N+1)*(2N+1)/6 – sum(arr[i]^2 = d^2 – m^2 /*(Using the identity for the sum of squares of first N numbers)*/

We have 2 equations and 2 unknowns. Solving these
we can get D and M.

Since N can be large, to avoid overflow we should do the left sides of (I) and (II) through a loop instead of using the formula directly.

The loop for LHS of equation (I) is
int lhs_1st = 0;

for (int i = 0; i < SEQ.length; i++)
{lsh_1st += i + 1 – SEQ[i];}
And the loop for LHS of eq (II) is
int lhs_2nd = 0;
for (int i = 0; i < SEQ.length; i++)
{lhs_2nd += (i + 1 + SEQ[i]) * (i + 1 – SEQ[i]);}
From (I) we have
(III) lhs_1st= D -M
From (II)
(IV) lhs_2nd = D^2 – M^2
Divide D by C we have
(V) lhs_2nd/lhs_1st = D + M

Solving (III) and (IV)
D = (lhs_1st + lhs_2nd/lhs_1st) / 2
M = (lhs_2nd/lhs_1st – lhs_1st ) / 2

Reply
- Chintan Shah on April 23, 2011
  
  Java code for the above:
  
  public static void findMissingAndDuplicate(int[] seq) {
  long eq1 = 0;
  long eq2 = 0;
  for (int i = 0; i < seq.length; i++) {
  eq1 += i + 1 – seq[i];
  eq2 += (i + 1 – seq[i]) * (i + 1 + seq[i]);
  }
  long dup_minus_missing = eq1;
  long dup_plus_missing = eq2 /dup_minus_missing;
  long M = (dup_plus_missing + dup_minus_missing ) / 2;
  long D = (dup_plus_missing – dup_minus_missing ) / 2;
  
  System.out.println("Missing number in the array: " + M);
  System.out.println("Duplicate number in the array: " + D);
  }
  
  Reply
  - Max Schireson on April 23, 2011
    
    Nicely done, basically correct. Accumulating along the way reduces overflow worries.
    
    That said to nitpick I think you still have some overflow worries. Do you see where?
    
    – Max
Nick Parker on April 23, 2011

I think something like this might do it:

int findMissingElementFor (int [] items, int size){

int sum = 0;
int sumOfSquares = 0;
int bound = size – 1;
int sumOfFirstNaturalNumbers = (bound * (bound + 1)) / 2;
int sumOfSquaresOfFirstNaturlNumbers = (bound * (bound + 1) * (2 * n + 1)) / 6;

for(int index = 0; index < size; index++){
sum = sum + items[index];
sumOfSquares = sumOfSquares + (items[index] * items[index]);
}
return ((sumOfFirstNaturalNumbers – sum) + (sumOfSquaresOfFirstNaturlNumbers – sumOfSquares) / (sumOfFirstNaturalNumbers – sum)) / 2;
}

Reply
- Max Schireson on April 23, 2011
  
  Correct but for some overflow worries 🙂
  
  Nice job.
  
  — Max
  
  Reply
Felix on April 23, 2011

I’m stuck and curious if I have chosen the right way.

I loop through the array and update two values:

sum = n1 + n2 + … + nN
x = n1 xor n2 xor … xor nN

In the end, I know that
(1+N)*(N/2) – missing + duplicate = sum
and
missing xor duplicate = x

I guess this could be enough, but I’m not sure how to proceed.

Reply
- Max Schireson on April 23, 2011
  
  On a good path but not quite there. You have two equations in two unknowns but they aren’t independent enough to solve. Say (duplicate-missing) is 1 and the difference is 1. All you know is that they’re consecutive and dup is odd.
  
  — Max
  
  Reply
Fedor Indutny on April 23, 2011

// code javascript
var sum = 0,
xor = 0;

for (var i = 0; i < N; i++) {
sum += i – a[i];
xor ^= a[i];
};

var duplicate = xor,
skipped = xor + sum;

Reply
- Fedor Indutny on April 23, 2011
  
  If job you’re offering will consist of solving such kind of tasks – I’m definitely up for it! 🙂
  
  Reply
  - Rolken on April 23, 2011
    
    It’s easy to demonstrate that your code doesn’t work. If you add the value N+1 to the end of the array, you’ll get that piled into an entirely different xor, so it can’t be the duplicate.
  - Max Schireson on April 23, 2011
    
    It will definitely have some challenging problem solving 🙂
- Fedor Indutny on April 23, 2011
  
  Addition and subtraction can be made in 64bit algebraic field.
  
  Reply
- Martin on April 23, 2011
  
  I am trying your code with this data set:
  
  var a=[10,2,6,9,8,5,9,1,3,7];
  
  (where, 9 is dupe and 4 is miss) And, It does not seems to work…
  
  Reply
  - Fedor Indutny on April 24, 2011
    
    Oh, missed that numbers should start from one.
Attila Gazso on April 23, 2011

Calculate the sum of 1..N (S), and calculate the sum of the given numbers (T). The diff equals T-S=-x+y where x is the missing number, and y is the dupe.

Calculate the product of 1..N (P), and calculate the product of the given numbers (Q). P/x*y=Q or Q/P=y/x

Now we have two equations and two unknown values. From here it’s basic math.

Reply
- itrend on April 23, 2011
  
  To store a value X, you need log(X) bits, hence you need log(N!) bits to store the value of N!. However, log(N!) is O(N*logN) so your solution actually uses [worse than] linear space (see http://en.wikipedia.org/wiki/Factorial#Rate_of_growth ).
  
  Reply
  - Attila Gazso on April 23, 2011
    
    You are right, instead of using product, it would be better to use sum of squares.
- Max Schireson on April 23, 2011
  
  Your corrected approach is correct. Congrats.
  
  Reply
Stephen A on April 23, 2011

a = given array, d = duplicate, m = missing
diff = (sum [1..n]) – (sum a) — m – d
nfact = n!
fdiff = nfact – (foldl1 (*) a) — n!(m-d)/m
m = nfact * diff / fdiff
d = m – diff

Reply
- Max Schireson on April 23, 2011
  
  if the list is really big, calculating n factorial is a problem. say n = 1 billion…
  
  can you come up with an easier way?
  
  — Max
  
  Reply
  - S.G. on April 23, 2011
    
    Take lg of x_i instead of multiplying x_i.
    
    PI(1,N){x_i} => SIGMA(1,N) lg{x_i}
  - Max Schireson on April 23, 2011
    
    can you come up with a solution that doesn’t include floating point math?
  - S.G. on April 23, 2011
    
    Without using FP calculations
    
    Let y = missing number, x = duplicate
    Let alpha = SUM(1,N) = ((N+1)N)/2
    Let beta = Sum(array)
    => alpha – beta = y – x
    
    Let gamma = SIGMA(1,N){x_i}^2 = (N(N+1)(2N+1))/6
    Let delta = Sum(array value^2)
    => gamma – delta = y^2 – x^2 = (y – x)(y + x)
    
    ==> (gamma – delta) / (alpha – beta) = (y + x)
    ===> (y+x) + (y-x) = 2y / 2 = y. Solve for x however you desire.
- Max Schireson on April 23, 2011
  
  got it with the correction, nice job.
  
  Reply
Michal Illich on April 23, 2011

1. Calculate Sum(1..N) – N*(1+N)/2 = DUP-MIS

2. Calculate MultiplyAll(1..N) / N! = DUP/MIS

3. You have two equations with two variables, so you just solve it.

Linear time, very small constant space used.

Reply
- Michal Illich on April 23, 2011
  
  Or Xor instead of multiplication would work also.
  
  (2. XorAll(1..N) Xor XorOfAllNumbersFrom1toN = DUP XOR MIS)
  
  Reply
- Michal Illich on April 23, 2011
  
  Well, the best second step is sum of squared numbers – but I have to admit that this idea isn’t mine.
  
  Reply
  - Max Schireson on April 23, 2011
    
    Yes; the XOR doesn’t actually get you enough additional information to solve. As discussed in some comments the original approach with multiplication is not constant space (more than linear).
    
    — Max
Drake Wilson on April 23, 2011

Well, I’m going to assume you want it in a comment since I don’t see another preferred contact method in the post.

Something that is quite clean mathematically and has the right complexity but isn’t necessarily good-performance on the microscale:

overwrite_delta = sum(1..count) − sum(array)
squares_delta = sum((x → x^2) (1..count)) − sum((x → x^2) array)
overwrite_sum = squares_delta / overwrite_delta
duplicate = (overwrite_delta + overwrite_sum) / 2
missing = overwrite_sum − duplicate

Of course you probably want to unify the loops so you don’t do multiple passes over the array and thrash the cache.

Possible problem: this requires integer arithmetic with integers of something like 3*ceil(log2(N)) bits (haven’t done the proof for that part). We know that N ≤ 2^64 − 1 (otherwise the array is impossible), so it’s still O(1) with “small” amounts of extra space (a few hundred bytes?). Nonetheless the big integers might turn out to be very inconvenient to implement if you have to make them yourself.

It also requires O(N) 64×64-bit multiplications and a single giant divmod near the end. (The / 2 can be a cheap bitshift.) If N is large enough and if the ALU/memory speed is high enough, I suspect this is okay, especially since the multiplications will parallelize well with memory accesses, but I haven’t measured.

Whether the microperformance issues matter depends on the situation.

Email me if you want to see the C99 code that handles the 20-bit case (I didn’t want to bother with the large integers for testing).

Reply
- Drake Wilson on April 23, 2011
  
  An extra note: I see a lot of earlier solutions and/or pseudo-solutions (haven’t analyzed them all) with a combination of summation and mass XOR. I fiddled with something along those lines but decided not to go that route. This was partly because it took longer than a few minutes and looked like a potential dead end (and I didn’t want to spend the entire day on this), and partly because this answer has amusing implementation characteristics while being very easy to understand and fitting the constraints. (Assuming a cute sum+xor solution exists, it’d be better for a continuous process in actual production, but this was a puzzle, so.)
  
  Reply
  - Max Schireson on April 23, 2011
    
    XOR is a dead end, feel glad you didn’t spend more than a few minutes.
    
    Your solution is good. Good job catching the overflow worries.
    
    — Max
  - Max Schireson on April 23, 2011
    
    Upon reading further, I found a solution which salvages the XOR (though not in conjunction with the sum; he found a way to do an additional XOR to get the right answer), so it wasn’t a total dead end (thought it seemed that way to both of us!)
Frank Denis on April 23, 2011

Horrible solution (and it assumes that numbers are starting at 0) but I wanted to quickly do something before going to bed:

bogus = []
a.each_with_index { |n, i| a[n], a[i] = n, a[n] }
a.each_with_index { |n, i| bogus << n << i if n != i }
puts "Bogus numbers: #{bogus.uniq.join ' and '}"

Reply
Alex Lerman on April 23, 2011

expectedSum = 0
actualSum = 0
expectedFactorial = 1
actualFactorial = 1
for( i=0; i<n; i++) {
expectedSum += ( i + 1 )
actualSum += arr[i]
expectedFactorial *= ( i+ 1 )
actualFactorial *= arr[i]
}

diff = actualSum – expectedSum = a -b
quotient = actualFactorial / expectedFactorial = a / b

We have 2 equations and 2 unknowns, go forth and calculate. ( We can use the sign of the diff to determine which is the missing and which is the duplicate. )

Reply
- Lim Kim Yong on April 23, 2011
  
  Not a very practical answer. for large N the factorial is going to overflow pretty quickly.
  
  Reply
Fahad on April 23, 2011

“You know that every integer 1-N appears once in the array,” – Based on this I’m assuming by 1-N you mean 1 to N. I may be wrong, but here is my solution for this,

1) Sort the N-array Or have the array maintained in sorted order.
2) Form tuples with pair-wise order (as tuples), which can be described as below,

>>>

returns a sequence of each element in input sequence and its predecessor, with the exception of the first element which is only returned as a predecessor of the second element.

If there is a sequence Seq then it returns Seq. As an example, if we have the N-array (sorted that is),

1, 2, 3, 4, 5, 5, 7, 8, 9…. N

the sequence generated is,

(1, 2)
(2, 3)
(3, 4)
(4, 5)
(5, 5)
(5, 7)
(7, 8)
(8, 9)
…
(N – 1, N)
>>>

We have three check conditions here,

for el in [pair-wise array] do
if second(tuple) – first(tuple) = 1 then
“Normal couple” // just go the next item in the pair-wise array
elif second(tuple) – first(tuple) = 0 then
“Duplicate”
elif second(tuple) – first(tuple) > 1 then
Missing = first(tuple) + (second(tuple) – first(tuple) – 1) // example: Missing = 5 + (7 – 5 – 1) = 6

The above algo is simple and linear without doing any rescan of the whole array or finding anything. The alg just checks these items when it is looping thru, which I think should be done.

I’m thinking in terms of F# language here. Let me know what you think on this.

-Fahad

Reply
- Max Schireson on April 23, 2011
  
  Sorting the array takes more than linear time, and you need separate (linear) space to do it unless you want to clobber the array you got (which I think I asked you not to do).
  
  — Max
  
  Reply
Bala on April 23, 2011

I have seen few solutions above and all are suggesting either sorting, additional O(N) storage or map reduce(involves sorting internally) solutions, when this can be done without any storage.

Let suppose the duplicated number is X, and missing number is Y.

If you sum numbers from 1 to N, let it be SN
Sum the numbers in the array, lets call it SM

SN – SM = Y-X

Now, you just need one more equation in X and Y to figure out X and Y. We can either choose either to multiply 1 to N and divide by the array product.This will give Y/X. One might be concerned that product of N 64-bit integers might overflow.

Other option is to use XOR from 1 to N, and then XOR the given array. So we get X xor Y.

— I have other solution in mind.

Because the elements are from 1 to N, one could put element in the correct index and find the duplicated number. To make sure that we don’t keep visiting the same elements again, we can negate the numbers. And in one more pass, one could reverse the sign of the integers. The code will look as below.

public class FindMissingAndDuplicate {

public static void main(String[] args)
{
int arr[] = {4,5,2,3,6,9,6,1,8,0};

int duplicateNumber = -1;

for(int i = 0; i < arr.length; i++)
{
int j = i;

while(arr[j] != j)
{
if(arr[arr[j]] == arr[j])
{
duplicateNumber = arr[j];
break;
}
int temp = arr[arr[j]];
arr[arr[j]] = arr[j];
arr[j] = temp;

j = arr[j];
}

if(duplicateNumber != -1)
break;
}
System.out.println("Duplicate Number " + duplicateNumber);

int sum = 0;
int sumi = 0;
int missingNumber = 0;
for(int i = 0; i < 10; i++)
{
sumi += i;
sum += arr[i];
}

missingNumber = sumi + duplicateNumber – sum;

System.out.println("Missing Number " + missingNumber);

}

}

Reply
- Max Schireson on April 23, 2011
  
  I haven’t quite verified if the code matches the algorithm you described, but I asked you to leave the original array untouched.
  
  Agree with your concerns on overflow of the products.
  
  XOR doesn’t get you enough additional information.
  
  — Max
  
  Reply
Fahad on April 23, 2011

There seems to be a smiley in my last example, must be due to some combination, do ignore it!

Reply
Daniel Rodríguez on April 23, 2011

Supposing we have a Big Integer library for doing the maths without overflowing. Let’s call x and y to the missing and the repeated numbers.
1. Sum all the array.
2. Calculate N(N+1)/2.
3. Calculate (1)-(2). Call it Z. That’s equal to x – y, so x – y = Z.
4. Sum the squares of all the array.
5. Calculate N(N+1)(2N+1)/6.
6. Calculate (4)-(5). Call it W. That’s equal to x^2 – y^2, so x^2 – y^2 = W.
7. Substitute (3) into (6). (Z+y)^2 – y^2 = W -> Z^2+y^2+2Zy-y^2 = W -> y = (W-Z^2)/2Z.
8. Substitute (7) into (3). x = Z – (W-Z^2)/2Z.

Linear time, constant space.

Reply
- Max Schireson on April 23, 2011
  
  correct.
  
  Reply
  - liruqi on April 24, 2011
    
    if you can ensure that N is less than 2^63 (and permit write operation on the array), that is fit signed 64 bit int type, there is a way quicker:
    
    calculate x – y = Z as above (step 2)
    
    while sweeping through the array($arr), for the $i’th array item, set the corresponding index value highest bit, that is: $arr[ $arr[i] ] |= (1<<63), and check the highest bit before this operation… then we can find the duplicated y in this way.
    
    the point is using the highest bit to find duplicated one.
  - Max Schireson on April 24, 2011
    
    interesting suggestion. you’re kind of sneaking in a bit vector in unused space 🙂
    
    ironically if the array isn’t in physical ram (a real possibility with a large data set) it will be _much_ slower this way as you’ll be swapping things in off disk to manipulate that one bit.
    
    if its all in RAM and apart from the space, the bit vector will be fast.
    
    — Max
Zack Maril on April 23, 2011

Not a computer science dude, but try counting digits:

If you have a set of numbers ranging 1 to N each represented by 64 bit integers, then you know that there needs to be a certain number of repetitions of 0’s and 1’s within the various places, i.e. 8 1’s in the one’s place, 8 0’s in the ten’s place, etc. etc. The expected values can be calculated explicitly from the value of n.

Tally up the number of 0’s and 1’s from the actual set of n integers and then compare that to the calculated expected total. If you think in terms of digits, a repetition will add one to the expected values across the board and a deletion will subtract one across the digit matrix. If it were just an either addition or deletion, then you would just look in each column for the digit which had an incorrect value compared to the expected and then put all of the erroneousness digits together into the missing or repeated number.

But when there are two operations on the set of numbers it gets way more complicated. Let’s say you have n=1111 and you repeat 1110 and delete 1100. Normally you would just look for the varying values. But this time you would see that only the second place would show any error. The deletion and repetition cancelled each other out! (Note: this is the worst case scenario and the following method should work for something like n=1111 and 1110 1101 but faster.)

But we do know in which place they varied! We know we have one too many 1’s and one too few 0’s in the second place. Do the same tallying routine again but this time separate the set of numbers into those that have 1’s in the second place and those that have 0’s in the second place. Since we know that these sets cannot intersect by definition, when we run the tally algorithm on them, we know there will only be one error in each resulting table of information. If you calculate a modified digit matrix holding the correct values constant, i.e. all integers less than n with 1 in the second column, then compare that to actual digit array, using the same variation method you should be able to just read the integer off the table.

Since you are only searching through the array 3 times and tallying up things into a 2 by 64 array of characters, the space and time feels like it would be small. If done correctly, the code for it shouldn’t be too massive and it feels like it would be about as fast as you can search through the array. It could probably generalize to N errors of the same types, but you would need a better splitting algorithm for when things overlapped.

But, I haven’t really formally studied computer science like this much so I could be way out in left field here. Hope this works!
-Zack

Reply
- Max Schireson on April 23, 2011
  
  Good thinking. Your solution is unique but I think describes a workable approach. Very strong problem solving. If you are the Zachary Maril who is a freshman at Texas A&M, study computer science and math and you’ll do well.
  
  Though you take 3 passes through, you can actually do the bitwise comparisons very fast. It wouldn’t surprise me if your solution is actually faster than many of the others that take one pass.
  
  Do you know the XOR function? To accomplish the first two paragraphs of your solution, all you need to do is loop through the array; start with zero and for each element XOR the value and the index with the previous XOR result.
  
  Congrats on a unique solution.
  
  — Max
  
  Reply
  - a non on April 24, 2011
    
    Can you explain a little more about the xor method? I understand if you were to xor the array values (XA), xor with the expected xor values (XN), then XA^XN = M^N. But I am a bit fuzzy on what the second and third passes are supposed to do…
  - Zack Maril on April 24, 2011
    
    Cool! I had heard of the XOR function before this, but computer logic isn’t a strong point of mine so I went with something that seemed right. XOR would have made things much easier to explain though in retrospect.
    
    More thoughts on this digit matrix stuff:
    1. When you have overlapping errors, you can just do one more pass after the initial one. You can be populating both of the new arrays at once with two conditionals. It’s just a two pass technique.
    2. I think there is potential for one more trick to get it down to one pass. If you look at the numbers in different bases, then the digits in the corresponding representations should fit a new pattern for numbers of digits.
    New example: n= 1111111 with repeated 1110111 and deleted 1100111.
    Converted:
    B2- 1111111 1110111 1100111
    B6- 331 315 251
    
    So you would get 2 tables like this after subtracting the expected values:
    
    Binary
    Index: 6 5 4 3 2 1 0
    0: 0 0 1 0 0 0 0
    1: 0 0-1 0 0 0 0
    
    Base 6:
    Index: 2 1 0
    0: 0 0 0
    1: 0 1-1
    2: -1 0 0
    3: 1 0 0
    4: 0 0 0
    5: 0-1 1
    And so you can just read a repeated 315 and a deleted -251 off the chart!
    
    Drawing from previous solutions, you would only need two arrays for this method. You can populate the arrays at the very beginning with the expected values and then subtract off the digit representations as you go.
    
    For this specific example, you would only need a 64 by 2 array for the binary and then a 25 (rounded up of log(2^64)/log(6)) by 6 array for the base 6 one.
    
    Problems/Ideas:
    Don’t pick a base that is a power of two. After some experimentation, you’ll see that the presentation continues to vary only in one spot from the binary due to the way the groupings of digits work.
    In this example I picked 6 because it made everything work, but I am not sure how the numbers affect what the correct base is yet. There probably isn’t a “perfect” base for every example. There might be a way to make a really good guess though. Not sure.
    The real problem is that I think that shifting bases moves it out of the bitwise operations space and into a slower more complicated space. Especially if you are shifting into a base that isn’t a power of 2.
    I think it could deal with more errors than just 2 by playing it like a weird suduko. Every time you read a new deleted or repeated number from the charts, you update all the charts and keep going until you have no more errors on any of the charts.
    A problem is that you could have something like n=999 in binary and then delete/repeat 919/929 and 828/818. The errors would end up all canceling each other out in a decimal representation, but hopefully would still remain in the other base representations. I haven’t found an example where it would cancel on 2 different base charts though but I haven’t proved it either way.
    
    I think this is still within the solution space of this problem. It takes a constant space up and takes only one pass. If you can make a good guess for the base, then this would be fun to use.
    
    Thanks for the interesting problem! (Sorry for the long responses with no code.)
Torsak Chaiwattanaphong on April 23, 2011

sum = 0;
sumIndex = 0;
array[0…2^64];
bool firstNotEqual;
for (int i=1; i sumIndex) println(array[array[i]] + ” missing”);
if ( sum < sumIndex) println((array[array[i]]-1) + " dup");
break;
}
}

you can find duplicate value by calculating:
dup = (sum(1..N) – sum(array[1]..array[N])) + missing value
or
missing = (sum(1..N) – sum(array[1]..array[N])) + dup value

Reply
Torsak Chaiwattanaphong on April 23, 2011

sum = 0;
sumIndex = 0;
array[0…2^64];
for ( i = 1 to i sumIndex ) then array[[array[i]] is missing then break
if ( sum < sumIndex ) then array[[array[i]] -1 is dup then break
endfor

you can find duplicate value by calculating:
dup = (sum(1..N) – sum(array[1]..array[N])) + missing value
or
missing = (sum(1..N) – sum(array[1]..array[N])) + dup value

Reply
- Max Schireson on April 24, 2011
  
  I can’t quite follow your logic here, sorry.
  
  Reply
Mike D on April 23, 2011

2 step solution:
Duplicate:
public static int findWithXOR(int[] input){
int xorArray=1;//,dup=0;
for(int i=2; i<=(input.length-1);i++){
xorArray = xorArray ^ i;
//System.out.println(“XOR : ” + xorArray);
}
System.out.println(“XOR Array : ” + xorArray);
for(int k=0;k<=input.length-1;k++){
xorArray = xorArray ^ input[k];
System.out.println(xorArray);
}
//System.out.println(“Duplicate : ” + dup);
return xorArray;
}

Missing:
int sum = 0;
int idx = -1;
for (int i = 0; i < arr.length; i++) {
if (arr[i] == 0) {
idx = i;
} else {
sum += arr[i];
}
}

// the total sum of numbers between 1 and arr.length.
int total = (arr.length + 1) * arr.length / 2;

System.out.println("missing number is: " + (total – sum) + " at index " + idx);

Reply
- Mike D on April 23, 2011
  
  no comment?
  
  Reply
  - Max Schireson on April 24, 2011
    
    I think there are a few problems with your code. I don’t think the dup will work, nor the missing?
    
    — Max
Kunal on April 23, 2011

2 Arrays, 1 of the unordered list of values. One of 0 to n-1 all defaulted to zero. Iterate through the unordered list and mark a -1 on the index that correlates to. After you are done, scan through the array and -2 is the dupe, 0 is the missing element.

Reply
- Martin on April 23, 2011
  
  Yeah, but first you need to generate the “good” array, with highs N, it will take time only to generate that array!
  
  I think that this is not what they want.
  
  Reply
Justin Holmgren on April 23, 2011

Call the missing number x and the additional number y.
We will find y – x and y^2 – x^2. Then dividing, we can get y + x which will in turn give us y and x.

Sum all of the integers. This will be N(N+1)/2 + y – x
Sum all of the integers squared. This will be N(N+1)(2N+1)/6 + y^2 – x^2

Now you’re sort of done. To make it actually constant space, you should do all of your summing mod 2^256, since y^2 – x^2 and y – x will certainly be between -2^128 and 2^128 (so when you find them mod 2^256, that’s the same as finding their actual values)

Reply
- Justin Holmgren on April 23, 2011
  
  Oops instead of mod 2^256, I meant mod 2^129
  
  Reply
Carl-Johan Blomqvist on April 23, 2011

A solution from a business person’s point of view would be to post on a geek forum asking how to solve the issue… 😉

Reply
- Vivek on April 23, 2011
  
  Ok You should failed for using the words “geek forum” ….
  
  Reply
Charles Keepax on April 23, 2011

See I would have disabled comments on this post. I take it posting a correct answer would be uncool?

Reply
- Max Schireson on April 23, 2011
  
  I’m allowing interesting attempts because I think the discussion is fun, but holding the correct answers a little while 🙂
  
  Reply
sparc5 on April 23, 2011

Assume the duplicated one is y and missing one is x while input array is
array[1..n]

sum = 0
for ( int i = 1; i <= n; i++ ) {
sum += array[i];
}

// diff is (y – x)
diff = sum – (n(n – 1)/2)

prod_sum_diff = 0;
for ( int i = 1; i <=n; i++ ) {
prod_sum_diff += array[i] * (array[i] – diff ) – i * (i – diff );
}

// pro_sum_diff is y * (y – (y – x ) ) – x * ( x – ( y – x ) ) = – 2x^2

x = sqrt( – ( prod_sum_diff / 2 ) )
y = diff – x

complex is o(n)

Reply
Jason Zhang on April 23, 2011

Assume the duplicated one is y and missing one is x while input array is
array[1..n]

sum = 0
for ( int i = 1; i <= n; i++ ) {
sum += array[i];
}

// diff is (y – x)
diff = sum – (n(n – 1)/2)

prod_sum_diff = 0;
for ( int i = 1; i <=n; i++ ) {
prod_sum_diff += array[i] * (array[i] – diff ) – i * (i – diff );
}

// pro_sum_diff is y * (y – (y – x ) ) – x * ( x – ( y – x ) ) = – 2x^2

x = sqrt( – ( prod_sum_diff / 2 ) )
y = diff – x

Reply
Gonzalo on April 23, 2011

In one pass you can sum the numbers and sum the squares of the numbers.

Their sum reveals X – Y.

Their sum of squares reveals X^2 – Y^2 = (X – Y)(X + Y).

Assuming N < 2^63, it suffices to use arithmetic in 64 bits for the sum and 128 bits for the sum of squares. Memory usage is 24 bytes.

Reply
- Gonzalo on April 23, 2011
  
  Actually, you could choose a prime p close to 2^64, and do all the arithmetic mod p. Assuming N < p, then X-Y is always invertible.
  
  Memory usage would then be just 16 bytes.
  
  Reply
  - Max Schireson on April 24, 2011
    
    Very good. Assume you’re sticking with the math not wanting a programming job 🙂
    
    — Max
Jason Zhang on April 23, 2011

A little update
———————
Assume the duplicated one is y and missing one is x while input array is
array[1..n]

sum = 0
for ( int i = 1; i <= n; i++ ) {
sum += array[i];
}

// diff is (y – x)
diff = sum – (n(n – 1)/2)

prod_sum_diff = 0;
for ( int i = 1; i <=n; i++ ) {
prod_sum_diff += array[i] * (array[i] – diff ) – i * (i – diff );
}

// pro_sum_diff is y * (y – (y – x ) ) – x * ( x – ( y – x ) ) = – 2x^2

x = sqrt( – ( prod_sum_diff / 2 ) )
y = diff + x

Reply
- Max Schireson on April 24, 2011
  
  I will get back to yours later I think it is correct but I am too tired to check the prod_sum_diff math right now 🙂
  
  Reply
Andrew on April 23, 2011

// PHP Code
// I assume that the array contains consecutive integers,
// is sorted ascending, and the first element contains 1

$numbers = array(1, 2, 3, 4, .. , N);

for ( $i=0; $i< sizeof($numbers) ; $i++ )
{
$value = $numbers[$i];
$estimate_value = $i + 1;
if ($value != $estimate_value)
{
echo $value . " is the duplicate value\n";
echo $estimate_value . " is the missing value\n";
break;
}
}

Reply
- Max Schireson on April 23, 2011
  
  The problem is too easy when they’re sorted. Try it without that assumption.
  — Max
  
  Reply
Lim Kim Yong on April 23, 2011

My solution if N is even

consider the case that then extra number is odd and missing number is even and extra number is even and missing number is odd:

add all even value numbers and add all odd value numbers

you can find the extra and missing numbers by finding the
difference between the calculated sum and the sum of all the terms from 1 to N (for even and odd)

if the additional numbers are both odd or even, you can find
that by looking at the number of odd/even numbers.

just take the respective group of numbers (even or odd)
and make a 1 bit shift to the right thus removing the
LSB. Recurse the above steps.

Best case O(1), worst case O(2N), requires at least N/2
storage components.

Reply
- Lim Kim Yong on April 23, 2011
  
  I meant at most N/2 storage components.
  
  Reply
  - Max Schireson on April 24, 2011
    
    Unique!
    But if say the extra and missing numbers are divisible by a large power of 2, you have to repeatedly remove bits and re-count to see if you’ve split them yet. That requires either linear storage or up to log n passes unless you have a clever way around that?
Igor Evsukov on April 23, 2011

Kids stuff, I’ve solved that kind of problems in 9nth grade: no jokes, really, I studied in very strong math school. Please tell email address to which I should send solution(description + source code) ?

Reply
- Igor Evsukov on April 23, 2011
  
  Sorry, I need to go, I’ve sent email with description and source code to Your personal email on Gmail.
  
  Reply
  - Max Schireson on April 24, 2011
    
    I got your email and your algorithm is correct. Haven’t checked your code yet.
- Max Schireson on April 24, 2011
  
  The soviet system for all its flaws had some spectacular math schools. A friend of mine had two of his math teammates from high school win the Fields medal (only given every 4 years).
  
  Reply
ijrussell on April 23, 2011

You need four values;

sum{1..n} n
sum{1..n} value
sum{1..n} n ^ 2
sum{1..n} value ^ 2

The difference between the sum of n and value gives you the difference between the duplicate and missing values.

Then using ((x ^ 2) + ((x + difference) ^ 2) = abs(sum of n ^ 2 – sum of value ^ 2), solve for x.

Reply
- Max Schireson on April 24, 2011
  
  good solution.
  
  Reply
AS on April 23, 2011

Here’s a solution that runs in O(n) time and O(n) space.

import java.util.HashSet;
import java.util.Set;

public class Test {

public static void main(final String[] args) {
final long[] data = {1, 2, 3, 4, 5, 6, 7, 8, 3, 10};

final Set seen = new HashSet();

long duplicate = -1;
long difference = 0;

for (int i = 0; i < data.length; i++) {
final long current = data[i];

if(!seen.add(current)) {
duplicate = current;
}

difference += (current – i – 1);
}

System.out.println("duplicate " + duplicate);
System.out.println("missing " + (duplicate – difference));
}
}

Reply
- AS on April 24, 2011
  
  Here’s an improved solution. Runs in O(n) time and O(1) space.
  
  import java.math.BigInteger;
  
  public class Test {
  
  public static void main(final String[] args) {
  final long[] data = {1, 2, 3, 4, 5, 6, 7, 8, 9, 2};
  
  BigInteger sumDifference = BigInteger.ZERO;
  BigInteger squaredDifference = BigInteger.ZERO;
  
  for (int i = 0; i < data.length; i++) {
  final BigInteger current = BigInteger.valueOf(data[i]);
  final BigInteger index = BigInteger.valueOf(i + 1);
  
  sumDifference = sumDifference.add(current).subtract(index);
  squaredDifference = squaredDifference.add(current.pow(2)).subtract(index.pow(2));
  }
  
  final BigInteger duplicate = sumDifference.pow(2).add(squaredDifference).divide(sumDifference.multiply(BigInteger.valueOf(2)));
  final BigInteger missing = duplicate.subtract(sumDifference);
  
  System.out.println("duplicate " + duplicate);
  System.out.println("missing " + missing);
  }
  }
  
  Reply
  - Max Schireson on April 24, 2011
    
    Good solution AS.
Brian on April 23, 2011

I’d ask the guy who makes the array generation routine to cache the omitted and duplicated values. Then I’d grab a pipe wrench and chase down anyone who alters my array without accessors.

Reply
- Max Schireson on April 24, 2011
  
  Your colleagues need to either write solid code or wear a helmet i guess!
  
  Reply
Scott Ruhnau on April 23, 2011

In this problem let X be the missing integer and Y be the repeated integer. Now we have 2 variables and as such need two equations to solve them.

First we sum the number from 1 to N, a linear time operation. Call this sum A. Also sum the integer in the array call this sum B. Perform a subtraction: B – A = C. Now we know that Y – X = B – A = C.

Now for the second equation we repeat the process with multiplication. At this point we must make one of two assumptions either we will not overflow a 64 bit integer or that we can perform effectively constant time multiplications on whatever size integer we need. So take the product of the number from 1 to N call this D. Take the product of the numbers in the array call this E. Divide them E/D = F. We know that F = Y/X.

Now we have two simple equations and can perform a simple algebraic solve:

F = Y/X
Y = FX
C = Y – X
C = FX – X
X = (F-1)/C
Y = F(F – 1)/C

The problem here comes from the assumption of constant time multiplication.

Reply
- Scott Ruhnau on April 23, 2011
  
  It also uses constant space which is the hard part of this problem. Without the space and rearrangement constraints simple solutions using a radix sort or using the idea of counting sort would be possible.
  
  Reply
  - Max Schireson on April 24, 2011
    
    Scott,
    Correct given constant time and space to multiply, but the numbers get linearly longer if you use say a bigint, so it actually isn’t constant space.
    — Max
  - Scott Ruhnau on April 24, 2011
    
    Fair enough. The sums of squares solution may also overflow a 64bit integer though. Less likely and too a lesser but still a possibility you may end up with 16777216 + a lot of other stuff.
    
    But yes my solution was inferior, haha. Poor me no interview.
  - Max Schireson on April 24, 2011
    
    Much less likely. Sum of squares will overflow around a million elements but if you use a bigint it will be 192 bits or less. But when the sum of squares is 192 bits, the product will take exabytes to store!
    
    You can always send in a resume!
    
    — Max
  - Scott Ruhnau on April 24, 2011
    
    I think I have a solution to the size problem. We can use the properties of multiplication and division to divide as we go. So instead of multiplying each array element and then dividing. We can setup a variable, F in the above, initialized to 1. As we iterate from 1 to N through the array we can multiply F by the array element and divide it by the index. For example on the first element: F = (F* Array[1])/1 and for the ith element: F = (F*Array[i])/i
    
    This way we end up with the same end result but can use a much smaller amount of space.
  - Max Schireson on April 26, 2011
    
    Interesting and makes incremental improvements, but doesn’t solve it. In a worst case where array[i] is reverse sorted, the first 2^31 operations will all at least multiply the result by 2 so you will have a gigabytes-sized product. There is a clever way to avoid this that so far no-one has suggested (hint: requires some advanced math).
    
    — Max
Max Schireson on April 23, 2011

Just published the correct answers (which I had been holding back for a little while) along with a number of other posts, including a funny one from Brian.

I’ll catch up on responding individually after dinner.

Thanks for a lot of discussion, and a few good candidates!

This is fun, we’ll have to do it again.

— Max

Reply
- pepe on April 23, 2011
  
  Where is the answer?
  
  Reply
  - Max Schireson on April 24, 2011
    
    Sums and sums of squares is by far the most popular solution.
    
    There was a unique solution which used XOR. I will try to paraphrase it here somewhat more concisely than originally stated; hopefully the rest of the posts provide enough context to follow:
    1. XOR the values and the indexes, the result is DUP XOR MISSING
    2. Choose one non-zero bit of that, say bit j.
    3. XOR again the values and indexes – but only when bit j is zero. This will give one of DUP or missing, but you won’t know which
    4. Then go back through and look for those values to see which is which (omitted from the post but I was so impressed with the originality I didn’t quibble)
    3 passes but no bigints (you can do the sums mod 2^64).
    
    — Max
- a non on April 24, 2011
  
  Where are the answers? Seems like calculating sum and square is the only right way…
  
  #!/usr/bin/env perl
  
  my $len = 10;
  my @a;
  for my $i (1 .. $len) {
  $a[$i-1]=$i;
  }
  
  my ($m, $n) = (0,0);
  while ($m == $n) {
  ($m, $n) = (int(rand($len)),int(rand($len)));
  }
  $a[$m] = $n+1;
  
  print “Missing : “.($m+1).”\n”;
  print “Duplicate: “.($n+1).”\n”;
  
  print join(“,”,@a).”\n”;
  
  shuffle(\@a);
  
  print join(“,”,@a).”\n”;
  
  my ($na, $nn, $sa, $sn) = (0,0,0,0);
  
  my $sn = ($len*($len+1))/2;
  my $nn = ($len*($len+1)*(2*$len+1))/6;
  
  for my $i (1 .. $len) {
  $sa += $a[$i-1];
  $na += $a[$i-1] ** 2;
  }
  
  my $sd = $sa – $sn;
  my $nd = $na – $nn;
  
  my $n2 = ($nd – ($sd**2))/(2*$sd);
  my $m2 = $sd + $n2;
  
  print “Missing : $n2\n”;
  print “Duplicate: $m2\n”;
  
  sub shuffle {
  my $array = shift;
  my $i;
  for ($i = @$array; –$i; ) {
  my $j = int rand ($i+1);
  next if $i == $j;
  @$array[$i,$j] = @$array[$j,$i];
  }
  }
  
  Reply
  - Max Schireson on April 24, 2011
    
    Basic algorithm looks good and i think you got all the +1’s and -1’s correct.
    
    Don’t know perl so i could easily have missed overflows, and didn’t check the shuffle since generating the data set wasn’t part of the problem.
    
    — Max
Binh Nguyen on April 23, 2011

Called the missing number m, duplicate d.
We will find m+d and m-d. Then it’s trivial to find m and d.

1) Find n XOR d:
X_array = 1;
I_array = 1;
for (i=2..N)
X_array = X_array XOR A[i]; I_array = I_array XOR i;
endfor

Then m XOR d = I_array XOR X_array

2) Find m – d
It will be a little bit more complicated. The idea is to get the sum of (1..N) subtract the sum of our array. But we have to be careful not to over or underflow the result.

Diff = 0;
i = j = 0; // i index I, j index A
while (i < N) && (j < N)
if (Diff + i < N-1)
Diff = Diff + i;
i++; //consume one in I
else
Diff = Diff – A[j];
j++; // consume one in J
endif
end

// Need some corner case check here to but I will omit

3) So we have (m XOR d) and m – d. Also, note that m XOR d is basically m + d without the carry bit so it will be trivial to get m and d.

running time : 3 N <- can be squeeze to 2N if merge the two logic
space : O(1)

Reply
- Binh Nguyen on April 24, 2011
  
  Correction:
  Once we find m XOR d, it’s not trivial to get m+d.
  
  Assume d XOR m = a. Find k = position of the first bit (from MSB to LSB) in a. Then d + m can be in [a + 2^(k+1),a + 2^(k+2),…,a + 2^63].
  We will have less than 63 cases. In each case, after we find d and m, make sure they are in range and then linear test.
  So in the end, we may need to run at most 63+3 = 66 N. But it still linear though.
  
  I believe it will work, but it may not be the most beautiful solution.
  
  PS: I try to solve the problem just purely for fun. I already had a job! 🙂
  
  Reply
  - Max Schireson on April 24, 2011
    
    I’m not quite following what you’re doing here, but it might be me at this late hour. There is a way to fix the XOR to produce D and M (but not know which is which) in one additional pass, its in one of my prior comments.
    
    — Max
Mark on April 23, 2011

Hi Max!

I will dispense here with code and verbalize the algorithm. If I’m close and you would like me to fill in the blanks… well you know the answer.

Using four variables, iterate (single pass) the array determining for each element:

min?, max?, count, and sum … and a fifth variable – or flag if you will.

Once the length of the array is determined by count, and using comparison operations on each element to obtain the minimum and maximum integers in the array, we can easily arrive at:

max(max+1)/2 – min(min+1)/2 – sum = duplicate integer – missing integer

Thanks to Mr. Gauss…

Our fifth variable is a “flag” (var duplicate) ; each element must either update the min? or max? from the previous element (appearing only once – the next element is either greater than or less than the current min? or max? variable); if not, that element is the duplicate integer, and from the above, we then can also determine the missing integer. Our count variable must accommodate the 64 bit integer sum of course.

I’m not a ninja programmer – but I am having fun putting monogdb to good use!

Reply
- Max Schireson on April 24, 2011
  
  Glad you’re having fun with mongodb!
  
  I think your assertion that each element updates either min or max is false. Say N=100, A[1]=1 and A[2]=100. Then each additional element will not update min or max and you will automatically think A[3] is the dupe.
  
  — Max
  
  Reply
Colin on April 23, 2011

def qs(a)
(pivot = a.pop) ? qs(a.select{|i| i pivot}) : []
end
def pa(a)
qs(a).each do |e|
@current = el
begin
state = @current – @previous
rescue
unless e == 0
state = -1
end
end
case state
when -1
puts “skipped number is #{0}”
@skipped = 0
when 0
puts “repeated number is #{@current}”
@repeated = @current
when 2
puts “skipped number is #{@current – 1}”
@skipped = @current -1
end
@previous = @current
end
@repeated
@skipped
end

You could easily speed this up doing only an xor on the right most bit and using registers, but that’s a little hard to illustrate in ruby. Maybe I’ll give this a shot in C later on tonight.

Reply
- Colin on April 23, 2011
  
  Mistake on line 6. Typo on the value, change that from “el” to “e”.
  
  Reply
  - Max Schireson on April 24, 2011
    
    I can’t understand ruby to save my life but i think you might be assuming the list is ordered?? Either C or plain english are best for me (but i understand java and c++ reasonably well and can muddle through perl easier than ruby)
    
    — Max
Julien Bouvier on April 23, 2011

I have used the sum and sum of squares of the integers within the array to find the duplicate and missing values.

My analytical solution is online at:

Click to access Max%20Schireson%20Algorithm%20by%20Julien%20Bouvier.pdf

Reply
- Max Schireson on April 24, 2011
  
  Good solution Julien.
  
  Reply
Kabir on April 23, 2011

# Python solution, returns (duplicate, missing)
# Uses Python BigInts but all numbers are no larger than 128 bits

def findDupMiss(array):
n = len(array)
correct = n*(n+1)/2
accumulator = sum(array)
difference = correct-accumulator
for number in array:
tmp = accumulator – number + difference
if tmp == correct: return number, number + difference

Reply
Top Posts — WordPress.com on April 24, 2011

[…] Want a job working on mongoDB? Your first online interview is in this post I like smart programmers. […]

Reply
- Max Schireson on April 24, 2011
  
  Thanks (blush)
  
  Looks like this post is number 34 in wordpress right now. Ahead of teen stabbings on CBS NY but behind Philles shutouts in San Diego.
  
  — Max
  
  Reply
Daniel Plaisted on April 24, 2011

You can do this in two passes over the array.

First, compare the sum of the elements of the array to the sum of the integers from 1 to N. The difference between these sums is the difference between the duplicate and missing number.

Now you want to find a way to partition the numbers from 1 to N into to sets, so that the missing number is in a different set than the duplicate number. To figure out which set a number is in, divide it by the difference between the duplicate and missing numbers. If the result is even, it belongs to one set, if the result is odd, it belongs to the other set.

So you make a second pass through the array, calculating the sum of the elements that belong to the even and odd sets. You also do the same calculation for the numbers from 1 to N to get the “expected” results if there were no numbers added or missing.

You will end up with one “actual” result bigger than expected and one less than expected. The difference between the result that is bigger than expected and its expected result is the duplicate number, and the other difference is the missing number.

In Pseudocode (where ints are 64 bits):

int N = array.Length;
int actualSum = 0;
int expectedSum = ((N + 1) * N) / 2;
for (int i=0; i< array.Length; i++)
{
actualSum += array[i];
}

int diff = Math.Abs(actualSum – expectedSum);

int actualEvenSum = 0;
int expectedEvenSum = 0;
int actualOddSum = 0;
int expectedOddSum = 0;

for (int i=1; i 0)
{
duplicate = evenDiff;
missing = -oddDiff;
}
else
{
duplicate = oddDiff;
missing = -evenDiff;
}

If there are more than about 2^33 numbers, then you will overflow a 64-bit integer when adding up the numbers. As long as this doesn’t raise an exception, the calculation should still work correctly, since the difference will fit in a 64-bit integer.

Reply
- Daniel Plaisted on April 24, 2011
  
  It looks like my pseudocode got eaten, starting with a less than or equal sign in the second loop. Hopefully my description of the algorithm makes sense by itself.
  
  Reply
- Daniel Plaisted on April 24, 2011
  
  Here’s a link to the un-mangled code: http://codepaste.net/3mvtfw
  
  Reply
  - Max Schireson on April 24, 2011
    
    Looks great Daniel!
Marcus Clyne on April 24, 2011

Hi Max,

Note: in the time between seeing this, trying to find out where to send the answer (i.e. to post it on this wall or to find your actual email address), and the time you posted the answers, I wrote this: I know it’s a bit late now, but I thought I’d send it to you anyway.

Here’s a rough outline of one possible approach, where x represents the each number you have:

– take the sum of x
– take the sum of x^2
– solve the simple equation you have using the two sums to get the missing and duplicated numbers

Things we know:

– the sum of squares of numbers from 1 to n is: n(n+1)/2
– the sum of squares of numbers from 1 to n is: n(n+1)(n+2)/6

Things we set up:
Two fields for our adding. We represent these as field 1 and 2 respectively for the two sums mentioned above.
Depending on n, we may be able to use 64-bit integers for all fields (if (n(n+1)(n+2)/6 + (n^2-1)) < 2^64 – I didn't bother to work out what this value would be – the (n^2-1) part at the end is to guarantee that we have enough space for if the duplicated value is n and the missing value is 1, which would give the highest sum of the squares).
If the previous equation results in an integer greater than can fit in a 64-bit integer, then we could either use a library to deal with integers that are greater than 64bits but use 64-bit integers, or could just use something like the gmp library (so we would be converting all our numbers to gmp integers). Although this would obviously be slower than using integer variables, it would approximately scale almost linearly (see note below).

The iteration part (which will be our part that scales linearly):
– loop through all the numbers you have in your array (represented as x)
– on each loop:
– add x to field 1
– add x^2 to field 2

The differences:
In order to get the differences from the 'complete' values, we just take the value of the above two equations above respectively from the two totals we got in field 1 and 2.

The resulting equations:
We are left now with two equations and two values (the differences calculated above). In these sums, 'd' represents the duplicated number, 'm' represents the missing number, and D[1|2] represent the differences above (which are known values at the end of our process):
1) d – m = D1
2) d^2 – m^2 = D2

Solving the equations:
(from 1) (d – m)^2 = D1^2
(from 1) d = m + D1
(from 2) d^2 – m^2 – D2 = 0
(d – m)^2 – D1^2 = 0
d^2 – 2dm + m^2 – D1^2 = d^2 – m^2 – D2 (we can cancel the d^2)
2m^2 – 2dm – D1^2 + D2 = 0
2m^2 – 2(m + D1)m – D1^2 + D2 = 0
2m^2 – 2m^2 – 2D1m – D1^2 + D2 = 0 (we can cancel the m^2)
– 2D1m – D1^2 + D2 = 0
OR
2D1m + D1^2 – D2 = 0
SO
m = (D2 – D1^2)/2D1
And we solve 'd' from equation 1 above.
Note: you could also solve 'd' first, then 'm' by cancelling out first m then d in the above workings.

One issue with this approach related to O(n):
If we were to use something like gmp, then the library itself would not quite scale linearly. The processing of larger integers would take longer than small integers, so there would be a very slight increase in time. Since we are *only* dealing with 64-bit integers, then the difference would probably be pretty small, and in practice may not be that noticeable in the 1 – 2^64 range.
If you wanted to be absolutly pedantic about linearity with the scaling, you could always set up a timer at a regular interval that was considerably longer than it takes for the gmp library to caclulate ((2^64)+1)^2, and then use some kind of loop that only did the next calculation after the alotted time had elapsed.
This is a completely pointless activity in practice, though would make it strictly linear (at least the iteration part).

Concluding comments:
I believe that this solution solves your problem/criteria. I hope there are no errors in my working – I did double-check them. 🙂
By the way, I actually contacted you guys about two weeks ago asking if you'd be interested in hiring me to do some work regarding your C driver and module for Nginx. There are some major issues with integrating MongoDB with Nginx because your C driver is blocking, and Nginx is non-blocking, so the GridFS Nginx module you have is going to kill performance on Nginx on high-load, so I offered to do some contract work to write a non-blocking C driver and an Nginx module that would allow Nginx to run as an HTTP front-end to both standard Mongo and GridFS.
I was asked to send in my resume (by Ben Sabrin), which I did, but I've not yet heard back from anyone. I didn't want to be pushy because I know that you guys are probably very busy, and possibly hadn't had time to consider my proposal. Just thought I'd mention it.

Thanks,
Marcus Clyne.

Reply
- Pavel on April 23, 2011
  
  …
  >- the sum of squares of numbers from 1 to n is: n(n+1)(n+2)/6
  …
  Correct is N(N+1)(2N+1)/6
  
  Reply
  - Marcus Clyne on April 24, 2011
    
    🙂 Oops – mis-remembered that one. Thanks.
Charles Keepax on April 24, 2011

Well my answer was this:

Sum all numbers, we’ll call this A.
Calculate N(N 1)/2, we’ll call this B.
Subtract the numbers one at a time from A and B, until A-B changes. The number that causes the change is one of the incorrect numbers. It is then trivial to calculate the other incorrect number.

Requires at worst two passes through the array though so better answers have been posted.

Reply
- Charles Keepax on April 24, 2011
  
  Well that will teach me to solve problems whilst after drinking 🙂 Sorry all.
  
  Reply
Hugo T R on April 24, 2011

I would use a bitmap or cumulative bitwise operations.

Reply
sparc5 on April 24, 2011

just try to see if the comments are post immediately or not. Somehow I did not see my post

Reply
Nick LaRacuente on April 24, 2011

Basic half-finished idea:
1) Compute the expected # of total 1 bits in positions 0-63. This should run in O(logn) or less than 64.
2) Repeat the above for different recombinations of bits, based on the butterfly algorithm for the Walsh-Hadamard transform, but with +/- replaced with XOR/NXOR. The intention is to create alternate expressions that couple bits.
3) Record the actual ones counts for each bit and recombined bit (this should require a constant-sized table of long integers).
4) Starting with the untransformed case, we can find all the bits that were 1 in the removed and 0 in the duplicate, or vice versa. From here, we can backtrack through each level of the transform to determine the values of bits that were the same in the duplicated and removed, but would have been coupled with a changed bit in the transforms.

Unfortunately, it appears I’ve returned here too late. Ah well. Probably should get back to actually working.

Reply
bro on April 24, 2011

Step 1:
let T be the XOR of all numbers from 1 to N
let S be the XOR of all numbers in the array
Then XOR(T,S) = XOR(d,m) = P where d is the duplicate
value and m is the missing value.
Step 2:
let i = 1
let Z2 = xor of all numbers in the array except the
ith element
if XOR(XOR(Z2, T), P) == ith element
then i th element is the duplicate

Now this is repeated for all indices i. The thing is the xor of all values in the array except the ith element
can be computed in one operation, that is XOR(S, ith element).

Reply
- Max Schireson on April 24, 2011
  
  I like it. Original and fast.
  
  Reply
  - Antti Rasinen on April 24, 2011
    
    I do not think it works. I’ll use + for XOR, and mark the ith element with simply i. Then the Z2 variable becomes S + i.
    
    Now (Z2 + T) + P = S + i + T + P. Since S + T = P, they cancel out. In other words, the equation
    
    (Z2 + T) + P == i
    
    is always true. Bummer.
Ajay Ohri on April 24, 2011

as a follow up, the unemployment rate in techs, math guys would be interesting given the interest generated by the job/post title

Reply
- Max Schireson on April 24, 2011
  
  As a former math guy myself always have a soft spot for math guys who can program 🙂
  
  Reply
Zachary Voase on April 24, 2011

I’ve written a C solution to the problem. It runs in O(n) time and O(n) constant space (i.e. it doesn’t grow/shrink over the course of its execution). You can find it here, along with example input and output files: https://gist.github.com/939287

It’s not very idiomatic because it’s my first C program, but the basic idea is simple: store a bit array of size N, where `check_array[number]` indicates whether or not we’ve seen `number`. Loop through the array setting `check_array[input_array[i]] = 1`. If we encounter a number which we’ve already seen, then that must be the duplicate.

As other people have noted, the difference between the duplicate and the missing number is the same as that between the sum of all input elements and the ‘triangle’ number (N * (N + 1) / 2). So, finding the missing number becomes trivial as long as we calculate the sum as we’re going along.

My solution above solves this problem correctly (as far as I can determine through generated random test input). It also does so very quickly, most likely because it’s written in such a low-level (yet well-suited) language.

Reply
- Max Schireson on April 24, 2011
  
  The challenge is to get that down to o(1) space – a constant amount of space that does’t depend on n!
  
  Reply
Pavel on April 24, 2011

Hi, this is working solution. It can be easy adopted for 64bit numbers.

#!/usr/local/bin/python
import sys
from random import shuffle

# prepare sequence
N = 10 # len of sequence
nums = range(1, N+1) # generate range of numbers from [1..N]
shuffle(nums) # make random order of numbers
nums[0] = nums[1] # dup 0-idx number as 1-idx number
print “Numbers: %s” % nums

# calculate sum of n and n^2 from 1..N
S = int(((1+N)*N)/2)
S2 = int(N*(N+1)*((2*N)+1)/6)
# print “Sum of numbers in range 1..N: %d” % S
# print “Sum of numbers^2 in range 1..N: %d” % S2

# Calculate sum of current numbers
E = E2 = 0
for n in nums:
E += n
E2 += n*n

if S == E:
print “All numbers in place”
sys.exit(0)

D = (S2 – (S*S) + (2*E*S) – E2 – (E*E)) / ((2*S) – (2*E))
print “Duplicated number: %d” % D

M = S – E + D
print “Missed number: %d” % M

Reply
- Max Schireson on April 24, 2011
  
  good work
  
  Reply
Pavel on April 24, 2011

Here you can see with the right highlighting and indented.
http://pastebin.com/KyFTXpEr

Sorry for duplicated comment.

Reply
Binh Nguyen on April 23, 2011

One of my friend came up with this solution:

m is missing,d is duplicate. I is 1..N, A is input.

Step 1: Compute m XOR d by XOR all values of A and I together.
let say x = m XOR d. Then x must has at least one 1 bit (or else m == d). Call the bit p. So now we can separate A into two list A0 are those with bit p == 0 and A1 are those with bit p == 1. So we have separated m and d into two different list.

Step 2: Iterate through I, maintain two number x0 and x1 as the results of XOR operations of those elements with p bit == 0 and == 1 respectively. Do the same for A to get a0 and a1. Then we know that m is either a0 XOR x0 or a1 XOR x1. One more linear test in A to confirm the missing element m.

In the end, the running time can be squeezed into 2 N.
space = O(1).

Reply
- Max Schireson on April 24, 2011
  
  Yes, that is the direction I was suggesting for the last part of your solution. It was also the direction Zach was taking, but your friend had by far the clearest explanation of it. He should send his resume too!
  
  — Max
  
  Reply
  - Binh Nguyen on April 24, 2011
    
    Thank you! That’s very nice!
    But both he and I are not looking for a job right now. 🙂
    Looking forward to another fun question in the future.
    
    PS: I don’t think calculate sum of squares of all number from 1 to N where N is ~ 2^63 is linear in term of N. At least not in practice!
Ittay Dror on April 23, 2011

(disclaimer: without reading the previous comments, so i may be presenting a wrong idea that has been presented before)

The first array is A. I’d build another array of N numbers, filled with 1, name it B.

foreach i in A
if B[A[i]] == 0 print “found duplicate” A[i]
B[A[i]] = 0

foreach i in B
if B[i] == 1 print “found missing” B[i]

I’m assuming of course, that I can allocate B.

If not, then reuse A. So A[0] gives another index in A, so I can save i = A[A[0]] and then A[A[0]] = 0 and continue with i. Something like
i = A[0]
while not i == 0
i = A[i]
if A[i] == 0 print “found duplicate” A[i]
A[i] = 0

foreach i in A
if A[i] != 0 print “found missing” A[i]

(there may be corner cases, like needing to add 1 to indexes and such, i don’t think it is worth bothering with)

Reply
- Ittay Dror on April 24, 2011
  
  oops. missed the space constraints. sorry
  
  Reply
Sergiy on April 23, 2011

D – duplicate M-missing
If sum(array) is computable (possible to compute), than we can find

MD=M-D by comparing sum(array) to sum(1, N)

Now lets set variables, each named X(0 to MD-1), – total md vars.
X0 would represent numbers that are equal 0 while modulo MD
So, say if MD=3, than X1 represents 4, 7, 10 …, X2 represents 5,8,11 …

Now lets make iterating procedure, that will go through the array, starting with first element and adding it to the top of the corresponding Xn var. (no matter in what order they arranged)

X1=(31,1,10..)
X2=sum(8,2,11, ..)

Xn= sum((n+MD*y)…)

When finished sum, one of the Xn will contain D and will

have missing M.

It is easy to find which Xn contains M and D, since “ideal” sequence Xn (4,7,10..) is easily computable and compared to the “actual” Xn.

Now when we know which Xn var has M and D, we also know that in this array M and D are consecutive numbers with difference MD – for example M=10 D=7

Lets brake this sum – Nx into 2 sums.
Each of them sums alternating numbers
Xn1=sum(4,10,16..)
Xn2=sum(1,7,13…)

Basically this procedure is made during the first procedure when we calculate Xn.
We add (n+MD*y) where y is odd to Xn1 and we add (n+MD*y) where y is even to Xn2.

Xn will be equal Xn1+Xn2.

Having actual Nx1 and Nx2 (one we know has Missing M and
another Duplicate D) and comparing them to “ideal” sequence Xn1 and Xn2 is easy find out M and D.

If “actual Xn1” is less than “ideal Xn1”, then difference is M.
If “actual Xn1” is more than “ideal Xn1”, then difference is D.

Тhe same to Xn2.

Reply
- Max Schireson on April 26, 2011
  
  Looks like you need MD variables, which could be N-1. Looking for a constant space solution.
  
  Reply
anxrn on April 24, 2011

(I didn’t read all comments to see if this solution has already been posted. Apologies if it has)

(I will assume all integers are positive)

Let x be the duplicated element
Let y be the missing element
Sum of all elements in the array = S
Product of all elements in the array = P

We have,
Eqn #1: S – x + y = n(n+1)/2 — (formula for sum of numbers 1..n)
Eqn #2: P * y/x = n!

So, 2 equations, and 2 variables. Simple linear equation. Solved. Oh, and time complexity is O(n), sum and product is computed in one pass. Space complexity is constant.

Reply
- Max Schireson on April 24, 2011
  
  As n gets large, p uses a lot of space to store as a bigint. Eventually exabytes when n is near 2^64. So space isn’t really constant…
  
  Reply
想去MongoDB工作？解决这个问题 on April 24, 2011

[…] 这个贴到目前为止已经有了160多条回复，下面是其中几个比较有意思的，如果你有时间，也可以去原帖发掘更多。 […]

Reply
- Max Schireson on April 24, 2011
  
  Sorry, I can only competently review responses in English. I will give German or Spanish a try but you’ll need to forgive my mistakes.
  
  — Max
  
  Reply
rmh on April 24, 2011

Let

Q be the sum over 1 to n of n^2
S be the sum over 1 to n of n

Determine

q, the sum of the squares of the elements in the array
s, the sum of the elements in the array

If

x = q – Q
y = s – S

then

m = \frac{x – y^2}{2y}
d = m + y

where

m is the missing number
d is the duplicate number

Reply
- Max Schireson on April 24, 2011
  
  well done
  
  Reply
  - Alex on April 24, 2011
    
    is that faster than looking at them?
    
    Oh I get it now, leaves the array untouched and in a constant space. whereas all that sorting and looping was a problem.
- Dylan on April 24, 2011
  
  Using similar methods, you can actually fill in an arbitrary number of missing or duplicated elements.
  
  Reply
  - Max Schireson on April 26, 2011
    
    An arbitrary but known number isn’t too hard 🙂

Alex on April 24, 2011


function FindDupeAndMissingIn(N) {
  // returns an array of two parts
  // assuming one duplicate and one missing integer
  N.sort();  
  theDupe = 'not found';
  theMissing = 'not found';

  for (i = 0; i < N.length(); i++) {
   if (N[i] == N[i+1]){
       theDupe = N[i];
   }
   if (N[i+1] == N[i]+2) {
       theMissing = N[i]+1;
   }
   if ((theDupe != 'not found')&&(theMissing != 'not found')) {
       return [theDupe, theMissing];
   }
 }
}

Max Schireson on April 25, 2011

Looking to avoid a sort, that will take n log n time, plus shuffle the array.

— Max

Reply

Jose luis figueira on April 24, 2011

Can’t you just:

x = [set of data]
N = ubound_limit
e = [1 to N] #empty
dupe = 0

for i in x
v_sum += x(i)
if e(i) > 0 # found a dupe
dupe = i
end
e(i) = e(i) + 1

s_sum = sum(1 to N)
diff = v_sum – s_sum #diff between miss and dupe

# Here, we can loop over e, and found an e(i) = 0, thats
# the miss, but we dont want to loop again, so…

# miss = dupe – diff
# whe have dupe and diff, we need to found miss
# then, do the maths…

Reply
- Jose luis figueira on April 24, 2011
  
  I forget, of course, to get s_sum = sum(1 to N), you don’t need to loop from 1 to N (thanks gauss):
  
  ((1 + N) * N) / 2
  
  Reply
- Max Schireson on April 25, 2011
  
  Part of what I was asking for was to avoid using the memory to store e.
  
  — Max
  
  Reply
Sergio Castaño Arteaga on April 24, 2011

Python:

from mpmath import *

def findDupAndMis(array):
vp, vs = mpf(1.0), 0
for i in xrange(1,len(array)+1):
vp = vp * (float(array[i-1])/float(i))
vs = vs + (array[i-1]-i)
mis = round(vs / (vp – 1))
dup = round(vs + mis)
print “Found: Array %s \nDup %d Mis %d” % (array, dup, mis)

Reply
Sergio Castaño Arteaga on April 24, 2011

Python (… to show tabs):

from mpmath import *

def findDupAndMis(array):
….vp, vs = mpf(1.0), 0
….for i in xrange(1,len(array)+1):
……..vp = vp * (float(array[i-1])/float(i))
……..vs = vs + (array[i-1]-i)
….mis = round(vs / (vp – 1))
….dup = round(vs + mis)
….print “Found: Array %s \nDup %d Mis %d” % (array, dup, mis)

Reply
- Max Schireson on April 25, 2011
  
  I think you’ll overflow with any decent sized N (unless phython just expands the size of your float, in which case you’ll use a lot of memory).
  
  Check some of the other comments about multiplying,
  
  — Max
  
  Reply
Venkatesan on April 24, 2011

#include
#include
#include

void find_missing_and_duplicated(std::vector& arr, ACE_INT64& dup, ACE_INT64& missing)
{
ACE_INT64 curXor = 0;
ACE_INT64 expXor = 0;

for(int i=0;i<arr.size();i++)
{
expXor ^= i;
curXor ^= arr[i];
}

//this is the xor of missing numbers
ACE_INT64 deltaXor = curXor ^ expXor;

//except for the missing number, every other combo yields deltaXor
for(int i=0;i<arr.size();i++)
{
if( ((expXor^i)^(curXor^arr[i])) == 0)
{
missing = i;
break;
}
}

dup = missing ^ deltaXor;

}

int main()
{
std::vector arr;
const int N = 100;

arr.resize(N);

for(int i=0;i<N;i++)
arr[i] = i;

int victim = 76;//rand() % N;

//arr[victim] = (victim+1) % N;

arr[10] = N-1;

std::cout << "Array size is " << arr.size() << ", Victim is " << victim << std::endl;

ACE_INT64 dup = 0;
ACE_INT64 missing = 0;
find_missing_and_duplicated(arr, dup, missing);

std::cout << "Missing: " << missing << ", Duplicate: " << dup << std::endl;

return 0;
}

Reply
- Max Schireson on April 25, 2011
  
  I’m not sure i agree with your comment on line 20?
  
  Have you tried a test case?
  
  — Max
  
  Reply
Quentin Calvez on April 24, 2011

(Sorry if my solution has already been posted, there are 200+ comments so I haven’t read everything)

I will assume that all numbers are positive.

1. I compute the sum of all integers in the array and the sum of numbers from 1 to N. I then call DIFF the difference between both.

2. I now iterate again trough the array and I build two sums : The numbers which euclidean division by DIFF is even on one side and those for which it’s odd on the other side. I do the same for the numbers from 1 to N.

3. Finally : I now have 4 sums from (2.), I compare the sums from the numbers (1-N) to those from the array (even with even and odd with odd). If the difference is positive, it’s the missing number and if the difference is negative, it’s the duplicate one.

I have made an implementation of this in pseudo c# that you can find here : http://pastebin.com/rKAeUEfa

The bottleneck here is the division… Because I don’t known if there are algorithms to do it with a complexity < N…

Reply
- Max Schireson on April 25, 2011
  
  nice solution Quentin.
  
  — Max
  
  Reply
  - Quentin Calvez on April 25, 2011
    
    Thank you 🙂
    
    BTW, I’m currently seeking an internship for this summer, so if you take interns at MongoDB, I’d be happy to apply !
Matthew H. on April 24, 2011

Don’t quite have the math for an easy solution, but you can see that

d = sum(n in 1-N) – sum(n in N Array)
x – y = d
(product(n in 1-N) / y) * x = product(n in N Array)

and now it’s quite easy to solve.

naively one can find all solutions to x – y = d and then test against the second equality.

Reply
- Max Schireson on April 25, 2011
  
  two comments:
  
  1. calculating the product will use a LOT of memory if N is large
  2. again with large N, finding all the x-y = d is very time consuming, but yes, there are other ways to solve once you are at that point; issue is just the size of the product.
  
  — Max
  
  Reply
Bruce Moreland on April 24, 2011

I haven’t read any of the above except the question and the answer by “rmh” that you declared to be correct. I normally wouldn’t find that kind of solution myself.

Here’s the answer I came up with while eating dinner after my son presented me with this problem. I’ll try to find the duplicate number. It’s possible to derive the missing number from that.

Pick a radix, for example 10. You can use binary or base 256 or whatever but for purposes of explanation I’ll use base 10. Regard the series as a bunch of base 10 numbers.

For each possible number in the 1’s column, 10’s column, etc., you know the frequency with which any digit should appear. For example, if the numbers are 1..20, you should have in the 10’s column 9×0’s, 10×1’s, and 1×2’s. In the 1’s column you should have two each of 0..9.

Examine your data and compare with what you expect to see. If a digit appears too often in a column, you know that this digit must appear in that column in the duplicated number. If all digits in a column appear with correct frequency, you know nothing about that column, yet.

It’s a fact that you’ll learn something in this pass, because something has to appear too often. You’ll learn something about at least one column, and possibly more.

What you have learned creates a subset of data, from which you can compute new frequency information. Repeat the above steps using the subset and the new frequency information.

You are done when you know what is going on in each column. The algorithm has to make progress so you’ll get done eventually. It’s arguably linear since it depends upon the size of the numbers but it’s a hell of a lot more linear than duplicating the array and quick-sorting it.

Example:

1 2 3 5 6 7 8 9 10 11 12 13 14 14 15 16 17 18 19 20

1..20 with 14 duplicated and 4 missing. These wouldn’t have to be in order, of course.

First pass would show the 10’s column as having 1 too many 1’s in the 10’s column, so you know that your answer is 1X, where X is unknown.

New subset of data, which is everything in the original data that had a 1 in the 10’s column.:

10 11 12 13 14 14 15 16 17 18 19

If the data you expected when you started was here, you’d have 1 each of 0..9.

You run over this and since there are too many 4’s, so you know that your duplicate has a 1 in the 10’s column and a 4 in the 1’s column, i.e. 14.

You can find out that 4 is the missing number trivially. You sum the original data set, find out that you have 10 too many, and 14 – 10 = 4.

This answer would work although depending upon how large the numbers were it might be impossible to sum them in 64 bits.

If someone tasked me with this problem I would want to know more about it, to avoid bugs and dumb. My answer might be wrong to start with, if there isn’t time to do the calculations I suggest, or if it’s impossible to go over the data twice, etc., so it would be pointless to waste time on it if this were true. It may also be pointless to try to find a better solution than this one if this one works orders of magnitude faster than it “has” to.

Also, if this was critically important and commonly solved, I’d probably try to find some information like this webpage, read the whole thing, swallow any residual pride, and use the best solution contained here, because I’m smart enough to realize that other people can be smarter, spend more time, or look at things differently.

Reply
- Max Schireson on April 25, 2011
  
  interesting approach.
  
  that said, if you had non-empty hundreds and thousands places (and potentially more…) i don’t think your second pass will learn anything new?
  
  — Max
  
  Reply
  - Bruce Moreland on April 27, 2011
    
    I don’t understand your comment, but I will restate in the hopes that I answer your question by accident. I’m pretty sure that I do answer it, indirectly.
    
    When writing this up I came to realize that the algorithm is simpler than I thought, and that will be reflected in the following.
    
    Definitions:
    Ideal data – the set 1..N. I assume this is represented mathematically only.
    Actual data – the set 1..N, with one missing and one duplicate. I vaguely assume this is an unsorted array that can be iterated over.
    
    The point of this algorithm is that you can use a frequency count to identify and characterize a subset of the actual data that must contain the duplicate, and which cannot contain the missing number.
    
    This allows you to trap the duplicate while excluding the missing number from consideration, which is key.
    
    If for example there are too many 6’s in the 10’s column in the actual data, the duplicate must have a 6 in the 10’s column, and the missing number cannot have a 6 in the 10’s column.
    
    Depending upon the radix you use and the character of the data, there’s a fair, maybe excellent, chance that you could solve the problem in one stroke with this frequency step.
    
    But assuming you don’t, this still allows you to identify a subset of the actual data that contains the duplicate, and allows you to generate the subset of ideal data that contains exactly the same values. In this example, that subset is the set of numbers in either ideal or actual data sets that have a 6 in the 10’s column.
    
    So:
    
    I is the ideal data
    A is the actual data
    
    X is the subset of I that must include the duplicate and can’t include the missing number, i.e. everything in I that has a 6 in the 10’s column.
    
    Given this:
    
    I’ is I intersected with X.
    A’ is A intersected with X (assuming that this will cause the duplicated element to appear twice).
    
    The result is that A’ is I’ with one extra element. The extra element is the duplicate.
    
    At this point you can take a second frequency pass to find the duplicate (I include this for consistency with my original solution), or likely more simply just use XOR, since all elements of A’ XOR’d with all elements of I’ will equal the duplicate element. (I would have figured out that use of XOR without having seen the other solution.)
    
    I hope there are no bugs in this. I don’t feel confident given that I didn’t write and test it, but it seems okay. Maybe more fat would fall off in the writing.
    
    The very strict definition of the problem implied set relationships that suggested that tricky math existed, and if my life had been on line if I didn’t find that, I would have gone looking for it.
    
    I realized though that the way I chose was more fun for me personally and does meet the criteria:
    
    1) Linear.
    2) Sane memory.
    3) Non-destructive.
    
    It would also be easy to implement this in a way that in a practical sense, meaning with natural integer operations and typical data structures, and would scale given that you used the phrase “extremely large”.
Yousuf Fauzan on April 25, 2011

Been thinking that XOR must have something to do with the solution, here is my second try.

A = range(1..N)
B = Our Array

M = Missing Number
D = Duplicate Number

A’ = XOR of elements in A
B’ = XOR of elements in B

A’ XOR B’ = M XOR D (Since all other elements will produce 0)

A’ XOR M = B’ XOR D —– 1

Since A’ XOR M is all elements of A excluding M
B already does not have M, but due to the duplicity of D, XOR is missing D too.

We know that (M XOR D) XOR D = M —-2

Using 1 and 2 and iterating through the elements in B, we can find D

for i in B:
if B’ XOR i == A’ XOR (M XOR D XOR i) then i is the duplicate number

Finding M is trivial from here.

Reply
- Max Schireson on April 25, 2011
  
  good solution!
  
  — Max
  
  Reply
  - Duarte on April 26, 2011
    
    It doesn’t work. A’ XOR (M XOR D XOR i) is A’ XOR A’ XOR B’ XOR i, which is B’ XOR i, so the condition is always true.
  - Max Schireson on April 26, 2011
    
    Duarte,
    
    Good catch; I think my eyes were glassing over!
    
    Sorry Yousuf,
    
    — Max
Yousuf Fauzan on April 25, 2011

This was my first try. It didnt get posted for some reason.

A=sum(1..N) = N*(N+1)/2
B=sum(1^2, 2^2…N^2)= N*(N+1)*(2N+1)/6

A’ = sum of the integers in the Array
B’ = sum of squares of integers in the Array

D = Duplicate
M = Missing

A’ = A + D – M
B’ = B + D^2 – M^2

=> D – M = A’ – A
D + M = (B’ – B)/(A’ – A)

Solve for D and M

Reply
- Max Schireson on April 25, 2011
  
  Caught in wordpress’s spam filter, sorry for the delay (i do occasionally check my spam).
  
  Good solution, congrats!
  
  — Max
  
  Reply
Chris Hurlburt on April 25, 2011

in 5N + overhead with a memory requirement based on 6 numbers plus
code storage overhead.

NOTE: Since this is sample code and in python, the max value for N is not
going to be 2^64 as in the problem because the range command is used. However,
this is a language limitation and not an algorithm limitation.

“””

def create_list(x,y,n):
“””
This creates a list of the type described in the puzzle for test purposes.
“””
test_obj = range(1,n+1)
test_obj.remove(x)
test_obj.append(y)

return test_obj

def seeker(list):
“””
Goes through the list finding both the missing and the duplicate. We are
assuming that the list is of the form described in the problem.

Notationally, x is the missing element and y is the duplicate element and
we will return x, y
“””

n = len(list)

#Find x-y a.k.a the first pass through the list
xy_diff = int(n*(n+1)/2) – sum(list)
if xy_diff == 0:
print “List is not in problem format!\n”
return 0,0

#Find the first binary digit where diff is non-zero.
sort_place = len(bin(xy_diff)) – bin(xy_diff).rfind(‘1’)

print “The first non_zero binary digit in the difference occurs at %d\n”%(sort_place)

#Initialize counters/summation variables
bucket0 = 0
ref_bucket0 = 0
count0 = 0
ref_count0 = 0
bucket1 = 0
ref_bucket1 = 0

#Go through the list counting and summing….
for i in range(n):
#Add the i+1 value into the appropriate bucket & increment counter
ibin = bin(i+1)
if len(ibin) – 2 >= sort_place:
if ibin[len(ibin) – sort_place] == ‘0’:
ref_bucket0 = ref_bucket0 + i+1
ref_count0 = ref_count0 + 1
else:
ref_bucket1 = ref_bucket1 + i+1
else:
ref_bucket0 = ref_bucket0 + i+1
ref_count0 = ref_count0 + 1

#Add the list entry to the appropriate bucket & increment counter
lbin = bin(list[i])
if len(lbin) – 2 >= sort_place:
if lbin[len(lbin) – sort_place] == ‘0’:
bucket0 = bucket0 + list[i]
count0 = count0 + 1
else:
bucket1 = bucket1 + list[i]
else:
bucket0 = bucket0 + list[i]
count0 = count0 + 1

#Process the results to find x and y

if count0 == ref_count0:
print “YOUR ALGORITHM FAILED!!!”
return 0,0

if count0 > ref_count0:
x = ref_bucket1 – bucket1
y = bucket0 – ref_bucket0
else:
x = ref_bucket0 – bucket0
y = bucket1 – ref_bucket1

print “Missing is %d\n”%(x)
print “Doubled is %d\n”%(y)

return x,y

Reply
- Max Schireson on April 25, 2011
  
  Algorithm looks good; my python is not good enough to meaningfully check your code, but congrats on a good algorithm!
  
  Reply
  - Chris Hurlburt on April 26, 2011
    
    Hi Max,
    
    Here’s a a more formal description. Coding allows me to check for what I like to call moron errors. Let x be the missing integer and y be the duplicated integer.
    
    Step 1: Compute x-y. (Gauss’ formula – sum of the list)
    
    Step 2: Find the smallest number i such that the coefficient a_i != 0 in the unique (binary) representation of
    |x-y| = a_0 + a_1 2^1 + a_2 2^2 + a_3 2^3 + …
    where a_j in {0, 1} for all j.
    
    Step 3: For each number on the list, determine whether the ith coefficient in the binary representation is 1 or 0. While doing so compute the following:
    S_0 = sum of all numbers on the list whose ith coefficient is 0
    C_0 = the cardinality of the set of numbers on the list whose ith coefficient is 0
    R_0 = sum of all numbers in 1..N whose ith coefficient is 0
    D_0 = the cardinality of the set of numbers in 1..N whose ith coefficient is 0.
    S_1 = sum of all numbers on the list whose ith coefficient is 1
    R_1 = sum of all numbers in 1..N whose ith coefficient is 1
    
    Step 4: If C_0 D_0 then y = S_0 – R_0 and x = R_1 – S_1.
    
    Lemma: Given the notation above, x and y must differ in the ith coefficient.
    
    Proof: For j < i, a_j = 0. Therefore the jth coefficients in the binary representation of x and y for j < i must be the same. If they were also the same for i, then the ith binary coefficient in |x-y| would be 0. However, i is chosen such that the ith binary coefficient in |x-y| is non-zero and hence x and y must also differ in the ith coefficient.
    
    Notes: Some of the computations in step 3 are redundant. For example, there is no need to actually compute either S_1 or R_1.
    
    In my implementation I compute all 6 values in step three while looping through the list—I did this because I didn't want to bother figuring out closed forms for R_0 and D_0 and I didn't want to code the more complicated set of conditionals that is required if you don't compute S_1 and R_1. This means the above code does 6N computations in the second pass for through the list rather than a more minimal 2N. It's also much easier to debug this way.
    
    If N is close enough to 2^64, then cleverness will be necessary to compute R_0 – S_0 without computing either R_0 or S_0.
    
    There are all sorts of various clever tricks that could improve the above implementation by little bits here and there. None of the ones I can think of improve the algorithm beyond O(N) compute and O(1) memory.
    
    With this approach binning on the remainder modulo |x-y| will not work. Here is a counter example. Let x = 5 and y = 7. Then binning on the remainder modulo the difference of 2 will put both 5 and 7 in the odd bin which will once again get you x-y (or y-x), but not x or y.
    
    Python as I know it is not ideally suited to this type of bit manipulation. However, I find it a useful language for fast prototyping of algorithmic ideas. Also by coding it I was able to stop the little mice in my head from working on closed forms for R_0 and D_0 because I realized that even if I did find one, chances are the cases would make a coding/debug nightmare. Thanks for a nice puzzle.
  - Chris Hurlburt on April 26, 2011
    
    ps. Somehow this comment area ate part of step 4. More words and less symbols:
    
    If C0 is less than D0, then x = R0 -S0 and y = S1 – R1.
    
    If C0 is greater than D0, then y = S0 – R0 and x = R1 -S1.
Joseph Bowman on April 25, 2011

Well, if I had to knock it out with a shell script I’d just make a file with a line for each integer and then

#!/bin/bash
DUPE=$(sort -n array.txt | uniq -d)
ARRAY=$(sort -n array.txt)
KEY=0
for i in $ARRAY; do
if [ $KEY -ne $i ]; then
echo “MISSING: $KEY”;
break;
fi
KEY=$(($KEY+1));
done
echo “DUPE: ${DUPE}”

I’m assume I understand the description right and the first number would always be 0. If not, then I could always head the sorted output to get the first number and start counting from there. I suppose I could get a little better about storing file in memory after the first sort for the whole “compact space” part.

Reply
- Max Schireson on April 25, 2011
  
  Hi. Looking to avoid the time of a sort (which is more than linear) and also not to have to hold a copy of the array. You can see some correct solutions and further commentary in the comments.
  
  — Max
  
  Reply
Ankur on April 26, 2011

Wow. This is probably the easiest one out there. It’s as simple as summing all of them and deducting it from n*(n 1)/2

I was really expecting something better!! I think this question was on my class 8 math exam.

Reply
- Max Schireson on April 26, 2011
  
  That’s simple, but it doesn’t actually answer the question asked 🙂
  
  — Max
  
  Reply
Duarte on April 26, 2011

Here’s my take. It’s based on XOR and on the observation that a bit set to 1 on (Missing XOR Duplicate) means that Missing and Duplicate can be distinguished based on that bit. My solution is in C#.

long total = 0;
long values = 0;

for (long i = 0; i < a.Length; i++)
{
total ^= i + 1;
values ^= a[i];
}

long x = total ^ values;

int bit;
for (bit = 1; (x & bit) == 0; bit <<= 1) { }

total = 0;
values = 0;

for (long i = 0; i < a.Length; i++)
{
var tmp = i + 1;
if ((tmp & bit) != 0)
{
total ^= tmp;
}

tmp = a[i];
if ((tmp & bit) != 0)
{
values ^= tmp;
}
}

long duplicate = total ^ values;

int count = 0;
for (long i = 0; i < a.Length; i++)
{
if (a[i] == duplicate && ++count == 2)
{
break;
}
}

duplicate = count == 2 ? duplicate : (duplicate ^ x);
long missing = duplicate ^ x;

We could use the equation SUM(1..N) – SUM(a) – Duplicate = Missing to reduce the solution to two loops, but I think this is faster.

Reply
- Duarte on April 26, 2011
  
  Oops, the third loop can be:
  
  bool found = false;
  for (long i = 0; i < a.Length; i++)
  {
  if (a[i] == duplicate)
  {
  found = true;
  break;
  }
  }
  
  duplicate = found ? duplicate : (duplicate ^ x);
  long missing = duplicate ^ x;
  
  Reply
  - Max Schireson on April 26, 2011
    
    looks good to me.
    
    — Max
  - Duarte on April 27, 2011
    
    Actually, I tested and this version is better:
    
    long xor = 0; // duplicate ^ missing
    long diff = 0; // duplicate – missing
    
    for (long i = 0; i < a.Length; i++)
    {
    long index = i + 1;
    xor ^= index ^ a[i];
    diff += a[i] – index;
    }
    
    int bit;
    for (bit = 1; (xor & bit) == 0; bit <<= 1) { }
    
    long duplicate = 0;
    
    for (long i = 0; i 0 ? Math.Max(duplicate, missing) : Math.Min(duplicate, missing);
    missing = duplicate ^ xor;
  - Duarte on April 27, 2011
    
    Hum, something went wrong.
    
    long xor = 0; // duplicate ^ missing
    long diff = 0; // duplicate – missing
    
    for (long i = 0; i < a.Length; i++)
    {
    long index = i + 1;
    xor ^= index ^ a[i];
    diff += a[i] – index;
    }
    
    int bit;
    for (bit = 1; (xor & bit) == 0; bit <<= 1) { }
    
    duplicate = 0;
    
    for (long i = 0; i 0 ? Math.Max(duplicate, missing) : Math.Min(duplicate, missing);
    missing = duplicate ^ xor;
  - Duarte on April 27, 2011
    
    Oh well. It’s here: https://gist.github.com/943484
Andrew on April 27, 2011

An interesting problem but more of a mathematical challenge than a programming ability test.

I wouldn’t turn a good programmer away just because thier knowledge of algebra is rusty.

Reply
- Max Schireson on April 27, 2011
  
  I wouldn’t either.
  These problems are no substitute for a real interview. They can identify some sharp folks. When people post code samples, you get to see how careful they are. And finally, you get to see how well people listen to requirements, which is certainly relevant.
  — Max
  
  Reply
- Bob Kimble on April 27, 2011
  
  I agree with your comment and with Max’ response. However, if you’re looking for really smart people, the kind of problem Max posted is a good example of the kind of problem a really smart person might be able to solve. Like it or not, computer science is really an offshoot of mathematics. Algorithmic efficiency is strongly tied to mathematical analysis, and this problem was not just about algebraic knowledge. I agree that a strong mathematical background is no substitute for programming ability. OTOH, it’s really good to find people who have both.
  
  Reply
wangjun on July 11, 2011

my english is very poor,so i write a programm to introduce my solution
void search(unsigned long *arry, unsigned long long N)
{
unsigned long base, i, duplicat = 0, miss = 0;
unsigned int vec[65536];
base = 0;
while(base < N)
{
for(i = 0; i < 65536; i ++)
vec[i] = 0;
for(i = 0; i = base && arry[i] < base + 65536 * 32)
{
int pos = (arry[i] – base) / 32, bit = (arry[i] – base)% 32;
if(vec[pos] & (1 << bit) ==0)
vec[pos] = vec[pos] | (1 << bit);
else
duplicat = arry[i];
}
for(i = 0; i < 65536; i++)
if(vec[i] != 0xffffffff)
{
miss = base + i * 32;
break;
}
if(i != 65536)
{
while(arry[i] != 0)
if(arry[i] % 2 != 0)
{
miss ++;
}
else
break;
}
}
if(miss != 0 && duplicat != 0)
break;
base += 65536 * 32;
}
}

Reply
Zhikun Zhong on July 12, 2011

I try to solve such a problem by count digit string appear times in the set. In the code, I try to seperate big number to two to four pieces, depend on how big the number is. The length of every piece is 6(Less than 1 million). We know that, the number set is regular, so, every piece appear times is regular and we can figure it out. But there get a problem, if duplicate number and miss number get some repeat part, like these two numbers(1221111, 1256300), theirs first two numbers is the same, so, we can’t get that digit string “12” appear more than expected or less than expected. But luckily, we know that, we got two numbers, they are end by 21111 and 56300 respectivility, theirs length is between 6 to 12 and they get same heigh apart, we can find out these two numbers use one more search(o(n)) in the set, if you think it is a way out to solve such a problem, I am get to archive this part, thanks very much.

#include
#include

using namespace std;

vector generateArray(int dupNum, int missNum, int N){

vector retVec;
retVec.resize(N + 1);

if (dupNum > N || missNum > N)
{
return retVec;
}

for(int i = 1; i <= N; i++)
{
if (i == missNum)
{
retVec[i] = dupNum;
continue;
}
retVec[i] = i;
}

return retVec;
}

pair gotDupAndMissNum(vectormyArray, __int64 N)
{
if (myArray.size() < N)
{
return pair(0, 0);
}
__int64 dup = 0, miss = 0;
const int vecSize = 10 * 10 * 10 * 10 * 10 * 10;
__int64 MidNum = ((__int64)(vecSize)) * (vecSize);
__int64 TopNum = ((__int64)(vecSize)) * (vecSize) * (vecSize);
int limit = vecSize;

__int64 cmpVal = vecSize;
__int64 v1Cnt, v2Cnt, v3Cnt;

vector myVec1, myVec2, myVec3, myVec4;
int dupIdxV1 = 0, dupIdxV2 = 0, dupIdxV3 = 0, dupIdxV4 = 0;
int misIdxV1 = 0, misIdxV2 = 0, misIdxV3 = 0, misIdxV4 = 0;

myVec1.resize(vecSize + 1);
myVec2.resize(vecSize + 1);
myVec3.resize(vecSize + 1);
myVec4.resize(100);

for(int i = 0; i < 100; i++)
{
myVec4[i] = 0;
}

for (int i = 0; i < vecSize; i++)
{
myVec1[i] = 0;
myVec2[i] = 0;
myVec3[i] = 0;
}

int cnt = 0;

while(cnt++ < 4)
{
switch (cnt)
{
case 1:
for (int i = 0; i < N; i++)
{
if (myArray[i] 1)
{
dup = myArray[i];
}
}
}

cmpVal = limit = vecSize;
if (N < vecSize)
{
cmpVal = limit = N;
}
for (int i = 1; i < limit; i++)
{
if (!myVec1[i])
{
miss = i;
}
myVec1[i] = 0;
}
myVec1[0] = 0;
break;
case 2:
for (int i = 0; i = vecSize && myArray[i] < MidNum)
{
myVec1[myArray[i] % vecSize]++;
myVec2[myArray[i] / vecSize]++;
}
}

v1Cnt = vecSize – 1;
v2Cnt = vecSize;
cmpVal = vecSize;

if (N < MidNum)
{
v1Cnt = N / vecSize;
cmpVal = N / vecSize;
if (cmpVal == 1)
{
limit = N % vecSize;
}
}

for (int i = 0; i < limit; i++)
{
if ( cmpVal == vecSize || cmpVal != vecSize && i v1Cnt)
{
dupIdxV1 = i;
}
else if (myVec1[i] v1Cnt – 1)
{
dupIdxV1 = i;
}
else if (myVec1[i] = cmpVal)
{
continue;
}

if (myVec2[i] > v2Cnt )
{
dupIdxV2 = i;
}
else if (myVec2[i] < v2Cnt)
{
misIdxV2 = i;
}
}

if (cmpVal != vecSize)
{
if (myVec2[cmpVal] N % vecSize){
dupIdxV2 = cmpVal;
}
}

if (!dup && dupIdxV1 && dupIdxV2)
{
dup = myArray[dupIdxV2] * vecSize + myArray[dupIdxV1];
dupIdxV1 = 0, dupIdxV2 = 0, dupIdxV3 = 0, dupIdxV4 = 0;
}
if (!miss && misIdxV1 && misIdxV2)
{
miss = myArray[misIdxV2] * vecSize + myArray[misIdxV1];
misIdxV1 = 0, misIdxV2 = 0, misIdxV3 = 0, misIdxV4 = 0;
}

for (int i = 0; i < vecSize; i++)
{
myVec1[i] = myVec2[i] = 0;
}
break;
case 3:
for (int i = 0; i = MidNum && myArray[i] < TopNum)
{
myVec1[myArray[i] % vecSize]++;
myVec2[myArray[i] / vecSize % vecSize]++;
myVec3[myArray[i] / MidNum]++;
}
}

v1Cnt = (__int64)vecSize * (vecSize – 1);
v2Cnt = v1Cnt;
v3Cnt = v1Cnt + 1;
cmpVal = vecSize;
limit = vecSize;

if (N < TopNum)
{
cmpVal = N / MidNum;
v1Cnt = v2Cnt = (__int64)(cmpVal – 1) * vecSize;
if (cmpVal == 1)
{
limit = N/vecSize%vecSize;
}
}

for(int i = 0; i < vecSize; i++){
if ( cmpVal == vecSize || cmpVal != vecSize && i v1Cnt)
{
dupIdxV1 = i;
}
else if (myVec1[i] v1Cnt – 1)
{
dupIdxV1 = i;
}
else if (myVec1[i] limit)
{
continue;
}
if (myVec2[i] > v2Cnt)
{
dupIdxV2 = i;
}
if (myVec2[i] = cmpVal)
{
continue;
}

if (myVec3[i] > v3Cnt )
{
dupIdxV3 = i;
}
else if (myVec3[i] < v3Cnt )
{
misIdxV3 = i;
}
}

if (cmpVal != vecSize)
{
if (myVec3[cmpVal] N % ((__int64)vecSize*vecSize)){
dupIdxV3 = cmpVal;
}
}

if (!dup && dupIdxV1 && dupIdxV2 && dupIdxV3)
{
dup = myArray[dupIdxV3] * vecSize * vecSize + myArray[dupIdxV2] * vecSize + myArray[dupIdxV1];
dupIdxV1 = 0, dupIdxV2 = 0, dupIdxV3 = 0, dupIdxV4 = 0;
}
if (!miss && misIdxV1 && misIdxV2 && misIdxV3)
{
miss = myArray[misIdxV3] * vecSize * vecSize + myArray[misIdxV2] * vecSize + myArray[misIdxV1];
misIdxV1 = 0, misIdxV2 = 0, misIdxV3 = 0, misIdxV4 = 0;
}
for (int i = 0; i < vecSize; i++)
{
myVec1[i] = myVec2[i] = myVec3[i] = 0;
}
break;
default:
for (int i = 0; i = TopNum)
{
myVec1[myArray[i] % vecSize]++;
myVec2[myArray[i] / vecSize % vecSize]++;
myVec3[myArray[i] / MidNum % vecSize]++;
myVec4[myArray[i] / TopNum]++;
}
}

cmpVal = N / TopNum;
v1Cnt = (__int64)vecSize * vecSize * vecSize * cmpVal;
v2Cnt = v1Cnt;
v3Cnt = v1Cnt;
limit = N/vecSize/vecSize%vecSize;

for(int i = 0; i v1Cnt)
{
dupIdxV1 = i;
}
else if (myVec1[i] v2Cnt)
{
dupIdxV2 = i;
}
else if (myVec2[i] limit)
{
continue;
}

if (myVec3[i] > v3Cnt)
{
dupIdxV3 = i;
}
else if (myVec3[i] = cmpVal)
{
continue;
}

if(myVec4[i] cmpVal)
{
dupIdxV4 = i;
}
}

if (cmpVal != vecSize)
{
if (myVec3[cmpVal] N % ((__int64)vecSize*vecSize*vecSize)){
dupIdxV3 = cmpVal;
}
}

if (!dup && dupIdxV1 && dupIdxV2 && dupIdxV3 && dupIdxV4)
{
dup = myArray[dupIdxV4] * vecSize * vecSize *vecSize + myArray[dupIdxV3] * vecSize * vecSize + myArray[dupIdxV2] * vecSize + myArray[dupIdxV1];
dupIdxV1 = 0, dupIdxV2 = 0, dupIdxV3 = 0, dupIdxV4 = 0;
}
if (!miss && misIdxV1 && misIdxV2 && misIdxV3 && misIdxV4)
{
dup = myArray[misIdxV4] * vecSize * vecSize *vecSize + myArray[misIdxV3] * vecSize * vecSize + myArray[misIdxV2] * vecSize + myArray[misIdxV1];
misIdxV1 = 0, misIdxV2 = 0, misIdxV3 = 0, misIdxV4 = 0;
}
break;
}

if (cmpVal != vecSize || dup && miss)
{
break;
}
}

return pair(dup, miss);
}

int _tmain(int argc, _TCHAR* argv[])
{
gotDupAndMissNum(generateArray(2154001, 785412, 5610121), 5610121);
return 0;
}

Reply
Dude on December 8, 2011

this is a math problem not necessarily a programming problem.

Reply
Dean on May 16, 2012

Haven’t done any programming in a while but, I would push the data into a hash table with a try/catch. This will show the duplicate in the error, then search for the one and only NULL entry in the has table.
Job done in about 5 lines of code, and hash tables are very fast.

Reply
- Max Schireson on May 17, 2012
  
  I was looking for a solution that doesn’t require copying the data (say in case it was very very large).
  
  Interesting approach though.
  
  — Max
  
  Reply
Albert on July 6, 2012

CONSTANT total = sum the numbers 0..63
CONSTANT square_total = sum of squares of numbers from 0..63

iterate array,
summing the numbers into sum,
summing square of numbers into square_sum.

LET diff = total – sum
LET square_diff = square_total – sum

duplicate = (square_diff – (expt diff 2)) / (diff * 2)
missing = duplicate + dif

Reply
Sudha on July 30, 2012

#include
#include
void main()
{
int arr[20],i,sum=0,dup,len,check;
printf(“enter the array length”);
scanf(“%d”,&len);
check=(len*(len+1)/2);
printf(“\n check = %d”,check);
printf(“\nThe array with %d element”,len);
printf(“\nenter the array element”);
for(i=1;i<=len;i++)
{
scanf("%d",&arr[i]);
if(i==arr[i])
sum=i+sum;
else
dup=arr[i];
}
printf("the Missing no is %d and the duplicate number is %d",(check-sum),dup);
getch();
}

Reply
Markiv on October 5, 2012

static void Main(string[] args)
{
int[] arr = new int[5] {1, 2, 4, 5, 5};

float sum = 0;
float product = 1;
float expectedSum = 0;
float expectedProduct = 1;
for (int i = 0; i < arr.Length; i++)
{
sum += arr[i];
product *= arr[i];
expectedSum += (i + 1);
expectedProduct *= (i + 1);
}

float missing = (expectedSum – sum) / (1 – product / expectedProduct);
float duplicate = missing – (expectedSum – sum);

Console.WriteLine("Missing – {0}, Duplicate – {1}", missing, duplicate);
Console.ReadKey();
}

Reply
Antoine Guiral (@antoineguiral) on December 11, 2012

I don’t know where are stored the data (mongo? file?) but if they are in file you can use comm() http://en.wikipedia.org/wiki/Comm (and sort before the comparison :)).

Reply
James on June 10, 2013

Years later, some python. Unemployed and bored… it seems to work fine, probably not linear though. At every index it checks front and back, then hops two. Nothing to do with Mongo but it’s 1:
miss = arr[i]-1
elif (arr[i+1]-arr[i]) > 1:
miss = arr[i]+1

if (i+2) == (len(arr)-1):
step = 1
i = i + step
except Exception:
True

print “Missing: %d — Duplicate: %d” %(miss, dup)

Reply
- James on June 10, 2013
  
  Site messed up my codepaste.
  
  arr = [1,2,3,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,25,26,27]
  
  miss = 0
  dup = 0
  
  try:
  i = 1
  step = 2
  
  for x in range(1, (len(arr)/2)+1):
  if arr[i] == arr[i-1] or arr[i] == arr[i+1]:
  dup = arr[i]
  elif (arr[i]-arr[i-1]) > 1:
  miss = arr[i]-1
  elif (arr[i+1]-arr[i]) > 1:
  miss = arr[i]+1
  
  if (i+2) == (len(arr)-1):
  step = 1
  i = i + step
  except Exception:
  True
  
  print “Missing: %d — Duplicate: %d” %(miss, dup)
  
  Reply
- Max Schireson on June 10, 2013
  
  It looks like you only covered the special case of sorted list? The problem statement does not require the numbers to be sorted.
  
  Reply