Want a job working on mongoDB? Your first online interview is in this post

I like smart programmers.

So, I thought I’d put out a programming puzzle. If you an solve it, you are well on your way to a very cool job. If you can’t solve it, send it to a smart friend who can. Even if you don’t want a job, I promise you’ll enjoy the solution when you find it.

Here’s the puzzle:

You’re given an array of N 64 bit integers. N may be very large. You know that every integer 1-N appears once in the array, except there is one integer missing and one integer duplicated.

I’d like a linear time algorithm to find the missing and duplicated numbers. Further, your algorithm should run in small constant space and leave the array untouched.

— Max

PS I love smart people in other jobs too; while I think this problem will appeal mostly to programmers, 10gen is a company where solving a problem like this helps you get a job in marketing, finance, or any other area that fits your skills. Send me a solution and tell me what job you want to interview for!

NOTE 4/24 4am: I have caught up with all the responses but plan to take Sunday off and spend it with my family. Thanks for making a job at 10gen more popular than baby tips from Elton John today.

260 comments so far

  1. Cliff Elam on

    I would put it into Excel column A, sort it, then copy this excel formula next to it in B:

    =IF(A20A19+1,IF(A20=A19,”Dupe”, “Missing”),” “)

    Then I would use my highly sophisticated Mark 1 human eyeball to spot the problem.

    Problem solution benefits:
    1> Small space. Small is relative, this works on a cheap PC doing a thousand other things and doesn’t require any reboot or configuration.

    2> Untouched array. Copy/paste, ’nuff said.

    3> N is very large. Depends on what you mean by large. I can to 5M+ rows in excel now. God knows how we lived without that.

    4> Linear. Well, something this simple probably is almost linear inside Excel, really, but more complex operations than this bog down as N gets large. But it’s close and the really expensive part of the equation, well, It took me longer to type in the first paragraph above than it would to actually solve the problem in the first place.

    5> Additional benefits. Nobody will take that code fragment and leave it lying about in a production release.

    Yours in excel mastery and advanced joshing algorithms,

    -XC

  2. Joris Verschoor on

    loop 1..N as x
    find x in array as index
    if not found, x is missing
    if found, find again at index. if found, duplicate
    end

    this is easily parallelized (word?)

    • Max Schireson on

      You loop N times, each time having to find x in the array. How long does each find take? Looking for something linear.

  3. Joris Verschoor on

    With map/reduce one would first emit(m, {status: ok/missing/duplicate}) in the map function.
    Then you could reduce that down to missing + duplicate values.

  4. Joris Verschoor on

    sum(array) – N! gives you duplicate-missing
    you can use this to loop over the array, and to check if a value + (duplicate-missing) exists

    If it’s missing, you know how to calculate the duplicate value: (duplicate-missing) + missing.

    I might have some + / – wrong though 😉

  5. Joris Verschoor on

    Sorry.. final one 🙂

    sum(array) – N! gives you duplicate-missing (dm)
    Then loop from 1..N, and check each value for existence.
    If it does not exist, you know which value is duplicate, because you have dm. Duplicate = dm + nonexisting

    This is small in size, and loops 1 time over the array to sum, and does one find per value

    • Joris Verschoor on

      N! would me 1*2*3..*N but I meant 1+2+3+4..+N =)

      • Max Schireson on

        Interesting path, but still needing that one find per value. Finding a value in a large array takes time. You found duplicate-missing quickly, can you extend that to find duplicate and find missing?

      • Joris Verschoor on

        Someone beat me with the xor..

        I’m no algo-expert, or novice even =)
        Was a nice lil’ puzzle, and I’m proud of the sum dup-miss thingy =)

        I’m not really a fan of this sort of interviews, but that wans’t my goal.

        Thanks for the challenge! Happy easter!

    • Martin on

      sum(array) – (1+2+3+4..+N) does not give you the missing. It gives you the difference between the missing and the duplicate. For instance:
      [10,2,6,9,8,5,9,1,3,7]

      Sum([10,2,6,9,8,5,9,1,3,7]) = 60
      Sum([1,2,3,4,5,6,7,8,9,10]) = 55

      60 – 55 = 5 <– No the missing! But, it is (9 – 4), the difference between the missing and the duplicate

      =)

  6. Edward Ribeiro on

    The solution I came up with is as follows:

    1. Creates a bit vector and initialize each element with zero. The “i” number will occupy the i-eth position in the bit vector, from left to right.

    2. For each x from input-vector do:
    2.1. check to see if the x-eth position of the bit vector is already set to 1. If so, the it’s the repeated element, and print ‘x’ as the repeated;
    2.2. otherwise, set the x-eth position of the bit to 1;

    3. Out of the loop, use Bit Twiddling Hacks to find the first zero bit from left to right (Hacker’s Delight book shows just how to do that).

    This solution is space efficient because the bit vector, and runs in linear time. Below is a quick and dirty snippet of java code that does exactly that, but it is limited by a huge memory consumption (for the input array), and is limited to Integer.MAX_VALUE, that is pretty big, but is could not be large enough. For a robust implementation, I would use C and a lot of Bit Twiddling Hacks to implement the above algorithm. 🙂

    ============================ Java Code ==========================

    public static void search(final Long[] input) {

    BitSet bitarray = new BitSet(input.length + 1); // should be from 1-N not from 0- (N-1)

    int repeated = 0;

    for (Long val : input) {
    if (bitarray.get(val.intValue())) {
    repeated = val.intValue();
    }
    bitarray.set(val.intValue());
    }

    System.out.println(“repeated: ” + repeated);
    System.out.println(“absent:” + bitarray.nextClearBit(1)); //ignores first bit (ZERO)
    }

    ===============================================================

    • Max Schireson on

      The space you’re using is still linear in the number of values. One bit per, but that adds up if you have billions of values… Can you do it in (small) constant space?

      • Edward Ribeiro on

        Taking in account the constant space requirement, I would use something like Bloom Filter. It’s great to check if a value is in a set, even with false positives’ probability. But a relatively modest bit vector size can reduce this probability.

        As to the repetition checking, I don’t know by heart now, but I am curious to see the final results. 😉

    • Wedge on

      Congrats, you’ve just managed to use 2^64 bits, or 2.3 exabytes of RAM. This may not be the most efficient solution.

      • Edward Ribeiro on

        LOL, are you sure you really understood my proposed solution? It’s basically a bitmap!

        Each bit in the bit vector (implemented as an array of unsigned ints or whatever) identifies if the i-th value appeared or not. If you want to represent a list with 1 to 1 billion values (any order) then this bitmap will be 125MB in size.

        The java snippet is just to show some code tough, as it should be done in C/C++ if billions of values are to be stored in the input list.

  7. dm on

    i guess Gauss would get the job

    • Max Schireson on

      he’d get the live interview. then we’d have to see if he could actually program, which is different from coming up with clever algorithms 🙂

  8. Bob Kimble on

    Joris is on the right track. As Max notes, his technique yields duplicate-missing. Repeating the process by summing the squares of the array similarly yields duplicate^2-missing^2. Dividing the first into the second yields duplicate+missing. The two equations together quickly yields both values. FWIW, the sum of the integers from 1 to N is N*(N+1)/2, and the sum of the squares of the integers from 1 to N is N*(N+1)*(2*N+1)/6. However, I feel as if I’m cheating, because I am a mathematician as well as a programmer. 🙂

    • Joris Verschoor on

      Ahhh =)

      Yeah I guess that’s cheating 😉

    • Max Schireson on

      i felt guilty of cheating for answering a fibonacci question with neither iteration nor recursion…

      • Max Schireson on

        But bob was the first to have it correct, even if it was submitted a comment on another post rather than his own post.

  9. Cliff Elam on

    See, I’d have had this done in Excel and be playing Scrabble by now.

    -XC

  10. ken on

    Keep 64 counters, one for each bit of a 64 bit integer. For each entry i encountered, increment the counters for each bit that is on in i. At the end, compare the counters to what you expect. Given that we know there was exactly one entry incorrect, we can tell how its bits differed from what we expected, giving both the missing number and the extra number.

    • Wedge on

      I’m not sure this is sufficient. For example: what if the number 8 is missing and the number 9 is duplicated? All you’d end up with is a difference in the 1’s counter, there isn’t enough information left to tell you about the other bits of the missing or duplicated numbers.

  11. Levi Strope on

    Amazon asked me this same question (nearly verbatim) in an interview and handed me a pad of paper. Fun times.

    • Max Schireson on

      Next time you’ll be better prepared having seen all these responses!

  12. Mike Thvedt on

    let arr = the array, miss = the missing integer, dup = the duplicated integer
    x = the sum of all elements of the array
    I posit that x = dup – miss. x != 0.

    we’re going to partition the array by modulus, and binary search. since we only have to search two “sub-arrays” at once, constant space is required. we perform at most B passes over the array, where B is the bit size of the integers (here 64)

    let search(array) {
    global dup = 0, miss = 0
    search(array, 1, 0)
    return (dup, miss)
    }

    let search(array, modulus, the_filter) = {
    if modulus == 2^64, scan the array for the_filter
    if found, set dup = the_filter; return
    if not found, set miss = the_filter; return
    sum (filter (\x: x mod modulus == the_filter) array)
    //comment: the filter can be done lazily
    if sum == 0, break
    if sum != 0 {
    search(array, modulus * 2, the_filter)
    search(array, modulus * 2, modulus * 2 + the_filter)
    }
    }

    since each ‘sub-array’ contains, for all x, its additive inverse ~x, except for one duplicated x and one missing x, if the sum of the sub-array is 0, then the sub-array has a duplicated x, missing x, or both.

    disclaimer: code is not tested.

    • Mike Thvedt on

      hey, this code is a little sloppy (break and return are confused, for example), but i’m convinced this is the right way to do it

      • Torsak Chaiwattanaphong on

        The recursion can make stack overflow due to the very big number (2^64). You can adapt it as “for loop” procedure by split the array to a smaller part (Ex. 1..100, 101..200, 201..300 and so on) and trying to check just the smaller part in each loop. In this way, you can have a constant space and linear time algorithm and won’t make your stack to be overflowed by a very large number.

      • Mike Thvedt on

        I posted a way less crazy solution here, though it builds primarily off someone else’s solution: http://news.ycombinator.com/item?id=2477401

      • Max Schireson on

        Mike, i’ll check your hacker news solution and comment there since you seem to like it better than this one.

        — Max

  13. itrend on

    Aha, inspired by Joris Verschoor’s half solution: iterate over the array accumulating the sum (S) of the numbers and the sum of the numbers squared (S2). If d is the duplicate and m is the missing then:

    d – m = S – n*(n+1)/2
    d^2 – m^2 = S2 – n*(n+1)*(2*n+1)/6

    We have now reduced the problem to a well known one 🙂

    • itrend on

      Thought I might as well post my quick proof of concept Haskell code for the curious (formatting will probably be lost though):

      import Data.List

      solve a = (dupe, miss)
      where
      dupe = (delta*delta + deltaSq) `div` (delta*2)
      miss = dupe – delta
      n = length a
      delta = sum – (n*(n+1) `div` 2)
      deltaSq = sumSq – (n*(n+1)*(2*n+1) `div` 6)
      (sum, sumSq) = foldl’ (\(sum, sumSq) x -> (sum+x, sumSq+x*x)) (0,0) a

    • Max Schireson on

      (edited) First solution posted as such though we had one from Bob Kimble in a comment earlier so really its second 🙂

  14. Jamie Rumbelow on

    Are the duplicate values consecutive?

  15. zvpebfbsg_fhpxf on

    Treat each array entry as an index back into the array, which induces a linked-list in the array. Find the cycle in this list in constant space (another good interview question). The entrypoint into the cycle is the duplicated value.

    To find the missing value, take the XOR of all numbers 1 to N, with the exception of the duplicated value. Then, XOR this value with each value in the array. The result is the missing number, since all the others canceled each other out.

    I first saw this problem posed by Rapleaf here: https://www.rapleaf.com/careers/challenges#array

    • David K on

      Ah… so I got the cycle detection bit… but was still trying to reverse that and try and find 2 heads to the graph to determine the missing number.

      Neat.

    • dm on

      ( 2, 1, … )

      cycles on the first two values, yet those may not be duplicated in the rest of the input.

    • Max Schireson on

      Not quite the same problem (assuming you mean question 2 in rapleaf’s challenge). For better or worse, theirs is easier.

      I can make the function in rapleaf’s question 1 really really really fast.

      Question 3 is interesting and harder.

      — Max

  16. Aaron on

    //Assuming we can use an iterator style interface with the array :/

    //Also assuming we don’t know the array length in advance so we don’t
    //want to do something like array.length, since that would involve counting
    //the elements, so that’s some extra CPU we don’t want to waste 🙂

    var i=1, skipped = 0, duped = 0;

    while (_next = next(array) && !skipped && !duped)
    {
    if (i == _next) continue;
    else if (i < _next)
    {
    skipped = i;
    i++; // i needs to catch up!
    }
    else
    {
    duped = i;
    i–; // i needs to slow down!
    }

    i++ ;
    }

    //If you wanted to be clever you could condense the if/else into ternary, but then it would be too damn hard to read.

    • Aaron on

      Just saw the assumption that it’s NOT sorted, which seems obvious now. In that case this solution completely fails 🙂

  17. Emanuele Di Saverio on

    Ok let me try…

    we can to find a couple (missing number, duplicated number)

    calculate Sum(1 to N) = N(N+1)/2 (constant)
    calculate Sum of input data (linear in N)
    delta between the two

    now we now that the delta between the missing and the duplicate. If we find one, we find the other. Let’s find the missing

    if delta > 0 , missingN is in the interval [1, N – delta]
    if delta < 0, missingN is in the interval [delta, N]

    now you have the interval of missing

    sum all numbers of the interval (linear in N) , be it S

    cycle through input data,
    for every number in the interval
    subtract from S

    what is left over is missing number.

    1) is that correct?
    2) sure it's ugly algorithm
    3) doesn't use the "64 bit" piece of the problem statement

    had fun trying anyway 🙂

    -Emanuele

    • Max Schireson on

      I believe its not correct. Say delta is -1. Missing and duplicate could be say 8,7 or some such. if N is 100 both missing and duplicate are in the interval and you’d get delta again. Which way you were calculating delta wasn’t quite clear, so I may have a sign reversed but I don’t see an obvious way to fix your approach.
      — Max

  18. Pius on

    This sounds like a job for bitwise XOR.

    • Max Schireson on

      It’s the magic bullet if you have either a missing value or a dupe, but not enough when you have both at once.

  19. Andy Gelfond on

    Works for N < 2^32, O(N) time, O(1) memory (not including array)

    For alpha = 1 to N
    let omega = arr[alpha]
    if omega has its top bit set, you have the duplicate number (clear omega's top bit to get it) and its position (alpha)
    otherwise, set arr[omega]'s top bit

    You can find the missing one by finding the index of the element that does not have its top bit set, either in line, or a separate loop

    You can then clear the top bits in the array

    To try make this work for N < 2^64, I would consider looking at some xor magic, either within the array, or in some extra variable, but I do not have the confidence in this approach to spend the time since it appears all bits would be needed

    Part time job for 1/2 an answer ?

    🙂

  20. Ron on

    Do you want solutions posted here or sent somewhere else?

    • Max Schireson on

      Post away (though I’ll check email to, you can find my address in “about”

  21. Warren Wilkinson on

    CONSTANT total = sum the numbers 0..63
    CONSTANT square_total = sum of squares of numbers from 0..63

    iterate array,
    summing the numbers into sum,
    summing square of numbers into square_sum.

    LET diff = total – sum
    LET square_diff = square_total – sum

    duplicate = (square_diff – (expt diff 2)) / (diff * 2)
    missing = duplicate + diff.

    ————

    How it works.

    Comparing the expected total to the computed total gives the difference between the missing number and the duplicate number.

    missing – doubled = some constant

    Comparing the expected square total to the computed square total gives another equation:

    (missing * missing) – (doubled * doubled) = another constant.

    Now that you have two equations, you can solve for both variables.

  22. David K on

    I’m not looking for a job, but having found this via HN thought I’d give it a whirl.

    You don’t mention… but is the array sorted?

    If it’s not, the solution I found for linear time required 16 exabytes of storage for a bitmap capable of storing the size of the language.

    If it’s sorted, then you could use a very small bitmap to track the last X Ns and detect the dupe whenever you set the same flag, and determine the gap whenever you have a flag not set. This fulfils all of the requirements, small space, linear time, and immutable array… but I’m betting it’s not that simple.

    If it isn’t sorted, then I’ll take another look.

    • Max Schireson on

      It isn’t sorted, and I’m looking for something without the 16 exabytes 🙂
      Under 1 kilobyte is no problem once you see the answer…

      — Max

      • David K on

        I’m thinking if the array is treated as a directed graph, then cycle detection would find the dupe. I’m less sure how to make that detect the missing element though.

        Might knock something up to test the theory. Seems like a tortoise + hare linked list, cycle detection thing.

        At least… it’s the only way I currently have to do away with a bit map.

  23. Chintan Shah on

    If SEQ is the array,
    D is the duplicate number ,
    and M is the missing number, then

    First compute sum of all elements in SEQ. Let it be SUM.
    (I) N*(N+1)/2 – SUM = D – M /*(Using the identity for sum of first N numbers.)*/
    (II) N*(N+1)*(2N+1)/6 – sum(arr[i]^2 = d^2 – m^2 /*(Using the identity for the sum of squares of first N numbers)*/

    We have 2 equations and 2 unknowns. Solving these
    we can get D and M.

    Since N can be large, to avoid overflow we should do the left sides of (I) and (II) through a loop instead of using the formula directly.

    The loop for LHS of equation (I) is
    int lhs_1st = 0;

    for (int i = 0; i < SEQ.length; i++)
    {lsh_1st += i + 1 – SEQ[i];}
    And the loop for LHS of eq (II) is
    int lhs_2nd = 0;
    for (int i = 0; i < SEQ.length; i++)
    {lhs_2nd += (i + 1 + SEQ[i]) * (i + 1 – SEQ[i]);}
    From (I) we have
    (III) lhs_1st= D -M
    From (II)
    (IV) lhs_2nd = D^2 – M^2
    Divide D by C we have
    (V) lhs_2nd/lhs_1st = D + M

    Solving (III) and (IV)
    D = (lhs_1st + lhs_2nd/lhs_1st) / 2
    M = (lhs_2nd/lhs_1st – lhs_1st ) / 2

    • Chintan Shah on

      Java code for the above:

      public static void findMissingAndDuplicate(int[] seq) {
      long eq1 = 0;
      long eq2 = 0;
      for (int i = 0; i < seq.length; i++) {
      eq1 += i + 1 – seq[i];
      eq2 += (i + 1 – seq[i]) * (i + 1 + seq[i]);
      }
      long dup_minus_missing = eq1;
      long dup_plus_missing = eq2 /dup_minus_missing;
      long M = (dup_plus_missing + dup_minus_missing ) / 2;
      long D = (dup_plus_missing – dup_minus_missing ) / 2;

      System.out.println("Missing number in the array: " + M);
      System.out.println("Duplicate number in the array: " + D);
      }

      • Max Schireson on

        Nicely done, basically correct. Accumulating along the way reduces overflow worries.

        That said to nitpick I think you still have some overflow worries. Do you see where?

        – Max

  24. Nick Parker on

    I think something like this might do it:

    int findMissingElementFor (int [] items, int size){

    int sum = 0;
    int sumOfSquares = 0;
    int bound = size – 1;
    int sumOfFirstNaturalNumbers = (bound * (bound + 1)) / 2;
    int sumOfSquaresOfFirstNaturlNumbers = (bound * (bound + 1) * (2 * n + 1)) / 6;

    for(int index = 0; index < size; index++){
    sum = sum + items[index];
    sumOfSquares = sumOfSquares + (items[index] * items[index]);
    }
    return ((sumOfFirstNaturalNumbers – sum) + (sumOfSquaresOfFirstNaturlNumbers – sumOfSquares) / (sumOfFirstNaturalNumbers – sum)) / 2;
    }

    • Max Schireson on

      Correct but for some overflow worries 🙂

      Nice job.

      — Max

  25. Felix on

    I’m stuck and curious if I have chosen the right way.

    I loop through the array and update two values:

    sum = n1 + n2 + … + nN
    x = n1 xor n2 xor … xor nN

    In the end, I know that
    (1+N)*(N/2) – missing + duplicate = sum
    and
    missing xor duplicate = x

    I guess this could be enough, but I’m not sure how to proceed.

    • Max Schireson on

      On a good path but not quite there. You have two equations in two unknowns but they aren’t independent enough to solve. Say (duplicate-missing) is 1 and the difference is 1. All you know is that they’re consecutive and dup is odd.

      — Max

  26. Fedor Indutny on

    // code javascript
    var sum = 0,
    xor = 0;

    for (var i = 0; i < N; i++) {
    sum += i – a[i];
    xor ^= a[i];
    };

    var duplicate = xor,
    skipped = xor + sum;

    • Fedor Indutny on

      If job you’re offering will consist of solving such kind of tasks – I’m definitely up for it! 🙂

      • Rolken on

        It’s easy to demonstrate that your code doesn’t work. If you add the value N+1 to the end of the array, you’ll get that piled into an entirely different xor, so it can’t be the duplicate.

      • Max Schireson on

        It will definitely have some challenging problem solving 🙂

    • Fedor Indutny on

      Addition and subtraction can be made in 64bit algebraic field.

    • Martin on

      I am trying your code with this data set:

      var a=[10,2,6,9,8,5,9,1,3,7];

      (where, 9 is dupe and 4 is miss) And, It does not seems to work…

      • Fedor Indutny on

        Oh, missed that numbers should start from one.

  27. Attila Gazso on

    Calculate the sum of 1..N (S), and calculate the sum of the given numbers (T). The diff equals T-S=-x+y where x is the missing number, and y is the dupe.

    Calculate the product of 1..N (P), and calculate the product of the given numbers (Q). P/x*y=Q or Q/P=y/x

    Now we have two equations and two unknown values. From here it’s basic math.

    • itrend on

      To store a value X, you need log(X) bits, hence you need log(N!) bits to store the value of N!. However, log(N!) is O(N*logN) so your solution actually uses [worse than] linear space (see http://en.wikipedia.org/wiki/Factorial#Rate_of_growth ).

      • Attila Gazso on

        You are right, instead of using product, it would be better to use sum of squares.

    • Max Schireson on

      Your corrected approach is correct. Congrats.

  28. Stephen A on

    a = given array, d = duplicate, m = missing
    diff = (sum [1..n]) – (sum a) — m – d
    nfact = n!
    fdiff = nfact – (foldl1 (*) a) — n!(m-d)/m
    m = nfact * diff / fdiff
    d = m – diff

    • Max Schireson on

      if the list is really big, calculating n factorial is a problem. say n = 1 billion…

      can you come up with an easier way?

      — Max

      • S.G. on

        Take lg of x_i instead of multiplying x_i.

        PI(1,N){x_i} => SIGMA(1,N) lg{x_i}

      • Max Schireson on

        can you come up with a solution that doesn’t include floating point math?

      • S.G. on

        Without using FP calculations

        Let y = missing number, x = duplicate
        Let alpha = SUM(1,N) = ((N+1)N)/2
        Let beta = Sum(array)
        => alpha – beta = y – x

        Let gamma = SIGMA(1,N){x_i}^2 = (N(N+1)(2N+1))/6
        Let delta = Sum(array value^2)
        => gamma – delta = y^2 – x^2 = (y – x)(y + x)

        ==> (gamma – delta) / (alpha – beta) = (y + x)
        ===> (y+x) + (y-x) = 2y / 2 = y. Solve for x however you desire.

    • Max Schireson on

      got it with the correction, nice job.

  29. Michal Illich on

    1. Calculate Sum(1..N) – N*(1+N)/2 = DUP-MIS

    2. Calculate MultiplyAll(1..N) / N! = DUP/MIS

    3. You have two equations with two variables, so you just solve it.

    Linear time, very small constant space used.

    • Michal Illich on

      Or Xor instead of multiplication would work also.

      (2. XorAll(1..N) Xor XorOfAllNumbersFrom1toN = DUP XOR MIS)

    • Michal Illich on

      Well, the best second step is sum of squared numbers – but I have to admit that this idea isn’t mine.

      • Max Schireson on

        Yes; the XOR doesn’t actually get you enough additional information to solve. As discussed in some comments the original approach with multiplication is not constant space (more than linear).

        — Max

  30. Drake Wilson on

    Well, I’m going to assume you want it in a comment since I don’t see another preferred contact method in the post.

    Something that is quite clean mathematically and has the right complexity but isn’t necessarily good-performance on the microscale:

    overwrite_delta = sum(1..count) − sum(array)
    squares_delta = sum((x → x^2) (1..count)) − sum((x → x^2) array)
    overwrite_sum = squares_delta / overwrite_delta
    duplicate = (overwrite_delta + overwrite_sum) / 2
    missing = overwrite_sum − duplicate

    Of course you probably want to unify the loops so you don’t do multiple passes over the array and thrash the cache.

    Possible problem: this requires integer arithmetic with integers of something like 3*ceil(log2(N)) bits (haven’t done the proof for that part). We know that N ≤ 2^64 − 1 (otherwise the array is impossible), so it’s still O(1) with “small” amounts of extra space (a few hundred bytes?). Nonetheless the big integers might turn out to be very inconvenient to implement if you have to make them yourself.

    It also requires O(N) 64×64-bit multiplications and a single giant divmod near the end. (The / 2 can be a cheap bitshift.) If N is large enough and if the ALU/memory speed is high enough, I suspect this is okay, especially since the multiplications will parallelize well with memory accesses, but I haven’t measured.

    Whether the microperformance issues matter depends on the situation.

    Email me if you want to see the C99 code that handles the 20-bit case (I didn’t want to bother with the large integers for testing).

    • Drake Wilson on

      An extra note: I see a lot of earlier solutions and/or pseudo-solutions (haven’t analyzed them all) with a combination of summation and mass XOR. I fiddled with something along those lines but decided not to go that route. This was partly because it took longer than a few minutes and looked like a potential dead end (and I didn’t want to spend the entire day on this), and partly because this answer has amusing implementation characteristics while being very easy to understand and fitting the constraints. (Assuming a cute sum+xor solution exists, it’d be better for a continuous process in actual production, but this was a puzzle, so.)

      • Max Schireson on

        XOR is a dead end, feel glad you didn’t spend more than a few minutes.

        Your solution is good. Good job catching the overflow worries.

        — Max

      • Max Schireson on

        Upon reading further, I found a solution which salvages the XOR (though not in conjunction with the sum; he found a way to do an additional XOR to get the right answer), so it wasn’t a total dead end (thought it seemed that way to both of us!)

  31. Frank Denis on

    Horrible solution (and it assumes that numbers are starting at 0) but I wanted to quickly do something before going to bed:

    bogus = []
    a.each_with_index { |n, i| a[n], a[i] = n, a[n] }
    a.each_with_index { |n, i| bogus << n << i if n != i }
    puts "Bogus numbers: #{bogus.uniq.join ' and '}"

  32. Alex Lerman on

    expectedSum = 0
    actualSum = 0
    expectedFactorial = 1
    actualFactorial = 1
    for( i=0; i<n; i++) {
    expectedSum += ( i + 1 )
    actualSum += arr[i]
    expectedFactorial *= ( i+ 1 )
    actualFactorial *= arr[i]
    }

    diff = actualSum – expectedSum = a -b
    quotient = actualFactorial / expectedFactorial = a / b

    We have 2 equations and 2 unknowns, go forth and calculate. ( We can use the sign of the diff to determine which is the missing and which is the duplicate. )

    • Lim Kim Yong on

      Not a very practical answer. for large N the factorial is going to overflow pretty quickly.

  33. Fahad on

    “You know that every integer 1-N appears once in the array,” – Based on this I’m assuming by 1-N you mean 1 to N. I may be wrong, but here is my solution for this,

    1) Sort the N-array Or have the array maintained in sorted order.
    2) Form tuples with pair-wise order (as tuples), which can be described as below,

    >>>

    returns a sequence of each element in input sequence and its predecessor, with the exception of the first element which is only returned as a predecessor of the second element.

    If there is a sequence Seq then it returns Seq. As an example, if we have the N-array (sorted that is),

    1, 2, 3, 4, 5, 5, 7, 8, 9…. N

    the sequence generated is,

    (1, 2)
    (2, 3)
    (3, 4)
    (4, 5)
    (5, 5)
    (5, 7)
    (7, 8)
    (8, 9)

    (N – 1, N)
    >>>

    We have three check conditions here,

    for el in [pair-wise array] do
    if second(tuple) – first(tuple) = 1 then
    “Normal couple” // just go the next item in the pair-wise array
    elif second(tuple) – first(tuple) = 0 then
    “Duplicate”
    elif second(tuple) – first(tuple) > 1 then
    Missing = first(tuple) + (second(tuple) – first(tuple) – 1) // example: Missing = 5 + (7 – 5 – 1) = 6

    The above algo is simple and linear without doing any rescan of the whole array or finding anything. The alg just checks these items when it is looping thru, which I think should be done.

    I’m thinking in terms of F# language here. Let me know what you think on this.

    -Fahad

    • Max Schireson on

      Sorting the array takes more than linear time, and you need separate (linear) space to do it unless you want to clobber the array you got (which I think I asked you not to do).

      — Max

  34. Bala on

    I have seen few solutions above and all are suggesting either sorting, additional O(N) storage or map reduce(involves sorting internally) solutions, when this can be done without any storage.

    Let suppose the duplicated number is X, and missing number is Y.

    If you sum numbers from 1 to N, let it be SN
    Sum the numbers in the array, lets call it SM

    SN – SM = Y-X

    Now, you just need one more equation in X and Y to figure out X and Y. We can either choose either to multiply 1 to N and divide by the array product.This will give Y/X. One might be concerned that product of N 64-bit integers might overflow.

    Other option is to use XOR from 1 to N, and then XOR the given array. So we get X xor Y.

    — I have other solution in mind.

    Because the elements are from 1 to N, one could put element in the correct index and find the duplicated number. To make sure that we don’t keep visiting the same elements again, we can negate the numbers. And in one more pass, one could reverse the sign of the integers. The code will look as below.

    public class FindMissingAndDuplicate {

    public static void main(String[] args)
    {
    int arr[] = {4,5,2,3,6,9,6,1,8,0};

    int duplicateNumber = -1;

    for(int i = 0; i < arr.length; i++)
    {
    int j = i;

    while(arr[j] != j)
    {
    if(arr[arr[j]] == arr[j])
    {
    duplicateNumber = arr[j];
    break;
    }
    int temp = arr[arr[j]];
    arr[arr[j]] = arr[j];
    arr[j] = temp;

    j = arr[j];
    }

    if(duplicateNumber != -1)
    break;
    }
    System.out.println("Duplicate Number " + duplicateNumber);

    int sum = 0;
    int sumi = 0;
    int missingNumber = 0;
    for(int i = 0; i < 10; i++)
    {
    sumi += i;
    sum += arr[i];
    }

    missingNumber = sumi + duplicateNumber – sum;

    System.out.println("Missing Number " + missingNumber);

    }

    }

    • Max Schireson on

      I haven’t quite verified if the code matches the algorithm you described, but I asked you to leave the original array untouched.

      Agree with your concerns on overflow of the products.

      XOR doesn’t get you enough additional information.

      — Max

  35. Fahad on

    There seems to be a smiley in my last example, must be due to some combination, do ignore it!

  36. Daniel Rodríguez on

    Supposing we have a Big Integer library for doing the maths without overflowing. Let’s call x and y to the missing and the repeated numbers.
    1. Sum all the array.
    2. Calculate N(N+1)/2.
    3. Calculate (1)-(2). Call it Z. That’s equal to x – y, so x – y = Z.
    4. Sum the squares of all the array.
    5. Calculate N(N+1)(2N+1)/6.
    6. Calculate (4)-(5). Call it W. That’s equal to x^2 – y^2, so x^2 – y^2 = W.
    7. Substitute (3) into (6). (Z+y)^2 – y^2 = W -> Z^2+y^2+2Zy-y^2 = W -> y = (W-Z^2)/2Z.
    8. Substitute (7) into (3). x = Z – (W-Z^2)/2Z.

    Linear time, constant space.

    • Max Schireson on

      correct.

      • liruqi on

        if you can ensure that N is less than 2^63 (and permit write operation on the array), that is fit signed 64 bit int type, there is a way quicker:

        calculate x – y = Z as above (step 2)

        while sweeping through the array($arr), for the $i’th array item, set the corresponding index value highest bit, that is: $arr[ $arr[i] ] |= (1<<63), and check the highest bit before this operation… then we can find the duplicated y in this way.

        the point is using the highest bit to find duplicated one.

      • Max Schireson on

        interesting suggestion. you’re kind of sneaking in a bit vector in unused space 🙂

        ironically if the array isn’t in physical ram (a real possibility with a large data set) it will be _much_ slower this way as you’ll be swapping things in off disk to manipulate that one bit.

        if its all in RAM and apart from the space, the bit vector will be fast.

        — Max

  37. Zack Maril on

    Not a computer science dude, but try counting digits:

    If you have a set of numbers ranging 1 to N each represented by 64 bit integers, then you know that there needs to be a certain number of repetitions of 0’s and 1’s within the various places, i.e. 8 1’s in the one’s place, 8 0’s in the ten’s place, etc. etc. The expected values can be calculated explicitly from the value of n.

    Tally up the number of 0’s and 1’s from the actual set of n integers and then compare that to the calculated expected total. If you think in terms of digits, a repetition will add one to the expected values across the board and a deletion will subtract one across the digit matrix. If it were just an either addition or deletion, then you would just look in each column for the digit which had an incorrect value compared to the expected and then put all of the erroneousness digits together into the missing or repeated number.

    But when there are two operations on the set of numbers it gets way more complicated. Let’s say you have n=1111 and you repeat 1110 and delete 1100. Normally you would just look for the varying values. But this time you would see that only the second place would show any error. The deletion and repetition cancelled each other out! (Note: this is the worst case scenario and the following method should work for something like n=1111 and 1110 1101 but faster.)

    But we do know in which place they varied! We know we have one too many 1’s and one too few 0’s in the second place. Do the same tallying routine again but this time separate the set of numbers into those that have 1’s in the second place and those that have 0’s in the second place. Since we know that these sets cannot intersect by definition, when we run the tally algorithm on them, we know there will only be one error in each resulting table of information. If you calculate a modified digit matrix holding the correct values constant, i.e. all integers less than n with 1 in the second column, then compare that to actual digit array, using the same variation method you should be able to just read the integer off the table.

    Since you are only searching through the array 3 times and tallying up things into a 2 by 64 array of characters, the space and time feels like it would be small. If done correctly, the code for it shouldn’t be too massive and it feels like it would be about as fast as you can search through the array. It could probably generalize to N errors of the same types, but you would need a better splitting algorithm for when things overlapped.

    But, I haven’t really formally studied computer science like this much so I could be way out in left field here. Hope this works!
    -Zack

    • Max Schireson on

      Good thinking. Your solution is unique but I think describes a workable approach. Very strong problem solving. If you are the Zachary Maril who is a freshman at Texas A&M, study computer science and math and you’ll do well.

      Though you take 3 passes through, you can actually do the bitwise comparisons very fast. It wouldn’t surprise me if your solution is actually faster than many of the others that take one pass.

      Do you know the XOR function? To accomplish the first two paragraphs of your solution, all you need to do is loop through the array; start with zero and for each element XOR the value and the index with the previous XOR result.

      Congrats on a unique solution.

      — Max

      • a non on

        Can you explain a little more about the xor method? I understand if you were to xor the array values (XA), xor with the expected xor values (XN), then XA^XN = M^N. But I am a bit fuzzy on what the second and third passes are supposed to do…

      • Zack Maril on

        Cool! I had heard of the XOR function before this, but computer logic isn’t a strong point of mine so I went with something that seemed right. XOR would have made things much easier to explain though in retrospect.

        More thoughts on this digit matrix stuff:
        1. When you have overlapping errors, you can just do one more pass after the initial one. You can be populating both of the new arrays at once with two conditionals. It’s just a two pass technique.
        2. I think there is potential for one more trick to get it down to one pass. If you look at the numbers in different bases, then the digits in the corresponding representations should fit a new pattern for numbers of digits.
        New example: n= 1111111 with repeated 1110111 and deleted 1100111.
        Converted:
        B2- 1111111 1110111 1100111
        B6- 331 315 251

        So you would get 2 tables like this after subtracting the expected values:

        Binary
        Index: 6 5 4 3 2 1 0
        0: 0 0 1 0 0 0 0
        1: 0 0-1 0 0 0 0

        Base 6:
        Index: 2 1 0
        0: 0 0 0
        1: 0 1-1
        2: -1 0 0
        3: 1 0 0
        4: 0 0 0
        5: 0-1 1
        And so you can just read a repeated 315 and a deleted -251 off the chart!

        Drawing from previous solutions, you would only need two arrays for this method. You can populate the arrays at the very beginning with the expected values and then subtract off the digit representations as you go.

        For this specific example, you would only need a 64 by 2 array for the binary and then a 25 (rounded up of log(2^64)/log(6)) by 6 array for the base 6 one.

        Problems/Ideas:
        Don’t pick a base that is a power of two. After some experimentation, you’ll see that the presentation continues to vary only in one spot from the binary due to the way the groupings of digits work.
        In this example I picked 6 because it made everything work, but I am not sure how the numbers affect what the correct base is yet. There probably isn’t a “perfect” base for every example. There might be a way to make a really good guess though. Not sure.
        The real problem is that I think that shifting bases moves it out of the bitwise operations space and into a slower more complicated space. Especially if you are shifting into a base that isn’t a power of 2.
        I think it could deal with more errors than just 2 by playing it like a weird suduko. Every time you read a new deleted or repeated number from the charts, you update all the charts and keep going until you have no more errors on any of the charts.
        A problem is that you could have something like n=999 in binary and then delete/repeat 919/929 and 828/818. The errors would end up all canceling each other out in a decimal representation, but hopefully would still remain in the other base representations. I haven’t found an example where it would cancel on 2 different base charts though but I haven’t proved it either way.

        I think this is still within the solution space of this problem. It takes a constant space up and takes only one pass. If you can make a good guess for the base, then this would be fun to use.

        Thanks for the interesting problem! (Sorry for the long responses with no code.)

  38. Torsak Chaiwattanaphong on

    sum = 0;
    sumIndex = 0;
    array[0…2^64];
    bool firstNotEqual;
    for (int i=1; i sumIndex) println(array[array[i]] + ” missing”);
    if ( sum < sumIndex) println((array[array[i]]-1) + " dup");
    break;
    }
    }

    you can find duplicate value by calculating:
    dup = (sum(1..N) – sum(array[1]..array[N])) + missing value
    or
    missing = (sum(1..N) – sum(array[1]..array[N])) + dup value

  39. Torsak Chaiwattanaphong on

    sum = 0;
    sumIndex = 0;
    array[0…2^64];
    for ( i = 1 to i sumIndex ) then array[[array[i]] is missing then break
    if ( sum < sumIndex ) then array[[array[i]] -1 is dup then break
    endfor

    you can find duplicate value by calculating:
    dup = (sum(1..N) – sum(array[1]..array[N])) + missing value
    or
    missing = (sum(1..N) – sum(array[1]..array[N])) + dup value

  40. Mike D on

    2 step solution:
    Duplicate:
    public static int findWithXOR(int[] input){
    int xorArray=1;//,dup=0;
    for(int i=2; i<=(input.length-1);i++){
    xorArray = xorArray ^ i;
    //System.out.println(“XOR : ” + xorArray);
    }
    System.out.println(“XOR Array : ” + xorArray);
    for(int k=0;k<=input.length-1;k++){
    xorArray = xorArray ^ input[k];
    System.out.println(xorArray);
    }
    //System.out.println(“Duplicate : ” + dup);
    return xorArray;
    }

    Missing:
    int sum = 0;
    int idx = -1;
    for (int i = 0; i < arr.length; i++) {
    if (arr[i] == 0) {
    idx = i;
    } else {
    sum += arr[i];
    }
    }

    // the total sum of numbers between 1 and arr.length.
    int total = (arr.length + 1) * arr.length / 2;

    System.out.println("missing number is: " + (total – sum) + " at index " + idx);

    • Mike D on

      no comment?

      • Max Schireson on

        I think there are a few problems with your code. I don’t think the dup will work, nor the missing?

        — Max

  41. Kunal on

    2 Arrays, 1 of the unordered list of values. One of 0 to n-1 all defaulted to zero. Iterate through the unordered list and mark a -1 on the index that correlates to. After you are done, scan through the array and -2 is the dupe, 0 is the missing element.

    • Martin on

      Yeah, but first you need to generate the “good” array, with highs N, it will take time only to generate that array!

      I think that this is not what they want.

  42. Justin Holmgren on

    Call the missing number x and the additional number y.
    We will find y – x and y^2 – x^2. Then dividing, we can get y + x which will in turn give us y and x.

    Sum all of the integers. This will be N(N+1)/2 + y – x
    Sum all of the integers squared. This will be N(N+1)(2N+1)/6 + y^2 – x^2

    Now you’re sort of done. To make it actually constant space, you should do all of your summing mod 2^256, since y^2 – x^2 and y – x will certainly be between -2^128 and 2^128 (so when you find them mod 2^256, that’s the same as finding their actual values)

    • Justin Holmgren on

      Oops instead of mod 2^256, I meant mod 2^129

  43. Carl-Johan Blomqvist on

    A solution from a business person’s point of view would be to post on a geek forum asking how to solve the issue… 😉

    • Vivek on

      Ok You should failed for using the words “geek forum” ….

  44. Charles Keepax on

    See I would have disabled comments on this post. I take it posting a correct answer would be uncool?

    • Max Schireson on

      I’m allowing interesting attempts because I think the discussion is fun, but holding the correct answers a little while 🙂

  45. sparc5 on

    Assume the duplicated one is y and missing one is x while input array is
    array[1..n]

    sum = 0
    for ( int i = 1; i <= n; i++ ) {
    sum += array[i];
    }

    // diff is (y – x)
    diff = sum – (n(n – 1)/2)

    prod_sum_diff = 0;
    for ( int i = 1; i <=n; i++ ) {
    prod_sum_diff += array[i] * (array[i] – diff ) – i * (i – diff );
    }

    // pro_sum_diff is y * (y – (y – x ) ) – x * ( x – ( y – x ) ) = – 2x^2

    x = sqrt( – ( prod_sum_diff / 2 ) )
    y = diff – x

    complex is o(n)

  46. Jason Zhang on

    Assume the duplicated one is y and missing one is x while input array is
    array[1..n]

    sum = 0
    for ( int i = 1; i <= n; i++ ) {
    sum += array[i];
    }

    // diff is (y – x)
    diff = sum – (n(n – 1)/2)

    prod_sum_diff = 0;
    for ( int i = 1; i <=n; i++ ) {
    prod_sum_diff += array[i] * (array[i] – diff ) – i * (i – diff );
    }

    // pro_sum_diff is y * (y – (y – x ) ) – x * ( x – ( y – x ) ) = – 2x^2

    x = sqrt( – ( prod_sum_diff / 2 ) )
    y = diff – x

  47. Gonzalo on

    In one pass you can sum the numbers and sum the squares of the numbers.

    Their sum reveals X – Y.

    Their sum of squares reveals X^2 – Y^2 = (X – Y)(X + Y).

    Assuming N < 2^63, it suffices to use arithmetic in 64 bits for the sum and 128 bits for the sum of squares. Memory usage is 24 bytes.

    • Gonzalo on

      Actually, you could choose a prime p close to 2^64, and do all the arithmetic mod p. Assuming N < p, then X-Y is always invertible.

      Memory usage would then be just 16 bytes.

      • Max Schireson on

        Very good. Assume you’re sticking with the math not wanting a programming job 🙂

        — Max

  48. Jason Zhang on

    A little update
    ———————
    Assume the duplicated one is y and missing one is x while input array is
    array[1..n]

    sum = 0
    for ( int i = 1; i <= n; i++ ) {
    sum += array[i];
    }

    // diff is (y – x)
    diff = sum – (n(n – 1)/2)

    prod_sum_diff = 0;
    for ( int i = 1; i <=n; i++ ) {
    prod_sum_diff += array[i] * (array[i] – diff ) – i * (i – diff );
    }

    // pro_sum_diff is y * (y – (y – x ) ) – x * ( x – ( y – x ) ) = – 2x^2

    x = sqrt( – ( prod_sum_diff / 2 ) )
    y = diff + x

    • Max Schireson on

      I will get back to yours later I think it is correct but I am too tired to check the prod_sum_diff math right now 🙂

  49. Andrew on

    // PHP Code
    // I assume that the array contains consecutive integers,
    // is sorted ascending, and the first element contains 1

    $numbers = array(1, 2, 3, 4, .. , N);

    for ( $i=0; $i< sizeof($numbers) ; $i++ )
    {
    $value = $numbers[$i];
    $estimate_value = $i + 1;
    if ($value != $estimate_value)
    {
    echo $value . " is the duplicate value\n";
    echo $estimate_value . " is the missing value\n";
    break;
    }
    }

    • Max Schireson on

      The problem is too easy when they’re sorted. Try it without that assumption.
      — Max

  50. Lim Kim Yong on

    My solution if N is even

    consider the case that then extra number is odd and missing number is even and extra number is even and missing number is odd:

    add all even value numbers and add all odd value numbers

    you can find the extra and missing numbers by finding the
    difference between the calculated sum and the sum of all the terms from 1 to N (for even and odd)

    if the additional numbers are both odd or even, you can find
    that by looking at the number of odd/even numbers.

    just take the respective group of numbers (even or odd)
    and make a 1 bit shift to the right thus removing the
    LSB. Recurse the above steps.

    Best case O(1), worst case O(2N), requires at least N/2
    storage components.

    • Lim Kim Yong on

      I meant at most N/2 storage components.

      • Max Schireson on

        Unique!
        But if say the extra and missing numbers are divisible by a large power of 2, you have to repeatedly remove bits and re-count to see if you’ve split them yet. That requires either linear storage or up to log n passes unless you have a clever way around that?

  51. Igor Evsukov on

    Kids stuff, I’ve solved that kind of problems in 9nth grade: no jokes, really, I studied in very strong math school. Please tell email address to which I should send solution(description + source code) ?

    • Igor Evsukov on

      Sorry, I need to go, I’ve sent email with description and source code to Your personal email on Gmail.

      • Max Schireson on

        I got your email and your algorithm is correct. Haven’t checked your code yet.

    • Max Schireson on

      The soviet system for all its flaws had some spectacular math schools. A friend of mine had two of his math teammates from high school win the Fields medal (only given every 4 years).

  52. ijrussell on

    You need four values;

    sum{1..n} n
    sum{1..n} value
    sum{1..n} n ^ 2
    sum{1..n} value ^ 2

    The difference between the sum of n and value gives you the difference between the duplicate and missing values.

    Then using ((x ^ 2) + ((x + difference) ^ 2) = abs(sum of n ^ 2 – sum of value ^ 2), solve for x.

  53. AS on

    Here’s a solution that runs in O(n) time and O(n) space.

    import java.util.HashSet;
    import java.util.Set;

    public class Test {

    public static void main(final String[] args) {
    final long[] data = {1, 2, 3, 4, 5, 6, 7, 8, 3, 10};

    final Set seen = new HashSet();

    long duplicate = -1;
    long difference = 0;

    for (int i = 0; i < data.length; i++) {
    final long current = data[i];

    if(!seen.add(current)) {
    duplicate = current;
    }

    difference += (current – i – 1);
    }

    System.out.println("duplicate " + duplicate);
    System.out.println("missing " + (duplicate – difference));
    }
    }

    • AS on

      Here’s an improved solution. Runs in O(n) time and O(1) space.

      import java.math.BigInteger;

      public class Test {

      public static void main(final String[] args) {
      final long[] data = {1, 2, 3, 4, 5, 6, 7, 8, 9, 2};

      BigInteger sumDifference = BigInteger.ZERO;
      BigInteger squaredDifference = BigInteger.ZERO;

      for (int i = 0; i < data.length; i++) {
      final BigInteger current = BigInteger.valueOf(data[i]);
      final BigInteger index = BigInteger.valueOf(i + 1);

      sumDifference = sumDifference.add(current).subtract(index);
      squaredDifference = squaredDifference.add(current.pow(2)).subtract(index.pow(2));
      }

      final BigInteger duplicate = sumDifference.pow(2).add(squaredDifference).divide(sumDifference.multiply(BigInteger.valueOf(2)));
      final BigInteger missing = duplicate.subtract(sumDifference);

      System.out.println("duplicate " + duplicate);
      System.out.println("missing " + missing);
      }
      }

  54. Brian on

    I’d ask the guy who makes the array generation routine to cache the omitted and duplicated values. Then I’d grab a pipe wrench and chase down anyone who alters my array without accessors.

    • Max Schireson on

      Your colleagues need to either write solid code or wear a helmet i guess!

  55. Scott Ruhnau on

    In this problem let X be the missing integer and Y be the repeated integer. Now we have 2 variables and as such need two equations to solve them.

    First we sum the number from 1 to N, a linear time operation. Call this sum A. Also sum the integer in the array call this sum B. Perform a subtraction: B – A = C. Now we know that Y – X = B – A = C.

    Now for the second equation we repeat the process with multiplication. At this point we must make one of two assumptions either we will not overflow a 64 bit integer or that we can perform effectively constant time multiplications on whatever size integer we need. So take the product of the number from 1 to N call this D. Take the product of the numbers in the array call this E. Divide them E/D = F. We know that F = Y/X.

    Now we have two simple equations and can perform a simple algebraic solve:

    F = Y/X
    Y = FX
    C = Y – X
    C = FX – X
    X = (F-1)/C
    Y = F(F – 1)/C

    The problem here comes from the assumption of constant time multiplication.

    • Scott Ruhnau on

      It also uses constant space which is the hard part of this problem. Without the space and rearrangement constraints simple solutions using a radix sort or using the idea of counting sort would be possible.

      • Max Schireson on

        Scott,
        Correct given constant time and space to multiply, but the numbers get linearly longer if you use say a bigint, so it actually isn’t constant space.
        — Max

      • Scott Ruhnau on

        Fair enough. The sums of squares solution may also overflow a 64bit integer though. Less likely and too a lesser but still a possibility you may end up with 16777216 + a lot of other stuff.

        But yes my solution was inferior, haha. Poor me no interview.

      • Max Schireson on

        Much less likely. Sum of squares will overflow around a million elements but if you use a bigint it will be 192 bits or less. But when the sum of squares is 192 bits, the product will take exabytes to store!

        You can always send in a resume!

        — Max

      • Scott Ruhnau on

        I think I have a solution to the size problem. We can use the properties of multiplication and division to divide as we go. So instead of multiplying each array element and then dividing. We can setup a variable, F in the above, initialized to 1. As we iterate from 1 to N through the array we can multiply F by the array element and divide it by the index. For example on the first element: F = (F* Array[1])/1 and for the ith element: F = (F*Array[i])/i

        This way we end up with the same end result but can use a much smaller amount of space.

      • Max Schireson on

        Interesting and makes incremental improvements, but doesn’t solve it. In a worst case where array[i] is reverse sorted, the first 2^31 operations will all at least multiply the result by 2 so you will have a gigabytes-sized product. There is a clever way to avoid this that so far no-one has suggested (hint: requires some advanced math).

        — Max

  56. Max Schireson on

    Just published the correct answers (which I had been holding back for a little while) along with a number of other posts, including a funny one from Brian.

    I’ll catch up on responding individually after dinner.

    Thanks for a lot of discussion, and a few good candidates!

    This is fun, we’ll have to do it again.

    — Max

    • pepe on

      Where is the answer?

      • Max Schireson on

        Sums and sums of squares is by far the most popular solution.

        There was a unique solution which used XOR. I will try to paraphrase it here somewhat more concisely than originally stated; hopefully the rest of the posts provide enough context to follow:
        1. XOR the values and the indexes, the result is DUP XOR MISSING
        2. Choose one non-zero bit of that, say bit j.
        3. XOR again the values and indexes – but only when bit j is zero. This will give one of DUP or missing, but you won’t know which
        4. Then go back through and look for those values to see which is which (omitted from the post but I was so impressed with the originality I didn’t quibble)
        3 passes but no bigints (you can do the sums mod 2^64).

        — Max

    • a non on

      Where are the answers? Seems like calculating sum and square is the only right way…

      #!/usr/bin/env perl

      my $len = 10;
      my @a;
      for my $i (1 .. $len) {
      $a[$i-1]=$i;
      }

      my ($m, $n) = (0,0);
      while ($m == $n) {
      ($m, $n) = (int(rand($len)),int(rand($len)));
      }
      $a[$m] = $n+1;

      print “Missing : “.($m+1).”\n”;
      print “Duplicate: “.($n+1).”\n”;

      print join(“,”,@a).”\n”;

      shuffle(\@a);

      print join(“,”,@a).”\n”;

      my ($na, $nn, $sa, $sn) = (0,0,0,0);

      my $sn = ($len*($len+1))/2;
      my $nn = ($len*($len+1)*(2*$len+1))/6;

      for my $i (1 .. $len) {
      $sa += $a[$i-1];
      $na += $a[$i-1] ** 2;
      }

      my $sd = $sa – $sn;
      my $nd = $na – $nn;

      my $n2 = ($nd – ($sd**2))/(2*$sd);
      my $m2 = $sd + $n2;

      print “Missing : $n2\n”;
      print “Duplicate: $m2\n”;

      sub shuffle {
      my $array = shift;
      my $i;
      for ($i = @$array; –$i; ) {
      my $j = int rand ($i+1);
      next if $i == $j;
      @$array[$i,$j] = @$array[$j,$i];
      }
      }

      • Max Schireson on

        Basic algorithm looks good and i think you got all the +1’s and -1’s correct.

        Don’t know perl so i could easily have missed overflows, and didn’t check the shuffle since generating the data set wasn’t part of the problem.

        — Max

  57. Binh Nguyen on

    Called the missing number m, duplicate d.
    We will find m+d and m-d. Then it’s trivial to find m and d.

    1) Find n XOR d:
    X_array = 1;
    I_array = 1;
    for (i=2..N)
    X_array = X_array XOR A[i]; I_array = I_array XOR i;
    endfor

    Then m XOR d = I_array XOR X_array

    2) Find m – d
    It will be a little bit more complicated. The idea is to get the sum of (1..N) subtract the sum of our array. But we have to be careful not to over or underflow the result.

    Diff = 0;
    i = j = 0; // i index I, j index A
    while (i < N) && (j < N)
    if (Diff + i < N-1)
    Diff = Diff + i;
    i++; //consume one in I
    else
    Diff = Diff – A[j];
    j++; // consume one in J
    endif
    end

    // Need some corner case check here to but I will omit

    3) So we have (m XOR d) and m – d. Also, note that m XOR d is basically m + d without the carry bit so it will be trivial to get m and d.

    running time : 3 N <- can be squeeze to 2N if merge the two logic
    space : O(1)

    • Binh Nguyen on

      Correction:
      Once we find m XOR d, it’s not trivial to get m+d.

      Assume d XOR m = a. Find k = position of the first bit (from MSB to LSB) in a. Then d + m can be in [a + 2^(k+1),a + 2^(k+2),…,a + 2^63].
      We will have less than 63 cases. In each case, after we find d and m, make sure they are in range and then linear test.
      So in the end, we may need to run at most 63+3 = 66 N. But it still linear though.

      I believe it will work, but it may not be the most beautiful solution.

      PS: I try to solve the problem just purely for fun. I already had a job! 🙂

      • Max Schireson on

        I’m not quite following what you’re doing here, but it might be me at this late hour. There is a way to fix the XOR to produce D and M (but not know which is which) in one additional pass, its in one of my prior comments.

        — Max

  58. Mark on

    Hi Max!

    I will dispense here with code and verbalize the algorithm. If I’m close and you would like me to fill in the blanks… well you know the answer.

    Using four variables, iterate (single pass) the array determining for each element:

    min?, max?, count, and sum … and a fifth variable – or flag if you will.

    Once the length of the array is determined by count, and using comparison operations on each element to obtain the minimum and maximum integers in the array, we can easily arrive at:

    max(max+1)/2 – min(min+1)/2 – sum = duplicate integer – missing integer

    Thanks to Mr. Gauss…

    Our fifth variable is a “flag” (var duplicate) ; each element must either update the min? or max? from the previous element (appearing only once – the next element is either greater than or less than the current min? or max? variable); if not, that element is the duplicate integer, and from the above, we then can also determine the missing integer. Our count variable must accommodate the 64 bit integer sum of course.

    I’m not a ninja programmer – but I am having fun putting monogdb to good use!

    • Max Schireson on

      Glad you’re having fun with mongodb!

      I think your assertion that each element updates either min or max is false. Say N=100, A[1]=1 and A[2]=100. Then each additional element will not update min or max and you will automatically think A[3] is the dupe.

      — Max

  59. Colin on

    def qs(a)
    (pivot = a.pop) ? qs(a.select{|i| i pivot}) : []
    end
    def pa(a)
    qs(a).each do |e|
    @current = el
    begin
    state = @current – @previous
    rescue
    unless e == 0
    state = -1
    end
    end
    case state
    when -1
    puts “skipped number is #{0}”
    @skipped = 0
    when 0
    puts “repeated number is #{@current}”
    @repeated = @current
    when 2
    puts “skipped number is #{@current – 1}”
    @skipped = @current -1
    end
    @previous = @current
    end
    @repeated
    @skipped
    end

    You could easily speed this up doing only an xor on the right most bit and using registers, but that’s a little hard to illustrate in ruby. Maybe I’ll give this a shot in C later on tonight.

    • Colin on

      Mistake on line 6. Typo on the value, change that from “el” to “e”.

      • Max Schireson on

        I can’t understand ruby to save my life but i think you might be assuming the list is ordered?? Either C or plain english are best for me (but i understand java and c++ reasonably well and can muddle through perl easier than ruby)

        — Max

  60. Julien Bouvier on

    I have used the sum and sum of squares of the integers within the array to find the duplicate and missing values.

    My analytical solution is online at:

    Click to access Max%20Schireson%20Algorithm%20by%20Julien%20Bouvier.pdf

  61. Kabir on

    # Python solution, returns (duplicate, missing)
    # Uses Python BigInts but all numbers are no larger than 128 bits

    def findDupMiss(array):
    n = len(array)
    correct = n*(n+1)/2
    accumulator = sum(array)
    difference = correct-accumulator
    for number in array:
    tmp = accumulator – number + difference
    if tmp == correct: return number, number + difference

  62. Top Posts — WordPress.com on

    […] Want a job working on mongoDB? Your first online interview is in this post I like smart programmers. […]

    • Max Schireson on

      Thanks (blush)

      Looks like this post is number 34 in wordpress right now. Ahead of teen stabbings on CBS NY but behind Philles shutouts in San Diego.

      — Max

  63. Daniel Plaisted on

    You can do this in two passes over the array.

    First, compare the sum of the elements of the array to the sum of the integers from 1 to N. The difference between these sums is the difference between the duplicate and missing number.

    Now you want to find a way to partition the numbers from 1 to N into to sets, so that the missing number is in a different set than the duplicate number. To figure out which set a number is in, divide it by the difference between the duplicate and missing numbers. If the result is even, it belongs to one set, if the result is odd, it belongs to the other set.

    So you make a second pass through the array, calculating the sum of the elements that belong to the even and odd sets. You also do the same calculation for the numbers from 1 to N to get the “expected” results if there were no numbers added or missing.

    You will end up with one “actual” result bigger than expected and one less than expected. The difference between the result that is bigger than expected and its expected result is the duplicate number, and the other difference is the missing number.

    In Pseudocode (where ints are 64 bits):

    int N = array.Length;
    int actualSum = 0;
    int expectedSum = ((N + 1) * N) / 2;
    for (int i=0; i< array.Length; i++)
    {
    actualSum += array[i];
    }

    int diff = Math.Abs(actualSum – expectedSum);

    int actualEvenSum = 0;
    int expectedEvenSum = 0;
    int actualOddSum = 0;
    int expectedOddSum = 0;

    for (int i=1; i 0)
    {
    duplicate = evenDiff;
    missing = -oddDiff;
    }
    else
    {
    duplicate = oddDiff;
    missing = -evenDiff;
    }

    If there are more than about 2^33 numbers, then you will overflow a 64-bit integer when adding up the numbers. As long as this doesn’t raise an exception, the calculation should still work correctly, since the difference will fit in a 64-bit integer.

    • Daniel Plaisted on

      It looks like my pseudocode got eaten, starting with a less than or equal sign in the second loop. Hopefully my description of the algorithm makes sense by itself.

    • Daniel Plaisted on

      Here’s a link to the un-mangled code: http://codepaste.net/3mvtfw

  64. Marcus Clyne on

    Hi Max,

    Note: in the time between seeing this, trying to find out where to send the answer (i.e. to post it on this wall or to find your actual email address), and the time you posted the answers, I wrote this: I know it’s a bit late now, but I thought I’d send it to you anyway.

    Here’s a rough outline of one possible approach, where x represents the each number you have:

    – take the sum of x
    – take the sum of x^2
    – solve the simple equation you have using the two sums to get the missing and duplicated numbers

    Things we know:

    – the sum of squares of numbers from 1 to n is: n(n+1)/2
    – the sum of squares of numbers from 1 to n is: n(n+1)(n+2)/6

    Things we set up:
    Two fields for our adding. We represent these as field 1 and 2 respectively for the two sums mentioned above.
    Depending on n, we may be able to use 64-bit integers for all fields (if (n(n+1)(n+2)/6 + (n^2-1)) < 2^64 – I didn't bother to work out what this value would be – the (n^2-1) part at the end is to guarantee that we have enough space for if the duplicated value is n and the missing value is 1, which would give the highest sum of the squares).
    If the previous equation results in an integer greater than can fit in a 64-bit integer, then we could either use a library to deal with integers that are greater than 64bits but use 64-bit integers, or could just use something like the gmp library (so we would be converting all our numbers to gmp integers). Although this would obviously be slower than using integer variables, it would approximately scale almost linearly (see note below).

    The iteration part (which will be our part that scales linearly):
    – loop through all the numbers you have in your array (represented as x)
    – on each loop:
    – add x to field 1
    – add x^2 to field 2

    The differences:
    In order to get the differences from the 'complete' values, we just take the value of the above two equations above respectively from the two totals we got in field 1 and 2.

    The resulting equations:
    We are left now with two equations and two values (the differences calculated above). In these sums, 'd' represents the duplicated number, 'm' represents the missing number, and D[1|2] represent the differences above (which are known values at the end of our process):
    1) d – m = D1
    2) d^2 – m^2 = D2

    Solving the equations:
    (from 1) (d – m)^2 = D1^2
    (from 1) d = m + D1
    (from 2) d^2 – m^2 – D2 = 0
    (d – m)^2 – D1^2 = 0
    d^2 – 2dm + m^2 – D1^2 = d^2 – m^2 – D2 (we can cancel the d^2)
    2m^2 – 2dm – D1^2 + D2 = 0
    2m^2 – 2(m + D1)m – D1^2 + D2 = 0
    2m^2 – 2m^2 – 2D1m – D1^2 + D2 = 0 (we can cancel the m^2)
    – 2D1m – D1^2 + D2 = 0
    OR
    2D1m + D1^2 – D2 = 0
    SO
    m = (D2 – D1^2)/2D1
    And we solve 'd' from equation 1 above.
    Note: you could also solve 'd' first, then 'm' by cancelling out first m then d in the above workings.

    One issue with this approach related to O(n):
    If we were to use something like gmp, then the library itself would not quite scale linearly. The processing of larger integers would take longer than small integers, so there would be a very slight increase in time. Since we are *only* dealing with 64-bit integers, then the difference would probably be pretty small, and in practice may not be that noticeable in the 1 – 2^64 range.
    If you wanted to be absolutly pedantic about linearity with the scaling, you could always set up a timer at a regular interval that was considerably longer than it takes for the gmp library to caclulate ((2^64)+1)^2, and then use some kind of loop that only did the next calculation after the alotted time had elapsed.
    This is a completely pointless activity in practice, though would make it strictly linear (at least the iteration part).

    Concluding comments:
    I believe that this solution solves your problem/criteria. I hope there are no errors in my working – I did double-check them. 🙂
    By the way, I actually contacted you guys about two weeks ago asking if you'd be interested in hiring me to do some work regarding your C driver and module for Nginx. There are some major issues with integrating MongoDB with Nginx because your C driver is blocking, and Nginx is non-blocking, so the GridFS Nginx module you have is going to kill performance on Nginx on high-load, so I offered to do some contract work to write a non-blocking C driver and an Nginx module that would allow Nginx to run as an HTTP front-end to both standard Mongo and GridFS.
    I was asked to send in my resume (by Ben Sabrin), which I did, but I've not yet heard back from anyone. I didn't want to be pushy because I know that you guys are probably very busy, and possibly hadn't had time to consider my proposal. Just thought I'd mention it.

    Thanks,
    Marcus Clyne.

    • Pavel on


      >- the sum of squares of numbers from 1 to n is: n(n+1)(n+2)/6

      Correct is N(N+1)(2N+1)/6

      • Marcus Clyne on

        🙂 Oops – mis-remembered that one. Thanks.

  65. Charles Keepax on

    Well my answer was this:

    Sum all numbers, we’ll call this A.
    Calculate N(N 1)/2, we’ll call this B.
    Subtract the numbers one at a time from A and B, until A-B changes. The number that causes the change is one of the incorrect numbers. It is then trivial to calculate the other incorrect number.

    Requires at worst two passes through the array though so better answers have been posted.

    • Charles Keepax on

      Well that will teach me to solve problems whilst after drinking 🙂 Sorry all.

  66. Hugo T R on

    I would use a bitmap or cumulative bitwise operations.

  67. sparc5 on

    just try to see if the comments are post immediately or not. Somehow I did not see my post

  68. Nick LaRacuente on

    Basic half-finished idea:
    1) Compute the expected # of total 1 bits in positions 0-63. This should run in O(logn) or less than 64.
    2) Repeat the above for different recombinations of bits, based on the butterfly algorithm for the Walsh-Hadamard transform, but with +/- replaced with XOR/NXOR. The intention is to create alternate expressions that couple bits.
    3) Record the actual ones counts for each bit and recombined bit (this should require a constant-sized table of long integers).
    4) Starting with the untransformed case, we can find all the bits that were 1 in the removed and 0 in the duplicate, or vice versa. From here, we can backtrack through each level of the transform to determine the values of bits that were the same in the duplicated and removed, but would have been coupled with a changed bit in the transforms.

    Unfortunately, it appears I’ve returned here too late. Ah well. Probably should get back to actually working.

  69. bro on

    Step 1:
    let T be the XOR of all numbers from 1 to N
    let S be the XOR of all numbers in the array
    Then XOR(T,S) = XOR(d,m) = P where d is the duplicate
    value and m is the missing value.
    Step 2:
    let i = 1
    let Z2 = xor of all numbers in the array except the
    ith element
    if XOR(XOR(Z2, T), P) == ith element
    then i th element is the duplicate

    Now this is repeated for all indices i. The thing is the xor of all values in the array except the ith element
    can be computed in one operation, that is XOR(S, ith element).

    • Max Schireson on

      I like it. Original and fast.

      • Antti Rasinen on

        I do not think it works. I’ll use + for XOR, and mark the ith element with simply i. Then the Z2 variable becomes S + i.

        Now (Z2 + T) + P = S + i + T + P. Since S + T = P, they cancel out. In other words, the equation

        (Z2 + T) + P == i

        is always true. Bummer.

  70. Ajay Ohri on

    as a follow up, the unemployment rate in techs, math guys would be interesting given the interest generated by the job/post title

    • Max Schireson on

      As a former math guy myself always have a soft spot for math guys who can program 🙂

  71. Zachary Voase on

    I’ve written a C solution to the problem. It runs in O(n) time and O(n) constant space (i.e. it doesn’t grow/shrink over the course of its execution). You can find it here, along with example input and output files: https://gist.github.com/939287

    It’s not very idiomatic because it’s my first C program, but the basic idea is simple: store a bit array of size N, where `check_array[number]` indicates whether or not we’ve seen `number`. Loop through the array setting `check_array[input_array[i]] = 1`. If we encounter a number which we’ve already seen, then that must be the duplicate.

    As other people have noted, the difference between the duplicate and the missing number is the same as that between the sum of all input elements and the ‘triangle’ number (N * (N + 1) / 2). So, finding the missing number becomes trivial as long as we calculate the sum as we’re going along.

    My solution above solves this problem correctly (as far as I can determine through generated random test input). It also does so very quickly, most likely because it’s written in such a low-level (yet well-suited) language.

    • Max Schireson on

      The challenge is to get that down to o(1) space – a constant amount of space that does’t depend on n!

  72. Pavel on

    Hi, this is working solution. It can be easy adopted for 64bit numbers.

    #!/usr/local/bin/python
    import sys
    from random import shuffle

    # prepare sequence
    N = 10 # len of sequence
    nums = range(1, N+1) # generate range of numbers from [1..N]
    shuffle(nums) # make random order of numbers
    nums[0] = nums[1] # dup 0-idx number as 1-idx number
    print “Numbers: %s” % nums

    # calculate sum of n and n^2 from 1..N
    S = int(((1+N)*N)/2)
    S2 = int(N*(N+1)*((2*N)+1)/6)
    # print “Sum of numbers in range 1..N: %d” % S
    # print “Sum of numbers^2 in range 1..N: %d” % S2

    # Calculate sum of current numbers
    E = E2 = 0
    for n in nums:
    E += n
    E2 += n*n

    if S == E:
    print “All numbers in place”
    sys.exit(0)

    D = (S2 – (S*S) + (2*E*S) – E2 – (E*E)) / ((2*S) – (2*E))
    print “Duplicated number: %d” % D

    M = S – E + D
    print “Missed number: %d” % M

  73. Pavel on

    Here you can see with the right highlighting and indented.
    http://pastebin.com/KyFTXpEr

    Sorry for duplicated comment.

  74. Binh Nguyen on

    One of my friend came up with this solution:

    m is missing,d is duplicate. I is 1..N, A is input.

    Step 1: Compute m XOR d by XOR all values of A and I together.
    let say x = m XOR d. Then x must has at least one 1 bit (or else m == d). Call the bit p. So now we can separate A into two list A0 are those with bit p == 0 and A1 are those with bit p == 1. So we have separated m and d into two different list.

    Step 2: Iterate through I, maintain two number x0 and x1 as the results of XOR operations of those elements with p bit == 0 and == 1 respectively. Do the same for A to get a0 and a1. Then we know that m is either a0 XOR x0 or a1 XOR x1. One more linear test in A to confirm the missing element m.

    In the end, the running time can be squeezed into 2 N.
    space = O(1).

    • Max Schireson on

      Yes, that is the direction I was suggesting for the last part of your solution. It was also the direction Zach was taking, but your friend had by far the clearest explanation of it. He should send his resume too!

      — Max

      • Binh Nguyen on

        Thank you! That’s very nice!
        But both he and I are not looking for a job right now. 🙂
        Looking forward to another fun question in the future.

        PS: I don’t think calculate sum of squares of all number from 1 to N where N is ~ 2^63 is linear in term of N. At least not in practice!

  75. Ittay Dror on

    (disclaimer: without reading the previous comments, so i may be presenting a wrong idea that has been presented before)

    The first array is A. I’d build another array of N numbers, filled with 1, name it B.

    foreach i in A
    if B[A[i]] == 0 print “found duplicate” A[i]
    B[A[i]] = 0

    foreach i in B
    if B[i] == 1 print “found missing” B[i]

    I’m assuming of course, that I can allocate B.

    If not, then reuse A. So A[0] gives another index in A, so I can save i = A[A[0]] and then A[A[0]] = 0 and continue with i. Something like
    i = A[0]
    while not i == 0
    i = A[i]
    if A[i] == 0 print “found duplicate” A[i]
    A[i] = 0

    foreach i in A
    if A[i] != 0 print “found missing” A[i]

    (there may be corner cases, like needing to add 1 to indexes and such, i don’t think it is worth bothering with)

    • Ittay Dror on

      oops. missed the space constraints. sorry

  76. Sergiy on

    D – duplicate M-missing
    If sum(array) is computable (possible to compute), than we can find

    MD=M-D by comparing sum(array) to sum(1, N)

    Now lets set variables, each named X(0 to MD-1), – total md vars.
    X0 would represent numbers that are equal 0 while modulo MD
    So, say if MD=3, than X1 represents 4, 7, 10 …, X2 represents 5,8,11 …

    Now lets make iterating procedure, that will go through the array, starting with first element and adding it to the top of the corresponding Xn var. (no matter in what order they arranged)

    X1=(31,1,10..)
    X2=sum(8,2,11, ..)

    Xn= sum((n+MD*y)…)

    When finished sum, one of the Xn will contain D and will

    have missing M.

    It is easy to find which Xn contains M and D, since “ideal” sequence Xn (4,7,10..) is easily computable and compared to the “actual” Xn.

    Now when we know which Xn var has M and D, we also know that in this array M and D are consecutive numbers with difference MD – for example M=10 D=7

    Lets brake this sum – Nx into 2 sums.
    Each of them sums alternating numbers
    Xn1=sum(4,10,16..)
    Xn2=sum(1,7,13…)

    Basically this procedure is made during the first procedure when we calculate Xn.
    We add (n+MD*y) where y is odd to Xn1 and we add (n+MD*y) where y is even to Xn2.

    Xn will be equal Xn1+Xn2.

    Having actual Nx1 and Nx2 (one we know has Missing M and
    another Duplicate D) and comparing them to “ideal” sequence Xn1 and Xn2 is easy find out M and D.

    If “actual Xn1” is less than “ideal Xn1”, then difference is M.
    If “actual Xn1” is more than “ideal Xn1”, then difference is D.

    Тhe same to Xn2.

    • Max Schireson on

      Looks like you need MD variables, which could be N-1. Looking for a constant space solution.

  77. anxrn on

    (I didn’t read all comments to see if this solution has already been posted. Apologies if it has)

    (I will assume all integers are positive)

    Let x be the duplicated element
    Let y be the missing element
    Sum of all elements in the array = S
    Product of all elements in the array = P

    We have,
    Eqn #1: S – x + y = n(n+1)/2 — (formula for sum of numbers 1..n)
    Eqn #2: P * y/x = n!

    So, 2 equations, and 2 variables. Simple linear equation. Solved. Oh, and time complexity is O(n), sum and product is computed in one pass. Space complexity is constant.

    • Max Schireson on

      As n gets large, p uses a lot of space to store as a bigint. Eventually exabytes when n is near 2^64. So space isn’t really constant…

  78. […] 这个贴到目前为止已经有了160多条回复,下面是其中几个比较有意思的,如果你有时间,也可以去原帖发掘更多。 […]

    • Max Schireson on

      Sorry, I can only competently review responses in English. I will give German or Spanish a try but you’ll need to forgive my mistakes.

      — Max

  79. rmh on

    Let

    Q be the sum over 1 to n of n^2
    S be the sum over 1 to n of n

    Determine

    q, the sum of the squares of the elements in the array
    s, the sum of the elements in the array

    If

    x = q – Q
    y = s – S

    then

    m = \frac{x – y^2}{2y}
    d = m + y

    where

    m is the missing number
    d is the duplicate number

    • Max Schireson on

      well done

      • Alex on

        is that faster than looking at them?

        Oh I get it now, leaves the array untouched and in a constant space. whereas all that sorting and looping was a problem.

    • Dylan on

      Using similar methods, you can actually fill in an arbitrary number of missing or duplicated elements.

      • Max Schireson on

        An arbitrary but known number isn’t too hard 🙂

  80. Alex on

    function FindDupeAndMissingIn(N) {
      // returns an array of two parts
      // assuming one duplicate and one missing integer
      N.sort();  
      theDupe = 'not found';
      theMissing = 'not found';
    
      for (i = 0; i < N.length(); i++) {
       if (N[i] == N[i+1]){
           theDupe = N[i];
       }
       if (N[i+1] == N[i]+2) {
           theMissing = N[i]+1;
       }
       if ((theDupe != 'not found')&&(theMissing != 'not found')) {
           return [theDupe, theMissing];
       }
     }
    }
    

    • Max Schireson on

      Looking to avoid a sort, that will take n log n time, plus shuffle the array.

      — Max

  81. Jose luis figueira on

    Can’t you just:

    x = [set of data]
    N = ubound_limit
    e = [1 to N] #empty
    dupe = 0

    for i in x
    v_sum += x(i)
    if e(i) > 0 # found a dupe
    dupe = i
    end
    e(i) = e(i) + 1

    s_sum = sum(1 to N)
    diff = v_sum – s_sum #diff between miss and dupe

    # Here, we can loop over e, and found an e(i) = 0, thats
    # the miss, but we dont want to loop again, so…

    # miss = dupe – diff
    # whe have dupe and diff, we need to found miss
    # then, do the maths…

    • Jose luis figueira on

      I forget, of course, to get s_sum = sum(1 to N), you don’t need to loop from 1 to N (thanks gauss):

      ((1 + N) * N) / 2

    • Max Schireson on

      Part of what I was asking for was to avoid using the memory to store e.

      — Max

  82. Sergio Castaño Arteaga on

    Python:

    from mpmath import *

    def findDupAndMis(array):
    vp, vs = mpf(1.0), 0
    for i in xrange(1,len(array)+1):
    vp = vp * (float(array[i-1])/float(i))
    vs = vs + (array[i-1]-i)
    mis = round(vs / (vp – 1))
    dup = round(vs + mis)
    print “Found: Array %s \nDup %d Mis %d” % (array, dup, mis)

  83. Sergio Castaño Arteaga on

    Python (… to show tabs):

    from mpmath import *

    def findDupAndMis(array):
    ….vp, vs = mpf(1.0), 0
    ….for i in xrange(1,len(array)+1):
    ……..vp = vp * (float(array[i-1])/float(i))
    ……..vs = vs + (array[i-1]-i)
    ….mis = round(vs / (vp – 1))
    ….dup = round(vs + mis)
    ….print “Found: Array %s \nDup %d Mis %d” % (array, dup, mis)

    • Max Schireson on

      I think you’ll overflow with any decent sized N (unless phython just expands the size of your float, in which case you’ll use a lot of memory).

      Check some of the other comments about multiplying,

      — Max

  84. Venkatesan on

    #include
    #include
    #include

    void find_missing_and_duplicated(std::vector& arr, ACE_INT64& dup, ACE_INT64& missing)
    {
    ACE_INT64 curXor = 0;
    ACE_INT64 expXor = 0;

    for(int i=0;i<arr.size();i++)
    {
    expXor ^= i;
    curXor ^= arr[i];
    }

    //this is the xor of missing numbers
    ACE_INT64 deltaXor = curXor ^ expXor;

    //except for the missing number, every other combo yields deltaXor
    for(int i=0;i<arr.size();i++)
    {
    if( ((expXor^i)^(curXor^arr[i])) == 0)
    {
    missing = i;
    break;
    }
    }

    dup = missing ^ deltaXor;

    }

    int main()
    {
    std::vector arr;
    const int N = 100;

    arr.resize(N);

    for(int i=0;i<N;i++)
    arr[i] = i;

    int victim = 76;//rand() % N;

    //arr[victim] = (victim+1) % N;

    arr[10] = N-1;

    std::cout << "Array size is " << arr.size() << ", Victim is " << victim << std::endl;

    ACE_INT64 dup = 0;
    ACE_INT64 missing = 0;
    find_missing_and_duplicated(arr, dup, missing);

    std::cout << "Missing: " << missing << ", Duplicate: " << dup << std::endl;

    return 0;
    }

    • Max Schireson on

      I’m not sure i agree with your comment on line 20?

      Have you tried a test case?

      — Max

  85. Quentin Calvez on

    (Sorry if my solution has already been posted, there are 200+ comments so I haven’t read everything)

    I will assume that all numbers are positive.

    1. I compute the sum of all integers in the array and the sum of numbers from 1 to N. I then call DIFF the difference between both.

    2. I now iterate again trough the array and I build two sums : The numbers which euclidean division by DIFF is even on one side and those for which it’s odd on the other side. I do the same for the numbers from 1 to N.

    3. Finally : I now have 4 sums from (2.), I compare the sums from the numbers (1-N) to those from the array (even with even and odd with odd). If the difference is positive, it’s the missing number and if the difference is negative, it’s the duplicate one.

    I have made an implementation of this in pseudo c# that you can find here : http://pastebin.com/rKAeUEfa

    The bottleneck here is the division… Because I don’t known if there are algorithms to do it with a complexity < N…

    • Max Schireson on

      nice solution Quentin.

      — Max

      • Quentin Calvez on

        Thank you 🙂

        BTW, I’m currently seeking an internship for this summer, so if you take interns at MongoDB, I’d be happy to apply !

  86. Matthew H. on

    Don’t quite have the math for an easy solution, but you can see that

    d = sum(n in 1-N) – sum(n in N Array)
    x – y = d
    (product(n in 1-N) / y) * x = product(n in N Array)

    and now it’s quite easy to solve.

    naively one can find all solutions to x – y = d and then test against the second equality.

    • Max Schireson on

      two comments:

      1. calculating the product will use a LOT of memory if N is large
      2. again with large N, finding all the x-y = d is very time consuming, but yes, there are other ways to solve once you are at that point; issue is just the size of the product.

      — Max

  87. Bruce Moreland on

    I haven’t read any of the above except the question and the answer by “rmh” that you declared to be correct. I normally wouldn’t find that kind of solution myself.

    Here’s the answer I came up with while eating dinner after my son presented me with this problem. I’ll try to find the duplicate number. It’s possible to derive the missing number from that.

    Pick a radix, for example 10. You can use binary or base 256 or whatever but for purposes of explanation I’ll use base 10. Regard the series as a bunch of base 10 numbers.

    For each possible number in the 1’s column, 10’s column, etc., you know the frequency with which any digit should appear. For example, if the numbers are 1..20, you should have in the 10’s column 9×0’s, 10×1’s, and 1×2’s. In the 1’s column you should have two each of 0..9.

    Examine your data and compare with what you expect to see. If a digit appears too often in a column, you know that this digit must appear in that column in the duplicated number. If all digits in a column appear with correct frequency, you know nothing about that column, yet.

    It’s a fact that you’ll learn something in this pass, because something has to appear too often. You’ll learn something about at least one column, and possibly more.

    What you have learned creates a subset of data, from which you can compute new frequency information. Repeat the above steps using the subset and the new frequency information.

    You are done when you know what is going on in each column. The algorithm has to make progress so you’ll get done eventually. It’s arguably linear since it depends upon the size of the numbers but it’s a hell of a lot more linear than duplicating the array and quick-sorting it.

    Example:

    1 2 3 5 6 7 8 9 10 11 12 13 14 14 15 16 17 18 19 20

    1..20 with 14 duplicated and 4 missing. These wouldn’t have to be in order, of course.

    First pass would show the 10’s column as having 1 too many 1’s in the 10’s column, so you know that your answer is 1X, where X is unknown.

    New subset of data, which is everything in the original data that had a 1 in the 10’s column.:

    10 11 12 13 14 14 15 16 17 18 19

    If the data you expected when you started was here, you’d have 1 each of 0..9.

    You run over this and since there are too many 4’s, so you know that your duplicate has a 1 in the 10’s column and a 4 in the 1’s column, i.e. 14.

    You can find out that 4 is the missing number trivially. You sum the original data set, find out that you have 10 too many, and 14 – 10 = 4.

    This answer would work although depending upon how large the numbers were it might be impossible to sum them in 64 bits.

    If someone tasked me with this problem I would want to know more about it, to avoid bugs and dumb. My answer might be wrong to start with, if there isn’t time to do the calculations I suggest, or if it’s impossible to go over the data twice, etc., so it would be pointless to waste time on it if this were true. It may also be pointless to try to find a better solution than this one if this one works orders of magnitude faster than it “has” to.

    Also, if this was critically important and commonly solved, I’d probably try to find some information like this webpage, read the whole thing, swallow any residual pride, and use the best solution contained here, because I’m smart enough to realize that other people can be smarter, spend more time, or look at things differently.

    • Max Schireson on

      interesting approach.

      that said, if you had non-empty hundreds and thousands places (and potentially more…) i don’t think your second pass will learn anything new?

      — Max

      • Bruce Moreland on

        I don’t understand your comment, but I will restate in the hopes that I answer your question by accident. I’m pretty sure that I do answer it, indirectly.

        When writing this up I came to realize that the algorithm is simpler than I thought, and that will be reflected in the following.

        Definitions:
        Ideal data – the set 1..N. I assume this is represented mathematically only.
        Actual data – the set 1..N, with one missing and one duplicate. I vaguely assume this is an unsorted array that can be iterated over.

        The point of this algorithm is that you can use a frequency count to identify and characterize a subset of the actual data that must contain the duplicate, and which cannot contain the missing number.

        This allows you to trap the duplicate while excluding the missing number from consideration, which is key.

        If for example there are too many 6’s in the 10’s column in the actual data, the duplicate must have a 6 in the 10’s column, and the missing number cannot have a 6 in the 10’s column.

        Depending upon the radix you use and the character of the data, there’s a fair, maybe excellent, chance that you could solve the problem in one stroke with this frequency step.

        But assuming you don’t, this still allows you to identify a subset of the actual data that contains the duplicate, and allows you to generate the subset of ideal data that contains exactly the same values. In this example, that subset is the set of numbers in either ideal or actual data sets that have a 6 in the 10’s column.

        So:

        I is the ideal data
        A is the actual data

        X is the subset of I that must include the duplicate and can’t include the missing number, i.e. everything in I that has a 6 in the 10’s column.

        Given this:

        I’ is I intersected with X.
        A’ is A intersected with X (assuming that this will cause the duplicated element to appear twice).

        The result is that A’ is I’ with one extra element. The extra element is the duplicate.

        At this point you can take a second frequency pass to find the duplicate (I include this for consistency with my original solution), or likely more simply just use XOR, since all elements of A’ XOR’d with all elements of I’ will equal the duplicate element. (I would have figured out that use of XOR without having seen the other solution.)

        I hope there are no bugs in this. I don’t feel confident given that I didn’t write and test it, but it seems okay. Maybe more fat would fall off in the writing.

        The very strict definition of the problem implied set relationships that suggested that tricky math existed, and if my life had been on line if I didn’t find that, I would have gone looking for it.

        I realized though that the way I chose was more fun for me personally and does meet the criteria:

        1) Linear.
        2) Sane memory.
        3) Non-destructive.

        It would also be easy to implement this in a way that in a practical sense, meaning with natural integer operations and typical data structures, and would scale given that you used the phrase “extremely large”.

  88. Yousuf Fauzan on

    Been thinking that XOR must have something to do with the solution, here is my second try.

    A = range(1..N)
    B = Our Array

    M = Missing Number
    D = Duplicate Number

    A’ = XOR of elements in A
    B’ = XOR of elements in B

    A’ XOR B’ = M XOR D (Since all other elements will produce 0)

    A’ XOR M = B’ XOR D —– 1

    Since A’ XOR M is all elements of A excluding M
    B already does not have M, but due to the duplicity of D, XOR is missing D too.

    We know that (M XOR D) XOR D = M —-2

    Using 1 and 2 and iterating through the elements in B, we can find D

    for i in B:
    if B’ XOR i == A’ XOR (M XOR D XOR i) then i is the duplicate number

    Finding M is trivial from here.

    • Max Schireson on

      good solution!

      — Max

      • Duarte on

        It doesn’t work. A’ XOR (M XOR D XOR i) is A’ XOR A’ XOR B’ XOR i, which is B’ XOR i, so the condition is always true.

      • Max Schireson on

        Duarte,

        Good catch; I think my eyes were glassing over!

        Sorry Yousuf,

        — Max

  89. Yousuf Fauzan on

    This was my first try. It didnt get posted for some reason.

    A=sum(1..N) = N*(N+1)/2
    B=sum(1^2, 2^2…N^2)= N*(N+1)*(2N+1)/6

    A’ = sum of the integers in the Array
    B’ = sum of squares of integers in the Array

    D = Duplicate
    M = Missing

    A’ = A + D – M
    B’ = B + D^2 – M^2

    => D – M = A’ – A
    D + M = (B’ – B)/(A’ – A)

    Solve for D and M

    • Max Schireson on

      Caught in wordpress’s spam filter, sorry for the delay (i do occasionally check my spam).

      Good solution, congrats!

      — Max

  90. Chris Hurlburt on

    in 5N + overhead with a memory requirement based on 6 numbers plus
    code storage overhead.

    NOTE: Since this is sample code and in python, the max value for N is not
    going to be 2^64 as in the problem because the range command is used. However,
    this is a language limitation and not an algorithm limitation.

    “””

    def create_list(x,y,n):
    “””
    This creates a list of the type described in the puzzle for test purposes.
    “””
    test_obj = range(1,n+1)
    test_obj.remove(x)
    test_obj.append(y)

    return test_obj

    def seeker(list):
    “””
    Goes through the list finding both the missing and the duplicate. We are
    assuming that the list is of the form described in the problem.

    Notationally, x is the missing element and y is the duplicate element and
    we will return x, y
    “””

    n = len(list)

    #Find x-y a.k.a the first pass through the list
    xy_diff = int(n*(n+1)/2) – sum(list)
    if xy_diff == 0:
    print “List is not in problem format!\n”
    return 0,0

    #Find the first binary digit where diff is non-zero.
    sort_place = len(bin(xy_diff)) – bin(xy_diff).rfind(‘1’)

    print “The first non_zero binary digit in the difference occurs at %d\n”%(sort_place)

    #Initialize counters/summation variables
    bucket0 = 0
    ref_bucket0 = 0
    count0 = 0
    ref_count0 = 0
    bucket1 = 0
    ref_bucket1 = 0

    #Go through the list counting and summing….
    for i in range(n):
    #Add the i+1 value into the appropriate bucket & increment counter
    ibin = bin(i+1)
    if len(ibin) – 2 >= sort_place:
    if ibin[len(ibin) – sort_place] == ‘0’:
    ref_bucket0 = ref_bucket0 + i+1
    ref_count0 = ref_count0 + 1
    else:
    ref_bucket1 = ref_bucket1 + i+1
    else:
    ref_bucket0 = ref_bucket0 + i+1
    ref_count0 = ref_count0 + 1

    #Add the list entry to the appropriate bucket & increment counter
    lbin = bin(list[i])
    if len(lbin) – 2 >= sort_place:
    if lbin[len(lbin) – sort_place] == ‘0’:
    bucket0 = bucket0 + list[i]
    count0 = count0 + 1
    else:
    bucket1 = bucket1 + list[i]
    else:
    bucket0 = bucket0 + list[i]
    count0 = count0 + 1

    #Process the results to find x and y

    if count0 == ref_count0:
    print “YOUR ALGORITHM FAILED!!!”
    return 0,0

    if count0 > ref_count0:
    x = ref_bucket1 – bucket1
    y = bucket0 – ref_bucket0
    else:
    x = ref_bucket0 – bucket0
    y = bucket1 – ref_bucket1

    print “Missing is %d\n”%(x)
    print “Doubled is %d\n”%(y)

    return x,y

    • Max Schireson on

      Algorithm looks good; my python is not good enough to meaningfully check your code, but congrats on a good algorithm!

      • Chris Hurlburt on

        Hi Max,

        Here’s a a more formal description. Coding allows me to check for what I like to call moron errors. Let x be the missing integer and y be the duplicated integer.

        Step 1: Compute x-y. (Gauss’ formula – sum of the list)

        Step 2: Find the smallest number i such that the coefficient a_i != 0 in the unique (binary) representation of
        |x-y| = a_0 + a_1 2^1 + a_2 2^2 + a_3 2^3 + …
        where a_j in {0, 1} for all j.

        Step 3: For each number on the list, determine whether the ith coefficient in the binary representation is 1 or 0. While doing so compute the following:
        S_0 = sum of all numbers on the list whose ith coefficient is 0
        C_0 = the cardinality of the set of numbers on the list whose ith coefficient is 0
        R_0 = sum of all numbers in 1..N whose ith coefficient is 0
        D_0 = the cardinality of the set of numbers in 1..N whose ith coefficient is 0.
        S_1 = sum of all numbers on the list whose ith coefficient is 1
        R_1 = sum of all numbers in 1..N whose ith coefficient is 1

        Step 4: If C_0 D_0 then y = S_0 – R_0 and x = R_1 – S_1.

        Lemma: Given the notation above, x and y must differ in the ith coefficient.

        Proof: For j < i, a_j = 0. Therefore the jth coefficients in the binary representation of x and y for j < i must be the same. If they were also the same for i, then the ith binary coefficient in |x-y| would be 0. However, i is chosen such that the ith binary coefficient in |x-y| is non-zero and hence x and y must also differ in the ith coefficient.

        Notes: Some of the computations in step 3 are redundant. For example, there is no need to actually compute either S_1 or R_1.

        In my implementation I compute all 6 values in step three while looping through the list—I did this because I didn't want to bother figuring out closed forms for R_0 and D_0 and I didn't want to code the more complicated set of conditionals that is required if you don't compute S_1 and R_1. This means the above code does 6N computations in the second pass for through the list rather than a more minimal 2N. It's also much easier to debug this way.

        If N is close enough to 2^64, then cleverness will be necessary to compute R_0 – S_0 without computing either R_0 or S_0.

        There are all sorts of various clever tricks that could improve the above implementation by little bits here and there. None of the ones I can think of improve the algorithm beyond O(N) compute and O(1) memory.

        With this approach binning on the remainder modulo |x-y| will not work. Here is a counter example. Let x = 5 and y = 7. Then binning on the remainder modulo the difference of 2 will put both 5 and 7 in the odd bin which will once again get you x-y (or y-x), but not x or y.

        Python as I know it is not ideally suited to this type of bit manipulation. However, I find it a useful language for fast prototyping of algorithmic ideas. Also by coding it I was able to stop the little mice in my head from working on closed forms for R_0 and D_0 because I realized that even if I did find one, chances are the cases would make a coding/debug nightmare. Thanks for a nice puzzle.

      • Chris Hurlburt on

        ps. Somehow this comment area ate part of step 4. More words and less symbols:

        If C0 is less than D0, then x = R0 -S0 and y = S1 – R1.

        If C0 is greater than D0, then y = S0 – R0 and x = R1 -S1.

  91. Joseph Bowman on

    Well, if I had to knock it out with a shell script I’d just make a file with a line for each integer and then

    #!/bin/bash
    DUPE=$(sort -n array.txt | uniq -d)
    ARRAY=$(sort -n array.txt)
    KEY=0
    for i in $ARRAY; do
    if [ $KEY -ne $i ]; then
    echo “MISSING: $KEY”;
    break;
    fi
    KEY=$(($KEY+1));
    done
    echo “DUPE: ${DUPE}”

    I’m assume I understand the description right and the first number would always be 0. If not, then I could always head the sorted output to get the first number and start counting from there. I suppose I could get a little better about storing file in memory after the first sort for the whole “compact space” part.

    • Max Schireson on

      Hi. Looking to avoid the time of a sort (which is more than linear) and also not to have to hold a copy of the array. You can see some correct solutions and further commentary in the comments.

      — Max

  92. Ankur on

    Wow. This is probably the easiest one out there. It’s as simple as summing all of them and deducting it from n*(n 1)/2

    I was really expecting something better!! I think this question was on my class 8 math exam.

    • Max Schireson on

      That’s simple, but it doesn’t actually answer the question asked 🙂

      — Max

  93. Duarte on

    Here’s my take. It’s based on XOR and on the observation that a bit set to 1 on (Missing XOR Duplicate) means that Missing and Duplicate can be distinguished based on that bit. My solution is in C#.

    long total = 0;
    long values = 0;

    for (long i = 0; i < a.Length; i++)
    {
    total ^= i + 1;
    values ^= a[i];
    }

    long x = total ^ values;

    int bit;
    for (bit = 1; (x & bit) == 0; bit <<= 1) { }

    total = 0;
    values = 0;

    for (long i = 0; i < a.Length; i++)
    {
    var tmp = i + 1;
    if ((tmp & bit) != 0)
    {
    total ^= tmp;
    }

    tmp = a[i];
    if ((tmp & bit) != 0)
    {
    values ^= tmp;
    }
    }

    long duplicate = total ^ values;

    int count = 0;
    for (long i = 0; i < a.Length; i++)
    {
    if (a[i] == duplicate && ++count == 2)
    {
    break;
    }
    }

    duplicate = count == 2 ? duplicate : (duplicate ^ x);
    long missing = duplicate ^ x;

    We could use the equation SUM(1..N) – SUM(a) – Duplicate = Missing to reduce the solution to two loops, but I think this is faster.

    • Duarte on

      Oops, the third loop can be:

      bool found = false;
      for (long i = 0; i < a.Length; i++)
      {
      if (a[i] == duplicate)
      {
      found = true;
      break;
      }
      }

      duplicate = found ? duplicate : (duplicate ^ x);
      long missing = duplicate ^ x;

  94. Andrew on

    An interesting problem but more of a mathematical challenge than a programming ability test.

    I wouldn’t turn a good programmer away just because thier knowledge of algebra is rusty.

    • Max Schireson on

      I wouldn’t either.
      These problems are no substitute for a real interview. They can identify some sharp folks. When people post code samples, you get to see how careful they are. And finally, you get to see how well people listen to requirements, which is certainly relevant.
      — Max

    • Bob Kimble on

      I agree with your comment and with Max’ response. However, if you’re looking for really smart people, the kind of problem Max posted is a good example of the kind of problem a really smart person might be able to solve. Like it or not, computer science is really an offshoot of mathematics. Algorithmic efficiency is strongly tied to mathematical analysis, and this problem was not just about algebraic knowledge. I agree that a strong mathematical background is no substitute for programming ability. OTOH, it’s really good to find people who have both.

  95. wangjun on

    my english is very poor,so i write a programm to introduce my solution
    void search(unsigned long *arry, unsigned long long N)
    {
    unsigned long base, i, duplicat = 0, miss = 0;
    unsigned int vec[65536];
    base = 0;
    while(base < N)
    {
    for(i = 0; i < 65536; i ++)
    vec[i] = 0;
    for(i = 0; i = base && arry[i] < base + 65536 * 32)
    {
    int pos = (arry[i] – base) / 32, bit = (arry[i] – base)% 32;
    if(vec[pos] & (1 << bit) ==0)
    vec[pos] = vec[pos] | (1 << bit);
    else
    duplicat = arry[i];
    }
    for(i = 0; i < 65536; i++)
    if(vec[i] != 0xffffffff)
    {
    miss = base + i * 32;
    break;
    }
    if(i != 65536)
    {
    while(arry[i] != 0)
    if(arry[i] % 2 != 0)
    {
    miss ++;
    }
    else
    break;
    }
    }
    if(miss != 0 && duplicat != 0)
    break;
    base += 65536 * 32;
    }
    }

  96. Zhikun Zhong on

    I try to solve such a problem by count digit string appear times in the set. In the code, I try to seperate big number to two to four pieces, depend on how big the number is. The length of every piece is 6(Less than 1 million). We know that, the number set is regular, so, every piece appear times is regular and we can figure it out. But there get a problem, if duplicate number and miss number get some repeat part, like these two numbers(1221111, 1256300), theirs first two numbers is the same, so, we can’t get that digit string “12” appear more than expected or less than expected. But luckily, we know that, we got two numbers, they are end by 21111 and 56300 respectivility, theirs length is between 6 to 12 and they get same heigh apart, we can find out these two numbers use one more search(o(n)) in the set, if you think it is a way out to solve such a problem, I am get to archive this part, thanks very much.

    #include
    #include

    using namespace std;

    vector generateArray(int dupNum, int missNum, int N){

    vector retVec;
    retVec.resize(N + 1);

    if (dupNum > N || missNum > N)
    {
    return retVec;
    }

    for(int i = 1; i <= N; i++)
    {
    if (i == missNum)
    {
    retVec[i] = dupNum;
    continue;
    }
    retVec[i] = i;
    }

    return retVec;
    }

    pair gotDupAndMissNum(vectormyArray, __int64 N)
    {
    if (myArray.size() < N)
    {
    return pair(0, 0);
    }
    __int64 dup = 0, miss = 0;
    const int vecSize = 10 * 10 * 10 * 10 * 10 * 10;
    __int64 MidNum = ((__int64)(vecSize)) * (vecSize);
    __int64 TopNum = ((__int64)(vecSize)) * (vecSize) * (vecSize);
    int limit = vecSize;

    __int64 cmpVal = vecSize;
    __int64 v1Cnt, v2Cnt, v3Cnt;

    vector myVec1, myVec2, myVec3, myVec4;
    int dupIdxV1 = 0, dupIdxV2 = 0, dupIdxV3 = 0, dupIdxV4 = 0;
    int misIdxV1 = 0, misIdxV2 = 0, misIdxV3 = 0, misIdxV4 = 0;

    myVec1.resize(vecSize + 1);
    myVec2.resize(vecSize + 1);
    myVec3.resize(vecSize + 1);
    myVec4.resize(100);

    for(int i = 0; i < 100; i++)
    {
    myVec4[i] = 0;
    }

    for (int i = 0; i < vecSize; i++)
    {
    myVec1[i] = 0;
    myVec2[i] = 0;
    myVec3[i] = 0;
    }

    int cnt = 0;

    while(cnt++ < 4)
    {
    switch (cnt)
    {
    case 1:
    for (int i = 0; i < N; i++)
    {
    if (myArray[i] 1)
    {
    dup = myArray[i];
    }
    }
    }

    cmpVal = limit = vecSize;
    if (N < vecSize)
    {
    cmpVal = limit = N;
    }
    for (int i = 1; i < limit; i++)
    {
    if (!myVec1[i])
    {
    miss = i;
    }
    myVec1[i] = 0;
    }
    myVec1[0] = 0;
    break;
    case 2:
    for (int i = 0; i = vecSize && myArray[i] < MidNum)
    {
    myVec1[myArray[i] % vecSize]++;
    myVec2[myArray[i] / vecSize]++;
    }
    }

    v1Cnt = vecSize – 1;
    v2Cnt = vecSize;
    cmpVal = vecSize;

    if (N < MidNum)
    {
    v1Cnt = N / vecSize;
    cmpVal = N / vecSize;
    if (cmpVal == 1)
    {
    limit = N % vecSize;
    }
    }

    for (int i = 0; i < limit; i++)
    {
    if ( cmpVal == vecSize || cmpVal != vecSize && i v1Cnt)
    {
    dupIdxV1 = i;
    }
    else if (myVec1[i] v1Cnt – 1)
    {
    dupIdxV1 = i;
    }
    else if (myVec1[i] = cmpVal)
    {
    continue;
    }

    if (myVec2[i] > v2Cnt )
    {
    dupIdxV2 = i;
    }
    else if (myVec2[i] < v2Cnt)
    {
    misIdxV2 = i;
    }
    }

    if (cmpVal != vecSize)
    {
    if (myVec2[cmpVal] N % vecSize){
    dupIdxV2 = cmpVal;
    }
    }

    if (!dup && dupIdxV1 && dupIdxV2)
    {
    dup = myArray[dupIdxV2] * vecSize + myArray[dupIdxV1];
    dupIdxV1 = 0, dupIdxV2 = 0, dupIdxV3 = 0, dupIdxV4 = 0;
    }
    if (!miss && misIdxV1 && misIdxV2)
    {
    miss = myArray[misIdxV2] * vecSize + myArray[misIdxV1];
    misIdxV1 = 0, misIdxV2 = 0, misIdxV3 = 0, misIdxV4 = 0;
    }

    for (int i = 0; i < vecSize; i++)
    {
    myVec1[i] = myVec2[i] = 0;
    }
    break;
    case 3:
    for (int i = 0; i = MidNum && myArray[i] < TopNum)
    {
    myVec1[myArray[i] % vecSize]++;
    myVec2[myArray[i] / vecSize % vecSize]++;
    myVec3[myArray[i] / MidNum]++;
    }
    }

    v1Cnt = (__int64)vecSize * (vecSize – 1);
    v2Cnt = v1Cnt;
    v3Cnt = v1Cnt + 1;
    cmpVal = vecSize;
    limit = vecSize;

    if (N < TopNum)
    {
    cmpVal = N / MidNum;
    v1Cnt = v2Cnt = (__int64)(cmpVal – 1) * vecSize;
    if (cmpVal == 1)
    {
    limit = N/vecSize%vecSize;
    }
    }

    for(int i = 0; i < vecSize; i++){
    if ( cmpVal == vecSize || cmpVal != vecSize && i v1Cnt)
    {
    dupIdxV1 = i;
    }
    else if (myVec1[i] v1Cnt – 1)
    {
    dupIdxV1 = i;
    }
    else if (myVec1[i] limit)
    {
    continue;
    }
    if (myVec2[i] > v2Cnt)
    {
    dupIdxV2 = i;
    }
    if (myVec2[i] = cmpVal)
    {
    continue;
    }

    if (myVec3[i] > v3Cnt )
    {
    dupIdxV3 = i;
    }
    else if (myVec3[i] < v3Cnt )
    {
    misIdxV3 = i;
    }
    }

    if (cmpVal != vecSize)
    {
    if (myVec3[cmpVal] N % ((__int64)vecSize*vecSize)){
    dupIdxV3 = cmpVal;
    }
    }

    if (!dup && dupIdxV1 && dupIdxV2 && dupIdxV3)
    {
    dup = myArray[dupIdxV3] * vecSize * vecSize + myArray[dupIdxV2] * vecSize + myArray[dupIdxV1];
    dupIdxV1 = 0, dupIdxV2 = 0, dupIdxV3 = 0, dupIdxV4 = 0;
    }
    if (!miss && misIdxV1 && misIdxV2 && misIdxV3)
    {
    miss = myArray[misIdxV3] * vecSize * vecSize + myArray[misIdxV2] * vecSize + myArray[misIdxV1];
    misIdxV1 = 0, misIdxV2 = 0, misIdxV3 = 0, misIdxV4 = 0;
    }
    for (int i = 0; i < vecSize; i++)
    {
    myVec1[i] = myVec2[i] = myVec3[i] = 0;
    }
    break;
    default:
    for (int i = 0; i = TopNum)
    {
    myVec1[myArray[i] % vecSize]++;
    myVec2[myArray[i] / vecSize % vecSize]++;
    myVec3[myArray[i] / MidNum % vecSize]++;
    myVec4[myArray[i] / TopNum]++;
    }
    }

    cmpVal = N / TopNum;
    v1Cnt = (__int64)vecSize * vecSize * vecSize * cmpVal;
    v2Cnt = v1Cnt;
    v3Cnt = v1Cnt;
    limit = N/vecSize/vecSize%vecSize;

    for(int i = 0; i v1Cnt)
    {
    dupIdxV1 = i;
    }
    else if (myVec1[i] v2Cnt)
    {
    dupIdxV2 = i;
    }
    else if (myVec2[i] limit)
    {
    continue;
    }

    if (myVec3[i] > v3Cnt)
    {
    dupIdxV3 = i;
    }
    else if (myVec3[i] = cmpVal)
    {
    continue;
    }

    if(myVec4[i] cmpVal)
    {
    dupIdxV4 = i;
    }
    }

    if (cmpVal != vecSize)
    {
    if (myVec3[cmpVal] N % ((__int64)vecSize*vecSize*vecSize)){
    dupIdxV3 = cmpVal;
    }
    }

    if (!dup && dupIdxV1 && dupIdxV2 && dupIdxV3 && dupIdxV4)
    {
    dup = myArray[dupIdxV4] * vecSize * vecSize *vecSize + myArray[dupIdxV3] * vecSize * vecSize + myArray[dupIdxV2] * vecSize + myArray[dupIdxV1];
    dupIdxV1 = 0, dupIdxV2 = 0, dupIdxV3 = 0, dupIdxV4 = 0;
    }
    if (!miss && misIdxV1 && misIdxV2 && misIdxV3 && misIdxV4)
    {
    dup = myArray[misIdxV4] * vecSize * vecSize *vecSize + myArray[misIdxV3] * vecSize * vecSize + myArray[misIdxV2] * vecSize + myArray[misIdxV1];
    misIdxV1 = 0, misIdxV2 = 0, misIdxV3 = 0, misIdxV4 = 0;
    }
    break;
    }

    if (cmpVal != vecSize || dup && miss)
    {
    break;
    }
    }

    return pair(dup, miss);
    }

    int _tmain(int argc, _TCHAR* argv[])
    {
    gotDupAndMissNum(generateArray(2154001, 785412, 5610121), 5610121);
    return 0;
    }

  97. Dude on

    this is a math problem not necessarily a programming problem.

  98. Dean on

    Haven’t done any programming in a while but, I would push the data into a hash table with a try/catch. This will show the duplicate in the error, then search for the one and only NULL entry in the has table.
    Job done in about 5 lines of code, and hash tables are very fast.

    • Max Schireson on

      I was looking for a solution that doesn’t require copying the data (say in case it was very very large).

      Interesting approach though.

      — Max

  99. Albert on

    CONSTANT total = sum the numbers 0..63
    CONSTANT square_total = sum of squares of numbers from 0..63

    iterate array,
    summing the numbers into sum,
    summing square of numbers into square_sum.

    LET diff = total – sum
    LET square_diff = square_total – sum

    duplicate = (square_diff – (expt diff 2)) / (diff * 2)
    missing = duplicate + dif

  100. Sudha on

    #include
    #include
    void main()
    {
    int arr[20],i,sum=0,dup,len,check;
    printf(“enter the array length”);
    scanf(“%d”,&len);
    check=(len*(len+1)/2);
    printf(“\n check = %d”,check);
    printf(“\nThe array with %d element”,len);
    printf(“\nenter the array element”);
    for(i=1;i<=len;i++)
    {
    scanf("%d",&arr[i]);
    if(i==arr[i])
    sum=i+sum;
    else
    dup=arr[i];
    }
    printf("the Missing no is %d and the duplicate number is %d",(check-sum),dup);
    getch();
    }

  101. Markiv on

    static void Main(string[] args)
    {
    int[] arr = new int[5] {1, 2, 4, 5, 5};

    float sum = 0;
    float product = 1;
    float expectedSum = 0;
    float expectedProduct = 1;
    for (int i = 0; i < arr.Length; i++)
    {
    sum += arr[i];
    product *= arr[i];
    expectedSum += (i + 1);
    expectedProduct *= (i + 1);
    }

    float missing = (expectedSum – sum) / (1 – product / expectedProduct);
    float duplicate = missing – (expectedSum – sum);

    Console.WriteLine("Missing – {0}, Duplicate – {1}", missing, duplicate);
    Console.ReadKey();
    }

  102. Antoine Guiral (@antoineguiral) on

    I don’t know where are stored the data (mongo? file?) but if they are in file you can use comm() http://en.wikipedia.org/wiki/Comm (and sort before the comparison :)).

  103. James on

    Years later, some python. Unemployed and bored… it seems to work fine, probably not linear though. At every index it checks front and back, then hops two. Nothing to do with Mongo but it’s 1:
    miss = arr[i]-1
    elif (arr[i+1]-arr[i]) > 1:
    miss = arr[i]+1

    if (i+2) == (len(arr)-1):
    step = 1
    i = i + step
    except Exception:
    True

    print “Missing: %d — Duplicate: %d” %(miss, dup)

    • James on

      Site messed up my codepaste.

      arr = [1,2,3,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,25,26,27]

      miss = 0
      dup = 0

      try:
      i = 1
      step = 2

      for x in range(1, (len(arr)/2)+1):
      if arr[i] == arr[i-1] or arr[i] == arr[i+1]:
      dup = arr[i]
      elif (arr[i]-arr[i-1]) > 1:
      miss = arr[i]-1
      elif (arr[i+1]-arr[i]) > 1:
      miss = arr[i]+1

      if (i+2) == (len(arr)-1):
      step = 1
      i = i + step
      except Exception:
      True

      print “Missing: %d — Duplicate: %d” %(miss, dup)

    • Max Schireson on

      It looks like you only covered the special case of sorted list? The problem statement does not require the numbers to be sorted.


Leave a comment