If that is a typo for "SQL", and your data is in a database ... then you can let the DB do the heavy lifting for you.
Simplest algorithm, but has one technical flaw:
Inialize two variables (one "old_key", one "old_value") to impossible values. Set a counter to 0;
Run this query:
SELECT KEY, VALUE INTO new_key, new_value FROM TABLE ORDER BY KEY, VALUE;
Read one line at a time from the result set.
If the new_key matches the old_key, but the new_value does not match: you have a duplicate. Increase the counter. If the counter is 2, print the old_key and the old_value. No matter what the counter value was, print the new_value.
If the new_key does not match the old_key: if the old_key is not the bogus value, and the counter is greater than 1 print the counter. Either way, set the old_key to the new key, reset the counter to 1.
Always set the old_value = to the new value.
When the data set is exhausted, if the counter is > 1, print it.
Starting the counter at 0 instead of 1 causes the bogus data to disappear into the ether. Restarting it at 1 is because you have real data ... that you may or may not want to print on the next pass.
The problem with the above algorithm is that it forces the database to sort the entire table. RolandII's point about having so much data that it will not fit into memory breaks the above algorithm.
Slurping the entire table across a 10MB network connection between the database and the machine running the program may cause headaches for other people trying to use that same pipe.
The technical flaw that I mentioned? This algorithm cannot print the count of "dupes" in the header row - it only knows that number when it prints the trailer row.
Another algorithm, that abuses the database a bit differently:
SELECT
KEY,
COUNT (1) "DUPLICATES"
FROM
(SELECT DISTINCT KEY, VALUE FROM TABLE)
HAVING COUNT (1) > 1
GROUP BY
KEY;
This gives you the keys that have duplicates, but performs all of the sorting and filtering inside the database engine. You take that list, and then go back to the database one at a time to get the values.
That gives you the keys that have more than one associated value, and a total number of associated values that you can print.
The first algoritm (sort, then parse) is O(n) once your program starts processing - you only touch each data line one time. (The database sort is probably O(n*log(n)) or worse - but who really knows?)
The second algoritm hits the database for every set of duplicates. The database sort could be the same as before.
For true evil done unto a database:
SELECT DISTINCT KEY, VALUE FROM TABLE WHERE KEY IN
(SELECT
KEY
FROM
(SELECT DISTINCT KEY, VALUE FROM TABLE)
HAVING COUNT (1) > 1
GROUP BY
KEY);
or even:
SELECT
A.KEY, A.DUPES, T.VALUE
FROM
(SELECT
KEY, COUNT(1) AS DUPES
FROM
(SELECT DISTINCT KEY, VALUE FROM TABLE)
HAVING COUNT (1) > 1
GROUP BY
KEY) A
JOIN TABLE T ON T.KEY = A.KEY;
For small data sets, no one will notice. For large data sets, you might get the hard drives in the database server to grind hard enough to walk the server around the room. |