The Daily Static
  The Daily Static
UF Archives
Register
UF Membership
Ad Free Site
Postcards
Community

Geekfinder
UFie Gear
Advertise on UF

Forum Rules
& FAQ


Username

Password


Create a New Account

 
 

Back to UserFriendly Strip Comments Index

data structure in C++ / STL? by dennismv2012-01-08 16:22:49
  "STL"? by rebill 2012-01-08 20:20:35

If that is a typo for "SQL", and your data is in a database ... then you can let the DB do the heavy lifting for you.

Simplest algorithm, but has one technical flaw:

Inialize two variables (one "old_key", one "old_value") to impossible values. Set a counter to 0;

Run this query:

SELECT KEY, VALUE INTO new_key, new_value FROM TABLE ORDER BY KEY, VALUE;

Read one line at a time from the result set.

If the new_key matches the old_key, but the new_value does not match: you have a duplicate. Increase the counter. If the counter is 2, print the old_key and the old_value. No matter what the counter value was, print the new_value.

If the new_key does not match the old_key: if the old_key is not the bogus value, and the counter is greater than 1 print the counter. Either way, set the old_key to the new key, reset the counter to 1.

Always set the old_value = to the new value.

When the data set is exhausted, if the counter is > 1, print it.

Starting the counter at 0 instead of 1 causes the bogus data to disappear into the ether. Restarting it at 1 is because you have real data ... that you may or may not want to print on the next pass.

The problem with the above algorithm is that it forces the database to sort the entire table. RolandII's point about having so much data that it will not fit into memory breaks the above algorithm.

Slurping the entire table across a 10MB network connection between the database and the machine running the program may cause headaches for other people trying to use that same pipe.

The technical flaw that I mentioned? This algorithm cannot print the count of "dupes" in the header row - it only knows that number when it prints the trailer row.

Another algorithm, that abuses the database a bit differently:

SELECT KEY, COUNT (1) "DUPLICATES" FROM (SELECT DISTINCT KEY, VALUE FROM TABLE) HAVING COUNT (1) > 1 GROUP BY KEY;

This gives you the keys that have duplicates, but performs all of the sorting and filtering inside the database engine. You take that list, and then go back to the database one at a time to get the values.

That gives you the keys that have more than one associated value, and a total number of associated values that you can print.

The first algoritm (sort, then parse) is O(n) once your program starts processing - you only touch each data line one time. (The database sort is probably O(n*log(n)) or worse - but who really knows?)

The second algoritm hits the database for every set of duplicates. The database sort could be the same as before.

For true evil done unto a database:

SELECT DISTINCT KEY, VALUE FROM TABLE WHERE KEY IN (SELECT KEY FROM (SELECT DISTINCT KEY, VALUE FROM TABLE) HAVING COUNT (1) > 1 GROUP BY KEY);

or even:

SELECT A.KEY, A.DUPES, T.VALUE FROM (SELECT KEY, COUNT(1) AS DUPES FROM (SELECT DISTINCT KEY, VALUE FROM TABLE) HAVING COUNT (1) > 1 GROUP BY KEY) A JOIN TABLE T ON T.KEY = A.KEY;

For small data sets, no one will notice. For large data sets, you might get the hard drives in the database server to grind hard enough to walk the server around the room.

[ Reply ]
    No, I think "Standard Template Library" by DesiredUsername2012-01-08 21:59:17

 

[Todays Cartoon Discussion] [News Index]

Come get yer ARS (Account Registration System) Source Code here!
All images, characters, content and text are copyrighted and trademarks of J.D. Frazer except where other ownership applies. Don't do bad things, we have lawyers.
UserFriendly.Org and its operators are not liable for comments or content posted by its visitors, and will cheerfully assist the lawful authorities in hunting down script-kiddies, spammers and other net scum. And if you're really bad, we'll call your mom. (We're not kidding, we've done it before.)