How to search common passwords from two given files of size 20GB?

i have two files of size 20GB. i have to remove common passwords from either of one file.

i sorted the second file by calling sort command of UNIX. after this i splitted the sorted file into many files so that file could fit in RAM memory using split command. After splitting into n files i just used an structure array of size n to store first password of each splitted file and its corresponding file name.

then i applied binary search technique in that structure array for each key of first file to to first password stored in structure to get index of the corresponding file. and then i applied b search to that indexed splitted file.

i assumed 20 character as a max length of passwords

this program is not yet efficient.

Please help to make it efficient, if possible....

Please give me some advise to sort that 20GB file efficiently .....

in 64 bit stream with 8gb RAM and i3 quard processor.....

i just tested my program with two file of size 10MB. it took about 2.66 hours without using any optimization option. ....according to my program it will take about 7-8 hours to check each passwords of 20GB after splitting , sorting and binary searching.....

can i improve its time complexity? i mean can i make it to run more "faster"???

Answers


Searching in external files is going to be painfully slow, even using binary search. You might speed it up by putting the data in an actual database designed for fast lookups. You could also sort both text files once and then do a single linear scan to filter out words. Something like the following pseudocode:

sort the files using any suitable sorting utility
open files A and B for reading
read wordA from A
read wordB from B
while (A not EOF and B not EOF)
{
    if (wordA < wordB)
      write wordA to output
      read wordA from A
    else if (wordA > wordB)
      read wordB from B
    else
      /* match found, don't output wordA */
      read wordA from A
}
while (A not EOF) /* output remaining words */
{
    write wordA to output
    read wordA from A
}

Check out external sorting. See http://www.umbrant.com/blog/2011/external_sorting.html which does have code at the end of the page (https://github.com/umbrant/extsort).

The idea behind external sorting is selecting and sorting equidistant samples from the file. Then partitioning the file at sampling points, sorting the partitions and merging the results.

example numbers = [1, 100, 2, 400, 60, 5, 0, 4]
example samples (distance 4) = 1, 60
chunks = {0,1,2,5,4} , {60, 100, 400}

Also, I don't think splitting the file is a good idea because you need to write 20GB to disk to split them. You might as well create the structure on the fly by seeking within the file.


For a previous SE question, "What algorithm to use to delete duplicates?" I described an algorithm for a probably-similar problem except with 50GB files instead of 20GB. The method is faster than sorting the big files in that problem.

Here is an adaptation of the method to your problem. Let's call the original two files A and B, and suppose A is larger than B. I don't understand from your problem description what is supposed to happen if or when a duplicate is detected, but in the following I assume you want to leave file A unchanged, and remove from B any items that also are in A. I also assume that entries within A are specified to be unique within A at the outset, and similarly for B. If that is not the case, the method needs more adapting and about twice as much I/O.

Suppose you can fit 1/k'th of file A into memory and still have room for the other required data structures. The whole file B can then be processed in k or fewer passes, as below, and this has a chance of being much faster than sorting either file, depending on line lengths and sort-algorithm constants. Sorting averages O(n ln n) and the process below is O(k n) worst case. For example, if lines average 10 characters and there are n = 2G lines, ln(n) ~ 21.4, likely to be about 4 times as bad as O(k n) if k=5. (Algorithm constants still can change the situation either way, but with a fast hash function the method has good constants.)

Process:

  1. Let Q = B (ie rename or copy B to Q)
  2. Allocate a few gigabytes for a work buffer W, and a gigabyte or so for a hash table H. Open input files A and Q, output file O, and temp file T. Go to step 2.
  3. Fill work buffer W by reading from file A.
  4. For each line L in W, hash L into H, such that H[hash[L]] indexes line L.
  5. Read all of Q, using H to detect duplicates, writing non-duplicates to temp file T.
  6. Close and delete Q, rename T to Q, open new temp file T.
  7. If EOF(A), rename Q to B and quit, else go to step 2.

Note that after each pass (ie at start of step 6) none of the lines in Q are duplicates of what has been read from A so far. Thus, 1/k'th of the original file is processed per pass, and processing takes k passes. Also note that although processing will be I/O bound you can read and write several times faster with big buffers (eg 8MB) than line-by-line. The algorithm as stated above does not include buffering details or how to deal with partial lines in big buffers.

Here is a simple performance example: Suppose A, B both are 20GB files, that each has about 2G passwords in it, and that duplicates are quite rare. Also suppose 8GB RAM is enough for work buffer W to be 4GB in size leaving enough room for hash table H to have say .6G 4-byte entries. Each pass (steps 2-5) reads 20% of A and reads and writes almost all of B, at each pass weeding out any password already seen in A. I/O is about 120GB read (1*A+5*B), 100GB written (5*B).

Here is a more involved performance example: Suppose about 1G randomly distributed passwords in B are duplicated in A, with all else as in previous example. Then I/O is about 100GB read and 70GB written (20+20+18+16+14+12 and 18+16+14+12+10, respectively).


Like so:

  1. Concatenate the two files.
  2. Use sort to sort the total result.
  3. Use uniq to remove duplicates from the sorted total.

If c++ it's an option for you, the ready to use STXXL should be able to handle your dataset.

Anyway, if you use external sort in c, as suggested by another answer, I think you should sort both files and then scan both sequentially. The scan should be fast, and the sort can be done in parallel.


Need Your Help

Memory analysis for a process

c++ memory process

I have a process which calls/creates another process, and this one will load a bunch of modules. The thing is that these modules will all be loaded in the same process as the caller (by default). Is

android apk's silent installation

android apk silent-installer

I'm searching for a way to program my application to install silently an apk file.

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.