Fast diff implementation in C++ to get amount of line differences in large files

Searched but didn't seem to find exactly what I needed.

I'm looking for a fast C++ equivalent of running the following command:

diff file1.log file2.log | wc -l

At present, I'm using file pipes in order to run diff from the command line, however I need to do this in a large, multi-nested loop and it takes quite longer than I anticipated. The files being diffed are roughly 150-200mb each, and each diff takes roughly 1-2 minutes.

Is there a faster solution that can be rolled by C++?

Here is my present method of calling it:

static std::string run_cmd(std::string in)
{
  // run command
  FILE* pipe = popen(in.c_str(), "r");
  if (!pipe)  return "err";

  char buff[128];
  std::string res = "";
  while (!feof(pipe))
  {
    if (fgets(buff, 128, pipe) != NULL)
      res += buff;
  }
  pclose(pipe);
  return res;
}

// diff 2 given files and return line number differences
std::string fileDiff(std::string file1, std::string file2)
{
  std::string f1 = base + file1;
  std::string f2 = base + file2;
  std::string cmd = "diff " + f1 + " " + f2 + " | wc -l";

  std::string res = run_cmd(cmd);
  if (res == "err") 
    return "E: Diff on [" + f1 + "] and [" + f2 + "]";

  return res;
}

Edit:

What I am essentially doing is logging code coverage. I've inserted logging statements into each nook and cranny of the codebase I'm working in, and writing each run to its own log file. I've attempted to minimize the writing penalty by not including them in constructors, loops, etc, and have buffered the actual writing process.

The program I had typically took about 10 minutes to run. With my added logging and diff calls its scaled up to about ~1 day.

I only care about the amount of line differences in this case, as it is feeding a fitness function in a genetic algorithm. The spread of execution paths between iterations is important at this point.

Answers


Launching an external process is fast. At 1-2 minutes per file, the process spawn overhead is a tiny insignificant fraction. You must be limited by 1) the performance of the diff command or 2) inefficient reading and storing of the pipe's output data. Try running the diff command in the shell and outputting into a file. Is it much faster? If not, then 1). If so, then 2).

I don't know much about Unix pipes, but a 128-byte buffer sounds small. The diff command is old and widely used, so it's unlikely that you could write a faster version.


Need Your Help

How to make sure AJAX is called by JavaScript?

php javascript ajax security

I asked a similar question before, and the answer was simply:

How to pass normals to a Vertex Shader in GLSL when using glDrawElements

c++ opengl glsl shader lighting

I am building a simple 3D game for practice, and I am having trouble passing normals to my shader when using indexed rendering. For each face of a polygon, at each vertex there would be the same no...

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.