Graph Paths Abstraction Algorithm Needed
I have a data structure holding a graph like the one in the following picture:
In this tree, a node can have any number of unique children from the levels below it. In tree in the picture represents a set of paths. Where every path should begin with a node from Level 1, and ends with a node of "*" mark. So the paths of the tree in the picture are:
A then C then G A then C then G then J A then D then G A then D then G the J A then D then K, and so on...
Actually my original tree is huge (around 2 Million sequences) and the maximum number of nodes per level is 61 (of 11 levels). So it causes many memory consumption problems in my application (a computer vision application for SAMSUNG).
My target is to have an iterative algorithm that represents these paths in a more compact string format. So I think we the problem is divided into three steps as follows. I have built the tree data structure (step 2), but still can not derive an iterative algorithm that gets the output string/sequence in step 3 from the tree.
1- Input String:
(A C G) | (A C G J) | (A D G) | (A D G J ) | (A D K) | ....,
Where "|" represents alternatives.
2- Building Tree Data Structure of These Paths.
3- Required Output String:
(A (C G [J]) | (D (G [J]) | K)) | (B ....).
Where where "|" represents alternatives and "[ ]" encloses options. The target output string should be optimized like there are not more common factors that can be taken to more simplify it.
You can use a modification of iterative DFS, which utilizes a stack to keep track of unprocessed nodes. This algorithm never stores more than 6 characters on the stack* for any one node, and there are always fewer than N nodes on the stack (where N is the number of nodes in the graph). You've indicated that N will be at most 61*11=671, so there will be a maximum of about 4000 elements possible on the stack.
In the pseudocode below, a "destination" node is a starred node in the example above, e.g. G*.
A dummy node Φ is introduced with an edge from Φ to each of the "root" nodes, e.g. nodes A and B above. The token for Φ is assumed to be a non-printing character, or you can explicitly check before adding it to the output string. The node Φ is pushed onto the stack before calling the function.
outString := "" while stack not empty pop token if token is node outString := outString + node(token) // Line 5 - explanation below if node(token) has children if node(token) is destination outString := outString + "[" push "]" end if node(token) has multiple children for each child of node(token), from right to left push ")" push child push "(" push "|" end pop // remove last "|" else push child end end else // token is ()| outString := outString + token end end
The output of this algorithm for the first part of your graph (A and its children) is (with extra spaces added for clarity; the spaces can be easily added to the code):
A (C G [J]) | (D (G [J]) | (K))
You'll notice a deviation between your result and mine: the final node K is enclosed in parentheses in my solution. If this is undesirable (it could result in ugliness like A[(B)|(C)]), you can eliminate it by performing an additional check when you pop a node token off of the stack at the cost of some additional overhead. Simply replace Line 5 above with:
if (node(token) has no children AND last character of outString is "(" AND next token on stack is ")") remove trailing "(" from outString concatenate token to outString pop ")" from stack and ignore else outString := outString + node(token) // as above end
Let me know if you have any questions or I've missed anything.
* This will happen in the (probably highly unlikely) case of a node being written as |[(A)]. Most nodes will take up 4 or fewer characters in the stack.