Organising inconsistent values

No idea if this is ok to ask here since it's not programming but I have no idea where else to go: I want to organise the following data in a consistent way. At the moment it's a mess, with only the first two columns (comma seperated) consistent. The remaining columns can number anywhere from 1-9 and are usually different. In other words, I want to sort it so the text matches (all of the value columns in a row, all of the recoil columns in a row, etc). Then I can remove the text and add a header, and it will still make sense.

bm_wp_upg_o_t1micro, sight, value = 3, zoom = 3, recoil = 1, spread_moving = -1 
bm_wp_upg_o_marksmansight_rear, sight, value = 3, zoom = 1, recoil = 1, spread = 1 
bm_wp_upg_o_marksmansight_front, extra, value = 1 
bm_wp_m4_upper_reciever_edge, upper_reciever, value = 3, recoil = 1 
bm_wp_m4_upper_reciever_round, upper_reciever, value = 1 
bm_wp_m4_uupg_b_long, barrel, value = 4, damage = 1, spread = 1, spread_moving = -2, concealment = -2 

Any suggestions (even on just where the right place is to actually ask this) would be great. Context is just raw data ripped from a game fiile that I'm trying to organise.

Thanks everyone!

Answers


I'm afraid regex isn't going to help you much here because of the irregular nature of your input (it would be possible to match it, but it would be a bear to get it all arranged one way or another). This could be done pretty easily with any programming language, but for stuff like this, I always go to awk.

Assuming your input is in a file called input.txt, put the following in a program called parse.awk:

BEGIN {
    FS=" *, *";
    formatStr = "%32s,%8s,%8s,%8s,%10s,%16s,%8s,%18s,%10s,%10s,%16s,%16s\n";
    printf( formatStr, "id", "sight", "value", "zoom", "recoil", "spread_moving", "extra", "upper_receiver", "barrel", "damage", "spread_moving", "concealment" );
}

{
    split("",a);
    for( i=2; i<=NF; i++ ) {
        if( split( $(i), kvp, " *= *" ) == 1 ) {
            a[kvp[1]] = "x";
        } else {
            a[kvp[1]] = gensub( /^\s*|\s*$/, "", "g", kvp[2] );
        }
    }
    printf( formatStr, $1, a["sight"], a["value"], a["zoom"], a["recoil"],
        a["spread_moving"], a["extra"], a["upper_receiver"], 
        a["barrel"], a["damage"], a["spread_moving"], a["concealment"] );
}

Run awk against it:

awk -f parse.awk input.txt

And get your output:

                              id,   sight,   value,    zoom,    recoil,   spread_moving,   extra,    upper_receiver,    barrel,    damage,   spread_moving,     concealment
             bm_wp_upg_o_t1micro,       x,       3,       3,         1,              -1,        ,                  ,          ,          ,              -1,
  bm_wp_upg_o_marksmansight_rear,       x,       3,       1,         1,                ,        ,                  ,          ,          ,                ,
 bm_wp_upg_o_marksmansight_front,        ,       1,        ,          ,                ,       x,                  ,          ,          ,                ,
    bm_wp_m4_upper_reciever_edge,        ,       3,        ,         1,                ,        ,                  ,          ,          ,                ,
   bm_wp_m4_upper_reciever_round,        ,       1,        ,          ,                ,        ,                  ,          ,          ,                ,
            bm_wp_m4_uupg_b_long,        ,       4,        ,          ,              -2,        ,                  ,         x,         1,              -2,              -2

Note that I chose to just use an 'x' for sight, which seems to a present/absent thing. You can use whatever you want there.

If you're using Linux or a Macintosh, you should have awk available. If you're on Windows, you'll have to install it.


I did make another awk version. I think this should a little easier to read. All value/column are read from the file to make it as dynamic as possible.

awk -F, '
    {
    ID[$1]=$2                   # use column 1 as index
    for (i=3;i<=NF;i++ )        # loop through all fields from #3 to end
        {
        gsub(/ +/,"",$i)            # remove space from field
        split($i,a,"=")         # split field in name and value a[1] and a[2]
        COLUMN[a[1]]++          # store field name as column name
        DATA[$1" "a[1]]=a[2]    # store data value in DATA using field #1 and column name as index
        }
    } 
END {
    printf "%49s   ","info"     # print info
    for (i in COLUMN)
        {printf "%15s",i}       # print column name
    print ""
    for (i in ID)               # loop through all ID
        {
        printf "%32s %16s ",i, ID[i]    # print ID and info
        for (j in COLUMN)
            {
            printf "%14s ",DATA[i" "j]+0    # print value
            }
        print ""
        }
    }' file

Output

                                             info            spread         recoil           zoom    concealment  spread_moving         damage          value
   bm_wp_m4_upper_reciever_round   upper_reciever              0              0              0              0              0              0              1
            bm_wp_m4_uupg_b_long           barrel              1              0              0             -2             -2              1              4
  bm_wp_upg_o_marksmansight_rear            sight              1              1              1              0              0              0              3
 bm_wp_upg_o_marksmansight_front            extra              0              0              0              0              0              0              1
    bm_wp_m4_upper_reciever_edge   upper_reciever              0              1              0              0              0              0              3
             bm_wp_upg_o_t1micro            sight              0              1              3              0             -1              0              3

Stick with Ethan's answer — this is just me enjoying myself. (And yes, that makes me pretty weird!)

awk script
awk 'BEGIN  {
                # f_idx[field] holds the column number c for a field=value item
                # f_name[c]    holds the names
                # f_width[c]   holds the width of the widest value (or the field name)
                # f_fmt[c]     holds the appropriate format
                FS = " *, *"; n = 2;
                f_name[0] = "id";   f_width[0] = length(f_name[0])
                f_name[1] = "type"; f_width[1] = length(f_name[1])
            }
            {
                #-#print NR ":" $0
                line[NR,0] = $1
                len = length($1)
                if (len > f_width[0])
                    f_width[0] = len
                line[NR,1] = $2
                len = length($2)
                if (len > f_width[1])
                    f_width[1] = len
                for (i = 3; i <= NF; i++)
                {
                    split($i, fv, " = ")
                    #-#print "1:" fv[1] ", 2:" fv[2]
                    if (!(fv[1] in f_idx))
                    {
                        f_idx[fv[1]] = n
                        f_width[n++] = length(fv[1])
                    }
                    c = f_idx[fv[1]]
                    f_name[c] = fv[1]
                    gsub(/ /, "", fv[2])
                    len = length(fv[2])
                    if (len > f_width[c])
                        f_width[c] = len
                    line[NR,c] = fv[2]
                    #-#print c ":" f_name[c] ":" f_width[c] ":" line[NR,c]
                }
            }
     END    {
                for (i = 0; i < n; i++)
                    f_fmt[i] = "%s%" f_width[i] "s"
                #-#for (i = 0; i < n; i++)
                #-#    printf "%d: (%d) %s %s\n", i, f_width[i], f_name[i], f_fmt[i]
                #-#    pad = ""
                for (j = 0; j < n; j++)
                {
                    printf f_fmt[j], pad, f_name[j]
                    pad = ","
                }
                printf "\n"
                for (i = 1; i <= NR; i++)
                {
                    pad = ""
                    for (j = 0; j < n; j++)
                    {
                        printf f_fmt[j], pad, line[i,j]
                        pad = ","
                    }
                    printf "\n"
                }
            }' data

This script adapts to the data it finds in the file. It assigns the column heading 'id' to column 1 of the input, and 'type' to column 2. For each of the sets of values in columns 3..N, it splits up the data into key (in fv[1]) and value (in fv[2]). If the key has not been seen before, it is assigned a new column number, and the key is stored as the column name, and the width of key as the initial column width. Then the value is stored in the appropriate column within the line.

When all the data's read, the script knows what the column headings are going to be. It can then create a set of format strings. Then it prints the headings and all the rows of data. If you don't want fixed width output, then you can simplify the script considerably. There are some (mostly minor) simplifications that could be made to this script.

Data file
bm_wp_upg_o_t1micro, sight, value = 3, zoom = 3, recoil = 1, spread_moving = -1 
bm_wp_upg_o_marksmansight_rear, sight, value = 3, zoom = 1, recoil = 1, spread = 1 
bm_wp_upg_o_marksmansight_front, extra, value = 1 
bm_wp_m4_upper_receiver_edge, upper_receiver, value = 3, recoil = 1 
bm_wp_m4_upper_receiver_round, upper_receiver, value = 1 
bm_wp_m4_uupg_b_long, barrel, value = 4, damage = 1, spread = 1, spread_moving = -2, concealment = -2
Output
                             id,          type,value,zoom,recoil,spread_moving,spread,damage,concealment
            bm_wp_upg_o_t1micro,         sight,    3,   3,     1,           -1,      ,      ,           
 bm_wp_upg_o_marksmansight_rear,         sight,    3,   1,     1,             ,     1,      ,           
bm_wp_upg_o_marksmansight_front,         extra,    1,    ,      ,             ,      ,      ,           
   bm_wp_m4_upper_receiver_edge,upper_receiver,    3,    ,     1,             ,      ,      ,           
  bm_wp_m4_upper_receiver_round,upper_receiver,    1,    ,      ,             ,      ,      ,           
           bm_wp_m4_uupg_b_long,        barrel,    4,    ,      ,           -2,     1,     1,         -2

Need Your Help

sbt.ResolveException on compilation in IDEA

playframework intellij-idea playframework-2.0

Everything works fine when I run "play clean compile" directly from comsole.

Ensure that mouse cursor coordinate is equal to setBounds coordinate of a component

java swing mouseevent mousemove

This is a followup question from my previous question. I am trying to solve the issue in the previous question.

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.