Should I use real or sample data for unit tests?
I'm writing a parser for the output of a legacy application, and since there are no specs on the file syntax I've got as many samples of these files as I could.
Now I'm writing the unit tests before implementing the parser (because there is no other sane way to do this) but I'm not sure whether I should:
- use the real files produced by the application, reading from them and comparing the output with the output that I would store in json format in another file.
- or create a sample string with the tokens and possibilities I want to test and a dict (this is python) with the expected output.
I'm inclined to use the second alternative because I would test only what I need to, without all the "real-world" data included on the actual files, but I'm afraid I could forget to test for one possibility or another.
What do you think?
My suggestion is to do both. Write a set of integration tests that run through all the files you have with the expected outputs then unit test with your expected inputs to isolate the parsing logic.
I would recommend writing the integration tests first so you write your parser outside in, it might be disparaging to see a bunch of failing tests, but it'll help you isolate your edge cases earlier.
Btw, I think this is a great question. I recently came across something a similar problem which was transforming large xml feeds from an upstream system into a proprietary format. My solution was to write a set of integration black box tests for the full feeds testing things like record counts and other high level success metrics, then break down inputs into smaller and smaller chunks until I was able to test all the permutations of the data. It was only then that I had a good understanding of how to build the parser.