Hi there,
The requirements look quite clear and straightforward to implement. However, tika is certainly not the tool you're looking for. It targets metadata and structured text. However, if I'm not missing anything here, what you need is optical character recognition aka OCR to parse the data from scan images, those 2 are very different things.
Rolling up a solution from scratch is way out of the scope for this project and an overkill in the first place so I suggest to use tesseract-ocr a renowned open source engine for this type of work. I have used it several times with pretty good results. I should note that though as is the case with any ocr implementation success rate won't be 100% meaning there'll be files that it won't be able to parse, e.g a badly scanned image.
About your questions,
1. I'm planning to use python utilizing tesseract
2. It won't work on different templates. The effort needed to make it work for another template is directly related to the difference between the templates. For instance, if it's a completely different template with a new layout, font, color, ect. a brand new parser should be created for it from scratch.
3. It's impossible to give any decent figure for that. Layout, font, coloring, clarity effects everything. For instance, the last page of the example document is very hard to parse if not impossible at all
4. I can start on next Wednesday, 27th of March and expect this to take 15 days.
I'll need lots of these files to train the engine