You don't need the tags to do image processing to track troops along the tabletop. Since you are taking known entities within a known fixed space and doing known fixed things with them, this is a closed-form image processing problem.
I dislike the "easy .. hard" continuum because difficulty is really a multidimensional thing. Usually, I think we can get away expressing (what I term) the major two dimensions: complexity and level of effort. So processing the image of a wargame tabletop and locating all the units – and why not dice while we're at it – (presumably to feed to an AR app for players or possibly an automated adjudication system) is complexity low, level of effort high.
Lots of grunt work, but not lots of figuring things out.
Depending on how well your figures blend with the tabletop, it will take more or less effort to train a clustering algorithm to find and identify them. (Dice are easy on both fronts, even QILS dice.) This will also drive the minimum resolution (pixels, colour depth) camera you need. I doubt you could find a digital camera today that couldn't handle the speed requirement
One of the other bits you would want is the LOE cost of putting the camera overhead. Otherwise, the tool you are building will have the same "I forgot about those guys behind the building" problem as players.
The biggest part of the overhead work would be slogging through training the cluster algorithm to recognize the troops across the desired presentation set. Orientation may or may not cost a lot of work (LOE for this is also lessened by putting the camera overhead). Effort for dealing with contrasting with different terrain types would be fairly variable relative to the nature of your terrain and figures.
So as you change terrain tables, you may or may not have to retrain the algorithm. This is where the low level complexity comes in … knowing for this set of minis and this set of terrain whether or not you have to retrain the algorithm.
Another bit of this issue is that training for say 10 troop types and 10 terrain types creates billions of combinations. You don't have to train for every possible combination, but the bigger the state space, the bigger the training load. But if you know these 3 will only be used with 4 terrain types, the next 4 with 6 of them, and the last 3 with only 3 of them, the aggregate performance space is much, much smaller. Ten and ten are probably very, very low numbers, so figuring out how to optimize your performance space by subdividing the possibilities could become moderately complex.
Overall, I would say your NFC idea is a more elegant overall solution.
If you are the kind of nerd who would enjoy setting up the image recognition approach (I am) and would consider the training aspects fun Zen work instead of drudgery, it would be fine.
Or you could hire people to build it for you.