Hello, here some considerations:
If your are only interested on the final static 3D model you may consider taking an approach with multiple photos (not just one or two) by little or no interaction.
This can be done running on the server a structure from motion algorithm (there are good available implementations such as VisualSFM) to obtain a sparse point based 3D model and then fit a parametric model of the head. The parametric model can be tagged on special features that can be automatically found from the pictures (eg. tip of nose, eyes, mouth corners..).
This way you would need more photos and the subject should keep the same expression but the final result would be superior and practically no interaction would be required, if not for minor editing on the 3D model.
In this case I would use JS for web interface, Python and c++ code for SFM.
If to use only one or two photos is mandatory and the user is supposed to do the work like in faceworx then I'd need a parametric model of the head (just like faceworx) and all the code would
be done in javascript and of course WebGL.
In both cases, very frankly, 5000$ is very little money.