STFT vs Power Spectral in Musical recognition system ?


I am just a master student, doing my internship. Right now, I am building a musical instrument recognition system. I have read several papers on it, and I am just curious:

All the papers/journals that I have read use the STFT, a.k.a the |X(t,f)| of a signal x(t), in order to extract several (spectral) features to be used as the input to the recognition system.

What are the reasons behind using the |X(t,f)| instead of using the "power spectral" |X(t,f)|^2 ?
(technically speaking, a power spectral density is the expectation of |X(f)|^2, i.e. E(|X(f)|^2) )

Thanks in advance,


