[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: STFT vs Power Spectral in Musical recognition system ?


A power spectral density is only defined for stationary signals, not music. The STFT generalizes it to short segments, if you use the squared magnitude.

The difference between the absolute value, square, log, etc. are just point nonlinearities that do not change the information content, but do change the metric structure of the space a bit. Log is too compressed, leading to too much emphasis on near-silent segments, while the square (the power you ask about) is too expanded, leading to too much emphasis on the louder parts. A good compromise is around a square root or cube root of magnitude (roughly matching perceptual magnitude via Stevens's law), but the magnitude itself is also sometimes acceptable, depending on what you're doing.


At 7:12 AM -0700 8/25/06, Edwin Sianturi wrote:
Content-Type: text/html
X-MIME-Autoconverted: from 8bit to quoted-printable by torrent.cc.mcgill.ca id k7PED6jh031610


I am just a master student, doing my internship. Right now, I am building a musical instrument recognition system. I have read several papers on it, and I am just curious:

All the papers/journals that I have read use the STFT, a.k.a the |X(t,f)| of a signal x(t), in order to extract several (spectral) features to be used as the input to the recognition system.

What are the reasons behind using the |X(t,f)| instead of using the "power spectral" |X(t,f)|^2 ?
(technically speaking, a power spectral density is the expectation of |X(f)|^2, i.e. E(|X(f)|^2) )

Thanks in advance,