[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: STFT vs Power Spectral in Musical recognition system ?

Arturo, I totally agree with the idea of using log(1 + KM), or log(M + epsilon) as I usually do it. This kind of nonlinearity is especially important in systems with an imprecisely known zero level or a variable noise floor. On the other hand, a power law, though it has an infinite slope at 0, is not half as bad as a plain log, and lots of people use that anyway. A stabilized power law, like (M + epsilon)^(1/3) is another good choice, probably more in line with perception that letting it go log-like at high magnitudes. With any of these, adjusting your parameters to accommodate a realistic range of input signal levels becomes important; you can no longer ignore scale factors and hope for algorithms to work fine on inputs varying over many orders of magnitude in scale.


At 6:32 PM -0400 8/31/06, Arturo Camacho wrote:
One problem of the square-root compression is that its slope
approaches infinity as the magnitude M approaches zero. A more
appropriate approach may be to use log(1+KM), where K is a constant to
be determined. The response of this function is almost logarithmic for
high magnitudes and almost linear for low magnitudes. Of course, the
determination of the optimal value for K given an input is not


 Arturo Camacho
 PhD Candidate
 Computer and Information Science and Engineering
 University of Florida

 E-mail: acamacho@xxxxxxxxxxxx
 Web page: www.cise.ufl.edu/~acamacho

On Fri, 25 Aug 2006, Richard F. Lyon wrote:


 A power spectral density is only defined for stationary signals, not
 music.  The STFT generalizes it to short segments, if you use the
 squared magnitude.

 The difference between the absolute value, square, log, etc. are just
 point nonlinearities that do not change the information content, but
 do change the metric structure of the space a bit.  Log is too
 compressed, leading to too much emphasis on near-silent segments,
 while the square (the power you ask about) is too expanded, leading
 to too much emphasis on the louder parts.  A good compromise is
 around a square root or cube root of magnitude (roughly matching
 perceptual magnitude via Stevens's law), but the magnitude itself is
 also sometimes acceptable, depending on what you're doing.


 At 7:12 AM -0700 8/25/06, Edwin Sianturi wrote:
 >Content-Type: text/html
 >X-MIME-Autoconverted: from 8bit to quoted-printable by
 >torrent.cc.mcgill.ca id k7PED6jh031610
 >I am just a master student, doing my internship. Right now, I am
 >building a musical instrument recognition system. I have read
 >several papers on it, and I am just curious:
 >All the papers/journals that I have read use the STFT, a.k.a the
 >|X(t,f)| of a signal x(t), in order to extract several (spectral)
 >features to be used as the input to the recognition system.
 >What are the reasons behind using the |X(t,f)| instead of using the
 >"power spectral" |X(t,f)|^2 ?
 >(technically speaking, a power spectral density is the expectation
 >of |X(f)|^2, i.e. E(|X(f)|^2) )
 >Thanks in advance,