# Re: STFT vs Power Spectral in Musical recognition system ?

Arturo, I totally agree with the idea of using log(1 + KM), or log(M + epsilon) as I usually do it. This kind of nonlinearity is especially important in systems with an imprecisely known zero level or a variable noise floor. On the other hand, a power law, though it has an infinite slope at 0, is not half as bad as a plain log, and lots of people use that anyway. A stabilized power law, like (M + epsilon)^(1/3) is another good choice, probably more in line with perception that letting it go log-like at high magnitudes. With any of these, adjusting your parameters to accommodate a realistic range of input signal levels becomes important; you can no longer ignore scale factors and hope for algorithms to work fine on inputs varying over many orders of magnitude in scale.

`Dick`

At 6:32 PM -0400 8/31/06, Arturo Camacho wrote:
```One problem of the square-root compression is that its slope
approaches infinity as the magnitude M approaches zero. A more
appropriate approach may be to use log(1+KM), where K is a constant to
be determined. The response of this function is almost logarithmic for
high magnitudes and almost linear for low magnitudes. Of course, the
determination of the optimal value for K given an input is not
trivial.```

```Arturo
--
__________________________________________________```

``` Arturo Camacho
PhD Candidate
Computer and Information Science and Engineering
University of Florida```

``` E-mail: acamacho@xxxxxxxxxxxx
Web page: www.cise.ufl.edu/~acamacho
__________________________________________________```

`On Fri, 25 Aug 2006, Richard F. Lyon wrote:`

` Edwin,`

``` A power spectral density is only defined for stationary signals, not
music.  The STFT generalizes it to short segments, if you use the
squared magnitude.```

``` The difference between the absolute value, square, log, etc. are just
point nonlinearities that do not change the information content, but
do change the metric structure of the space a bit.  Log is too
compressed, leading to too much emphasis on near-silent segments,
to too much emphasis on the louder parts.  A good compromise is
around a square root or cube root of magnitude (roughly matching
perceptual magnitude via Stevens's law), but the magnitude itself is
also sometimes acceptable, depending on what you're doing.```

` Dick`

``` At 7:12 AM -0700 8/25/06, Edwin Sianturi wrote:
>Content-Type: text/html
>X-MIME-Autoconverted: from 8bit to quoted-printable by
>torrent.cc.mcgill.ca id k7PED6jh031610
>
>Hello,
>
>I am just a master student, doing my internship. Right now, I am
>building a musical instrument recognition system. I have read
>several papers on it, and I am just curious:
>
>All the papers/journals that I have read use the STFT, a.k.a the
>|X(t,f)| of a signal x(t), in order to extract several (spectral)
>features to be used as the input to the recognition system.
>
>What are the reasons behind using the |X(t,f)| instead of using the
>"power spectral" |X(t,f)|^2 ?
>(technically speaking, a power spectral density is the expectation
>of |X(f)|^2, i.e. E(|X(f)|^2) )
>