I'm not completely clear on what you might be looking for, but would these examples be pointing in the right direction:
My reading of your question[s] is a study of what I call the first principle of ASA, segmentation. Segmentation being the multi-layered application of the concept of points of articulation. The role of Schoenberg's concept of klangfarben being among the first european formal articulations of spectrum as having possible structural significance. As Schoenberg had little access to sounds that could continuously transform in the spectral domain, independently of frequency and amplitude, his examples are restricted.
[My hearing says that the voice / spoken language meets many of your criteria, the voice [in most western european languages] being segmented and 'decoded' almost completely in terms of spectrum.]
Recently in introducing the vocoder in a class I recalled that the raw information extracted by a [traditional] vocoder is amplitude representations of frequency bands -- in some senses [and applications], a 'spectral extractor'. [My preference here is for spectral as I use this term to be a metric; timbre I use as a psychometric.]
And the famous Kontakte 'frequency to rhythm' section of Kontakte:
In covering these concepts in classes I refer back to the concept of a sonic / musical identity. The Beethoven Fifth Symphony [in its musical essence / identity remains consistent across the application of different spectral / timbral sources.
And as a last possibility for a universal understanding:
On 2013, Feb 8y hythm perception