A simple ``grammar'' is proposed to describe how ``auditory subevents,'' i.e., onsets, terminations, fillings, and silences, are connected perceptually. A rapid increase of sound intensity in a certain frequency range serves as a perceptual clue of an onset, whereas a rapid decrease serves as a clue of a termination. A thick distribution of sound energy for a certain duration in the time--frequency plane indicates a filling (or fillings), and a thin distribution of energy for a certain duration after a thicker part indicates a silence (or silences). The auditory system seems to interpret these clues according to the grammar, which allows three patterns of ``auditory events'': (1) an onset followed by a silence, (2) an onset and a filling followed by another onset, and (3) an onset, a filling, and a termination followed by a silence. Clearer clues and contexts facilitate the formation of auditory events. The proximity principle works when onsets and terminations are organized into an ``auditory stream,'' which is defined as a chain of auditory events and silences. Silences also obey the grammar. In order to avoid any violation of the grammar, the auditory system sometimes neglects or reuses clues of auditory subevents, or adds subevents.