I had no idea about the answer. But I liked the question - the fundamental curiosity behind it!
So I tried searching and found this thread which explains it far simpler than the verbose expert in the video above. It so seems that the waves of the various simultaneous sounds/frequencies actually add up into one composite wave which is produced by the speaker, but our aural and brain circuitry can decipher back the component waves by listening to that one composite wave. Now, isn’t that fascinating!
This is my limited (possibly erroneous too) understanding from the thread. I’d like to improve it based on other responses
I will try to answer this question. The two tones you mentioned, 1001Hz and 1002Hz are pretty close to each other, and when you combine them, the resulting signal looks exactly like the original tones. Practically, I don't know whether there is any system that can resolve the two tones from this signal.
In the above plots, you can see that the signal in the third plot is identical to that of the first and the second tones, except that it's amplitude is almost doubled. However, if the tones are far apart, like 1000Hz and 1100Hz, then the result looks interesting.
In this case, from the 3rd plot, you can see that sometimes the tones reinforce each other, almost doubling the amplitude, and sometimes they cancel each other, resulting in a period of silence. All this happens so quickly and you may not be able to distinguish (hear) each of these phenomena separately.
To answer your question about how the speaker is able to reproduce two tones, assume that the y-axis of the third plot is also the displacement of the cone, i.e., 0V corresponds to stationary cone. 2V, corresponds to the cone pushed forward by 2cm and -2V corresponds to the cone pushed backward by 2cm, and so on. I may be oversimplifying here, and things will never be as linear as the way I put it. When you feed the two-tone signal to the speaker, the cone starts moving. As the voltage rises from 0V to 2V, the cone gets pushed forward by 2cm. When the voltage falls to 0V and then to -2V, the cone gets pushed backward by 2cm. The cone experiences +/-2cm displacements, from the resting position, during the period [0s, 0.002s]. And, during [0.004s, 0.006s], the cone is hardly displaced. If you note the times carefully, you can see that the periods of max and min displacements are only in the order of roughly 2ms, i.e., they happen very quickly. If you look at a longer time window, you can see that these periods repeat.
As the cone moves, it causes compression and rarefaction of air, which produces sound. The particular pattern of movement in this case, causes us to hear two tones. Hope this helps!