Digital audio fundamental question

Prem has answered the question in the best possible manner. Mathematically any band limited signal can be represented by a sum of a series of sinc functions and a set of coefficients. To reconstruct the analog signal from these coefficients, one needs to be able to generate a perfect impulse which is known as a delta Dirac function. However a perfect impulse is impossible to generate. Most DACs approximate this using a square wave which causes aliasing and multiple harmonics of the original signal appear at the output. So this output must be filtered again to remove this spuriae. To do this DACs have a brick wall filter that removes content above 20khz or half the sampling rate at the output.

However these filters actually have a roll off and are not the ideal theoretical brick wall as it is impossible to engineer the same. Any filter rings or causes distortion. A traditional filter would actually cause something called pre ringing and post ringing. Pre ringing happens before the output and post ringing after the output. Think post ringing as the sound of a bell. After the bell rang, it smoothly decays and then stops. However pre ringing is an unnatural phenomenon - it is sort of effect before cause. To combat this meridian and a few others designed something called a minimum phase or apodizing filter. These filters avoid pre ringing at the cost of slightly increased post ringing and sound more natural. Pre ringing is what people associate the bright digital annoying sound to.

In addition to all this there is the sound of the DAC itself based on architecture - multi bit or delta sigma or hybrid. This profoundly affects the quality of output.
 
I read the thread. Where it does, it only asserts what the theorem says, as you have here. Isn't there someone that can explain this theorem without formulae to non maths people? It isn't anything to do with human physiological restrictions, afaik, within the specified frequencies. The 44 part is selected based on human physiology, but that is just an application of the principles proved, not the theorem itself.

There isn't really anything more to assert than what the theorem says:)

Shannon-Nyquist theorem basically says that a signal can be reconstructed completely by just sampling it at twice its highest frequency. I don't know how to put that across without words like "sampling", "twice", "highest frequency" :lol:. I guess you have to bear with a bit of math and physics concepts here (shorn of the formulae). An analogy would be if we have sufficient number of blind men to touch and feel the proverbial elephant, their description of the elephant will be a complete description of said mammal, and not just a fractured, partial and incomplete description by individual blind men.

The 44.1 kHz sampling frequency is born out of human physiology. The average human hearing is 20000 Hz. With a buffer of 2050 Hz added to that, the sampling rate becomes (20000 + 2050) x 2 = 44100 Hertz. It is an application of the theorem.
 
Another stumbling block is that maybe Nyquist theory might not actually explain everything that is happening in digital music. The answer to this is that Nyquist theory is not an explanation, it is the reason the whole thing works.

The theory doesn't exist because of the music; the music exists because of the theory

Someone put this far better than I can. Might have been Monty, see Xiph links above.

There has been some thought / discussion devoted to the fact that nyquist works great with sine waves. But it may not work as well with dynamic signals (music) and may not account for the rise time (amplitude) changes in a dynamic signal or the artifacts of mixing waves of different frequencies (again, music). But with the state of digital today and on-going improvements it may be good enough.

Digital is still not bettering my mid-level analogue rig (vpi, souther arm, shelter 901) in the midrange dacs that I've listened too. But it is getting better year to year and I look forward to the day when it is better than the analogue rig I can afford....
 
Thanks, guys, this has been a very fascinating and most useful read. In particular what worked for me is the air vibrations explanation for different frequencies and sound level strengths.
If I may restate to see if I got it right:
Sound is nothing but vibrations in the air. The sound vibrations have a width and height characteristic - to use those terms for the time and sound levels involved in the vibrations.
What is said to be an analog wave form is nothing but joining the dots of the two points of the width extremes and the two points of the height extremes to get a wave form representation. There is actually no wave - or maybe there is and perhaps this is also the good old wave particle duality thing in light as it applies to sound. But it doesn't matter for this purpose.
So, if one were to know these four points, joining the dots would produce exactly the same wave form representation of the vibration. Exactly the same waveform, naturally, because one is just joining the same dots.
If I have understood it right then, capturing enough information of these points for a 20khz frequency sound means that one has also captured it for every frequency below that, not just down up to 20hz.
I understand the parts about the engineering challenges to translate this reality into working digital audio systems, but my question was to the fundamental concept.
Again, much appreciated:clapping:.
 
I'm avoiding any discussion of which is best. We have and will continue to have, lots of those :)
rsud said:
There has been some thought / discussion devoted to the fact that nyquist works great with sine waves.
There is also some thought that all music is sine waves. ok... lots of them, added together, but still, sine waves.

One of the mysteries of digital sound to me is that it reproduces music, not just tones. Tones would be easy to grasp, all one would need (as per earlier basic explanation) is a measure of frequency, a measure of amplitude, encode it, give it to the DAC wich then outputs beeep at the right note and volume.

How is it, though, that one set of numeric values sounds like a flute, another like a violin, and yet more sound like the hundred instruments of an orchestra? Can anyone look at sample values, recognise them, and hum the tune?

but back to earth...

But it may not work as well with dynamic signals (music) and may not account for the rise time (amplitude) changes in a dynamic signal

I was trying to get my head around this only the other day. The internet abounds with people who are prepared to say that the Nyquist thing either doesn't work, or can be beaten by something trivial. Thankfully, it also has those who give answers.

What about an amazingly short musical event that occurs between two sample points? This one still bugs me, but, from what I got from some of the explanations given... if it happens between two sample points, then its frequency is higher than the Nyquist frequency, and no, it would not be recorded. But, given 44.1, that means it would be over 22,000khz and would not have been heard. The system is, and has to be, bandwidth limited. Don't pick on "44.1" (just in case anyone felt like it ;)) but just apply the same to the sampling rate and bandwidth of your choice.
 
Well, here is my understanding of Nyquist Theorem and aliasing. What Nyquist says is the usable frequency is half the sampling frequency.

Manoj, sorry, that is quite wrong - and your visualization is in fact the reason why people get confused all the time.

A very simple way to think about this is - our ears perceive sound because air molecules hit our eardrums - the way rain falls on a tin roof and makes a sound.

We perceive a 20KHz audio note because air molecules are hitting our eardrums 20,000 times a second. One entire sine wave you depicted in your diagram is one "drum beat" - and this drum beat is occurring 20,000 times a second.

So, when your diagram shows sampling of pieces inside this single wave, it makes no sense and is thus a highly misleading visualization.

And the reason for this common mis-perception is that sound wave is really a compression wave - and our ear drums only "feel" the sound because physical air molecules hit the ear drum. (Which is why no sound can be head in a perfect vacuum).

Please note once again - sound is a discrete phenomenon - there is nothing continuous about it. It is discrete the same way a drum beat is - if you sample 100 times between two consecutive drum beats, you are not achieving anything!

And the maximum drumbeat (frequency) our eardrum can perceive is when air molecules hit our eardrums 20,000 times a second.

So, it is sufficient for us to *perfectly* record audio (up to 20Khz) in a digital format, as long as we store these frequencies (drum beats) as fast as 40,000 times a second. Not only is our information store 100% accurate for this frequency range, increasing the rate at which we store this information does absolutely nothing to the quality of our recording (which is already perfect).

At best, you end up storing some ultrasonic frequencies.
 
Thanks, guys, this has been a very fascinating and most useful read. In particular what worked for me is the air vibrations explanation for different frequencies and sound level strengths.
If I may restate to see if I got it right:
Sound is nothing but vibrations in the air. The sound vibrations have a width and height characteristic - to use those terms for the time and sound levels involved in the vibrations.
What is said to be an analog wave form is nothing but joining the dots of the two points of the width extremes and the two points of the height extremes to get a wave form representation. There is actually no wave - or maybe there is and perhaps this is also the good old wave particle duality thing in light as it applies to sound. But it doesn't matter for this purpose.
So, if one were to know these four points, joining the dots would produce exactly the same wave form representation of the vibration. Exactly the same waveform, naturally, because one is just joining the same dots.
If I have understood it right then, capturing enough information of these points for a 20khz frequency sound means that one has also captured it for every frequency below that, not just down up to 20hz.
I understand the parts about the engineering challenges to translate this reality into working digital audio systems, but my question was to the fundamental concept.
Again, much appreciated:clapping:.

Yes, sir. Spot on.

Only one thing to add- it is even easier to think of this as the drumbeat analogy as I posted in another comment.

And regardless of how the wave propogates in the medium (air), we only perceive it because air molecules physically push our ear drum. This is firmly because of the particle nature of matter, not the wave nature of matter.

If our ears were built to work with the wave nature of matter, then we would be able to "hear" electromagnetic radiation - which can travel in vacuum. We would theoretically be able to "hear" light, X-rays, radio waves, etc. (Admittedly, there is a whole different subject about how photons can also behave like particles, but that is too complex for me, and in any case, I can still say that our audio sense only works if particles hit our eardrums -and that happens in a discrete way, and our ears can only detect bombardments of up to 20,000 a second).

There is no wave (as far as we are concerned).
 
Last edited:
Thanks, guys, this has been a very fascinating and most useful read. In particular what worked for me is the air vibrations explanation for different frequencies and sound level strengths.
If I may restate to see if I got it right:
Sound is nothing but vibrations in the air. The sound vibrations have a width and height characteristic - to use those terms for the time and sound levels involved in the vibrations.
What is said to be an analog wave form is nothing but joining the dots of the two points of the width extremes and the two points of the height extremes to get a wave form representation. There is actually no wave - or maybe there is and perhaps this is also the good old wave particle duality thing in light as it applies to sound. But it doesn't matter for this purpose.
So, if one were to know these four points, joining the dots would produce exactly the same wave form representation of the vibration. Exactly the same waveform, naturally, because one is just joining the same dots.
If I have understood it right then, capturing enough information of these points for a 20khz frequency sound means that one has also captured it for every frequency below that, not just down up to 20hz.
I understand the parts about the engineering challenges to translate this reality into working digital audio systems, but my question was to the fundamental concept.
Again, much appreciated:clapping:.

I think you are mixing up several things that are different questions:

1. How would sound in the air look if we could see it?

2. What does that sound look like as an analogue electrical signal?

3. The conversion of that analogue signal to digital.

asliarun... Can you help out here?

manoj.p I think that, as well as some of the same confusions, you are also trying to connect your points with a pencil, whereas the technology does it with a flexible strip.
 
And regardless of how the wave propogates in the medium (air), we only perceive it because air molecules physically push our ear drum. This is firmly because of the particle nature of matter, not the wave nature of matter.

If our ears were built to work with the wave nature of matter, then we would be able to "hear" electromagnetic radiation - which can travel in vacuum. We would theoretically be able to "hear" light, X-rays, radio waves, etc. (Admittedly, there is a whole different subject about how photons can also behave like particles, but that is too complex for me.
Digressing a little - I would say that eyes are also matter, and therefore respond to the particle nature of light? It is just that they are built to respond to bombardment of photon particles, representing different kinds of vibrations? So while they can see light, they cannot see all light such as infra red or ultra violet?
Of course, all matter is nothing but a form of energy, so....but this is getting to be complex beyond my understanding as well. And not relevant to this thread.
 
(So! You keepp owl hours too! :) )

The 16 bit is actually amazingly easy: Dynamic range. The more bits you have, the greater range between soft and loud you can express. Given a starting point of silence, you could get louder with 24 bits than with 16.

Sorry, can't quote all those dB numbers, they just don't stick in my head, but I believe that 16 bits is capable of just about the range of human hearing, and is more than enough for the useful range, at least for those who do not want to cause instant deafness.

Digital volume controls actually truncate each sample, chopping bits off. Horror of horrors, the signal is no longer bit perfect! :eek:. Purists don't like that. And, to be honest, lots of us don't like the idea, whether it matters in practice or not.

Of course, if your music is 24-bit, then you have a lot more bits to throw away before you get to anything you don't want to loose.
 
(So! You keepp owl hours too! :) )

The 16 bit is actually amazingly easy: Dynamic range. The more bits you have, the greater range between soft and loud you can express. Given a starting point of silence, you could get louder with 24 bits than with 16.
Yes, I do, when something is gnawing me like this little thing has been all of yesterday!
Ok, then about the 16bit - it would mean that it fully covers the silent side of the range, more than 16 is for the loudness, to go to, as you said, deafness.
16 then is selected for how loud it can go taking into account the human levels of listening without damage to the ears?
But why can't the same thing be achieved with say, 8 bit and more amplification?
 
Manoj, sorry, that is quite wrong - and your visualization is in fact the reason why people get confused all the time.

A very simple way to think about this is - our ears perceive sound because air molecules hit our eardrums - the way rain falls on a tin roof and makes a sound.

We perceive a 20KHz audio note because air molecules are hitting our eardrums 20,000 times a second. One entire sine wave you depicted in your diagram is one "drum beat" - and this drum beat is occurring 20,000 times a second.

So, when your diagram shows sampling of pieces inside this single wave, it makes no sense and is thus a highly misleading visualization.

And the reason for this common mis-perception is that sound wave is really a compression wave - and our ear drums only "feel" the sound because physical air molecules hit the ear drum. (Which is why no sound can be head in a perfect vacuum).

Please note once again - sound is a discrete phenomenon - there is nothing continuous about it. It is discrete the same way a drum beat is - if you sample 100 times between two consecutive drum beats, you are not achieving anything!

And the maximum drumbeat (frequency) our eardrum can perceive is when air molecules hit our eardrums 20,000 times a second.

So, it is sufficient for us to *perfectly* record audio (up to 20Khz) in a digital format, as long as we store these frequencies (drum beats) as fast as 40,000 times a second. Not only is our information store 100% accurate for this frequency range, increasing the rate at which we store this information does absolutely nothing to the quality of our recording (which is already perfect).

At best, you end up storing some ultrasonic frequencies.

Arun,

Sorry but I would love to hear the real meaning of Nyquist theory with respect to Sampling frequency. I know we can all throw words around how sound travels and how we hear it. But if one were to think about the rational behind "double sampling rate", this is what I understood. I am quite open to hear the reasoning behind "double sample rate"

As for a single drum beat - a drum beat will never sound as a single frequency. There are multiples of frequencies produced by a single drum beat and none is produced at 20Khz by the way. But that's a moot point. Nobody is sampling between two drum beats. The sampling frequency is constant and does not change. In case of CD, the sampling frequency is 44100. That means, the 44100 measurements are taken per second continuously and the amplitude is stored. At that measurement, there can be various frequencies with different amplitude. How much of the amplitude can be stored now depends upon the bits. For CD's that limit is 16bit and that defines the dynamic range of the recording.

We can all talk about how sound is generate/captured/transmitted/read/converted/outputted from speakers/travels. I do understand that when a rain drop hits a tin, there are multiples of sound generated that instance, with various frequncies and amplitudes. Some travel faster, some slower. Some decay too fast. The sampling rate + bits will define how much of that data can be captured and reproduced realistically. Higher of both will obviously give better resolution, but beyond a certain limit, none of that can be heard nor perceived. For some - that limit happen to be 16/44.1, for some 24/96. For some analog all the way is the BEST.

Again - this is how I understood it. I would love to hear another explanation of Nyquist theorem in simple terms.
 
I think you are mixing up several things that are different questions:

1. How would sound in the air look if we could see it?

2. What does that sound look like as an analogue electrical signal?

3. The conversion of that analogue signal to digital.

asliarun... Can you help out here?

manoj.p I think that, as well as some of the same confusions, you are also trying to connect your points with a pencil, whereas the technology does it with a flexible strip.

To answer Manoj's illustration:
bViy0sz


(not sure why this image is not showing)
Link: http://imgur.com/bViy0sz

1. I saw a good illustration online of a sound wave but cannot find it now. It basically looks like a compression wave. Like how bellows or a harmonium works. Think really long bellows or dominoes - the speaker acts as a piston and pushes it at one end. The compression wave then propogates - like dominoes. Ultimately, the final layer of air pushes our eardrum which is picked up by our neurons attached to the eardrum, which then sends this signal to our brain.

2. A set of voltage changes (?) that allows a speaker drive to react to those changes in an electro-magnetic way and thus translates the electrical voltage changes into pistonic motion.

3. Information. About a set of samples.

Apologies in advance if I am getting anything wrong. This is how I think and I could have some fundamental holes in my knowledge too :)
 
To answer Manoj's illustration:
bViy0sz


(not sure why this image is not showing)
Link: imgur: the simple image sharer

1. I saw a good illustration online of a sound wave but cannot find it now. It basically looks like a compression wave. Like how bellows or a harmonium works. Think really long bellows or dominoes - the speaker acts as a piston and pushes it at one end. The compression wave then propogates - like dominoes. Ultimately, the final layer of air pushes our eardrum which is picked up by our neurons attached to the eardrum, which then sends this signal to our brain.

2. A set of voltage changes (?) that allows a speaker drive to react to those changes in an electro-magnetic way and thus translates the electrical voltage changes into pistonic motion.

3. Information. About a set of samples.

Apologies in advance if I am getting anything wrong. This is how I think and I could have some fundamental holes in my knowledge too :)

Okay, I get what you are thinking now. But there are some things I like to clarify.

I agree with the part that one may not need 10000 sampling points to recreate a waveform. But if we have to think, how many samples do we need for any frequency, how will come to that conclusion.

For a moment, let's forget about frequency. Just take one simple wave, like in my illustration. Thats pic 1.
We can take one sample per complete wave. That will give us one point, let's assume its at peak and we get picture 2 in my illustration. We wont get anything useful.
Same with 1.5x - thats pic 3. We won't still get good wave.
when we go to 2 samples, that gives us something useful - pic 4.

So, if someone is designing an audio format and a target is given to them that the format should capture upto 20000Hz. What should be the min sampling frequency that can capture that realistically? We get our answer.

Now coming back to what you are saying - that we don't need dips, we need only peaks of the wave form because that's what hits the ears. I am sorry to say, but sound is not only peaks. The sound travels as the wave. That's how its created and that's how it hits the eardrums. If the digital has to be truthful, then it has to capture the complete wave, not just the peaks.

Hope this clarifies.
 
Last edited:
This owl has to go to bed now :( even thought the conversation is hot. See you tomorrow :)

but...

1. Thinking of how to illustrate sound in the air: how about a slinky spring, which clearly shows the compressions and expansion of the sections as an impulse passes along it.

2. But it doesn't look like a wave*. But??? it does if we record it and look at its analogue signal on a scope?

later...




*DOH! Yes, of course it does, because it is coils. But I'm not sure that's relevant. I should have gone to bed!
 
Last edited:
As you can now guess, if you assume that someone has bat hearing and can hear up to 40Khz, you will now need to sample that sound wave at 80,000 times a second to be able to properly capture the ultrasonic transients and nuances. .
This is another key sentence - I would just make one change to it. Instead of the word properly, I would use the word completely. And one commonly understands the word sample to mean a representative sample, although the word itself includes a 100% sample.
 
This is another key sentence - I would just make one change to it. Instead of the word properly, I would use the word completely. And one commonly understands the word sample to mean a representative sample, although the word itself includes a 100% sample.

Umm, I won't say completely. It means 80000 samples per sec are needed to recreate 40000 Hz frequency at a min. More samples will give better waveform.

You are right, the word "sample" here does not mean as a representative sample. But that's far from saying complete. In digital, you can break down time into infinitesimal small pieces and capture more "samples". A "second" is quite small unit of time, but one can break it down infinitesimally. Milliseconds, nanoseconds, picoseconds, irrespective of whether humans can perceive such a small unit.
 
Last edited:
Now coming back to what you are saying - that we don't need dips, we need only peaks of the wave form because that's what hits the ears. I am sorry to say, but sound is not only peaks. The sound travels as the wave. That's how its created and that's how it hits the eardrums. If the digital has to be truthful, then it has to capture the complete wave, not just the peaks.
Let me test my understanding by taking a stab at an answer:).
We do need dips, but these are nothing more than the variations in peak heights, all the way down to zero height. So if all the peak heights are captured, so are the dips.
And the wave is just a representation of the sound vibrations, using succeeding peak height information, and "how many peaks are present in a second" information. Thinking of the wave as more than that leads to thinking that there have to be data points for other parts of the "wave". There aren't any, in reality.
I am now pretty sure that the wave you see on an oscilloscope is drawn by the electronics in the instrument by just these two bits of information, and we end up thinking that there are data points for all of points on the lines shown that connect these information points.
For digital to be 100% truthful it has to carry all the information about the peak heights for as many times as they occur in a second. And sampling 40000 times a second allows this to be done for all the data points that exist for sound frequencies up to 20khz. The 44000 times is for a margin of safety.
The times when it does not happen is not because of a flaw in this reasoning, but because of engineering constraints that have to be overcome in converting this state of affairs to just as truthful electrical signals that reach the speakers.
Digressing again, my experience of modern digital audio equipment tells me that these constraints have now been overcome to the point that further progress isn't to be audibly heard in a well constructed listening test. And the solutions overcoming these have by now also found their way to cheap digital components.
 
The reason why I started this thread was to understand the theory behind why digital works as well as it does - at least in my set ups, at my home and to my ears.
After some careful AB listening tests at home I had shelved - now sold - my turntable, and I was using my SACD player only for its DAC with a wireless digital front end solution into a very decent 2 channel system.
Two days ago, I resurrected an old Arcam iPod dock, that was lying in a box - it is a well constructed little thing, with a button to start/stop device charging, but has only RCA analog output sockets, which means it is keeping the internal DAC of the I device in play. It does have its own op amps, keeping the amplification circuits of the docked device out of play. I wired it to the inputs of my preamp, stuck a three year old iPod touch into it, and started listening to lossless files on it, as well as recent I tunes 256kbps purchases of some very good ECM recordings - ECM being one company that takes great care over mastering quality. I was pleasantly surprised - even amazed - at the sound pouring out of the speakers.
That spurred me to try to understand why digital today sounds as good as it does to me. Even where the source is a three year old I pod touch. Or, as I also discovered, a four year old iPod classic with 160gigs of place for music.
 
For excellent sound that won't break the bank, the 5 Star Award Winning Wharfedale Diamond 12.1 Bookshelf Speakers is the one to consider!
Back
Top