# Big data hitting the fan? Nyquist-Shannon TOOL SAMPLE can save you

You are working on a big data project that collects data from sensors that can be polled every 0.1 of a second. But just because you can doesn’t mean you should, so how do we decide how frequently to poll sensors? The tempting answer is to collect it all, every last ping. That way, not only is your back covered but you can …

This topic is closed for new posts.
1. #### I can't use it

I already knew the Sample Frequency I need, and I can't use it. Almost nobody in telcos can.

Reasons: it would bring to a halt all the systems.

So we are stuck with "good enough" and "hope for the best". We have also better approaches: a mix between data polling and SNMP traps (alerts).

Anyway, I think that what people need is top down alerts AND checks. so you just klnow the health of the system, you aren't really interested in components.

2. #### Just occurred to me

What would happen if a waveform was sampled at non-regular (but defined) intervals? This way, the chances of always measuring a signal as it crosses zero would be solved. One for the mathematicians, I think.

1. #### Re: Just occurred to me

If you're measuring something that is periodic and predictable then you may be able to do that. But if it's periodic and predictable why are you measuring it?

If the signal isn't periodic and predictable, then the longest sample period ( lowest frequency) of your non-regular interval needs to be half the period ( or twice the frequency ) of the signal change that you are interested in, otherwise you will run the risk of missing something.

1. #### Re: Just occurred to me

"If the signal isn't periodic and predictable, then the longest sample period ( lowest frequency) of your non-regular interval needs to be half the period ( or twice the frequency ) "

In theory yes, in reality you need to sample more than that because if for example you're measuring a sine wave and measure at twice the frequency you could hit the point at which the wave crosses zero every time in which case it'll look like there is no signal. In reality you need to measure 2x the frequency and also 2x the frequency at 90 deg phase shift. This is partly how a discrete fourier transform works - it multiplies by the sine AND the cosine before integrating.

3. #### Slight case of subject drift

The article started off talking about stored data volumes, i.e. storing logging data, and then drifted off into sampling rates, which is all very interesting and must be considered when deciding how to get a true picture of the behaviour over time of the variable being sampled.

The answer to the storage problem, that I expected to see, is to only record the timestamped new value each time the sampled variable changes. Unless the change rate approaches the sampling rate, the storage saved by logging timestamped changes will easily exceed the overhead of recording the timestamp.

1. #### Re: Slight case of subject drift

That would require intelligence at each sensor to push the data to a central location when an event/change occurs. But if the sensor fails nothing will ever be logged again. So then you need the central location to periodically poll the intelligent sensors, "You still alive and working", to check for faults. Also you need a network to queue events in case two sensors push data at the exact same instance in time. The more complexity that is added the more possible failure points that are also added. There are many advantages to K.I.S.S. https://en.wikipedia.org/wiki/KISS_principle

1. #### Re: Slight case of subject drift

Nope - he was talking about capture, i.e. permanent data storage. IOW it doesn't matter whether all sensors autonomously send in readings or the logging system(s) poll them for data. Once the data arrives at the server that will record it, its easy to scan through the stream from each device and discard everything except the changes in a sensor reading.

Think systems don't work that way? Here's a real-life example: the switches in mobile phone cells are polled on a daily basis and their call data pulled down as via FTP as a file containing a megabyte or two of data. This is then processed in various ways, e.g. run through fraud detection kit and analysed by the network performance team before being used to populate one or more databases.

4. #### Don't forget...

Statistical sampling. If testing 10, 100 bullets in a batch of 10,000 rounds of ammunition was A-OK for the guy that invented it... YMMV.

...and paradoxically, redundant and fail-safe sensors. Thermometers tend to fail catastrophically, either going to top or bottom of scale, which would cause your HVAC to freeze or toast everybody, so a voting circuit reading odd more thermometers would not just increase your reliability, but it would give you equally reliable fault alarms. Old school 4 - 20mA sensors don't fall prey to silent failure, exactly because of those 4mA. As always, YMMV.

I'm not denying anything in the article, but sometimes you don't need perfect sampling. You can combine both, of course, with your datalogger polling randomly 10% of your sensors every 100 minutes, coupled with "IRQ" alarms.

This topic is closed for new posts.

Biting the hand that feeds IT © 1998–2020