Tutorial Contents

Trace Amplitude Histogram

Manual fit

Automatic fit

Analysis region

Other stuff

Contents

Trace Amplitude Histogram

This shows a simple simulation of channel patch-clamp data where event channels a and b represent the opening of two ion channels within a patch, and the data traces represent the consequent current recordings. Each channel contributes a fundamental current of 1 unit. In trace 1 a random Gaussian noise signal with a standard deviation of 0.2 has been superimposed on the currents, while trace 2 is a copy of trace 1 with an additional random signal with a standard deviation of 0.3 superimposed. The overallThe standard deviation of 2 variables is the square root of the sum of their variance. standard deviation of the noise in trace 2 is thus 0.36 (but the standard deviation of the trace itself is higher because of the current steps). Trace 3 is identical to trace 2, except the sequence of the values has been randomly shuffled.

The histogram displays the amplitude distributionThe rule for allocating a data value to a histogram bin is:
lower bound <= value < upper bound.
of the data values in trace 1 that are currently visible in the main view. The histogram X axis scales are initially set to those of the axis in the main view on which trace 1 displays, but they can be adjusted if desired.

By default the scaled PDF of a single Gaussian (normal) distribution is shown superimposed on the histogram, with the mean and standard deviation set from the data. In this case the distribution is obviously wrong - the histogram has 3 clear peaks, so a unimodal PDF is inappropriate. The lowest level (left-hand) peak in the histogram represents both channels closed (i.e. the resting state), the middle peak represents either channel open, while the highest peak represents coincident opening of both channels.

Dataview allows fitting a mixture of up to 4 normal distributions to the histogram. This can be done manually or automatically. At any stage you can return to an earlier set of mixture parameters by clicking the Back button.

The manual procedure will be described first, since following it will give a clearer understanding of what is going on.

Manual fit

The histogram clearly has 3 peaks, so:

Initially we will adjust the parameters of the mixture by eye, and then when they are approaching reasonable values, we will optimize the fitNote that the fit optimization is to the histogram bin counts, not to the underlying data. by using Powell’s method to maximize the likelihood.

We have no reason to suspect that the different distributions have different standard deviations, so leave the Equal SDs box checked.

We now have 3 clear peaks in the PDF in approximately the correct locations. This initial fit is rather poor, but it will do as a starting point.

The PDF rapidly adjusts to what I hope is a visually-good fit to the histogram. The fitted mean and s.d. values also show a good fit to the known "ground truth" from the data construction parameters. It would take some work to figure out the true proportions of the mixtures, since the channel opening events occur randomly , but it could be done from the event time parameters. However, the fitted proportions appear quite reasonable.

The increased noise blurs the histogram peaks to the point where it is not absolutely clear whether there are two or three peaks present. Furthermore, the previous fit is now not very good.

The fit looks good, but is it better than a fit to just 2 components? We know from its construction that there are in fact 3 current steps, but in a real experiment we would only have the histogram to go from, and it is not clear whether the slight bump on the right-hand side of the histogram is a 3rd component, or just a bump occurring by random chance.

We get back to our original parameter values despite the poor guess of the third set of parameters, which is rather nice.

Automatic Fit

DataView can attempt an automatic fit of Gaussian mixtures by applying a clustering algorithmFrom Bouman's web page: "The ... program applies the expectation-maximization (EM) algorithm together with an agglomerative clustering strategy to estimate the number of clusters which best fit the data. The estimation is based on the Rissenen order identification criteria known as minimum discription length (MDL). This is equivalent to maximum-likelihood (ML) estimation when the number of clusters is fixed, but in addition it allows the number of clusters to be accurately estimated." (Bouman, 1997) to partition the raw data underlying the histogram.

Note that the Target cluster count edit box shows 0. This instructs the algorithm to automatically choose the "best" number of clusters in the data.

After a brief interval the Cluster dialog shows that the data can best be described by a mixture of 4 normal distributions. It lists the proportion (fraction of the data belonging to that distribution), mean and standard deviation of each sub-class (component) of the mixture.

The Fit Mixtures dialog now shows the parameters of the 4 component distributions, and the red histogram line shows these superimposed on the histogram.

At first sight, finding 4 mixtures seems plain wrong, because we know that there are only 3 current steps in the data. However, the noise itself constitutes a data component, and in trace 2 the noise is sufficiently large that it contributes a significant amount to the histogram. So actually, finding 4 components is quite appropriate.

The manual fit produces an even closer match between the red PDF line and the histogram. This is not surprising, because this is what the fitting procedure optimizes. What is slightly surprising is that the mixture parameters are slightly closer to the ground-truth of the data. The 3 mixtures with means close to 0, 1 and 2 reflect the 3 current steps (and their standard deviation is close to that of the actual noise). The fourth component has a mean of about 0.4 and a larger standard deviation. This is the noise component that is "smeared" across the whole recording.

Visual inspection of trace 2 in the main view shows us clearly that there are 3 steps in the distribution. However, this is only clear because sections of data "stick" in one of the 3 steps for several (in fact many) contiguous samples. Trace 3 contains exactly the same set of numerical values, but they have been shuffled so that the temporal sequencing is lost.

The histogram is unchanged in shape, even though trace 3 has a very different visual appearance compared to trace 2 in the main view. This is exactly the way it should be, because the histogram is uninfluenced by the time sequence in which the raw values occur.

Once again, 4 clusters are found, but the parameter values are slightly different, although still a reasonable estimate of the ground-truth.

This emphasises an important point about the automatic clustering algorithm - it is influcenced by the data sequence. This is because it works by making a initial assignment of each data point in sequence to one of 20 arbitrary clusters, and as the algorithm progresses, the assignments are changed and the number of clusters reduced, until an optimum is achieved. In cases such as trace 2, where the distinction between clusters is very fuzzy due to the noise, a different optimum can be reached if the initial assignment is different, which it will be if the data values are shuffled. In fact, with some shuffles of this trace, the cluster algorithm suggested that the optimum was just two clusters, which actually is a reasonable reflection of the visual appearance of the histogram, although not actually correct if the data were noise-free.

If a similar shuffle is performed on trace 1, where the noise level is lower and the clusters are much more distinct, the algorithm nearly always converges on 3 clusters, with the parameters very similar between each run, and very close to the ground-truth.

Analysis region

Data files can be very long, and the analysis functions may run very slowly, or indeed, run out of memory completely, if an attempt is made to analyse the entire file. For this reason DataView allows the user to only include a part of the file in the histogram display and any analysis. The default is to only analyse data visible in the main view, but of course this could be the entire file. So if working with a very large file the user is advised to only view a portion of the file before activating the Trace Amplitude Histogram dialog.

Hopefully, the Analysis window options at the top left of the dialog are self-explanatory.

Other stuff

The histogram can also display a rug plot (described here) and a kernel density estimate (KDE: described here).