Close Menu
AI News TodayAI News Today

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Inside the World’s Biggest Bet on Fusion Energy

    Xreal, Google’s smartglasses partner, thinks it has finally mastered this notoriously tricky industry

    Premier League Soccer: Stream Tottenham vs. Everton From Anywhere Live

    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    Facebook X (Twitter) Instagram Pinterest Vimeo
    AI News TodayAI News Today
    • Home
    • Shop
    • AI News
    • AI Reviews
    • AI Tools
    • AI Tutorials
    • Chatbots
    • Free AI Tools
    AI News TodayAI News Today
    Home»AI Tools»How to Mathematically Choose the Optimal Bins for Your Histogram
    AI Tools

    How to Mathematically Choose the Optimal Bins for Your Histogram

    By No Comments10 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    How to Mathematically Choose the Optimal Bins for Your Histogram
    Share
    Facebook Twitter LinkedIn Pinterest Email

    Have you ever wondered how to choose your bins in a histogram? Did you ever ask yourself whether there are deeper reasons for choices that go beyond that it just looks nice? While histograms are the most fundamental tool for data visualization, setting their resolution is important, especially when the histogram itself is used for further analyses. Histograms are often computed to visualize the density of the data. In this post, we explore the mathematics of density fitting, specifically looking at how bins should shrink as our dataset grows. Inspired by adjacent fields such as perturbation theory in physics and Taylor expansions in mathematics, we will find a rigorous method for constructing densities.

    All images are by the author

    Background

    Approximations

    The intuition is simple: the more data you have, the more detail you should be able to see. If you are looking at a sample of ten observations, two or three wide bins are likely all you can afford before your visualization becomes a sparse collection of empty gaps. But if you have ten million observations, those wide bins start to feel like a low-resolution pixelated photograph. You want to “zoom in” by increasing the number of bins. The question, however, is: How exactly should we scale this resolution?

    In physics, when we face a system that is too complex to solve exactly, we often turn to Perturbation Theory. In Quantum Electrodynamics (QED), for example, we approximate complex interactions by expanding them in terms of a small coupling constant, like the electron charge e. This “interaction strength” provides a natural hierarchy for our approximations. But for a histogram, what is the analogous “charge”? Is there a fundamental parameter that governs the interaction between our discrete data points and the underlying distribution we are trying to estimate?

    Mathematics offers another path: the Taylor Expansion. If we assume the underlying density function is sufficiently smooth (analytical), we can describe it locally using its derivatives. This feels like a promising lead as the higher orders can be demonstrated to vanish. Although we may want to accept a restriction to analytical distributions, it is not clear how this leads to a certain bin size.

    Alternatively, we might treat the problem as an Expansion in Basis Functions. Just like we can represent a piece-wise continuous function using a Fourier transform or Legendre polynomials, we could view histogram bins as a set of basis functions. Using such an approach we could approximate the function in terms of L2. But this approach introduces its own set of hurdles. How do we compute the coefficients for these functions efficiently? And more importantly, how do we satisfy the physical constraints of a probability density function? Unlike a general Fourier series, a density function must be strictly positive-definite and normalized to one. We will see in the following that the method obtained from information theory has similar aspects to expanding in basis functions.

    Information Theory

    Priors & Posteriors

    For an introduction to Bayesian statistics or information theory, the reader is referred to (Murphy, 2022). In a Bayesian approach, a model P(X|θ)P(X|theta) , where X are the observables we want to model and θtheta are our parameters, also contains a prior distribution 𝑃(𝜃|ℳ) that reflects our belief on the distribution before data was observed. After the data has been observed, we can estimate the posterior distribution P(θ|X)P(theta | X)

    𝑃(𝜃|𝑋) = 𝑃(𝑋|𝜃)𝑃(𝜃|ℳ)/𝑃(𝑋)

    This procedure is mathematically elegant because it is 100% safe against overfitting. However, it demands a strict discipline: we are not allowed to choose our model or prior after having seen the data. If we use the data to decide which model structure to use, we break the underlying logic of the inference.

    The most-likely model given the data versus model weighting

    The quality of a model can be computed by considering its surprisal (see e.g. (Vries, 2026))

    log 𝑃(𝑋|ℳ) = −surprisal = accuracy – complexity

    Models with an excessive number of parameters (because one may be tempted to include all kind of hypothetical interactions) may achieve an incredible accuracy, but they are “killed” by the penalty of their own complexity. The ideal model isn’t the most detailed one; it is the one that captures the most information with the least amount of unnecessary baggage.

    When considering a set of models, one can compute the likelihood of each model in comparison with the models under consideration

    𝑃(ℳ𝑖 ∣ 𝑋) ~ 𝑃(𝑋 | ℳ𝑖) 𝑃(ℳ𝑖 )

    It is tempting to simply pick the model with the highest probability and move on. But this “winner takes-all” approach carries risks:

    • Statistical Fluctuations: The data 𝑋 might contain a random fluke that makes a sub-optimal model look temporarily superior.
    • The Weight of the Crowd: Sometimes, the sum of many “less likely” models actually outweighs the probability of the single “best” model.

    Because of this, a more robust path is to carry all models forward, weighting them by their probability. It is important to note that this is not a “mixture” of different truths; we still assume only one model is actually true, but we use the full distribution of possibilities to account for our own uncertainty.

    Densities

    A density using Bayesian approach

    To treat a density as a formal model, we view each of its 𝐾 bins as a parameter. Specifically, we assign a weight wkw_k to each bin, representing the probability of a data point falling into that interval. Because the total probability must sum to one (∑kwk=1sum_k w_k=1), a density with 𝐾 bins is defined by 𝐾 −1 independent parameters, such models are also called mixtures. In our Bayesian framework, we need to assign a prior to these weights. Given that we are dealing with categorical proportions that must sum to one, the Dirichlet distribution is the mathematically natural choice.

    Choosing the Hyperparameters

    The Dirichlet distribution is governed by hyperparameters, often denoted as 𝛼. These values represent our “pseudo-counts”—essentially what we believe the density looks like before we
    have even seen the first data point. When we assume a flat prior (where the evidence 𝑃(𝑋) is constant), two primary strategies emerge for choosing 𝛼:

    • 𝛼 =1/𝐾 (The Sparse Choice): This is often used when we expect the data to be highly concentrated. It assumes a priori that the majority of bins will be empty, making it a “sparsity-promoting” prior.
    • 𝛼 =1 (The Uniform Choice): Also known as the flat or Laplace prior, this assumes that every possible distribution of weights is equally likely. It essentially adds one “virtual” observation to every bin before the real data arrives.

    For the purpose of constructing a standard density, the second choice 𝛼 = 1 is often the most natural. It reflects a neutral starting point where we assume the data is uniformly distributed across the interval until the evidence proves otherwise.

    By defining our bins this way, we have transformed the “pixelation” of a density into a rigorous model. We now have a fixed set of parameters (𝐾 − 1 weights) and a clear prior (𝛼 = 1). The next step is to use the data to determine the optimal number of bins 𝐾 by balancing the accuracy of the fit against the complexity of the parameters.

    Example

    Please look at the data in the figure below:

    When fitting with 8 bins we obtain:

    What one can see in this density is that the right-most bin is above zero although no data points were present in this bin. This is a result of the Bayesian approach which estimates the believed density based on our prior belief and the data that we observed.

    Summarizing, we obtained a density using a Bayesian approach. We defined a prior 𝑃(𝜃) that reflected our expectation for a uniform density. Then we took the data and we computed the posterior 𝑃(𝜃|𝑋) that underlies the resulting density.

    Weighted densities

    Using the approach of the previous section we can make densities using 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and 1024 bins. More bins give a more accurate fit of the data but also introduce additional complexities. As was discussed in the previous section, one can use accuracy and complexity to compute its evidence. When viewing each density as a model, we can compute its likelihood to be true compared to the set of models we are considering. This yields the figure below:

    In the previous section it was discussed that one may choose the “best” model which would in this case be the use of 8 bins. However, it is safer to take a weighted sum over all the models. This
    yields:

    It is important to realize that from a Bayesian perspective this is the best that we can do. Also note that in this graph there is a density present of 1024 bins. Lastly, one can prove that densities of higher orders N will diminish.

    Densities with unequal bins

    The previously obtained density above looks a bit blocky which originates from the choice of using equal bins. There are other options available such as taking random splits (and compensating the prior for it). This yields the graph below:

    Densities with error bars

    Now to close off the construction of densities, it may be of interest to visualize our uncertainty in these densities. Although numerically expensive to compute, the expression for computing the standard deviation in the density is remarkably straightforward (F. Pijlman, 2023)

    σP(x|X)2=P(x|X)(P(x|x,X)−P(x|X))sigma_{P(x|X)}^2 = P(x|X) left( P(x|{x,X}) – P(x|X) right)

    This yields the densities below:

    Conclusions

    We began with a simple question: Is there a mathematical foundation for choosing the bins in a histogram? As the concept of bins inherently connects data points with densities, we studied how
    to choose bins for densities.

    Using a Bayesian approach (information theory) one can fit densities without having to worry of overfitting (too many bins showing too much detail). Although one can compute the “best” bin-width, we saw that:

    • Model weighting allows us to combine multiple resolutions, providing a smoother and more honest representation of the data.
    • Dirichlet Priors give us a rigorous way to express our initial assumptions about data distribution.

    Just as perturbation theory provides a hierarchy for physical interactions, this Bayesian framework provides a hierarchy for data resolution. The resolution scales naturally as more data becomes available. Note that such ideas can also be used when learning models in which one has an expansion in interactions.

    The method of combining densities of various resolutions was also explored in case random bins are chosen. This led to smooth histograms which may appear to be more natural for most data
    sets.

    We also presented the use of standard deviations in histograms. Although the calculation of standard deviations was derived for Bayesian models, its calculation-procedure suggests a wider applicability. As such, it can be for visualizing the remaining uncertainties in densities.

    Acknowledgements

    The EdgeAI “Edge AI Technologies for Optimised Performance Embedded Processing” project has received funding from Key Digital Technologies Joint Undertaking (KDT JU) under grant agreement No. 101097300. The KDT JU receives support from the European Union’s Horizon Europe research and innovation program and Austria, Belgium, France, Greece, Italy, Latvia, Luxembourg, Netherlands, and Norway.

    References

    • F. Pijlman, J. L. (2023). Variance of Likelihood of Data. https://sitb2023.ulb.be/proceedings/, 34/37.
    • Murphy, K. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.
    • Vries, B. d. (2026). Active Inference for Physical AI Agents. arXiv.

    Bio

    Fetze Pijlman is a Principal Scientist at Signify Research in Eindhoven, the Netherlands. His research focus spans probabilistic machine learning, Bayesian inference, and signal processing, with a particular interest in applying these mathematical frameworks to IoT, sensing, and smart systems.

    Bins Choose Histogram Mathematically Optimal
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleThese special phone and app features can help protect you from spyware
    Next Article The Best Surge Protectors for 2026: Prep for Blackouts
    • Website

    Related Posts

    AI Tools

    The Ultimate Beginners’ Guide to Building an AI Agent in Python

    AI Tools

    Beyond the Model: Why Data Scientists Must Embrace APIs and API Documentation

    AI Tools

    Beyond the Scroll: How Social Media Algorithms Shape Your Reality

    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Inside the World’s Biggest Bet on Fusion Energy

    0 Views

    Xreal, Google’s smartglasses partner, thinks it has finally mastered this notoriously tricky industry

    0 Views

    Premier League Soccer: Stream Tottenham vs. Everton From Anywhere Live

    0 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews
    AI Tutorials

    Quantization from the ground up

    AI Tools

    David Sacks is done as AI czar — here’s what he’s doing instead

    AI Reviews

    Judge sides with Anthropic to temporarily block the Pentagon’s ban

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    Inside the World’s Biggest Bet on Fusion Energy

    0 Views

    Xreal, Google’s smartglasses partner, thinks it has finally mastered this notoriously tricky industry

    0 Views

    Premier League Soccer: Stream Tottenham vs. Everton From Anywhere Live

    0 Views
    Our Picks

    Quantization from the ground up

    David Sacks is done as AI czar — here’s what he’s doing instead

    Judge sides with Anthropic to temporarily block the Pentagon’s ban

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Terms & Conditions
    • Privacy Policy
    • Disclaimer

    © 2026 ainewstoday.co. All rights reserved. Designed by DD.

    Type above and press Enter to search. Press Esc to cancel.