Categorical cross entropy

9/22/2023

Since the actual probability it’s sunny is still ¾, the expected amount of information is ¾ * 0.15. With our predicted distribution, this would require us to encode a “sunny” outcome with \(-log_2(9/10)\) = 0.15 bits. Maybe we live in a temperate place, so we think P(sunny) = 9/10. So this is where our model’s prediction of the weather comes in. Let’s get to the main attraction: cross entropy is the expected number of bits needed to encode data from one distribution using another. This brings us to the general formula for entropy, which is defined for some probability distribution (in this case, of the weather) \(p\): Or, from the other side, we expect it would take 0.811 bits to send a message saying what today’s weather was. This tells us that given the weather follows our distribution of ¾ sunny, ¼ rainy, we’d expect that learning what the weather is for a given day (i.e., receiving the information about the weather’s value) gives us 0.811 bits’ worth of information. \ = P(rainy) * Info(rainy) + P(sunny) * Info(sunny)\] Now to calculate the expected amount of information (entropy!), we take the weighted sum of these individual amounts of information, with weights being the probability of each outcome. The station transmitted one of 00, 01, or 10, which is less than 2 bits of info because we’re not certain what each bit was.

This is a bit trickier than the rainy case: The station must’ve transmitted 11, which is precisely 2 bits long. To further illustrate this formula, let’s consider the 2 possible outcomes from learning what the weather is: The weather station tells us it’s rainy. The more rare the outcome, the more information you’ll need to discern that event from others – hence the inverse relationship between probability and information for a given event. The more uncertain or random the weather seems, the more information you gain when you learn what the weather actually is.Įntropy is centered around this information transfer: it’s the expected amount of information, measured in bits, in some transmission of data. “Information” is such an overloaded word in English, but in the statistical/information-theoretic sense, information is a quantification of uncertainty in a probability distribution. We’ll start off reviewing the concepts of information and entropy. We’ve also created a short interactive demo you can play around with at the bottom of the post. Luckily these two loss functions are intricately related, and in this post we’ll explore the intuitive ideas behind both, and compare & contrast the two so you can decide which is more appropriate to use in your case. We wanted to dedicate an entire post to the lovely functions cross entropy and Kullback-Leibler divergence, which are very widely used in training ML models but not very intuitive.

0 Comments

Categorical cross entropy

Leave a Reply.

Author

Archives

Categories