MLLabelUtils.jl’s documentation

This package represents a community effort to provide the necessary functionality for interpreting class-predictions, as well as converting classification targets from one encoding to another. As such it is part of the JuliaML ecosystem.

The main intend of this package is to be a light-weight back-end for other JuliaML packages that deal with classification problems. In particular, this library is designed with package developers in mind that require their classification-targets to be in a specific format. To that end, the core focus of this package is to provide all the tools needed to deal with classification targets of arbitrary format. This includes asserting if the targets are of a desired encoding, inferring the concrete encoding the targets are in and how many classes they represent, and converting from their native encoding to the desired one.

From an end-user’s perspective one normally does not need to import this package directly. That said, some functionality (in particular convertlabels()) can also be useful to end-users who code their own special Machine Learning scripts.

Where to begin?

If this is the first time you consider using MLLabelUtils for your machine learning related experiments or packages, make sure to check out the “Getting Started” section; specifically “How to …?”, which lists some of most common scenarios and links to the appropriate places that should guide you on how to approach these scenarios using the functionality provided by this or other packages.

Getting Started

MLLabelUtils is the result of a collaborative effort to design an efficient but also convenient-to-use library for working with the most commonly utilized class-label encodings in Machine Learning. As such, this package provides functionality to derive or assert properties about some label-encoding or target array, as well as the functions needed to convert given targets into a different format.

Installation

To install MLLabelUtils.jl, start up Julia and type the following code-snipped into the REPL. It makes use of the native Julia package manger.

Pkg.add("MLLabelUtils")

Additionally, for example if you encounter any sudden issues, or in the case you would like to contribute to the package, you can manually choose to be on the latest (untagged) version.

Pkg.checkout("MLLabelUtils")

Overview

Let us take a look at some examples (with only minor explanation) to get a feeling for what one can do with this package. Once installed the package can be imported just as any other Julia package.

using MLLabelUtils

For starters, the library provides a few utility functions to compute various properties of the target array. These include the number of labels (see nlabel()), the labels themselves (see label()), and a mapping from label to the elements of the target array (see labelmap() and labelfreq()).

julia> true_targets = [0, 1, 1, 0, 0];

julia> label(true_targets)
2-element Array{Int64,1}:
 1
 0

julia> nlabel(true_targets)
2

julia> labelmap(true_targets)
Dict{Int64,Array{Int64,1}} with 2 entries:
  0 => [1,4,5]
  1 => [2,3]

julia> labelfreq(true_targets)
Dict{Int64,Int64} with 2 entries:
  0 => 3
  1 => 2

Tip

Because labelfreq() utilizes a Dict to store its result, it is straight forward to visualize the class distribution (using the absolute frequencies) right in the REPL using the UnicodePlots.jl package.

julia> using UnicodePlots
julia> barplot(labelfreq([:yes,:no,:no,:maybe,:yes,:yes]), symb="#")
#        ┌────────────────────────────────────────┐
#    yes │##################################### 3 │
#  maybe │############ 1                          │
#     no │######################### 2             │
#        └────────────────────────────────────────┘

If you find yourself writing some custom function that is intended to train some specific supervised model, chances are that you want to assert if the given targets are in the correct encoding that the model requires. We provide a few functions for such a scenario, namely labelenc() and islabelenc().

julia> true_targets = [0, 1, 1, 0, 0];

julia> labelenc(true_targets) # determine encoding using heuristics
MLLabelUtils.LabelEnc.ZeroOne{Int64,Float64}(0.5)

julia> islabelenc(true_targets, LabelEnc.ZeroOne)
true

julia> islabelenc(true_targets, LabelEnc.ZeroOne(Int))
true

julia> islabelenc(true_targets, LabelEnc.ZeroOne(Float32))
false

julia> islabelenc(true_targets, LabelEnc.MarginBased)
false

In the case that it turns out the given targets are in the wrong encoding you may want to convert them into the format you require. For that purpose we expose the function convertlabel().

julia> true_targets = [0, 1, 1, 0, 0];

julia> convertlabel(LabelEnc.MarginBased, true_targets)
5-element Array{Int64,1}:
 -1
  1
  1
 -1
 -1

julia> convertlabel(LabelEnc.MarginBased(Float64), true_targets)
5-element Array{Float64,1}:
 -1.0
  1.0
  1.0
 -1.0
 -1.0

julia> convertlabel([:yes,:no], true_targets)
5-element Array{Symbol,1}:
 :no
 :yes
 :yes
 :no
 :no

julia> convertlabel(LabelEnc.OneOfK, true_targets)
2×5 Array{Int64,2}:
 0  1  1  0  0
 1  0  0  1  1

julia> convertlabel(LabelEnc.OneOfK{Bool}, true_targets)
2×5 Array{Bool,2}:
 false   true   true  false  false
  true  false  false   true   true

julia> convertlabel(LabelEnc.OneOfK{Float64}, true_targets, obsdim=1)
5×2 Array{Float64,2}:
 0.0  1.0
 1.0  0.0
 1.0  0.0
 0.0  1.0
 0.0  1.0

It may be interesting to point out explicitly that we provide LabelEnc.OneVsRest to conveniently convert a multi-class problem into a two-class problem.

julia> convertlabel(LabelEnc.OneVsRest(:yes), [:yes,:no,:no,:maybe,:yes,:yes])
6-element Array{Symbol,1}:
 :yes
 :not_yes
 :not_yes
 :not_yes
 :yes
 :yes

julia> convertlabel(LabelEnc.ZeroOne, [:yes,:no,:no,:maybe,:yes,:yes], LabelEnc.OneVsRest(:yes))
6-element Array{Float64,1}:
 1.0
 0.0
 0.0
 0.0
 1.0
 1.0

Some encodings come with an implicit contract of how the raw predictions of some model should look like and how to classify a raw prediction into a predicted class-label. For that purpose we provide the function classify() and its mutating version classify!().

For LabelEnc.ZeroOne the convention is that the raw prediction is between 0 and 1 and represents a degree of certainty that the observation is of the positive class. That means that in order to classify a raw prediction to either positive or negative, one needs to define a “threshold” parameter, which determines at which degree of certainty a prediction is “good enough” to classify as positive.

julia> classify(0.3f0, 0.5); # equivalent to below
julia> classify(0.3f0, LabelEnc.ZeroOne) # preserves type
0.0f0

julia> classify(0.3f0, LabelEnc.ZeroOne(0.5)) # defaults to Float64
0.0

julia> classify(0.3f0, LabelEnc.ZeroOne(Int,0.2))
1

julia> classify.([0.3,0.5], LabelEnc.ZeroOne(Int,0.4))
2-element Array{Int64,1}:
 0
 1

For LabelEnc.MarginBased on the other hand the decision boundary is predefined at 0, meaning that any raw prediction greater than or equal to zero is considered a positive prediction, while any negative raw prediction is considered a negative prediction.

julia> classify(0.3f0, LabelEnc.MarginBased) # preserves type
1.0f0

julia> classify(-0.3f0, LabelEnc.MarginBased()) # defaults to Float64
-1.0

julia> classify.([-2.3,6.5], LabelEnc.MarginBased(Int))
2-element Array{Int64,1}:
 -1
  1

The encoding LabelEnc.OneOfK is special in that it is matrix-based and thus there exists the concept of ObsDim, i.e. the freedom to choose which array dimension denotes the observations. The classified prediction will be the index of the largest element of an observation. By default the “obsdim” is defined as the last array dimension.

julia> pred_output = [0.1 0.4 0.3 0.2; 0.8 0.3 0.6 0.2; 0.1 0.3 0.1 0.6]
3×4 Array{Float64,2}:
 0.1  0.4  0.3  0.2
 0.8  0.3  0.6  0.2
 0.1  0.3  0.1  0.6

julia> classify(pred_output, LabelEnc.OneOfK)
4-element Array{Int64,1}:
 2
 1
 2
 3

julia> classify(pred_output', LabelEnc.OneOfK, obsdim=1) # note the transpose
4-element Array{Int64,1}:
 2
 1
 2
 3

julia> classify([0.1,0.2,0.6,0.1], LabelEnc.OneOfK) # single observation
3

Getting Help

To get help on specific functionality you can either look up the information here, or if you prefer you can make use of Julia’s native doc-system. The following example shows how to get additional information on LabelEnc.OneOfK within Julia’s REPL:

?LabelEnc.OneOfK

If you find yourself stuck or have other questions concerning the package you can find us at gitter or the Machine Learning domain on discourse.julialang.org

If you encounter a bug or would like to participate in the further development of this package come find us on Github.

API Documentation

This section gives a more detailed treatment of all the exposed functions and their available methods. We start by discussing what we understand under terms such as “classification targets” and the available functionality to compute properties about them.

Classification Targets

In this section we will outline the functionality that this package provides in order to work with classification targets. We will start by discussion the terms we use and how they are used in the context of this package.

Terms and Definitions

In a classification setting one usually treats the desired output variable (also called ground truths, or targets) as a discrete categorical variable. That is true even if the values themself are of numerical type, which they often are for practical reasons.

We use the term targets when we talk about concrete data. Concretely, targets are the desired output of some dataset and further themself also part of the dataset. If a dataset includes targets we call it labeled data. In a labeled dataset, each observation has its own target. Thus we have as many targets as we have observations, as the target is treated as a part of each observation.

Tip

Let us look at an example of what targets could look like and how they relate to some dataset, or in this case data subset. The following code snipped loads the first 3 observations of the iris dataset using the RDatasets package.

julia> using RDatasets
julia> iris = head(dataset("datasets", "iris"), 3)
3×5 DataFrames.DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species  │
├─────┼─────────────┼────────────┼─────────────┼────────────┼──────────┤
│ 1   │ 5.1         │ 3.5        │ 1.4         │ 0.2        │ "setosa" │
│ 2   │ 4.9         │ 3.0        │ 1.4         │ 0.2        │ "setosa" │
│ 3   │ 4.7         │ 3.2        │ 1.3         │ 0.2        │ "setosa" │

For this data subset the targets would be ["setosa","setosa","setosa"]. Note how only one of the three available classes of the dataset is represented here.

The term “target” itself applies for both regression and classification scenarios. In a classification setting (which is the domain that this package operates in) the targets are treated as a discrete categorical variable. If the classification targets can just take one of two values, we call the classification problem binary, or two-class.

For our purposes we treat the term “class” as an abstract concept with little to no practical appearance in the functionality provided by this package. In essence we think about a class as the abstract interpretation behind some concrete value. For example: Let’s say we try to predict if a tumor is malignant or benign. The two classes could then be described as “malignant tumor” and “benign tumor”. One could argue that we could translate these abstract concepts into a string or symbol quite easily and thus make it concrete, but that is not the point. The point is, that the concrete interpretation behind the prediction targets is of little consequence for the library and as such it should not talk about it.

Instead, this library cares about representation. The representation can vary a lot between one model to another, while the “class” remains the same. For example, some models require the targets in the form of numbers in the set \(\{1,0\}\), other in \(\{1,-1\}\) etc.

We call a concrete and consistent representation of a single class a label. That implies that each class should consistently be represented by a single label respectively. How a label looks like is completely up to the user, but there are some forms that are more common than others. A convention of what labels to use to represent a fixed number of classes will be referred to as a label-encoding, or short encoding.

Note

To be fair, the term “class-encoding” would be more appropriate. However, when considering that we need to use the defined terms for naming the functions and types, it seemed more reasonable (and user-friendly) to keep the list of utilized domain-specific words small and consistent.

Determine the Labels

Now that we settled on the terminology, let us investigate what kind of functionality this package provides to work with classification targets. The first thing we may be interested in is determining what kind of labels we are working with when presented with some targets.

In general we try to make little assumptions about the type of the object containing the targets, just that it supports unique. The functions listed here do, however, expect the object containing the targets to include all possible labels of the classification scenario.

label(iter) → Vector

Returns the labels represented in the given iterator iter. Note that the order of the resulting labels matters in general, because other functions expect the first label to denote the positive label for binary classification problems. Thus, for consistency reason there are some heuristics involved that try to guarantee this for the commons encodings.

Parameters:iter (Any) – Any object for which the type either implements the iterator interface, or which provides a custom implementation for unique.
Returns:The unique labels in the form of a vector. In the case of two labels, the first element will represent the positive label and the second element the negative label respectively.
julia> label([:yes,:no,:no,:maybe,:yes,:no])
3-element Array{Symbol,1}:
 :yes
 :no
 :maybe

julia> label([-1,1,1,-1,1])
2-element Array{Int64,1}:
  1
 -1

As described above, we may mutate the order of the result of unique for consistency reasons in those cases that they describe a common binary label-encoding. The reason for this is that we want the first element to denote the positive label. The following example highlights the different results for unique and label() in the case of targets in “zero-one” form.

julia> unique([0,1,0,0,1])
2-element Array{Int64,1}:
 0
 1

julia> label([0,1,0,0,1])
2-element Array{Int64,1}:
 1
 0

While the generic iterator implementation covers most cases, we do selectively treat some iterators (such as Dict), differently, or even disallow some completely (such as any AbstractArray that has more than two dimensions).

label(dict) → Vector

Returns the keys of the dictionary in the form of a vector. The reasoning behind this convention for how to interpret the content of a Dict is that we utilize dictionaries to store label-specific information, such as the class-frequency (see labelfreq()).

Note again, that for consistency reasons there are heuristics in place that try to enforce the correct label-order for numeric label-vectors that have exactly two elements.

Parameters:dict (Dict) – Any julia dictionary.
Returns:The unique labels in the form of a vector. In the case of two labels, the first element will represent the positive label and the second element the negative label respectively.

We also treat matrices in a special way. The reason for this is that for our purposes it is not their values that encode the information about the labels, but their structure.

label(mat[, obsdim]) → Vector

Returns a vector that enumerates the dimension of the given matrix mat that does not denote the observations. In other words it returns the indices of that dimension.

Parameters:
  • mat (AbstractMatrix) – An numeric array that is assumed to be in the form of a one-hot encoding or similar.
  • obsdim (ObsDimension) – Optional. Denotes which of the two array dimensions of mat denotes the observations. It can be specified as a type-stable positional argument or a smart keyword (Note: for this method the return-value will type-stable either way). Defaults to Obsdim.Last(). see ?ObsDim for more information.
Returns:

A vector of indices that enumerate the particular dimension of mat that does not denote the observations.

julia> label([0 1 0 0; 1 0 1 0; 0 0 0 1])
3-element Array{Int64,1}:
 1
 2
 3

julia> label([0 1 0; 1 0 0; 0 1 0; 0 0 1], obsdim = 1)
3-element Array{Int64,1}:
 1
 2
 3

julia> label([0 1 0; 1 0 0; 0 1 0; 0 0 1], ObsDim.First()) # positional obsdim
3-element Array{Int64,1}:
 1
 2
 3

For convenience one can also just query for the label that corresponds to the positive class or the negative class respectively. These helper functions check if the given targets contain exactly two unique labels and will throw an ArgumentError if this assumption is violated.

poslabel(iter) → eltype(iter)

If label() returns a vector of length = 2, then this function will return the first element of it, which denotes the positive label. Otherwise an error will be thrown.

julia> poslabel([-1,1,1,-1,1])
1

julia> poslabel([:yes,:no,:no,:maybe,:yes,:no])
ERROR: ArgumentError: The given object has more or less than two labels, thus poslabel is not defined.
neglabel(iter) → eltype(iter)

If label() returns a vector of length = 2, then this function will return the second element of it, which denotes the negative label. Otherwise an error will be thrown.

julia> neglabel([-1,1,1,-1,1])
-1

julia> neglabel([:yes,:no,:no,:maybe,:yes,:no])
ERROR: ArgumentError: The given object has more or less than two labels, thus neglabel is not defined.

Number of Labels

We can compute the number of unique labels using nlabel(). It works by first computing the labels and then counting them. As such it has the same restrictions as label().

nlabel(iter) → Int

Returns the number of labels represented in the given iterator iter. It uses the function label() internally, so the same properties and restrictions apply.

Parameters:iter (Any) – Any object for which the function label() is implemented.
julia> nlabel([:yes,:no,:no,:maybe,:yes,:no])
3

julia> nlabel([-1,1,1,-1,1])
2

Mapping Labels to Observations

In many classification scenarios we have to deal with what is called an imbalanced class distribution. In essence that means that some classes are represented more often in a given dataset than the other classes. While we won’t go into detail about the implications of such a scenario, the key takeaway is that there exist strategies to deal with those situations by using information about how the class-label are distributed. More importantly even, some require a mapping from each label to all the observations that have that label as target. We call such a mapping from labels to observation-indices a label-map.

labelmap(iter) → Dict

Computes a mapping from the labels in iter to all the individual element-indices in iter that correspond to that label. Note that there is actually no check or requirement that iter must implement length or getindex. Instead, it is assumed that the first element of the iterator has the index 1 and the indices are incremented by 1 with each element of the iterator.

Parameters:iter (Any) – Any object for which the type implements the iterator interface
Returns:A dictionary that for each label as key, has a vector as value that contains all indices of the observations that observed that label.
julia> labelmap([0, 1, 1, 0, 0])
Dict{Int64,Array{Int64,1}} with 2 entries:
  0 => [1,4,5]
  1 => [2,3]

julia> labelmap([:yes,:no,:no,:maybe,:yes,:no])
Dict{Symbol,Array{Int64,1}} with 3 entries:
  :yes   => [1,5]
  :maybe => [4]
  :no    => [2,3,6]

We also provide a mutating version to update an existing label-map. In those cases we also have to specify the index/indices of that new observation(s).

labelmap!(dict, idx, elem) → Dict

Updates the given label-map dict with the new element elem, which is assumed to be associated with the index idx. Note that the given index is not checked for being a duplicate.

Parameters:
  • dict (Dict) – The dictionary that may or may not already contain existing label-mapping. It will be updated with the new element.
  • idx (Int) – The observation-index that elem corresponds to in the context of the overall dataset.
  • elem (Any) – The new target of the observation denoted by idx. It is expected to be in the form of a label.
Returns:

Returns the mutated dict for convenience.

julia> lm = labelmap([0, 1, 1, 0, 0])
Dict{Int64,Array{Int64,1}} with 2 entries:
  0 => [1,4,5]
  1 => [2,3]

julia> labelmap!(lm, 6, 0)
Dict{Int64,Array{Int64,1}} with 2 entries:
  0 => [1,4,5,6]
  1 => [2,3]
labelmap!(dict, indices, iter) → Dict

Updates the given label-map dict with the new elements in the given iterator iter. Each element in iter is assumed to be associated with the corresponding index in indices. This implies that both, iter and indices, must provide the same amount of elements. Note that the given indices are not checked for being duplicates.

Parameters:
  • dict (Dict) – The dictionary that may or may not already contain existing label-mapping. It will be updated with the new elements in iter.
  • indices (AbstractVector{Int}) – The indices for each element in iter.
  • iter (Any) – Any object for which the type implements the iterator interface.
Returns:

Returns the mutated dict for convenience.

julia> lm = labelmap([:yes,:no,:no,:maybe,:yes,:no])
Dict{Symbol,Array{Int64,1}} with 3 entries:
  :yes   => [1,5]
  :maybe => [4]
  :no    => [2,3,6]

julia> labelmap!(lm, 7:8, [:no,:maybe])
Dict{Symbol,Array{Int64,1}} with 3 entries:
  :yes   => [1,5]
  :maybe => [4,8]
  :no    => [2,3,6,7]

There also is a convenience function to reverse a labelmap into a label vector.

labelmap2vec(dict) → Vector

Computes an Vector of labels by element-wise traversal of the entries in dict.

param Dict{T, Int} dict:
 A labelmap with labels of type T.
return:Vector{T} of labels.
julia> labelvec = [:yes,:no,:no,:yes,:yes]

julia> lm = labelmap(labelvec)
Dict{Symbol,Array{Int64,1}} with 2 entries:
    :yes => [1, 4, 5]
    :no  => [2, 3]

julia> labelmap2vec(lm)
5-element Array{Symbol,1}:
    :yes
    :no
    :no
    :yes
    :yes

Frequency of Labels

Another useful information to compute is the absolute frequency of each label in the dataset of interest. In contrast to labelmap(), this function does not care about indices but instead simply counts occurrences. We call such a dictionary a frequency-map.

labelfreq(iter) → Dict

Computes the absolute frequencies for each label in iter and adds it as a key (label) value (count) pair to the resulting dictionary.

Parameters:iter (Any) – Any object for which the type implements the iterator interface
Returns:A dictionary that for each label as key, has an Int as value that denotes how often the corresponding label was encountered in iter
julia> labelfreq([0, 1, 1, 0, 0])
Dict{Int64,Int64} with 2 entries:
  0 => 3
  1 => 2

julia> labelfreq([:yes,:no,:no,:maybe,:yes,:no])
Dict{Symbol,Int64} with 3 entries:
  :yes   => 2
  :maybe => 1
  :no    => 3

If you have already created a mapping using labelmap(), then you can reuse that dictionary to compute the frequencies more efficiently.

labelfreq(dict) → Dict

Converts a label-map to a frequency map by counting the number of indices associated with each label.

Parameters:dict (Dict) – A dictionary produced by labelmap().
Returns:A dictionary that for each label as key, has an Int as value that denotes how many indices were stored in dict for the corresponding label.
julia> lm = labelmap([:yes,:no,:no,:maybe,:yes,:no])
Dict{Symbol,Array{Int64,1}} with 3 entries:
  :yes   => [1,5]
  :maybe => [4]
  :no    => [2,3,6]

julia> labelfreq(lm)
Dict{Symbol,Int64} with 3 entries:
  :yes   => 2
  :maybe => 1
  :no    => 3

For some data sources it may not be useful or even possible to associate an observation with an index (e.g. streaming data). For such cases it may still prove useful to continuously keep track of the number of times each label was encountered. To that end we provide a mutating version that updates a frequency-map in-place.

labelfreq!(dict, iter) → Dict

Updates the given frequency-map dict with the number of times each label occurs in the given iterator iter. Note that these occurances are added to the current values.

Parameters:
  • dict (Dict) – The dictionary that may or may not already contain existing frequency information. It will be updated with the new elements in iter.
  • iter (Any) – Any object for which the type implements the iterator interface.
Returns:

Returns the mutated dict for convenience.

julia> lf = labelfreq([:yes,:no,:no,:maybe,:yes,:no])
Dict{Symbol,Int64} with 3 entries:
  :yes   => 2
  :maybe => 1
  :no    => 3

julia> labelfreq!(lf, [:no,:maybe])
Dict{Symbol,Int64} with 3 entries:
  :yes   => 2
  :maybe => 2
  :no    => 4

Next we focus on label-encodings. We will show how to create them and how they can be used to transform classification targets from one encoding-convention to another. Some even define methods for a classification function that can be used to transform raw mode-predictions into a class-label.

Working with Encodings

Now that we have an understanding of how to extract the label-related information from our targets, let us consider how to instantiate (or infer) a label-encoding, and what we can do with it once we have one. In particular, these encodings will enable us to transform the targets from one representation into another without losing the ability to convert them back afterwards.

Inferring the Encoding

In many cases we may not want to just simply assume or guess the particular encoding that some user-provided targets are in. Instead we would rather let the targets themself inform us what encoding they are using. To that end we provide the function labelenc().

labelenc(vec) → LabelEncoding

Tries to determine the most approriate label-encoding to describe the given vector vec, based on the result of label(vec). Note that in most cases this function is not typestable, because the eltype of vec is usually not enough to infer the encoding or number of labels reliably.

Parameters:vec (AbstractVector) – The classification targets in vector form.
Returns:The label-encoding that is deemed most approriate to describe the values found in vec.
julia> labelenc([:yes,:no,:no,:maybe,:yes,:no])
MLLabelUtils.LabelEnc.NativeLabels{Symbol,3}(Symbol[:yes,:no,:maybe],Dict(:yes=>1,:maybe=>3,:no=>2))

julia> labelenc([-1,1,1,-1,1])
MLLabelUtils.LabelEnc.MarginBased{Int64}()

julia> labelenc(UInt8[0,1,1,0,1])
MLLabelUtils.LabelEnc.ZeroOne{UInt8,Float64}(0.5)

julia> labelenc([false,true,true,false,true])
MLLabelUtils.LabelEnc.TrueFalse()

For matrices we allow an additional (but optional) parameter with which the user can specify the array dimension that denotes the observations.

labelenc(mat[, obsdim]) → LabelEncoding

Computes the concrete matrix-based label-encoding that is used, by determining the size of the matrix for the dimension that is not used for denoting the observations.

Parameters:
  • mat (AbstractMatrix) – An numeric matrix that is assumed to be in the form of a one-hot encoding or similar.
  • obsdim (ObsDimension) – Optional. Denotes which of the two array dimensions of mat denotes the observations. It can be specified as a type-stable positional argument or a smart keyword. Defaults to Obsdim.Last(). see ?ObsDim for more information.
Returns:

The label-encoding that is deemed most approriate to describe the structure and values found in mat.

julia> labelenc([0 1 0 0; 1 0 1 0; 0 0 0 1])
MLLabelUtils.LabelEnc.OneOfK{Int64,3}()

julia> labelenc(Float32[0 1; 1 0; 0 1; 0 1], obsdim = 1)
MLLabelUtils.LabelEnc.OneOfK{Float32,2}()

Asserting Assumptions

When writing a function that requires the classification targets to be in a specific encoding (for example \(\{1, -1\}\) in the case of SVMs), it can be useful to check if the user-provided targets are already in the appropriate encoding, of if they first have to be converted. To check if the targets are of a specific encoding, or family of encodings, we provide the function islabelenc().

islabelenc(vec, encoding) → Bool

Checks if the given values in vec can be described as being produced by the given encoding. This function does not only check the values but also for the correct type. Furthermore it also checks if the total number of labels is appropriate for what the encoding expects it to be.

Parameters:
  • vec (AbstractVector) – The classification targets in vector form.
  • encoding (LabelEncoding) – A concrete instance of a label-encoding that one wants to work with.
Returns:

True, if both the values in vec as well as their types are consistent with the given encoding.

julia> islabelenc([0,1,1,0,1], LabelEnc.ZeroOne(Int))
true

julia> islabelenc([0,1,1,0,1], LabelEnc.ZeroOne(Float64))
false

julia> islabelenc([0,1,1,0,1], LabelEnc.MarginBased(Int))
false

julia> islabelenc(Int8[-1,1,1,-1,1], LabelEnc.MarginBased(Int8))
true

julia> islabelenc(Int8[-1,1,1,-1,1], LabelEnc.MarginBased(Int16))
false

julia> islabelenc([2,1,2,3,1], LabelEnc.Indices(Int,3))
true

julia> islabelenc([2,1,2,3,1], LabelEnc.Indices(Int,4)) # it allows missing labels
true

julia> islabelenc([2,1,2,3,1], LabelEnc.Indices(Int,2)) # more labels than expected
false

Similar to label() we treat matrices in a special way to account for the fact that information about the number of labels is contained in the size of a matrix and not its values. Additionally the user has the freedom to choose which matrix dimension denotes the observations.

islabelenc(mat, encoding[, obsdim]) → Bool

Checks if the values and the structure of the given matrix mat is consistent with the specified encoding. This functions also checks for the correct type and dimensions.

Parameters:
  • mat (AbstractMatrix) – The classification targets in matrix form.
  • encoding (LabelEncoding) – A concrete instance of a matrix-based label-encoding that one wants to work with.
  • obsdim (ObsDimension) – Optional. Denotes which of the two array dimensions of mat denotes the observations. It can be specified as a type-stable positional argument or a smart keyword. Defaults to Obsdim.Last(). see ?ObsDim for more information.
Returns:

True, if the values in mat, its eltype, and the shape of mat is consistent with the given encoding.

julia> islabelenc([0 1 0 0; 1 0 1 0; 0 0 0 1], LabelEnc.OneOfK(Int,3))
true

julia> islabelenc([0 1 0 0; 1 0 1 0; 0 0 0 1], LabelEnc.OneOfK(Int8,3))
false

julia> islabelenc([1 1 0 0; 1 0 1 0; 0 0 0 1], LabelEnc.OneOfK(Int,3)) # matrix is not one-hot
false

julia> islabelenc([0 1 0 0; 1 0 1 0; 0 0 0 1], LabelEnc.OneOfK(Int,4)) # only 3 rows
false

julia> islabelenc([0 1; 1 0; 0 1; 0 1], LabelEnc.OneOfK(Int,2), obsdim = 1)
true

julia> islabelenc(UInt8[0 1; 1 0; 0 1; 0 1], LabelEnc.OneOfK(Int,2), obsdim = 1)
false

julia> islabelenc(UInt8[0 1; 1 0; 0 1; 0 1], LabelEnc.OneOfK(UInt8,2), obsdim = 1)
true

So far islabelenc() was very restrictive concerning the element types of the given target array. In many cases, however, we may not actually care too much about the concrete numeric type but only if the encoding-scheme itself is followed. In fact we usually don’t want to be restrictive about concrete types at all, since we have Julia’s multiple-dispatch system to take care of that later on. In other words we may be more interested in asserting if the labels of the given targets belong to a family of possible label-encodings.

islabelenc(vec, type) → Bool

Checks is the given values in vec can be described as being produced by any possible instance of the given type. In other word this function checks if the labels in vec can be described as being consistent with the family of label-encodings specified by type. This means that the check is much more tolerant concerning the eltype and the total number of labels, since some families of encodings are approriate for any number of labels.

Parameters:
  • vec (AbstractVector) – The classification targets in vector form.
  • type (DataType) – Any subtype of LabelEncoding{T,K,1}
Returns:

True, if the values in vec are consistent with the given family of encodings specified by type.

julia> islabelenc([0,1,1,0,1], LabelEnc.ZeroOne)
true

julia> islabelenc(UInt8[0,1,1,0,1], LabelEnc.ZeroOne)
true

julia> islabelenc([0,1,1,0,1], LabelEnc.MarginBased)
false

julia> islabelenc(Float32[-1,1,1,-1,1], LabelEnc.MarginBased)
true

julia> islabelenc(Int8[-1,1,1,-1,1], LabelEnc.MarginBased)
true

julia> islabelenc([2,1,2,3,1], LabelEnc.Indices)
true

julia> islabelenc(Int8[2,1,2,3,1], LabelEnc.Indices)
true

julia> islabelenc(Int8[2,1,2,3,1], LabelEnc.Indices{Int}) # restrict type but not nlabels
false

We again provide a special version for matrices.

islabelenc(mat, type[, obsdim]) → Bool

Checks is the values and the structure of the given matrix mat can be described as being produced by any possible instance of the given type. This means that the check is much more tolerant concerning the eltype and the size of the matrix, since some families of encodings are approriate for any number of labels.

Parameters:
  • mat (AbstractMatrix) – The classification targets in matrix form.
  • type (DataType) – Any subtype of LabelEncoding{T,K,2}
  • obsdim (ObsDimension) – Optional. Denotes which of the two array dimensions of mat denotes the observations. It can be specified as a type-stable positional argument or a smart keyword. Defaults to Obsdim.Last(). see ?ObsDim for more information.
Returns:

True, if the values in mat are consistent with the given family of encodings specified by type.

julia> islabelenc([0 1 0 0; 1 0 1 0; 0 0 0 1], LabelEnc.OneOfK)
true

julia> islabelenc(Int8[0 1 0 0; 1 0 1 0; 0 0 0 1], LabelEnc.OneOfK)
true

julia> islabelenc([1 1 0 0; 1 0 1 0; 0 0 0 1], LabelEnc.OneOfK) # matrix is not one-hot
false

julia> islabelenc([0 1; 1 0; 0 1; 0 1], LabelEnc.OneOfK, obsdim = 1)
true

julia> islabelenc(UInt8[0 1; 1 0; 0 1; 0 1], LabelEnc.OneOfK, obsdim = 1)
true

julia> islabelenc(UInt8[0 1; 1 0; 0 1; 0 1], LabelEnc.OneOfK{Int32}, obsdim = 1) # restrict type but not nlabels
false

Properties of an Encoding

Once we have an instance of some label-encoding, we can compute a number of useful properties about it. For example we can query all the labels that an encoding uses to represent the classes.

label(encoding) → Vector

Returns all the labels that a specific encoding uses in their approriate order.

Parameters:encoding (LabelEncoding) – The specific label-encoding.
Returns:The unique labels in the form of a vector. In the case of two labels, the first element will represent the positive label and the second element the negative label respectively.
julia> label(LabelEnc.ZeroOne(UInt8))
2-element Array{UInt8,1}:
 0x01
 0x00

julia> label(LabelEnc.MarginBased())
2-element Array{Float64,1}:
  1.0
 -1.0

julia> label(LabelEnc.Indices(Float32,5))
5-element Array{Float32,1}:
 1.0
 2.0
 3.0
 4.0
 5.0

For convenience one can also just query for the label that corresponds to the positive class or the negative class respectively. These helper functions are only defined for binary label-encoding and will throw an MethodError for multi-class encodings.

poslabel(encoding)

If the encoding is binary it will return the positive label of it. The function will throw an error otherwise.

Parameters:encoding (LabelEncoding) – The specific label-encoding.
Returns:The value representing the positive label of the given encoding in the approriate type.
julia> poslabel(LabelEnc.ZeroOne(UInt8))
0x01

julia> poslabel(LabelEnc.MarginBased())
1.0

julia> poslabel(LabelEnc.Indices(Float32,2))
1.0f0

julia> poslabel(LabelEnc.Indices(Float32,5))
ERROR: MethodError: no method matching poslabel(::MLLabelUtils.LabelEnc.Indices{Float32,5})
neglabel(encoding)

If the encoding is binary it will return the negative label of it. The function will throw an error otherwise.

Parameters:encoding (LabelEncoding) – The specific label-encoding.
Returns:The value representing the negative label of the given encoding in the approriate type.
julia> neglabel(LabelEnc.ZeroOne(UInt8))
0x00

julia> neglabel(LabelEnc.MarginBased())
-1.0

julia> neglabel(LabelEnc.Indices(Float32,2))
2.0f0

julia> neglabel(LabelEnc.Indices(Float32,5))
ERROR: MethodError: no method matching neglabel(::MLLabelUtils.LabelEnc.Indices{Float32,5})

We can also query the number of labels that a concrete encoding uses. In other words we can query the number of classes the given label-encoding is able to represent.

nlabel(encoding) → Int

Returns the number of labels that a specific encoding uses.

Parameters:encoding (LabelEncoding) – The specific label-encoding.
julia> nlabel(LabelEnc.ZeroOne(UInt8))
2

julia> nlabel(LabelEnc.NativeLabels([:a,:b,:c]))
3

More interestingly, we can infer the number of labels for a family of encodings. This allows for some compile time decisions, but only work for some types of encodings (i.e. binary).

nlabel(type) → Int

Returns the number of labels that the family of encodings type can describe. Note that this function will fail if the number of labels can not be inferred from the given type.

Parameters:type (DataType) – Some subtype of LabelEncoding{T,K,M} with a fixed K
Returns:The type-parameter K of type.
julia> nlabel(LabelEnc.ZeroOne)
2

julia> nlabel(LabelEnc.NativeLabels)
ERROR: ArgumentError: number of labels could not be inferred for the given type

We can also query a family of encodings for their label-type. In this case we decided to not throw an error if the type can not be inferred but instead return the most specific abstract type.

labeltype(type) → DataType

Determine the type of the labels represented by the given family of label-encoding. If the type can not be inferred than Any is returned.

Parameters:type (DataType) – Some subtype of LabelEncoding{T,K,M}
Returns:The type-parameter T of type if specified, or the most specific abstract type otherwise.
julia> labeltype(LabelEnc.TrueFalse)
Bool

julia> labeltype(LabelEnc.ZeroOne{Int})
Int64

julia> labeltype(LabelEnc.ZeroOne)
Number

julia> labeltype(LabelEnc.NativeLabels)
Any

Converting to/from Indices

As stated before, the order of the of label() matters. In a binary setting, for example, the first label is interpreted as the positive class and the second label as the negative class. This is simply the arbitrary convention that we follow. That said, even in a multi-class setting it is important to be consistent with the ordering. This is crucial in order to make sure that converting to a different encoding and then converting back yields the original values.

Every encoding understands the concept of a label-index, which is a unique representation of a class that all encodings share. For example the positive label of a binary label-encoding always has the label-index 1 and the negative 2 respectively.

To convert a label-index into the label that a specific encoding uses to represent the underlying class we provide the function ind2label().

ind2label(index, encoding)

Converts the given index into the corresponding label defined by the encoding. Note that in the binary case, index = 1 represents the positive label and index = 2 the negative label.

This function supports broadcasting.

Parameters:
  • index (Int) – Index of the desired label. This variable can be specified either as an Int or as a Val. Note that indices are one-based.
  • encoding (LabelEncoding) – The encoding one wants to get the label from.
Returns:

The label of the specified index for the specified encoding.

julia> ind2label(1, LabelEnc.MarginBased(Float32))
1.0f0

julia> ind2label(Val{1}, LabelEnc.MarginBased(Float32))
1.0f0

julia> ind2label(2, LabelEnc.MarginBased(Float32))
-1.0f0

julia> ind2label(3, LabelEnc.OneOfK(Int8,4))
4-element Array{Int8,1}:
 0
 0
 1
 0

julia> ind2label(3, LabelEnc.NativeLabels([:a,:b,:c,:d]))
:c

julia> ind2label.([1,2,2,1], LabelEnc.ZeroOne(UInt8)) # broadcast support
4-element Array{UInt8,1}:
 0x01
 0x00
 0x00
 0x01

We also provide inverse function for converting a label of a specific encoding into the corresponding label-index. Note that this function does not check if the given label is of the expected type, but simply that it is of the appropriate value.

label2ind(label, encoding) → Int

Converts the given label into the corresponding index defined by the encoding. Note that in the binary case, the positive label will result in the index 1 and the negative label in the index 2 respectively.

This function supports broadcasting.

Parameters:
  • label (Any) – A label in the format familiar to the encoding.
  • encoding (LabelEncoding) – The encoding to compute the label-index with.
Returns:

The index of the specified label for the specified encoding.

julia> label2ind(1.0, LabelEnc.MarginBased())
1

julia> label2ind(-1.0, LabelEnc.MarginBased())
2

julia> label2ind([0,0,1,0], LabelEnc.OneOfK(4))
3

julia> label2ind(:c, LabelEnc.NativeLabels([:a,:b,:c,:d]))
3

julia> label2ind.([1,0,0,1], LabelEnc.ZeroOne()) # broadcast support
4-element Array{Int64,1}:
 1
 2
 2
 1

Converting between Encodings

In the case that the given targets are not in the encoding that your algorithm expects them to be in, you may want to convert them into the format you require. For that purpose we expose the function convertlabel().

convertlabel(dst_encoding, src_label, src_encoding)

Converts the given input label src_label from src_encoding into the corresponding label described by the desired output encoding dst_encoding.

Note that both encodings are expected to be vector-based, meaning that this method does not work for LabelEnc.OneOfK. It does, however, support broadcasting.

Parameters:
  • dst_encoding (LabelEncoding) – The vector-based label-encoding that should be used to produce the output label.
  • src_label (Any) – The input label one wants to convert. It is expected to be consistent with src_encoding.
  • src_encoding (LabelEncoding) – A vector-based label-encoding that is assumed to have produced the given src_label.
Returns:

The label from dst_encoding that corresponds to src_label in src_encoding

julia> convertlabel(LabelEnc.OneOfK(2), -1, LabelEnc.MarginBased()) # OneOfK is not vector-based
ERROR: MethodError: no method matching [...]

julia> convertlabel(LabelEnc.NativeLabels([:a,:b,:c,:d]), 3, LabelEnc.Indices(4))
:c

julia> convertlabel(LabelEnc.ZeroOne(), :yes, LabelEnc.NativeLabels([:yes,:no]))
1.0

julia> convertlabel(LabelEnc.ZeroOne(), :no, LabelEnc.NativeLabels([:yes,:no]))
0.0

julia> convertlabel(LabelEnc.MarginBased(Int), 0, LabelEnc.ZeroOne())
-1

julia> convertlabel(LabelEnc.NativeLabels([:a,:b]), -1, LabelEnc.MarginBased())
:b

julia> convertlabel.(LabelEnc.NativeLabels([:a,:b]), [-1,1,1,-1], LabelEnc.MarginBased()) # broadcast support
4-element Array{Symbol,1}:
 :b
 :a
 :a
 :b

Aside from the one broadcast-able method that is implemented for converting single labels, we provide a range of methods that work on whole arrays. These are more flexible because by having an array as input these methods have more information available to make reasonable decisions. As a consequence of that can we consider the “source-encoding” parameter optional, because these methods can now make use of labelenc() internally to infer it automatically.

convertlabel(dst_encoding, arr[, src_encoding][, obsdim])

Converts the given array arr from the src_encoding into the dst_encoding. If src_encoding is not specified it will be inferred automaticaly using the function labelenc(). This should not negatively influence type-inference.

Note that both encodings should have the same number of labels, or a MethodError will be thrown in most cases.

Parameters:
  • dst_encoding (LabelEncoding) – The desired output format.
  • arr (AbstractArray) – The input targets that should be converted into the encoding specified by dst_encoding.
  • src_encoding (LabelEncoding) – The input encoding that arr is expected to be in.
  • obsdim (ObsDimension) – Optional. Only possible if one of the two encodings is a matrix-based encoding. Defines which of the two array dimensions denotes the observations. It can be specified as a type-stable positional argument or a smart keyword. Defaults to Obsdim.Last(). see ?ObsDim for more information.
Returns:

A converted version of arr using the specified output encoding dst_encoding.

julia> convertlabel(LabelEnc.NativeLabels([:yes,:no]), [-1,1,-1,1,1,-1])
6-element Array{Symbol,1}:
 :no
 :yes
 :no
 :yes
 :yes
 :no

julia> convertlabel(LabelEnc.OneOfK(Float32,2), [-1,1,-1,1,1,-1])
2×6 Array{Float32,2}:
 0.0  1.0  0.0  1.0  1.0  0.0
 1.0  0.0  1.0  0.0  0.0  1.0

julia> convertlabel(LabelEnc.TrueFalse(), [-1,1,-1,1,1,-1])
6-element Array{Bool,1}:
 false
  true
 false
  true
  true
 false

julia> convertlabel(LabelEnc.Indices(3), [:no,:maybe,:yes,:no], LabelEnc.NativeLabels([:yes,:maybe,:no]))
4-element Array{Int64,1}:
 3
 2
 1
 3

It may be interesting to point out explicitly that we provide special treatment for LabelEnc.OneVsRest to conveniently convert a multi-class problem into a two-class problem.

julia> convertlabel(LabelEnc.OneVsRest(:yes), [:yes,:no,:no,:maybe,:yes,:yes])
6-element Array{Symbol,1}:
 :yes
 :not_yes
 :not_yes
 :not_yes
 :yes
 :yes

julia> convertlabel(LabelEnc.ZeroOne(Float64), [:yes,:no,:no,:maybe,:yes,:yes], LabelEnc.OneVsRest(:yes))
6-element Array{Float64,1}:
 1.0
 0.0
 0.0
 0.0
 1.0
 1.0

We also allow a more concise way to specify that your are using a LabelEnc.NativeLabels encoding by just passing the label-vector directly, that you would normally pass to its constructor.

julia> convertlabel([:yes,:no], [-1,1,-1,1,1,-1])
6-element Array{Symbol,1}:
 :no
 :yes
 :no
 :yes
 :yes
 :no

julia> convertlabel(LabelEnc.Indices(3), [:no,:maybe,:yes,:no], [:yes,:maybe,:no])
4-element Array{Int64,1}:
 3
 2
 1
 3

In many cases it can be inconvenient that one has to explicitly specify the label-type and number of labels for the desired output-encoding. To that end we also allow the output-encoding to be specified in terms of an encoding-family (i.e. as DataType).

convertlabel(dst_family, arr[, src_encoding][, obsdim])

Converts the given array arr from the src_encoding into some concrete label-encoding that is a subtype of dst_family. This way the method tries to preserve the eltype of arr if it is numeric. Furthermore, the concrete number of labels need not be specified explicitly, but will instead be inferred from src_encoding.

If src_encoding is not specified it will be inferred automaticaly using the function labelenc(). This should not negatively influence type-inference.

Parameters:
  • dst_family (DataType) – Any subtype of LabelEncoding{T,K,M}. It denotes the desired family of label-encodings that one wants the return value to be in.
  • arr (AbstractArray) – The input targets that should be converted into some encoding specified by the type dst_family.
  • src_encoding (LabelEncoding) – The input encoding that arr is expected to be in.
  • obsdim (ObsDimension) – Optional. Only possible if one of the two encodings is a matrix-based encoding. Defines which of the two array dimensions denotes the observations. It can be specified as a type-stable positional argument or a smart keyword. Defaults to Obsdim.Last(). see ?ObsDim for more information.
Returns:

A converted version of arr using a label-encoding that is member of the encoding-family dst_family.

julia> convertlabel(LabelEnc.OneOfK, Int8[-1,1,-1,1,1,-1])
2×6 Array{Int8,2}:
 0  1  0  1  1  0
 1  0  1  0  0  1

julia> convertlabel(LabelEnc.OneOfK{Float32}, Int8[-1,1,-1,1,1,-1], obsdim = 1)
6×2 Array{Float32,2}:
 0.0  1.0
 1.0  0.0
 0.0  1.0
 1.0  0.0
 1.0  0.0
 0.0  1.0

julia> convertlabel(LabelEnc.TrueFalse, [-1,1,-1,1,1,-1])
6-element Array{Bool,1}:
 false
  true
 false
  true
  true
 false

julia> convertlabel(LabelEnc.Indices, [:no,:maybe,:yes,:no], LabelEnc.NativeLabels([:yes,:maybe,:no]))
4-element Array{Int64,1}:
 3
 2
 1
 3

For vector-based encodings (which means all except LabelEnc.OneOfK), we provide a lazy version of convertlabel() that does not allocate a new array for the outputs, but instead creates a MappedArray into the original targets.

convertlabelview(dst_encoding, vec[, src_encoding])

Creates a MappedArray that provides a lazy view into vec, that makes it look like the values are actually of the provided output encoding new_encoding. This means that the convertion happens on the fly when an element of the resulting mapped array is accessed. This resulting mapped array will even be writeable, unless src_encoding is LabelEnc.OneVsRest.

Note that both encodings are expected to be vector-based, meaning that this method does not work for LabelEnc.OneOfK.

Parameters:
  • dst_encoding (LabelEncoding) – The desired vector-based output encoding.
  • vec (AbstractVector) – The input targets that one wants to convert using dst_encoding. It is expected to be consistent with src_encoding.
  • src_encoding (LabelEncoding) – A vector-based label-encoding that is assumed to have produced the values in vec.
Returns:

A MappedArray or ReadonlyMappedArray that makes vec look like it is in the encoding specified by new_encoding

julia> true_targets = [-1,1,-1,1,1,-1]
6-element Array{Int64,1}:
 -1
  1
 -1
  1
  1
 -1

julia> A = convertlabelview(LabelEnc.NativeLabels([:yes,:no]), true_targets)
6-element MappedArrays.MappedArray{Symbol,1,...}:
 :no
 :yes
 :no
 :yes
 :yes
 :no

julia> A[2] = :no
julia> A
6-element MappedArrays.MappedArray{Symbol,1,...}:
 :no
 :no
 :no
 :yes
 :yes
 :no

julia> true_targets
6-element Array{Int64,1}:
 -1
 -1
 -1
  1
  1
 -1

Classifying Predictions

Some encodings come with an implicit interpretation of how the raw predictions of some model (often denoted as \(\hat{y}\), written yhat) should look like and how they can be classified into a predicted class-label. For that purpose we provide the function classify() and its mutating version classify!().

classify(yhat, encoding)

Returns the classified version of yhat given the encoding. That means that if yhat can be interpreted as a positive label, the positive label of encoding is returned. If yhat can not be interpreted as a positive value then the negative label is returned.

This methods supports broadcasting.

Parameters:
  • yhat (Number) – The numeric prediction that should be classified into either the label representing the positive class or the label representing the negative class
  • encoding (LabelEncoding) – A concrete instance of a label-encoding that one wants to work with.
Returns:

The label that the encoding uses to represent the class that yhat is classified into.

For LabelEnc.MarginBased the decision boundary between classifying into a negative or a positive label is predefined at zero. More precisely a raw prediction greater than or equal to zero is considered a positive prediction, while any strictly negative raw prediction is considered a negative prediction.

julia> classify(-0.3f0, LabelEnc.MarginBased()) # defaults to Float64
-1.0

julia> classify.([-2.3,6.5], LabelEnc.MarginBased(Int))
2-element Array{Int64,1}:
 -1
  1

For LabelEnc.ZeroOne the assumption is that the raw prediction is in the closed interval \([0, 1]\) and represents a degree of certainty that the observation is of the positive class. That means that in order to classify a raw prediction to either positive or negative, one needs to decide on a “threshold” parameter, which determines at which degree of certainty a prediction is “good enough” to classify as positive.

julia> classify(0.3f0, LabelEnc.ZeroOne(0.5)) # defaults to Float64
0.0

julia> classify(0.3f0, LabelEnc.ZeroOne(Int,0.2))
1

julia> classify.([0.3,0.5], LabelEnc.ZeroOne(Int,0.4))
2-element Array{Int64,1}:
 0
 1

We recognize that such a probabilistic interpretation of the raw predicted value is fairly common. So much so that we provide a convenience method for when one is working under the assumption of a LabelEnc.ZeroOne encoding.

classify(yhat, threshold)

Returns the classified version of yhat given the decision margin threshold. This method assumes that yhat denotes a probability and will either return zero(yhat) if yhat is below threshold, or one(yhat) otherwise.

This methods supports broadcasting.

Parameters:
  • yhat (Number) – The numeric prediction. It is assumed be a value between 0 and 1.
  • threshold (Number) – The threshold below which yhat will be classified as 0.
Returns:

The classified version of yhat of the same type.

julia> classify(0.3f0, 0.5)
0.0f0

julia> classify(0.3f0, 0.2)
1.0f0

julia> classify.([0.3,0.5], 0.4)
2-element Array{Float64,1}:
 0.0
 1.0

For matrix-based encodings, such as LabelEnc.OneOfK we provide a special method that allows to optionally specify the dimension of the matrix that denote the observations.

classify(yhat, encoding[, obsdim])

If yhat is a vector (i.e. a single observation), this function returns the index of the element that has the largest value. If yhat is a matrix, this function returns a vector of indices for each observation in yhat.

Parameters:
  • yhat (AbstractArray) – The numeric predictions in the form of either a vector or a matrix.
  • encoding (LabelEncoding) – A concrete instance of a matrix-based label-encoding that one wants to work with.
  • obsdim (ObsDimension) – Optional iff yhat is a matrix. Denotes which of the two array dimensions of yhat denotes the observations. It can be specified as a type-stable positional argument or a smart keyword. Defaults to Obsdim.Last(). see ?ObsDim for more information.
Returns:

The classified version of yhat. This will either be an integer or a vector of indices.

julia> pred_output = [0.1 0.4 0.3 0.2; 0.8 0.3 0.6 0.2; 0.1 0.3 0.1 0.6]
3×4 Array{Float64,2}:
 0.1  0.4  0.3  0.2
 0.8  0.3  0.6  0.2
 0.1  0.3  0.1  0.6

julia> classify(pred_output, LabelEnc.OneOfK(3))
4-element Array{Int64,1}:
 2
 1
 2
 3

julia> classify(pred_output', LabelEnc.OneOfK(3), obsdim=1) # note the transpose
4-element Array{Int64,1}:
 2
 1
 2
 3

julia> classify([0.1,0.2,0.6,0.1], LabelEnc.OneOfK(4)) # single observation
3

Similar to other functions we expose a version that can be called with a family of encodings (i.e. a type with free type parameters) instead of a concrete instance.

classify(yhat, type)

Returns the classified version of yhat given the family of encodings specified by type. That means that if yhat can be interpreted as a positive label, the positive label of that family is returned (and the negative otherwise). Furthermore, the type of yhat is preserved.

This method supports broadcasting.

Parameters:
  • yhat (Number) – The numeric prediction that should be classified into either the label representing the positive class or the label representing the negative class
  • type (DataType) – Any subtype of LabelEncoding{T,K,1}
Returns:

The classified version of yhat of the same type.

julia> classify(0.3f0, LabelEnc.ZeroOne) # threshold fixed at 0.5
0.0f0

julia> classify(0.3, LabelEnc.ZeroOne)
0.0

julia> classify(4f0, LabelEnc.MarginBased)
1.0f0

julia> classify(-4, LabelEnc.MarginBased)
-1
classify(yhat, type[, obsdim])

If yhat is a vector (i.e. a single observation), this function returns the index of the element that has the largest value. If yhat is a matrix, this function returns a vector of indices for each observation in yhat.

Parameters:
  • yhat (AbstractArray) – The numeric predictions in the form of either a vector or a matrix.
  • type (DataType) – Any subtype of LabelEncoding{T,K,2}
  • obsdim (ObsDimension) – Optional iff yhat is a matrix. Denotes which of the two array dimensions of yhat denotes the observations. It can be specified as a type-stable positional argument or a smart keyword. Defaults to Obsdim.Last(). see ?ObsDim for more information.
Returns:

The classified version of yhat. This will either be an integer or a vector of indices.

julia> pred_output = [0.1 0.4 0.3 0.2; 0.8 0.3 0.6 0.2; 0.1 0.3 0.1 0.6]
3×4 Array{Float64,2}:
 0.1  0.4  0.3  0.2
 0.8  0.3  0.6  0.2
 0.1  0.3  0.1  0.6

julia> classify(pred_output, LabelEnc.OneOfK)
4-element Array{Int64,1}:
 2
 1
 2
 3

julia> classify(pred_output', LabelEnc.OneOfK, obsdim=1) # note the transpose
4-element Array{Int64,1}:
 2
 1
 2
 3

julia> classify([0.1,0.2,0.6,0.1], LabelEnc.OneOfK) # single observation
3

We also provide a mutating version. This is mainly of interest when working with LabelEnc.OneOfK(), in which case broadcast is not defined on the previous methods.

classify!(out, arr, encoding[, obsdim])

Same as classify, but uses out to store the result. In the case of a vector-based encoding this will use broadcast internally. It is mainly provided to offer a consistent API between vector-based and matrix-based encodings.

For convenience we also provide boolean version that assert if the given raw prediction could be interpreted as either a positive or a negative prediction.

isposlabel(yhat, encoding) → Bool

Checks if the given value yhat can be interpreted as the positive label given the encoding. This function takes potential classification rules into account.

julia> isposlabel([1,0], LabelEnc.OneOfK(2))
true

julia> isposlabel([0,1], LabelEnc.OneOfK(2))
false

julia> isposlabel(-5, LabelEnc.MarginBased())
false

julia> isposlabel(2, LabelEnc.MarginBased())
true

julia> isposlabel(0.3f0, LabelEnc.ZeroOne(0.5))
false

julia> isposlabel(0.3f0, LabelEnc.ZeroOne(0.2))
true
isneglabel(yhat, encoding) → Bool

Checks if the given value yhat can be interpreted as the negative label given the encoding. This function takes potential classification rules into account.

julia> isneglabel([1,0], LabelEnc.OneOfK(2))
false

julia> isneglabel([0,1], LabelEnc.OneOfK(2))
true

julia> isneglabel(-5, LabelEnc.MarginBased())
true

julia> isneglabel(2, LabelEnc.MarginBased())
false

julia> isneglabel(0.3f0, LabelEnc.ZeroOne(0.5))
true

julia> isneglabel(0.3f0, LabelEnc.ZeroOne(0.2))
false

Lastly, we provide an organized list of the implemented label-encoding that this package exposes. We will also discuss their properties and differences or other nuances.

Supported Encodings

The design of this packages revolves around a number of immutable types, each of which representing a specific label-encoding. These types are contained within their own namespace LabelEnc. The reason for the namespace is mainly convenience, as it allows for a simple form of auto-completion and also more concise names that could otherwise be considered to be too ambiguous.

Abstract LabelEncoding

We offer a number of different encodings that can best be described in terms of two orthogonal properties. The first property is the number of classes it represents, and the second property is the number of array dimensions it operates on.

LabelEncoding{T,K,M}

Abstract super-type of all label encodings. Mainly intended for dispatch. As such this type is not exported. It defines three type-parameters that are useful to divide the different encodings into groups.

T

The label-type of the encoding, which specifies which concrete type all label of that particular encoding have.

K

The number of labels that the label-encoding can deal with. So for binary encodings this will be the constant 2

M

The number of array dimensions that the encoding works with. For most encodings this will be 1, meaning that a target array of that encoding is expected to be some vector. In contrast to this does the encoding LabelEnc.OneOfK defined M=2, because it represents the target array as a matrix.

TrueFalse

LabelEnc.TrueFalse

Denotes the classes as boolean values, for which true corresponds to the positive class, and false to the negative class.

julia> supertype(LabelEnc.TrueFalse)
MLLabelUtils.LabelEncoding{Bool,2,1}

It belongs to the family of binary vector-based encodings, and as such represents the targets as a vector that is using only two distinct values for its elements. That implies that it is per defintion always binary and as such the number of labels can be inferred at compile time.

julia> nlabel(LabelEnc.TrueFalse)
2
TrueFalse() → LabelEnc.TrueFalse

Returns the singleton that represents the encoding. All information about the encoding is already contained withing the type. As such there is no need to specify additional parameters.

For more information on how to use such an encoding, please look at the corresponding parts of the documentation.

julia> true_targets = [false, true, true, false];

julia> labelenc(true_targets)
MLLabelUtils.LabelEnc.TrueFalse()

julia> label(LabelEnc.TrueFalse())
2-element Array{Bool,1}:
  true
 false

julia> nlabel(LabelEnc.TrueFalse())
2

ZeroOne

LabelEnc.ZeroOne

Denotes the classes as numeric values, for which 1 corresponds to the positive class, and 0 to the negative class. This type of encoding is often used when the predictions denote a probabilty.

julia> supertype(LabelEnc.ZeroOne)
MLLabelUtils.LabelEncoding{T<:Number,2,1}

It belongs to the family of binary numeric vector-based encodings, and as such represents the targets as a vector that is using only two distinct values for its elements. In fact, it is by definition always binary and as such the number of labels can be inferred at compile time.

julia> nlabel(LabelEnc.ZeroOne)
2

This type also comes with support for classification (see classify()). It assumes that the raw predictions (often called \(\hat{y}\)) are in the closed interval \([0, 1]\) and represent something resembling a probabilty (or some degree of certainty) that the observation is of the positive class. That means that in order to classify a raw prediction to either positive or negative, one needs to decide on a “threshold” parameter, which determines at which degree of certainty a prediction is “good enough” to classify as positive.

threshold

A real number between 0 and 1 that defines the “cutoff” point for classification. Any prediction less than this value will be classified as negative and any prediction equal to or greater than this value will be classified as a positive prediction.

ZeroOne([labeltype][, threshold]) → LabelEnc.ZeroOne

Creates a new label-encoding of the LabelEnc.ZeroOne family.

Parameters:
  • labeltype (DataType) – The type that should be used to represent the labels. Has to be a subtype of Number. Defaults to Float64.
  • threshold (Number) – The classification threshold that should be used in classify(). Defaults to 0.5.

For more information on how to use such an encoding, please look at the corresponding parts of the documentation.

julia> LabelEnc.ZeroOne(Int, 0.3) # threshold = 0.3
MLLabelUtils.LabelEnc.ZeroOne{Int64,Float64}(0.3)

julia> true_targets = [0, 1, 1, 0];

julia> labelenc(true_targets)
MLLabelUtils.LabelEnc.ZeroOne{Int64,Float64}(0.5)

julia> label(LabelEnc.ZeroOne())
2-element Array{Float64,1}:
 1.0
 0.0

julia> nlabel(LabelEnc.ZeroOne())
2

MarginBased

LabelEnc.MarginBased

Denotes the classes as numeric values, for which 1 corresponds to the positive class, and -1 to the negative class. This type of encoding is very prominent for margin-based classifier, in particular SVMs.

julia> supertype(LabelEnc.MarginBased)
MLLabelUtils.LabelEncoding{T<:Number,2,1}

It belongs to the family of binary numeric vector-based encodings, and as such represents the targets as a vector that is using only two distinct values for its elements. In fact, it is by definition always binary and as such the number of labels can be inferred at compile time.

julia> nlabel(LabelEnc.MarginBased)
2

This type also comes with support for classification (see classify()). It expects the raw predictions to be real numbers of arbitrary value. The decision boundary between classifying into a negative or a positive label is predefined at zero. More precisely a raw prediction greater than or equal to zero is considered a positive prediction, while any strictly negative raw prediction is considered a negative prediction.

MarginBased([labeltype]) → LabelEnc.MarginBased

Creates a new label-encoding of the LabelEnc.MarginBased family.

Parameters:labeltype (DataType) – The type that should be used to represent the labels. Has to be a subtype of Number. Defaults to Float64.

For more information on how to use such an encoding, please look at the corresponding parts of the documentation.

julia> true_targets = [-1, 1, 1, -1];

julia> labelenc(true_targets)
MLLabelUtils.LabelEnc.MarginBased{Int64}()

julia> label(LabelEnc.MarginBased())
2-element Array{Float64,1}:
  1.0
 -1.0

julia> nlabel(LabelEnc.MarginBased())
2

OneVsRest

LabelEnc.OneVsRest

This is a special type of binary encoding that allows to convert a multi-class problem into a binary one. It does so by only “caring” about what the positive label is, and treating everything that is not equal to it as negative.

julia> supertype(LabelEnc.OneVsRest)
MLLabelUtils.LabelEncoding{T,2,1}

It belongs to the family of binary vector-based encodings. It is by definition always binary and as such the number of labels can be inferred at compile time.

julia> nlabel(LabelEnc.OneVsRest)
2

While this encoding only uses to positive label to assert class membership, it still needs to have a placeholder-value of the same type for a negative label in order for convertlabel() to work.

poslabel

The value that will be used to represent the positive class. This value will be used to determine if a given value is positive (if it is equal) or negative.

neglabel

Placeholder to represent the negative class. This value will not be used to determine membership, but simply to impute a reasonable value when converting to such an encoding.

OneVsRest(poslabel[, neglabel]) → LabelEnc.OneVsRest

Creates a new label-encoding of the one-vs-rest family. While both a positive and a negative label have to be known to the encoding, only the positive label is used for comparision and asserting class membership. Note that both parameter have to be of the same type.

Parameters:
  • poslabel (Any) – The label of interest.
  • neglabel (Any) – The negative label. It is optional for the common types, such as symbol, number, or string. For label-types other than that it has to be provided explicitly.

For more information on how to use such an encoding, please look at the corresponding parts of the documentation.

julia> true_targets = [:yes, :no, :maybe, :yes];

julia> convertlabel(LabelEnc.OneVsRest(:yes), true_targets)
4-element Array{Symbol,1}:
 :yes
 :not_yes
 :not_yes
 :yes

julia> convertlabel(LabelEnc.MarginBased, true_targets, LabelEnc.OneVsRest(:yes))
4-element Array{Float64,1}:
  1.0
 -1.0
 -1.0
  1.0

julia> label(LabelEnc.OneVsRest(:yes))
2-element Array{Symbol,1}:
 :yes
 :not_yes

julia> nlabel(LabelEnc.OneVsRest(:yes))
2

Indices

LabelEnc.Indices

A multiclass encoding that uses the integer numbers in \(\{1, 2, ..., K\}\) as label to denote the classes. While these “indices” are integers in terms of their values, they don’t need to be Int as a type.

julia> supertype(LabelEnc.Indices)
MLLabelUtils.LabelEncoding{T<:Number,K,1}

It belongs to the family of numeric vector-based encodings and can encode any number of classes. As such the number of labels K is a free type-parameter. It is considered a binary encoding if and only if K = 2

Indices([labeltype, ]k) → LabelEnc.Incides

Creates a new label-encoding of the LabelEnc.Indices family.

Parameters:
  • labeltype (DataType) – The type that should be used to represent the labels. Has to be a subtype of Number. Defaults to Int.
  • k (Int) – The number of classes that the concoding should represent. This parameter can be specified as an Int or in type-stable manner as Val{k}

For more information on how to use such an encoding, please look at the corresponding parts of the documentation.

julia> true_targets = [1, 2, 1, 3, 1, 2];

julia> labelenc(true_targets)
MLLabelUtils.LabelEnc.Indices{Int64,3}()

julia> label(LabelEnc.Indices(3))
3-element Array{Int64,1}:
 1
 2
 3

julia> label(LabelEnc.Indices(Float32,4))
4-element Array{Float32,1}:
 1.0
 2.0
 3.0
 4.0

julia> nlabel(LabelEnc.Indices(Val{5})) # type-stable
5

OneOfK

LabelEnc.OneOfK

A multi-class encoding that uses one of the two matrix dimensions to denote the label. More precisely other words it uses an indicator-encoding to explicitly state what class an observation represents and what it does not represent, by only setting one element of each observation to 1 and the rest to 0

julia> supertype(LabelEnc.OneOfK)
MLLabelUtils.LabelEncoding{T<:Number,K,2}

It belongs to the family of numeric matrix-based encodings and can encode any number of classes. As such the number of labels K is a free type-parameter. It is considered a binary encoding if and only if K = 2

OneOfK([labeltype, ]k) → LabelEnc.OneOfK

Creates a new label-encoding of the matrix-based LabelEnc.OneOfK family.

Parameters:
  • labeltype (DataType) – The type that should be used to represent the labels. Has to be a subtype of Number. Defaults to Int.
  • k (Int) – The number of classes that the concoding should represent. This parameter can be specified as an Int or in type-stable manner as Val{k}

For more information on how to use such an encoding, please look at the corresponding parts of the documentation.

julia> true_targets = [0 1 0 0; 1 0 1 0; 0 0 0 1]
3×4 Array{Int64,2}:
 0  1  0  0
 1  0  1  0
 0  0  0  1

julia> labelenc(true_targets)
MLLabelUtils.LabelEnc.OneOfK{Int64,3}()

julia> label(LabelEnc.OneOfK(Float32, 4)) # returns the indices
4-element Array{Int64,1}:
 1
 2
 3
 4

julia> ind2label(3, LabelEnc.OneOfK(Float32, 4))
4-element Array{Float32,1}:
 0.0
 0.0
 1.0
 0.0

julia> nlabel(LabelEnc.OneOfK(Val{4}))
4

NativeLabels

LabelEnc.NativeLabels

A multi-class encoding that can use any abritrary values to represent any number of labels. It does so by mapping each label-index to a class label. The class labels can be of arbitrary type as long as the type is consistent for all labels. Furthermore, all labels have to be specified explicitly.

julia> supertype(LabelEnc.NativeLabels)
MLLabelUtils.LabelEncoding{T,K,1}

It belongs to the family of vector-based encodings that can encode any number of classes. As such the number of labels K is a free type-parameter. It is considered a binary encoding if and only if k = 2

label

A vector that contains all the used labels in their defined order. If it only contains two values, then the first value will be interpreted as the positive label and the second value as the negative label.

invlabel

A Dict that maps each label to their index in the vector label. This map is used for fast lookup and generated automatically.

NativeLabels(label[, k]) → LabelEnc.NativeLabels

Creates a new vector-based label-encoding for the given values in label. The values in label are expected to be distinct.

Parameters:
  • label (Vector) – The label that the encoding should use in their intended order
  • k (DataType) – The number of labels in label. This paramater is optional and will be computed from label if omited. However, if the number of labels is known at compile time this parmater can be provided using Val{k}

For more information on how to use such an encoding, please look at the corresponding parts of the documentation.

julia> true_targets = [:a, :b, :a, :c, :b, :a];

julia> le = labelenc(true_targets)
MLLabelUtils.LabelEnc.NativeLabels{Symbol,3}(Symbol[:a,:b,:c],Dict(:c=>3,:a=>1,:b=>2))

julia> label(le)
3-element Array{Symbol,1}:
 :a
 :b
 :c

julia> nlabel(le)
3

julia> LabelEnc.NativeLabels([:yes, :no, :maybe], Val{3}) # type inferrable
MLLabelUtils.LabelEnc.NativeLabels{Symbol,3}(Symbol[:yes,:no,:maybe],Dict(:yes=>1,:maybe=>3,:no=>2))

FuzzyBinary

LabelEnc.FuzzyBinary

A vector-based binary label interpretation without a specific labeltype. It is primarily intended for fuzzy comparision of binary true targets and predicted targets. It basically assumes that the encoding is either TrueFalse, ZeroOne, or MarginBased by treating all non-negative values as positive outputs.

Indices and tables

LICENSE

The MLLabelUtils.jl package is licensed under the MIT “Expat” License

see LICENSE.md in the Github repository.