MLLabelUtils.jl’s documentation¶
This package represents a community effort to provide the necessary functionality for interpreting class-predictions, as well as converting classification targets from one encoding to another. As such it is part of the JuliaML ecosystem.
The main intend of this package is to be a light-weight back-end for other JuliaML packages that deal with classification problems. In particular, this library is designed with package developers in mind that require their classification-targets to be in a specific format. To that end, the core focus of this package is to provide all the tools needed to deal with classification targets of arbitrary format. This includes asserting if the targets are of a desired encoding, inferring the concrete encoding the targets are in and how many classes they represent, and converting from their native encoding to the desired one.
From an end-user’s perspective one normally does not need to import
this package directly. That said, some functionality (in particular
convertlabels()
) can also be useful to end-users who code
their own special Machine Learning scripts.
Where to begin?¶
If this is the first time you consider using MLLabelUtils for your machine learning related experiments or packages, make sure to check out the “Getting Started” section; specifically “How to …?”, which lists some of most common scenarios and links to the appropriate places that should guide you on how to approach these scenarios using the functionality provided by this or other packages.
Getting Started¶
MLLabelUtils is the result of a collaborative effort to design an efficient but also convenient-to-use library for working with the most commonly utilized class-label encodings in Machine Learning. As such, this package provides functionality to derive or assert properties about some label-encoding or target array, as well as the functions needed to convert given targets into a different format.
Installation¶
To install MLLabelUtils.jl, start up Julia and type the following code-snipped into the REPL. It makes use of the native Julia package manger.
Pkg.add("MLLabelUtils")
Additionally, for example if you encounter any sudden issues, or in the case you would like to contribute to the package, you can manually choose to be on the latest (untagged) version.
Pkg.checkout("MLLabelUtils")
Overview¶
Let us take a look at some examples (with only minor explanation) to get a feeling for what one can do with this package. Once installed the package can be imported just as any other Julia package.
using MLLabelUtils
For starters, the library provides a few utility functions to
compute various properties of the target array. These include the
number of labels (see nlabel()
), the labels themselves (see
label()
), and a mapping from label to the elements of the
target array (see labelmap()
and labelfreq()
).
julia> true_targets = [0, 1, 1, 0, 0];
julia> label(true_targets)
2-element Array{Int64,1}:
1
0
julia> nlabel(true_targets)
2
julia> labelmap(true_targets)
Dict{Int64,Array{Int64,1}} with 2 entries:
0 => [1,4,5]
1 => [2,3]
julia> labelfreq(true_targets)
Dict{Int64,Int64} with 2 entries:
0 => 3
1 => 2
Tip
Because labelfreq()
utilizes a Dict
to store its result,
it is straight forward to visualize the class distribution
(using the absolute frequencies) right in the REPL using the
UnicodePlots.jl
package.
julia> using UnicodePlots
julia> barplot(labelfreq([:yes,:no,:no,:maybe,:yes,:yes]), symb="#")
# ┌────────────────────────────────────────┐
# yes │##################################### 3 │
# maybe │############ 1 │
# no │######################### 2 │
# └────────────────────────────────────────┘
If you find yourself writing some custom function that is intended
to train some specific supervised model, chances are that you want to
assert if the given targets are in the correct encoding that the model
requires. We provide a few functions for such a scenario, namely
labelenc()
and islabelenc()
.
julia> true_targets = [0, 1, 1, 0, 0];
julia> labelenc(true_targets) # determine encoding using heuristics
MLLabelUtils.LabelEnc.ZeroOne{Int64,Float64}(0.5)
julia> islabelenc(true_targets, LabelEnc.ZeroOne)
true
julia> islabelenc(true_targets, LabelEnc.ZeroOne(Int))
true
julia> islabelenc(true_targets, LabelEnc.ZeroOne(Float32))
false
julia> islabelenc(true_targets, LabelEnc.MarginBased)
false
In the case that it turns out the given targets are in the wrong
encoding you may want to convert them into the format you require.
For that purpose we expose the function convertlabel()
.
julia> true_targets = [0, 1, 1, 0, 0];
julia> convertlabel(LabelEnc.MarginBased, true_targets)
5-element Array{Int64,1}:
-1
1
1
-1
-1
julia> convertlabel(LabelEnc.MarginBased(Float64), true_targets)
5-element Array{Float64,1}:
-1.0
1.0
1.0
-1.0
-1.0
julia> convertlabel([:yes,:no], true_targets)
5-element Array{Symbol,1}:
:no
:yes
:yes
:no
:no
julia> convertlabel(LabelEnc.OneOfK, true_targets)
2×5 Array{Int64,2}:
0 1 1 0 0
1 0 0 1 1
julia> convertlabel(LabelEnc.OneOfK{Bool}, true_targets)
2×5 Array{Bool,2}:
false true true false false
true false false true true
julia> convertlabel(LabelEnc.OneOfK{Float64}, true_targets, obsdim=1)
5×2 Array{Float64,2}:
0.0 1.0
1.0 0.0
1.0 0.0
0.0 1.0
0.0 1.0
It may be interesting to point out explicitly that we provide
LabelEnc.OneVsRest
to conveniently convert a multi-class
problem into a two-class problem.
julia> convertlabel(LabelEnc.OneVsRest(:yes), [:yes,:no,:no,:maybe,:yes,:yes])
6-element Array{Symbol,1}:
:yes
:not_yes
:not_yes
:not_yes
:yes
:yes
julia> convertlabel(LabelEnc.ZeroOne, [:yes,:no,:no,:maybe,:yes,:yes], LabelEnc.OneVsRest(:yes))
6-element Array{Float64,1}:
1.0
0.0
0.0
0.0
1.0
1.0
Some encodings come with an implicit contract of how the raw
predictions of some model should look like and how to classify a
raw prediction into a predicted class-label.
For that purpose we provide the function classify()
and its
mutating version classify!()
.
For LabelEnc.ZeroOne
the convention is that the raw
prediction is between 0 and 1 and represents a degree of
certainty that the observation is of the positive class. That
means that in order to classify a raw prediction to either
positive or negative, one needs to define a “threshold”
parameter, which determines at which degree of certainty a
prediction is “good enough” to classify as positive.
julia> classify(0.3f0, 0.5); # equivalent to below
julia> classify(0.3f0, LabelEnc.ZeroOne) # preserves type
0.0f0
julia> classify(0.3f0, LabelEnc.ZeroOne(0.5)) # defaults to Float64
0.0
julia> classify(0.3f0, LabelEnc.ZeroOne(Int,0.2))
1
julia> classify.([0.3,0.5], LabelEnc.ZeroOne(Int,0.4))
2-element Array{Int64,1}:
0
1
For LabelEnc.MarginBased
on the other hand the decision
boundary is predefined at 0, meaning that any raw prediction greater
than or equal to zero is considered a positive prediction, while any
negative raw prediction is considered a negative prediction.
julia> classify(0.3f0, LabelEnc.MarginBased) # preserves type
1.0f0
julia> classify(-0.3f0, LabelEnc.MarginBased()) # defaults to Float64
-1.0
julia> classify.([-2.3,6.5], LabelEnc.MarginBased(Int))
2-element Array{Int64,1}:
-1
1
The encoding LabelEnc.OneOfK
is special in that it is
matrix-based and thus there exists the concept of ObsDim
,
i.e. the freedom to choose which array dimension denotes the
observations.
The classified prediction will be the index of the largest element of
an observation. By default the “obsdim” is defined as the last array
dimension.
julia> pred_output = [0.1 0.4 0.3 0.2; 0.8 0.3 0.6 0.2; 0.1 0.3 0.1 0.6]
3×4 Array{Float64,2}:
0.1 0.4 0.3 0.2
0.8 0.3 0.6 0.2
0.1 0.3 0.1 0.6
julia> classify(pred_output, LabelEnc.OneOfK)
4-element Array{Int64,1}:
2
1
2
3
julia> classify(pred_output', LabelEnc.OneOfK, obsdim=1) # note the transpose
4-element Array{Int64,1}:
2
1
2
3
julia> classify([0.1,0.2,0.6,0.1], LabelEnc.OneOfK) # single observation
3
How to … ?¶
Chances are you ended up here with a very specific use-case in mind. This section outlines a number of different but common scenarios and links to places that explain how this or a related package can be utilized to solve them.
What MLLabelUtils can do
- Infer which encoding some classification targets use.
- Assert if some classification targets are of the encoding I need them in.
- Convert targets into a specific encoding that my model requires.
- Group observations according to their class-label.
- Work with matrices in which the user can choose of the rows or the columns denote the observations.
- Classify model predictions into class labels appropriate for the encoding of the targets.
What MLLabelUtils can NOT do (outsourced)
Getting Help¶
To get help on specific functionality you can either look up the
information here, or if you prefer you can make use of Julia’s
native doc-system.
The following example shows how to get additional information on
LabelEnc.OneOfK
within Julia’s REPL:
?LabelEnc.OneOfK
If you find yourself stuck or have other questions concerning the package you can find us at gitter or the Machine Learning domain on discourse.julialang.org
If you encounter a bug or would like to participate in the further development of this package come find us on Github.
API Documentation¶
This section gives a more detailed treatment of all the exposed functions and their available methods. We start by discussing what we understand under terms such as “classification targets” and the available functionality to compute properties about them.
Classification Targets¶
In this section we will outline the functionality that this package provides in order to work with classification targets. We will start by discussion the terms we use and how they are used in the context of this package.
Terms and Definitions¶
In a classification setting one usually treats the desired output variable (also called ground truths, or targets) as a discrete categorical variable. That is true even if the values themself are of numerical type, which they often are for practical reasons.
We use the term targets when we talk about concrete data. Concretely, targets are the desired output of some dataset and further themself also part of the dataset. If a dataset includes targets we call it labeled data. In a labeled dataset, each observation has its own target. Thus we have as many targets as we have observations, as the target is treated as a part of each observation.
Tip
Let us look at an example of what targets could look like and how they relate to some dataset, or in this case data subset. The following code snipped loads the first 3 observations of the iris dataset using the RDatasets package.
julia> using RDatasets
julia> iris = head(dataset("datasets", "iris"), 3)
3×5 DataFrames.DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
├─────┼─────────────┼────────────┼─────────────┼────────────┼──────────┤
│ 1 │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ "setosa" │
│ 2 │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ "setosa" │
│ 3 │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ "setosa" │
For this data subset the targets would be
["setosa","setosa","setosa"]
.
Note how only one of the three available classes of the dataset
is represented here.
The term “target” itself applies for both regression and classification scenarios. In a classification setting (which is the domain that this package operates in) the targets are treated as a discrete categorical variable. If the classification targets can just take one of two values, we call the classification problem binary, or two-class.
For our purposes we treat the term “class” as an abstract concept with little to no practical appearance in the functionality provided by this package. In essence we think about a class as the abstract interpretation behind some concrete value. For example: Let’s say we try to predict if a tumor is malignant or benign. The two classes could then be described as “malignant tumor” and “benign tumor”. One could argue that we could translate these abstract concepts into a string or symbol quite easily and thus make it concrete, but that is not the point. The point is, that the concrete interpretation behind the prediction targets is of little consequence for the library and as such it should not talk about it.
Instead, this library cares about representation. The representation can vary a lot between one model to another, while the “class” remains the same. For example, some models require the targets in the form of numbers in the set \(\{1,0\}\), other in \(\{1,-1\}\) etc.
We call a concrete and consistent representation of a single class a label. That implies that each class should consistently be represented by a single label respectively. How a label looks like is completely up to the user, but there are some forms that are more common than others. A convention of what labels to use to represent a fixed number of classes will be referred to as a label-encoding, or short encoding.
Note
To be fair, the term “class-encoding” would be more appropriate. However, when considering that we need to use the defined terms for naming the functions and types, it seemed more reasonable (and user-friendly) to keep the list of utilized domain-specific words small and consistent.
Determine the Labels¶
Now that we settled on the terminology, let us investigate what kind of functionality this package provides to work with classification targets. The first thing we may be interested in is determining what kind of labels we are working with when presented with some targets.
In general we try to make little assumptions about the type of
the object containing the targets, just that it supports unique
.
The functions listed here do, however, expect the object containing
the targets to include all possible labels of the classification
scenario.
-
label
(iter) → Vector¶ Returns the labels represented in the given iterator iter. Note that the order of the resulting labels matters in general, because other functions expect the first label to denote the positive label for binary classification problems. Thus, for consistency reason there are some heuristics involved that try to guarantee this for the commons encodings.
Parameters: iter (Any) – Any object for which the type either implements the iterator interface, or which provides a custom implementation for unique
.Returns: The unique labels in the form of a vector. In the case of two labels, the first element will represent the positive label and the second element the negative label respectively.
julia> label([:yes,:no,:no,:maybe,:yes,:no])
3-element Array{Symbol,1}:
:yes
:no
:maybe
julia> label([-1,1,1,-1,1])
2-element Array{Int64,1}:
1
-1
As described above, we may mutate the order of the result of
unique
for consistency reasons in those cases that they describe
a common binary label-encoding. The reason for this is that we want
the first element to denote the positive label.
The following example highlights the different results for
unique
and label()
in the case of targets in “zero-one” form.
julia> unique([0,1,0,0,1])
2-element Array{Int64,1}:
0
1
julia> label([0,1,0,0,1])
2-element Array{Int64,1}:
1
0
While the generic iterator implementation covers most cases, we
do selectively treat some iterators (such as Dict
),
differently, or even disallow some completely (such as any
AbstractArray
that has more than two dimensions).
-
label
(dict) → Vector Returns the keys of the dictionary in the form of a vector. The reasoning behind this convention for how to interpret the content of a Dict is that we utilize dictionaries to store label-specific information, such as the class-frequency (see
labelfreq()
).Note again, that for consistency reasons there are heuristics in place that try to enforce the correct label-order for numeric label-vectors that have exactly two elements.
Parameters: dict (Dict) – Any julia dictionary. Returns: The unique labels in the form of a vector. In the case of two labels, the first element will represent the positive label and the second element the negative label respectively.
We also treat matrices in a special way. The reason for this is that for our purposes it is not their values that encode the information about the labels, but their structure.
-
label
(mat[, obsdim]) → Vector Returns a vector that enumerates the dimension of the given matrix mat that does not denote the observations. In other words it returns the indices of that dimension.
Parameters: - mat (AbstractMatrix) – An numeric array that is assumed to be in the form of a one-hot encoding or similar.
- obsdim (ObsDimension) – Optional. Denotes which of the two
array dimensions of mat denotes the observations. It
can be specified as a type-stable positional argument or
a smart keyword (Note: for this method the return-value
will type-stable either way). Defaults to
Obsdim.Last()
. see?ObsDim
for more information.
Returns: A vector of indices that enumerate the particular dimension of mat that does not denote the observations.
julia> label([0 1 0 0; 1 0 1 0; 0 0 0 1])
3-element Array{Int64,1}:
1
2
3
julia> label([0 1 0; 1 0 0; 0 1 0; 0 0 1], obsdim = 1)
3-element Array{Int64,1}:
1
2
3
julia> label([0 1 0; 1 0 0; 0 1 0; 0 0 1], ObsDim.First()) # positional obsdim
3-element Array{Int64,1}:
1
2
3
For convenience one can also just query for the label that
corresponds to the positive class or the negative class respectively.
These helper functions check if the given targets contain exactly two
unique labels and will throw an ArgumentError
if this assumption
is violated.
-
poslabel
(iter) → eltype(iter)¶ If
label()
returns a vector of length = 2, then this function will return the first element of it, which denotes the positive label. Otherwise an error will be thrown.
julia> poslabel([-1,1,1,-1,1])
1
julia> poslabel([:yes,:no,:no,:maybe,:yes,:no])
ERROR: ArgumentError: The given object has more or less than two labels, thus poslabel is not defined.
-
neglabel
(iter) → eltype(iter)¶ If
label()
returns a vector of length = 2, then this function will return the second element of it, which denotes the negative label. Otherwise an error will be thrown.
julia> neglabel([-1,1,1,-1,1])
-1
julia> neglabel([:yes,:no,:no,:maybe,:yes,:no])
ERROR: ArgumentError: The given object has more or less than two labels, thus neglabel is not defined.
Number of Labels¶
We can compute the number of unique labels using nlabel()
.
It works by first computing the labels and then counting them.
As such it has the same restrictions as label()
.
-
nlabel
(iter) → Int¶ Returns the number of labels represented in the given iterator iter. It uses the function
label()
internally, so the same properties and restrictions apply.Parameters: iter (Any) – Any object for which the function label()
is implemented.
julia> nlabel([:yes,:no,:no,:maybe,:yes,:no])
3
julia> nlabel([-1,1,1,-1,1])
2
Mapping Labels to Observations¶
In many classification scenarios we have to deal with what is called an imbalanced class distribution. In essence that means that some classes are represented more often in a given dataset than the other classes. While we won’t go into detail about the implications of such a scenario, the key takeaway is that there exist strategies to deal with those situations by using information about how the class-label are distributed. More importantly even, some require a mapping from each label to all the observations that have that label as target. We call such a mapping from labels to observation-indices a label-map.
-
labelmap
(iter) → Dict¶ Computes a mapping from the labels in iter to all the individual element-indices in iter that correspond to that label. Note that there is actually no check or requirement that iter must implement length or getindex. Instead, it is assumed that the first element of the iterator has the index
1
and the indices are incremented by1
with each element of the iterator.Parameters: iter (Any) – Any object for which the type implements the iterator interface Returns: A dictionary that for each label as key, has a vector as value that contains all indices of the observations that observed that label.
julia> labelmap([0, 1, 1, 0, 0])
Dict{Int64,Array{Int64,1}} with 2 entries:
0 => [1,4,5]
1 => [2,3]
julia> labelmap([:yes,:no,:no,:maybe,:yes,:no])
Dict{Symbol,Array{Int64,1}} with 3 entries:
:yes => [1,5]
:maybe => [4]
:no => [2,3,6]
We also provide a mutating version to update an existing label-map. In those cases we also have to specify the index/indices of that new observation(s).
-
labelmap!
(dict, idx, elem) → Dict¶ Updates the given label-map dict with the new element elem, which is assumed to be associated with the index idx. Note that the given index is not checked for being a duplicate.
Parameters: - dict (Dict) – The dictionary that may or may not already contain existing label-mapping. It will be updated with the new element.
- idx (Int) – The observation-index that elem corresponds to in the context of the overall dataset.
- elem (Any) – The new target of the observation denoted by idx. It is expected to be in the form of a label.
Returns: Returns the mutated dict for convenience.
julia> lm = labelmap([0, 1, 1, 0, 0])
Dict{Int64,Array{Int64,1}} with 2 entries:
0 => [1,4,5]
1 => [2,3]
julia> labelmap!(lm, 6, 0)
Dict{Int64,Array{Int64,1}} with 2 entries:
0 => [1,4,5,6]
1 => [2,3]
-
labelmap!
(dict, indices, iter) → Dict Updates the given label-map dict with the new elements in the given iterator iter. Each element in iter is assumed to be associated with the corresponding index in indices. This implies that both, iter and indices, must provide the same amount of elements. Note that the given indices are not checked for being duplicates.
Parameters: - dict (Dict) – The dictionary that may or may not already contain existing label-mapping. It will be updated with the new elements in iter.
- indices (AbstractVector{Int}) – The indices for each element in iter.
- iter (Any) – Any object for which the type implements the iterator interface.
Returns: Returns the mutated dict for convenience.
julia> lm = labelmap([:yes,:no,:no,:maybe,:yes,:no])
Dict{Symbol,Array{Int64,1}} with 3 entries:
:yes => [1,5]
:maybe => [4]
:no => [2,3,6]
julia> labelmap!(lm, 7:8, [:no,:maybe])
Dict{Symbol,Array{Int64,1}} with 3 entries:
:yes => [1,5]
:maybe => [4,8]
:no => [2,3,6,7]
There also is a convenience function to reverse a labelmap into a label vector.
-
labelmap2vec
(dict) → Vector¶
Computes an Vector of labels by element-wise traversal of the entries in dict.
param Dict{T, Int} dict: A labelmap with labels of type T. return: Vector{T} of labels.
julia> labelvec = [:yes,:no,:no,:yes,:yes]
julia> lm = labelmap(labelvec)
Dict{Symbol,Array{Int64,1}} with 2 entries:
:yes => [1, 4, 5]
:no => [2, 3]
julia> labelmap2vec(lm)
5-element Array{Symbol,1}:
:yes
:no
:no
:yes
:yes
Frequency of Labels¶
Another useful information to compute is the absolute frequency of
each label in the dataset of interest. In contrast to labelmap()
,
this function does not care about indices but instead simply counts
occurrences. We call such a dictionary a frequency-map.
-
labelfreq
(iter) → Dict¶ Computes the absolute frequencies for each label in iter and adds it as a key (label) value (count) pair to the resulting dictionary.
Parameters: iter (Any) – Any object for which the type implements the iterator interface Returns: A dictionary that for each label as key, has an Int as value that denotes how often the corresponding label was encountered in iter
julia> labelfreq([0, 1, 1, 0, 0])
Dict{Int64,Int64} with 2 entries:
0 => 3
1 => 2
julia> labelfreq([:yes,:no,:no,:maybe,:yes,:no])
Dict{Symbol,Int64} with 3 entries:
:yes => 2
:maybe => 1
:no => 3
If you have already created a mapping using labelmap()
, then you
can reuse that dictionary to compute the frequencies more efficiently.
-
labelfreq
(dict) → Dict Converts a label-map to a frequency map by counting the number of indices associated with each label.
Parameters: dict (Dict) – A dictionary produced by labelmap()
.Returns: A dictionary that for each label as key, has an Int as value that denotes how many indices were stored in dict for the corresponding label.
julia> lm = labelmap([:yes,:no,:no,:maybe,:yes,:no])
Dict{Symbol,Array{Int64,1}} with 3 entries:
:yes => [1,5]
:maybe => [4]
:no => [2,3,6]
julia> labelfreq(lm)
Dict{Symbol,Int64} with 3 entries:
:yes => 2
:maybe => 1
:no => 3
For some data sources it may not be useful or even possible to associate an observation with an index (e.g. streaming data). For such cases it may still prove useful to continuously keep track of the number of times each label was encountered. To that end we provide a mutating version that updates a frequency-map in-place.
-
labelfreq!
(dict, iter) → Dict¶ Updates the given frequency-map dict with the number of times each label occurs in the given iterator iter. Note that these occurances are added to the current values.
Parameters: - dict (Dict) – The dictionary that may or may not already contain existing frequency information. It will be updated with the new elements in iter.
- iter (Any) – Any object for which the type implements the iterator interface.
Returns: Returns the mutated dict for convenience.
julia> lf = labelfreq([:yes,:no,:no,:maybe,:yes,:no])
Dict{Symbol,Int64} with 3 entries:
:yes => 2
:maybe => 1
:no => 3
julia> labelfreq!(lf, [:no,:maybe])
Dict{Symbol,Int64} with 3 entries:
:yes => 2
:maybe => 2
:no => 4
Next we focus on label-encodings. We will show how to create them and how they can be used to transform classification targets from one encoding-convention to another. Some even define methods for a classification function that can be used to transform raw mode-predictions into a class-label.
Working with Encodings¶
Now that we have an understanding of how to extract the label-related information from our targets, let us consider how to instantiate (or infer) a label-encoding, and what we can do with it once we have one. In particular, these encodings will enable us to transform the targets from one representation into another without losing the ability to convert them back afterwards.
Inferring the Encoding¶
In many cases we may not want to just simply assume or guess the
particular encoding that some user-provided targets are in.
Instead we would rather let the targets themself inform us what
encoding they are using.
To that end we provide the function labelenc()
.
-
labelenc
(vec) → LabelEncoding¶ Tries to determine the most approriate label-encoding to describe the given vector vec, based on the result of
label(vec)
. Note that in most cases this function is not typestable, because the eltype of vec is usually not enough to infer the encoding or number of labels reliably.Parameters: vec (AbstractVector) – The classification targets in vector form. Returns: The label-encoding that is deemed most approriate to describe the values found in vec.
julia> labelenc([:yes,:no,:no,:maybe,:yes,:no])
MLLabelUtils.LabelEnc.NativeLabels{Symbol,3}(Symbol[:yes,:no,:maybe],Dict(:yes=>1,:maybe=>3,:no=>2))
julia> labelenc([-1,1,1,-1,1])
MLLabelUtils.LabelEnc.MarginBased{Int64}()
julia> labelenc(UInt8[0,1,1,0,1])
MLLabelUtils.LabelEnc.ZeroOne{UInt8,Float64}(0.5)
julia> labelenc([false,true,true,false,true])
MLLabelUtils.LabelEnc.TrueFalse()
For matrices we allow an additional (but optional) parameter with which the user can specify the array dimension that denotes the observations.
-
labelenc
(mat[, obsdim]) → LabelEncoding Computes the concrete matrix-based label-encoding that is used, by determining the size of the matrix for the dimension that is not used for denoting the observations.
Parameters: - mat (AbstractMatrix) – An numeric matrix that is assumed to be in the form of a one-hot encoding or similar.
- obsdim (ObsDimension) – Optional. Denotes which of the two array
dimensions of mat denotes the
observations. It can be specified as
a type-stable positional argument or
a smart keyword.
Defaults to
Obsdim.Last()
. see?ObsDim
for more information.
Returns: The label-encoding that is deemed most approriate to describe the structure and values found in mat.
julia> labelenc([0 1 0 0; 1 0 1 0; 0 0 0 1])
MLLabelUtils.LabelEnc.OneOfK{Int64,3}()
julia> labelenc(Float32[0 1; 1 0; 0 1; 0 1], obsdim = 1)
MLLabelUtils.LabelEnc.OneOfK{Float32,2}()
Asserting Assumptions¶
When writing a function that requires the classification targets to
be in a specific encoding (for example \(\{1, -1\}\) in the case
of SVMs), it can be useful to check if the user-provided targets are
already in the appropriate encoding, of if they first have to be
converted.
To check if the targets are of a specific encoding, or family of
encodings, we provide the function islabelenc()
.
-
islabelenc
(vec, encoding) → Bool¶ Checks if the given values in vec can be described as being produced by the given encoding. This function does not only check the values but also for the correct type. Furthermore it also checks if the total number of labels is appropriate for what the encoding expects it to be.
Parameters: - vec (AbstractVector) – The classification targets in vector form.
- encoding (LabelEncoding) – A concrete instance of a label-encoding that one wants to work with.
Returns: True, if both the values in vec as well as their types are consistent with the given encoding.
julia> islabelenc([0,1,1,0,1], LabelEnc.ZeroOne(Int))
true
julia> islabelenc([0,1,1,0,1], LabelEnc.ZeroOne(Float64))
false
julia> islabelenc([0,1,1,0,1], LabelEnc.MarginBased(Int))
false
julia> islabelenc(Int8[-1,1,1,-1,1], LabelEnc.MarginBased(Int8))
true
julia> islabelenc(Int8[-1,1,1,-1,1], LabelEnc.MarginBased(Int16))
false
julia> islabelenc([2,1,2,3,1], LabelEnc.Indices(Int,3))
true
julia> islabelenc([2,1,2,3,1], LabelEnc.Indices(Int,4)) # it allows missing labels
true
julia> islabelenc([2,1,2,3,1], LabelEnc.Indices(Int,2)) # more labels than expected
false
Similar to label()
we treat matrices in a special way to
account for the fact that information about the number of labels
is contained in the size of a matrix and not its values.
Additionally the user has the freedom to choose which matrix
dimension denotes the observations.
-
islabelenc
(mat, encoding[, obsdim]) → Bool Checks if the values and the structure of the given matrix mat is consistent with the specified encoding. This functions also checks for the correct type and dimensions.
Parameters: - mat (AbstractMatrix) – The classification targets in matrix form.
- encoding (LabelEncoding) – A concrete instance of a matrix-based label-encoding that one wants to work with.
- obsdim (ObsDimension) – Optional. Denotes which of the two array
dimensions of mat denotes the
observations. It can be specified as
a type-stable positional argument or
a smart keyword.
Defaults to
Obsdim.Last()
. see?ObsDim
for more information.
Returns: True, if the values in mat, its eltype, and the shape of mat is consistent with the given encoding.
julia> islabelenc([0 1 0 0; 1 0 1 0; 0 0 0 1], LabelEnc.OneOfK(Int,3))
true
julia> islabelenc([0 1 0 0; 1 0 1 0; 0 0 0 1], LabelEnc.OneOfK(Int8,3))
false
julia> islabelenc([1 1 0 0; 1 0 1 0; 0 0 0 1], LabelEnc.OneOfK(Int,3)) # matrix is not one-hot
false
julia> islabelenc([0 1 0 0; 1 0 1 0; 0 0 0 1], LabelEnc.OneOfK(Int,4)) # only 3 rows
false
julia> islabelenc([0 1; 1 0; 0 1; 0 1], LabelEnc.OneOfK(Int,2), obsdim = 1)
true
julia> islabelenc(UInt8[0 1; 1 0; 0 1; 0 1], LabelEnc.OneOfK(Int,2), obsdim = 1)
false
julia> islabelenc(UInt8[0 1; 1 0; 0 1; 0 1], LabelEnc.OneOfK(UInt8,2), obsdim = 1)
true
So far islabelenc()
was very restrictive concerning the element
types of the given target array. In many cases, however, we may not
actually care too much about the concrete numeric type but only if
the encoding-scheme itself is followed. In fact we usually don’t
want to be restrictive about concrete types at all, since we
have Julia’s multiple-dispatch system to take care of that later on.
In other words we may be more interested in asserting if the labels
of the given targets belong to a family of possible label-encodings.
-
islabelenc
(vec, type) → Bool Checks is the given values in vec can be described as being produced by any possible instance of the given type. In other word this function checks if the labels in vec can be described as being consistent with the family of label-encodings specified by type. This means that the check is much more tolerant concerning the eltype and the total number of labels, since some families of encodings are approriate for any number of labels.
Parameters: - vec (AbstractVector) – The classification targets in vector form.
- type (DataType) – Any subtype of
LabelEncoding{T,K,1}
Returns: True, if the values in vec are consistent with the given family of encodings specified by type.
julia> islabelenc([0,1,1,0,1], LabelEnc.ZeroOne)
true
julia> islabelenc(UInt8[0,1,1,0,1], LabelEnc.ZeroOne)
true
julia> islabelenc([0,1,1,0,1], LabelEnc.MarginBased)
false
julia> islabelenc(Float32[-1,1,1,-1,1], LabelEnc.MarginBased)
true
julia> islabelenc(Int8[-1,1,1,-1,1], LabelEnc.MarginBased)
true
julia> islabelenc([2,1,2,3,1], LabelEnc.Indices)
true
julia> islabelenc(Int8[2,1,2,3,1], LabelEnc.Indices)
true
julia> islabelenc(Int8[2,1,2,3,1], LabelEnc.Indices{Int}) # restrict type but not nlabels
false
We again provide a special version for matrices.
-
islabelenc
(mat, type[, obsdim]) → Bool Checks is the values and the structure of the given matrix mat can be described as being produced by any possible instance of the given type. This means that the check is much more tolerant concerning the eltype and the size of the matrix, since some families of encodings are approriate for any number of labels.
Parameters: - mat (AbstractMatrix) – The classification targets in matrix form.
- type (DataType) – Any subtype of
LabelEncoding{T,K,2}
- obsdim (ObsDimension) – Optional. Denotes which of the two array
dimensions of mat denotes the
observations. It can be specified as
a type-stable positional argument or
a smart keyword.
Defaults to
Obsdim.Last()
. see?ObsDim
for more information.
Returns: True, if the values in mat are consistent with the given family of encodings specified by type.
julia> islabelenc([0 1 0 0; 1 0 1 0; 0 0 0 1], LabelEnc.OneOfK)
true
julia> islabelenc(Int8[0 1 0 0; 1 0 1 0; 0 0 0 1], LabelEnc.OneOfK)
true
julia> islabelenc([1 1 0 0; 1 0 1 0; 0 0 0 1], LabelEnc.OneOfK) # matrix is not one-hot
false
julia> islabelenc([0 1; 1 0; 0 1; 0 1], LabelEnc.OneOfK, obsdim = 1)
true
julia> islabelenc(UInt8[0 1; 1 0; 0 1; 0 1], LabelEnc.OneOfK, obsdim = 1)
true
julia> islabelenc(UInt8[0 1; 1 0; 0 1; 0 1], LabelEnc.OneOfK{Int32}, obsdim = 1) # restrict type but not nlabels
false
Properties of an Encoding¶
Once we have an instance of some label-encoding, we can compute a number of useful properties about it. For example we can query all the labels that an encoding uses to represent the classes.
-
label
(encoding) → Vector¶ Returns all the labels that a specific encoding uses in their approriate order.
Parameters: encoding (LabelEncoding) – The specific label-encoding. Returns: The unique labels in the form of a vector. In the case of two labels, the first element will represent the positive label and the second element the negative label respectively.
julia> label(LabelEnc.ZeroOne(UInt8))
2-element Array{UInt8,1}:
0x01
0x00
julia> label(LabelEnc.MarginBased())
2-element Array{Float64,1}:
1.0
-1.0
julia> label(LabelEnc.Indices(Float32,5))
5-element Array{Float32,1}:
1.0
2.0
3.0
4.0
5.0
For convenience one can also just query for the label that
corresponds to the positive class or the negative class respectively.
These helper functions are only defined for binary label-encoding and
will throw an MethodError
for multi-class encodings.
-
poslabel
(encoding)¶ If the encoding is binary it will return the positive label of it. The function will throw an error otherwise.
Parameters: encoding (LabelEncoding) – The specific label-encoding. Returns: The value representing the positive label of the given encoding in the approriate type.
julia> poslabel(LabelEnc.ZeroOne(UInt8))
0x01
julia> poslabel(LabelEnc.MarginBased())
1.0
julia> poslabel(LabelEnc.Indices(Float32,2))
1.0f0
julia> poslabel(LabelEnc.Indices(Float32,5))
ERROR: MethodError: no method matching poslabel(::MLLabelUtils.LabelEnc.Indices{Float32,5})
-
neglabel
(encoding)¶ If the encoding is binary it will return the negative label of it. The function will throw an error otherwise.
Parameters: encoding (LabelEncoding) – The specific label-encoding. Returns: The value representing the negative label of the given encoding in the approriate type.
julia> neglabel(LabelEnc.ZeroOne(UInt8))
0x00
julia> neglabel(LabelEnc.MarginBased())
-1.0
julia> neglabel(LabelEnc.Indices(Float32,2))
2.0f0
julia> neglabel(LabelEnc.Indices(Float32,5))
ERROR: MethodError: no method matching neglabel(::MLLabelUtils.LabelEnc.Indices{Float32,5})
We can also query the number of labels that a concrete encoding uses. In other words we can query the number of classes the given label-encoding is able to represent.
-
nlabel
(encoding) → Int¶ Returns the number of labels that a specific encoding uses.
Parameters: encoding (LabelEncoding) – The specific label-encoding.
julia> nlabel(LabelEnc.ZeroOne(UInt8))
2
julia> nlabel(LabelEnc.NativeLabels([:a,:b,:c]))
3
More interestingly, we can infer the number of labels for a family of encodings. This allows for some compile time decisions, but only work for some types of encodings (i.e. binary).
-
nlabel
(type) → Int Returns the number of labels that the family of encodings type can describe. Note that this function will fail if the number of labels can not be inferred from the given type.
Parameters: type (DataType) – Some subtype of LabelEncoding{T,K,M}
with a fixedK
Returns: The type-parameter K
of type.
julia> nlabel(LabelEnc.ZeroOne)
2
julia> nlabel(LabelEnc.NativeLabels)
ERROR: ArgumentError: number of labels could not be inferred for the given type
We can also query a family of encodings for their label-type. In this case we decided to not throw an error if the type can not be inferred but instead return the most specific abstract type.
-
labeltype
(type) → DataType¶ Determine the type of the labels represented by the given family of label-encoding. If the type can not be inferred than
Any
is returned.Parameters: type (DataType) – Some subtype of LabelEncoding{T,K,M}
Returns: The type-parameter T
of type if specified, or the most specific abstract type otherwise.
julia> labeltype(LabelEnc.TrueFalse)
Bool
julia> labeltype(LabelEnc.ZeroOne{Int})
Int64
julia> labeltype(LabelEnc.ZeroOne)
Number
julia> labeltype(LabelEnc.NativeLabels)
Any
Converting to/from Indices¶
As stated before, the order of the of label()
matters.
In a binary setting, for example, the first label is interpreted as
the positive class and the second label as the negative class.
This is simply the arbitrary convention that we follow.
That said, even in a multi-class setting it is important to be
consistent with the ordering. This is crucial in order to make sure
that converting to a different encoding and then converting back
yields the original values.
Every encoding understands the concept of a label-index,
which is a unique representation of a class that all encodings share.
For example the positive label of a binary label-encoding always
has the label-index 1
and the negative 2
respectively.
To convert a label-index into the label that a specific encoding uses
to represent the underlying class we provide the function
ind2label()
.
-
ind2label
(index, encoding)¶ Converts the given index into the corresponding label defined by the encoding. Note that in the binary case,
index = 1
represents the positive label andindex = 2
the negative label.This function supports broadcasting.
Parameters: - index (Int) – Index of the desired label. This variable can
be specified either as an
Int
or as aVal
. Note that indices are one-based. - encoding (LabelEncoding) – The encoding one wants to get the label from.
Returns: The label of the specified index for the specified encoding.
- index (Int) – Index of the desired label. This variable can
be specified either as an
julia> ind2label(1, LabelEnc.MarginBased(Float32))
1.0f0
julia> ind2label(Val{1}, LabelEnc.MarginBased(Float32))
1.0f0
julia> ind2label(2, LabelEnc.MarginBased(Float32))
-1.0f0
julia> ind2label(3, LabelEnc.OneOfK(Int8,4))
4-element Array{Int8,1}:
0
0
1
0
julia> ind2label(3, LabelEnc.NativeLabels([:a,:b,:c,:d]))
:c
julia> ind2label.([1,2,2,1], LabelEnc.ZeroOne(UInt8)) # broadcast support
4-element Array{UInt8,1}:
0x01
0x00
0x00
0x01
We also provide inverse function for converting a label of a specific encoding into the corresponding label-index. Note that this function does not check if the given label is of the expected type, but simply that it is of the appropriate value.
-
label2ind
(label, encoding) → Int¶ Converts the given label into the corresponding index defined by the encoding. Note that in the binary case, the positive label will result in the index 1 and the negative label in the index 2 respectively.
This function supports broadcasting.
Parameters: - label (Any) – A label in the format familiar to the encoding.
- encoding (LabelEncoding) – The encoding to compute the label-index with.
Returns: The index of the specified label for the specified encoding.
julia> label2ind(1.0, LabelEnc.MarginBased())
1
julia> label2ind(-1.0, LabelEnc.MarginBased())
2
julia> label2ind([0,0,1,0], LabelEnc.OneOfK(4))
3
julia> label2ind(:c, LabelEnc.NativeLabels([:a,:b,:c,:d]))
3
julia> label2ind.([1,0,0,1], LabelEnc.ZeroOne()) # broadcast support
4-element Array{Int64,1}:
1
2
2
1
Converting between Encodings¶
In the case that the given targets are not in the encoding that your
algorithm expects them to be in, you may want to convert them into the
format you require.
For that purpose we expose the function convertlabel()
.
-
convertlabel
(dst_encoding, src_label, src_encoding)¶ Converts the given input label src_label from src_encoding into the corresponding label described by the desired output encoding dst_encoding.
Note that both encodings are expected to be vector-based, meaning that this method does not work for
LabelEnc.OneOfK
. It does, however, support broadcasting.Parameters: - dst_encoding (LabelEncoding) – The vector-based label-encoding that should be used to produce the output label.
- src_label (Any) – The input label one wants to convert. It is expected to be consistent with src_encoding.
- src_encoding (LabelEncoding) – A vector-based label-encoding that is assumed to have produced the given src_label.
Returns: The label from dst_encoding that corresponds to src_label in src_encoding
julia> convertlabel(LabelEnc.OneOfK(2), -1, LabelEnc.MarginBased()) # OneOfK is not vector-based
ERROR: MethodError: no method matching [...]
julia> convertlabel(LabelEnc.NativeLabels([:a,:b,:c,:d]), 3, LabelEnc.Indices(4))
:c
julia> convertlabel(LabelEnc.ZeroOne(), :yes, LabelEnc.NativeLabels([:yes,:no]))
1.0
julia> convertlabel(LabelEnc.ZeroOne(), :no, LabelEnc.NativeLabels([:yes,:no]))
0.0
julia> convertlabel(LabelEnc.MarginBased(Int), 0, LabelEnc.ZeroOne())
-1
julia> convertlabel(LabelEnc.NativeLabels([:a,:b]), -1, LabelEnc.MarginBased())
:b
julia> convertlabel.(LabelEnc.NativeLabels([:a,:b]), [-1,1,1,-1], LabelEnc.MarginBased()) # broadcast support
4-element Array{Symbol,1}:
:b
:a
:a
:b
Aside from the one broadcast-able method that is implemented for
converting single labels, we provide a range of methods that work on
whole arrays.
These are more flexible because by having an array as input these
methods have more information available to make reasonable
decisions. As a consequence of that can we consider the
“source-encoding” parameter optional, because these methods can
now make use of labelenc()
internally to infer it
automatically.
-
convertlabel
(dst_encoding, arr[, src_encoding][, obsdim]) Converts the given array arr from the src_encoding into the dst_encoding. If src_encoding is not specified it will be inferred automaticaly using the function
labelenc()
. This should not negatively influence type-inference.Note that both encodings should have the same number of labels, or a MethodError will be thrown in most cases.
Parameters: - dst_encoding (LabelEncoding) – The desired output format.
- arr (AbstractArray) – The input targets that should be converted into the encoding specified by dst_encoding.
- src_encoding (LabelEncoding) – The input encoding that arr is expected to be in.
- obsdim (ObsDimension) – Optional. Only possible if one of the
two encodings is a matrix-based encoding.
Defines which of the two array
dimensions denotes the observations.
It can be specified as a type-stable
positional argument or a smart keyword.
Defaults to
Obsdim.Last()
. see?ObsDim
for more information.
Returns: A converted version of arr using the specified output encoding dst_encoding.
julia> convertlabel(LabelEnc.NativeLabels([:yes,:no]), [-1,1,-1,1,1,-1])
6-element Array{Symbol,1}:
:no
:yes
:no
:yes
:yes
:no
julia> convertlabel(LabelEnc.OneOfK(Float32,2), [-1,1,-1,1,1,-1])
2×6 Array{Float32,2}:
0.0 1.0 0.0 1.0 1.0 0.0
1.0 0.0 1.0 0.0 0.0 1.0
julia> convertlabel(LabelEnc.TrueFalse(), [-1,1,-1,1,1,-1])
6-element Array{Bool,1}:
false
true
false
true
true
false
julia> convertlabel(LabelEnc.Indices(3), [:no,:maybe,:yes,:no], LabelEnc.NativeLabels([:yes,:maybe,:no]))
4-element Array{Int64,1}:
3
2
1
3
It may be interesting to point out explicitly that we provide
special treatment for LabelEnc.OneVsRest
to conveniently
convert a multi-class problem into a two-class problem.
julia> convertlabel(LabelEnc.OneVsRest(:yes), [:yes,:no,:no,:maybe,:yes,:yes])
6-element Array{Symbol,1}:
:yes
:not_yes
:not_yes
:not_yes
:yes
:yes
julia> convertlabel(LabelEnc.ZeroOne(Float64), [:yes,:no,:no,:maybe,:yes,:yes], LabelEnc.OneVsRest(:yes))
6-element Array{Float64,1}:
1.0
0.0
0.0
0.0
1.0
1.0
We also allow a more concise way to specify that your are using a
LabelEnc.NativeLabels
encoding by just passing the
label-vector directly, that you would normally pass to its
constructor.
julia> convertlabel([:yes,:no], [-1,1,-1,1,1,-1])
6-element Array{Symbol,1}:
:no
:yes
:no
:yes
:yes
:no
julia> convertlabel(LabelEnc.Indices(3), [:no,:maybe,:yes,:no], [:yes,:maybe,:no])
4-element Array{Int64,1}:
3
2
1
3
In many cases it can be inconvenient that one has to explicitly
specify the label-type and number of labels for the desired
output-encoding. To that end we also allow the output-encoding
to be specified in terms of an encoding-family (i.e. as DataType
).
-
convertlabel
(dst_family, arr[, src_encoding][, obsdim]) Converts the given array arr from the src_encoding into some concrete label-encoding that is a subtype of dst_family. This way the method tries to preserve the eltype of arr if it is numeric. Furthermore, the concrete number of labels need not be specified explicitly, but will instead be inferred from src_encoding.
If src_encoding is not specified it will be inferred automaticaly using the function
labelenc()
. This should not negatively influence type-inference.Parameters: - dst_family (DataType) – Any subtype of
LabelEncoding{T,K,M}
. It denotes the desired family of label-encodings that one wants the return value to be in. - arr (AbstractArray) – The input targets that should be converted into some encoding specified by the type dst_family.
- src_encoding (LabelEncoding) – The input encoding that arr is expected to be in.
- obsdim (ObsDimension) – Optional. Only possible if one of the
two encodings is a matrix-based encoding.
Defines which of the two array
dimensions denotes the observations.
It can be specified as a type-stable
positional argument or a smart keyword.
Defaults to
Obsdim.Last()
. see?ObsDim
for more information.
Returns: A converted version of arr using a label-encoding that is member of the encoding-family dst_family.
- dst_family (DataType) – Any subtype of
julia> convertlabel(LabelEnc.OneOfK, Int8[-1,1,-1,1,1,-1])
2×6 Array{Int8,2}:
0 1 0 1 1 0
1 0 1 0 0 1
julia> convertlabel(LabelEnc.OneOfK{Float32}, Int8[-1,1,-1,1,1,-1], obsdim = 1)
6×2 Array{Float32,2}:
0.0 1.0
1.0 0.0
0.0 1.0
1.0 0.0
1.0 0.0
0.0 1.0
julia> convertlabel(LabelEnc.TrueFalse, [-1,1,-1,1,1,-1])
6-element Array{Bool,1}:
false
true
false
true
true
false
julia> convertlabel(LabelEnc.Indices, [:no,:maybe,:yes,:no], LabelEnc.NativeLabels([:yes,:maybe,:no]))
4-element Array{Int64,1}:
3
2
1
3
For vector-based encodings (which means all except
LabelEnc.OneOfK
), we provide a lazy version of
convertlabel()
that does not allocate a new array for the
outputs, but instead creates a
MappedArray
into the original targets.
-
convertlabelview
(dst_encoding, vec[, src_encoding])¶ Creates a
MappedArray
that provides a lazy view into vec, that makes it look like the values are actually of the provided output encoding new_encoding. This means that the convertion happens on the fly when an element of the resulting mapped array is accessed. This resulting mapped array will even be writeable, unless src_encoding isLabelEnc.OneVsRest
.Note that both encodings are expected to be vector-based, meaning that this method does not work for
LabelEnc.OneOfK
.Parameters: - dst_encoding (LabelEncoding) – The desired vector-based output encoding.
- vec (AbstractVector) – The input targets that one wants to convert using dst_encoding. It is expected to be consistent with src_encoding.
- src_encoding (LabelEncoding) – A vector-based label-encoding that is assumed to have produced the values in vec.
Returns: A
MappedArray
orReadonlyMappedArray
that makes vec look like it is in the encoding specified by new_encoding
julia> true_targets = [-1,1,-1,1,1,-1]
6-element Array{Int64,1}:
-1
1
-1
1
1
-1
julia> A = convertlabelview(LabelEnc.NativeLabels([:yes,:no]), true_targets)
6-element MappedArrays.MappedArray{Symbol,1,...}:
:no
:yes
:no
:yes
:yes
:no
julia> A[2] = :no
julia> A
6-element MappedArrays.MappedArray{Symbol,1,...}:
:no
:no
:no
:yes
:yes
:no
julia> true_targets
6-element Array{Int64,1}:
-1
-1
-1
1
1
-1
Classifying Predictions¶
Some encodings come with an implicit interpretation of how the
raw predictions of some model (often denoted as \(\hat{y}\),
written yhat
) should look like and how they can be classified
into a predicted class-label.
For that purpose we provide the function classify()
and its
mutating version classify!()
.
-
classify
(yhat, encoding) Returns the classified version of yhat given the encoding. That means that if yhat can be interpreted as a positive label, the positive label of encoding is returned. If yhat can not be interpreted as a positive value then the negative label is returned.
This methods supports broadcasting.
Parameters: - yhat (Number) – The numeric prediction that should be classified into either the label representing the positive class or the label representing the negative class
- encoding (LabelEncoding) – A concrete instance of a label-encoding that one wants to work with.
Returns: The label that the encoding uses to represent the class that yhat is classified into.
For LabelEnc.MarginBased
the decision boundary between
classifying into a negative or a positive label is predefined at zero.
More precisely a raw prediction greater than or equal to zero
is considered a positive prediction, while any strictly negative raw
prediction is considered a negative prediction.
julia> classify(-0.3f0, LabelEnc.MarginBased()) # defaults to Float64
-1.0
julia> classify.([-2.3,6.5], LabelEnc.MarginBased(Int))
2-element Array{Int64,1}:
-1
1
For LabelEnc.ZeroOne
the assumption is that the raw
prediction is in the closed interval \([0, 1]\) and represents
a degree of certainty that the observation is of the positive class.
That means that in order to classify a raw prediction to either
positive or negative, one needs to decide on a “threshold” parameter,
which determines at which degree of certainty a prediction is
“good enough” to classify as positive.
julia> classify(0.3f0, LabelEnc.ZeroOne(0.5)) # defaults to Float64
0.0
julia> classify(0.3f0, LabelEnc.ZeroOne(Int,0.2))
1
julia> classify.([0.3,0.5], LabelEnc.ZeroOne(Int,0.4))
2-element Array{Int64,1}:
0
1
We recognize that such a probabilistic interpretation of the raw
predicted value is fairly common. So much so that we provide a
convenience method for when one is working under the assumption of
a LabelEnc.ZeroOne
encoding.
-
classify
(yhat, threshold) Returns the classified version of yhat given the decision margin threshold. This method assumes that yhat denotes a probability and will either return
zero(yhat)
if yhat is below threshold, orone(yhat)
otherwise.This methods supports broadcasting.
Parameters: - yhat (Number) – The numeric prediction. It is assumed be a value between 0 and 1.
- threshold (Number) – The threshold below which yhat will be
classified as
0
.
Returns: The classified version of yhat of the same type.
julia> classify(0.3f0, 0.5)
0.0f0
julia> classify(0.3f0, 0.2)
1.0f0
julia> classify.([0.3,0.5], 0.4)
2-element Array{Float64,1}:
0.0
1.0
For matrix-based encodings, such as LabelEnc.OneOfK
we
provide a special method that allows to optionally specify the
dimension of the matrix that denote the observations.
-
classify
(yhat, encoding[, obsdim]) If yhat is a vector (i.e. a single observation), this function returns the index of the element that has the largest value. If yhat is a matrix, this function returns a vector of indices for each observation in yhat.
Parameters: - yhat (AbstractArray) – The numeric predictions in the form of either a vector or a matrix.
- encoding (LabelEncoding) – A concrete instance of a matrix-based label-encoding that one wants to work with.
- obsdim (ObsDimension) – Optional iff yhat is a matrix.
Denotes which of the two array
dimensions of yhat denotes the
observations. It can be specified as
a type-stable positional argument or
a smart keyword.
Defaults to
Obsdim.Last()
. see?ObsDim
for more information.
Returns: The classified version of yhat. This will either be an integer or a vector of indices.
julia> pred_output = [0.1 0.4 0.3 0.2; 0.8 0.3 0.6 0.2; 0.1 0.3 0.1 0.6]
3×4 Array{Float64,2}:
0.1 0.4 0.3 0.2
0.8 0.3 0.6 0.2
0.1 0.3 0.1 0.6
julia> classify(pred_output, LabelEnc.OneOfK(3))
4-element Array{Int64,1}:
2
1
2
3
julia> classify(pred_output', LabelEnc.OneOfK(3), obsdim=1) # note the transpose
4-element Array{Int64,1}:
2
1
2
3
julia> classify([0.1,0.2,0.6,0.1], LabelEnc.OneOfK(4)) # single observation
3
Similar to other functions we expose a version that can be called with a family of encodings (i.e. a type with free type parameters) instead of a concrete instance.
-
classify
(yhat, type) Returns the classified version of yhat given the family of encodings specified by type. That means that if yhat can be interpreted as a positive label, the positive label of that family is returned (and the negative otherwise). Furthermore, the type of yhat is preserved.
This method supports broadcasting.
Parameters: - yhat (Number) – The numeric prediction that should be classified into either the label representing the positive class or the label representing the negative class
- type (DataType) – Any subtype of
LabelEncoding{T,K,1}
Returns: The classified version of yhat of the same type.
julia> classify(0.3f0, LabelEnc.ZeroOne) # threshold fixed at 0.5
0.0f0
julia> classify(0.3, LabelEnc.ZeroOne)
0.0
julia> classify(4f0, LabelEnc.MarginBased)
1.0f0
julia> classify(-4, LabelEnc.MarginBased)
-1
-
classify
(yhat, type[, obsdim]) If yhat is a vector (i.e. a single observation), this function returns the index of the element that has the largest value. If yhat is a matrix, this function returns a vector of indices for each observation in yhat.
Parameters: - yhat (AbstractArray) – The numeric predictions in the form of either a vector or a matrix.
- type (DataType) – Any subtype of
LabelEncoding{T,K,2}
- obsdim (ObsDimension) – Optional iff yhat is a matrix.
Denotes which of the two array
dimensions of yhat denotes the
observations. It can be specified as
a type-stable positional argument or
a smart keyword.
Defaults to
Obsdim.Last()
. see?ObsDim
for more information.
Returns: The classified version of yhat. This will either be an integer or a vector of indices.
julia> pred_output = [0.1 0.4 0.3 0.2; 0.8 0.3 0.6 0.2; 0.1 0.3 0.1 0.6]
3×4 Array{Float64,2}:
0.1 0.4 0.3 0.2
0.8 0.3 0.6 0.2
0.1 0.3 0.1 0.6
julia> classify(pred_output, LabelEnc.OneOfK)
4-element Array{Int64,1}:
2
1
2
3
julia> classify(pred_output', LabelEnc.OneOfK, obsdim=1) # note the transpose
4-element Array{Int64,1}:
2
1
2
3
julia> classify([0.1,0.2,0.6,0.1], LabelEnc.OneOfK) # single observation
3
We also provide a mutating version. This is mainly of interest
when working with LabelEnc.OneOfK()
, in which case broadcast
is not defined on the previous methods.
-
classify!
(out, arr, encoding[, obsdim])¶ Same as classify, but uses out to store the result. In the case of a vector-based encoding this will use broadcast internally. It is mainly provided to offer a consistent API between vector-based and matrix-based encodings.
For convenience we also provide boolean version that assert if the given raw prediction could be interpreted as either a positive or a negative prediction.
-
isposlabel
(yhat, encoding) → Bool¶ Checks if the given value yhat can be interpreted as the positive label given the encoding. This function takes potential classification rules into account.
julia> isposlabel([1,0], LabelEnc.OneOfK(2))
true
julia> isposlabel([0,1], LabelEnc.OneOfK(2))
false
julia> isposlabel(-5, LabelEnc.MarginBased())
false
julia> isposlabel(2, LabelEnc.MarginBased())
true
julia> isposlabel(0.3f0, LabelEnc.ZeroOne(0.5))
false
julia> isposlabel(0.3f0, LabelEnc.ZeroOne(0.2))
true
-
isneglabel
(yhat, encoding) → Bool¶ Checks if the given value yhat can be interpreted as the negative label given the encoding. This function takes potential classification rules into account.
julia> isneglabel([1,0], LabelEnc.OneOfK(2))
false
julia> isneglabel([0,1], LabelEnc.OneOfK(2))
true
julia> isneglabel(-5, LabelEnc.MarginBased())
true
julia> isneglabel(2, LabelEnc.MarginBased())
false
julia> isneglabel(0.3f0, LabelEnc.ZeroOne(0.5))
true
julia> isneglabel(0.3f0, LabelEnc.ZeroOne(0.2))
false
Lastly, we provide an organized list of the implemented label-encoding that this package exposes. We will also discuss their properties and differences or other nuances.
Supported Encodings¶
The design of this packages revolves around a number of immutable
types, each of which representing a specific label-encoding.
These types are contained within their own namespace LabelEnc
.
The reason for the namespace is mainly convenience, as it allows
for a simple form of auto-completion and also more concise names that
could otherwise be considered to be too ambiguous.
Abstract LabelEncoding¶
We offer a number of different encodings that can best be described in terms of two orthogonal properties. The first property is the number of classes it represents, and the second property is the number of array dimensions it operates on.
-
LabelEncoding{T,K,M}
¶ Abstract super-type of all label encodings. Mainly intended for dispatch. As such this type is not exported. It defines three type-parameters that are useful to divide the different encodings into groups.
-
T
¶ The label-type of the encoding, which specifies which concrete type all label of that particular encoding have.
-
K
¶ The number of labels that the label-encoding can deal with. So for binary encodings this will be the constant
2
-
M
¶ The number of array dimensions that the encoding works with. For most encodings this will be
1
, meaning that a target array of that encoding is expected to be some vector. In contrast to this does the encodingLabelEnc.OneOfK
definedM=2
, because it represents the target array as a matrix.
-
TrueFalse¶
-
LabelEnc.
TrueFalse
¶ Denotes the classes as boolean values, for which
true
corresponds to the positive class, andfalse
to the negative class.julia> supertype(LabelEnc.TrueFalse) MLLabelUtils.LabelEncoding{Bool,2,1}
It belongs to the family of binary vector-based encodings, and as such represents the targets as a vector that is using only two distinct values for its elements. That implies that it is per defintion always binary and as such the number of labels can be inferred at compile time.
julia> nlabel(LabelEnc.TrueFalse) 2
-
TrueFalse
() → LabelEnc.TrueFalse¶ Returns the singleton that represents the encoding. All information about the encoding is already contained withing the type. As such there is no need to specify additional parameters.
For more information on how to use such an encoding, please look at the corresponding parts of the documentation.
julia> true_targets = [false, true, true, false];
julia> labelenc(true_targets)
MLLabelUtils.LabelEnc.TrueFalse()
julia> label(LabelEnc.TrueFalse())
2-element Array{Bool,1}:
true
false
julia> nlabel(LabelEnc.TrueFalse())
2
ZeroOne¶
-
LabelEnc.
ZeroOne
¶ Denotes the classes as numeric values, for which
1
corresponds to the positive class, and0
to the negative class. This type of encoding is often used when the predictions denote a probabilty.julia> supertype(LabelEnc.ZeroOne) MLLabelUtils.LabelEncoding{T<:Number,2,1}
It belongs to the family of binary numeric vector-based encodings, and as such represents the targets as a vector that is using only two distinct values for its elements. In fact, it is by definition always binary and as such the number of labels can be inferred at compile time.
julia> nlabel(LabelEnc.ZeroOne) 2
This type also comes with support for classification (see
classify()
). It assumes that the raw predictions (often called \(\hat{y}\)) are in the closed interval \([0, 1]\) and represent something resembling a probabilty (or some degree of certainty) that the observation is of the positive class. That means that in order to classify a raw prediction to either positive or negative, one needs to decide on a “threshold” parameter, which determines at which degree of certainty a prediction is “good enough” to classify as positive.-
threshold
¶ A real number between 0 and 1 that defines the “cutoff” point for classification. Any prediction less than this value will be classified as negative and any prediction equal to or greater than this value will be classified as a positive prediction.
-
-
ZeroOne
([labeltype][, threshold]) → LabelEnc.ZeroOne¶ Creates a new label-encoding of the
LabelEnc.ZeroOne
family.Parameters: - labeltype (DataType) – The type that should be used to
represent the labels. Has to be a
subtype of
Number
. Defaults toFloat64
. - threshold (Number) – The classification threshold that
should be used in
classify()
. Defaults to0.5
.
- labeltype (DataType) – The type that should be used to
represent the labels. Has to be a
subtype of
For more information on how to use such an encoding, please look at the corresponding parts of the documentation.
julia> LabelEnc.ZeroOne(Int, 0.3) # threshold = 0.3
MLLabelUtils.LabelEnc.ZeroOne{Int64,Float64}(0.3)
julia> true_targets = [0, 1, 1, 0];
julia> labelenc(true_targets)
MLLabelUtils.LabelEnc.ZeroOne{Int64,Float64}(0.5)
julia> label(LabelEnc.ZeroOne())
2-element Array{Float64,1}:
1.0
0.0
julia> nlabel(LabelEnc.ZeroOne())
2
MarginBased¶
-
LabelEnc.
MarginBased
¶ Denotes the classes as numeric values, for which
1
corresponds to the positive class, and-1
to the negative class. This type of encoding is very prominent for margin-based classifier, in particular SVMs.julia> supertype(LabelEnc.MarginBased) MLLabelUtils.LabelEncoding{T<:Number,2,1}
It belongs to the family of binary numeric vector-based encodings, and as such represents the targets as a vector that is using only two distinct values for its elements. In fact, it is by definition always binary and as such the number of labels can be inferred at compile time.
julia> nlabel(LabelEnc.MarginBased) 2
This type also comes with support for classification (see
classify()
). It expects the raw predictions to be real numbers of arbitrary value. The decision boundary between classifying into a negative or a positive label is predefined at zero. More precisely a raw prediction greater than or equal to zero is considered a positive prediction, while any strictly negative raw prediction is considered a negative prediction.
-
MarginBased
([labeltype]) → LabelEnc.MarginBased¶ Creates a new label-encoding of the
LabelEnc.MarginBased
family.Parameters: labeltype (DataType) – The type that should be used to represent the labels. Has to be a subtype of Number
. Defaults toFloat64
.
For more information on how to use such an encoding, please look at the corresponding parts of the documentation.
julia> true_targets = [-1, 1, 1, -1];
julia> labelenc(true_targets)
MLLabelUtils.LabelEnc.MarginBased{Int64}()
julia> label(LabelEnc.MarginBased())
2-element Array{Float64,1}:
1.0
-1.0
julia> nlabel(LabelEnc.MarginBased())
2
OneVsRest¶
-
LabelEnc.
OneVsRest
¶ This is a special type of binary encoding that allows to convert a multi-class problem into a binary one. It does so by only “caring” about what the positive label is, and treating everything that is not equal to it as negative.
julia> supertype(LabelEnc.OneVsRest) MLLabelUtils.LabelEncoding{T,2,1}
It belongs to the family of binary vector-based encodings. It is by definition always binary and as such the number of labels can be inferred at compile time.
julia> nlabel(LabelEnc.OneVsRest) 2
While this encoding only uses to positive label to assert class membership, it still needs to have a placeholder-value of the same type for a negative label in order for
convertlabel()
to work.-
poslabel
¶ The value that will be used to represent the positive class. This value will be used to determine if a given value is positive (if it is equal) or negative.
-
neglabel
¶ Placeholder to represent the negative class. This value will not be used to determine membership, but simply to impute a reasonable value when converting to such an encoding.
-
-
OneVsRest
(poslabel[, neglabel]) → LabelEnc.OneVsRest¶ Creates a new label-encoding of the one-vs-rest family. While both a positive and a negative label have to be known to the encoding, only the positive label is used for comparision and asserting class membership. Note that both parameter have to be of the same type.
Parameters: - poslabel (Any) – The label of interest.
- neglabel (Any) – The negative label. It is optional for the common types, such as symbol, number, or string. For label-types other than that it has to be provided explicitly.
For more information on how to use such an encoding, please look at the corresponding parts of the documentation.
julia> true_targets = [:yes, :no, :maybe, :yes];
julia> convertlabel(LabelEnc.OneVsRest(:yes), true_targets)
4-element Array{Symbol,1}:
:yes
:not_yes
:not_yes
:yes
julia> convertlabel(LabelEnc.MarginBased, true_targets, LabelEnc.OneVsRest(:yes))
4-element Array{Float64,1}:
1.0
-1.0
-1.0
1.0
julia> label(LabelEnc.OneVsRest(:yes))
2-element Array{Symbol,1}:
:yes
:not_yes
julia> nlabel(LabelEnc.OneVsRest(:yes))
2
Indices¶
-
LabelEnc.
Indices
¶ A multiclass encoding that uses the integer numbers in \(\{1, 2, ..., K\}\) as label to denote the classes. While these “indices” are integers in terms of their values, they don’t need to be
Int
as a type.julia> supertype(LabelEnc.Indices) MLLabelUtils.LabelEncoding{T<:Number,K,1}
It belongs to the family of numeric vector-based encodings and can encode any number of classes. As such the number of labels
K
is a free type-parameter. It is considered a binary encoding if and only ifK = 2
-
Indices
([labeltype, ]k) → LabelEnc.Incides¶ Creates a new label-encoding of the
LabelEnc.Indices
family.Parameters: - labeltype (DataType) – The type that should be used to
represent the labels. Has to be a
subtype of
Number
. Defaults toInt
. - k (Int) – The number of classes that the concoding
should represent. This parameter can be
specified as an
Int
or in type-stable manner asVal{k}
- labeltype (DataType) – The type that should be used to
represent the labels. Has to be a
subtype of
For more information on how to use such an encoding, please look at the corresponding parts of the documentation.
julia> true_targets = [1, 2, 1, 3, 1, 2];
julia> labelenc(true_targets)
MLLabelUtils.LabelEnc.Indices{Int64,3}()
julia> label(LabelEnc.Indices(3))
3-element Array{Int64,1}:
1
2
3
julia> label(LabelEnc.Indices(Float32,4))
4-element Array{Float32,1}:
1.0
2.0
3.0
4.0
julia> nlabel(LabelEnc.Indices(Val{5})) # type-stable
5
OneOfK¶
-
LabelEnc.
OneOfK
¶ A multi-class encoding that uses one of the two matrix dimensions to denote the label. More precisely other words it uses an indicator-encoding to explicitly state what class an observation represents and what it does not represent, by only setting one element of each observation to
1
and the rest to0
julia> supertype(LabelEnc.OneOfK) MLLabelUtils.LabelEncoding{T<:Number,K,2}
It belongs to the family of numeric matrix-based encodings and can encode any number of classes. As such the number of labels
K
is a free type-parameter. It is considered a binary encoding if and only ifK = 2
-
OneOfK
([labeltype, ]k) → LabelEnc.OneOfK¶ Creates a new label-encoding of the matrix-based
LabelEnc.OneOfK
family.Parameters: - labeltype (DataType) – The type that should be used to
represent the labels. Has to be a
subtype of
Number
. Defaults toInt
. - k (Int) – The number of classes that the concoding
should represent. This parameter can be
specified as an
Int
or in type-stable manner asVal{k}
- labeltype (DataType) – The type that should be used to
represent the labels. Has to be a
subtype of
For more information on how to use such an encoding, please look at the corresponding parts of the documentation.
julia> true_targets = [0 1 0 0; 1 0 1 0; 0 0 0 1]
3×4 Array{Int64,2}:
0 1 0 0
1 0 1 0
0 0 0 1
julia> labelenc(true_targets)
MLLabelUtils.LabelEnc.OneOfK{Int64,3}()
julia> label(LabelEnc.OneOfK(Float32, 4)) # returns the indices
4-element Array{Int64,1}:
1
2
3
4
julia> ind2label(3, LabelEnc.OneOfK(Float32, 4))
4-element Array{Float32,1}:
0.0
0.0
1.0
0.0
julia> nlabel(LabelEnc.OneOfK(Val{4}))
4
NativeLabels¶
-
LabelEnc.
NativeLabels
¶ A multi-class encoding that can use any abritrary values to represent any number of labels. It does so by mapping each label-index to a class label. The class labels can be of arbitrary type as long as the type is consistent for all labels. Furthermore, all labels have to be specified explicitly.
julia> supertype(LabelEnc.NativeLabels) MLLabelUtils.LabelEncoding{T,K,1}
It belongs to the family of vector-based encodings that can encode any number of classes. As such the number of labels
K
is a free type-parameter. It is considered a binary encoding if and only ifk = 2
-
label
¶ A vector that contains all the used labels in their defined order. If it only contains two values, then the first value will be interpreted as the positive label and the second value as the negative label.
-
invlabel
¶ A Dict that maps each label to their index in the vector label. This map is used for fast lookup and generated automatically.
-
-
NativeLabels
(label[, k]) → LabelEnc.NativeLabels¶ Creates a new vector-based label-encoding for the given values in label. The values in label are expected to be distinct.
Parameters: - label (Vector) – The label that the encoding should use in their intended order
- k (DataType) – The number of labels in label. This
paramater is optional and will be computed
from label if omited. However, if the
number of labels is known at compile time
this parmater can be provided using
Val{k}
For more information on how to use such an encoding, please look at the corresponding parts of the documentation.
julia> true_targets = [:a, :b, :a, :c, :b, :a];
julia> le = labelenc(true_targets)
MLLabelUtils.LabelEnc.NativeLabels{Symbol,3}(Symbol[:a,:b,:c],Dict(:c=>3,:a=>1,:b=>2))
julia> label(le)
3-element Array{Symbol,1}:
:a
:b
:c
julia> nlabel(le)
3
julia> LabelEnc.NativeLabels([:yes, :no, :maybe], Val{3}) # type inferrable
MLLabelUtils.LabelEnc.NativeLabels{Symbol,3}(Symbol[:yes,:no,:maybe],Dict(:yes=>1,:maybe=>3,:no=>2))
FuzzyBinary¶
-
LabelEnc.
FuzzyBinary
¶ A vector-based binary label interpretation without a specific labeltype. It is primarily intended for fuzzy comparision of binary true targets and predicted targets. It basically assumes that the encoding is either TrueFalse, ZeroOne, or MarginBased by treating all non-negative values as positive outputs.
Indices and tables¶
LICENSE¶
The MLLabelUtils.jl package is licensed under the MIT “Expat” License
see LICENSE.md in the Github repository.