Getting Started

MLLabelUtils is the result of a collaborative effort to design an efficient but also convenient-to-use library for working with the most commonly utilized class-label encodings in Machine Learning. As such, this package provides functionality to derive or assert properties about some label-encoding or target array, as well as the functions needed to convert given targets into a different format.

Installation

To install MLLabelUtils.jl, start up Julia and type the following code-snipped into the REPL. It makes use of the native Julia package manger.

Pkg.add("MLLabelUtils")

Additionally, for example if you encounter any sudden issues, or in the case you would like to contribute to the package, you can manually choose to be on the latest (untagged) version.

Pkg.checkout("MLLabelUtils")

Overview

Let us take a look at some examples (with only minor explanation) to get a feeling for what one can do with this package. Once installed the package can be imported just as any other Julia package.

using MLLabelUtils

For starters, the library provides a few utility functions to compute various properties of the target array. These include the number of labels (see nlabel()), the labels themselves (see label()), and a mapping from label to the elements of the target array (see labelmap() and labelfreq()).

julia> true_targets = [0, 1, 1, 0, 0];

julia> label(true_targets)
2-element Array{Int64,1}:
 1
 0

julia> nlabel(true_targets)
2

julia> labelmap(true_targets)
Dict{Int64,Array{Int64,1}} with 2 entries:
  0 => [1,4,5]
  1 => [2,3]

julia> labelfreq(true_targets)
Dict{Int64,Int64} with 2 entries:
  0 => 3
  1 => 2

Tip

Because labelfreq() utilizes a Dict to store its result, it is straight forward to visualize the class distribution (using the absolute frequencies) right in the REPL using the UnicodePlots.jl package.

julia> using UnicodePlots
julia> barplot(labelfreq([:yes,:no,:no,:maybe,:yes,:yes]), symb="#")
#        ┌────────────────────────────────────────┐
#    yes │##################################### 3 │
#  maybe │############ 1                          │
#     no │######################### 2             │
#        └────────────────────────────────────────┘

If you find yourself writing some custom function that is intended to train some specific supervised model, chances are that you want to assert if the given targets are in the correct encoding that the model requires. We provide a few functions for such a scenario, namely labelenc() and islabelenc().

julia> true_targets = [0, 1, 1, 0, 0];

julia> labelenc(true_targets) # determine encoding using heuristics
MLLabelUtils.LabelEnc.ZeroOne{Int64,Float64}(0.5)

julia> islabelenc(true_targets, LabelEnc.ZeroOne)
true

julia> islabelenc(true_targets, LabelEnc.ZeroOne(Int))
true

julia> islabelenc(true_targets, LabelEnc.ZeroOne(Float32))
false

julia> islabelenc(true_targets, LabelEnc.MarginBased)
false

In the case that it turns out the given targets are in the wrong encoding you may want to convert them into the format you require. For that purpose we expose the function convertlabel().

julia> true_targets = [0, 1, 1, 0, 0];

julia> convertlabel(LabelEnc.MarginBased, true_targets)
5-element Array{Int64,1}:
 -1
  1
  1
 -1
 -1

julia> convertlabel(LabelEnc.MarginBased(Float64), true_targets)
5-element Array{Float64,1}:
 -1.0
  1.0
  1.0
 -1.0
 -1.0

julia> convertlabel([:yes,:no], true_targets)
5-element Array{Symbol,1}:
 :no
 :yes
 :yes
 :no
 :no

julia> convertlabel(LabelEnc.OneOfK, true_targets)
2×5 Array{Int64,2}:
 0  1  1  0  0
 1  0  0  1  1

julia> convertlabel(LabelEnc.OneOfK{Bool}, true_targets)
2×5 Array{Bool,2}:
 false   true   true  false  false
  true  false  false   true   true

julia> convertlabel(LabelEnc.OneOfK{Float64}, true_targets, obsdim=1)
5×2 Array{Float64,2}:
 0.0  1.0
 1.0  0.0
 1.0  0.0
 0.0  1.0
 0.0  1.0

It may be interesting to point out explicitly that we provide LabelEnc.OneVsRest to conveniently convert a multi-class problem into a two-class problem.

julia> convertlabel(LabelEnc.OneVsRest(:yes), [:yes,:no,:no,:maybe,:yes,:yes])
6-element Array{Symbol,1}:
 :yes
 :not_yes
 :not_yes
 :not_yes
 :yes
 :yes

julia> convertlabel(LabelEnc.ZeroOne, [:yes,:no,:no,:maybe,:yes,:yes], LabelEnc.OneVsRest(:yes))
6-element Array{Float64,1}:
 1.0
 0.0
 0.0
 0.0
 1.0
 1.0

Some encodings come with an implicit contract of how the raw predictions of some model should look like and how to classify a raw prediction into a predicted class-label. For that purpose we provide the function classify() and its mutating version classify!().

For LabelEnc.ZeroOne the convention is that the raw prediction is between 0 and 1 and represents a degree of certainty that the observation is of the positive class. That means that in order to classify a raw prediction to either positive or negative, one needs to define a “threshold” parameter, which determines at which degree of certainty a prediction is “good enough” to classify as positive.

julia> classify(0.3f0, 0.5); # equivalent to below
julia> classify(0.3f0, LabelEnc.ZeroOne) # preserves type
0.0f0

julia> classify(0.3f0, LabelEnc.ZeroOne(0.5)) # defaults to Float64
0.0

julia> classify(0.3f0, LabelEnc.ZeroOne(Int,0.2))
1

julia> classify.([0.3,0.5], LabelEnc.ZeroOne(Int,0.4))
2-element Array{Int64,1}:
 0
 1

For LabelEnc.MarginBased on the other hand the decision boundary is predefined at 0, meaning that any raw prediction greater than or equal to zero is considered a positive prediction, while any negative raw prediction is considered a negative prediction.

julia> classify(0.3f0, LabelEnc.MarginBased) # preserves type
1.0f0

julia> classify(-0.3f0, LabelEnc.MarginBased()) # defaults to Float64
-1.0

julia> classify.([-2.3,6.5], LabelEnc.MarginBased(Int))
2-element Array{Int64,1}:
 -1
  1

The encoding LabelEnc.OneOfK is special in that it is matrix-based and thus there exists the concept of ObsDim, i.e. the freedom to choose which array dimension denotes the observations. The classified prediction will be the index of the largest element of an observation. By default the “obsdim” is defined as the last array dimension.

julia> pred_output = [0.1 0.4 0.3 0.2; 0.8 0.3 0.6 0.2; 0.1 0.3 0.1 0.6]
3×4 Array{Float64,2}:
 0.1  0.4  0.3  0.2
 0.8  0.3  0.6  0.2
 0.1  0.3  0.1  0.6

julia> classify(pred_output, LabelEnc.OneOfK)
4-element Array{Int64,1}:
 2
 1
 2
 3

julia> classify(pred_output', LabelEnc.OneOfK, obsdim=1) # note the transpose
4-element Array{Int64,1}:
 2
 1
 2
 3

julia> classify([0.1,0.2,0.6,0.1], LabelEnc.OneOfK) # single observation
3

Getting Help

To get help on specific functionality you can either look up the information here, or if you prefer you can make use of Julia’s native doc-system. The following example shows how to get additional information on LabelEnc.OneOfK within Julia’s REPL:

?LabelEnc.OneOfK

If you find yourself stuck or have other questions concerning the package you can find us at gitter or the Machine Learning domain on discourse.julialang.org

If you encounter a bug or would like to participate in the further development of this package come find us on Github.