Learning from Crowds
Routines for deep learning for crowds.
CoNAL
Bases: Module
Common Noise Adaptation Layers (CoNAL). This method introduces two types of confusions: worker-specific and global. Each is parameterized by a confusion matrix. The ratio of the two confusions is determined by the common noise adaptation layer. The common noise adaptation layer is a trainable function that takes the instance embedding and the worker ID as input and outputs a scalar value between 0 and 1.
Zhendong Chu, Jing Ma, and Hongning Wang. Learning from Crowds by Modeling Common Confusions. Proceedings of the AAAI Conference on Artificial Intelligence, 35(7), 5832-5840, 2021. https://doi.org/10.1609/aaai.v35i7.16730
Examples:
>>> from crowdkit.learning import CoNAL
>>> import torch
>>> input = torch.randn(3, 5)
>>> workers = torch.tensor([0, 1, 0])
>>> embeddings = torch.randn(3, 5)
>>> conal = CoNAL(5, 2)
>>> conal(embeddings, input, workers)
Source code in crowdkit/learning/conal.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
|
__init__(num_labels, n_workers, com_emb_size=20, user_feature=None)
Initializes the CoNAL module.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
num_labels
|
int
|
Number of classes. |
required |
n_workers
|
int
|
Number of annotators. |
required |
com_emb_size
|
int
|
Embedding size of the common noise module. |
20
|
user_feature
|
ndarray
|
User feature vector. |
None
|
Source code in crowdkit/learning/conal.py
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 |
|
forward(embeddings, logits, workers)
Forward pass of the CoNAL module.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embeddings
|
Tensor
|
Tensor of shape (batch_size, embedding_size) |
required |
logits
|
Tensor
|
Tensor of shape (batch_size, num_classes) |
required |
workers
|
Tensor
|
Tensor of shape (batch_size,) containing the worker IDs. |
required |
Returns:
Type | Description |
---|---|
Tensor
|
torch.Tensor: Tensor of shape (batch_size, 1) containing the predicted output probabilities. |
Source code in crowdkit/learning/conal.py
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
|
simple_common_module(input, workers)
Common noise adoptation module.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
Tensor
|
Tensor of shape (batch_size, embedding_size) |
required |
workers
|
Tensor
|
Tensor of shape (batch_size,) containing the worker IDs. |
required |
Returns:
Type | Description |
---|---|
Tensor
|
torch.Tensor: Tensor of shape (batch_size, 1) containing the common noise rate. |
Source code in crowdkit/learning/conal.py
95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
|
CrowdLayer
Bases: Module
CrowdLayer module for classification tasks.
This method applies a worker-specific transformation of the logits. There are four types of transformations: - MW: Multiplication on the worker's confusion matrix. - VW: Element-wise multiplication with the worker's weight vector. - VB: Element-wise addition with the worker's bias vector. - VW + b: Combination of VW and VB: VW * logits + b.
Filipe Rodrigues and Francisco Pereira. Deep Learning from Crowds. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018. https://doi.org/10.1609/aaai.v32i1.11506
Examples:
>>> from crowdkit.learning import CrowdLayer
>>> import torch
>>> input = torch.randn(3, 5)
>>> workers = torch.tensor([0, 1, 0])
>>> cl = CrowdLayer(5, 2, conn_type="mw")
>>> cl(input, workers)
Source code in crowdkit/learning/crowd_layer.py
88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 |
|
__init__(num_labels, n_workers, conn_type='mw', device=None, dtype=None)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
num_labels
|
int
|
Number of classes. |
required |
n_workers
|
int
|
Number of workers. |
required |
conn_type
|
str
|
Connection type. One of 'mw', 'vw', 'vb', 'vw+b'. |
'mw'
|
device
|
DeviceObjType
|
Device to use. |
None
|
dtype
|
dtype
|
Data type to use. |
None
|
Raises: ValueError: If conn_type is not one of 'mw', 'vw', 'vb', 'vw+b'.
Source code in crowdkit/learning/crowd_layer.py
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
|
forward(outputs, workers)
Forward pass.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
outputs
|
Tensor
|
Tensor of shape (batch_size, input_dim) |
required |
workers
|
Tensor
|
Tensor of shape (batch_size,) containing the worker IDs. |
required |
Returns:
Type | Description |
---|---|
Tensor
|
torch.Tensor: Tensor of shape (batch_size, num_labels) |
Source code in crowdkit/learning/crowd_layer.py
157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 |
|
TextSummarization
Bases: BaseTextsAggregator
Text Aggregation through Summarization
The method uses a pre-trained language model for summarization to aggregate crowdsourced texts.
For each task, texts are concateneted by |
token and passed as a model's input. If
n_permutations
is not None
, texts are random shuffled n_permutations
times and then
outputs are aggregated with permutation_aggregator
if provided. If permutation_aggregator
is not provided, the resulting aggregate is the most common output over permuted inputs.
To use pretrained model and tokenizer from transformers
, you need to install torch
M. Orzhenovskii, "Fine-Tuning Pre-Trained Language Model for Crowdsourced Texts Aggregation," Proceedings of the 2nd Crowd Science Workshop: Trust, Ethics, and Excellence in Crowdsourced Data Management at Scale, 2021, pp. 8-14. https://ceur-ws.org/Vol-2932/short1.pdf
S. Pletenev, "Noisy Text Sequences Aggregation as a Summarization Subtask," Proceedings of the 2nd Crowd Science Workshop: Trust, Ethics, and Excellence in Crowdsourced Data Management at Scale, 2021, pp. 15-20. https://ceur-ws.org/Vol-2932/short2.pdf
Examples:
>>> import torch
>>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, AutoConfig
>>> from crowdkit.learning import TextSummarization
>>> device = 'cuda' if torch.cuda.is_available() else 'cpu'
>>> mname = "toloka/t5-large-for-text-aggregation"
>>> tokenizer = AutoTokenizer.from_pretrained(mname)
>>> model = AutoModelForSeq2SeqLM.from_pretrained(mname)
>>> agg = TextSummarization(tokenizer, model, device=device)
>>> result = agg.fit_predict(df)
...
Source code in crowdkit/learning/text_summarization.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
|
concat_token = attr.ib(default=' | ')
class-attribute
instance-attribute
Token used for the workers' texts concatenation.
device = attr.ib(default='cpu')
class-attribute
instance-attribute
Device to use such as cpu
or cuda
.
model = attr.ib()
class-attribute
instance-attribute
Pre-trained model for text summarization.
n_permutations = attr.ib(default=None)
class-attribute
instance-attribute
Number of input permutations to use. If None
, use a single permutation according to the input's order.
num_beams = attr.ib(default=16)
class-attribute
instance-attribute
Number of beams for beam search. 1 means no beam search.
permutation_aggregator = attr.ib(default=None)
class-attribute
instance-attribute
Text aggregation method to use for aggregating outputs of multiple input permutations if use_permutations
flag is set.
tokenizer = attr.ib()
class-attribute
instance-attribute
fit_predict(data)
Run the aggregation and return the aggregated texts.
Args:
data (DataFrame): Workers' text outputs.
A pandas.DataFrame containing task
, worker
and text
columns.
Returns:
Series: Tasks' texts.
A pandas.Series indexed by task
such that result.loc[task, text]
is the task's text.
Source code in crowdkit/learning/text_summarization.py
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
|