Zebulon Goriely commited on
Commit
da49dff
·
1 Parent(s): 4a8bdce

Upload model

Browse files
ByteSpanSurprisalGlobalIncrement_64000/blimp_results.json ADDED
@@ -0,0 +1,2965 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "blimp": {
4
+ "acc,none": 0.7772686567164179,
5
+ "acc_stderr,none": 0.001440291792047074,
6
+ "alias": "blimp"
7
+ },
8
+ "blimp_adjunct_island": {
9
+ "alias": " - blimp_adjunct_island",
10
+ "acc,none": 0.81,
11
+ "acc_stderr,none": 0.01241185135481633
12
+ },
13
+ "blimp_anaphor_gender_agreement": {
14
+ "alias": " - blimp_anaphor_gender_agreement",
15
+ "acc,none": 0.875,
16
+ "acc_stderr,none": 0.010463483381956722
17
+ },
18
+ "blimp_anaphor_number_agreement": {
19
+ "alias": " - blimp_anaphor_number_agreement",
20
+ "acc,none": 0.975,
21
+ "acc_stderr,none": 0.004939574819698468
22
+ },
23
+ "blimp_animate_subject_passive": {
24
+ "alias": " - blimp_animate_subject_passive",
25
+ "acc,none": 0.754,
26
+ "acc_stderr,none": 0.01362606581775064
27
+ },
28
+ "blimp_animate_subject_trans": {
29
+ "alias": " - blimp_animate_subject_trans",
30
+ "acc,none": 0.877,
31
+ "acc_stderr,none": 0.01039129342184988
32
+ },
33
+ "blimp_causative": {
34
+ "alias": " - blimp_causative",
35
+ "acc,none": 0.705,
36
+ "acc_stderr,none": 0.014428554438445509
37
+ },
38
+ "blimp_complex_NP_island": {
39
+ "alias": " - blimp_complex_NP_island",
40
+ "acc,none": 0.516,
41
+ "acc_stderr,none": 0.015811198373114878
42
+ },
43
+ "blimp_coordinate_structure_constraint_complex_left_branch": {
44
+ "alias": " - blimp_coordinate_structure_constraint_complex_left_branch",
45
+ "acc,none": 0.598,
46
+ "acc_stderr,none": 0.015512467135715073
47
+ },
48
+ "blimp_coordinate_structure_constraint_object_extraction": {
49
+ "alias": " - blimp_coordinate_structure_constraint_object_extraction",
50
+ "acc,none": 0.828,
51
+ "acc_stderr,none": 0.011939788882495321
52
+ },
53
+ "blimp_determiner_noun_agreement_1": {
54
+ "alias": " - blimp_determiner_noun_agreement_1",
55
+ "acc,none": 0.986,
56
+ "acc_stderr,none": 0.0037172325482565808
57
+ },
58
+ "blimp_determiner_noun_agreement_2": {
59
+ "alias": " - blimp_determiner_noun_agreement_2",
60
+ "acc,none": 0.97,
61
+ "acc_stderr,none": 0.005397140829099201
62
+ },
63
+ "blimp_determiner_noun_agreement_irregular_1": {
64
+ "alias": " - blimp_determiner_noun_agreement_irregular_1",
65
+ "acc,none": 0.924,
66
+ "acc_stderr,none": 0.008384169266796394
67
+ },
68
+ "blimp_determiner_noun_agreement_irregular_2": {
69
+ "alias": " - blimp_determiner_noun_agreement_irregular_2",
70
+ "acc,none": 0.951,
71
+ "acc_stderr,none": 0.006829761756140915
72
+ },
73
+ "blimp_determiner_noun_agreement_with_adj_2": {
74
+ "alias": " - blimp_determiner_noun_agreement_with_adj_2",
75
+ "acc,none": 0.926,
76
+ "acc_stderr,none": 0.008282064512704157
77
+ },
78
+ "blimp_determiner_noun_agreement_with_adj_irregular_1": {
79
+ "alias": " - blimp_determiner_noun_agreement_with_adj_irregular_1",
80
+ "acc,none": 0.897,
81
+ "acc_stderr,none": 0.0096168333396958
82
+ },
83
+ "blimp_determiner_noun_agreement_with_adj_irregular_2": {
84
+ "alias": " - blimp_determiner_noun_agreement_with_adj_irregular_2",
85
+ "acc,none": 0.907,
86
+ "acc_stderr,none": 0.00918887563499668
87
+ },
88
+ "blimp_determiner_noun_agreement_with_adjective_1": {
89
+ "alias": " - blimp_determiner_noun_agreement_with_adjective_1",
90
+ "acc,none": 0.955,
91
+ "acc_stderr,none": 0.006558812241406107
92
+ },
93
+ "blimp_distractor_agreement_relational_noun": {
94
+ "alias": " - blimp_distractor_agreement_relational_noun",
95
+ "acc,none": 0.841,
96
+ "acc_stderr,none": 0.01156947936827129
97
+ },
98
+ "blimp_distractor_agreement_relative_clause": {
99
+ "alias": " - blimp_distractor_agreement_relative_clause",
100
+ "acc,none": 0.746,
101
+ "acc_stderr,none": 0.013772206565168543
102
+ },
103
+ "blimp_drop_argument": {
104
+ "alias": " - blimp_drop_argument",
105
+ "acc,none": 0.795,
106
+ "acc_stderr,none": 0.012772554096113112
107
+ },
108
+ "blimp_ellipsis_n_bar_1": {
109
+ "alias": " - blimp_ellipsis_n_bar_1",
110
+ "acc,none": 0.83,
111
+ "acc_stderr,none": 0.011884495834541665
112
+ },
113
+ "blimp_ellipsis_n_bar_2": {
114
+ "alias": " - blimp_ellipsis_n_bar_2",
115
+ "acc,none": 0.917,
116
+ "acc_stderr,none": 0.00872852720607479
117
+ },
118
+ "blimp_existential_there_object_raising": {
119
+ "alias": " - blimp_existential_there_object_raising",
120
+ "acc,none": 0.82,
121
+ "acc_stderr,none": 0.012155153135511952
122
+ },
123
+ "blimp_existential_there_quantifiers_1": {
124
+ "alias": " - blimp_existential_there_quantifiers_1",
125
+ "acc,none": 0.971,
126
+ "acc_stderr,none": 0.005309160685756989
127
+ },
128
+ "blimp_existential_there_quantifiers_2": {
129
+ "alias": " - blimp_existential_there_quantifiers_2",
130
+ "acc,none": 0.248,
131
+ "acc_stderr,none": 0.013663187134877665
132
+ },
133
+ "blimp_existential_there_subject_raising": {
134
+ "alias": " - blimp_existential_there_subject_raising",
135
+ "acc,none": 0.871,
136
+ "acc_stderr,none": 0.010605256784796586
137
+ },
138
+ "blimp_expletive_it_object_raising": {
139
+ "alias": " - blimp_expletive_it_object_raising",
140
+ "acc,none": 0.763,
141
+ "acc_stderr,none": 0.01345407046257796
142
+ },
143
+ "blimp_inchoative": {
144
+ "alias": " - blimp_inchoative",
145
+ "acc,none": 0.684,
146
+ "acc_stderr,none": 0.01470919305605713
147
+ },
148
+ "blimp_intransitive": {
149
+ "alias": " - blimp_intransitive",
150
+ "acc,none": 0.768,
151
+ "acc_stderr,none": 0.013354937452281558
152
+ },
153
+ "blimp_irregular_past_participle_adjectives": {
154
+ "alias": " - blimp_irregular_past_participle_adjectives",
155
+ "acc,none": 0.857,
156
+ "acc_stderr,none": 0.01107581480856704
157
+ },
158
+ "blimp_irregular_past_participle_verbs": {
159
+ "alias": " - blimp_irregular_past_participle_verbs",
160
+ "acc,none": 0.879,
161
+ "acc_stderr,none": 0.010318210380946087
162
+ },
163
+ "blimp_irregular_plural_subject_verb_agreement_1": {
164
+ "alias": " - blimp_irregular_plural_subject_verb_agreement_1",
165
+ "acc,none": 0.883,
166
+ "acc_stderr,none": 0.010169287802713329
167
+ },
168
+ "blimp_irregular_plural_subject_verb_agreement_2": {
169
+ "alias": " - blimp_irregular_plural_subject_verb_agreement_2",
170
+ "acc,none": 0.898,
171
+ "acc_stderr,none": 0.00957536880165389
172
+ },
173
+ "blimp_left_branch_island_echo_question": {
174
+ "alias": " - blimp_left_branch_island_echo_question",
175
+ "acc,none": 0.374,
176
+ "acc_stderr,none": 0.015308767369006366
177
+ },
178
+ "blimp_left_branch_island_simple_question": {
179
+ "alias": " - blimp_left_branch_island_simple_question",
180
+ "acc,none": 0.615,
181
+ "acc_stderr,none": 0.015395194445410808
182
+ },
183
+ "blimp_matrix_question_npi_licensor_present": {
184
+ "alias": " - blimp_matrix_question_npi_licensor_present",
185
+ "acc,none": 0.431,
186
+ "acc_stderr,none": 0.0156679444881735
187
+ },
188
+ "blimp_npi_present_1": {
189
+ "alias": " - blimp_npi_present_1",
190
+ "acc,none": 0.524,
191
+ "acc_stderr,none": 0.015801065586651758
192
+ },
193
+ "blimp_npi_present_2": {
194
+ "alias": " - blimp_npi_present_2",
195
+ "acc,none": 0.561,
196
+ "acc_stderr,none": 0.015701131345400767
197
+ },
198
+ "blimp_only_npi_licensor_present": {
199
+ "alias": " - blimp_only_npi_licensor_present",
200
+ "acc,none": 0.922,
201
+ "acc_stderr,none": 0.008484573530118594
202
+ },
203
+ "blimp_only_npi_scope": {
204
+ "alias": " - blimp_only_npi_scope",
205
+ "acc,none": 0.563,
206
+ "acc_stderr,none": 0.015693223928730377
207
+ },
208
+ "blimp_passive_1": {
209
+ "alias": " - blimp_passive_1",
210
+ "acc,none": 0.903,
211
+ "acc_stderr,none": 0.009363689373248113
212
+ },
213
+ "blimp_passive_2": {
214
+ "alias": " - blimp_passive_2",
215
+ "acc,none": 0.888,
216
+ "acc_stderr,none": 0.009977753031397245
217
+ },
218
+ "blimp_principle_A_c_command": {
219
+ "alias": " - blimp_principle_A_c_command",
220
+ "acc,none": 0.746,
221
+ "acc_stderr,none": 0.01377220656516854
222
+ },
223
+ "blimp_principle_A_case_1": {
224
+ "alias": " - blimp_principle_A_case_1",
225
+ "acc,none": 1.0,
226
+ "acc_stderr,none": 0.0
227
+ },
228
+ "blimp_principle_A_case_2": {
229
+ "alias": " - blimp_principle_A_case_2",
230
+ "acc,none": 0.934,
231
+ "acc_stderr,none": 0.007855297938697605
232
+ },
233
+ "blimp_principle_A_domain_1": {
234
+ "alias": " - blimp_principle_A_domain_1",
235
+ "acc,none": 0.945,
236
+ "acc_stderr,none": 0.007212976294639239
237
+ },
238
+ "blimp_principle_A_domain_2": {
239
+ "alias": " - blimp_principle_A_domain_2",
240
+ "acc,none": 0.817,
241
+ "acc_stderr,none": 0.012233587399477826
242
+ },
243
+ "blimp_principle_A_domain_3": {
244
+ "alias": " - blimp_principle_A_domain_3",
245
+ "acc,none": 0.611,
246
+ "acc_stderr,none": 0.015424555647308493
247
+ },
248
+ "blimp_principle_A_reconstruction": {
249
+ "alias": " - blimp_principle_A_reconstruction",
250
+ "acc,none": 0.453,
251
+ "acc_stderr,none": 0.015749255189977593
252
+ },
253
+ "blimp_regular_plural_subject_verb_agreement_1": {
254
+ "alias": " - blimp_regular_plural_subject_verb_agreement_1",
255
+ "acc,none": 0.937,
256
+ "acc_stderr,none": 0.0076870078762864185
257
+ },
258
+ "blimp_regular_plural_subject_verb_agreement_2": {
259
+ "alias": " - blimp_regular_plural_subject_verb_agreement_2",
260
+ "acc,none": 0.893,
261
+ "acc_stderr,none": 0.009779910359847169
262
+ },
263
+ "blimp_sentential_negation_npi_licensor_present": {
264
+ "alias": " - blimp_sentential_negation_npi_licensor_present",
265
+ "acc,none": 0.941,
266
+ "acc_stderr,none": 0.007454835650406727
267
+ },
268
+ "blimp_sentential_negation_npi_scope": {
269
+ "alias": " - blimp_sentential_negation_npi_scope",
270
+ "acc,none": 0.516,
271
+ "acc_stderr,none": 0.01581119837311488
272
+ },
273
+ "blimp_sentential_subject_island": {
274
+ "alias": " - blimp_sentential_subject_island",
275
+ "acc,none": 0.371,
276
+ "acc_stderr,none": 0.015283736211823188
277
+ },
278
+ "blimp_superlative_quantifiers_1": {
279
+ "alias": " - blimp_superlative_quantifiers_1",
280
+ "acc,none": 0.811,
281
+ "acc_stderr,none": 0.012386784588117717
282
+ },
283
+ "blimp_superlative_quantifiers_2": {
284
+ "alias": " - blimp_superlative_quantifiers_2",
285
+ "acc,none": 0.747,
286
+ "acc_stderr,none": 0.01375427861358708
287
+ },
288
+ "blimp_tough_vs_raising_1": {
289
+ "alias": " - blimp_tough_vs_raising_1",
290
+ "acc,none": 0.585,
291
+ "acc_stderr,none": 0.015589035185604623
292
+ },
293
+ "blimp_tough_vs_raising_2": {
294
+ "alias": " - blimp_tough_vs_raising_2",
295
+ "acc,none": 0.856,
296
+ "acc_stderr,none": 0.011107987548939149
297
+ },
298
+ "blimp_transitive": {
299
+ "alias": " - blimp_transitive",
300
+ "acc,none": 0.857,
301
+ "acc_stderr,none": 0.011075814808567038
302
+ },
303
+ "blimp_wh_island": {
304
+ "alias": " - blimp_wh_island",
305
+ "acc,none": 0.825,
306
+ "acc_stderr,none": 0.012021627157731972
307
+ },
308
+ "blimp_wh_questions_object_gap": {
309
+ "alias": " - blimp_wh_questions_object_gap",
310
+ "acc,none": 0.725,
311
+ "acc_stderr,none": 0.014127086556490528
312
+ },
313
+ "blimp_wh_questions_subject_gap": {
314
+ "alias": " - blimp_wh_questions_subject_gap",
315
+ "acc,none": 0.903,
316
+ "acc_stderr,none": 0.009363689373248104
317
+ },
318
+ "blimp_wh_questions_subject_gap_long_distance": {
319
+ "alias": " - blimp_wh_questions_subject_gap_long_distance",
320
+ "acc,none": 0.846,
321
+ "acc_stderr,none": 0.01141991306509869
322
+ },
323
+ "blimp_wh_vs_that_no_gap": {
324
+ "alias": " - blimp_wh_vs_that_no_gap",
325
+ "acc,none": 0.948,
326
+ "acc_stderr,none": 0.007024624213817151
327
+ },
328
+ "blimp_wh_vs_that_no_gap_long_distance": {
329
+ "alias": " - blimp_wh_vs_that_no_gap_long_distance",
330
+ "acc,none": 0.974,
331
+ "acc_stderr,none": 0.0050348137353181934
332
+ },
333
+ "blimp_wh_vs_that_with_gap": {
334
+ "alias": " - blimp_wh_vs_that_with_gap",
335
+ "acc,none": 0.564,
336
+ "acc_stderr,none": 0.01568917302314406
337
+ },
338
+ "blimp_wh_vs_that_with_gap_long_distance": {
339
+ "alias": " - blimp_wh_vs_that_with_gap_long_distance",
340
+ "acc,none": 0.266,
341
+ "acc_stderr,none": 0.013979965645145158
342
+ }
343
+ },
344
+ "groups": {
345
+ "blimp": {
346
+ "acc,none": 0.7772686567164179,
347
+ "acc_stderr,none": 0.001440291792047074,
348
+ "alias": "blimp"
349
+ }
350
+ },
351
+ "group_subtasks": {
352
+ "blimp": [
353
+ "blimp_adjunct_island",
354
+ "blimp_anaphor_gender_agreement",
355
+ "blimp_anaphor_number_agreement",
356
+ "blimp_animate_subject_passive",
357
+ "blimp_animate_subject_trans",
358
+ "blimp_causative",
359
+ "blimp_complex_NP_island",
360
+ "blimp_coordinate_structure_constraint_complex_left_branch",
361
+ "blimp_coordinate_structure_constraint_object_extraction",
362
+ "blimp_determiner_noun_agreement_1",
363
+ "blimp_determiner_noun_agreement_2",
364
+ "blimp_determiner_noun_agreement_irregular_1",
365
+ "blimp_determiner_noun_agreement_irregular_2",
366
+ "blimp_determiner_noun_agreement_with_adj_2",
367
+ "blimp_determiner_noun_agreement_with_adj_irregular_1",
368
+ "blimp_determiner_noun_agreement_with_adj_irregular_2",
369
+ "blimp_determiner_noun_agreement_with_adjective_1",
370
+ "blimp_distractor_agreement_relational_noun",
371
+ "blimp_distractor_agreement_relative_clause",
372
+ "blimp_drop_argument",
373
+ "blimp_ellipsis_n_bar_1",
374
+ "blimp_ellipsis_n_bar_2",
375
+ "blimp_existential_there_object_raising",
376
+ "blimp_existential_there_quantifiers_1",
377
+ "blimp_existential_there_quantifiers_2",
378
+ "blimp_existential_there_subject_raising",
379
+ "blimp_expletive_it_object_raising",
380
+ "blimp_inchoative",
381
+ "blimp_intransitive",
382
+ "blimp_irregular_past_participle_adjectives",
383
+ "blimp_irregular_past_participle_verbs",
384
+ "blimp_irregular_plural_subject_verb_agreement_1",
385
+ "blimp_irregular_plural_subject_verb_agreement_2",
386
+ "blimp_left_branch_island_echo_question",
387
+ "blimp_left_branch_island_simple_question",
388
+ "blimp_matrix_question_npi_licensor_present",
389
+ "blimp_npi_present_1",
390
+ "blimp_npi_present_2",
391
+ "blimp_only_npi_licensor_present",
392
+ "blimp_only_npi_scope",
393
+ "blimp_passive_1",
394
+ "blimp_passive_2",
395
+ "blimp_principle_A_c_command",
396
+ "blimp_principle_A_case_1",
397
+ "blimp_principle_A_case_2",
398
+ "blimp_principle_A_domain_1",
399
+ "blimp_principle_A_domain_2",
400
+ "blimp_principle_A_domain_3",
401
+ "blimp_principle_A_reconstruction",
402
+ "blimp_regular_plural_subject_verb_agreement_1",
403
+ "blimp_regular_plural_subject_verb_agreement_2",
404
+ "blimp_sentential_negation_npi_licensor_present",
405
+ "blimp_sentential_negation_npi_scope",
406
+ "blimp_sentential_subject_island",
407
+ "blimp_superlative_quantifiers_1",
408
+ "blimp_superlative_quantifiers_2",
409
+ "blimp_tough_vs_raising_1",
410
+ "blimp_tough_vs_raising_2",
411
+ "blimp_transitive",
412
+ "blimp_wh_island",
413
+ "blimp_wh_questions_object_gap",
414
+ "blimp_wh_questions_subject_gap",
415
+ "blimp_wh_questions_subject_gap_long_distance",
416
+ "blimp_wh_vs_that_no_gap",
417
+ "blimp_wh_vs_that_no_gap_long_distance",
418
+ "blimp_wh_vs_that_with_gap",
419
+ "blimp_wh_vs_that_with_gap_long_distance"
420
+ ]
421
+ },
422
+ "configs": {
423
+ "blimp_adjunct_island": {
424
+ "task": "blimp_adjunct_island",
425
+ "dataset_path": "blimp",
426
+ "dataset_name": "adjunct_island",
427
+ "validation_split": "train",
428
+ "doc_to_text": "",
429
+ "doc_to_target": 0,
430
+ "unsafe_code": false,
431
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
432
+ "description": "",
433
+ "target_delimiter": " ",
434
+ "fewshot_delimiter": "\n\n",
435
+ "num_fewshot": 0,
436
+ "metric_list": [
437
+ {
438
+ "metric": "acc",
439
+ "aggregation": "mean",
440
+ "higher_is_better": true
441
+ }
442
+ ],
443
+ "output_type": "multiple_choice",
444
+ "repeats": 1,
445
+ "should_decontaminate": true,
446
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
447
+ "metadata": {
448
+ "version": 1.0
449
+ }
450
+ },
451
+ "blimp_anaphor_gender_agreement": {
452
+ "task": "blimp_anaphor_gender_agreement",
453
+ "dataset_path": "blimp",
454
+ "dataset_name": "anaphor_gender_agreement",
455
+ "validation_split": "train",
456
+ "doc_to_text": "",
457
+ "doc_to_target": 0,
458
+ "unsafe_code": false,
459
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
460
+ "description": "",
461
+ "target_delimiter": " ",
462
+ "fewshot_delimiter": "\n\n",
463
+ "num_fewshot": 0,
464
+ "metric_list": [
465
+ {
466
+ "metric": "acc",
467
+ "aggregation": "mean",
468
+ "higher_is_better": true
469
+ }
470
+ ],
471
+ "output_type": "multiple_choice",
472
+ "repeats": 1,
473
+ "should_decontaminate": true,
474
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
475
+ "metadata": {
476
+ "version": 1.0
477
+ }
478
+ },
479
+ "blimp_anaphor_number_agreement": {
480
+ "task": "blimp_anaphor_number_agreement",
481
+ "dataset_path": "blimp",
482
+ "dataset_name": "anaphor_number_agreement",
483
+ "validation_split": "train",
484
+ "doc_to_text": "",
485
+ "doc_to_target": 0,
486
+ "unsafe_code": false,
487
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
488
+ "description": "",
489
+ "target_delimiter": " ",
490
+ "fewshot_delimiter": "\n\n",
491
+ "num_fewshot": 0,
492
+ "metric_list": [
493
+ {
494
+ "metric": "acc",
495
+ "aggregation": "mean",
496
+ "higher_is_better": true
497
+ }
498
+ ],
499
+ "output_type": "multiple_choice",
500
+ "repeats": 1,
501
+ "should_decontaminate": true,
502
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
503
+ "metadata": {
504
+ "version": 1.0
505
+ }
506
+ },
507
+ "blimp_animate_subject_passive": {
508
+ "task": "blimp_animate_subject_passive",
509
+ "dataset_path": "blimp",
510
+ "dataset_name": "animate_subject_passive",
511
+ "validation_split": "train",
512
+ "doc_to_text": "",
513
+ "doc_to_target": 0,
514
+ "unsafe_code": false,
515
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
516
+ "description": "",
517
+ "target_delimiter": " ",
518
+ "fewshot_delimiter": "\n\n",
519
+ "num_fewshot": 0,
520
+ "metric_list": [
521
+ {
522
+ "metric": "acc",
523
+ "aggregation": "mean",
524
+ "higher_is_better": true
525
+ }
526
+ ],
527
+ "output_type": "multiple_choice",
528
+ "repeats": 1,
529
+ "should_decontaminate": true,
530
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
531
+ "metadata": {
532
+ "version": 1.0
533
+ }
534
+ },
535
+ "blimp_animate_subject_trans": {
536
+ "task": "blimp_animate_subject_trans",
537
+ "dataset_path": "blimp",
538
+ "dataset_name": "animate_subject_trans",
539
+ "validation_split": "train",
540
+ "doc_to_text": "",
541
+ "doc_to_target": 0,
542
+ "unsafe_code": false,
543
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
544
+ "description": "",
545
+ "target_delimiter": " ",
546
+ "fewshot_delimiter": "\n\n",
547
+ "num_fewshot": 0,
548
+ "metric_list": [
549
+ {
550
+ "metric": "acc",
551
+ "aggregation": "mean",
552
+ "higher_is_better": true
553
+ }
554
+ ],
555
+ "output_type": "multiple_choice",
556
+ "repeats": 1,
557
+ "should_decontaminate": true,
558
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
559
+ "metadata": {
560
+ "version": 1.0
561
+ }
562
+ },
563
+ "blimp_causative": {
564
+ "task": "blimp_causative",
565
+ "dataset_path": "blimp",
566
+ "dataset_name": "causative",
567
+ "validation_split": "train",
568
+ "doc_to_text": "",
569
+ "doc_to_target": 0,
570
+ "unsafe_code": false,
571
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
572
+ "description": "",
573
+ "target_delimiter": " ",
574
+ "fewshot_delimiter": "\n\n",
575
+ "num_fewshot": 0,
576
+ "metric_list": [
577
+ {
578
+ "metric": "acc",
579
+ "aggregation": "mean",
580
+ "higher_is_better": true
581
+ }
582
+ ],
583
+ "output_type": "multiple_choice",
584
+ "repeats": 1,
585
+ "should_decontaminate": true,
586
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
587
+ "metadata": {
588
+ "version": 1.0
589
+ }
590
+ },
591
+ "blimp_complex_NP_island": {
592
+ "task": "blimp_complex_NP_island",
593
+ "dataset_path": "blimp",
594
+ "dataset_name": "complex_NP_island",
595
+ "validation_split": "train",
596
+ "doc_to_text": "",
597
+ "doc_to_target": 0,
598
+ "unsafe_code": false,
599
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
600
+ "description": "",
601
+ "target_delimiter": " ",
602
+ "fewshot_delimiter": "\n\n",
603
+ "num_fewshot": 0,
604
+ "metric_list": [
605
+ {
606
+ "metric": "acc",
607
+ "aggregation": "mean",
608
+ "higher_is_better": true
609
+ }
610
+ ],
611
+ "output_type": "multiple_choice",
612
+ "repeats": 1,
613
+ "should_decontaminate": true,
614
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
615
+ "metadata": {
616
+ "version": 1.0
617
+ }
618
+ },
619
+ "blimp_coordinate_structure_constraint_complex_left_branch": {
620
+ "task": "blimp_coordinate_structure_constraint_complex_left_branch",
621
+ "dataset_path": "blimp",
622
+ "dataset_name": "coordinate_structure_constraint_complex_left_branch",
623
+ "validation_split": "train",
624
+ "doc_to_text": "",
625
+ "doc_to_target": 0,
626
+ "unsafe_code": false,
627
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
628
+ "description": "",
629
+ "target_delimiter": " ",
630
+ "fewshot_delimiter": "\n\n",
631
+ "num_fewshot": 0,
632
+ "metric_list": [
633
+ {
634
+ "metric": "acc",
635
+ "aggregation": "mean",
636
+ "higher_is_better": true
637
+ }
638
+ ],
639
+ "output_type": "multiple_choice",
640
+ "repeats": 1,
641
+ "should_decontaminate": true,
642
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
643
+ "metadata": {
644
+ "version": 1.0
645
+ }
646
+ },
647
+ "blimp_coordinate_structure_constraint_object_extraction": {
648
+ "task": "blimp_coordinate_structure_constraint_object_extraction",
649
+ "dataset_path": "blimp",
650
+ "dataset_name": "coordinate_structure_constraint_object_extraction",
651
+ "validation_split": "train",
652
+ "doc_to_text": "",
653
+ "doc_to_target": 0,
654
+ "unsafe_code": false,
655
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
656
+ "description": "",
657
+ "target_delimiter": " ",
658
+ "fewshot_delimiter": "\n\n",
659
+ "num_fewshot": 0,
660
+ "metric_list": [
661
+ {
662
+ "metric": "acc",
663
+ "aggregation": "mean",
664
+ "higher_is_better": true
665
+ }
666
+ ],
667
+ "output_type": "multiple_choice",
668
+ "repeats": 1,
669
+ "should_decontaminate": true,
670
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
671
+ "metadata": {
672
+ "version": 1.0
673
+ }
674
+ },
675
+ "blimp_determiner_noun_agreement_1": {
676
+ "task": "blimp_determiner_noun_agreement_1",
677
+ "dataset_path": "blimp",
678
+ "dataset_name": "determiner_noun_agreement_1",
679
+ "validation_split": "train",
680
+ "doc_to_text": "",
681
+ "doc_to_target": 0,
682
+ "unsafe_code": false,
683
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
684
+ "description": "",
685
+ "target_delimiter": " ",
686
+ "fewshot_delimiter": "\n\n",
687
+ "num_fewshot": 0,
688
+ "metric_list": [
689
+ {
690
+ "metric": "acc",
691
+ "aggregation": "mean",
692
+ "higher_is_better": true
693
+ }
694
+ ],
695
+ "output_type": "multiple_choice",
696
+ "repeats": 1,
697
+ "should_decontaminate": true,
698
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
699
+ "metadata": {
700
+ "version": 1.0
701
+ }
702
+ },
703
+ "blimp_determiner_noun_agreement_2": {
704
+ "task": "blimp_determiner_noun_agreement_2",
705
+ "dataset_path": "blimp",
706
+ "dataset_name": "determiner_noun_agreement_2",
707
+ "validation_split": "train",
708
+ "doc_to_text": "",
709
+ "doc_to_target": 0,
710
+ "unsafe_code": false,
711
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
712
+ "description": "",
713
+ "target_delimiter": " ",
714
+ "fewshot_delimiter": "\n\n",
715
+ "num_fewshot": 0,
716
+ "metric_list": [
717
+ {
718
+ "metric": "acc",
719
+ "aggregation": "mean",
720
+ "higher_is_better": true
721
+ }
722
+ ],
723
+ "output_type": "multiple_choice",
724
+ "repeats": 1,
725
+ "should_decontaminate": true,
726
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
727
+ "metadata": {
728
+ "version": 1.0
729
+ }
730
+ },
731
+ "blimp_determiner_noun_agreement_irregular_1": {
732
+ "task": "blimp_determiner_noun_agreement_irregular_1",
733
+ "dataset_path": "blimp",
734
+ "dataset_name": "determiner_noun_agreement_irregular_1",
735
+ "validation_split": "train",
736
+ "doc_to_text": "",
737
+ "doc_to_target": 0,
738
+ "unsafe_code": false,
739
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
740
+ "description": "",
741
+ "target_delimiter": " ",
742
+ "fewshot_delimiter": "\n\n",
743
+ "num_fewshot": 0,
744
+ "metric_list": [
745
+ {
746
+ "metric": "acc",
747
+ "aggregation": "mean",
748
+ "higher_is_better": true
749
+ }
750
+ ],
751
+ "output_type": "multiple_choice",
752
+ "repeats": 1,
753
+ "should_decontaminate": true,
754
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
755
+ "metadata": {
756
+ "version": 1.0
757
+ }
758
+ },
759
+ "blimp_determiner_noun_agreement_irregular_2": {
760
+ "task": "blimp_determiner_noun_agreement_irregular_2",
761
+ "dataset_path": "blimp",
762
+ "dataset_name": "determiner_noun_agreement_irregular_2",
763
+ "validation_split": "train",
764
+ "doc_to_text": "",
765
+ "doc_to_target": 0,
766
+ "unsafe_code": false,
767
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
768
+ "description": "",
769
+ "target_delimiter": " ",
770
+ "fewshot_delimiter": "\n\n",
771
+ "num_fewshot": 0,
772
+ "metric_list": [
773
+ {
774
+ "metric": "acc",
775
+ "aggregation": "mean",
776
+ "higher_is_better": true
777
+ }
778
+ ],
779
+ "output_type": "multiple_choice",
780
+ "repeats": 1,
781
+ "should_decontaminate": true,
782
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
783
+ "metadata": {
784
+ "version": 1.0
785
+ }
786
+ },
787
+ "blimp_determiner_noun_agreement_with_adj_2": {
788
+ "task": "blimp_determiner_noun_agreement_with_adj_2",
789
+ "dataset_path": "blimp",
790
+ "dataset_name": "determiner_noun_agreement_with_adj_2",
791
+ "validation_split": "train",
792
+ "doc_to_text": "",
793
+ "doc_to_target": 0,
794
+ "unsafe_code": false,
795
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
796
+ "description": "",
797
+ "target_delimiter": " ",
798
+ "fewshot_delimiter": "\n\n",
799
+ "num_fewshot": 0,
800
+ "metric_list": [
801
+ {
802
+ "metric": "acc",
803
+ "aggregation": "mean",
804
+ "higher_is_better": true
805
+ }
806
+ ],
807
+ "output_type": "multiple_choice",
808
+ "repeats": 1,
809
+ "should_decontaminate": true,
810
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
811
+ "metadata": {
812
+ "version": 1.0
813
+ }
814
+ },
815
+ "blimp_determiner_noun_agreement_with_adj_irregular_1": {
816
+ "task": "blimp_determiner_noun_agreement_with_adj_irregular_1",
817
+ "dataset_path": "blimp",
818
+ "dataset_name": "determiner_noun_agreement_with_adj_irregular_1",
819
+ "validation_split": "train",
820
+ "doc_to_text": "",
821
+ "doc_to_target": 0,
822
+ "unsafe_code": false,
823
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
824
+ "description": "",
825
+ "target_delimiter": " ",
826
+ "fewshot_delimiter": "\n\n",
827
+ "num_fewshot": 0,
828
+ "metric_list": [
829
+ {
830
+ "metric": "acc",
831
+ "aggregation": "mean",
832
+ "higher_is_better": true
833
+ }
834
+ ],
835
+ "output_type": "multiple_choice",
836
+ "repeats": 1,
837
+ "should_decontaminate": true,
838
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
839
+ "metadata": {
840
+ "version": 1.0
841
+ }
842
+ },
843
+ "blimp_determiner_noun_agreement_with_adj_irregular_2": {
844
+ "task": "blimp_determiner_noun_agreement_with_adj_irregular_2",
845
+ "dataset_path": "blimp",
846
+ "dataset_name": "determiner_noun_agreement_with_adj_irregular_2",
847
+ "validation_split": "train",
848
+ "doc_to_text": "",
849
+ "doc_to_target": 0,
850
+ "unsafe_code": false,
851
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
852
+ "description": "",
853
+ "target_delimiter": " ",
854
+ "fewshot_delimiter": "\n\n",
855
+ "num_fewshot": 0,
856
+ "metric_list": [
857
+ {
858
+ "metric": "acc",
859
+ "aggregation": "mean",
860
+ "higher_is_better": true
861
+ }
862
+ ],
863
+ "output_type": "multiple_choice",
864
+ "repeats": 1,
865
+ "should_decontaminate": true,
866
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
867
+ "metadata": {
868
+ "version": 1.0
869
+ }
870
+ },
871
+ "blimp_determiner_noun_agreement_with_adjective_1": {
872
+ "task": "blimp_determiner_noun_agreement_with_adjective_1",
873
+ "dataset_path": "blimp",
874
+ "dataset_name": "determiner_noun_agreement_with_adjective_1",
875
+ "validation_split": "train",
876
+ "doc_to_text": "",
877
+ "doc_to_target": 0,
878
+ "unsafe_code": false,
879
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
880
+ "description": "",
881
+ "target_delimiter": " ",
882
+ "fewshot_delimiter": "\n\n",
883
+ "num_fewshot": 0,
884
+ "metric_list": [
885
+ {
886
+ "metric": "acc",
887
+ "aggregation": "mean",
888
+ "higher_is_better": true
889
+ }
890
+ ],
891
+ "output_type": "multiple_choice",
892
+ "repeats": 1,
893
+ "should_decontaminate": true,
894
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
895
+ "metadata": {
896
+ "version": 1.0
897
+ }
898
+ },
899
+ "blimp_distractor_agreement_relational_noun": {
900
+ "task": "blimp_distractor_agreement_relational_noun",
901
+ "dataset_path": "blimp",
902
+ "dataset_name": "distractor_agreement_relational_noun",
903
+ "validation_split": "train",
904
+ "doc_to_text": "",
905
+ "doc_to_target": 0,
906
+ "unsafe_code": false,
907
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
908
+ "description": "",
909
+ "target_delimiter": " ",
910
+ "fewshot_delimiter": "\n\n",
911
+ "num_fewshot": 0,
912
+ "metric_list": [
913
+ {
914
+ "metric": "acc",
915
+ "aggregation": "mean",
916
+ "higher_is_better": true
917
+ }
918
+ ],
919
+ "output_type": "multiple_choice",
920
+ "repeats": 1,
921
+ "should_decontaminate": true,
922
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
923
+ "metadata": {
924
+ "version": 1.0
925
+ }
926
+ },
927
+ "blimp_distractor_agreement_relative_clause": {
928
+ "task": "blimp_distractor_agreement_relative_clause",
929
+ "dataset_path": "blimp",
930
+ "dataset_name": "distractor_agreement_relative_clause",
931
+ "validation_split": "train",
932
+ "doc_to_text": "",
933
+ "doc_to_target": 0,
934
+ "unsafe_code": false,
935
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
936
+ "description": "",
937
+ "target_delimiter": " ",
938
+ "fewshot_delimiter": "\n\n",
939
+ "num_fewshot": 0,
940
+ "metric_list": [
941
+ {
942
+ "metric": "acc",
943
+ "aggregation": "mean",
944
+ "higher_is_better": true
945
+ }
946
+ ],
947
+ "output_type": "multiple_choice",
948
+ "repeats": 1,
949
+ "should_decontaminate": true,
950
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
951
+ "metadata": {
952
+ "version": 1.0
953
+ }
954
+ },
955
+ "blimp_drop_argument": {
956
+ "task": "blimp_drop_argument",
957
+ "dataset_path": "blimp",
958
+ "dataset_name": "drop_argument",
959
+ "validation_split": "train",
960
+ "doc_to_text": "",
961
+ "doc_to_target": 0,
962
+ "unsafe_code": false,
963
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
964
+ "description": "",
965
+ "target_delimiter": " ",
966
+ "fewshot_delimiter": "\n\n",
967
+ "num_fewshot": 0,
968
+ "metric_list": [
969
+ {
970
+ "metric": "acc",
971
+ "aggregation": "mean",
972
+ "higher_is_better": true
973
+ }
974
+ ],
975
+ "output_type": "multiple_choice",
976
+ "repeats": 1,
977
+ "should_decontaminate": true,
978
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
979
+ "metadata": {
980
+ "version": 1.0
981
+ }
982
+ },
983
+ "blimp_ellipsis_n_bar_1": {
984
+ "task": "blimp_ellipsis_n_bar_1",
985
+ "dataset_path": "blimp",
986
+ "dataset_name": "ellipsis_n_bar_1",
987
+ "validation_split": "train",
988
+ "doc_to_text": "",
989
+ "doc_to_target": 0,
990
+ "unsafe_code": false,
991
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
992
+ "description": "",
993
+ "target_delimiter": " ",
994
+ "fewshot_delimiter": "\n\n",
995
+ "num_fewshot": 0,
996
+ "metric_list": [
997
+ {
998
+ "metric": "acc",
999
+ "aggregation": "mean",
1000
+ "higher_is_better": true
1001
+ }
1002
+ ],
1003
+ "output_type": "multiple_choice",
1004
+ "repeats": 1,
1005
+ "should_decontaminate": true,
1006
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1007
+ "metadata": {
1008
+ "version": 1.0
1009
+ }
1010
+ },
1011
+ "blimp_ellipsis_n_bar_2": {
1012
+ "task": "blimp_ellipsis_n_bar_2",
1013
+ "dataset_path": "blimp",
1014
+ "dataset_name": "ellipsis_n_bar_2",
1015
+ "validation_split": "train",
1016
+ "doc_to_text": "",
1017
+ "doc_to_target": 0,
1018
+ "unsafe_code": false,
1019
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1020
+ "description": "",
1021
+ "target_delimiter": " ",
1022
+ "fewshot_delimiter": "\n\n",
1023
+ "num_fewshot": 0,
1024
+ "metric_list": [
1025
+ {
1026
+ "metric": "acc",
1027
+ "aggregation": "mean",
1028
+ "higher_is_better": true
1029
+ }
1030
+ ],
1031
+ "output_type": "multiple_choice",
1032
+ "repeats": 1,
1033
+ "should_decontaminate": true,
1034
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1035
+ "metadata": {
1036
+ "version": 1.0
1037
+ }
1038
+ },
1039
+ "blimp_existential_there_object_raising": {
1040
+ "task": "blimp_existential_there_object_raising",
1041
+ "dataset_path": "blimp",
1042
+ "dataset_name": "existential_there_object_raising",
1043
+ "validation_split": "train",
1044
+ "doc_to_text": "",
1045
+ "doc_to_target": 0,
1046
+ "unsafe_code": false,
1047
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1048
+ "description": "",
1049
+ "target_delimiter": " ",
1050
+ "fewshot_delimiter": "\n\n",
1051
+ "num_fewshot": 0,
1052
+ "metric_list": [
1053
+ {
1054
+ "metric": "acc",
1055
+ "aggregation": "mean",
1056
+ "higher_is_better": true
1057
+ }
1058
+ ],
1059
+ "output_type": "multiple_choice",
1060
+ "repeats": 1,
1061
+ "should_decontaminate": true,
1062
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1063
+ "metadata": {
1064
+ "version": 1.0
1065
+ }
1066
+ },
1067
+ "blimp_existential_there_quantifiers_1": {
1068
+ "task": "blimp_existential_there_quantifiers_1",
1069
+ "dataset_path": "blimp",
1070
+ "dataset_name": "existential_there_quantifiers_1",
1071
+ "validation_split": "train",
1072
+ "doc_to_text": "",
1073
+ "doc_to_target": 0,
1074
+ "unsafe_code": false,
1075
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1076
+ "description": "",
1077
+ "target_delimiter": " ",
1078
+ "fewshot_delimiter": "\n\n",
1079
+ "num_fewshot": 0,
1080
+ "metric_list": [
1081
+ {
1082
+ "metric": "acc",
1083
+ "aggregation": "mean",
1084
+ "higher_is_better": true
1085
+ }
1086
+ ],
1087
+ "output_type": "multiple_choice",
1088
+ "repeats": 1,
1089
+ "should_decontaminate": true,
1090
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1091
+ "metadata": {
1092
+ "version": 1.0
1093
+ }
1094
+ },
1095
+ "blimp_existential_there_quantifiers_2": {
1096
+ "task": "blimp_existential_there_quantifiers_2",
1097
+ "dataset_path": "blimp",
1098
+ "dataset_name": "existential_there_quantifiers_2",
1099
+ "validation_split": "train",
1100
+ "doc_to_text": "",
1101
+ "doc_to_target": 0,
1102
+ "unsafe_code": false,
1103
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1104
+ "description": "",
1105
+ "target_delimiter": " ",
1106
+ "fewshot_delimiter": "\n\n",
1107
+ "num_fewshot": 0,
1108
+ "metric_list": [
1109
+ {
1110
+ "metric": "acc",
1111
+ "aggregation": "mean",
1112
+ "higher_is_better": true
1113
+ }
1114
+ ],
1115
+ "output_type": "multiple_choice",
1116
+ "repeats": 1,
1117
+ "should_decontaminate": true,
1118
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1119
+ "metadata": {
1120
+ "version": 1.0
1121
+ }
1122
+ },
1123
+ "blimp_existential_there_subject_raising": {
1124
+ "task": "blimp_existential_there_subject_raising",
1125
+ "dataset_path": "blimp",
1126
+ "dataset_name": "existential_there_subject_raising",
1127
+ "validation_split": "train",
1128
+ "doc_to_text": "",
1129
+ "doc_to_target": 0,
1130
+ "unsafe_code": false,
1131
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1132
+ "description": "",
1133
+ "target_delimiter": " ",
1134
+ "fewshot_delimiter": "\n\n",
1135
+ "num_fewshot": 0,
1136
+ "metric_list": [
1137
+ {
1138
+ "metric": "acc",
1139
+ "aggregation": "mean",
1140
+ "higher_is_better": true
1141
+ }
1142
+ ],
1143
+ "output_type": "multiple_choice",
1144
+ "repeats": 1,
1145
+ "should_decontaminate": true,
1146
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1147
+ "metadata": {
1148
+ "version": 1.0
1149
+ }
1150
+ },
1151
+ "blimp_expletive_it_object_raising": {
1152
+ "task": "blimp_expletive_it_object_raising",
1153
+ "dataset_path": "blimp",
1154
+ "dataset_name": "expletive_it_object_raising",
1155
+ "validation_split": "train",
1156
+ "doc_to_text": "",
1157
+ "doc_to_target": 0,
1158
+ "unsafe_code": false,
1159
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1160
+ "description": "",
1161
+ "target_delimiter": " ",
1162
+ "fewshot_delimiter": "\n\n",
1163
+ "num_fewshot": 0,
1164
+ "metric_list": [
1165
+ {
1166
+ "metric": "acc",
1167
+ "aggregation": "mean",
1168
+ "higher_is_better": true
1169
+ }
1170
+ ],
1171
+ "output_type": "multiple_choice",
1172
+ "repeats": 1,
1173
+ "should_decontaminate": true,
1174
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1175
+ "metadata": {
1176
+ "version": 1.0
1177
+ }
1178
+ },
1179
+ "blimp_inchoative": {
1180
+ "task": "blimp_inchoative",
1181
+ "dataset_path": "blimp",
1182
+ "dataset_name": "inchoative",
1183
+ "validation_split": "train",
1184
+ "doc_to_text": "",
1185
+ "doc_to_target": 0,
1186
+ "unsafe_code": false,
1187
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1188
+ "description": "",
1189
+ "target_delimiter": " ",
1190
+ "fewshot_delimiter": "\n\n",
1191
+ "num_fewshot": 0,
1192
+ "metric_list": [
1193
+ {
1194
+ "metric": "acc",
1195
+ "aggregation": "mean",
1196
+ "higher_is_better": true
1197
+ }
1198
+ ],
1199
+ "output_type": "multiple_choice",
1200
+ "repeats": 1,
1201
+ "should_decontaminate": true,
1202
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1203
+ "metadata": {
1204
+ "version": 1.0
1205
+ }
1206
+ },
1207
+ "blimp_intransitive": {
1208
+ "task": "blimp_intransitive",
1209
+ "dataset_path": "blimp",
1210
+ "dataset_name": "intransitive",
1211
+ "validation_split": "train",
1212
+ "doc_to_text": "",
1213
+ "doc_to_target": 0,
1214
+ "unsafe_code": false,
1215
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1216
+ "description": "",
1217
+ "target_delimiter": " ",
1218
+ "fewshot_delimiter": "\n\n",
1219
+ "num_fewshot": 0,
1220
+ "metric_list": [
1221
+ {
1222
+ "metric": "acc",
1223
+ "aggregation": "mean",
1224
+ "higher_is_better": true
1225
+ }
1226
+ ],
1227
+ "output_type": "multiple_choice",
1228
+ "repeats": 1,
1229
+ "should_decontaminate": true,
1230
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1231
+ "metadata": {
1232
+ "version": 1.0
1233
+ }
1234
+ },
1235
+ "blimp_irregular_past_participle_adjectives": {
1236
+ "task": "blimp_irregular_past_participle_adjectives",
1237
+ "dataset_path": "blimp",
1238
+ "dataset_name": "irregular_past_participle_adjectives",
1239
+ "validation_split": "train",
1240
+ "doc_to_text": "",
1241
+ "doc_to_target": 0,
1242
+ "unsafe_code": false,
1243
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1244
+ "description": "",
1245
+ "target_delimiter": " ",
1246
+ "fewshot_delimiter": "\n\n",
1247
+ "num_fewshot": 0,
1248
+ "metric_list": [
1249
+ {
1250
+ "metric": "acc",
1251
+ "aggregation": "mean",
1252
+ "higher_is_better": true
1253
+ }
1254
+ ],
1255
+ "output_type": "multiple_choice",
1256
+ "repeats": 1,
1257
+ "should_decontaminate": true,
1258
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1259
+ "metadata": {
1260
+ "version": 1.0
1261
+ }
1262
+ },
1263
+ "blimp_irregular_past_participle_verbs": {
1264
+ "task": "blimp_irregular_past_participle_verbs",
1265
+ "dataset_path": "blimp",
1266
+ "dataset_name": "irregular_past_participle_verbs",
1267
+ "validation_split": "train",
1268
+ "doc_to_text": "",
1269
+ "doc_to_target": 0,
1270
+ "unsafe_code": false,
1271
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1272
+ "description": "",
1273
+ "target_delimiter": " ",
1274
+ "fewshot_delimiter": "\n\n",
1275
+ "num_fewshot": 0,
1276
+ "metric_list": [
1277
+ {
1278
+ "metric": "acc",
1279
+ "aggregation": "mean",
1280
+ "higher_is_better": true
1281
+ }
1282
+ ],
1283
+ "output_type": "multiple_choice",
1284
+ "repeats": 1,
1285
+ "should_decontaminate": true,
1286
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1287
+ "metadata": {
1288
+ "version": 1.0
1289
+ }
1290
+ },
1291
+ "blimp_irregular_plural_subject_verb_agreement_1": {
1292
+ "task": "blimp_irregular_plural_subject_verb_agreement_1",
1293
+ "dataset_path": "blimp",
1294
+ "dataset_name": "irregular_plural_subject_verb_agreement_1",
1295
+ "validation_split": "train",
1296
+ "doc_to_text": "",
1297
+ "doc_to_target": 0,
1298
+ "unsafe_code": false,
1299
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1300
+ "description": "",
1301
+ "target_delimiter": " ",
1302
+ "fewshot_delimiter": "\n\n",
1303
+ "num_fewshot": 0,
1304
+ "metric_list": [
1305
+ {
1306
+ "metric": "acc",
1307
+ "aggregation": "mean",
1308
+ "higher_is_better": true
1309
+ }
1310
+ ],
1311
+ "output_type": "multiple_choice",
1312
+ "repeats": 1,
1313
+ "should_decontaminate": true,
1314
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1315
+ "metadata": {
1316
+ "version": 1.0
1317
+ }
1318
+ },
1319
+ "blimp_irregular_plural_subject_verb_agreement_2": {
1320
+ "task": "blimp_irregular_plural_subject_verb_agreement_2",
1321
+ "dataset_path": "blimp",
1322
+ "dataset_name": "irregular_plural_subject_verb_agreement_2",
1323
+ "validation_split": "train",
1324
+ "doc_to_text": "",
1325
+ "doc_to_target": 0,
1326
+ "unsafe_code": false,
1327
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1328
+ "description": "",
1329
+ "target_delimiter": " ",
1330
+ "fewshot_delimiter": "\n\n",
1331
+ "num_fewshot": 0,
1332
+ "metric_list": [
1333
+ {
1334
+ "metric": "acc",
1335
+ "aggregation": "mean",
1336
+ "higher_is_better": true
1337
+ }
1338
+ ],
1339
+ "output_type": "multiple_choice",
1340
+ "repeats": 1,
1341
+ "should_decontaminate": true,
1342
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1343
+ "metadata": {
1344
+ "version": 1.0
1345
+ }
1346
+ },
1347
+ "blimp_left_branch_island_echo_question": {
1348
+ "task": "blimp_left_branch_island_echo_question",
1349
+ "dataset_path": "blimp",
1350
+ "dataset_name": "left_branch_island_echo_question",
1351
+ "validation_split": "train",
1352
+ "doc_to_text": "",
1353
+ "doc_to_target": 0,
1354
+ "unsafe_code": false,
1355
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1356
+ "description": "",
1357
+ "target_delimiter": " ",
1358
+ "fewshot_delimiter": "\n\n",
1359
+ "num_fewshot": 0,
1360
+ "metric_list": [
1361
+ {
1362
+ "metric": "acc",
1363
+ "aggregation": "mean",
1364
+ "higher_is_better": true
1365
+ }
1366
+ ],
1367
+ "output_type": "multiple_choice",
1368
+ "repeats": 1,
1369
+ "should_decontaminate": true,
1370
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1371
+ "metadata": {
1372
+ "version": 1.0
1373
+ }
1374
+ },
1375
+ "blimp_left_branch_island_simple_question": {
1376
+ "task": "blimp_left_branch_island_simple_question",
1377
+ "dataset_path": "blimp",
1378
+ "dataset_name": "left_branch_island_simple_question",
1379
+ "validation_split": "train",
1380
+ "doc_to_text": "",
1381
+ "doc_to_target": 0,
1382
+ "unsafe_code": false,
1383
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1384
+ "description": "",
1385
+ "target_delimiter": " ",
1386
+ "fewshot_delimiter": "\n\n",
1387
+ "num_fewshot": 0,
1388
+ "metric_list": [
1389
+ {
1390
+ "metric": "acc",
1391
+ "aggregation": "mean",
1392
+ "higher_is_better": true
1393
+ }
1394
+ ],
1395
+ "output_type": "multiple_choice",
1396
+ "repeats": 1,
1397
+ "should_decontaminate": true,
1398
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1399
+ "metadata": {
1400
+ "version": 1.0
1401
+ }
1402
+ },
1403
+ "blimp_matrix_question_npi_licensor_present": {
1404
+ "task": "blimp_matrix_question_npi_licensor_present",
1405
+ "dataset_path": "blimp",
1406
+ "dataset_name": "matrix_question_npi_licensor_present",
1407
+ "validation_split": "train",
1408
+ "doc_to_text": "",
1409
+ "doc_to_target": 0,
1410
+ "unsafe_code": false,
1411
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1412
+ "description": "",
1413
+ "target_delimiter": " ",
1414
+ "fewshot_delimiter": "\n\n",
1415
+ "num_fewshot": 0,
1416
+ "metric_list": [
1417
+ {
1418
+ "metric": "acc",
1419
+ "aggregation": "mean",
1420
+ "higher_is_better": true
1421
+ }
1422
+ ],
1423
+ "output_type": "multiple_choice",
1424
+ "repeats": 1,
1425
+ "should_decontaminate": true,
1426
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1427
+ "metadata": {
1428
+ "version": 1.0
1429
+ }
1430
+ },
1431
+ "blimp_npi_present_1": {
1432
+ "task": "blimp_npi_present_1",
1433
+ "dataset_path": "blimp",
1434
+ "dataset_name": "npi_present_1",
1435
+ "validation_split": "train",
1436
+ "doc_to_text": "",
1437
+ "doc_to_target": 0,
1438
+ "unsafe_code": false,
1439
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1440
+ "description": "",
1441
+ "target_delimiter": " ",
1442
+ "fewshot_delimiter": "\n\n",
1443
+ "num_fewshot": 0,
1444
+ "metric_list": [
1445
+ {
1446
+ "metric": "acc",
1447
+ "aggregation": "mean",
1448
+ "higher_is_better": true
1449
+ }
1450
+ ],
1451
+ "output_type": "multiple_choice",
1452
+ "repeats": 1,
1453
+ "should_decontaminate": true,
1454
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1455
+ "metadata": {
1456
+ "version": 1.0
1457
+ }
1458
+ },
1459
+ "blimp_npi_present_2": {
1460
+ "task": "blimp_npi_present_2",
1461
+ "dataset_path": "blimp",
1462
+ "dataset_name": "npi_present_2",
1463
+ "validation_split": "train",
1464
+ "doc_to_text": "",
1465
+ "doc_to_target": 0,
1466
+ "unsafe_code": false,
1467
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1468
+ "description": "",
1469
+ "target_delimiter": " ",
1470
+ "fewshot_delimiter": "\n\n",
1471
+ "num_fewshot": 0,
1472
+ "metric_list": [
1473
+ {
1474
+ "metric": "acc",
1475
+ "aggregation": "mean",
1476
+ "higher_is_better": true
1477
+ }
1478
+ ],
1479
+ "output_type": "multiple_choice",
1480
+ "repeats": 1,
1481
+ "should_decontaminate": true,
1482
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1483
+ "metadata": {
1484
+ "version": 1.0
1485
+ }
1486
+ },
1487
+ "blimp_only_npi_licensor_present": {
1488
+ "task": "blimp_only_npi_licensor_present",
1489
+ "dataset_path": "blimp",
1490
+ "dataset_name": "only_npi_licensor_present",
1491
+ "validation_split": "train",
1492
+ "doc_to_text": "",
1493
+ "doc_to_target": 0,
1494
+ "unsafe_code": false,
1495
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1496
+ "description": "",
1497
+ "target_delimiter": " ",
1498
+ "fewshot_delimiter": "\n\n",
1499
+ "num_fewshot": 0,
1500
+ "metric_list": [
1501
+ {
1502
+ "metric": "acc",
1503
+ "aggregation": "mean",
1504
+ "higher_is_better": true
1505
+ }
1506
+ ],
1507
+ "output_type": "multiple_choice",
1508
+ "repeats": 1,
1509
+ "should_decontaminate": true,
1510
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1511
+ "metadata": {
1512
+ "version": 1.0
1513
+ }
1514
+ },
1515
+ "blimp_only_npi_scope": {
1516
+ "task": "blimp_only_npi_scope",
1517
+ "dataset_path": "blimp",
1518
+ "dataset_name": "only_npi_scope",
1519
+ "validation_split": "train",
1520
+ "doc_to_text": "",
1521
+ "doc_to_target": 0,
1522
+ "unsafe_code": false,
1523
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1524
+ "description": "",
1525
+ "target_delimiter": " ",
1526
+ "fewshot_delimiter": "\n\n",
1527
+ "num_fewshot": 0,
1528
+ "metric_list": [
1529
+ {
1530
+ "metric": "acc",
1531
+ "aggregation": "mean",
1532
+ "higher_is_better": true
1533
+ }
1534
+ ],
1535
+ "output_type": "multiple_choice",
1536
+ "repeats": 1,
1537
+ "should_decontaminate": true,
1538
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1539
+ "metadata": {
1540
+ "version": 1.0
1541
+ }
1542
+ },
1543
+ "blimp_passive_1": {
1544
+ "task": "blimp_passive_1",
1545
+ "dataset_path": "blimp",
1546
+ "dataset_name": "passive_1",
1547
+ "validation_split": "train",
1548
+ "doc_to_text": "",
1549
+ "doc_to_target": 0,
1550
+ "unsafe_code": false,
1551
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1552
+ "description": "",
1553
+ "target_delimiter": " ",
1554
+ "fewshot_delimiter": "\n\n",
1555
+ "num_fewshot": 0,
1556
+ "metric_list": [
1557
+ {
1558
+ "metric": "acc",
1559
+ "aggregation": "mean",
1560
+ "higher_is_better": true
1561
+ }
1562
+ ],
1563
+ "output_type": "multiple_choice",
1564
+ "repeats": 1,
1565
+ "should_decontaminate": true,
1566
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1567
+ "metadata": {
1568
+ "version": 1.0
1569
+ }
1570
+ },
1571
+ "blimp_passive_2": {
1572
+ "task": "blimp_passive_2",
1573
+ "dataset_path": "blimp",
1574
+ "dataset_name": "passive_2",
1575
+ "validation_split": "train",
1576
+ "doc_to_text": "",
1577
+ "doc_to_target": 0,
1578
+ "unsafe_code": false,
1579
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1580
+ "description": "",
1581
+ "target_delimiter": " ",
1582
+ "fewshot_delimiter": "\n\n",
1583
+ "num_fewshot": 0,
1584
+ "metric_list": [
1585
+ {
1586
+ "metric": "acc",
1587
+ "aggregation": "mean",
1588
+ "higher_is_better": true
1589
+ }
1590
+ ],
1591
+ "output_type": "multiple_choice",
1592
+ "repeats": 1,
1593
+ "should_decontaminate": true,
1594
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1595
+ "metadata": {
1596
+ "version": 1.0
1597
+ }
1598
+ },
1599
+ "blimp_principle_A_c_command": {
1600
+ "task": "blimp_principle_A_c_command",
1601
+ "dataset_path": "blimp",
1602
+ "dataset_name": "principle_A_c_command",
1603
+ "validation_split": "train",
1604
+ "doc_to_text": "",
1605
+ "doc_to_target": 0,
1606
+ "unsafe_code": false,
1607
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1608
+ "description": "",
1609
+ "target_delimiter": " ",
1610
+ "fewshot_delimiter": "\n\n",
1611
+ "num_fewshot": 0,
1612
+ "metric_list": [
1613
+ {
1614
+ "metric": "acc",
1615
+ "aggregation": "mean",
1616
+ "higher_is_better": true
1617
+ }
1618
+ ],
1619
+ "output_type": "multiple_choice",
1620
+ "repeats": 1,
1621
+ "should_decontaminate": true,
1622
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1623
+ "metadata": {
1624
+ "version": 1.0
1625
+ }
1626
+ },
1627
+ "blimp_principle_A_case_1": {
1628
+ "task": "blimp_principle_A_case_1",
1629
+ "dataset_path": "blimp",
1630
+ "dataset_name": "principle_A_case_1",
1631
+ "validation_split": "train",
1632
+ "doc_to_text": "",
1633
+ "doc_to_target": 0,
1634
+ "unsafe_code": false,
1635
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1636
+ "description": "",
1637
+ "target_delimiter": " ",
1638
+ "fewshot_delimiter": "\n\n",
1639
+ "num_fewshot": 0,
1640
+ "metric_list": [
1641
+ {
1642
+ "metric": "acc",
1643
+ "aggregation": "mean",
1644
+ "higher_is_better": true
1645
+ }
1646
+ ],
1647
+ "output_type": "multiple_choice",
1648
+ "repeats": 1,
1649
+ "should_decontaminate": true,
1650
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1651
+ "metadata": {
1652
+ "version": 1.0
1653
+ }
1654
+ },
1655
+ "blimp_principle_A_case_2": {
1656
+ "task": "blimp_principle_A_case_2",
1657
+ "dataset_path": "blimp",
1658
+ "dataset_name": "principle_A_case_2",
1659
+ "validation_split": "train",
1660
+ "doc_to_text": "",
1661
+ "doc_to_target": 0,
1662
+ "unsafe_code": false,
1663
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1664
+ "description": "",
1665
+ "target_delimiter": " ",
1666
+ "fewshot_delimiter": "\n\n",
1667
+ "num_fewshot": 0,
1668
+ "metric_list": [
1669
+ {
1670
+ "metric": "acc",
1671
+ "aggregation": "mean",
1672
+ "higher_is_better": true
1673
+ }
1674
+ ],
1675
+ "output_type": "multiple_choice",
1676
+ "repeats": 1,
1677
+ "should_decontaminate": true,
1678
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1679
+ "metadata": {
1680
+ "version": 1.0
1681
+ }
1682
+ },
1683
+ "blimp_principle_A_domain_1": {
1684
+ "task": "blimp_principle_A_domain_1",
1685
+ "dataset_path": "blimp",
1686
+ "dataset_name": "principle_A_domain_1",
1687
+ "validation_split": "train",
1688
+ "doc_to_text": "",
1689
+ "doc_to_target": 0,
1690
+ "unsafe_code": false,
1691
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1692
+ "description": "",
1693
+ "target_delimiter": " ",
1694
+ "fewshot_delimiter": "\n\n",
1695
+ "num_fewshot": 0,
1696
+ "metric_list": [
1697
+ {
1698
+ "metric": "acc",
1699
+ "aggregation": "mean",
1700
+ "higher_is_better": true
1701
+ }
1702
+ ],
1703
+ "output_type": "multiple_choice",
1704
+ "repeats": 1,
1705
+ "should_decontaminate": true,
1706
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1707
+ "metadata": {
1708
+ "version": 1.0
1709
+ }
1710
+ },
1711
+ "blimp_principle_A_domain_2": {
1712
+ "task": "blimp_principle_A_domain_2",
1713
+ "dataset_path": "blimp",
1714
+ "dataset_name": "principle_A_domain_2",
1715
+ "validation_split": "train",
1716
+ "doc_to_text": "",
1717
+ "doc_to_target": 0,
1718
+ "unsafe_code": false,
1719
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1720
+ "description": "",
1721
+ "target_delimiter": " ",
1722
+ "fewshot_delimiter": "\n\n",
1723
+ "num_fewshot": 0,
1724
+ "metric_list": [
1725
+ {
1726
+ "metric": "acc",
1727
+ "aggregation": "mean",
1728
+ "higher_is_better": true
1729
+ }
1730
+ ],
1731
+ "output_type": "multiple_choice",
1732
+ "repeats": 1,
1733
+ "should_decontaminate": true,
1734
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1735
+ "metadata": {
1736
+ "version": 1.0
1737
+ }
1738
+ },
1739
+ "blimp_principle_A_domain_3": {
1740
+ "task": "blimp_principle_A_domain_3",
1741
+ "dataset_path": "blimp",
1742
+ "dataset_name": "principle_A_domain_3",
1743
+ "validation_split": "train",
1744
+ "doc_to_text": "",
1745
+ "doc_to_target": 0,
1746
+ "unsafe_code": false,
1747
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1748
+ "description": "",
1749
+ "target_delimiter": " ",
1750
+ "fewshot_delimiter": "\n\n",
1751
+ "num_fewshot": 0,
1752
+ "metric_list": [
1753
+ {
1754
+ "metric": "acc",
1755
+ "aggregation": "mean",
1756
+ "higher_is_better": true
1757
+ }
1758
+ ],
1759
+ "output_type": "multiple_choice",
1760
+ "repeats": 1,
1761
+ "should_decontaminate": true,
1762
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1763
+ "metadata": {
1764
+ "version": 1.0
1765
+ }
1766
+ },
1767
+ "blimp_principle_A_reconstruction": {
1768
+ "task": "blimp_principle_A_reconstruction",
1769
+ "dataset_path": "blimp",
1770
+ "dataset_name": "principle_A_reconstruction",
1771
+ "validation_split": "train",
1772
+ "doc_to_text": "",
1773
+ "doc_to_target": 0,
1774
+ "unsafe_code": false,
1775
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1776
+ "description": "",
1777
+ "target_delimiter": " ",
1778
+ "fewshot_delimiter": "\n\n",
1779
+ "num_fewshot": 0,
1780
+ "metric_list": [
1781
+ {
1782
+ "metric": "acc",
1783
+ "aggregation": "mean",
1784
+ "higher_is_better": true
1785
+ }
1786
+ ],
1787
+ "output_type": "multiple_choice",
1788
+ "repeats": 1,
1789
+ "should_decontaminate": true,
1790
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1791
+ "metadata": {
1792
+ "version": 1.0
1793
+ }
1794
+ },
1795
+ "blimp_regular_plural_subject_verb_agreement_1": {
1796
+ "task": "blimp_regular_plural_subject_verb_agreement_1",
1797
+ "dataset_path": "blimp",
1798
+ "dataset_name": "regular_plural_subject_verb_agreement_1",
1799
+ "validation_split": "train",
1800
+ "doc_to_text": "",
1801
+ "doc_to_target": 0,
1802
+ "unsafe_code": false,
1803
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1804
+ "description": "",
1805
+ "target_delimiter": " ",
1806
+ "fewshot_delimiter": "\n\n",
1807
+ "num_fewshot": 0,
1808
+ "metric_list": [
1809
+ {
1810
+ "metric": "acc",
1811
+ "aggregation": "mean",
1812
+ "higher_is_better": true
1813
+ }
1814
+ ],
1815
+ "output_type": "multiple_choice",
1816
+ "repeats": 1,
1817
+ "should_decontaminate": true,
1818
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1819
+ "metadata": {
1820
+ "version": 1.0
1821
+ }
1822
+ },
1823
+ "blimp_regular_plural_subject_verb_agreement_2": {
1824
+ "task": "blimp_regular_plural_subject_verb_agreement_2",
1825
+ "dataset_path": "blimp",
1826
+ "dataset_name": "regular_plural_subject_verb_agreement_2",
1827
+ "validation_split": "train",
1828
+ "doc_to_text": "",
1829
+ "doc_to_target": 0,
1830
+ "unsafe_code": false,
1831
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1832
+ "description": "",
1833
+ "target_delimiter": " ",
1834
+ "fewshot_delimiter": "\n\n",
1835
+ "num_fewshot": 0,
1836
+ "metric_list": [
1837
+ {
1838
+ "metric": "acc",
1839
+ "aggregation": "mean",
1840
+ "higher_is_better": true
1841
+ }
1842
+ ],
1843
+ "output_type": "multiple_choice",
1844
+ "repeats": 1,
1845
+ "should_decontaminate": true,
1846
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1847
+ "metadata": {
1848
+ "version": 1.0
1849
+ }
1850
+ },
1851
+ "blimp_sentential_negation_npi_licensor_present": {
1852
+ "task": "blimp_sentential_negation_npi_licensor_present",
1853
+ "dataset_path": "blimp",
1854
+ "dataset_name": "sentential_negation_npi_licensor_present",
1855
+ "validation_split": "train",
1856
+ "doc_to_text": "",
1857
+ "doc_to_target": 0,
1858
+ "unsafe_code": false,
1859
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1860
+ "description": "",
1861
+ "target_delimiter": " ",
1862
+ "fewshot_delimiter": "\n\n",
1863
+ "num_fewshot": 0,
1864
+ "metric_list": [
1865
+ {
1866
+ "metric": "acc",
1867
+ "aggregation": "mean",
1868
+ "higher_is_better": true
1869
+ }
1870
+ ],
1871
+ "output_type": "multiple_choice",
1872
+ "repeats": 1,
1873
+ "should_decontaminate": true,
1874
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1875
+ "metadata": {
1876
+ "version": 1.0
1877
+ }
1878
+ },
1879
+ "blimp_sentential_negation_npi_scope": {
1880
+ "task": "blimp_sentential_negation_npi_scope",
1881
+ "dataset_path": "blimp",
1882
+ "dataset_name": "sentential_negation_npi_scope",
1883
+ "validation_split": "train",
1884
+ "doc_to_text": "",
1885
+ "doc_to_target": 0,
1886
+ "unsafe_code": false,
1887
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1888
+ "description": "",
1889
+ "target_delimiter": " ",
1890
+ "fewshot_delimiter": "\n\n",
1891
+ "num_fewshot": 0,
1892
+ "metric_list": [
1893
+ {
1894
+ "metric": "acc",
1895
+ "aggregation": "mean",
1896
+ "higher_is_better": true
1897
+ }
1898
+ ],
1899
+ "output_type": "multiple_choice",
1900
+ "repeats": 1,
1901
+ "should_decontaminate": true,
1902
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1903
+ "metadata": {
1904
+ "version": 1.0
1905
+ }
1906
+ },
1907
+ "blimp_sentential_subject_island": {
1908
+ "task": "blimp_sentential_subject_island",
1909
+ "dataset_path": "blimp",
1910
+ "dataset_name": "sentential_subject_island",
1911
+ "validation_split": "train",
1912
+ "doc_to_text": "",
1913
+ "doc_to_target": 0,
1914
+ "unsafe_code": false,
1915
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1916
+ "description": "",
1917
+ "target_delimiter": " ",
1918
+ "fewshot_delimiter": "\n\n",
1919
+ "num_fewshot": 0,
1920
+ "metric_list": [
1921
+ {
1922
+ "metric": "acc",
1923
+ "aggregation": "mean",
1924
+ "higher_is_better": true
1925
+ }
1926
+ ],
1927
+ "output_type": "multiple_choice",
1928
+ "repeats": 1,
1929
+ "should_decontaminate": true,
1930
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1931
+ "metadata": {
1932
+ "version": 1.0
1933
+ }
1934
+ },
1935
+ "blimp_superlative_quantifiers_1": {
1936
+ "task": "blimp_superlative_quantifiers_1",
1937
+ "dataset_path": "blimp",
1938
+ "dataset_name": "superlative_quantifiers_1",
1939
+ "validation_split": "train",
1940
+ "doc_to_text": "",
1941
+ "doc_to_target": 0,
1942
+ "unsafe_code": false,
1943
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1944
+ "description": "",
1945
+ "target_delimiter": " ",
1946
+ "fewshot_delimiter": "\n\n",
1947
+ "num_fewshot": 0,
1948
+ "metric_list": [
1949
+ {
1950
+ "metric": "acc",
1951
+ "aggregation": "mean",
1952
+ "higher_is_better": true
1953
+ }
1954
+ ],
1955
+ "output_type": "multiple_choice",
1956
+ "repeats": 1,
1957
+ "should_decontaminate": true,
1958
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1959
+ "metadata": {
1960
+ "version": 1.0
1961
+ }
1962
+ },
1963
+ "blimp_superlative_quantifiers_2": {
1964
+ "task": "blimp_superlative_quantifiers_2",
1965
+ "dataset_path": "blimp",
1966
+ "dataset_name": "superlative_quantifiers_2",
1967
+ "validation_split": "train",
1968
+ "doc_to_text": "",
1969
+ "doc_to_target": 0,
1970
+ "unsafe_code": false,
1971
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1972
+ "description": "",
1973
+ "target_delimiter": " ",
1974
+ "fewshot_delimiter": "\n\n",
1975
+ "num_fewshot": 0,
1976
+ "metric_list": [
1977
+ {
1978
+ "metric": "acc",
1979
+ "aggregation": "mean",
1980
+ "higher_is_better": true
1981
+ }
1982
+ ],
1983
+ "output_type": "multiple_choice",
1984
+ "repeats": 1,
1985
+ "should_decontaminate": true,
1986
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1987
+ "metadata": {
1988
+ "version": 1.0
1989
+ }
1990
+ },
1991
+ "blimp_tough_vs_raising_1": {
1992
+ "task": "blimp_tough_vs_raising_1",
1993
+ "dataset_path": "blimp",
1994
+ "dataset_name": "tough_vs_raising_1",
1995
+ "validation_split": "train",
1996
+ "doc_to_text": "",
1997
+ "doc_to_target": 0,
1998
+ "unsafe_code": false,
1999
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2000
+ "description": "",
2001
+ "target_delimiter": " ",
2002
+ "fewshot_delimiter": "\n\n",
2003
+ "num_fewshot": 0,
2004
+ "metric_list": [
2005
+ {
2006
+ "metric": "acc",
2007
+ "aggregation": "mean",
2008
+ "higher_is_better": true
2009
+ }
2010
+ ],
2011
+ "output_type": "multiple_choice",
2012
+ "repeats": 1,
2013
+ "should_decontaminate": true,
2014
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2015
+ "metadata": {
2016
+ "version": 1.0
2017
+ }
2018
+ },
2019
+ "blimp_tough_vs_raising_2": {
2020
+ "task": "blimp_tough_vs_raising_2",
2021
+ "dataset_path": "blimp",
2022
+ "dataset_name": "tough_vs_raising_2",
2023
+ "validation_split": "train",
2024
+ "doc_to_text": "",
2025
+ "doc_to_target": 0,
2026
+ "unsafe_code": false,
2027
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2028
+ "description": "",
2029
+ "target_delimiter": " ",
2030
+ "fewshot_delimiter": "\n\n",
2031
+ "num_fewshot": 0,
2032
+ "metric_list": [
2033
+ {
2034
+ "metric": "acc",
2035
+ "aggregation": "mean",
2036
+ "higher_is_better": true
2037
+ }
2038
+ ],
2039
+ "output_type": "multiple_choice",
2040
+ "repeats": 1,
2041
+ "should_decontaminate": true,
2042
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2043
+ "metadata": {
2044
+ "version": 1.0
2045
+ }
2046
+ },
2047
+ "blimp_transitive": {
2048
+ "task": "blimp_transitive",
2049
+ "dataset_path": "blimp",
2050
+ "dataset_name": "transitive",
2051
+ "validation_split": "train",
2052
+ "doc_to_text": "",
2053
+ "doc_to_target": 0,
2054
+ "unsafe_code": false,
2055
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2056
+ "description": "",
2057
+ "target_delimiter": " ",
2058
+ "fewshot_delimiter": "\n\n",
2059
+ "num_fewshot": 0,
2060
+ "metric_list": [
2061
+ {
2062
+ "metric": "acc",
2063
+ "aggregation": "mean",
2064
+ "higher_is_better": true
2065
+ }
2066
+ ],
2067
+ "output_type": "multiple_choice",
2068
+ "repeats": 1,
2069
+ "should_decontaminate": true,
2070
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2071
+ "metadata": {
2072
+ "version": 1.0
2073
+ }
2074
+ },
2075
+ "blimp_wh_island": {
2076
+ "task": "blimp_wh_island",
2077
+ "dataset_path": "blimp",
2078
+ "dataset_name": "wh_island",
2079
+ "validation_split": "train",
2080
+ "doc_to_text": "",
2081
+ "doc_to_target": 0,
2082
+ "unsafe_code": false,
2083
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2084
+ "description": "",
2085
+ "target_delimiter": " ",
2086
+ "fewshot_delimiter": "\n\n",
2087
+ "num_fewshot": 0,
2088
+ "metric_list": [
2089
+ {
2090
+ "metric": "acc",
2091
+ "aggregation": "mean",
2092
+ "higher_is_better": true
2093
+ }
2094
+ ],
2095
+ "output_type": "multiple_choice",
2096
+ "repeats": 1,
2097
+ "should_decontaminate": true,
2098
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2099
+ "metadata": {
2100
+ "version": 1.0
2101
+ }
2102
+ },
2103
+ "blimp_wh_questions_object_gap": {
2104
+ "task": "blimp_wh_questions_object_gap",
2105
+ "dataset_path": "blimp",
2106
+ "dataset_name": "wh_questions_object_gap",
2107
+ "validation_split": "train",
2108
+ "doc_to_text": "",
2109
+ "doc_to_target": 0,
2110
+ "unsafe_code": false,
2111
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2112
+ "description": "",
2113
+ "target_delimiter": " ",
2114
+ "fewshot_delimiter": "\n\n",
2115
+ "num_fewshot": 0,
2116
+ "metric_list": [
2117
+ {
2118
+ "metric": "acc",
2119
+ "aggregation": "mean",
2120
+ "higher_is_better": true
2121
+ }
2122
+ ],
2123
+ "output_type": "multiple_choice",
2124
+ "repeats": 1,
2125
+ "should_decontaminate": true,
2126
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2127
+ "metadata": {
2128
+ "version": 1.0
2129
+ }
2130
+ },
2131
+ "blimp_wh_questions_subject_gap": {
2132
+ "task": "blimp_wh_questions_subject_gap",
2133
+ "dataset_path": "blimp",
2134
+ "dataset_name": "wh_questions_subject_gap",
2135
+ "validation_split": "train",
2136
+ "doc_to_text": "",
2137
+ "doc_to_target": 0,
2138
+ "unsafe_code": false,
2139
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2140
+ "description": "",
2141
+ "target_delimiter": " ",
2142
+ "fewshot_delimiter": "\n\n",
2143
+ "num_fewshot": 0,
2144
+ "metric_list": [
2145
+ {
2146
+ "metric": "acc",
2147
+ "aggregation": "mean",
2148
+ "higher_is_better": true
2149
+ }
2150
+ ],
2151
+ "output_type": "multiple_choice",
2152
+ "repeats": 1,
2153
+ "should_decontaminate": true,
2154
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2155
+ "metadata": {
2156
+ "version": 1.0
2157
+ }
2158
+ },
2159
+ "blimp_wh_questions_subject_gap_long_distance": {
2160
+ "task": "blimp_wh_questions_subject_gap_long_distance",
2161
+ "dataset_path": "blimp",
2162
+ "dataset_name": "wh_questions_subject_gap_long_distance",
2163
+ "validation_split": "train",
2164
+ "doc_to_text": "",
2165
+ "doc_to_target": 0,
2166
+ "unsafe_code": false,
2167
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2168
+ "description": "",
2169
+ "target_delimiter": " ",
2170
+ "fewshot_delimiter": "\n\n",
2171
+ "num_fewshot": 0,
2172
+ "metric_list": [
2173
+ {
2174
+ "metric": "acc",
2175
+ "aggregation": "mean",
2176
+ "higher_is_better": true
2177
+ }
2178
+ ],
2179
+ "output_type": "multiple_choice",
2180
+ "repeats": 1,
2181
+ "should_decontaminate": true,
2182
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2183
+ "metadata": {
2184
+ "version": 1.0
2185
+ }
2186
+ },
2187
+ "blimp_wh_vs_that_no_gap": {
2188
+ "task": "blimp_wh_vs_that_no_gap",
2189
+ "dataset_path": "blimp",
2190
+ "dataset_name": "wh_vs_that_no_gap",
2191
+ "validation_split": "train",
2192
+ "doc_to_text": "",
2193
+ "doc_to_target": 0,
2194
+ "unsafe_code": false,
2195
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2196
+ "description": "",
2197
+ "target_delimiter": " ",
2198
+ "fewshot_delimiter": "\n\n",
2199
+ "num_fewshot": 0,
2200
+ "metric_list": [
2201
+ {
2202
+ "metric": "acc",
2203
+ "aggregation": "mean",
2204
+ "higher_is_better": true
2205
+ }
2206
+ ],
2207
+ "output_type": "multiple_choice",
2208
+ "repeats": 1,
2209
+ "should_decontaminate": true,
2210
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2211
+ "metadata": {
2212
+ "version": 1.0
2213
+ }
2214
+ },
2215
+ "blimp_wh_vs_that_no_gap_long_distance": {
2216
+ "task": "blimp_wh_vs_that_no_gap_long_distance",
2217
+ "dataset_path": "blimp",
2218
+ "dataset_name": "wh_vs_that_no_gap_long_distance",
2219
+ "validation_split": "train",
2220
+ "doc_to_text": "",
2221
+ "doc_to_target": 0,
2222
+ "unsafe_code": false,
2223
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2224
+ "description": "",
2225
+ "target_delimiter": " ",
2226
+ "fewshot_delimiter": "\n\n",
2227
+ "num_fewshot": 0,
2228
+ "metric_list": [
2229
+ {
2230
+ "metric": "acc",
2231
+ "aggregation": "mean",
2232
+ "higher_is_better": true
2233
+ }
2234
+ ],
2235
+ "output_type": "multiple_choice",
2236
+ "repeats": 1,
2237
+ "should_decontaminate": true,
2238
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2239
+ "metadata": {
2240
+ "version": 1.0
2241
+ }
2242
+ },
2243
+ "blimp_wh_vs_that_with_gap": {
2244
+ "task": "blimp_wh_vs_that_with_gap",
2245
+ "dataset_path": "blimp",
2246
+ "dataset_name": "wh_vs_that_with_gap",
2247
+ "validation_split": "train",
2248
+ "doc_to_text": "",
2249
+ "doc_to_target": 0,
2250
+ "unsafe_code": false,
2251
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2252
+ "description": "",
2253
+ "target_delimiter": " ",
2254
+ "fewshot_delimiter": "\n\n",
2255
+ "num_fewshot": 0,
2256
+ "metric_list": [
2257
+ {
2258
+ "metric": "acc",
2259
+ "aggregation": "mean",
2260
+ "higher_is_better": true
2261
+ }
2262
+ ],
2263
+ "output_type": "multiple_choice",
2264
+ "repeats": 1,
2265
+ "should_decontaminate": true,
2266
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2267
+ "metadata": {
2268
+ "version": 1.0
2269
+ }
2270
+ },
2271
+ "blimp_wh_vs_that_with_gap_long_distance": {
2272
+ "task": "blimp_wh_vs_that_with_gap_long_distance",
2273
+ "dataset_path": "blimp",
2274
+ "dataset_name": "wh_vs_that_with_gap_long_distance",
2275
+ "validation_split": "train",
2276
+ "doc_to_text": "",
2277
+ "doc_to_target": 0,
2278
+ "unsafe_code": false,
2279
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2280
+ "description": "",
2281
+ "target_delimiter": " ",
2282
+ "fewshot_delimiter": "\n\n",
2283
+ "num_fewshot": 0,
2284
+ "metric_list": [
2285
+ {
2286
+ "metric": "acc",
2287
+ "aggregation": "mean",
2288
+ "higher_is_better": true
2289
+ }
2290
+ ],
2291
+ "output_type": "multiple_choice",
2292
+ "repeats": 1,
2293
+ "should_decontaminate": true,
2294
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2295
+ "metadata": {
2296
+ "version": 1.0
2297
+ }
2298
+ }
2299
+ },
2300
+ "versions": {
2301
+ "blimp": 2.0,
2302
+ "blimp_adjunct_island": 1.0,
2303
+ "blimp_anaphor_gender_agreement": 1.0,
2304
+ "blimp_anaphor_number_agreement": 1.0,
2305
+ "blimp_animate_subject_passive": 1.0,
2306
+ "blimp_animate_subject_trans": 1.0,
2307
+ "blimp_causative": 1.0,
2308
+ "blimp_complex_NP_island": 1.0,
2309
+ "blimp_coordinate_structure_constraint_complex_left_branch": 1.0,
2310
+ "blimp_coordinate_structure_constraint_object_extraction": 1.0,
2311
+ "blimp_determiner_noun_agreement_1": 1.0,
2312
+ "blimp_determiner_noun_agreement_2": 1.0,
2313
+ "blimp_determiner_noun_agreement_irregular_1": 1.0,
2314
+ "blimp_determiner_noun_agreement_irregular_2": 1.0,
2315
+ "blimp_determiner_noun_agreement_with_adj_2": 1.0,
2316
+ "blimp_determiner_noun_agreement_with_adj_irregular_1": 1.0,
2317
+ "blimp_determiner_noun_agreement_with_adj_irregular_2": 1.0,
2318
+ "blimp_determiner_noun_agreement_with_adjective_1": 1.0,
2319
+ "blimp_distractor_agreement_relational_noun": 1.0,
2320
+ "blimp_distractor_agreement_relative_clause": 1.0,
2321
+ "blimp_drop_argument": 1.0,
2322
+ "blimp_ellipsis_n_bar_1": 1.0,
2323
+ "blimp_ellipsis_n_bar_2": 1.0,
2324
+ "blimp_existential_there_object_raising": 1.0,
2325
+ "blimp_existential_there_quantifiers_1": 1.0,
2326
+ "blimp_existential_there_quantifiers_2": 1.0,
2327
+ "blimp_existential_there_subject_raising": 1.0,
2328
+ "blimp_expletive_it_object_raising": 1.0,
2329
+ "blimp_inchoative": 1.0,
2330
+ "blimp_intransitive": 1.0,
2331
+ "blimp_irregular_past_participle_adjectives": 1.0,
2332
+ "blimp_irregular_past_participle_verbs": 1.0,
2333
+ "blimp_irregular_plural_subject_verb_agreement_1": 1.0,
2334
+ "blimp_irregular_plural_subject_verb_agreement_2": 1.0,
2335
+ "blimp_left_branch_island_echo_question": 1.0,
2336
+ "blimp_left_branch_island_simple_question": 1.0,
2337
+ "blimp_matrix_question_npi_licensor_present": 1.0,
2338
+ "blimp_npi_present_1": 1.0,
2339
+ "blimp_npi_present_2": 1.0,
2340
+ "blimp_only_npi_licensor_present": 1.0,
2341
+ "blimp_only_npi_scope": 1.0,
2342
+ "blimp_passive_1": 1.0,
2343
+ "blimp_passive_2": 1.0,
2344
+ "blimp_principle_A_c_command": 1.0,
2345
+ "blimp_principle_A_case_1": 1.0,
2346
+ "blimp_principle_A_case_2": 1.0,
2347
+ "blimp_principle_A_domain_1": 1.0,
2348
+ "blimp_principle_A_domain_2": 1.0,
2349
+ "blimp_principle_A_domain_3": 1.0,
2350
+ "blimp_principle_A_reconstruction": 1.0,
2351
+ "blimp_regular_plural_subject_verb_agreement_1": 1.0,
2352
+ "blimp_regular_plural_subject_verb_agreement_2": 1.0,
2353
+ "blimp_sentential_negation_npi_licensor_present": 1.0,
2354
+ "blimp_sentential_negation_npi_scope": 1.0,
2355
+ "blimp_sentential_subject_island": 1.0,
2356
+ "blimp_superlative_quantifiers_1": 1.0,
2357
+ "blimp_superlative_quantifiers_2": 1.0,
2358
+ "blimp_tough_vs_raising_1": 1.0,
2359
+ "blimp_tough_vs_raising_2": 1.0,
2360
+ "blimp_transitive": 1.0,
2361
+ "blimp_wh_island": 1.0,
2362
+ "blimp_wh_questions_object_gap": 1.0,
2363
+ "blimp_wh_questions_subject_gap": 1.0,
2364
+ "blimp_wh_questions_subject_gap_long_distance": 1.0,
2365
+ "blimp_wh_vs_that_no_gap": 1.0,
2366
+ "blimp_wh_vs_that_no_gap_long_distance": 1.0,
2367
+ "blimp_wh_vs_that_with_gap": 1.0,
2368
+ "blimp_wh_vs_that_with_gap_long_distance": 1.0
2369
+ },
2370
+ "n-shot": {
2371
+ "blimp_adjunct_island": 0,
2372
+ "blimp_anaphor_gender_agreement": 0,
2373
+ "blimp_anaphor_number_agreement": 0,
2374
+ "blimp_animate_subject_passive": 0,
2375
+ "blimp_animate_subject_trans": 0,
2376
+ "blimp_causative": 0,
2377
+ "blimp_complex_NP_island": 0,
2378
+ "blimp_coordinate_structure_constraint_complex_left_branch": 0,
2379
+ "blimp_coordinate_structure_constraint_object_extraction": 0,
2380
+ "blimp_determiner_noun_agreement_1": 0,
2381
+ "blimp_determiner_noun_agreement_2": 0,
2382
+ "blimp_determiner_noun_agreement_irregular_1": 0,
2383
+ "blimp_determiner_noun_agreement_irregular_2": 0,
2384
+ "blimp_determiner_noun_agreement_with_adj_2": 0,
2385
+ "blimp_determiner_noun_agreement_with_adj_irregular_1": 0,
2386
+ "blimp_determiner_noun_agreement_with_adj_irregular_2": 0,
2387
+ "blimp_determiner_noun_agreement_with_adjective_1": 0,
2388
+ "blimp_distractor_agreement_relational_noun": 0,
2389
+ "blimp_distractor_agreement_relative_clause": 0,
2390
+ "blimp_drop_argument": 0,
2391
+ "blimp_ellipsis_n_bar_1": 0,
2392
+ "blimp_ellipsis_n_bar_2": 0,
2393
+ "blimp_existential_there_object_raising": 0,
2394
+ "blimp_existential_there_quantifiers_1": 0,
2395
+ "blimp_existential_there_quantifiers_2": 0,
2396
+ "blimp_existential_there_subject_raising": 0,
2397
+ "blimp_expletive_it_object_raising": 0,
2398
+ "blimp_inchoative": 0,
2399
+ "blimp_intransitive": 0,
2400
+ "blimp_irregular_past_participle_adjectives": 0,
2401
+ "blimp_irregular_past_participle_verbs": 0,
2402
+ "blimp_irregular_plural_subject_verb_agreement_1": 0,
2403
+ "blimp_irregular_plural_subject_verb_agreement_2": 0,
2404
+ "blimp_left_branch_island_echo_question": 0,
2405
+ "blimp_left_branch_island_simple_question": 0,
2406
+ "blimp_matrix_question_npi_licensor_present": 0,
2407
+ "blimp_npi_present_1": 0,
2408
+ "blimp_npi_present_2": 0,
2409
+ "blimp_only_npi_licensor_present": 0,
2410
+ "blimp_only_npi_scope": 0,
2411
+ "blimp_passive_1": 0,
2412
+ "blimp_passive_2": 0,
2413
+ "blimp_principle_A_c_command": 0,
2414
+ "blimp_principle_A_case_1": 0,
2415
+ "blimp_principle_A_case_2": 0,
2416
+ "blimp_principle_A_domain_1": 0,
2417
+ "blimp_principle_A_domain_2": 0,
2418
+ "blimp_principle_A_domain_3": 0,
2419
+ "blimp_principle_A_reconstruction": 0,
2420
+ "blimp_regular_plural_subject_verb_agreement_1": 0,
2421
+ "blimp_regular_plural_subject_verb_agreement_2": 0,
2422
+ "blimp_sentential_negation_npi_licensor_present": 0,
2423
+ "blimp_sentential_negation_npi_scope": 0,
2424
+ "blimp_sentential_subject_island": 0,
2425
+ "blimp_superlative_quantifiers_1": 0,
2426
+ "blimp_superlative_quantifiers_2": 0,
2427
+ "blimp_tough_vs_raising_1": 0,
2428
+ "blimp_tough_vs_raising_2": 0,
2429
+ "blimp_transitive": 0,
2430
+ "blimp_wh_island": 0,
2431
+ "blimp_wh_questions_object_gap": 0,
2432
+ "blimp_wh_questions_subject_gap": 0,
2433
+ "blimp_wh_questions_subject_gap_long_distance": 0,
2434
+ "blimp_wh_vs_that_no_gap": 0,
2435
+ "blimp_wh_vs_that_no_gap_long_distance": 0,
2436
+ "blimp_wh_vs_that_with_gap": 0,
2437
+ "blimp_wh_vs_that_with_gap_long_distance": 0
2438
+ },
2439
+ "higher_is_better": {
2440
+ "blimp": {
2441
+ "acc": true
2442
+ },
2443
+ "blimp_adjunct_island": {
2444
+ "acc": true
2445
+ },
2446
+ "blimp_anaphor_gender_agreement": {
2447
+ "acc": true
2448
+ },
2449
+ "blimp_anaphor_number_agreement": {
2450
+ "acc": true
2451
+ },
2452
+ "blimp_animate_subject_passive": {
2453
+ "acc": true
2454
+ },
2455
+ "blimp_animate_subject_trans": {
2456
+ "acc": true
2457
+ },
2458
+ "blimp_causative": {
2459
+ "acc": true
2460
+ },
2461
+ "blimp_complex_NP_island": {
2462
+ "acc": true
2463
+ },
2464
+ "blimp_coordinate_structure_constraint_complex_left_branch": {
2465
+ "acc": true
2466
+ },
2467
+ "blimp_coordinate_structure_constraint_object_extraction": {
2468
+ "acc": true
2469
+ },
2470
+ "blimp_determiner_noun_agreement_1": {
2471
+ "acc": true
2472
+ },
2473
+ "blimp_determiner_noun_agreement_2": {
2474
+ "acc": true
2475
+ },
2476
+ "blimp_determiner_noun_agreement_irregular_1": {
2477
+ "acc": true
2478
+ },
2479
+ "blimp_determiner_noun_agreement_irregular_2": {
2480
+ "acc": true
2481
+ },
2482
+ "blimp_determiner_noun_agreement_with_adj_2": {
2483
+ "acc": true
2484
+ },
2485
+ "blimp_determiner_noun_agreement_with_adj_irregular_1": {
2486
+ "acc": true
2487
+ },
2488
+ "blimp_determiner_noun_agreement_with_adj_irregular_2": {
2489
+ "acc": true
2490
+ },
2491
+ "blimp_determiner_noun_agreement_with_adjective_1": {
2492
+ "acc": true
2493
+ },
2494
+ "blimp_distractor_agreement_relational_noun": {
2495
+ "acc": true
2496
+ },
2497
+ "blimp_distractor_agreement_relative_clause": {
2498
+ "acc": true
2499
+ },
2500
+ "blimp_drop_argument": {
2501
+ "acc": true
2502
+ },
2503
+ "blimp_ellipsis_n_bar_1": {
2504
+ "acc": true
2505
+ },
2506
+ "blimp_ellipsis_n_bar_2": {
2507
+ "acc": true
2508
+ },
2509
+ "blimp_existential_there_object_raising": {
2510
+ "acc": true
2511
+ },
2512
+ "blimp_existential_there_quantifiers_1": {
2513
+ "acc": true
2514
+ },
2515
+ "blimp_existential_there_quantifiers_2": {
2516
+ "acc": true
2517
+ },
2518
+ "blimp_existential_there_subject_raising": {
2519
+ "acc": true
2520
+ },
2521
+ "blimp_expletive_it_object_raising": {
2522
+ "acc": true
2523
+ },
2524
+ "blimp_inchoative": {
2525
+ "acc": true
2526
+ },
2527
+ "blimp_intransitive": {
2528
+ "acc": true
2529
+ },
2530
+ "blimp_irregular_past_participle_adjectives": {
2531
+ "acc": true
2532
+ },
2533
+ "blimp_irregular_past_participle_verbs": {
2534
+ "acc": true
2535
+ },
2536
+ "blimp_irregular_plural_subject_verb_agreement_1": {
2537
+ "acc": true
2538
+ },
2539
+ "blimp_irregular_plural_subject_verb_agreement_2": {
2540
+ "acc": true
2541
+ },
2542
+ "blimp_left_branch_island_echo_question": {
2543
+ "acc": true
2544
+ },
2545
+ "blimp_left_branch_island_simple_question": {
2546
+ "acc": true
2547
+ },
2548
+ "blimp_matrix_question_npi_licensor_present": {
2549
+ "acc": true
2550
+ },
2551
+ "blimp_npi_present_1": {
2552
+ "acc": true
2553
+ },
2554
+ "blimp_npi_present_2": {
2555
+ "acc": true
2556
+ },
2557
+ "blimp_only_npi_licensor_present": {
2558
+ "acc": true
2559
+ },
2560
+ "blimp_only_npi_scope": {
2561
+ "acc": true
2562
+ },
2563
+ "blimp_passive_1": {
2564
+ "acc": true
2565
+ },
2566
+ "blimp_passive_2": {
2567
+ "acc": true
2568
+ },
2569
+ "blimp_principle_A_c_command": {
2570
+ "acc": true
2571
+ },
2572
+ "blimp_principle_A_case_1": {
2573
+ "acc": true
2574
+ },
2575
+ "blimp_principle_A_case_2": {
2576
+ "acc": true
2577
+ },
2578
+ "blimp_principle_A_domain_1": {
2579
+ "acc": true
2580
+ },
2581
+ "blimp_principle_A_domain_2": {
2582
+ "acc": true
2583
+ },
2584
+ "blimp_principle_A_domain_3": {
2585
+ "acc": true
2586
+ },
2587
+ "blimp_principle_A_reconstruction": {
2588
+ "acc": true
2589
+ },
2590
+ "blimp_regular_plural_subject_verb_agreement_1": {
2591
+ "acc": true
2592
+ },
2593
+ "blimp_regular_plural_subject_verb_agreement_2": {
2594
+ "acc": true
2595
+ },
2596
+ "blimp_sentential_negation_npi_licensor_present": {
2597
+ "acc": true
2598
+ },
2599
+ "blimp_sentential_negation_npi_scope": {
2600
+ "acc": true
2601
+ },
2602
+ "blimp_sentential_subject_island": {
2603
+ "acc": true
2604
+ },
2605
+ "blimp_superlative_quantifiers_1": {
2606
+ "acc": true
2607
+ },
2608
+ "blimp_superlative_quantifiers_2": {
2609
+ "acc": true
2610
+ },
2611
+ "blimp_tough_vs_raising_1": {
2612
+ "acc": true
2613
+ },
2614
+ "blimp_tough_vs_raising_2": {
2615
+ "acc": true
2616
+ },
2617
+ "blimp_transitive": {
2618
+ "acc": true
2619
+ },
2620
+ "blimp_wh_island": {
2621
+ "acc": true
2622
+ },
2623
+ "blimp_wh_questions_object_gap": {
2624
+ "acc": true
2625
+ },
2626
+ "blimp_wh_questions_subject_gap": {
2627
+ "acc": true
2628
+ },
2629
+ "blimp_wh_questions_subject_gap_long_distance": {
2630
+ "acc": true
2631
+ },
2632
+ "blimp_wh_vs_that_no_gap": {
2633
+ "acc": true
2634
+ },
2635
+ "blimp_wh_vs_that_no_gap_long_distance": {
2636
+ "acc": true
2637
+ },
2638
+ "blimp_wh_vs_that_with_gap": {
2639
+ "acc": true
2640
+ },
2641
+ "blimp_wh_vs_that_with_gap_long_distance": {
2642
+ "acc": true
2643
+ }
2644
+ },
2645
+ "n-samples": {
2646
+ "blimp_adjunct_island": {
2647
+ "original": 1000,
2648
+ "effective": 1000
2649
+ },
2650
+ "blimp_anaphor_gender_agreement": {
2651
+ "original": 1000,
2652
+ "effective": 1000
2653
+ },
2654
+ "blimp_anaphor_number_agreement": {
2655
+ "original": 1000,
2656
+ "effective": 1000
2657
+ },
2658
+ "blimp_animate_subject_passive": {
2659
+ "original": 1000,
2660
+ "effective": 1000
2661
+ },
2662
+ "blimp_animate_subject_trans": {
2663
+ "original": 1000,
2664
+ "effective": 1000
2665
+ },
2666
+ "blimp_causative": {
2667
+ "original": 1000,
2668
+ "effective": 1000
2669
+ },
2670
+ "blimp_complex_NP_island": {
2671
+ "original": 1000,
2672
+ "effective": 1000
2673
+ },
2674
+ "blimp_coordinate_structure_constraint_complex_left_branch": {
2675
+ "original": 1000,
2676
+ "effective": 1000
2677
+ },
2678
+ "blimp_coordinate_structure_constraint_object_extraction": {
2679
+ "original": 1000,
2680
+ "effective": 1000
2681
+ },
2682
+ "blimp_determiner_noun_agreement_1": {
2683
+ "original": 1000,
2684
+ "effective": 1000
2685
+ },
2686
+ "blimp_determiner_noun_agreement_2": {
2687
+ "original": 1000,
2688
+ "effective": 1000
2689
+ },
2690
+ "blimp_determiner_noun_agreement_irregular_1": {
2691
+ "original": 1000,
2692
+ "effective": 1000
2693
+ },
2694
+ "blimp_determiner_noun_agreement_irregular_2": {
2695
+ "original": 1000,
2696
+ "effective": 1000
2697
+ },
2698
+ "blimp_determiner_noun_agreement_with_adj_2": {
2699
+ "original": 1000,
2700
+ "effective": 1000
2701
+ },
2702
+ "blimp_determiner_noun_agreement_with_adj_irregular_1": {
2703
+ "original": 1000,
2704
+ "effective": 1000
2705
+ },
2706
+ "blimp_determiner_noun_agreement_with_adj_irregular_2": {
2707
+ "original": 1000,
2708
+ "effective": 1000
2709
+ },
2710
+ "blimp_determiner_noun_agreement_with_adjective_1": {
2711
+ "original": 1000,
2712
+ "effective": 1000
2713
+ },
2714
+ "blimp_distractor_agreement_relational_noun": {
2715
+ "original": 1000,
2716
+ "effective": 1000
2717
+ },
2718
+ "blimp_distractor_agreement_relative_clause": {
2719
+ "original": 1000,
2720
+ "effective": 1000
2721
+ },
2722
+ "blimp_drop_argument": {
2723
+ "original": 1000,
2724
+ "effective": 1000
2725
+ },
2726
+ "blimp_ellipsis_n_bar_1": {
2727
+ "original": 1000,
2728
+ "effective": 1000
2729
+ },
2730
+ "blimp_ellipsis_n_bar_2": {
2731
+ "original": 1000,
2732
+ "effective": 1000
2733
+ },
2734
+ "blimp_existential_there_object_raising": {
2735
+ "original": 1000,
2736
+ "effective": 1000
2737
+ },
2738
+ "blimp_existential_there_quantifiers_1": {
2739
+ "original": 1000,
2740
+ "effective": 1000
2741
+ },
2742
+ "blimp_existential_there_quantifiers_2": {
2743
+ "original": 1000,
2744
+ "effective": 1000
2745
+ },
2746
+ "blimp_existential_there_subject_raising": {
2747
+ "original": 1000,
2748
+ "effective": 1000
2749
+ },
2750
+ "blimp_expletive_it_object_raising": {
2751
+ "original": 1000,
2752
+ "effective": 1000
2753
+ },
2754
+ "blimp_inchoative": {
2755
+ "original": 1000,
2756
+ "effective": 1000
2757
+ },
2758
+ "blimp_intransitive": {
2759
+ "original": 1000,
2760
+ "effective": 1000
2761
+ },
2762
+ "blimp_irregular_past_participle_adjectives": {
2763
+ "original": 1000,
2764
+ "effective": 1000
2765
+ },
2766
+ "blimp_irregular_past_participle_verbs": {
2767
+ "original": 1000,
2768
+ "effective": 1000
2769
+ },
2770
+ "blimp_irregular_plural_subject_verb_agreement_1": {
2771
+ "original": 1000,
2772
+ "effective": 1000
2773
+ },
2774
+ "blimp_irregular_plural_subject_verb_agreement_2": {
2775
+ "original": 1000,
2776
+ "effective": 1000
2777
+ },
2778
+ "blimp_left_branch_island_echo_question": {
2779
+ "original": 1000,
2780
+ "effective": 1000
2781
+ },
2782
+ "blimp_left_branch_island_simple_question": {
2783
+ "original": 1000,
2784
+ "effective": 1000
2785
+ },
2786
+ "blimp_matrix_question_npi_licensor_present": {
2787
+ "original": 1000,
2788
+ "effective": 1000
2789
+ },
2790
+ "blimp_npi_present_1": {
2791
+ "original": 1000,
2792
+ "effective": 1000
2793
+ },
2794
+ "blimp_npi_present_2": {
2795
+ "original": 1000,
2796
+ "effective": 1000
2797
+ },
2798
+ "blimp_only_npi_licensor_present": {
2799
+ "original": 1000,
2800
+ "effective": 1000
2801
+ },
2802
+ "blimp_only_npi_scope": {
2803
+ "original": 1000,
2804
+ "effective": 1000
2805
+ },
2806
+ "blimp_passive_1": {
2807
+ "original": 1000,
2808
+ "effective": 1000
2809
+ },
2810
+ "blimp_passive_2": {
2811
+ "original": 1000,
2812
+ "effective": 1000
2813
+ },
2814
+ "blimp_principle_A_c_command": {
2815
+ "original": 1000,
2816
+ "effective": 1000
2817
+ },
2818
+ "blimp_principle_A_case_1": {
2819
+ "original": 1000,
2820
+ "effective": 1000
2821
+ },
2822
+ "blimp_principle_A_case_2": {
2823
+ "original": 1000,
2824
+ "effective": 1000
2825
+ },
2826
+ "blimp_principle_A_domain_1": {
2827
+ "original": 1000,
2828
+ "effective": 1000
2829
+ },
2830
+ "blimp_principle_A_domain_2": {
2831
+ "original": 1000,
2832
+ "effective": 1000
2833
+ },
2834
+ "blimp_principle_A_domain_3": {
2835
+ "original": 1000,
2836
+ "effective": 1000
2837
+ },
2838
+ "blimp_principle_A_reconstruction": {
2839
+ "original": 1000,
2840
+ "effective": 1000
2841
+ },
2842
+ "blimp_regular_plural_subject_verb_agreement_1": {
2843
+ "original": 1000,
2844
+ "effective": 1000
2845
+ },
2846
+ "blimp_regular_plural_subject_verb_agreement_2": {
2847
+ "original": 1000,
2848
+ "effective": 1000
2849
+ },
2850
+ "blimp_sentential_negation_npi_licensor_present": {
2851
+ "original": 1000,
2852
+ "effective": 1000
2853
+ },
2854
+ "blimp_sentential_negation_npi_scope": {
2855
+ "original": 1000,
2856
+ "effective": 1000
2857
+ },
2858
+ "blimp_sentential_subject_island": {
2859
+ "original": 1000,
2860
+ "effective": 1000
2861
+ },
2862
+ "blimp_superlative_quantifiers_1": {
2863
+ "original": 1000,
2864
+ "effective": 1000
2865
+ },
2866
+ "blimp_superlative_quantifiers_2": {
2867
+ "original": 1000,
2868
+ "effective": 1000
2869
+ },
2870
+ "blimp_tough_vs_raising_1": {
2871
+ "original": 1000,
2872
+ "effective": 1000
2873
+ },
2874
+ "blimp_tough_vs_raising_2": {
2875
+ "original": 1000,
2876
+ "effective": 1000
2877
+ },
2878
+ "blimp_transitive": {
2879
+ "original": 1000,
2880
+ "effective": 1000
2881
+ },
2882
+ "blimp_wh_island": {
2883
+ "original": 1000,
2884
+ "effective": 1000
2885
+ },
2886
+ "blimp_wh_questions_object_gap": {
2887
+ "original": 1000,
2888
+ "effective": 1000
2889
+ },
2890
+ "blimp_wh_questions_subject_gap": {
2891
+ "original": 1000,
2892
+ "effective": 1000
2893
+ },
2894
+ "blimp_wh_questions_subject_gap_long_distance": {
2895
+ "original": 1000,
2896
+ "effective": 1000
2897
+ },
2898
+ "blimp_wh_vs_that_no_gap": {
2899
+ "original": 1000,
2900
+ "effective": 1000
2901
+ },
2902
+ "blimp_wh_vs_that_no_gap_long_distance": {
2903
+ "original": 1000,
2904
+ "effective": 1000
2905
+ },
2906
+ "blimp_wh_vs_that_with_gap": {
2907
+ "original": 1000,
2908
+ "effective": 1000
2909
+ },
2910
+ "blimp_wh_vs_that_with_gap_long_distance": {
2911
+ "original": 1000,
2912
+ "effective": 1000
2913
+ }
2914
+ },
2915
+ "config": {
2916
+ "model": "hf",
2917
+ "model_args": "pretrained=outputs/fw57M-tied/42/ByteSpanSurprisalGlobalIncrement_64000/.cache/eval_model",
2918
+ "model_num_parameters": 105785088,
2919
+ "model_dtype": "torch.bfloat16",
2920
+ "model_revision": "main",
2921
+ "model_sha": "",
2922
+ "batch_size": 1,
2923
+ "batch_sizes": [],
2924
+ "device": null,
2925
+ "use_cache": null,
2926
+ "limit": null,
2927
+ "bootstrap_iters": 100000,
2928
+ "gen_kwargs": null,
2929
+ "random_seed": 0,
2930
+ "numpy_seed": 1234,
2931
+ "torch_seed": 1234,
2932
+ "fewshot_seed": 1234
2933
+ },
2934
+ "git_hash": "0296da0",
2935
+ "date": 1751203761.2802591,
2936
+ "pretty_env_info": "'NoneType' object has no attribute 'splitlines'",
2937
+ "transformers_version": "4.52.4",
2938
+ "upper_git_hash": null,
2939
+ "tokenizer_pad_token": [
2940
+ "<|padding|>",
2941
+ "0"
2942
+ ],
2943
+ "tokenizer_eos_token": [
2944
+ "<|endoftext|>",
2945
+ "1"
2946
+ ],
2947
+ "tokenizer_bos_token": [
2948
+ "<|endoftext|>",
2949
+ "1"
2950
+ ],
2951
+ "eot_token_id": 1,
2952
+ "max_length": 2048,
2953
+ "task_hashes": {},
2954
+ "model_source": "hf",
2955
+ "model_name": "outputs/fw57M-tied/42/ByteSpanSurprisalGlobalIncrement_64000/.cache/eval_model",
2956
+ "model_name_sanitized": "outputs__fw57M-tied__42__ByteSpanSurprisalGlobalIncrement_64000__.cache__eval_model",
2957
+ "system_instruction": null,
2958
+ "system_instruction_sha": null,
2959
+ "fewshot_as_multiturn": false,
2960
+ "chat_template": null,
2961
+ "chat_template_sha": null,
2962
+ "start_time": 448849.355028488,
2963
+ "end_time": 449841.840207846,
2964
+ "total_evaluation_time_seconds": "992.4851793580456"
2965
+ }
ByteSpanSurprisalGlobalIncrement_64000/config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 1,
8
+ "eos_token_id": 1,
9
+ "head_dim": 32,
10
+ "hidden_act": "silu",
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "max_position_embeddings": 2048,
15
+ "mlp_bias": false,
16
+ "model_type": "llama",
17
+ "num_attention_heads": 24,
18
+ "num_hidden_layers": 6,
19
+ "num_key_value_heads": 24,
20
+ "pad_token_id": 0,
21
+ "pretraining_tp": 1,
22
+ "rms_norm_eps": 1e-05,
23
+ "rope_scaling": null,
24
+ "rope_theta": 10000.0,
25
+ "tie_word_embeddings": true,
26
+ "torch_dtype": "bfloat16",
27
+ "transformers_version": "4.52.4",
28
+ "use_cache": true,
29
+ "vocab_size": 64000
30
+ }
ByteSpanSurprisalGlobalIncrement_64000/generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 1,
5
+ "pad_token_id": 0,
6
+ "transformers_version": "4.52.4"
7
+ }
ByteSpanSurprisalGlobalIncrement_64000/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7205f0a69825f83e68c73c1bde51527b17b58c4f7b16cfb178217d3e85f7b611
3
+ size 211576448
ByteSpanSurprisalGlobalIncrement_64000/special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|endoftext|>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|endoftext|>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<|padding|>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "<|unk|>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
ByteSpanSurprisalGlobalIncrement_64000/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
ByteSpanSurprisalGlobalIncrement_64000/tokenizer_config.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": true,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<|padding|>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "1": {
13
+ "content": "<|endoftext|>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "698": {
21
+ "content": "<|unk|>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ }
28
+ },
29
+ "bos_token": "<|endoftext|>",
30
+ "clean_up_tokenization_spaces": false,
31
+ "eos_token": "<|endoftext|>",
32
+ "extra_special_tokens": {},
33
+ "model_max_length": 1000000000000000019884624838656,
34
+ "pad_token": "<|padding|>",
35
+ "tokenizer_class": "PreTrainedTokenizer",
36
+ "unk_token": "<|unk|>"
37
+ }