Decision Tree Learning Quiz

Question 1

Recall that entropy was defined in lecture as H(p_1) = - p_1 log_2(p_1) - p_0 log_2(p_0), where p_1 is the fraction of positive examples and p_0 the fraction of negative examples.

At a given node of a decision tree, 6 of 10 examples are cats and 4 of 10 are not cats. Which expression calculates the entropy H(p_1) of this group of 10 animals?

−(0.6)log₂(0.6)−(1−0.4)log₂(1−0.4)
−(0.6)log₂(0.6)−(0.4)log₂(0.4) ✓
(0.6)log₂(0.6)+(1−0.4)log₂(1−0.4)
(0.6)log₂(0.6)+(0.4)log₂(0.4)

Answer Location: Found in Section 4: “H(p_1) = -p_1 log_2(p_1) - p_0 log_2(p_0)” where p_1 = 6/10 = 0.6 and p_0 = 4/10 = 0.4.

Question 2

Recall that information gain was defined as follows: H(p₁^root) - (w^left H(p₁^left) + w^right H(p₁^right))

Before a split, the entropy of a group of 5 cats and 5 non-cats is H(5/10). After splitting on a particular feature, a group of 7 animals (4 of which are cats) has an entropy of H(4/7). The other group of 3 animals (1 is a cat) and has an entropy of H(1/3). What is the expression for information gain?

H(0.5)−(4/7 × H(4/7) + 4/7 × H(1/3))
H(0.5)−(7 × H(4/7) + 3 × H(1/3))
H(0.5)−(7/10 H(4/7) + 3/10 H(1/3)) ✓
H(0.5)−(H(4/7) + H(1/3))

Answer Location: Found in Section 5: Information gain formula uses weighted averages where w^left = 7/10 (fraction going left) and w^right = 3/10 (fraction going right).

Question 3

To represent 3 possible values for the ear shape, you can define 3 features for ear shape: pointy ears, floppy ears, oval ears. For an animal whose ears are not pointy, not floppy, but are oval, how can you represent this information as a feature vector?

[1, 1, 0]
[0, 1, 0]
[0, 0, 1] ✓
[1, 0, 0]

Answer Location: Found in Section 7: “For an animal whose ears are not pointy, not floppy, but are oval” - in one-hot encoding, exactly one feature equals 1, corresponding to the true category (oval ears).

Question 4

For a continuous valued feature (such as weight of the animal), there are 10 animals in the dataset. According to the lecture, what is the recommended way to find the best split for that feature?

Use a one-hot encoding to turn the feature into a discrete feature vector of 0’s and 1’s, then apply the algorithm we had discussed for discrete features.
Choose the 9 mid-points between the 10 examples as possible splits, and find the split that gives the highest information gain. ✓
Try every value spaced at regular intervals (e.g., 8, 8.5, 9, 9.5, 10, etc.) and find the split that gives the highest information gain.
Use gradient descent to find the value of the split threshold that gives the highest information gain.

Answer Location: Found in Section 8: “sort all of the examples according to the weight or according to the value of this feature and take all the values that are mid points between the sorted list of training examples as the values for consideration for this threshold.”

Question 5

Which of these are commonly used criteria to decide to stop splitting? (Choose two.)

☑ When the tree has reached a maximum depth ✓
☑ When the number of examples in a node is below a threshold ✓
☐ When the information gain from additional splits is too large
☐ When a node is 50% one class and 50% another class (highest possible value of entropy)

Answer Location: Found in Section 2 and 6: Stopping criteria include “when further splitting a node will cause the tree to exceed the maximum depth” and “if the number of examples in a nodes is below a threshold.”