We explore a data related to my research on natural language understanding in the domain of text adventure games. Text adventure games are a good environment to study language acquisition, in the sense that the player does not know the implicit ontology of the game and must learn it through interacting with the game environment. Our stretch goal is whether we design a system that can acquire language to the extent that it can learn to play games that it has never seen before during training.
Our primary questions are thus centered around whether the collection of text adventure games I have found offer a diverse set of textual environments. In the worst case, the game ontologies are so similar to the point where generalization between games is trivial. In the best case, the game ontologies are rich enough to present an interesting learning problem. More specifically, the questions we’d like to answer are:
We consider the text adventure games available from Jericho, a game engine by Microsoft Research. After downloading each game, we extract its text. Fully extracting text from games requires exhaustive playthroughs in which one considers every potential path in the game, and is intractable. Instead we use ZMachine Tools to decompile the game binaries. This results in decompiled assembly code for the game that look like the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Resident data ends at 7f8c, program starts at 7f8c, file ends at 35040 Starting analysis pass at address 7f8a End of analysis pass, low address = 7f8c, high address = 2049a [Start of code at 7f8c] Main routine 7f8c, 0 locals 7f8d: e0 3f 20 88 ff call_vs 8220 -> gef 7f92: ba quit Routine 7f94, 1 local 7f95: 2d ff 01 store gef local0 7f98: 41 ff 00 4f je gef #00 ~7fa9 7f9c: b3 ... print_ret "Feeding a guest" 7fa9: 41 ff 01 4f je gef #01 ~7fba 7fad: b3 ... print_ret "Finding a turnip" ...
Next, I extract strings from the text via regular expressions. In the above snippet, the strings are "Feeding a guest" and "Finding a turnip". The extracted text look like this:
1 2 3 4 5 6 7 8 9 10
... Must be where Mama bear cooks all her porridge. I don't know why she doesn't just get herself a microwave. There is a blazing fire in the hearth! The hearth is cold and full of ash. Something glints among the embers... Somehow I don't think a load of cold ashes is going to make a very good fire. I'm not sure a ceramic bowl is the best sort of container for cooking porridge in. There must be some other way of heating it up! The fire's not even lit - besides I'm not sure a ceramic bowl is the best sort of container for cooking porridge in. There must be some other way of heating it up! I put the pork chops on the hearth. Wow, they defrosted quickly! ...
We can then examine the characteristics of each game using this text dump. To begin, we answer the first question How rich is the language of text adventure games? To facilitate exploration, we will focus on two games to start, Hunter, in Darkness and The Meteor, the Stone and a Long Glass of Sherbet. To start, we plot the distribution of words in each text dump.
Here we compute the number of unique word types in the collection of text adventure games. We find that while some games consists of a small vocabulary (e.g. acorncourt, 905), others consists of rich vocabularies with several thousands words (E.g. anchor, curses). We find one problem with the dataset by doing this, which is that one game failed to generate a text dump (Murdac).
Here we plot the number of strings extracted from each game as well as the number of unique word types contained in each game. While the two distributions show largely similar patterns. There are a couple of surpises. For example, the games "LostPig" and "yomomma" contain a large number of extractions but a small number of word types. Conversely, the game "curses" does not yield a large amount of extractions, but it contains the second highest number of unique word types.
Here we show the top word types that occur in the text dump of two text adventure games. In particular, we examine the probability of the word occuring in the text. However, we find that most common words are rather uninformative. This suggests that we need to do some filtering to surface more informative distincitons between the different games.
We then plot the distribution of word types:
After removing stop words and short words, we see some interesting difference emerge from the vocabulary. We see that while there are common words shared between both games, such as "noun", "rope", and "back", the distribution of words are fairly different and indicative of the theme of the game. For instance, "Hunter, In Darkness" contains words relevant to its rogue cave setting save as "bats", "stone", "pit", and "crawl" where as "The Meteor, the Stone and a Long Glass of Sherbet" contains words indicative of its magical fantasy setting such as "spell" and "Empire".
Instead of plotting raw word probabilities, we can also plot the pointwise mutual information of word types, where @@0@@. We see that these words are very much distinct from the common words in the previous figure. Another observation we can make from this is that the top 20 indicative words in "Hunter" overlaps somewhat with words in "Sherbet", while the reverse is not true - all top 20 indicative words in "Sherbet" are not found in "Hunter".
How might we show the distribution of text of each game in the same space? One way to do this is to visualize how the words are co-located in embedding space. Here we visualize the average GloVe embedding of each word. To facilitate visualization, we visualize the two principal components of the embedding space.
This figure show the distribution of words of a particular game in embedding space. In particular, we collect the vocabulary of this game and perform principcal component analysis (PCA) on the words' GloVe embeddings. We then show the words at locations correponding to their two principal components. From this, we see clusters of similar concepts in the game, such as names (top right) and directions (middle right). However a large quantity of words do not differ significantly from each other (e.g. those in the lower left cluster).
This figure show the distribution of words similar to the last figure, except we have used a more power, non-linear dimensionality reduction technique called TSNE. When we zoom in, we do find more interesting collocations such as materials (x,y=2,14), animals (x,y=12,0), plants (x,y=-12,7) etc.
In this figure, we have applied TSNE reduction to the vocabulary across two games in the collection. From this we see that while there is a significant number of words that overlap, there are still clusters of words that are indicative of a particular game. For example, the "huntdark" game has a strong cluster of cave dwelling creatures such as "bats" and "bugs". "sherbet" has a cluster of magical spell related terms around (x,y=3,17).
In this figure, we have applied TSNE reduction to the phrases extracted across five games in the collection. On a high level inspection, the phrases are distributed on roughly the same manifold. However, we see that certain games tend to gravitate towards certain regions. For example, "gold" and "reverb" tends to fall on the right side of the manifold whereas "huntdark" tends to fall on the left side. Note that each point displays a tooltip containing the phrase on mouseover.
In this figure, we have applied TSNE reduction to the possible actions extracted across games. We see that there is a large number of actions that are shared between games (e.g. in red) - these include common verbs such as "yes", "no", "ask", "lie", "lock", "save" etc. However, games also have large numbers of theme-specific actions. For example, "sherbet" has those related to magic such as "runes", "spellbook", "beam", and "fragment" whereas "huntdark" has those reflecting its rogue theme such as "mechanism", "crossbow", "needles", "strap", and "bowstrings".
We examined the distribution of textual content across text adventure games. In particular, we find that the size of the vocabulary vary greatly from game to game. That is, some games have a rich lexicon while others do not. Next, we examined the overlap in vocabulary between games and found that there is a significant overlap in terms of common objects (e.g. chests, doors), but one can distinguish bewteen games using only their vocabulary by observing theme-specific words (e.g. spellbook) which have high PMI. Moreover, we examined word embeddings as a means to visualize the space of game vocabulary, using dimensionality reduction techniques to interpret these high dimensional embeddings. Using this technique, we find that games have shared and unique regions on the manifold that are visually distinguishable. We find similar patterns in the affordances between games - namely that there is a large amount of shared actions as well as a large amount of theme-specific actions. These results indicate that textual adventure games are a promising environment for studying language acquisition, in that some language acquired in one game is useful to another, but that the agent also needs to acquire new language in order to succeed in the new game.