Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .~lock.TextMining_writeup.odt#
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
,maggie,maggie-Latitude-E5470,23.02.2017 22:24,file:///home/maggie/.config/libreoffice/4;
38 changes: 37 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,39 @@
# TextMining

This is the base repo for the text mining and analysis project for Software Design at Olin College.
## by Margaret Rosner
### Project Overview
I combined two books, Robinson Crusoe and Herland, and used a Markov Chain to generate random sentences/quotes from the words of the combined texts. I wanted to see what the product of the two books would generate and I hoped to create some interesting sentences. In completing this project I wanted to learn how to think through a project on my own because up to this point there had been a lot of scaffolding and I wanted to focus on how to think through a project on my own.

### Implementation
I implemented my program by downloading the two texts that I had chosen to focus on and then quickly stripped the combination of the two texts of all punctuation and converted every letter in the string of words to lowercase and made sure that everything was separated by a space. I then chose to use the enumerate function because it would give me an index for every word in the combined text and I used enumerate to go through the text and create a dictionary that took in each unique word as a key and stored the word that immediately followed it in the text as a value. By using enumerate I was able to easily add the next indexed word into the value section of the dictionary.
After I had created a dictionary of all the words in the text, I then created a function that would generate a sentence of a specific length. In this function I randomly chose a starting key which I then used to randomly choose a word from the value of the key. I then added that value to a string and made that word my new key. Once I had reached a string of my desired length I return the string in quotation marks.
### Results
I chose to use a Markov Chain to create sentence from a combination of the books Herland and Robinson Crusoe because both books were written in the early 18th century and document men on a journey. The interesting piece about these two texts is that the journeys of these men take place in very different contexts. Robinson Crusoe is stranded on an island and the book focuses on his struggle to conquer his environment. On the other hand, Herland tells the story of a group of men who journey into the unknown in an attempt to find an undiscovered all female society and later are imprisoned by the women until they learn more about the world. Herland focuses on the themes of love and the definition of femininity.
I thought that because these two books both are about men on journeys, but with very different themes, the sentences generated randomly would be interesting and weird. See a smattering below.

"in cold like this in time as i found for."

“great forests looked round me no effort applied myself a quarter of my head for he called my strength the.”

“creatures were armed we being murdered and all good things.”

"I made snares to son i inquired if burglars try to believe."

"so that they were decimated by the wall and begged lazily."

"to take much better import some merchants for the guns of bread i that price was by this savage that which i."

"approach seemed to mere nature should happen that went in further into the person but this line i must have a matter."

"cautiously and wisdom justice of cats were if i did not at my condition i sat down immediately driven by."

"remedy for as follows three killed one word i could come that and that if founded on the country seat which i was ashore but did not so that in less a habitation."
"as it to lie down and loaded my pistols with my story at it to me governor what became but the knees and pointing to us if i asked him what he meant."

"to sleep had a pitch of labor too with me for home in the way to this retreat i slept all that part i hastened to their darts or deliverance which."

### Reflection
I ended up finding that the themes of the two books did not really play much into the content of the quotes, however if I were to do this project again I would perform a sentiment analysis and generate sentences based on the sentiment I discovered in each text. I think I could have done a better job of using unit testing because I really just didn’t use any since the functions that I created used random. To test my code I used print statements along the way to ensure that it was doing what I wanted it to do.



74 changes: 74 additions & 0 deletions TextMining.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
'''''
Software Design Project 3: Text Mining
@author Margaret Rosner

'''''

import random
import math
from random import randint
import requests

herland_full_text = requests.get('http://www.gutenberg.org/files/32/32-0.txt').text
crusoe_full_text = requests.get('http://www.gutenberg.org/files/521/521-0.txt').text

#first remove all punctuation from the texts
import string
s1 = herland_full_text # do you just put the whole text here?
#out1 = s1.translate(string.punctuation)
s2 = crusoe_full_text # Sample string
#out2 = s2.translate(string.punctuation)
exclude = set(string.punctuation)
s1 = ''.join(ch for ch in s1 if ch not in exclude)
s2 = ''.join(ch for ch in s2 if ch not in exclude)

#make all letters lowercase
herland_text = str.lower(s1)
crusoe_text = str.lower(s2)

whole_text = herland_text + crusoe_text

#print(whole_text)

word_list = whole_text.split(' ')

#make an index of all words in Herland & Crusoe
new_dict = {}
""" The following code creates a dictionary (new_dict) that contains all of the
words in both texts as keys and then the word that
follows the key word stored in a dictionary. If the word already exists in the
dictionary the code simply adds the following word in the list to the dictionary.
"""
for index,word in enumerate(word_list[:-1]):
if word not in new_dict:
new_dict[word] = [word_list[index + 1]]
else:
new_dict[word].append(word_list[index + 1])
#print(new_dict)

def quote(data,length_quote):
""" This function generated a random sentence/quote from the dictionary
created above. this code randomly chooses an index and then finds the key
with that index in the dictionary and then randomly chooses a value of that
key. That value is then added to a string and then becomes the next key.
This process is repeted until the desired length of quote is reached.
"""
new_string = '"'
num_words = 0
x = random.choice(list(data.keys()))
while num_words < length_quote:
if num_words > 0:
new_string += ' '
#print(new_string)
next_word = random.choice(data[x])
new_string = new_string + next_word
x = next_word
num_words = num_words + 1
new_string += '."'
#print(new_string)
return new_string



Herland_Crusoe = quote(new_dict, 30)
print(Herland_Crusoe)