diff --git a/docs/src/LM.md b/docs/src/LM.md index 0289516d..c76a0e3c 100644 --- a/docs/src/LM.md +++ b/docs/src/LM.md @@ -1,16 +1,16 @@ -# Statistical Language Model +# Statistical Language Models -**TextAnalysis** provide following different Language Models +**TextAnalysis** provides the following different language models: -- **MLE** - Base Ngram model. -- **Lidstone** - Base Ngram model with Lidstone smoothing. -- **Laplace** - Base Ngram language model with Laplace smoothing. -- **WittenBellInterpolated** - Interpolated Version of witten-Bell algorithm. -- **KneserNeyInterpolated** - Interpolated version of Kneser -Ney smoothing. +- **MLE** - Base n-gram model using Maximum Likelihood Estimation. +- **Lidstone** - Base n-gram model with Lidstone smoothing. +- **Laplace** - Base n-gram language model with Laplace smoothing. +- **WittenBellInterpolated** - Interpolated version of the Witten-Bell algorithm. +- **KneserNeyInterpolated** - Interpolated version of Kneser-Ney smoothing. ## APIs -To use the API, we first *Instantiate* desired model and then load it with train set +To use the API, first instantiate the desired model and then train it with a training set: ```julia MLE(word::Vector{T}, unk_cutoff=1, unk_label="") where { T <: AbstractString} @@ -25,31 +25,31 @@ KneserNeyInterpolated(word::Vector{T}, discount:: Float64=0.1, unk_cutoff=1, unk (lm::)(text, min::Integer, max::Integer) ``` -Arguments: +**Arguments:** - * `word` : Array of strings to store vocabulary. + * `word`: Array of strings to store the vocabulary. * `unk_cutoff`: Tokens with counts greater than or equal to the cutoff value will be considered part of the vocabulary. - * `unk_label`: token for unknown labels + * `unk_label`: Token for unknown labels. - * `gamma`: smoothing argument gamma + * `gamma`: Smoothing parameter gamma. - * `discount`: discounting factor for `KneserNeyInterpolated` + * `discount`: Discounting factor for `KneserNeyInterpolated`. - for more information see docstrings of vocabulary +For more information, see the docstrings of the vocabulary functions. ```julia julia> voc = ["my","name","is","salman","khan","and","he","is","shahrukh","Khan"] julia> train = ["khan","is","my","good", "friend","and","He","is","my","brother"] -# voc and train are used to train vocabulary and model respectively +# voc and train are used to train the vocabulary and model respectively julia> model = MLE(voc) MLE(Vocabulary(Dict("khan"=>1,"name"=>1,""=>1,"salman"=>1,"is"=>2,"Khan"=>1,"my"=>1,"he"=>1,"shahrukh"=>1,"and"=>1…), 1, "", ["my", "name", "is", "salman", "khan", "and", "he", "is", "shahrukh", "Khan", ""])) julia> print(voc) -11-element Array{String,1}: +11-element Vector{String}: "my" "name" "is" @@ -62,42 +62,41 @@ julia> print(voc) "Khan" "" -# you can see "" token is added to voc -julia> fit = model(train,2,2) #considering only bigrams +# You can see the "" token is added to voc +julia> fit = model(train,2,2) # considering only bigrams -julia> unmaskedscore = score(model, fit, "is" ,"") #score output P(word | context) without replacing context word with "" +julia> unmaskedscore = score(model, fit, "is" ,"") # score output P(word | context) without replacing context word with "" 0.3333333333333333 julia> masked_score = maskedscore(model,fit,"is","alien") 0.3333333333333333 -#as expected maskedscore is equivalent to unmaskedscore with context replaced with "" +# As expected, maskedscore is equivalent to unmaskedscore with context replaced with "" ``` !!! note - When you call `MLE(voc)` for the first time, It will update your vocabulary set as well. + When you call `MLE(voc)` for the first time, it will update your vocabulary set as well. -## Evaluation Method +## Evaluation Methods ### `score` -used to evaluate the probability of word given context (*P(word | context)*) +Used to evaluate the probability of a word given its context (*P(word | context)*): ```@docs score ``` -Arguments: +**Arguments:** -1. `m` : Instance of `Langmodel` struct. -2. `temp_lm`: output of function call of instance of `Langmodel`. -3. `word`: string of word -4. `context`: context of given word +1. `m`: Instance of `Langmodel` struct. +2. `temp_lm`: Output of function call of instance of `Langmodel`. +3. `word`: String of the word. +4. `context`: Context of the given word. -- In case of `Lidstone` and `Laplace` it apply smoothing and, - -- In Interpolated language model, provide `Kneserney` and `WittenBell` smoothing +- For `Lidstone` and `Laplace` models, smoothing is applied. +- For interpolated language models, `KneserNey` and `WittenBell` smoothing are provided. ### `maskedscore` ```@docs @@ -121,9 +120,9 @@ entropy perplexity ``` -## Preprocessing +## Preprocessing - For Preprocessing following functions: +The following functions are available for preprocessing: ```@docs everygram padding_ngram @@ -131,18 +130,18 @@ padding_ngram ## Vocabulary -Struct to store Language models vocabulary +A struct to store language model vocabulary. -checking membership and filters items by comparing their counts to a cutoff value +It checks membership and filters items by comparing their counts to a cutoff value. -It also Adds a special "unknown" tokens which unseen words are mapped to +It also adds a special "unknown" token which unseen words are mapped to: ```@repl using TextAnalysis words = ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d"] vocabulary = Vocabulary(words, 2) -# lookup a sequence or words in the vocabulary +# Look up a sequence of words in the vocabulary word = ["a", "-", "d", "c", "a"] diff --git a/docs/src/classify.md b/docs/src/classify.md index 51957451..09274c76 100644 --- a/docs/src/classify.md +++ b/docs/src/classify.md @@ -1,25 +1,25 @@ # Classifier -Text Analysis currently offers a Naive Bayes Classifier for text classification. +TextAnalysis currently offers a Naive Bayes Classifier for text classification. -To load the Naive Bayes Classifier, use the following command - +To load the Naive Bayes Classifier, use the following command: using TextAnalysis: NaiveBayesClassifier, fit!, predict ## Basic Usage -Its usage can be done in the following 3 steps. +It can be used in the following 3 steps: -1- Create an instance of the Naive Bayes Classifier model - +1. Create an instance of the Naive Bayes Classifier model: ```@docs NaiveBayesClassifier ``` -2- Fitting the model weights on input - +2. Fit the model weights on training data: ```@docs fit! ``` -3- Predicting for the input case - +3. Make predictions on new data: ```@docs predict ``` diff --git a/docs/src/corpus.md b/docs/src/corpus.md index eff92421..57fa4c11 100644 --- a/docs/src/corpus.md +++ b/docs/src/corpus.md @@ -2,16 +2,14 @@ Working with isolated documents gets boring quickly. We typically want to work with a collection of documents. We represent collections of documents -using the Corpus type: +using the `Corpus` type: ```@docs Corpus ``` ## Standardizing a Corpus -A `Corpus` may contain many different types of documents. It is generally more convenient to standardize all of the documents in a -corpus using a single type. This can be done using the `standardize!` -function: +A `Corpus` may contain many different types of documents. It is generally more convenient to standardize all of the documents in a corpus using a single type. This can be done using the `standardize!` function: ```@docs standardize! @@ -19,8 +17,7 @@ standardize! ## Processing a Corpus -We can apply the same sort of preprocessing steps that are defined for -individual documents to an entire corpus at once: +We can apply the same preprocessing steps that are defined for individual documents to an entire corpus at once: ```@repl using TextAnalysis @@ -35,36 +32,33 @@ These operations are run on each document in the corpus individually. ## Corpus Statistics -Often we wish to think broadly about properties of an entire corpus at once. -In particular, we want to work with two constructs: +Often we want to analyze properties of an entire corpus at once. In particular, we work with two key constructs: -* _Lexicon_: The lexicon of a corpus consists of all the terms that occur in any document in the corpus. The lexical frequency of a term tells us how often a term occurs across all of the documents. Often the most interesting words in a document are those words whose frequency within a document is higher than their frequency in the corpus as a whole. -* _Inverse Index_: If we are interested in a specific term, we often want to know which documents in a corpus contain that term. The inverse index tells us this and therefore provides a simplistic sort of search algorithm. +* **Lexicon**: The lexicon of a corpus consists of all the terms that occur in any document in the corpus. The lexical frequency of a term tells us how often a term occurs across all documents. Often the most interesting words in a document are those whose frequency within that document is higher than their frequency in the corpus as a whole. +* **Inverse Index**: If we are interested in a specific term, we often want to know which documents in a corpus contain that term. The inverse index provides this information and enables a basic search algorithm. -Because computations involving the lexicon can take a long time, a -`Corpus`'s default lexicon is blank: +Because computations involving the lexicon can be time-consuming, a `Corpus` has an empty lexicon by default: ```julia julia> crps = Corpus([StringDocument("Name Foo"), - StringDocument("Name Bar")]) + StringDocument("Name Bar")]) julia> lexicon(crps) Dict{String,Int64} with 0 entries ``` -In order to work with the lexicon, you have to update it and then access it: +To work with the lexicon, you must update it first and then access it: ```julia julia> update_lexicon!(crps) julia> lexicon(crps) Dict{String,Int64} with 3 entries: - "Bar" => 1 - "Foo" => 1 + "Bar" => 1 + "Foo" => 1 "Name" => 2 ``` -But once this work is done, you can easier address lots of interesting -questions about a corpus: +Once this is done, you can easily address many interesting questions about a corpus: ```julia julia> lexical_frequency(crps, "Name") 0.5 @@ -73,11 +67,11 @@ julia> lexical_frequency(crps, "Foo") 0.25 ``` -Like the lexicon, the inverse index for a corpus is blank by default: +Like the lexicon, the inverse index for a corpus is empty by default: ```julia julia> inverse_index(crps) -Dict{String,Array{Int64,1}} with 0 entries +Dict{String,Vector{Int64}} with 0 entries ``` Again, you need to update it before you can work with it: @@ -86,74 +80,94 @@ Again, you need to update it before you can work with it: julia> update_inverse_index!(crps) julia> inverse_index(crps) -Dict{String,Array{Int64,1}} with 3 entries: - "Bar" => [2] - "Foo" => [1] +Dict{String,Vector{Int64}} with 3 entries: + "Bar" => [2] + "Foo" => [1] "Name" => [1, 2] ``` -But once you've updated the inverse index, you can easily search the entire -corpus: +Once you've updated the inverse index, you can easily search the entire corpus: ```julia julia> crps["Name"] - -2-element Array{Int64,1}: +2-element Vector{Int64}: 1 2 julia> crps["Foo"] -1-element Array{Int64,1}: +1-element Vector{Int64}: 1 julia> crps["Summer"] -0-element Array{Int64,1} +Int64[] +``` + +## Converting a Corpus to a DataFrame + +Sometimes we want to apply non-text-specific data analysis operations to a corpus. The easiest way to do this is to convert a `Corpus` object into a `DataFrame`: + +```julia +julia> using DataFrames +julia> crps = Corpus([StringDocument("Name Foo"), StringDocument("Name Bar")]) +julia> df = DataFrame(crps) +2×6 DataFrame + Row │ Language Title Author Timestamp Length Text + │ String? String? String? String? Int64? String? +─────┼──────────────────────────────────────────────────────────────────────────────────────── + 1 │ Languages.English() Untitled Document Unknown Author Unknown Time 8 Name Foo + 2 │ Languages.English() Untitled Document Unknown Author Unknown Time 8 Name Bar ``` -## Converting a DataFrame from a Corpus +This creates a DataFrame with columns for Language, Title, Author, Timestamp, Length, and Text for each document in the corpus. -Sometimes we want to apply non-text specific data analysis operations to a -corpus. The easiest way to do this is to convert a `Corpus` object into -a `DataFrame`: +Alternatively, you can manually construct a DataFrame with custom columns: - convert(DataFrame, crps) +```julia +using DataFrames +df = DataFrame( + text = [text(doc) for doc in crps.documents], + language = languages(crps), + title = titles(crps), + author = authors(crps), + timestamp = timestamps(crps) +) +``` ## Corpus Metadata -You can also retrieve the metadata for every document in a `Corpus` at once: +You can retrieve the metadata for every document in a `Corpus` at once: -* `languages()`: What language is the document in? Defaults to `Languages.English()`, a Language instance defined by the Languages package. -* `titles()`: What is the title of the document? Defaults to `"Untitled Document"`. -* `authors()`: Who wrote the document? Defaults to `"Unknown Author"`. -* `timestamps()`: When was the document written? Defaults to `"Unknown Time"`. +* `languages()`: What language is each document in? Defaults to `Languages.English()`, a Language instance defined by the Languages package. +* `titles()`: What is the title of each document? Defaults to `"Untitled Document"`. +* `authors()`: Who wrote each document? Defaults to `"Unknown Author"`. +* `timestamps()`: When was each document written? Defaults to `"Unknown Time"`. ```julia julia> crps = Corpus([StringDocument("Name Foo"), - StringDocument("Name Bar")]) + StringDocument("Name Bar")]) julia> languages(crps) -2-element Array{Languages.English,1}: +2-element Vector{Languages.English}: Languages.English() Languages.English() julia> titles(crps) -2-element Array{String,1}: +2-element Vector{String}: "Untitled Document" "Untitled Document" julia> authors(crps) -2-element Array{String,1}: +2-element Vector{String}: "Unknown Author" "Unknown Author" julia> timestamps(crps) -2-element Array{String,1}: +2-element Vector{String}: "Unknown Time" "Unknown Time" ``` -It is possible to change the metadata fields for each document in a `Corpus`. -These functions use the same metadata value for every document: +You can change the metadata fields for each document in a `Corpus`. These functions set the same metadata value for every document: ```julia julia> languages!(crps, Languages.German()) @@ -161,11 +175,10 @@ julia> titles!(crps, "") julia> authors!(crps, "Me") julia> timestamps!(crps, "Now") ``` -Additionally, you can specify the metadata fields for each document in -a `Corpus` individually: +Additionally, you can specify the metadata fields for each document in a `Corpus` individually: ```julia -julia> languages!(crps, [Languages.German(), Languages.English +julia> languages!(crps, [Languages.German(), Languages.English()]) julia> titles!(crps, ["", "Untitled"]) julia> authors!(crps, ["Ich", "You"]) julia> timestamps!(crps, ["Unbekannt", "2018"]) diff --git a/docs/src/documents.md b/docs/src/documents.md index 53b00ab8..4351774a 100644 --- a/docs/src/documents.md +++ b/docs/src/documents.md @@ -1,12 +1,11 @@ ## Creating Documents -The basic unit of text analysis is a document. The TextAnalysis package -allows one to work with documents stored in a variety of formats: +The basic unit of text analysis is a document. The TextAnalysis package allows you to work with documents stored in a variety of formats: -* _FileDocument_ : A document represented using a plain text file on disk -* _StringDocument_ : A document represented using a UTF8 String stored in RAM -* _TokenDocument_ : A document represented as a sequence of UTF8 tokens -* _NGramDocument_ : A document represented as a bag of n-grams, which are UTF8 n-grams that map to counts +* **FileDocument**: A document represented using a plain text file on disk +* **StringDocument**: A document represented using a UTF-8 String stored in RAM +* **TokenDocument**: A document represented as a sequence of UTF-8 tokens +* **NGramDocument**: A document represented as a bag of n-grams, which are UTF-8 n-grams that map to counts !!! note These formats represent a hierarchy: you can always move down the hierarchy, but can generally not move up the hierarchy. A `FileDocument` can easily become a `StringDocument`, but an `NGramDocument` cannot easily become a `FileDocument`. @@ -20,16 +19,14 @@ TokenDocument NGramDocument ``` -An NGramDocument consisting of bigrams or any higher order representation `N` -can be easily created by passing the parameter `N` to `NGramDocument` +An `NGramDocument` consisting of bigrams or any higher-order representation `N` can be easily created by passing the parameter `N` to `NGramDocument`: ```@repl using TextAnalysis NGramDocument("To be or not to be ...", 2) ``` -For every type of document except a `FileDocument`, you can also construct a -new document by simply passing in a string of text: +For every type of document except a `FileDocument`, you can also construct a new document by simply passing in a string of text: ```@repl using TextAnalysis @@ -38,14 +35,9 @@ td = TokenDocument("To be or not to be...") ngd = NGramDocument("To be or not to be...") ``` -The system will automatically perform tokenization or n-gramization in order -to produce the required data. Unfortunately, `FileDocument`'s cannot be -constructed this way because filenames are themselves strings. It would cause -chaos if filenames were treated as the text contents of a document. +The system will automatically perform tokenization or n-gramization to produce the required data. Unfortunately, `FileDocument`s cannot be constructed this way because filenames are themselves strings. It would cause confusion if filenames were treated as the text contents of a document. -That said, there is one way around this restriction: you can use the generic -`Document()` constructor function, which will guess at the type of the inputs -and construct the appropriate type of document object: +However, there is one way around this restriction: you can use the generic `Document()` constructor function, which will infer the type of the inputs and construct the appropriate type of document object: ```julia julia> Document("To be or not to be...") @@ -80,12 +72,11 @@ A NGramDocument{AbstractString} * Snippet: ***SAMPLE TEXT NOT AVAILABLE*** ``` -This constructor is very convenient for working in the REPL, but should be avoided in permanent code because, unlike the other constructors, the return type of the `Document` function cannot be known at compile-time. +This constructor is very convenient for working in the REPL, but should be avoided in production code because, unlike the other constructors, the return type of the `Document` function cannot be known at compile time. ## Basic Functions for Working with Documents -Once you've created a document object, you can work with it in many ways. The -most obvious thing is to access its text using the `text()` function: +Once you've created a document object, you can work with it in many ways. The most obvious operation is to access its text using the `text()` function: ```@repl using TextAnalysis @@ -94,16 +85,9 @@ text(sd) ``` !!! note - This function works without warnings on `StringDocument`'s and - `FileDocument`'s. For `TokenDocument`'s it is not possible to know if the - text can be reconstructed perfectly, so calling - `text(TokenDocument("This is text"))` will produce a warning message before - returning an approximate reconstruction of the text as it existed before - tokenization. It is entirely impossible to reconstruct the text of an - `NGramDocument`, so `text(NGramDocument("This is text"))` raises an error. + This function works without warnings on `StringDocument`s and `FileDocument`s. For `TokenDocument`s it is not possible to know if the text can be reconstructed perfectly, so calling `text(TokenDocument("This is text"))` will produce a warning message before returning an approximate reconstruction of the text as it existed before tokenization. It is entirely impossible to reconstruct the text of an `NGramDocument`, so `text(NGramDocument("This is text"))` raises an error. -Instead of working with the text itself, you can work with the tokens or -n-grams of a document using the `tokens()` and `ngrams()` functions: +Instead of working with the text itself, you can work with the tokens or n-grams of a document using the `tokens()` and `ngrams()` functions: ```@repl using TextAnalysis @@ -112,9 +96,7 @@ tokens(sd) ngrams(sd) ``` -By default the `ngrams()` function produces unigrams. If you would like to -produce bigrams or trigrams, you can specify that directly using a numeric -argument to the `ngrams()` function: +By default the `ngrams()` function produces unigrams. If you want to produce bigrams or trigrams, you can specify that directly using a numeric argument to the `ngrams()` function: ```@repl using TextAnalysis @@ -130,8 +112,7 @@ sd = StringDocument("To be or not to be..."); ngrams(sd, 2, 3) ``` -If you have a `NGramDocument`, you can determine whether an `NGramDocument` -contains unigrams, bigrams or a higher-order representation using the `ngram_complexity()` function: +If you have an `NGramDocument`, you can determine whether it contains unigrams, bigrams, or a higher-order representation using the `ngram_complexity()` function: ```@repl using TextAnalysis @@ -139,23 +120,18 @@ ngd = NGramDocument("To be or not to be ...", 2); ngram_complexity(ngd) ``` -This information is not available for other types of `Document` objects -because it is possible to produce any level of complexity when constructing -n-grams from raw text or tokens. +This information is not available for other types of `Document` objects because it is possible to produce any level of complexity when constructing n-grams from raw text or tokens. ## Document Metadata -In addition to methods for manipulating the representation of the text of a -document, every document object also stores basic metadata about itself, -including the following pieces of information: +In addition to methods for manipulating the text representation of a document, every document object also stores basic metadata about itself, including the following information: * `language()`: What language is the document in? Defaults to `Languages.English()`, a Language instance defined by the Languages package. * `title()`: What is the title of the document? Defaults to `"Untitled Document"`. * `author()`: Who wrote the document? Defaults to `"Unknown Author"`. * `timestamp()`: When was the document written? Defaults to `"Unknown Time"`. -Try these functions out on a `StringDocument` to see how the defaults work -in practice: +Try these functions on a `StringDocument` to see how the defaults work in practice: ```@repl using TextAnalysis @@ -166,8 +142,7 @@ author(sd) timestamp(sd) ``` -If you need reset these fields, you can use the mutating versions of the same -functions: +If you need to reset these fields, you can use the mutating versions of the same functions: ```@repl using TextAnalysis, Languages @@ -180,8 +155,7 @@ timestamp!(sd, "Desconocido") ## Preprocessing Documents -Having easy access to the text of a document and its metadata is very -important, but most text analysis tasks require some amount of preprocessing. +Having easy access to the text of a document and its metadata is important, but most text analysis tasks require some preprocessing. At a minimum, your text source may contain corrupt characters. You can remove these using the `remove_corrupt_utf8!()` function: @@ -190,10 +164,7 @@ these using the `remove_corrupt_utf8!()` function: remove_corrupt_utf8! ``` -Alternatively, you may want to edit the text to remove items that are hard -to process automatically. For example, our sample text sentence taken from Hamlet -has three periods that we might like to discard. We can remove this kind of -punctuation using the `prepare!()` function: +Alternatively, you may want to edit the text to remove items that are difficult to process automatically. For example, text may contain punctuation that you want to discard. You can remove punctuation using the `prepare!()` function: ```@repl using TextAnalysis @@ -202,9 +173,7 @@ prepare!(str, strip_punctuation) text(str) ``` -* To remove case distinctions, use `remove_case!()` function: -* At times you'll want to remove specific words from a document like a person's -name. To do that, use the `remove_words!()` function: +To remove case distinctions, use the `remove_case!()` function. You may also want to remove specific words from a document, such as a person's name. To do that, use the `remove_words!()` function: ```@repl using TextAnalysis @@ -215,16 +184,14 @@ remove_words!(sd, ["lear"]) text(sd) ``` -At other times, you'll want to remove whole classes of words. To make this -easier, we can use several classes of basic words defined by the Languages.jl -package: +At other times, you'll want to remove entire classes of words. To make this easier, you can use several classes of basic words defined by the Languages.jl package: -* _Articles_ : "a", "an", "the" -* _Indefinite Articles_ : "a", "an" -* _Definite Articles_ : "the" -* _Prepositions_ : "across", "around", "before", ... -* _Pronouns_ : "I", "you", "he", "she", ... -* _Stop Words_ : "all", "almost", "alone", ... +* **Articles**: "a", "an", "the" +* **Indefinite Articles**: "a", "an" +* **Definite Articles**: "the" +* **Prepositions**: "across", "around", "before", ... +* **Pronouns**: "I", "you", "he", "she", ... +* **Stop Words**: "all", "almost", "alone", ... These special classes can all be removed using specially-named parameters: @@ -240,16 +207,11 @@ These special classes can all be removed using specially-named parameters: * `prepare!(sd, strip_frequent_terms)` * `prepare!(sd, strip_html_tags)` -These functions use words lists, so they are capable of working for many -different languages without change, also these operations can be combined -together for improved performance: +These functions use word lists, so they work with many different languages without modification. These operations can also be combined for improved performance: * `prepare!(sd, strip_articles| strip_numbers| strip_html_tags)` -In addition to removing words, it is also common to take words that are -closely related like "dog" and "dogs" and stem them in order to produce a -smaller set of words for analysis. We can do this using the `stem!()` -function: +In addition to removing words, it is also common to take words that are closely related like "dog" and "dogs" and stem them to produce a smaller set of words for analysis. You can do this using the `stem!()` function: ```@repl using TextAnalysis diff --git a/docs/src/evaluation_metrics.md b/docs/src/evaluation_metrics.md index a07258cf..5909ee95 100644 --- a/docs/src/evaluation_metrics.md +++ b/docs/src/evaluation_metrics.md @@ -1,7 +1,6 @@ ## Evaluation Metrics -Natural Language Processing tasks require certain Evaluation Metrics. -As of now TextAnalysis provides the following evaluation metrics. +Natural Language Processing tasks require evaluation metrics. TextAnalysis currently provides the following evaluation metrics: * [ROUGE-N](https://en.wikipedia.org/wiki/ROUGE_(metric)) * [ROUGE-L](https://en.wikipedia.org/wiki/ROUGE_(metric)) @@ -9,8 +8,7 @@ As of now TextAnalysis provides the following evaluation metrics. * [BLEU (bilingual evaluation understudy)](https://en.wikipedia.org/wiki/BLEU) ## ROUGE-N, ROUGE-L, ROUGE-L-Summary -This metric evaluation based on the overlap of N-grams -between the system and reference summaries. +These metrics evaluate text based on the overlap of N-grams between the system and reference summaries. ```@docs argmax @@ -20,16 +18,40 @@ rouge_l_sentence rouge_l_summary ``` +### ROUGE-N Example + +```@example +using TextAnalysis + +candidate_summary = "Brazil, Russia, China and India are growing nations. They are all an important part of BRIC as well as regular part of G20 summits." +reference_summaries = ["Brazil, Russia, India and China are the next big political powers in the global economy. Together referred to as BRIC(S) along with South Korea.", "Brazil, Russia, India and China are together known as the BRIC(S) and have been invited to the G20 summit."] + +# Calculate ROUGE-N scores for different N values +rouge_2_scores = rouge_n(reference_summaries, candidate_summary, 2) +rouge_1_scores = rouge_n(reference_summaries, candidate_summary, 1) + +# Get the best scores using argmax +results = [rouge_2_scores, rouge_1_scores] .|> argmax +``` + +### ROUGE-L Examples + +ROUGE-L measures the longest common subsequence between the candidate and reference summaries: + ```@example using TextAnalysis -candidate_summary = "Brazil, Russia, China and India are growing nations. They are all an important part of BRIC as well as regular part of G20 summits." -reference_summaries = ["Brazil, Russia, India and China are the next big political powers in the global economy. Together referred to as BRIC(S) along with South Korea.", "Brazil, Russia, India and China are together known as the BRIC(S) and have been invited to the G20 summit."] +candidate = "Brazil, Russia, China and India are growing nations." +references = [ + "Brazil, Russia, India and China are the next big political powers.", + "Brazil, Russia, India and China are BRIC nations." +] -results = [ - rouge_n(reference_summaries, candidate_summary, 2), - rouge_n(reference_summaries, candidate_summary, 1) -] .|> argmax +# ROUGE-L for sentence-level evaluation +sentence_scores = rouge_l_sentence(references, candidate) + +# ROUGE-L for summary-level evaluation (requires β parameter) +summary_scores = rouge_l_summary(references, candidate, 8) ``` ## BLEU (bilingual evaluation understudy) @@ -38,32 +60,34 @@ results = [ bleu_score ``` -[NLTK sample](https://www.nltk.org/api/nltk.translate.bleu_score.html) +Example adapted from [NLTK](https://www.nltk.org/api/nltk.translate.bleu_score.html): + ```@example - using TextAnalysis - - reference1 = [ - "It", "is", "a", "guide", "to", "action", "that", - "ensures", "that", "the", "military", "will", "forever", - "heed", "Party", "commands" - ] - reference2 = [ - "It", "is", "the", "guiding", "principle", "which", - "guarantees", "the", "military", "forces", "always", - "being", "under", "the", "command", "of", "the", - "Party" - ] - reference3 = [ - "It", "is", "the", "practical", "guide", "for", "the", - "army", "always", "to", "heed", "the", "directions", - "of", "the", "party" - ] - - hypothesis1 = [ - "It", "is", "a", "guide", "to", "action", "which", - "ensures", "that", "the", "military", "always", - "obeys", "the", "commands", "of", "the", "party" - ] - - score = bleu_score([[reference1, reference2, reference3]], [hypothesis1]) +using TextAnalysis + +reference1 = [ + "It", "is", "a", "guide", "to", "action", "that", + "ensures", "that", "the", "military", "will", "forever", + "heed", "Party", "commands" +] +reference2 = [ + "It", "is", "the", "guiding", "principle", "which", + "guarantees", "the", "military", "forces", "always", + "being", "under", "the", "command", "of", "the", + "Party" +] +reference3 = [ + "It", "is", "the", "practical", "guide", "for", "the", + "army", "always", "to", "heed", "the", "directions", + "of", "the", "party" +] + +hypothesis1 = [ + "It", "is", "a", "guide", "to", "action", "which", + "ensures", "that", "the", "military", "always", + "obeys", "the", "commands", "of", "the", "party" +] + +# Calculate BLEU score +score = bleu_score([[reference1, reference2, reference3]], [hypothesis1]) ``` diff --git a/docs/src/example.md b/docs/src/example.md index a1444414..2415db74 100644 --- a/docs/src/example.md +++ b/docs/src/example.md @@ -1,31 +1,72 @@ # Extended Usage Example -To show you how text analysis might work in practice, we're going to work with -a text corpus composed of political speeches from American presidents given -as part of the State of the Union Address tradition. +To show you how text analysis works in practice, we'll work with a text corpus composed of political speeches from American presidents given as part of the State of the Union Address tradition. ```julia - using TextAnalysis, MultivariateStats, Clustering +using TextAnalysis - crps = DirectoryCorpus("sotu") +# Load a directory of text files as a corpus +# Note: For testing, use "test/data/sotu" path in the TextAnalysis.jl repository +crps = DirectoryCorpus("sotu") - standardize!(crps, StringDocument) +# Standardize all documents to StringDocument type for consistency +standardize!(crps, StringDocument) - crps = Corpus(crps[1:30]) +# Work with a subset for faster processing +# Note: Adjust the range based on available documents (e.g., 1:25 if only 29 documents exist) +crps = Corpus(crps[1:min(length(crps), 25)]) - remove_case!(crps) - prepare!(crps, strip_punctuation) +# Preprocessing: convert to lowercase and remove punctuation +remove_case!(crps) +prepare!(crps, strip_punctuation) - update_lexicon!(crps) - update_inverse_index!(crps) +# Build the lexicon and inverse index for efficient searching +update_lexicon!(crps) +update_inverse_index!(crps) - crps["freedom"] +# Search for documents containing specific terms +freedom_docs = crps["freedom"] +println("Documents mentioning 'freedom': ", length(freedom_docs)) - m = DocumentTermMatrix(crps) +# Create a document-term matrix for numerical analysis +m = DocumentTermMatrix(crps) - D = dtm(m, :dense) +# Convert to dense matrix representation +D = dtm(m, :dense) +println("Document-term matrix size: ", size(D)) - T = tf_idf(D) +# Apply TF-IDF (Term Frequency-Inverse Document Frequency) transformation +T = tf_idf(D) +println("TF-IDF matrix size: ", size(T)) - cl = kmeans(T, 5) +# Additional analysis examples +println("\nCorpus Statistics:") +println(" Vocabulary size: ", length(lexicon(crps))) +println(" Matrix density: ", count(x -> x > 0, D) / length(D)) + +# Find most frequent terms +lex = lexicon(crps) +sorted_words = sort(collect(lex), by=x->x[2], rev=true) +println(" Most frequent terms: ", [word for (word, count) in sorted_words[1:5]]) + +# Search for documents containing multiple terms +america_docs = crps["america"] +democracy_docs = crps["democracy"] +println(" Documents mentioning 'america': ", length(america_docs)) +println(" Documents mentioning 'democracy': ", length(democracy_docs)) + +# For clustering analysis, you would need additional packages: +# using MultivariateStats, Clustering +# cl = kmeans(T, 5) ``` + +This example demonstrates the core TextAnalysis workflow: + +1. **Data Loading**: Load multiple documents from a directory +2. **Standardization**: Ensure all documents use the same representation +3. **Preprocessing**: Clean the text (case normalization, punctuation removal) +4. **Indexing**: Build lexicon and inverse index for efficient operations +5. **Search**: Find documents containing specific terms +6. **Vectorization**: Convert text to numerical representation (DTM) +7. **Transformation**: Apply TF-IDF weighting for better feature representation +8. **Analysis**: Explore corpus statistics and term frequencies diff --git a/docs/src/features.md b/docs/src/features.md index 12e90976..eb95e6db 100644 --- a/docs/src/features.md +++ b/docs/src/features.md @@ -1,8 +1,6 @@ ## Creating a Document Term Matrix -Often we want to represent documents as a matrix of word counts so that we -can apply linear algebra operations and statistical techniques. Before -we do this, we need to update the lexicon: +Often we want to represent documents as a matrix of word counts so that we can apply linear algebra operations and statistical techniques. Before we do this, we need to update the lexicon: ```@repl using TextAnalysis @@ -12,8 +10,7 @@ update_lexicon!(crps) m = DocumentTermMatrix(crps) ``` -A `DocumentTermMatrix` object is a special type. If you would like to use -a simple sparse matrix, call `dtm()` on this object: +A `DocumentTermMatrix` object is a special type. If you want to use a simple sparse matrix, call `dtm()` on this object: ```julia julia> dtm(m) @@ -30,35 +27,28 @@ julia> dtm(m) [2, 6] = 1 ``` -If you would like to use a dense matrix instead, you can pass this as -an argument to the `dtm` function: +If you want to use a dense matrix instead, you can pass this as an argument to the `dtm` function: ```julia julia> dtm(m, :dense) -2×6 Array{Int64,2}: +2×6 Matrix{Int64}: 1 2 0 1 1 1 1 0 2 1 1 1 ``` ## Creating Individual Rows of a Document Term Matrix -In many cases, we don't need the entire document term matrix at once: we can -make do with just a single row. You can get this using the `dtv` function. -Because individual's document do not have a lexicon associated with them, we -have to pass in a lexicon as an additional argument: +In many cases, we don't need the entire document term matrix at once: we can make do with just a single row. You can get this using the `dtv` function. Because individual documents do not have a lexicon associated with them, we have to pass in a lexicon as an additional argument: ```julia julia> dtv(crps[1], lexicon(crps)) -1×6 Array{Int64,2}: +1×6 Matrix{Int64}: 1 2 0 1 1 1 ``` ## The Hash Trick -The need to create a lexicon before we can construct a document term matrix is often prohibitive. We can often employ a trick that has come to be called the -"Hash Trick" in which we replace terms with their hashed valued using a hash -function that outputs integers from 1 to N. To construct such a hash function, -you can use the `TextHashFunction(N)` constructor: +The need to create a lexicon before we can construct a document term matrix is often prohibitive. We can employ a trick called the "Hash Trick" in which we replace terms with their hashed values using a hash function that outputs integers from 1 to N. To construct such a hash function, you can use the `TextHashFunction(N)` constructor: ```julia julia> h = TextHashFunction(10) @@ -81,16 +71,15 @@ entries by calling the `hash_dtv` function: ```julia julia> hash_dtv(crps[1], h) -1×10 Array{Int64,2}: +1×10 Matrix{Int64}: 0 2 0 0 1 3 0 0 0 0 ``` -This can be done for a corpus as a whole to construct a DTM without defining -a lexicon in advance: +This can be done for a corpus as a whole to construct a DTM without defining a lexicon in advance: ```julia julia> hash_dtm(crps, h) -2×10 Array{Int64,2}: +2×10 Matrix{Int64}: 0 2 0 0 1 3 0 0 0 0 0 2 0 0 1 1 0 0 2 0 ``` @@ -100,30 +89,28 @@ using just one argument: ```julia julia> hash_dtm(crps) -2×100 Array{Int64,2}: +2×100 Matrix{Int64}: 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ``` -Moreover, if you do not specify a hash function for just one row of the hash -DTM, a default hash function will be constructed for you: +Moreover, if you do not specify a hash function for just one row of the hash DTM, a default hash function will be constructed for you: ```julia julia> hash_dtv(crps[1]) -1×100 Array{Int64,2}: +1×100 Matrix{Int64}: 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 ``` ## TF (Term Frequency) -Often we need to find out the proportion of a document is contributed -by each term. This can be done by finding the term frequency function +Often we need to find out what proportion of a document is contributed by each term. This can be done using the term frequency function: ```@docs tf ``` -The parameter, `dtm` can be of the types - `DocumentTermMatrix` , `SparseMatrixCSC` or `Matrix` +The parameter `dtm` can be of the types `DocumentTermMatrix`, `SparseMatrixCSC`, or `Matrix`. ```@repl using TextAnalysis @@ -155,23 +142,20 @@ m = DocumentTermMatrix(crps) tf_idf(m) ``` -As you can see, TF-IDF has the effect of inserting 0's into the columns of -words that occur in all documents. This is a useful way to avoid having to -remove those words during preprocessing. +As you can see, TF-IDF has the effect of inserting 0's into the columns of words that occur in all documents. This is a useful way to avoid having to remove those words during preprocessing. ## Okapi BM-25 -From the document term matparamterix, [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25) document-word statistic can be created. +From the document term matrix, [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25) document-word statistics can be created. bm_25(dtm::AbstractMatrix; κ, β) bm_25(dtm::DocumentTermMatrixm, κ, β) -It can also be used via the following methods Overwrite the `bm25` with calculated weights. +It can also be used via the following method to overwrite the `bm25` with calculated weights: bm_25!(dtm, bm25, κ, β) -The inputs matrices can also be a `Sparse Matrix`. -The parameters κ and β default to 2 and 0.75 respectively. +The input matrices can also be a `SparseMatrix`. The parameters κ and β default to 2 and 0.75 respectively. Here is an example usage - @@ -189,25 +173,19 @@ m = DocumentTermMatrix(crps) bm_25(m) ``` -## Co occurrence matrix (COOM) +## Co-occurrence Matrix (COOM) -The elements of the Co occurrence matrix indicate how many times two words co-occur -in a (sliding) word window of a given size. -The COOM can be calculated for objects of type `Corpus`, -`AbstractDocument` (with the exception of `NGramDocument`). +The elements of the co-occurrence matrix indicate how many times two words co-occur in a (sliding) word window of a given size. The COOM can be calculated for objects of type `Corpus` and `AbstractDocument` (with the exception of `NGramDocument`). CooMatrix(crps; window, normalize) CooMatrix(doc; window, normalize) It takes following keyword arguments: -* `window::Integer` -length of the Window size, defaults to `5`. The actual size of the sliding window is 2 * window + 1, with the keyword argument window specifying how many words to consider to the left and right of the center one -* `normalize::Bool` -normalizes counts to distance between words, defaults to `true` +* `window::Integer`: Length of the window size, defaults to `5`. The actual size of the sliding window is 2 * window + 1, with the keyword argument `window` specifying how many words to consider to the left and right of the center word. +* `normalize::Bool`: Normalizes counts to distance between words, defaults to `true`. -It returns the `CooMatrix` structure from which -the matrix can be extracted using `coom(::CooMatrix)`. -The `terms` can also be extracted from this. -Here is an example usage - +It returns the `CooMatrix` structure from which the matrix can be extracted using `coom(::CooMatrix)`. The `terms` can also be extracted from this structure. Here is an example usage: ```@repl using TextAnalysis @@ -217,10 +195,7 @@ coom(C) C.terms ``` -It can also be called to calculate the terms for -a specific list of words / terms in the document. -In other cases it calculates the the co occurrence elements -for all the terms. +It can also be called to calculate the terms for a specific list of words/terms in the document. Otherwise, it calculates the co-occurrence elements for all terms. CooMatrix(crps, terms; window, normalize) CooMatrix(doc, terms; window, normalize) @@ -235,22 +210,20 @@ CooMatrix{Float64}( ``` -The type can also be specified for `CooMatrix` -with the weights of type `T`. `T` defaults to `Float64`. +The type can also be specified for `CooMatrix` with weights of type `T`. `T` defaults to `Float64`. CooMatrix{T}(crps; window, normalize) where T <: AbstractFloat CooMatrix{T}(doc; window, normalize) where T <: AbstractFloat CooMatrix{T}(crps, terms; window, normalize) where T <: AbstractFloat CooMatrix{T}(doc, terms; window, normalize) where T <: AbstractFloat -Remarks: +**Remarks:** -* The sliding window used to count co-occurrences does not take into consideration sentence stops however, it does with documents i.e. does not span across documents -* The co-occurrence matrices of the documents in a corpus are summed up when calculating the matrix for an entire corpus +* The sliding window used to count co-occurrences does not take sentence boundaries into consideration; however, it respects document boundaries (i.e., it does not span across documents). +* The co-occurrence matrices of the documents in a corpus are summed when calculating the matrix for an entire corpus. !!! note - The Co occurrence matrix does not work for `NGramDocument`, - or a Corpus containing an `NGramDocument`. + The co-occurrence matrix does not work for `NGramDocument` or a Corpus containing an `NGramDocument`. ```julia julia> C = CooMatrix(NGramDocument("A document"), window=1, normalize=false) # fails, documents are NGramDocument @@ -263,4 +236,4 @@ TextAnalysis offers a simple text-rank based summarizer for its various document ```@docs summarize -``` \ No newline at end of file +``` diff --git a/docs/src/index.md b/docs/src/index.md index 1027d57c..2890d695 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -1,7 +1,15 @@ ## Preface -This manual is designed to get you started doing text analysis in Julia. -It assumes that you already familiar with the basic methods of text analysis. +This manual is designed to get you started with text analysis in Julia. It assumes that you are already familiar with the basic methods of text analysis. + +TextAnalysis.jl provides a comprehensive suite of tools for analyzing text data, including: + +* Document representation and preprocessing +* Corpus creation and management +* Feature extraction (TF-IDF, n-grams, co-occurrence matrices) +* Text classification (Naive Bayes) +* Evaluation metrics (ROUGE, BLEU) +* Text summarization and language models ## Installation @@ -11,9 +19,7 @@ The TextAnalysis package can be installed using Julia's package manager: ## Loading -In all of the examples that follow, we'll assume that you have the -TextAnalysis package fully loaded. This means that we think you've -implicitly typed +In all of the examples that follow, we'll assume that you have the TextAnalysis package fully loaded. This means that we assume you've implicitly typed using TextAnalysis @@ -21,5 +27,5 @@ before every snippet of code. ## TextModels -The [TextModels](https://github.com/JuliaText/TextModels.jl) package enhances this library with the addition of practical neural network based models. Some of that code used to live in this package, but was moved to simplify installation and dependencies. +The [TextModels](https://github.com/JuliaText/TextModels.jl) package enhances this library with the addition of practical neural network-based models. Some of that code used to live in this package, but was moved to simplify installation and dependencies. diff --git a/docs/src/semantic.md b/docs/src/semantic.md index 6eb2d7d6..e275a486 100644 --- a/docs/src/semantic.md +++ b/docs/src/semantic.md @@ -1,14 +1,11 @@ ## LSA: Latent Semantic Analysis -Often we want to think about documents -from the perspective of semantic content. -One standard approach to doing this, -is to perform Latent Semantic Analysis or LSA on the corpus. +Often we want to analyze documents from the perspective of their semantic content. One standard approach to doing this is to perform Latent Semantic Analysis (LSA) on the corpus. ```@docs lsa ``` -lsa uses `tf_idf` for statistics. +LSA uses `tf_idf` for computing term statistics. ```@repl @@ -19,7 +16,8 @@ crps = Corpus([ ]) lsa(crps) ``` -lsa can also be performed on a `DocumentTermMatrix`. + +LSA can also be performed directly on a `DocumentTermMatrix`: ```@repl using TextAnalysis crps = Corpus([ @@ -36,10 +34,9 @@ lsa(m) ## LDA: Latent Dirichlet Allocation -Another way to get a handle on the semantic content of a corpus is to use -[Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation): +Another way to analyze the semantic content of a corpus is to use [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation). -First we need to produce the DocumentTermMatrix +First, we need to create a DocumentTermMatrix: ```@docs lda ``` @@ -52,13 +49,18 @@ crps = Corpus([ update_lexicon!(crps) m = DocumentTermMatrix(crps) -k = 2 # number of topics -iterations = 1000 # number of gibbs sampling iterations -α = 0.1 # hyper parameter -β = 0.1 # hyper parameter +k = 2 # Number of topics +iterations = 1000 # Number of Gibbs sampling iterations +α = 0.1 # Hyperparameter for document-topic distribution +β = 0.1 # Hyperparameter for topic-word distribution -ϕ, θ = lda(m, k, iterations, α, β); -ϕ -θ +ϕ, θ = lda(m, k, iterations, α, β); +ϕ # Topic-word distribution matrix +θ # Document-topic distribution matrix ``` + +The `lda` function returns two matrices: +- `ϕ` (phi): The topic-word distribution matrix showing the probability of each word in each topic +- `θ` (theta): The document-topic distribution matrix showing the probability of each topic in each document + See `?lda` for more help. diff --git a/src/LM/api.jl b/src/LM/api.jl index 1498e4c2..7b2b593d 100644 --- a/src/LM/api.jl +++ b/src/LM/api.jl @@ -1,9 +1,9 @@ """ $(TYPEDSIGNATURES) -It is used to evaluate score with masks out of vocabulary words +Evaluate the score with masked out-of-vocabulary words. -The arguments are the same as for [`score`](@ref) +The arguments are the same as for [`score`](@ref). """ function maskedscore(m::Langmodel, temp_lm::DefaultDict, word, context)::Float64 score(m, temp_lm, lookup(m.vocab, [word])[begin], lookup(m.vocab, [context])[begin]) @@ -12,9 +12,9 @@ end """ $(TYPEDSIGNATURES) -Evaluate the log score of this word in this context. +Evaluate the log score of a word in a given context. -The arguments are the same as for [`score`](@ref) and [`maskedscore`](@ref) +The arguments are the same as for [`score`](@ref) and [`maskedscore`](@ref). """ function logscore(m::Langmodel, temp_lm::DefaultDict, word, context)::Float64 log2(maskedscore(m, temp_lm, word, context)) @@ -23,9 +23,9 @@ end """ $(TYPEDSIGNATURES) -Calculate *cross-entropy* of model for given evaluation text. +Calculate the cross-entropy of the model for a given evaluation text. -Input text must be `Vector` of ngram of same lengths +Input text must be a `Vector` of n-grams of the same length. """ function entropy(m::Langmodel, lm::DefaultDict, text_ngram::AbstractVector)::Float64 n_sum = sum(text_ngram) do ngram @@ -38,9 +38,9 @@ end """ $(TYPEDSIGNATURES) -Calculates the perplexity of the given text. +Calculate the perplexity of the given text. -This is simply 2 ** cross-entropy(entropy) for the text, so the arguments are the same as [`entropy`](@ref) +This is simply `2^entropy` for the text, so the arguments are the same as [`entropy`](@ref). """ function perplexity(m::Langmodel, lm::DefaultDict, text_ngram::AbstractVector)::Float64 return 2^(entropy(m, lm, text_ngram)) diff --git a/src/LM/counter.jl b/src/LM/counter.jl index f6843340..97e8dc0f 100644 --- a/src/LM/counter.jl +++ b/src/LM/counter.jl @@ -3,8 +3,8 @@ using DataStructures """ $(TYPEDSIGNATURES) -counter is used to make conditional distribution, which is used by score functions to -calculate conditional frequency distribution +Create a conditional distribution counter, which is used by score functions to +calculate conditional frequency distributions. """ function counter2(data, min::Integer, max::Integer) data = everygram(data, min_len=min, max_len=max) diff --git a/src/LM/langmodel.jl b/src/LM/langmodel.jl index bd28b8e2..628bcfbf 100644 --- a/src/LM/langmodel.jl +++ b/src/LM/langmodel.jl @@ -13,9 +13,9 @@ end """ MLE(word::Vector{T}, unk_cutoff=1, unk_label="") where {T <: AbstractString} -Initiate Type for providing MLE ngram model scores. +Initialize a type for providing MLE n-gram model scores. -Implementation of Base Ngram Model. +Implementation of the base n-gram model using Maximum Likelihood Estimation. """ function MLE(word::Vector{T}, unk_cutoff=1, unk_label="") where {T<:AbstractString} @@ -36,10 +36,10 @@ end """ Lidstone(word::Vector{T}, gamma:: Float64, unk_cutoff=1, unk_label="") where {T <: AbstractString} -Function to initiate Type(Lidstone) for providing Lidstone-smoothed scores. +Initialize a Lidstone type for providing Lidstone-smoothed scores. -In addition to initialization arguments from BaseNgramModel also requires -a number by which to increase the counts, gamma. +In addition to initialization arguments from the base n-gram model, this also requires +a number by which to increase the counts (gamma). """ function Lidstone(word::Vector{T}, gamma=1.0, unk_cutoff=1, unk_label="") where {T<:AbstractString} Lidstone(Vocabulary(word, unk_cutoff, unk_label), gamma) @@ -54,10 +54,10 @@ end """ Laplace(word::Vector{T}, unk_cutoff=1, unk_label="") where {T <: AbstractString} -Function to initiate Type(Laplace) for providing Laplace-smoothed scores. +Initialize a Laplace type for providing Laplace-smoothed scores. -In addition to initialization arguments from BaseNgramModel also requires -a number by which to increase the counts, gamma = 1. +In addition to initialization arguments from the base n-gram model, this uses +a smoothing parameter gamma = 1. """ struct Laplace <: gammamodel vocab::Vocabulary @@ -77,9 +77,9 @@ end """ score(m::gammamodel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString) -score is used to output probability of word given that context +Compute the probability of a word given its context using add-one smoothing. -Add-one smoothing to Lidstone or Laplace(gammamodel) models +Applies add-one smoothing to Lidstone or Laplace (gammamodel) models. """ function score(m::gammamodel, temp_lm::DefaultDict, word, context) #score for gammamodel output probabl @@ -93,9 +93,9 @@ end """ $(TYPEDSIGNATURES) -To get probability of word given that context +Get the probability of a word given its context. -In other words, for given context calculate frequency distribution of word +In other words, for a given context, calculate the frequency distribution of words. """ function prob(m::Langmodel, templ_lm::DefaultDict, word, context=nothing)::Float64 (isnothing(context) || isempty(context)) && return 1.0 / length(templ_lm) #provide distribution @@ -116,7 +116,7 @@ end """ score(m::MLE, temp_lm::DefaultDict, word::AbstractString, context::AbstractString) -score is used to output probability of word given that context in MLE +Compute the probability of a word given its context using MLE (Maximum Likelihood Estimation). """ function score(m::MLE, temp_lm::DefaultDict, word, context=nothing) @@ -128,9 +128,9 @@ struct WittenBellInterpolated <: InterpolatedLanguageModel end """ - WittenBellInterpolated(word::Vector{T}, unk_cutoff=1, unk_label="") where { T <: AbstractString} + WittenBellInterpolated(word::Vector{T}, unk_cutoff=1, unk_label="") where {T <: AbstractString} -Initiate Type for providing Interpolated version of Witten-Bell smoothing. +Initialize a type for providing an interpolated version of Witten-Bell smoothing. The idea to abstract this comes from Chen & Goodman 1995. @@ -175,10 +175,9 @@ end """ score(m::InterpolatedLanguageModel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString) -score is used to output probability of word given that context in InterpolatedLanguageModel +Compute the probability of a word given its context in an interpolated language model. -Apply Kneserney and WittenBell smoothing -depending upon the sub-Type +Applies Kneser-Ney and Witten-Bell smoothing depending on the sub-type. """ function score(m::InterpolatedLanguageModel, temp_lm::DefaultDict, word, context=nothing) @@ -204,9 +203,9 @@ struct KneserNeyInterpolated <: InterpolatedLanguageModel end """ - KneserNeyInterpolated(word::Vector{T}, discount:: Float64,unk_cutoff=1, unk_label="") where {T <: AbstractString} + KneserNeyInterpolated(word::Vector{T}, discount::Float64, unk_cutoff=1, unk_label="") where {T <: AbstractString} -Initiate Type for providing KneserNey Interpolated language model. +Initialize a type for providing a Kneser-Ney interpolated language model. The idea to abstract this comes from Chen & Goodman 1995. diff --git a/src/LM/preprocessing.jl b/src/LM/preprocessing.jl index d20540cd..82c92e18 100644 --- a/src/LM/preprocessing.jl +++ b/src/LM/preprocessing.jl @@ -1,14 +1,14 @@ """ - everygram(seq::Vector{T}; min_len::Int=1, max_len::Int=-1)where { T <: AbstractString} + everygram(seq::Vector{T}; min_len::Int=1, max_len::Int=-1) where {T <: AbstractString} -Return all possible ngrams generated from sequence of items, as an Array{String,1} +Return all possible n-grams generated from a sequence of items, as a `Vector{String}`. # Example ```julia-repl julia> seq = ["To","be","or","not"] -julia> a = everygram(seq,min_len=1, max_len=-1) - 10-element Array{Any,1}: +julia> a = everygram(seq, min_len=1, max_len=-1) + 10-element Vector{Any}: "or" "not" "To" @@ -34,18 +34,18 @@ function everygram(seq::Vector{T}; min_len::Int=1, max_len::Int=-1)::Vector{Stri end """ - padding_ngram(word::Vector{T}, n=1; pad_left=false, pad_right=false, left_pad_symbol="", right_pad_symbol ="") where { T <: AbstractString} + padding_ngram(word::Vector{T}, n=1; pad_left=false, pad_right=false, left_pad_symbol="", right_pad_symbol="") where {T <: AbstractString} -padding _ngram is used to pad both left and right of sentence and out putting ngrmas of order n - - It also pad the original input Array of string +Pad both left and right sides of a sentence and output n-grams of order n. + +This function also pads the original input vector of strings. # Example ```julia-repl julia> example = ["1","2","3","4","5"] julia> padding_ngram(example,2,pad_left=true,pad_right=true) - 6-element Array{Any,1}: + 6-element Vector{Any}: " 1" "1 2" "2 3" @@ -69,15 +69,15 @@ function padding_ngram( end """ - ngramizenew( words::Vector{T}, nlist::Integer...) where { T <: AbstractString} + ngramizenew(words::Vector{T}, nlist::Integer...) where {T <: AbstractString} -ngramizenew is used to out putting ngrmas in set +Generate n-grams from a sequence of words. # Example ```julia-repl julia> seq=["To","be","or","not","To","not","To","not"] -julia> ngramizenew(seq ,2) - 7-element Array{Any,1}: +julia> ngramizenew(seq, 2) + 7-element Vector{Any}: "To be" "be or" "or not" diff --git a/src/LM/vocab.jl b/src/LM/vocab.jl index ad8f94f6..e424f67f 100644 --- a/src/LM/vocab.jl +++ b/src/LM/vocab.jl @@ -1,11 +1,12 @@ """ - Vocabulary(word,unk_cutoff =1 ,unk_label = "") + Vocabulary(word, unk_cutoff=1, unk_label="") + +Store language model vocabulary. -Stores language model vocabulary. Satisfies two common language modeling requirements for a vocabulary: - When checking membership and calculating its size, filters items -by comparing their counts to a cutoff value. -Adds a special "unknown" token which unseen words are mapped to. + by comparing their counts to a cutoff value. +- Adds a special "unknown" token which unseen words are mapped to. # Example ```julia-repl @@ -51,15 +52,15 @@ julia> lookup("a") julia> word = ["a", "-", "d", "c", "a"] -julia> lookup(vocabulary ,word) - 5-element Array{Any,1}: +julia> lookup(vocabulary, word) + 5-element Vector{Any}: "a" "" "d" "c" "a" -If given a sequence, it will return an Array{Any,1} of the looked up words as shown above. +If given a sequence, it will return a `Vector{Any}` of the looked up words as shown above. It's possible to update the counts after the vocabulary has been created. julia> update(vocabulary,["b","c","c"]) @@ -107,9 +108,9 @@ end """ $(TYPEDSIGNATURES) -lookup a sequence or words in the vocabulary +Look up a sequence of words in the vocabulary. -Return an Array of String +Return a vector of strings. See [`Vocabulary`](@ref) """ diff --git a/src/bayes.jl b/src/bayes.jl index 3a2f58c1..16f0777a 100644 --- a/src/bayes.jl +++ b/src/bayes.jl @@ -7,7 +7,7 @@ simpleTokenise(s) = WordTokenizers.tokenize(lowercase(replace(s, "." => ""))) """ $(TYPEDSIGNATURES) -Create a dict that maps elements in input array to their frequencies. +Create a dictionary that maps elements in input array to their frequencies. """ function frequencies(xs::AbstractVector{T})::Dict{T,Int} where {T<:Any} frequencies = Dict{eltype(xs),Int}() @@ -20,7 +20,7 @@ end """ $(TYPEDSIGNATURES) -Compute an Array, mapping the value corresponding to elements of `dict` to the input `AbstractDict`. +Compute an array, mapping the values corresponding to elements of `dict` from the input `AbstractDict`. """ function features(fs::AbstractDict, dict::AbstractVector)::Vector{Int} bag = Vector{Int}(undef, size(dict)) @@ -45,9 +45,9 @@ end A Naive Bayes Classifier for classifying documents. -It takes two arguments: -* `classes`: An array of possible classes that the concerned data could belong to. -* `dict`:(Optional Argument) An Array of possible tokens (words). This is automatically updated if a new token is detected in the Step 2) or 3) +# Arguments +- `classes`: Array of possible classes that the data could belong to +- `dict`: (Optional) Array of possible tokens (words). This is automatically updated if a new token is detected during training or prediction # Example ```julia-repl @@ -79,7 +79,7 @@ probabilities(c::NaiveBayesClassifier) = c.weights ./ sum(c.weights, dims=1) """ extend!(model::NaiveBayesClassifier, dictElement) -Add the dictElement to dictionary of the Classifier `model`. +Add the `dictElement` to the dictionary of the classifier `model`. """ function extend!(c::NaiveBayesClassifier, dictElement) push!(c.dict, dictElement) diff --git a/src/coom.jl b/src/coom.jl index e76cb151..f25e5ca0 100644 --- a/src/coom.jl +++ b/src/coom.jl @@ -11,7 +11,7 @@ coo_matrix(::Type{T}, doc::Vector{AbstractString}, vocab::OrderedDict{AbstractString, Int}, window::Int, normalize::Bool, mode::Symbol) Basic low-level function that calculates the co-occurrence matrix of a document. -Returns a sparse co-occurrence matrix sized `n × n` where `n = length(vocab)` +Return a sparse co-occurrence matrix sized `n × n` where `n = length(vocab)` with elements of type `T`. The document `doc` is represented by a vector of its terms (in order)`. The keywords `window` and `normalize` indicate the size of the sliding word window in which co-occurrences are counted and whether to normalize @@ -86,13 +86,11 @@ coo_matrix(::Type{T}, doc::Vector{<:AbstractString}, vocab::Dict{<:AbstractStrin """ Basic Co-occurrence Matrix (COOM) type. + # Fields - * `coom::SparseMatriCSC{T,Int}` the actual COOM; elements represent -co-occurrences of two terms within a given window - * `terms::Vector{String}` a list of terms that represent the lexicon of -the document or corpus - * `column_indices::OrderedDict{String, Int}` a map between the `terms` and the -columns of the co-occurrence matrix +* `coom::SparseMatrixCSC{T,Int}`: The actual COOM; elements represent co-occurrences of two terms within a given window. +* `terms::Vector{String}`: A list of terms that represent the lexicon of the document or corpus. +* `column_indices::OrderedDict{String, Int}`: A map between the `terms` and the columns of the co-occurrence matrix. """ struct CooMatrix{T} coom::SparseMatrixCSC{T,Int} @@ -104,11 +102,9 @@ end """ CooMatrix{T}(crps::Corpus [,terms] [;window=5, normalize=true]) -Auxiliary constructor(s) of the `CooMatrix` type. The type `T` has to be -a subtype of `AbstractFloat`. The constructor(s) requires a corpus `crps` and -a `terms` structure representing the lexicon of the corpus. The latter -can be a `Vector{String}`, an `AbstractDict` where the keys are the lexicon, -or can be omitted, in which case the `lexicon` field of the corpus is used. +Auxiliary constructors of the `CooMatrix` type. The type `T` must be a subtype of `AbstractFloat`. + +The constructors require a corpus `crps` and a `terms` structure representing the lexicon of the corpus. The latter can be a `Vector{String}`, an `AbstractDict` where the keys are the lexicon, or can be omitted, in which case the `lexicon` field of the corpus is used. """ function CooMatrix{T}(crps::Corpus, terms::Vector{String}; @@ -182,11 +178,11 @@ Access the co-occurrence matrix field `coom` of a `CooMatrix` `c`. coom(c::CooMatrix) = c.coom """ - coom(entity, eltype=DEFAULT_FLOAT_TYPE [;window=5, normalize=true]) + coom(entity, eltype=Float [;window=5, normalize=true]) Access the co-occurrence matrix of the `CooMatrix` associated -with the `entity`. The `CooMatrix{T}` will first have to -be created in order for the actual matrix to be accessed. +with the `entity`. The `CooMatrix{T}` will first be +created in order for the actual matrix to be accessed. """ coom(entity, eltype::Type{T}=Float; window::Int=5, normalize::Bool=true, mode::Symbol=:default) where {T<:AbstractFloat} = diff --git a/src/corpus.jl b/src/corpus.jl index 3190b553..9d3b273b 100644 --- a/src/corpus.jl +++ b/src/corpus.jl @@ -35,7 +35,7 @@ function Corpus(docs::Vector{T}) where {T<:AbstractDocument} ) end -Corpus(docs::Vector{Any}) = Corpus(convert(Array{GenericDocument,1}, docs)) +Corpus(docs::Vector{Any}) = Corpus(convert(Vector{GenericDocument}, docs)) """ DirectoryCorpus(dirname::AbstractString) @@ -158,9 +158,9 @@ end """ lexicon(crps::Corpus) -Shows the lexicon of the corpus. +Return the lexicon of the corpus. -Lexicon of a corpus consists of all the terms that occur in any document in the corpus. +The lexicon of a corpus consists of all terms that occur in any document in the corpus. # Example ```julia-repl @@ -222,19 +222,19 @@ lexical_frequency(crps::Corpus, term::AbstractString) = """ inverse_index(crps::Corpus) -Shows the inverse index of a corpus. +Return the inverse index of a corpus. If we are interested in a specific term, we often want to know which documents in a corpus -contain that term. The inverse index tells us this and therefore provides a simplistic sort of search algorithm. +contain that term. The inverse index provides this information and enables a simplistic search algorithm. """ inverse_index(crps::Corpus) = crps.inverse_index function update_inverse_index!(crps::Corpus) - idx = Dict{String,Array{Int,1}}() + idx = Dict{String,Vector{Int}}() for i in 1:length(crps) doc = crps.documents[i] ngram_arr = isa(doc, NGramDocument) ? collect(keys(ngrams(doc))) : tokens(doc) - ngram_arr = convert(Array{String,1}, ngram_arr) + ngram_arr = convert(Vector{String}, ngram_arr) for ngram in ngram_arr key = get!(() -> [], idx, ngram) push!(key, i) diff --git a/src/document.jl b/src/document.jl index de27920a..e933f9a7 100644 --- a/src/document.jl +++ b/src/document.jl @@ -20,16 +20,14 @@ mutable struct DocumentMetadata custom::Any ) - Stores basic metadata about Document. + Store basic metadata about a document. - ... # Arguments - - `language`: What language is the document in? Defaults to Languages.English(), a Language instance defined by the Languages package. - - `title::String` : What is the title of the document? Defaults to "Untitled Document". - - `author::String` : Who wrote the document? Defaults to "Unknown Author". - - `timestamp::String` : When was the document written? Defaults to "Unknown Time". - - `custom` : user specific data field. Defaults to nothing. - ... + - `language`: Language of the document (default: `Languages.English()`) + - `title`: Title of the document (default: "Untitled Document") + - `author`: Author of the document (default: "Unknown Author") + - `timestamp`: Timestamp when the document was written (default: "Unknown Time") + - `custom`: User-specific data field (default: `nothing`) """ DocumentMetadata( language::Language=Languages.English(), @@ -57,7 +55,7 @@ end """ FileDocument(pathname::AbstractString) -Represents a document using a plain text file on disk. +Represent a document using a plain text file on disk. # Example ```julia-repl @@ -88,7 +86,7 @@ end """ StringDocument(txt::AbstractString) -Represents a document using a UTF8 String stored in RAM. +Represent a document using a UTF8 String stored in RAM. # Example ```julia-repl @@ -117,12 +115,12 @@ end TokenDocument(txt::AbstractString, dm::DocumentMetadata) TokenDocument(tkns::Vector{T}) where T <: AbstractString -Represents a document as a sequence of UTF8 tokens. +Represent a document as a sequence of UTF8 tokens. # Example ```julia-repl julia> my_tokens = String["To", "be", "or", "not", "to", "be..."] -6-element Array{String,1}: +6-element Vector{String}: "To" "be" "or" @@ -159,7 +157,7 @@ end NGramDocument(txt::AbstractString, dm::DocumentMetadata, n::Integer=1) NGramDocument(ng::Dict{T, Int}, n::Integer=1) where T <: AbstractString -Represents a document as a bag of n-grams, which are UTF8 n-grams and map to counts. +Represent a document as a bag of n-grams, which are UTF8 n-grams that map to counts. # Example ```julia-repl @@ -260,7 +258,7 @@ A StringDocument{String} * Snippet: To be or not to be... julia> tokens(sd) -7-element Array{String,1}: +7-element Vector{String}: "To" "be" "or" diff --git a/src/dtm.jl b/src/dtm.jl index c60a3561..35c9cc7c 100644 --- a/src/dtm.jl +++ b/src/dtm.jl @@ -22,7 +22,7 @@ end """ columnindices(terms::Vector{String}) -Creates a column index lookup dictionary from a vector of terms. +Create a column index lookup dictionary from a vector of terms. """ function columnindices(terms::Vector{T}) where {T} column_indices = Dict{T,Int}() @@ -37,12 +37,12 @@ end DocumentTermMatrix(crps::Corpus) DocumentTermMatrix(crps::Corpus, terms::Vector{String}) DocumentTermMatrix(crps::Corpus, lex::AbstractDict) - DocumentTermMatrix(dtm::SparseMatrixCSC{Int, Int},terms::Vector{String}) + DocumentTermMatrix(dtm::SparseMatrixCSC{Int, Int}, terms::Vector{String}) Represent documents as a matrix of word counts. -Allow us to apply linear algebra operations and statistical techniques. -Need to update lexicon before use. +This representation allows linear algebra operations and statistical techniques +to be applied. The lexicon must be updated before use. # Examples ```julia-repl @@ -108,7 +108,7 @@ DocumentTermMatrix(dtm::SparseMatrixCSC{Int,Int}, terms::Vector{T}) where {T} = dtm(d::DocumentTermMatrix) dtm(d::DocumentTermMatrix, density::Symbol) -Creates a simple sparse matrix of DocumentTermMatrix object. +Create a sparse matrix from a DocumentTermMatrix object. # Examples ```julia-repl @@ -131,7 +131,7 @@ julia> dtm(DocumentTermMatrix(crps)) [2, 6] = 1 julia> dtm(DocumentTermMatrix(crps), :dense) -2×6 Array{Int64,2}: +2×6 Matrix{Int64}: 1 2 0 1 1 1 1 0 2 1 1 1 ``` @@ -189,12 +189,12 @@ end Produce a single row of a DocumentTermMatrix. Individual documents do not have a lexicon associated with them, -we have to pass in a lexicon as an additional argument. +so a lexicon must be passed as an additional argument. # Examples ```julia-repl julia> dtv(crps[1], lexicon(crps)) -1×6 Array{Int64,2}: +1×6 Matrix{Int64}: 1 2 0 1 1 1 ``` """ @@ -222,7 +222,7 @@ end hash_dtv(d::AbstractDocument) hash_dtv(d::AbstractDocument, h::TextHashFunction) -Represents a document as a vector with N entries. +Represent a document as a vector with N entries. # Examples ```julia-repl @@ -233,11 +233,11 @@ julia> h = TextHashFunction(10) TextHashFunction(hash, 10) julia> hash_dtv(crps[1], h) -1×10 Array{Int64,2}: +1×10 Matrix{Int64}: 0 2 0 0 1 3 0 0 0 0 julia> hash_dtv(crps[1]) -1×100 Array{Int64,2}: +1×100 Matrix{Int64}: 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 ``` """ @@ -258,7 +258,7 @@ hash_dtv(d::AbstractDocument) = hash_dtv(d, TextHashFunction()) hash_dtm(crps::Corpus) hash_dtm(crps::Corpus, h::TextHashFunction) -Represents a Corpus as a Matrix with N entries. +Represent a Corpus as a Matrix with N entries. """ function hash_dtm(crps::Corpus, h::TextHashFunction) n, p = length(crps), cardinality(h) @@ -355,8 +355,9 @@ end """ merge!(dtm1::DocumentTermMatrix{T}, dtm2::DocumentTermMatrix{T}) where {T} -Merge one DocumentTermMatrix instance into another. Documents are appended to the end. Terms are re-sorted. -For efficiency, this may result in modifications to dtm2 as well. +Merge one DocumentTermMatrix instance into another. Documents are appended +to the end and terms are re-sorted. For efficiency, this may result in +modifications to dtm2 as well. """ function merge!(dtm1::DocumentTermMatrix{T}, dtm2::DocumentTermMatrix{T}) where {T} (length(dtm2.dtm) == 0) && (return dtm1) diff --git a/src/evaluation_metrics.jl b/src/evaluation_metrics.jl index af94bd0a..89cf33da 100644 --- a/src/evaluation_metrics.jl +++ b/src/evaluation_metrics.jl @@ -34,9 +34,9 @@ Base.show(io::IO, score::Score) = Base.write(io, """ average(scores::Vector{Score})::Score -* scores - vector of [`Score`](@ref) +* `scores` - Vector of [`Score`](@ref) objects -Returns average values of scores as a [`Score`](@ref) with precision/recall/fmeasure +Return average values of scores as a [`Score`](@ref) with precision/recall/fmeasure. """ function average(scores::Vector{Score})::Score res = reduce(scores, init=zeros(Float32, 3)) do acc, i @@ -52,9 +52,9 @@ end """ argmax(scores::Vector{Score})::Score -* scores - vector of [`Score`](@ref) +* `scores` - Vector of [`Score`](@ref) objects -Returns maximum by precision fiels of each [`Score`](@ref) +Return the maximum by f-measure field of each [`Score`](@ref). """ Base.argmax(scores::Vector{Score})::Score = argmax(s -> s.fmeasure, scores) @@ -68,14 +68,13 @@ Base.argmax(scores::Vector{Score})::Score = argmax(s -> s.fmeasure, scores) Compute n-gram recall between `candidate` and the `references` summaries. -The function takes the following arguments - +# Arguments +- `references::Vector{T} where T<: AbstractString` - List of reference summaries +- `candidate::AbstractString` - Input candidate summary to be scored against reference summaries +- `n::Integer` - Order of n-grams +- `lang::Language` - Language of the text, useful while generating n-grams (default: `Languages.English()`) -* `references::Vector{T} where T<: AbstractString` = The list of reference summaries. -* `candidate::AbstractString` = Input candidate summary, to be scored against reference summaries. -* `n::Integer` = Order of NGrams -* `lang::Language` = Language of the text, useful while generating N-grams. Defaults value is Languages.English() - -Returns a vector of [`Score`](@ref) +Return a vector of [`Score`](@ref) objects. See [Rouge: A package for automatic evaluation of summaries](http://www.aclweb.org/anthology/W04-1013) @@ -117,12 +116,14 @@ end Calculate the ROUGE-L score between `references` and `candidate` at sentence level. -Returns a vector of [`Score`](@ref) +Return a vector of [`Score`](@ref) objects. See [Rouge: A package for automatic evaluation of summaries](http://www.aclweb.org/anthology/W04-1013) -Note: the `weighted` argument enables weighting of values when calculating the longest common subsequence. -Initial implementation ROUGE-1.5.5.pl contains a power function. The function `weight_func` here has a power of 0.5 by default. +!!! note + The `weighted` argument enables weighting of values when calculating the longest common subsequence. + Initial implementation ROUGE-1.5.5.pl contains a power function. The function `weight_func` here + has a power of 0.5 by default. See also: [`rouge_n`](@ref), [`rouge_l_summary`](@ref) """ @@ -151,11 +152,11 @@ end Calculate the ROUGE-L score between `references` and `candidate` at summary level. -Returns a vector of [`Score`](@ref) +Return a vector of [`Score`](@ref) objects. See [Rouge: A package for automatic evaluation of summaries](http://www.aclweb.org/anthology/W04-1013) -See also: [`rouge_l_sentence()`](@ref), [`rouge_n`](@ref) +See also: [`rouge_l_sentence`](@ref), [`rouge_n`](@ref) """ function rouge_l_summary(references::Vector{<:AbstractString}, candidate::AbstractString, β::Int; lang=Languages.English())::Vector{Score} diff --git a/src/hash.jl b/src/hash.jl index 7f3005b1..3b0fc7a2 100644 --- a/src/hash.jl +++ b/src/hash.jl @@ -24,19 +24,18 @@ mutable struct TextHashFunction end """ -``` -TextHashFunction(cardinality) -TextHashFunction(hash_function, cardinality) -``` + TextHashFunction(cardinality) + TextHashFunction(hash_function, cardinality) -The need to create a lexicon before we can construct a document term matrix is often prohibitive. -We can often employ a trick that has come to be called the Hash Trick in which we replace terms -with their hashed valued using a hash function that outputs integers from 1 to N. +The need to create a lexicon before constructing a document term matrix is often prohibitive. +This implementation employs the "Hash Trick" technique, which replaces terms with their hashed +values using a hash function that outputs integers from 1 to N. -Parameters: - - cardinality = Max index used for hashing (default 100) - - hash_function = function used for hashing process (default function present, see code-base) +# Arguments +- `cardinality`: Maximum index used for hashing (default: 100) +- `hash_function`: Function used for hashing process (default: built-in `hash` function) +# Examples ```julia-repl julia> h = TextHashFunction(10) TextHashFunction(hash, 10) @@ -49,16 +48,15 @@ TextHashFunction() = TextHashFunction(hash, 100) cardinality(h::TextHashFunction) = h.cardinality """ -``` -index_hash(str, TextHashFunc) -``` + index_hash(str, TextHashFunc) -Shows mapping of string to integer. +Show mapping of string to integer using the hash trick. -Parameters: - - str = Max index used for hashing (default 100) - - TextHashFunc = TextHashFunction type object +# Arguments +- `str`: String to be hashed +- `TextHashFunc`: TextHashFunction object containing hash configuration +# Examples ```julia-repl julia> h = TextHashFunction(10) TextHashFunction(hash, 10) diff --git a/src/lda.jl b/src/lda.jl index ab07398c..464ae7ae 100644 --- a/src/lda.jl +++ b/src/lda.jl @@ -26,16 +26,21 @@ end Perform [Latent Dirichlet allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation). -# Required Positional Arguments -- `α` Dirichlet dist. hyperparameter for topic distribution per document. `α<1` yields a sparse topic mixture for each document. `α>1` yields a more uniform topic mixture for each document. -- `β` Dirichlet dist. hyperparameter for word distribution per topic. `β<1` yields a sparse word mixture for each topic. `β>1` yields a more uniform word mixture for each topic. - -# Optional Keyword Arguments -- `showprogress::Bool`. Show a progress bar during the Gibbs sampling. Default value: `true`. - -# Return Values -- `ϕ`: `ntopics × nwords` Sparse matrix of probabilities s.t. `sum(ϕ, 1) == 1` -- `θ`: `ntopics × ndocs` Dense matrix of probabilities s.t. `sum(θ, 1) == 1` +# Arguments +- `dtm::DocumentTermMatrix`: Document-term matrix containing the corpus +- `ntopics::Int`: Number of topics to extract +- `iterations::Int`: Number of Gibbs sampling iterations +- `α::Float64`: Dirichlet distribution hyperparameter for topic distribution per document. + `α < 1` yields a sparse topic mixture, `α > 1` yields a more uniform topic mixture +- `β::Float64`: Dirichlet distribution hyperparameter for word distribution per topic. + `β < 1` yields a sparse word mixture, `β > 1` yields a more uniform word mixture + +# Keyword Arguments +- `showprogress::Bool`: Show a progress bar during Gibbs sampling (default: `true`) + +# Returns +- `ϕ`: `ntopics × nwords` sparse matrix of word probabilities per topic +- `θ`: `ntopics × ndocs` dense matrix of topic probabilities per document """ function lda( dtm::DocumentTermMatrix, ntopics::Int, iteration::Int, diff --git a/src/lsa.jl b/src/lsa.jl index 21b24b7a..28d91e6e 100644 --- a/src/lsa.jl +++ b/src/lsa.jl @@ -2,7 +2,7 @@ lsa(dtm::DocumentTermMatrix) lsa(crps::Corpus) -Performs Latent Semantic Analysis or LSA on a corpus. +Perform Latent Semantic Analysis (LSA) on a corpus or document-term matrix. """ lsa(dtm::DocumentTermMatrix) = svd(Matrix(tf_idf(dtm))) diff --git a/src/metadata.jl b/src/metadata.jl index 5a8a457a..6587d2f1 100644 --- a/src/metadata.jl +++ b/src/metadata.jl @@ -72,7 +72,7 @@ end """ author!(doc, author) -Set the author metadata of doc to `author`. +Set the author metadata of `doc` to `author`. See also: [`author`](@ref), [`authors`](@ref), [`authors!`](@ref) """ @@ -83,7 +83,7 @@ end """ timestamp!(doc, timestamp::AbstractString) -Set the timestamp metadata of doc to `timestamp`. +Set the timestamp metadata of `doc` to `timestamp`. See also: [`timestamp`](@ref), [`timestamps`](@ref), [`timestamps!`](@ref) """ @@ -139,7 +139,7 @@ timestamps!(c::Corpus, nv::AbstractString) = timestamp!.(documents(c), Ref(nv)) Update titles of the documents in a Corpus. -If the input is a String, set the same title for all documents. If the input is a vector, set title of `i`th document to corresponding `i`th element in the vector `vec`. In the latter case, the number of documents must equal the length of vector. +If the input is a String, set the same title for all documents. If the input is a vector, set the title of the `i`th document to the corresponding `i`th element in the vector `vec`. In the latter case, the number of documents must equal the length of the vector. See also: [`titles`](@ref), [`title!`](@ref), [`title`](@ref) """ @@ -156,7 +156,7 @@ end Update languages of documents in a Corpus. -If the input is a Vector, then language of the `i`th document is set to the `i`th element in the vector, respectively. However, the number of documents must equal the length of vector. +If the input is a Vector, then the language of the `i`th document is set to the `i`th element in the vector, respectively. However, the number of documents must equal the length of the vector. See also: [`languages`](@ref), [`language!`](@ref), [`language`](@ref) """ diff --git a/src/ngramizer.jl b/src/ngramizer.jl index 1b84ff3e..7235a137 100644 --- a/src/ngramizer.jl +++ b/src/ngramizer.jl @@ -1,7 +1,7 @@ """ ngramize(lang, tokens, n) -Compute the ngrams of `tokens` of the order `n`. +Compute the n-grams of `tokens` of order `n`. # Example @@ -34,7 +34,7 @@ ngramize(lang::Language, str::AbstractString, n::Integer) = ngramize(lang, token """ onegramize(lang, tokens) -Create the unigrams dict for input tokens. +Create the unigrams dictionary for input tokens. # Example diff --git a/src/preprocessing.jl b/src/preprocessing.jl index f9a5ec49..0ed2b10b 100644 --- a/src/preprocessing.jl +++ b/src/preprocessing.jl @@ -50,6 +50,7 @@ remove_corrupt_utf8!(d::FileDocument) = error("FileDocument cannot be modified") """ remove_corrupt_utf8!(doc) remove_corrupt_utf8!(crps) + Remove corrupt UTF8 characters for `doc` or documents in `crps`. Does not support `FileDocument` or Corpus containing `FileDocument`. See also: [`remove_corrupt_utf8`](@ref) @@ -83,6 +84,7 @@ end """ remove_case(str) + Convert `str` to lowercase. See also: [`remove_case!`](@ref) """ @@ -92,6 +94,7 @@ remove_case(s::T) where {T<:AbstractString} = lowercase(s) """ remove_case!(doc) remove_case!(crps) + Convert the text of `doc` or `crps` to lowercase. Does not support `FileDocument` or `crps` containing `FileDocument`. # Example @@ -146,6 +149,7 @@ const html_tags = Regex("<[^>]*>") """ remove_html_tags(str) + Remove html tags from `str`, including the style and script tags. See also: [`remove_html_tags!`](@ref) """ @@ -158,6 +162,7 @@ end """ remove_html_tags!(doc::StringDocument) remove_html_tags!(crps) + Remove html tags from the `StringDocument` or documents `crps`. Does not work for documents other than `StringDocument`. # Example @@ -203,6 +208,7 @@ end """ remove_words!(doc, words::Vector{AbstractString}) remove_words!(crps, words::Vector{AbstractString}) + Remove the occurrences of words from `doc` or `crps`. # Example ```julia-repl @@ -240,8 +246,9 @@ function tag_pos!(entity::Union{Corpus,TokenDocument,StringDocument}) end """ - sparse_terms(crps, alpha=0.05]) -Find the sparse terms from Corpus, occurring in less than `alpha` percentage of the documents. + sparse_terms(crps, alpha=0.05) + +Return the sparse terms from `crps`, occurring in less than `alpha` percentage of the documents. # Example ``` julia> crps = Corpus([StringDocument("This is Document 1"), @@ -254,7 +261,7 @@ A Corpus with 2 documents: Corpus's lexicon contains 0 tokens Corpus's index contains 0 tokens julia> sparse_terms(crps, 0.5) -2-element Array{String,1}: +2-element Vector{String}: "1" "2" ``` @@ -263,7 +270,7 @@ See also: [`remove_sparse_terms!`](@ref), [`frequent_terms`](@ref) function sparse_terms(crps::Corpus, alpha::Real=alpha_sparse) update_lexicon!(crps) update_inverse_index!(crps) - res = Array{String}(undef, 0) + res = Vector{String}(undef, 0) ndocs = length(crps.documents) for term in keys(crps.lexicon) f = length(crps.inverse_index[term]) / ndocs @@ -276,7 +283,8 @@ end """ frequent_terms(crps, alpha=0.95) -Find the frequent terms from Corpus, occurring more than `alpha` percentage of the documents. + +Return the frequent terms from `crps`, occurring more than `alpha` percentage of the documents. # Example ``` julia> crps = Corpus([StringDocument("This is Document 1"), @@ -289,7 +297,7 @@ A Corpus with 2 documents: Corpus's lexicon contains 0 tokens Corpus's index contains 0 tokens julia> frequent_terms(crps) -3-element Array{String,1}: +3-element Vector{String}: "is" "This" "Document" @@ -299,7 +307,7 @@ See also: [`remove_frequent_terms!`](@ref), [`sparse_terms`](@ref) function frequent_terms(crps::Corpus, alpha::Real=alpha_frequent) update_lexicon!(crps) update_inverse_index!(crps) - res = Array{String}(undef, 0) + res = Vector{String}(undef, 0) ndocs = length(crps.documents) for term in keys(crps.lexicon) f = length(crps.inverse_index[term]) / ndocs @@ -312,7 +320,8 @@ end """ remove_sparse_terms!(crps, alpha=0.05) -Remove sparse terms in crps, occurring less than `alpha` percent of documents. + +Remove sparse terms from `crps`, occurring in less than `alpha` percent of documents. # Example ```julia-repl julia> crps = Corpus([StringDocument("This is Document 1"), @@ -336,7 +345,8 @@ remove_sparse_terms!(crps::Corpus, alpha::Real=alpha_sparse) = remove_words!(crp """ remove_frequent_terms!(crps, alpha=0.95) -Remove terms in `crps`, occurring more than `alpha` percent of documents. + +Remove frequent terms from `crps`, occurring in more than `alpha` percent of documents. # Example ```julia-repl julia> crps = Corpus([StringDocument("This is Document 1"), @@ -362,6 +372,7 @@ remove_frequent_terms!(crps::Corpus, alpha::Real=alpha_frequent) = remove_words! """ prepare!(doc, flags) prepare!(crps, flags) + Preprocess document or corpus based on the input flags. # List of Flags * strip_patterns @@ -438,8 +449,8 @@ end """ remove_whitespace(str) -Squash multiple whitespaces to a single one. -And remove all leading and trailing whitespaces. +Remove multiple whitespaces and replace with a single space. +Remove all leading and trailing whitespaces. See also: [`remove_whitespace!`](@ref) """ remove_whitespace(str::AbstractString) = replace(strip(str), r"\s+" => " ") @@ -449,7 +460,7 @@ remove_whitespace(str::AbstractString) = replace(strip(str), r"\s+" => " ") remove_whitespace!(doc) remove_whitespace!(crps) -Squash multiple whitespaces to a single space and remove all leading and trailing whitespaces in document or crps. +Remove multiple whitespaces and replace with a single space, removing all leading and trailing whitespaces in document or corpus. Does no-op for `FileDocument`, `TokenDocument` or `NGramDocument`. See also: [`remove_whitespace`](@ref) """ diff --git a/src/stemmer.jl b/src/stemmer.jl index e402a912..e0e05496 100644 --- a/src/stemmer.jl +++ b/src/stemmer.jl @@ -1,7 +1,10 @@ """ - stemmer_for_document(doc) + stemmer_for_document(d) -Search for an appropriate stemmer based on the language of the document. +Return an appropriate stemmer based on the language of the document. + +# Arguments +- `d`: Document for which to select stemmer """ function stemmer_for_document(d::AbstractDocument) Stemmer(lowercase(Languages.english_name(language(d)))) @@ -11,9 +14,13 @@ end stem!(doc) stem!(crps) -Stems the document or documents in `crps` with a suitable stemmer. +Apply stemming to the document or documents in `crps` using an appropriate stemmer. + +Does not support `FileDocument` or Corpus containing `FileDocument`. -Stemming cannot be done for `FileDocument` and Corpus made of these type of documents. +# Arguments +- `doc`: Document to apply stemming to +- `crps`: Corpus containing documents to apply stemming to """ function stem!(d::AbstractDocument) stemmer = stemmer_for_document(d) @@ -47,7 +54,10 @@ end """ stem!(crps::Corpus) -Stem an entire corpus. Assumes all documents in the corpus have the same language (picked from the first) +Apply stemming to an entire corpus. Assumes all documents in the corpus have the same language (determined from the first document). + +# Arguments +- `crps`: Corpus containing documents to apply stemming to """ function stem!(crps::Corpus) stemmer = stemmer_for_document(crps.documents[1]) diff --git a/src/summarizer.jl b/src/summarizer.jl index d3bee689..a92237d1 100644 --- a/src/summarizer.jl +++ b/src/summarizer.jl @@ -1,20 +1,21 @@ """ - summarize(doc [, ns]) + summarize(doc; ns=5) -Summarizes the document and returns `ns` number of sentences. -It takes 2 arguments: +Generate a summary of the document and return the top `ns` sentences. -* `d` : A document of type `StringDocument`, `FileDocument` or `TokenDocument` -* `ns` : (Optional) Mention the number of sentences in the Summary, defaults to `5` sentences. +# Arguments +- `doc`: Document of type `StringDocument`, `FileDocument`, or `TokenDocument` +- `ns`: Number of sentences in the summary (default: 5) -By default `ns` is set to the value 5. +# Returns +- `Vector{SubString{String}}`: Array of the most relevant sentences # Example ```julia-repl julia> s = StringDocument("Assume this Short Document as an example. Assume this as an example summarizer. This has too foo sentences.") julia> summarize(s, ns=2) -2-element Array{SubString{String},1}: +2-element Vector{SubString{String}}: "Assume this Short Document as an example." "This has too foo sentences." ``` @@ -32,6 +33,19 @@ function summarize(d::AbstractDocument; ns=5) return sentences[sort(sortperm(vec(p), rev=true)[1:min(ns, num_sentences)])] end +""" + pagerank(A; n_iter=20, damping=0.15) + +Compute PageRank scores for nodes in a graph using the power iteration method. + +# Arguments +- `A`: Adjacency matrix representing the graph +- `n_iter`: Number of iterations for convergence (default: 20) +- `damping`: Damping factor for PageRank algorithm (default: 0.15) + +# Returns +- `Matrix{Float64}`: PageRank scores for each node +""" function pagerank(A; n_iter=20, damping=0.15) nmax = size(A, 1) r = rand(1, nmax) # Generate a random starting rank. diff --git a/src/tagging_schemes.jl b/src/tagging_schemes.jl index 83d8611e..2b89bd55 100644 --- a/src/tagging_schemes.jl +++ b/src/tagging_schemes.jl @@ -15,12 +15,17 @@ const available_schemes = Dict(string(nameof(type)) => type for type in subtypes """ tag_scheme!(tags, current_scheme::String, new_scheme::String) -Convert `tags` from `current_scheme` to `new_scheme`. +Convert tags from one tagging scheme to another in-place. -List of tagging schemes currently supported- - * BIO1 (BIO) - * BIO2 - * BIOES +# Arguments +- `tags`: Vector of tags to convert +- `current_scheme`: Name of the current tagging scheme +- `new_scheme`: Name of the target tagging scheme + +# Supported Schemes +- BIO1 (BIO) +- BIO2 +- BIOES # Example ```julia-repl @@ -29,7 +34,7 @@ julia> tags = ["I-LOC", "O", "I-PER", "B-MISC", "I-MISC", "B-PER", "I-PER", "I-P julia> tag_scheme!(tags, "BIO1", "BIOES") julia> tags -8-element Array{String,1}: +8-element Vector{String}: "S-LOC" "O" "S-PER" diff --git a/src/tf_idf.jl b/src/tf_idf.jl index 7ccc33b9..2f7c23e9 100644 --- a/src/tf_idf.jl +++ b/src/tf_idf.jl @@ -1,9 +1,14 @@ """ tf!(dtm::AbstractMatrix{Real}, tf::AbstractMatrix{AbstractFloat}) -Overwrite `tf` with the term frequency of the `dtm`. +Compute term frequency and store result in `tf` matrix. -Works correctly if `dtm` and `tf` are same matrix. +# Arguments +- `dtm`: Document-term matrix containing term counts +- `tf`: Output matrix for term frequency values (modified in-place) + +# Notes +Works correctly when `dtm` and `tf` are the same matrix. See also: [`tf`](@ref), [`tf_idf`](@ref), [`tf_idf!`](@ref) """ @@ -25,9 +30,14 @@ end """ tf!(dtm::SparseMatrixCSC{Real}, tf::SparseMatrixCSC{AbstractFloat}) -Overwrite `tf` with the term frequency of the `dtm`. +Compute term frequency for sparse matrices and store result in `tf`. + +# Arguments +- `dtm`: Sparse document-term matrix containing term counts +- `tf`: Output sparse matrix for term frequency values (modified in-place) -`tf` should have the has same nonzeros as `dtm`. +# Notes +The `tf` matrix should have the same nonzero pattern as `dtm`. See also: [`tf`](@ref), [`tf_idf`](@ref), [`tf_idf!`](@ref) """ @@ -59,7 +69,13 @@ tf!(dtm::SparseMatrixCSC{T}) where {T<:Real} = tf!(dtm, dtm) tf(dtm::SparseMatrixCSC{Real}) tf(dtm::Matrix{Real}) -Compute the `term-frequency` of the input. +Compute term frequency for the document-term matrix. + +# Arguments +- `dtm`: Document-term matrix (DocumentTermMatrix, sparse matrix, or dense matrix) + +# Returns +- `Matrix{Float64}` or `SparseMatrixCSC{Float64}`: Term frequency matrix # Example @@ -96,11 +112,16 @@ tf(dtm::SparseMatrixCSC{T}) where {T<:Real} = tf!(dtm, similar(dtm, Float64)) """ tf_idf!(dtm::AbstractMatrix{Real}, tf_idf::AbstractMatrix{AbstractFloat}) -Overwrite `tf_idf` with the tf-idf (Term Frequency - Inverse Doc Frequency) of the `dtm`. +Compute TF-IDF (Term Frequency-Inverse Document Frequency) and store result in `tf_idf` matrix. + +# Arguments +- `dtm`: Document-term matrix containing term counts +- `tf_idf`: Output matrix for TF-IDF values (modified in-place) -`dtm` and `tf-idf` must be matrices of same dimensions. +# Notes +The matrices `dtm` and `tf_idf` must have the same dimensions. -See also: [`tf`](@ref), [`tf!`](@ref) , [`tf_idf`](@ref) +See also: [`tf`](@ref), [`tf!`](@ref), [`tf_idf`](@ref) """ function tf_idf!(dtm::AbstractMatrix{T1}, tfidf::AbstractMatrix{T2}) where {T1<:Real,T2<:AbstractFloat} n, p = size(dtm) @@ -160,7 +181,10 @@ end """ tf_idf!(dtm) -Compute tf-idf for `dtm` +Compute TF-IDF values for document-term matrix in-place. + +# Arguments +- `dtm`: Document-term matrix to transform (modified in-place) """ tf_idf!(dtm::AbstractMatrix{T}) where {T<:Real} = tf_idf!(dtm, dtm) @@ -170,19 +194,23 @@ tf_idf!(dtm::SparseMatrixCSC{T}) where {T<:Real} = tf_idf!(dtm, dtm) #tf_idf!(dtm::DocumentTermMatrix) = tf_idf!(dtm.dtm) """ - tf(dtm::DocumentTermMatrix) - tf(dtm::SparseMatrixCSC{Real}) - tf(dtm::Matrix{Real}) + tf_idf(dtm::DocumentTermMatrix) + tf_idf(dtm::SparseMatrixCSC{Real}) + tf_idf(dtm::Matrix{Real}) -Compute `tf-idf` value (Term Frequency - Inverse Document Frequency) for the input. +Compute TF-IDF (Term Frequency-Inverse Document Frequency) values for the document-term matrix. -In many cases, raw word counts are not appropriate for use because: +# Arguments +- `dtm`: Document-term matrix (DocumentTermMatrix, sparse matrix, or dense matrix) +# Returns +- `Matrix{Float64}` or `SparseMatrixCSC{Float64}`: TF-IDF weighted matrix + +# Notes +TF-IDF addresses issues with raw word counts: - Some documents are longer than other documents - Some words are more frequent than other words -A simple workaround this can be done by performing `TF-IDF` on a `DocumentTermMatrix` - # Example ```julia-repl @@ -320,7 +348,7 @@ d = dtm(crps) tfm = tf_idf(d) cs = cos_similarity(tfm) Matrix(cs) - # 3×3 Array{Float64,2}: + # 3×3 Matrix{Float64}: # 1.0 0.0329318 0.0 # 0.0329318 1.0 0.0 # 0.0 0.0 1.0 diff --git a/src/tokenizer.jl b/src/tokenizer.jl index 818b3e28..1d4a6c43 100644 --- a/src/tokenizer.jl +++ b/src/tokenizer.jl @@ -1,13 +1,20 @@ """ - tokenize(language, str) + tokenize(lang, s) -Split `str` into words and other tokens such as punctuation. +Split string into words and other tokens such as punctuation. + +# Arguments +- `lang`: Language for tokenization rules +- `s`: String to tokenize + +# Returns +- `Vector{String}`: Array of tokens extracted from the string # Example ```julia-repl julia> tokenize(Languages.English(), "Too foo words!") -4-element Array{String,1}: +4-element Vector{String}: "Too" "foo" "words" @@ -20,14 +27,21 @@ tokenize(lang::S, s::T) where {S<:Language,T<:AbstractString} = WordTokenizers.t """ - sentence_tokenize(language, str) + sentence_tokenize(lang, s) + +Split string into individual sentences. + +# Arguments +- `lang`: Language for sentence boundary detection rules +- `s`: String to split into sentences -Split `str` into sentences. +# Returns +- `Vector{SubString{String}}`: Array of sentences extracted from the string # Example ```julia-repl julia> sentence_tokenize(Languages.English(), "Here are few words! I am Foo Bar.") -2-element Array{SubString{String},1}: +2-element Vector{SubString{String}}: "Here are few words!" "I am Foo Bar." ``` diff --git a/src/translate_evaluation/bleu_score.jl b/src/translate_evaluation/bleu_score.jl index 29766fce..7692f03a 100644 --- a/src/translate_evaluation/bleu_score.jl +++ b/src/translate_evaluation/bleu_score.jl @@ -15,12 +15,13 @@ """ get_ngrams(segment, max_order) -Extracts all n-grams upto a given maximum order from an input segment. Returns the counter containing all n-grams upto max_order in segment -with a count of how many times each n-gram occurred. +Extract all n-grams up to a given maximum order from an input segment. + +Return a counter containing all n-grams up to `max_order` in the segment with a count of how many times each n-gram occurred. # Arguments - - `segment`: text segment from which n-grams will be extracted. - - `max_order`: maximum length in tokens of the n-grams returned by this methods. +- `segment`: Text segment from which n-grams will be extracted. +- `max_order`: Maximum length in tokens of the n-grams returned by this method. """ function get_ngrams(segment::Vector{<:AbstractString}, max_order::Integer) @@ -41,14 +42,15 @@ const DocumentWithTokenizedSentences = Vector{<:ListOfTokens} """ bleu_score(reference_corpus::Vector{Vector{Token}}, translation_corpus::Vector{Token}; max_order=4, smooth=false) -Computes BLEU score of translated segments against one or more references. Returns the `BLEU score`, `n-gram precisions`, `brevity penalty`, -geometric mean of n-gram precisions, translation_length and reference_length +Compute the BLEU score of translated segments against one or more references. + +Return the `BLEU score`, `n-gram precisions`, `brevity penalty`, geometric mean of n-gram precisions, `translation_length`, and `reference_length`. # Arguments - - `reference_corpus`: list of lists of references for each translation. Each reference should be tokenized into a list of tokens. - - `translation_corpus`: list of translations to score. Each translation should be tokenized into a list of tokens. - - `max_order`: maximum n-gram order to use when computing BLEU score. - - `smooth=false`: whether or not to apply. Lin et al. 2004 smoothing. +- `reference_corpus`: List of lists of references for each translation. Each reference should be tokenized into a list of tokens. +- `translation_corpus`: List of translations to score. Each translation should be tokenized into a list of tokens. +- `max_order`: Maximum n-gram order to use when computing BLEU score. +- `smooth=false`: Whether or not to apply Lin et al. 2004 smoothing. Example: diff --git a/src/utils.jl b/src/utils.jl index 8c21e4d3..cbaed6b2 100644 --- a/src/utils.jl +++ b/src/utils.jl @@ -1,8 +1,17 @@ """ - weighted_lcs(X, Y, weight_score::Bool, returns_string::Bool, weigthing_function::Function) + weighted_lcs(X, Y, weighted=true, f=sqrt) Compute the Weighted Longest Common Subsequence of X and Y. + +# Arguments +- `X`: First sequence +- `Y`: Second sequence +- `weighted`: Whether to use weighted computation (default: true) +- `f`: Weighting function (default: sqrt) + +# Returns +- `Float32`: Length of the weighted longest common subsequence """ function weighted_lcs(X, Y, weighted=true, f=sqrt) result = weighted_lcs_inner(X, Y, weighted, f) @@ -10,6 +19,20 @@ function weighted_lcs(X, Y, weighted=true, f=sqrt) return result.c_table[end, end] end +""" + weighted_lcs_tokens(X, Y, weighted=true, f=sqrt) + +Compute the tokens of the Weighted Longest Common Subsequence of X and Y. + +# Arguments +- `X`: First sequence +- `Y`: Second sequence +- `weighted`: Whether to use weighted computation (default: true) +- `f`: Weighting function (default: sqrt) + +# Returns +- `Vector{String}`: Array of tokens in the longest common subsequence +""" function weighted_lcs_tokens(X, Y, weighted=true, f=sqrt) m, n, c_table, _w_table = weighted_lcs_inner(X, Y, weighted, f) @@ -64,15 +87,17 @@ end """ - fmeasure_lcs(RLCS, PLCS, β) + fmeasure_lcs(RLCS, PLCS, β=1.0) Compute the F-measure based on WLCS. # Arguments +- `RLCS`: Recall factor for LCS computation +- `PLCS`: Precision factor for LCS computation +- `β`: Beta parameter controlling precision vs recall balance (default: 1.0) -- `RLCS` - Recall Factor -- `PLCS` - Precision Factor -- `β` - Parameter +# Returns +- `Real`: F-measure score balancing precision and recall """ function fmeasure_lcs(RLCS::Real, PLCS::Real, β=1.0)::Real divider = RLCS + (β^2) * PLCS