Collocations Extraction using Python

Geschlossen Veröffentlicht Jan 15, 2014 Bezahlt bei Lieferung
Geschlossen Bezahlt bei Lieferung

Given a big text (corpus) about 1GB, I want to extract two-word, 3-word, 4-word and 5-word collocations or patterns using Log-Likelihood Ratio.

More specifically,

The requirements are:

(1) Given the corpus, I'd like to get the bigrams, trigrams, 4-grams and 5-grams using LLR

(2) Also, I want to find the collocations for any word which contains three or four specific letters. Like the collocations for words that have the letters "a - d - f" in that order but no matter if they are following one another or they are separated by other letters.

In both cases, I wish to have the output sorted. And of course, as I said earlier, the corpus is 1G so it's really big.

I prefer working with Python but I'm a novice so the code needs to be clear, easy to use and understand.

P.S. Budget limited to $100

Thanks

Datensuche Python Statistiken

Projekt-ID: #5323958

Über das Projekt

6 Vorschläge Remote Projekt Aktiv Feb 21, 2014

6 Freelancer bieten im Durchschnitt $128 für diesen Job

srinichal

I like to discuss further and deliver the project . .

$147 USD in 3 Tagen
(17 Bewertungen)
5.1
anuyadav1

hello i can write python script for this , i can hadle 1 gb big text . thank you . .

$100 USD in 3 Tagen
(7 Bewertungen)
4.1
adamcold

A proposal has not yet been provided

$100 USD in 3 Tagen
(0 Bewertungen)
0.0
mwschultz

Hello! I am an expert at data mining with Python. I am certain that I can provide you with the solution you require. I have a Master's degree in Computer Science, as well as over four years of professional programming Mehr

$83 USD in 5 Tagen
(0 Bewertungen)
0.0
njwiggin

I have a great deal of python experience, will complete the project in a timely manner, and do it correctly. Thank you for considering my bid.

$100 USD in 3 Tagen
(0 Bewertungen)
0.0
peterjrow

NLTK (the python library for this kind of thing) has a class which calculates LL for bigram and trigrams, and it has a general class which is only missing one method - the contingency matrix. I could write new classes Mehr

$66 USD in 3 Tagen
(0 Bewertungen)
0.0