Need a program to auto catch data from a site.

  • Status Geschlossen
  • Budget $250 - $750 USD
  • Anzahl der Angebote 16

Projektbeschreibung

Hi there.

I need someone to write a program that can auto catch data from this site : [url removed, login to view]

The program is what I [url removed, login to view] the data.

Look at the category list in the right site first.

I need the program can auto catch some categories' data.

From [キングダム] to [横浜線ドッペルゲンガー].

*Check the attachment named “category”

You have to output data in the following way:

Data Structure:

id----order

url----URL of the article

cat----category

title----title

magazine----magazine

author----author

genre----genre

character----character

site----goal website

article----the text

entry_data_at----publish time

created_at----catch time

picture----the cover the article

*Check check the attachment named “tip1,tip2”

Explanation:

[cat],means the name of [url removed, login to view] the category list in the right [url removed, login to view] can see words like [キングダム] and [トキワ来たれり!!],they are categorise.

[title],check the category [キングダム],turns to a new page,you can see words such as [キングダム 最新 492話 ネタバレ&感想 入隊選抜試験と逸材発見!?] or [キングダム 最新 491話 ネタバレ&感想 秦趙決裂と軍備強化], they are titles.

[article],check one title like [キングダム 最新 492話 ネタバレ&感想 入隊選抜試験と逸材発見!?],turns to a new page,you can see an article with lot of [url removed, login to view] have to catch the body which from the title(キングダム 最新 492話 ネタバレ&感想 入隊選抜試験と逸材発見!?) to the end of the article (end at the place above [第491話へ][第493話へ] and advertisements).

[entry_data_at],means the publish time of the articel,for example,the publish time of キングダム 最新 492話 ネタバレ&感想 入隊選抜試験と逸材発見!? is the one written under the title - 2016/10/[url removed, login to view] have to record it by using timestamp,which would turn 2016/10/01 into 1451577600.

[url],means the url of the article,like [url removed, login to view]

[site],all write as [url removed, login to view]

[character],for example,

[url removed, login to view]

You can see words written in blue [第492話 成長への募兵].

In the Developer Tools which is

<span style="font-size: x-large; color: #0000ff;">

<strong>第492話 成長への募兵</strong>

</span>

The number 492 is the [character].

About [author],[magazine],[genre],[picutre],[id] and [created_at],you should do the following step first.

Search any [cat] in [url removed, login to view],use the first result.

For example,search [キングダム] in [url removed, login to view],you can get:

作家:原泰久

雑誌・レーベル:ヤングジャンプ

ジャンル: バトル・アクション / 歴史 / 青年マンガ / アニメ化 / 中国史・三国志

So,

[author],means the words after [作家:]. In the example the [author] is [原泰久].

[magazine],means the words after [雑誌・レーベル:], In the example the [magazine] is [ヤングジャンプ].

[genre],means the words after [genre:],need to use "," to separate them. In the example the [genre] is [バトル・アクション,歴史,青年マンガ,アニメ化,中国史・三国志].

[pitucre],the cover of the first [url removed, login to view] have to catch covers and store [url removed, login to view] the datebase there should add a data bar of [pictuer] and have url of each cover.

[id],means the order, the first one is 1, the second one is 2, etc.(MySQL autoincrement field)

[created_at],means the time you catch the article,also have to record by using timestamp. For example,if I catch the date on UTC/GMT+08:00 2016/10/11 14:40:30, so the [created_at] should be 1476168030.

Use [キングダム] as the example, do what I said,you can get:

id:1

url:[url removed, login to view]

cat:キングダム

title:キングダム 最新 492話 ネタバレ&感想 入隊選抜試験と逸材発見!?

magazine:BE・LOVE

author: ヤングジャンプ

genre: バトル・アクション,歴史,青年マンガ,アニメ化,中国史・三国志

character:492

site:[url removed, login to view]

article:<h1 class="entry-title">......

entry_data_at:1451577600

created_at:1476168030

*Check the explanation named “database sample”.

This is what I [url removed, login to view] have to make the program to catch data in this way to make my server can recognize the data.

Need to catch data 2 hours one time.

Need to send me the program you write to catch data.

Need he full data scraper , also need the program that can catch new data and not catch old data again.

Tap 113114 in your bid.

Erhalten Sie kostenlose Angebote für ein Projekt wie dieses
Vergeben an:
Erforderliche Fähigkeiten

Möchten Sie Geld verdienen?

  • Legen Sie Ihr Budget und Ihren Zeitraum fest
  • Stellen Sie Ihr Angebot kurz dar
  • Bekommen Sie Geld für Ihre Arbeit

Heuern Sie Freelancer an, die auch auf dieses Projekt geboten haben

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online