kerzol/wikiprep
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
Repository files navigation
Wikiprep
==============
Zemanta fork
with MediaWiki::DumpFile
MediaWiki syntax preprocessor and information extractor
Attention!
==========
The perl package MediaWiki::DumpFile is pretty old, and the latest tested version of wikipedia dump is 0.5.
Using the infomation from https://github.com/wikimedia/mediawiki/commits/master/docs/export-0.6.xsd
I can imagine that Wikiprep can work well with a dump created before Nov 2011.
One should either rewrite MediaWiki::DumpFile, either find another Wikiprep-like tool.
More info : #2
Content
=======
1. Introduction
1.1 Relation to the original Wikiprep
1.2 Current status
2. Requirements and installation
2.1 Software
2.2 Hardware
3. Usage
3.1 Parallel processing
4. Tools
5. License
6. Hacking
1. Introduction
===============
Wikiprep is a Perl script that parses MediaWiki data dumps in XML format
and extracts useful information from them (MediaWiki is the software that
is best known for running Wikipedia and other Wikimedia foundation
projects).
MediaWiki uses a markup language (wiki syntax) that is optimized for easy
editing by human beings. It contains a lot of quirks, special cases and
odd corners that help MediaWiki correctly display wiki pages, even when
they contain typing errors. This makes parsing this syntax with other
software highly non-trivial.
One goal of Wikiprep is to implement a parser that:
o is compatible with MediaWiki as closely as possible, implementing as
much functionality as is needed to achive other goals.
o is as fast as possible, allowing tracking the English Wikipedia dataset
as closely as possible (MediaWiki's PHP code is slow)
The other goal is to use that parser to extract various information from
the dump that is suitable for further processing and is stored in files
with simple syntax.
1.1 Relation to the original Wikiprep
Wikiprep was initialy developed by Evgeniy Gabrilovich to aid his research.
Tomaz Solc adapted his script for use in semantic information extraction
from Wikipedia as part of Zemanta web service.
This version of Wikiprep undergone some extensive modification to be able
to extract information needed by Zemanta's engine.
1.2 Current status
Currently implemented MediaWiki functionality:
o Templates:
- Named and positional parameters,
- parameter defaults,
- recursive inclusion to some degree - infinite recursion
breaking is currently implemented in a way incompatible with
MediaWiki and
- support for <noinclude>, <includeonly> and similar syntax.
o Parser functions (currently #if, #ifeq, #language, #switch)
o Magic words (currently urlencode, PAGENAME)
o Redirects
o Internal, external and interwiki links
o Proper handling of <nowiki> and other pseudo-HTML tags
o Disambiguation page recognition and special parsing
o MediaWiki compatible date handling
o Stub page recognition
o Table and math syntax blocks are recognized and removed from the final
output
o Related article identification
2. Requirements
===============
2.1 Software
You need a recent version of Perl 5 compiled for a 64 bit architecture with
the following modules installed (names in parentheses are names of respective
Debian packages):
MediaWiki::DumpFile
Regexp::Common (libregexp-common-perl)
Inline::C (libinline-perl)
XML::Writer (libxml-writer-perl)
BerkeleyDB (libberkeleydb-perl)
Log::Handler (liblog-handler-perl)
If you can't use Inline::C for some reason, run wikiprep with "-pureperl" flag
and it will use the (roughly) equivalent pure Perl implementations.
If you want to process gzip or bzip2 compressed dumps you will need gzip and
bzip2 installed. Gzip is also required for the -compress option.
To run unit tests you will also need xmllint utility (shipped with libxml2,
libxml2-utils)
2.2 Hardware
English Wikipedia is big and is getting bigger, so requirements below
slowly grow over time.
As of January 2011, Wikiprep output takes approximately 22 GB of hard disk
space (with -compress, -format composite and default logging). Debug log
takes 20 GB or more.
Prescan phase requires a little over 4 GB of memory in a single Perl
process (hence the requirement for a 64 bit version of Perl). In transform
phase Wikiprep requires 100 to 200 MB per Perl process (but 4 GB or more
is recommended for decent performance due to OS caching BDB tables)
On a dual 6 core 2.6 GHz AMD Opteron with 32 GB of RAM and 12 parallel
processes it takes approximately 16 hours to process English Wikipedia dump
from 15 January 2011
3. Usage
========
3.1 Installation
Run the following commands from the top of the Wikiprep distribution:
$ perl Makefile.PL
$ make
$ make test (optional)
And as root:
$ make install
3.2 Running
The most common command to start Wikiprep on an XML dump you downloaded
from Wikimedia:
$ wikiprep -format composite -compress -f enwiki-20090306-pages-articles.xml.bz2
This will produce a number of files in the same directory as the dump.
Alternatively, run the following to get a list of other available options:
$ wikiprep
To run regression tests included in the distribution, run:
$ make test
3.3 Parallel processing
Wikiprep is capable of using multiple CPUs, however in order to do this
you must provide it with a split XML dump.
First split the XML dump using the splitwiki utility in the tools
subdirectory:
$ bzcat enwiki-20090306-pages-articles.xml.bz2 | \
splitwiki 4 enwiki-20090306-pages-articles.xml
Then run Wikiprep on the split dump using the -parallel option:
$ perl wikiprep.pl -format composite -compress \
-f enwiki-20090306-pages-articles.xml.0000
You only need to run Wikiprep once on a machine and it will automatically
split into as many parallel processes as there are parts of the dump
(identified by the .NNNN suffix).
Wikiprep processing is split into sequential parts: "prescan" which is done
in one process and "transform" which can be done in parallel. So you will
need to wait for prescan to finish before you will see multiple wikiprep
processes running on the system.
With some scripting it is possible to distribute dump parts to
multiple machines and start Wikiprep separately on each machine.
It's also possible to only run prescan once on a single machine and then
distribute .db files it creates to other machines where only transform
part is started.
3.4 Parsing MediaWiki dumps other than English Wikipedia
All language and site-specific settings are located in Perl modules under
/lib/Wikiprep/Config. The configuration used can be chosen via the -config
command line parameter (default is Enwiki.pm)
Wikiprep distribution includes configurations for Wikipedia in various
languages. These were contributed by Wikiprep users and can be outdated or
incomplete. At the moment the only configuration that is thoroughly tested
before each release is the English Wikipedia config. Patches are always
welcome.
To add support for a new language or MediaWiki installation, copy Enwiki.pm
to a new file (the convention is to use the same name as is used in naming
Wikimedia dumps) and adjust the values within.
4. Tools
========
There are a couple of tools in the tools/ directory that aim to make your
life a bit easier. Their use should be pretty much obvious from the help
message they return if you run them without any command line arguments.
o findtemplate.sh
Find templates that support a named parameter. Good for searching for
all templates that support for example the "isbn" parameter.
o getpage.py
Uses Special:Export feature of Wikipedia to download a single article
and all templates it depends on. It then constructs a file resembling
a MediaWiki XML dump, containing only these pages.
Good for making testing datasets or researching why a specific page
failed to parse properly.
o samplewiki
Takes a complete MediaWiki XML dump, takes a random sample of pages
from it and creates a new (smaller) dump.
o splitwiki
Splits a MediaWiki XML dump into N equal parts. For example:
$ bzcat enwiki-20090306-pages-articles.xml.bz2 | \
splitwiki 4 enwiki-20090306-pages-articles.xml
Will produce 4 files of roughly equal length in the current directory:
$ ls
enwiki-20090306-pages-articles.xml.0000
enwiki-20090306-pages-articles.xml.0001
enwiki-20090306-pages-articles.xml.0002
enwiki-20090306-pages-articles.xml.0003
o riffle
Takes a complete MediaWiki XML dump and some articles downloaded using the
Special:Export feature and inserts these articles into the dump.
Good for keeping an XML dump up-to-date without having to re-download the
whole thing.
5. License
==========
Copyright (C) 2007 Evgeniy Gabrilovich (gabr@cs.technion.ac.il)
Copyright (C) 2009 Tomaz Solc (tomaz.solc@tablix.org)
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA,
or see <http://www.gnu.org/licenses/> and
<http://www.fsf.org/licensing/licenses/info/GPLv2.html>
Some of the example files are copied from English Wikipedia and are
copyright (C) by their respective authors. Text is available under the
terms of the GNU Free Documentation License
6. Hacking
==========
The code is pretty well documented. Have a look inside first, then ask
on the mailing list.
If you need to customize Wikiprep for a specific language, copy
Wikiprep/Config/Enwiki.pm and change fields there. You can then use your
new config with -language option.
To run a test on just a particular test case, run for example:
$ perl t/cases.t t/cases/infobox.xml