Dictionary
Contents
Goal
The goal of this tutorial is to create a simple dictionary - a database with words from a specific language(s). It will be created in a very simple manner - we supply some text to the pipeline (in PDF, TXT or any other popular format), which parses it and insert the words in a unique manner to a database.
What we have to do?
- Parse text which comes in an arbitrary format
- Insert all tokens words from it, which satisfy the 'word' criteria in a database with no duplications
Setup
Obviously, we will need to setup a context. It has a very simple structure, for the purpose of the tutorial we will name it "dictionary" :
context \- apps \- dictionary |- languages | \- en | \- source.pdf \- context.properties
You can use any file in the place of source.pdf. It is just an ordinary text downloaded from the Internet. Of course, more words it contains - the better.
Project
The solution is a set of 2 artifacts:
- JAR file which contains the code of checking and inserting the keywords into the database
- ZIP file which contains the compiled solution
tutorials-dictionary |- distribution |- stages \- pom.xml
Distribution
pom.xml
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <artifactId>dcmpc-services-dictionary-distribution</artifactId> <name>dcmpc-services-dictionary-distribution</name> <url>http://wiki.datacraftmagic.com/display/SFIND/%23Find+Home</url> <packaging>pom</packaging> <properties> <sharpfind.version>1.0.3</sharpfind.version> </properties> <parent> <groupId>com.datacraftmagic</groupId> <artifactId>dcmpc-services-dictionary</artifactId> <version>1.3.7-SNAPSHOT</version> </parent> <dependencies> <!-- App --> <dependency> <groupId>net.ccil</groupId> <version>${ccil.version}</version> <artifactId>ccil-tutorials-dictionary-stages</artifactId> </dependency> <dependency> <groupId>net.ccil</groupId> <version>${ccil.version}</version> <artifactId>ccil-parse-tika</artifactId> </dependency> <dependency> <artifactId>ccil-app</artifactId> <groupId>net.ccil</groupId> <version>${ccil.version}</version> </dependency> <dependency> <groupId>cybercore</groupId> <version>${cybercore.version}</version> <artifactId>cybercore-util</artifactId> </dependency> <dependency> <artifactId>ccil-common-generic</artifactId> <groupId>net.ccil</groupId> <version>${ccil.version}</version> <scope>runtime</scope> </dependency> <dependency> <groupId>net.ccil</groupId> <artifactId>ccil-common-split</artifactId> <version>${ccil.version}</version> </dependency> <dependency> <groupId>net.ccil</groupId> <artifactId>ccil-common-sql</artifactId> <version>${ccil.version}</version> </dependency> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.17</version> </dependency> </dependencies> <build> <finalName>${project.name}</finalName> <plugins> <plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <descriptors> <descriptor>bin.xml</descriptor> </descriptors> </configuration> <executions> <execution> <id>make-assembly</id> <phase>package</phase> <goals> <goal>attached</goal> </goals> </execution> </executions> </plugin> </plugins> </build> <organization> <name>Data Craft and Magic ltd.</name> <url>http://datacraftmagic.com/</url> </organization> </project>
bin.xml
<assembly> <id>bin</id> <includeBaseDirectory>false</includeBaseDirectory> <formats> <format>zip</format> <format>dir</format> </formats> <fileSets> <fileSet> <directory>files/bin</directory> <useDefaultExcludes>true</useDefaultExcludes> <outputDirectory>bin</outputDirectory> <fileMode>0755</fileMode> </fileSet> <fileSet> <directory>files/config</directory> <useDefaultExcludes>true</useDefaultExcludes> <outputDirectory>config</outputDirectory> </fileSet> <fileSet> <directory>files/context</directory> <useDefaultExcludes>true</useDefaultExcludes> <outputDirectory>context</outputDirectory> </fileSet> <fileSet> <directory>files/services</directory> <useDefaultExcludes>true</useDefaultExcludes> <outputDirectory>services</outputDirectory> </fileSet> <fileSet> <directory>files/sql</directory> <useDefaultExcludes>true</useDefaultExcludes> <outputDirectory>sql</outputDirectory> </fileSet> </fileSets> <dependencySets> <!-- lib folder --> <dependencySet> <useProjectArtifact>false</useProjectArtifact> <useProjectAttachments>false</useProjectAttachments> <outputDirectory>lib</outputDirectory> <useTransitiveDependencies>true</useTransitiveDependencies> <excludes> <exclude>org.eclipse.jetty:*</exclude> <exclude>org.slf4j:*</exclude> <exclude>ch.qos.logback:*</exclude> <!-- only JARs here --> <exclude>*:war:*</exclude> <exclude>*:pom:*</exclude> <exclude>*:zip:*</exclude> <exclude>*:zip:*</exclude> </excludes> </dependencySet> <!-- populate the launcher folder --> <dependencySet> <useProjectArtifact>false</useProjectArtifact> <useProjectAttachments>false</useProjectAttachments> <outputDirectory>launcher</outputDirectory> <useTransitiveDependencies>true</useTransitiveDependencies> <includes> <!-- server --> <include>cybercore:cybercore-launcher</include> <!-- common --> <include>ch.qos.logback:logback*</include> <include>org.slf4j:jcl-over-slf4j</include> <include>org.slf4j:slf4j-api</include> <include>log4j:log4j</include> </includes> </dependencySet> </dependencySets> </assembly>
files
bin \- ccil-tutorials-dictionary-app.sh config |- ccil-tutorials-dictionary-app.ttl \- logback.xml context |- apps | |-dictionary | | \- languages | | \- en | | \- source.pdf | \- context.properties \- context.properties
Stages
TBA
pom.xml
TBA
Startup script
bin/ccil-tutorials-dictionary-app.sh
#!/bin/bash CCIL_HOME=`dirname $PWD` CCIL_CONTEXT=$CCIL_HOME/context java -cp "$CCIL_HOME/lib/*:$CCIL_HOME/config:$CCIL_HOME/launcher/*" -Dserver.config.file=ccil-tutorials-dictionary-app.ttl -Dserver.home.dir=$CCIL_HOME -Xmx1024M -Dserver.context.dir=$CCIL_CONTEXT -Dserver.jmx.enabled=false net.ccil.execution.CcilConsoleApp -execute -root $CCIL_HOME/context/apps "$@"
Configuration
config/ccil-tutorials-dictionary-app.ttl
TBA
Code
TBA
Parsing the text
TBA
Insert into database
TBA
Purge
TBA
The application context
TBA
Further steps
- Break words into stemmed forms, storing the stem variants in a joined table
- Develop user interface
Links
TBA