Difference between revisions of "Dictionary"

From CCIL
Jump to: navigation, search
(Goal)
(Further steps)
 
(37 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
__TOC__
  
 +
== Goal ==
  
 
The goal of this tutorial is to create a simple dictionary - a database with words from a specific language(s). It will be created in a very simple manner - we supply some text to the pipeline (in PDF, TXT or any other popular format), which parses it and insert the words in a unique manner to a database.
 
The goal of this tutorial is to create a simple dictionary - a database with words from a specific language(s). It will be created in a very simple manner - we supply some text to the pipeline (in PDF, TXT or any other popular format), which parses it and insert the words in a unique manner to a database.
Line 7: Line 9:
 
# Parse text which comes in an arbitrary format
 
# Parse text which comes in an arbitrary format
 
# Insert all tokens words from it, which satisfy the 'word' criteria in a database with no duplications
 
# Insert all tokens words from it, which satisfy the 'word' criteria in a database with no duplications
 +
 +
== Setup ==
 +
Obviously, we will need to setup a context. It has a very simple structure, for the purpose of the tutorial we will name it "dictionary" :
 +
<pre>
 +
context
 +
\- apps
 +
  \- dictionary
 +
      |- languages
 +
      |  \- en
 +
      |    \- source.pdf
 +
      \- context.properties
 +
</pre>
 +
You can use any file in the place of ''source.pdf''. It is just an ordinary text downloaded from the Internet. Of course, more words it contains - the better.
 +
 +
=== Project ===
 +
The solution is a set of 2 artifacts:
 +
* JAR file which contains the code of checking and inserting the keywords into the database
 +
* ZIP file which contains the compiled solution
 +
 +
<pre>
 +
tutorials-dictionary
 +
|- distribution
 +
|- stages
 +
\- pom.xml
 +
</pre>
 +
 +
==== Distribution ====
 +
===== pom.xml =====
 +
 +
<pre>
 +
<?xml version="1.0" encoding="UTF-8"?>
 +
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 +
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
 +
<modelVersion>4.0.0</modelVersion>
 +
<artifactId>dcmpc-services-dictionary-distribution</artifactId>
 +
<name>dcmpc-services-dictionary-distribution</name>
 +
<url>http://wiki.datacraftmagic.com/display/SFIND/%23Find+Home</url>
 +
<packaging>pom</packaging>
 +
<properties>
 +
<sharpfind.version>1.0.3</sharpfind.version>
 +
</properties>
 +
<parent>
 +
<groupId>com.datacraftmagic</groupId>
 +
<artifactId>dcmpc-services-dictionary</artifactId>
 +
<version>1.3.7-SNAPSHOT</version>
 +
</parent>
 +
<dependencies>
 +
<!-- App -->
 +
<dependency>
 +
<groupId>net.ccil</groupId>
 +
<version>${ccil.version}</version>
 +
<artifactId>ccil-tutorials-dictionary-stages</artifactId>
 +
</dependency>
 +
<dependency>
 +
        <groupId>net.ccil</groupId>
 +
<version>${ccil.version}</version>
 +
<artifactId>ccil-parse-tika</artifactId>
 +
</dependency>
 +
<dependency>
 +
<artifactId>ccil-app</artifactId>
 +
<groupId>net.ccil</groupId>
 +
<version>${ccil.version}</version>
 +
</dependency>
 +
<dependency>
 +
<groupId>cybercore</groupId>
 +
<version>${cybercore.version}</version>
 +
<artifactId>cybercore-util</artifactId>
 +
</dependency>
 +
<dependency>
 +
<artifactId>ccil-common-generic</artifactId>
 +
<groupId>net.ccil</groupId>
 +
<version>${ccil.version}</version>
 +
<scope>runtime</scope>
 +
</dependency>
 +
<dependency>
 +
<groupId>net.ccil</groupId>
 +
<artifactId>ccil-common-split</artifactId>
 +
<version>${ccil.version}</version>
 +
</dependency>
 +
<dependency>
 +
<groupId>net.ccil</groupId>
 +
<artifactId>ccil-common-sql</artifactId>
 +
<version>${ccil.version}</version>
 +
</dependency>
 +
<dependency>
 +
<groupId>mysql</groupId>
 +
<artifactId>mysql-connector-java</artifactId>
 +
<version>5.1.17</version>
 +
</dependency>
 +
</dependencies>
 +
<build>
 +
<finalName>${project.name}</finalName>
 +
<plugins>
 +
<plugin>
 +
<artifactId>maven-assembly-plugin</artifactId>
 +
<configuration>
 +
<descriptors>
 +
<descriptor>bin.xml</descriptor>
 +
</descriptors>
 +
</configuration>
 +
<executions>
 +
<execution>
 +
<id>make-assembly</id>
 +
<phase>package</phase>
 +
<goals>
 +
<goal>attached</goal>
 +
</goals>
 +
</execution>
 +
</executions>
 +
</plugin>
 +
</plugins>
 +
</build>
 +
 +
<organization>
 +
<name>Data Craft and Magic ltd.</name>
 +
<url>http://datacraftmagic.com/</url>
 +
</organization>
 +
</project>
 +
</pre>
 +
 +
===== bin.xml =====
 +
 +
<pre>
 +
<assembly>
 +
<id>bin</id>
 +
<includeBaseDirectory>false</includeBaseDirectory>
 +
<formats>
 +
<format>zip</format>
 +
<format>dir</format>
 +
</formats>
 +
<fileSets>
 +
<fileSet>
 +
<directory>files/bin</directory>
 +
<useDefaultExcludes>true</useDefaultExcludes>
 +
<outputDirectory>bin</outputDirectory>
 +
<fileMode>0755</fileMode>
 +
</fileSet>
 +
<fileSet>
 +
<directory>files/config</directory>
 +
<useDefaultExcludes>true</useDefaultExcludes>
 +
<outputDirectory>config</outputDirectory>
 +
</fileSet>
 +
<fileSet>
 +
<directory>files/context</directory>
 +
<useDefaultExcludes>true</useDefaultExcludes>
 +
<outputDirectory>context</outputDirectory>
 +
</fileSet>
 +
<fileSet>
 +
<directory>files/services</directory>
 +
<useDefaultExcludes>true</useDefaultExcludes>
 +
<outputDirectory>services</outputDirectory>
 +
</fileSet>
 +
<fileSet>
 +
<directory>files/sql</directory>
 +
<useDefaultExcludes>true</useDefaultExcludes>
 +
<outputDirectory>sql</outputDirectory>
 +
</fileSet>
 +
</fileSets>
 +
<dependencySets>
 +
<!-- lib folder -->
 +
<dependencySet>
 +
<useProjectArtifact>false</useProjectArtifact>
 +
<useProjectAttachments>false</useProjectAttachments>
 +
<outputDirectory>lib</outputDirectory>
 +
<useTransitiveDependencies>true</useTransitiveDependencies>
 +
 +
<excludes>
 +
<exclude>org.eclipse.jetty:*</exclude>
 +
<exclude>org.slf4j:*</exclude>
 +
<exclude>ch.qos.logback:*</exclude>
 +
<!-- only JARs here -->
 +
<exclude>*:war:*</exclude>
 +
<exclude>*:pom:*</exclude>
 +
<exclude>*:zip:*</exclude>
 +
<exclude>*:zip:*</exclude>
 +
</excludes>
 +
</dependencySet>
 +
<!-- populate the launcher folder -->
 +
<dependencySet>
 +
<useProjectArtifact>false</useProjectArtifact>
 +
<useProjectAttachments>false</useProjectAttachments>
 +
<outputDirectory>launcher</outputDirectory>
 +
<useTransitiveDependencies>true</useTransitiveDependencies>
 +
<includes>
 +
<!-- server -->
 +
<include>cybercore:cybercore-launcher</include>
 +
<!-- common -->
 +
<include>ch.qos.logback:logback*</include>
 +
<include>org.slf4j:jcl-over-slf4j</include>
 +
<include>org.slf4j:slf4j-api</include>
 +
<include>log4j:log4j</include>
 +
</includes>
 +
</dependencySet>
 +
</dependencySets>
 +
</assembly>
 +
</pre>
 +
 +
===== files =====
 +
 +
<pre>
 +
bin
 +
\- ccil-tutorials-dictionary-app.sh
 +
config
 +
|- ccil-tutorials-dictionary-app.ttl
 +
\- logback.xml
 +
context
 +
|- apps
 +
|  |-dictionary
 +
|  |  \- languages
 +
|  |      \- en
 +
|  |          \- source.pdf
 +
|  \- context.properties
 +
\- context.properties
 +
</pre>
 +
 +
==== Stages ====
 +
TBA
 +
===== pom.xml =====
 +
TBA
 +
 +
=== Startup script ===
 +
 +
bin/ccil-tutorials-dictionary-app.sh
 +
<pre>
 +
#!/bin/bash
 +
CCIL_HOME=`dirname $PWD`
 +
CCIL_CONTEXT=$CCIL_HOME/context
 +
 +
java -cp "$CCIL_HOME/lib/*:$CCIL_HOME/config:$CCIL_HOME/launcher/*" -Dserver.config.file=ccil-tutorials-dictionary-app.ttl -Dserver.home.dir=$CCIL_HOME -Xmx1024M -Dserver.context.dir=$CCIL_CONTEXT -Dserver.jmx.enabled=false net.ccil.execution.CcilConsoleApp -execute -root $CCIL_HOME/context/apps "$@"
 +
</pre>
 +
 +
=== Configuration ===
 +
config/ccil-tutorials-dictionary-app.ttl
 +
<pre>
 +
TBA
 +
</pre>
 +
 +
== Code ==
 +
TBA
 +
 +
=== Parsing the text ===
 +
TBA
 +
 +
=== Insert into database ===
 +
TBA
 +
 +
=== Purge ===
 +
TBA
 +
 +
== The application context ==
 +
TBA
 +
 +
== Further steps ==
 +
* Break words into stemmed forms, storing the stem variants in a joined table.
 +
* Develop user interface.
 +
 +
== Links ==
 +
TBA

Latest revision as of 13:47, 17 May 2017

Goal

The goal of this tutorial is to create a simple dictionary - a database with words from a specific language(s). It will be created in a very simple manner - we supply some text to the pipeline (in PDF, TXT or any other popular format), which parses it and insert the words in a unique manner to a database.


What we have to do?

  1. Parse text which comes in an arbitrary format
  2. Insert all tokens words from it, which satisfy the 'word' criteria in a database with no duplications

Setup

Obviously, we will need to setup a context. It has a very simple structure, for the purpose of the tutorial we will name it "dictionary" :

context
\- apps
   \- dictionary
      |- languages
      |  \- en
      |     \- source.pdf
      \- context.properties

You can use any file in the place of source.pdf. It is just an ordinary text downloaded from the Internet. Of course, more words it contains - the better.

Project

The solution is a set of 2 artifacts:

  • JAR file which contains the code of checking and inserting the keywords into the database
  • ZIP file which contains the compiled solution
tutorials-dictionary
|- distribution
|- stages
\- pom.xml

Distribution

pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
	<modelVersion>4.0.0</modelVersion>
	<artifactId>dcmpc-services-dictionary-distribution</artifactId>
	<name>dcmpc-services-dictionary-distribution</name>
	<url>http://wiki.datacraftmagic.com/display/SFIND/%23Find+Home</url>
	<packaging>pom</packaging>
	<properties>
		<sharpfind.version>1.0.3</sharpfind.version>
	</properties>
	<parent>
		<groupId>com.datacraftmagic</groupId>
		<artifactId>dcmpc-services-dictionary</artifactId>
		<version>1.3.7-SNAPSHOT</version>
	</parent>
	<dependencies>
		<!-- App -->
		<dependency>
			<groupId>net.ccil</groupId>
			<version>${ccil.version}</version>
			<artifactId>ccil-tutorials-dictionary-stages</artifactId>
		</dependency>
 		<dependency>
 		        <groupId>net.ccil</groupId>
			<version>${ccil.version}</version>
			<artifactId>ccil-parse-tika</artifactId>
		</dependency>		
		<dependency>
 			<artifactId>ccil-app</artifactId>
 			<groupId>net.ccil</groupId>
 			<version>${ccil.version}</version>
 		</dependency>
 		<dependency>
 			<groupId>cybercore</groupId>
 			<version>${cybercore.version}</version>
 			<artifactId>cybercore-util</artifactId>
		</dependency>
		<dependency>
			<artifactId>ccil-common-generic</artifactId>
 			<groupId>net.ccil</groupId>
 			<version>${ccil.version}</version>
 			<scope>runtime</scope>
		</dependency>
		<dependency>
			<groupId>net.ccil</groupId>
 			<artifactId>ccil-common-split</artifactId>
 			<version>${ccil.version}</version>
 		</dependency>
 		<dependency>
 			<groupId>net.ccil</groupId>
 			<artifactId>ccil-common-sql</artifactId>
 			<version>${ccil.version}</version>
		</dependency>
		<dependency>
			<groupId>mysql</groupId>
 			<artifactId>mysql-connector-java</artifactId>
 			<version>5.1.17</version>
 		</dependency>
	</dependencies>
	<build>
		<finalName>${project.name}</finalName>
		<plugins>
			<plugin>
				<artifactId>maven-assembly-plugin</artifactId>
				<configuration>
					<descriptors>
						<descriptor>bin.xml</descriptor>
					</descriptors>
				</configuration>
				<executions>
					<execution>
						<id>make-assembly</id>
						<phase>package</phase>
						<goals>
							<goal>attached</goal>
						</goals>
					</execution>
				</executions>
			</plugin>
		</plugins>
	</build>

	<organization>
		<name>Data Craft and Magic ltd.</name>
		<url>http://datacraftmagic.com/</url>
	</organization>
</project>
bin.xml
<assembly>
	<id>bin</id>
	<includeBaseDirectory>false</includeBaseDirectory>
	<formats>
		<format>zip</format>
		<format>dir</format>
	</formats>
	<fileSets>
		<fileSet>
			<directory>files/bin</directory>
			<useDefaultExcludes>true</useDefaultExcludes>
			<outputDirectory>bin</outputDirectory>
			<fileMode>0755</fileMode>
		</fileSet>
		<fileSet>
			<directory>files/config</directory>
			<useDefaultExcludes>true</useDefaultExcludes>
			<outputDirectory>config</outputDirectory>
		</fileSet>
		<fileSet>
			<directory>files/context</directory>
			<useDefaultExcludes>true</useDefaultExcludes>
			<outputDirectory>context</outputDirectory>
		</fileSet>
		<fileSet>
			<directory>files/services</directory>
			<useDefaultExcludes>true</useDefaultExcludes>
			<outputDirectory>services</outputDirectory>
		</fileSet>
		<fileSet>
			<directory>files/sql</directory>
			<useDefaultExcludes>true</useDefaultExcludes>
			<outputDirectory>sql</outputDirectory>
		</fileSet>
	</fileSets>
	<dependencySets>
		<!-- lib folder -->
		<dependencySet>
			<useProjectArtifact>false</useProjectArtifact>
			<useProjectAttachments>false</useProjectAttachments>
			<outputDirectory>lib</outputDirectory>
			<useTransitiveDependencies>true</useTransitiveDependencies>

			<excludes>
				<exclude>org.eclipse.jetty:*</exclude>
				<exclude>org.slf4j:*</exclude>
				<exclude>ch.qos.logback:*</exclude>
				<!-- only JARs here -->
				<exclude>*:war:*</exclude>
				<exclude>*:pom:*</exclude>
				<exclude>*:zip:*</exclude>
				<exclude>*:zip:*</exclude>
			</excludes>
		</dependencySet>
		<!-- populate the launcher folder -->
		<dependencySet>
			<useProjectArtifact>false</useProjectArtifact>
			<useProjectAttachments>false</useProjectAttachments>
			<outputDirectory>launcher</outputDirectory>
			<useTransitiveDependencies>true</useTransitiveDependencies>
			<includes>
				<!-- server -->
				<include>cybercore:cybercore-launcher</include>
				<!-- common -->
				<include>ch.qos.logback:logback*</include>
				<include>org.slf4j:jcl-over-slf4j</include>
				<include>org.slf4j:slf4j-api</include>
				<include>log4j:log4j</include>
			</includes>
		</dependencySet>
	</dependencySets>
</assembly>
files
bin
 \- ccil-tutorials-dictionary-app.sh
config
 |- ccil-tutorials-dictionary-app.ttl
 \- logback.xml
context
 |- apps
 |   |-dictionary
 |   |  \- languages
 |   |      \- en
 |   |          \- source.pdf
 |   \- context.properties
 \- context.properties

Stages

TBA

pom.xml

TBA

Startup script

bin/ccil-tutorials-dictionary-app.sh

#!/bin/bash
CCIL_HOME=`dirname $PWD`
CCIL_CONTEXT=$CCIL_HOME/context

java -cp "$CCIL_HOME/lib/*:$CCIL_HOME/config:$CCIL_HOME/launcher/*" -Dserver.config.file=ccil-tutorials-dictionary-app.ttl -Dserver.home.dir=$CCIL_HOME -Xmx1024M -Dserver.context.dir=$CCIL_CONTEXT -Dserver.jmx.enabled=false net.ccil.execution.CcilConsoleApp -execute -root $CCIL_HOME/context/apps "$@"

Configuration

config/ccil-tutorials-dictionary-app.ttl

TBA

Code

TBA

Parsing the text

TBA

Insert into database

TBA

Purge

TBA

The application context

TBA

Further steps

  • Break words into stemmed forms, storing the stem variants in a joined table.
  • Develop user interface.

Links

TBA