Difference between revisions of "Dictionary"
(→Goal) |
(→Further steps) |
||
(37 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | __TOC__ | ||
+ | == Goal == | ||
The goal of this tutorial is to create a simple dictionary - a database with words from a specific language(s). It will be created in a very simple manner - we supply some text to the pipeline (in PDF, TXT or any other popular format), which parses it and insert the words in a unique manner to a database. | The goal of this tutorial is to create a simple dictionary - a database with words from a specific language(s). It will be created in a very simple manner - we supply some text to the pipeline (in PDF, TXT or any other popular format), which parses it and insert the words in a unique manner to a database. | ||
Line 7: | Line 9: | ||
# Parse text which comes in an arbitrary format | # Parse text which comes in an arbitrary format | ||
# Insert all tokens words from it, which satisfy the 'word' criteria in a database with no duplications | # Insert all tokens words from it, which satisfy the 'word' criteria in a database with no duplications | ||
+ | |||
+ | == Setup == | ||
+ | Obviously, we will need to setup a context. It has a very simple structure, for the purpose of the tutorial we will name it "dictionary" : | ||
+ | <pre> | ||
+ | context | ||
+ | \- apps | ||
+ | \- dictionary | ||
+ | |- languages | ||
+ | | \- en | ||
+ | | \- source.pdf | ||
+ | \- context.properties | ||
+ | </pre> | ||
+ | You can use any file in the place of ''source.pdf''. It is just an ordinary text downloaded from the Internet. Of course, more words it contains - the better. | ||
+ | |||
+ | === Project === | ||
+ | The solution is a set of 2 artifacts: | ||
+ | * JAR file which contains the code of checking and inserting the keywords into the database | ||
+ | * ZIP file which contains the compiled solution | ||
+ | |||
+ | <pre> | ||
+ | tutorials-dictionary | ||
+ | |- distribution | ||
+ | |- stages | ||
+ | \- pom.xml | ||
+ | </pre> | ||
+ | |||
+ | ==== Distribution ==== | ||
+ | ===== pom.xml ===== | ||
+ | |||
+ | <pre> | ||
+ | <?xml version="1.0" encoding="UTF-8"?> | ||
+ | <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" | ||
+ | xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> | ||
+ | <modelVersion>4.0.0</modelVersion> | ||
+ | <artifactId>dcmpc-services-dictionary-distribution</artifactId> | ||
+ | <name>dcmpc-services-dictionary-distribution</name> | ||
+ | <url>http://wiki.datacraftmagic.com/display/SFIND/%23Find+Home</url> | ||
+ | <packaging>pom</packaging> | ||
+ | <properties> | ||
+ | <sharpfind.version>1.0.3</sharpfind.version> | ||
+ | </properties> | ||
+ | <parent> | ||
+ | <groupId>com.datacraftmagic</groupId> | ||
+ | <artifactId>dcmpc-services-dictionary</artifactId> | ||
+ | <version>1.3.7-SNAPSHOT</version> | ||
+ | </parent> | ||
+ | <dependencies> | ||
+ | <!-- App --> | ||
+ | <dependency> | ||
+ | <groupId>net.ccil</groupId> | ||
+ | <version>${ccil.version}</version> | ||
+ | <artifactId>ccil-tutorials-dictionary-stages</artifactId> | ||
+ | </dependency> | ||
+ | <dependency> | ||
+ | <groupId>net.ccil</groupId> | ||
+ | <version>${ccil.version}</version> | ||
+ | <artifactId>ccil-parse-tika</artifactId> | ||
+ | </dependency> | ||
+ | <dependency> | ||
+ | <artifactId>ccil-app</artifactId> | ||
+ | <groupId>net.ccil</groupId> | ||
+ | <version>${ccil.version}</version> | ||
+ | </dependency> | ||
+ | <dependency> | ||
+ | <groupId>cybercore</groupId> | ||
+ | <version>${cybercore.version}</version> | ||
+ | <artifactId>cybercore-util</artifactId> | ||
+ | </dependency> | ||
+ | <dependency> | ||
+ | <artifactId>ccil-common-generic</artifactId> | ||
+ | <groupId>net.ccil</groupId> | ||
+ | <version>${ccil.version}</version> | ||
+ | <scope>runtime</scope> | ||
+ | </dependency> | ||
+ | <dependency> | ||
+ | <groupId>net.ccil</groupId> | ||
+ | <artifactId>ccil-common-split</artifactId> | ||
+ | <version>${ccil.version}</version> | ||
+ | </dependency> | ||
+ | <dependency> | ||
+ | <groupId>net.ccil</groupId> | ||
+ | <artifactId>ccil-common-sql</artifactId> | ||
+ | <version>${ccil.version}</version> | ||
+ | </dependency> | ||
+ | <dependency> | ||
+ | <groupId>mysql</groupId> | ||
+ | <artifactId>mysql-connector-java</artifactId> | ||
+ | <version>5.1.17</version> | ||
+ | </dependency> | ||
+ | </dependencies> | ||
+ | <build> | ||
+ | <finalName>${project.name}</finalName> | ||
+ | <plugins> | ||
+ | <plugin> | ||
+ | <artifactId>maven-assembly-plugin</artifactId> | ||
+ | <configuration> | ||
+ | <descriptors> | ||
+ | <descriptor>bin.xml</descriptor> | ||
+ | </descriptors> | ||
+ | </configuration> | ||
+ | <executions> | ||
+ | <execution> | ||
+ | <id>make-assembly</id> | ||
+ | <phase>package</phase> | ||
+ | <goals> | ||
+ | <goal>attached</goal> | ||
+ | </goals> | ||
+ | </execution> | ||
+ | </executions> | ||
+ | </plugin> | ||
+ | </plugins> | ||
+ | </build> | ||
+ | |||
+ | <organization> | ||
+ | <name>Data Craft and Magic ltd.</name> | ||
+ | <url>http://datacraftmagic.com/</url> | ||
+ | </organization> | ||
+ | </project> | ||
+ | </pre> | ||
+ | |||
+ | ===== bin.xml ===== | ||
+ | |||
+ | <pre> | ||
+ | <assembly> | ||
+ | <id>bin</id> | ||
+ | <includeBaseDirectory>false</includeBaseDirectory> | ||
+ | <formats> | ||
+ | <format>zip</format> | ||
+ | <format>dir</format> | ||
+ | </formats> | ||
+ | <fileSets> | ||
+ | <fileSet> | ||
+ | <directory>files/bin</directory> | ||
+ | <useDefaultExcludes>true</useDefaultExcludes> | ||
+ | <outputDirectory>bin</outputDirectory> | ||
+ | <fileMode>0755</fileMode> | ||
+ | </fileSet> | ||
+ | <fileSet> | ||
+ | <directory>files/config</directory> | ||
+ | <useDefaultExcludes>true</useDefaultExcludes> | ||
+ | <outputDirectory>config</outputDirectory> | ||
+ | </fileSet> | ||
+ | <fileSet> | ||
+ | <directory>files/context</directory> | ||
+ | <useDefaultExcludes>true</useDefaultExcludes> | ||
+ | <outputDirectory>context</outputDirectory> | ||
+ | </fileSet> | ||
+ | <fileSet> | ||
+ | <directory>files/services</directory> | ||
+ | <useDefaultExcludes>true</useDefaultExcludes> | ||
+ | <outputDirectory>services</outputDirectory> | ||
+ | </fileSet> | ||
+ | <fileSet> | ||
+ | <directory>files/sql</directory> | ||
+ | <useDefaultExcludes>true</useDefaultExcludes> | ||
+ | <outputDirectory>sql</outputDirectory> | ||
+ | </fileSet> | ||
+ | </fileSets> | ||
+ | <dependencySets> | ||
+ | <!-- lib folder --> | ||
+ | <dependencySet> | ||
+ | <useProjectArtifact>false</useProjectArtifact> | ||
+ | <useProjectAttachments>false</useProjectAttachments> | ||
+ | <outputDirectory>lib</outputDirectory> | ||
+ | <useTransitiveDependencies>true</useTransitiveDependencies> | ||
+ | |||
+ | <excludes> | ||
+ | <exclude>org.eclipse.jetty:*</exclude> | ||
+ | <exclude>org.slf4j:*</exclude> | ||
+ | <exclude>ch.qos.logback:*</exclude> | ||
+ | <!-- only JARs here --> | ||
+ | <exclude>*:war:*</exclude> | ||
+ | <exclude>*:pom:*</exclude> | ||
+ | <exclude>*:zip:*</exclude> | ||
+ | <exclude>*:zip:*</exclude> | ||
+ | </excludes> | ||
+ | </dependencySet> | ||
+ | <!-- populate the launcher folder --> | ||
+ | <dependencySet> | ||
+ | <useProjectArtifact>false</useProjectArtifact> | ||
+ | <useProjectAttachments>false</useProjectAttachments> | ||
+ | <outputDirectory>launcher</outputDirectory> | ||
+ | <useTransitiveDependencies>true</useTransitiveDependencies> | ||
+ | <includes> | ||
+ | <!-- server --> | ||
+ | <include>cybercore:cybercore-launcher</include> | ||
+ | <!-- common --> | ||
+ | <include>ch.qos.logback:logback*</include> | ||
+ | <include>org.slf4j:jcl-over-slf4j</include> | ||
+ | <include>org.slf4j:slf4j-api</include> | ||
+ | <include>log4j:log4j</include> | ||
+ | </includes> | ||
+ | </dependencySet> | ||
+ | </dependencySets> | ||
+ | </assembly> | ||
+ | </pre> | ||
+ | |||
+ | ===== files ===== | ||
+ | |||
+ | <pre> | ||
+ | bin | ||
+ | \- ccil-tutorials-dictionary-app.sh | ||
+ | config | ||
+ | |- ccil-tutorials-dictionary-app.ttl | ||
+ | \- logback.xml | ||
+ | context | ||
+ | |- apps | ||
+ | | |-dictionary | ||
+ | | | \- languages | ||
+ | | | \- en | ||
+ | | | \- source.pdf | ||
+ | | \- context.properties | ||
+ | \- context.properties | ||
+ | </pre> | ||
+ | |||
+ | ==== Stages ==== | ||
+ | TBA | ||
+ | ===== pom.xml ===== | ||
+ | TBA | ||
+ | |||
+ | === Startup script === | ||
+ | |||
+ | bin/ccil-tutorials-dictionary-app.sh | ||
+ | <pre> | ||
+ | #!/bin/bash | ||
+ | CCIL_HOME=`dirname $PWD` | ||
+ | CCIL_CONTEXT=$CCIL_HOME/context | ||
+ | |||
+ | java -cp "$CCIL_HOME/lib/*:$CCIL_HOME/config:$CCIL_HOME/launcher/*" -Dserver.config.file=ccil-tutorials-dictionary-app.ttl -Dserver.home.dir=$CCIL_HOME -Xmx1024M -Dserver.context.dir=$CCIL_CONTEXT -Dserver.jmx.enabled=false net.ccil.execution.CcilConsoleApp -execute -root $CCIL_HOME/context/apps "$@" | ||
+ | </pre> | ||
+ | |||
+ | === Configuration === | ||
+ | config/ccil-tutorials-dictionary-app.ttl | ||
+ | <pre> | ||
+ | TBA | ||
+ | </pre> | ||
+ | |||
+ | == Code == | ||
+ | TBA | ||
+ | |||
+ | === Parsing the text === | ||
+ | TBA | ||
+ | |||
+ | === Insert into database === | ||
+ | TBA | ||
+ | |||
+ | === Purge === | ||
+ | TBA | ||
+ | |||
+ | == The application context == | ||
+ | TBA | ||
+ | |||
+ | == Further steps == | ||
+ | * Break words into stemmed forms, storing the stem variants in a joined table. | ||
+ | * Develop user interface. | ||
+ | |||
+ | == Links == | ||
+ | TBA |
Latest revision as of 13:47, 17 May 2017
Contents
Goal
The goal of this tutorial is to create a simple dictionary - a database with words from a specific language(s). It will be created in a very simple manner - we supply some text to the pipeline (in PDF, TXT or any other popular format), which parses it and insert the words in a unique manner to a database.
What we have to do?
- Parse text which comes in an arbitrary format
- Insert all tokens words from it, which satisfy the 'word' criteria in a database with no duplications
Setup
Obviously, we will need to setup a context. It has a very simple structure, for the purpose of the tutorial we will name it "dictionary" :
context \- apps \- dictionary |- languages | \- en | \- source.pdf \- context.properties
You can use any file in the place of source.pdf. It is just an ordinary text downloaded from the Internet. Of course, more words it contains - the better.
Project
The solution is a set of 2 artifacts:
- JAR file which contains the code of checking and inserting the keywords into the database
- ZIP file which contains the compiled solution
tutorials-dictionary |- distribution |- stages \- pom.xml
Distribution
pom.xml
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <artifactId>dcmpc-services-dictionary-distribution</artifactId> <name>dcmpc-services-dictionary-distribution</name> <url>http://wiki.datacraftmagic.com/display/SFIND/%23Find+Home</url> <packaging>pom</packaging> <properties> <sharpfind.version>1.0.3</sharpfind.version> </properties> <parent> <groupId>com.datacraftmagic</groupId> <artifactId>dcmpc-services-dictionary</artifactId> <version>1.3.7-SNAPSHOT</version> </parent> <dependencies> <!-- App --> <dependency> <groupId>net.ccil</groupId> <version>${ccil.version}</version> <artifactId>ccil-tutorials-dictionary-stages</artifactId> </dependency> <dependency> <groupId>net.ccil</groupId> <version>${ccil.version}</version> <artifactId>ccil-parse-tika</artifactId> </dependency> <dependency> <artifactId>ccil-app</artifactId> <groupId>net.ccil</groupId> <version>${ccil.version}</version> </dependency> <dependency> <groupId>cybercore</groupId> <version>${cybercore.version}</version> <artifactId>cybercore-util</artifactId> </dependency> <dependency> <artifactId>ccil-common-generic</artifactId> <groupId>net.ccil</groupId> <version>${ccil.version}</version> <scope>runtime</scope> </dependency> <dependency> <groupId>net.ccil</groupId> <artifactId>ccil-common-split</artifactId> <version>${ccil.version}</version> </dependency> <dependency> <groupId>net.ccil</groupId> <artifactId>ccil-common-sql</artifactId> <version>${ccil.version}</version> </dependency> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>5.1.17</version> </dependency> </dependencies> <build> <finalName>${project.name}</finalName> <plugins> <plugin> <artifactId>maven-assembly-plugin</artifactId> <configuration> <descriptors> <descriptor>bin.xml</descriptor> </descriptors> </configuration> <executions> <execution> <id>make-assembly</id> <phase>package</phase> <goals> <goal>attached</goal> </goals> </execution> </executions> </plugin> </plugins> </build> <organization> <name>Data Craft and Magic ltd.</name> <url>http://datacraftmagic.com/</url> </organization> </project>
bin.xml
<assembly> <id>bin</id> <includeBaseDirectory>false</includeBaseDirectory> <formats> <format>zip</format> <format>dir</format> </formats> <fileSets> <fileSet> <directory>files/bin</directory> <useDefaultExcludes>true</useDefaultExcludes> <outputDirectory>bin</outputDirectory> <fileMode>0755</fileMode> </fileSet> <fileSet> <directory>files/config</directory> <useDefaultExcludes>true</useDefaultExcludes> <outputDirectory>config</outputDirectory> </fileSet> <fileSet> <directory>files/context</directory> <useDefaultExcludes>true</useDefaultExcludes> <outputDirectory>context</outputDirectory> </fileSet> <fileSet> <directory>files/services</directory> <useDefaultExcludes>true</useDefaultExcludes> <outputDirectory>services</outputDirectory> </fileSet> <fileSet> <directory>files/sql</directory> <useDefaultExcludes>true</useDefaultExcludes> <outputDirectory>sql</outputDirectory> </fileSet> </fileSets> <dependencySets> <!-- lib folder --> <dependencySet> <useProjectArtifact>false</useProjectArtifact> <useProjectAttachments>false</useProjectAttachments> <outputDirectory>lib</outputDirectory> <useTransitiveDependencies>true</useTransitiveDependencies> <excludes> <exclude>org.eclipse.jetty:*</exclude> <exclude>org.slf4j:*</exclude> <exclude>ch.qos.logback:*</exclude> <!-- only JARs here --> <exclude>*:war:*</exclude> <exclude>*:pom:*</exclude> <exclude>*:zip:*</exclude> <exclude>*:zip:*</exclude> </excludes> </dependencySet> <!-- populate the launcher folder --> <dependencySet> <useProjectArtifact>false</useProjectArtifact> <useProjectAttachments>false</useProjectAttachments> <outputDirectory>launcher</outputDirectory> <useTransitiveDependencies>true</useTransitiveDependencies> <includes> <!-- server --> <include>cybercore:cybercore-launcher</include> <!-- common --> <include>ch.qos.logback:logback*</include> <include>org.slf4j:jcl-over-slf4j</include> <include>org.slf4j:slf4j-api</include> <include>log4j:log4j</include> </includes> </dependencySet> </dependencySets> </assembly>
files
bin \- ccil-tutorials-dictionary-app.sh config |- ccil-tutorials-dictionary-app.ttl \- logback.xml context |- apps | |-dictionary | | \- languages | | \- en | | \- source.pdf | \- context.properties \- context.properties
Stages
TBA
pom.xml
TBA
Startup script
bin/ccil-tutorials-dictionary-app.sh
#!/bin/bash CCIL_HOME=`dirname $PWD` CCIL_CONTEXT=$CCIL_HOME/context java -cp "$CCIL_HOME/lib/*:$CCIL_HOME/config:$CCIL_HOME/launcher/*" -Dserver.config.file=ccil-tutorials-dictionary-app.ttl -Dserver.home.dir=$CCIL_HOME -Xmx1024M -Dserver.context.dir=$CCIL_CONTEXT -Dserver.jmx.enabled=false net.ccil.execution.CcilConsoleApp -execute -root $CCIL_HOME/context/apps "$@"
Configuration
config/ccil-tutorials-dictionary-app.ttl
TBA
Code
TBA
Parsing the text
TBA
Insert into database
TBA
Purge
TBA
The application context
TBA
Further steps
- Break words into stemmed forms, storing the stem variants in a joined table.
- Develop user interface.
Links
TBA