let $refIdInSource := doc('source.xml')//*[. contains text {$p} all words
should retrieve and maintain the original path. This is not "wrong and
corpus in order ta reassemble them later in Rest app. A corpus is not just
display.
Hello.
It's a dictionary and words in the dictionary reference the texts with
simple identifiers. The texts are used as examples and can be referenced by
many different words. Thus no explicit bookkeeping is kept about which
words use which texts as examples, this is rather done implicitly and each
text reference (from the dictionary to -ft-) also holds much extra
information about the quality of the example text according to the word and
such. Statistics and summaries of this data is done by separate queries and
not held explicitly in either database.
Not sure whether I answered your question?
BR
Kristian K
Kristian,
Out of curiosity, how are you linking the normalized texts in the -ft-
database to the source documents? Is keeping a reference from the indexed
text back to the source document a requirement in your application?
Thanks,
Vincent
Kankainen
*Sent:* Friday, June 30, 2017 5:27 PM
*Subject:* Re: [basex-talk] Full-text lemmatizing and xml:lang
Hello
Sorry for being slow in reception, being a full-time father of two kids is
my only excuse.
Thank you for enlightening answers. At first creating a separate database
felt wrong and stupid, but after a while it felt just right and helping to
organize different language elements via aggregation instead of composition.
(:~
This function takes a list of database names and optionally a list of
language codes.
It creates separate full-text indexed databases for lemmatized searching
of each language contained in the original database.
If the list of language codes is empty, all existing values of xml:lang
found in the database is used.
The full-text databases are named 'dbname-ft-langcode'
Another function normalizes the texts, removes duplicate entries and
inserts xml:id attributes
:)
declare updating function keeleleek:create-ft-indices-for-each-lang(
$db-names as xs:string*,
$lang-codes as xs:string*
) {
for $db-name in $db-names
let $langs := if( not( empty( $lang-codes )))
then( $lang-codes )
)
for $lang in $langs
let $ft-db-name := concat($db-name, '-ft-', $lang)
(: create full-text db for each language :)
return
db:create(
$ft-db-name,
<texts>{$lang-group}</texts>,
$ft-db-name,
map { 'ftindex': true(), 'language': $lang }
)
};
Cheers
Kristian K
Hi,
After reading Christian answer ( :-) ); I thought it could be interesting
----------------------------------
distinct-values(
)
!
db:create(
'db-' || .,
<root xml:lang="{.}">
{
for $file in file:children('/Users/xavier/Desktop/')[matches(.,'xml$')]
return
}
</root>,
"myfile",
map { 'ftindex': true(), 'language': . }
)
----------------------------------
Hi Kristian,
It is currently not possible to work with different languages in a
single database. This is mostly because all normalized tokens will end
up in the same internal index, and it would be a lot of effort to
diversify this software behavior.
As Xavier pointed out (thanks!), the best way indeed is to create
different databases, one per language. The following example has been
inspired by Xavierâs proposal; it groups all files by their language
for $path-group in file:children('input-dir')
where ends-with($path-group, '.xml')
return db:create(
'db-' || $lang,
$path-group,
(),
map { 'ftindex': true(), 'language': $lang }
)
Hope this helps,
Christian
On Tue, Jun 27, 2017 at 5:19 PM, Xavier-Laurent SALVADOR
Post by Xavier-Laurent SALVADORHi Kristian,
This is useful for creating automatically databases according to xml:lang
attribute
let $dir := '/Users/me/myDesktop/'
for $file in file:list($dir)[matches(.,'xml')]
return
return
true(),'language':$flag })
Or you can "ft:tokenize" your string mapping {'language':$flag} into your
query
Hope I understood the problem :) Else return 'sorry'
Post by Kristian KankainenHello
I have documents with text in several languages. When creating a
database
Post by Xavier-Laurent SALVADORPost by Kristian Kankainenin BaseX I can choose *one* language for stemming for the full-text
search
Post by Xavier-Laurent SALVADORPost by Kristian Kankainenindex. Is there a way BaseX could lemmatize according to the elements
xml:lang attribute?
Best regards
Kristian K
--
Ce message peut contenir des informations réservées exclusivement à son
destinataire. Toute diffusion sans autorisation est interdite. Si vous
n'en
Post by Xavier-Laurent SALVADORêtes pas le destinataire, merci de prendre contact avec l'expéditeur et
de
Post by Xavier-Laurent SALVADORdétruire ce message.
This email may contain material for the sole use of the intended
recipient.
Post by Xavier-Laurent SALVADORAny forwarding without express permission is prohibited. If you are not
the
Post by Xavier-Laurent SALVADORintended recipient, please contact the sender and delete all copies.
--
Ce message peut contenir des informations réservées exclusivement à son
destinataire. Toute diffusion sans autorisation est interdite. Si vous
n'en êtes pas le destinataire, merci de prendre contact avec l'expéditeur
et de détruire ce message.
*This email may contain material for the sole use of the intended
recipient. Any forwarding without express permission is prohibited. If you
are not the intended recipient, please contact the sender and delete all
copies*.
destinataire. Toute diffusion sans autorisation est interdite. Si vous
et de détruire ce message.
recipient. Any forwarding without express permission is prohibited. If you