Target Audience - Aisera Admin
Audience
- IT Administrator
Prerequisites
- None
Steps to Install/Setup
- Login to AISM UI using provided credentials, e.g. login.XXX.aisera.com
- Select "Datasources" from drop down
- Click on the ‘New Data Source button and and select ‘Crawler’
- Click ‘Next’ to set up general parameters
Name |
Description |
Name |
Name of this data source |
Function |
The function of this data source |
Schedule |
The schedule to set for the data source. It can be run on-demand or on a set cadence such as weekly or daily. |
Language |
The language of the crawled content. |
Description |
The chat window's header text color. |
- Click ‘Next’ to set the configuration
Name |
Description |
Crawl Type |
Sets the type of crawl desired. Below are the values to choose from:
|
Start URLs |
A single URL or list of URLs from where Crawl should start. All URLs should be from the same domain. |
Sitemap URLs |
A list of urls pointing to the sitemaps whose urls you want to crawl. You can also point to a robots.txt and it will be parsed to extract sitemap urls from it. Set crawlType to SitemapCrawl |
Folder Path |
The path to S3 folder location |
Javascript |
We can render Javascript using Node Service. Options are below:
|
Title XPath |
XPATH for title in a Document |
Content XPath |
XPATH for Content in a Document |
Maximum Docs |
Limit on maximum docs to be crawled. Set field to null to crawl whole website |
Include URLs |
A single or list of regular expressions that the (absolute) urls must match in order to be extracted. |
Exclude URLs |
A single or list of regular expressions that the (absolute) urls must match in order to be excluded |
- Click ‘Next’ to specify Advance Configuration parameters
Name |
Description |
Custom Parsing |
Select if a custom Python Script is needed for crawling.If selected then Title XPath and Content XPath are ignored. |
Allowed Domains |
A list of domains to be considered for extracting the links. By default the baseurl domain is used. |
Excluded Domains |
A list of domains to be excluded when extracting the links. |
Element XPaths |
A list of XPaths specifying the areas inside the response where links should be extracted. If specified, only the HTML elements selected by those XPaths will be scanned for links. |
Canonicalize URLs |
Used for duplicates checking. |
User Agent |
Some websites block page crawling. You can change the user agent to avoid this issue. For example: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.167 Safari/537.36 |
Download Delay |
The delay in seconds to throttle down the number of requests scrapy is making. |
Message Type |
Type of content that is being crawled. Below are the options to select:
|
Max Retries |
Number of times that a failed request will be retried. |
Tag Maps |
Custom Parsing and Mapping of Unknown HTML Tags to Known Tags |
Optional JSON |
Some websites need special parameters. In this case you can provide the name-value pairs here in JSON format. |
- Click ‘Next’ to define Ingestion configuration
Name |
Description |
Ignore Unsupported Documents |
Ignore all documents of non-supported formats if true, otherwise they will be parsed as plain text |
Filter Out Non-ASCII Characters |
Indicates whether non-ASCII characters will be removed during parsing (applicable only in western character sets) |
Detect Sections Using Font Properties |
Use font properties to delineate section start |
Add Contents Section |
If selected, a section listing the top level sections is created for applicable documents |
Merge Sections With Similar Subjects |
Indicates whether consecutive sections with similar subjects should be joined. |
Use Predefined Section Titles |
A comma-separated list of words, sentences, or regular expressions will be used as section titles for every matching line of text; a value of {} activates default section titles |
Show Sections As Images |
If selected, each page of the parsed document becomes a section, ignoring any section limits within the page |
Copy Images During Parsing |
If selected, images will be copied from the source to an Aisera server, for faster presentation |
Renamed HTML Tags
|
Specifies HTML tags to be replaced before parsing. For example, to replace <span> with <p<, use tag=span,replacetag=p |
HTML Parameters
|
Custom parameters applicable to HTML document parsing |
PDF Parameters
|
Custom parameters applicable to PDF document parsing |
- Click OK to complete the process. You will be redirected back to Aisera’s admin dashboard and you should see a success message
- Navigate to “Aisera Apps” and click on the app you want to attach to the data source to
- Click on ‘Add Data Source and select your preferred data source.
- Click OK to confirm your selection.
- Navigate back to the data sources, click on the newly created data source and now you will see the ‘Play’ icon you can use to crawl documents from your portal/website.
AISERA Inc. Confidential and Proprietary. Do not distribute. Copyright © 2021 All Rights Reserved. 1121 San Antonio Road, Suite C202, Palo Alto, CA 94303 | info@aisera.com |(650) 667 4308 This document is for authorized use by the intended recipient(s) only. It may contain proprietary material, confidential information and/or be subject to legal privilege. It should not be copied, disclosed to, retained or used by, or any other party. If you are not the intended recipient, then please promptly delete this message and any attachment and all copies and inform the sender.
Comments
0 comments
Please sign in to leave a comment.