Translations
Info
All page names need to be in English.
en da  de  fr  it  ja  km  nl  ru  zh

Blueprints/FAL-IndexerRework

From TYPO3Wiki
Jump to: navigation, search

<- Back to blueprints overview

Blueprint: FAL Indexer refactoring

Proposal Refactor the FAL Indexer
Owner/Starter Steffen Ritter
Participants/Members -
Status Draft, Discussion, Voting Phase, Accepted, Declined, Withdrawn
Current Progress Unknown, Started, Good Progress, Bad Progress, Stalled, Review Needed
Topic for Gerrit falindexer

Target Versions/Milestones

  • Started during TYPO3 CMS 6.2 development

Goals / Motivation

As a central pillar of the FileAbstractionLayer it assumes that there is an index which always reflects the state of the file-systems. If you are working in file module the index is kept up-to-date transparently. If your editors have FTP access or you are working with external systems

The current indexer has some severe problems: It queries all information of all files - including content hash (which means retrieving content) - at once. This creates a massive IO load and is not needed, and might be very "intense" for remote storages. In addition extraction services are currently just possible with a "Signal" which knows about the file object. The service then has to take care of persistence itself. In addition not every service makes sense for every file. Tika document indexing for remote files would not be such intelligent as every file would have to be downloaded first. Finally this concepts even gets more complicated since there is a sys_file_metadata table which is versionable and translatable.

As a last weird part in the current concepts, parts of the indexing logic reside in the scheduler task outside the file abstraction layer and should be moved into the file abstraction layer and only be called from a scheduler task so it is usable from other places, too.

Concept

The concept consists of several parts:

  1. Define an interface for MetaData Extractors and a registry object
  2. Split Indexing in two phases
  3. Rework the Indexer according to splitting
  4. Have separate scheduler task for the phases and Storages
  5. Improve Storage/Driver API allowing to retrieve recursive filesystem information without the need of querying ALL information at once.

Implementation Details

1: Extractor Interface and Registry

The ExtractorRegistry allows to register Extractor classes implementing the Interface. The Interface allows to define a priority, restrictions for Supported Drivers as well as Supported FileTypes. The interface features "canProcess" and extractMataData Methods. Extractors have to return the extracted data as record array where the key is the according database field of sys_file_metadata.

PHP script:
interface ExtractorInterface {

	/**
	 * Returns an array of supported file types;
	 * An empty array indicates all filetypes
	 *
	 * @return array
	 */
	public function getFileTypeRestrictions();


	/**
	 * Get all supported DriverClasses
	 *
	 * Since some extractors may only work for local files, and other extractors
	 * are especially made for grabbing data from remote.
	 *
	 * Returns array of string with driver names of Drivers which are supported,
	 * If the driver did not register a name, it's the classname.
	 * empty array indicates no restrictions
	 *
	 * @return array
	 */
	public function getDriverRestrictions();

	/**
	 * Returns the priority of the extraction Service
	 * Should be between 1 and 100
	 *
	 * @return int
	 */
	public function getPriority();

	/**
	 * Checks if the given file can be processed by this Extractor
	 *
	 * @param Resource\File $file
	 * @return boolean
	 */
	public function canProcess(Resource\File $file);

	/**
	 * The actual processing TASK
	 *
	 * Should return an array with database properties for sys_file_metadata to write
	 *
	 * @param Resource\File $file
	 * @return array
	 */
	public function extractMetaData(Resource\File $file);


}

The Registry currently allows to get all Extractors as well as Extractors supporting a special Driver - since the indexing scheduler Tasks are arranged by storage we only need a special set of Extractors for them.

PHP script:
class ExtractorRegistry implements \TYPO3\CMS\Core\SingletonInterface {

	/**
	 * Registered ClassNames
	 * @var array
	 */
	protected $extractors = array();

	/**
	 * Instance Cache for Extractors
	 *
	 * @var ExtractorInterface[]
	 */
	protected $instances = NULL;

	/**
	 * Returns an instance of this class
	 *
	 * @return ExtractorRegistry
	 */
	public function getInstance() {
		return \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance('TYPO3\\CMS\\Core\\Resource\\Index\\ExtractorRegistry');
	}

	/**
	 * Allows to register MetaData extraction to the FAL Indexer
	 *
	 * @param string $className
	 * @throws \RuntimeException
	 */
	public function registerExtractionService($className) {
		if (!class_exists($className)) {
			throw new \RuntimeException('The Class you are registering is not available');
		} elseif (!in_array('TYPO3\\CMS\\Core\\Resource\\Index\\ExtractorInterface', class_implements($className))) {
			throw new \RuntimeException('The extractor needs to implement the ExtractorInterface');
		} else {
			$this->extractors[] = $className;
		}
	}

	/**
	 * Get all registered extractors
	 *
	 * @return ExtractorInterface[]
	 */
	public function getExtractors() {
		if ($this->instances === NULL) {
			$this->instances = array();
			foreach ($this->extractors as $className) {
				/** @var ExtractorInterface $object */
				$object = \TYPO3\CMS\Core\Utility\GeneralUtility::makeInstance($className);
				$this->instances[$object->getPriority()] = $object;
			}
			krsort($this->instances);
		}
		return $this->instances;
	}

	/**
	 * Get Extractors which work for a special driver
	 *
	 * @param string $driverType
	 * @return ExtractorInterface[]
	 */
	public function getExtractorsWithDriverSupport($driverType) {
		$allExtractors = $this->getExtractors();

		$filteredExtractors = array();
		foreach ($allExtractors as $priority => $extractorObject) {
			if (count($extractorObject->getDriverRestrictions()) == 0) {
				$filteredExtractors[$priority] = $extractorObject;
			} elseif (in_array($driverType, $extractorObject->getDriverRestrictions())) {
				$filteredExtractors[$priority] = $extractorObject;
			}
		}
		return $filteredExtractors;
	}

}

2/3: Phase Splitting

Currently everything is done "on the fly" which might be quite slow. In future only the things required for FAL to work are done "directly". In fact that is the information in sys_file table. The first indexing step always stores an update timestamp within the database.

Following happens here:

  1. get all file-identifiers from the storage
  2. get the modification time for file-identifier from the storage
  3. look if the the modification time in FS is higher than the one in database (key needed on timestamps)
  4. add the file to a "todolist"
  5. Go over the todo-list
    1. create file object from index
    2. update all the information from filesystem
    3. update file-object and push it to IndexRepository->update

The second step would be extraction and only work asynchronously via scheduler tasks. The task queries all files in storage where the update timestamp is lower than the last index timestamp (which needs to be introduced). For all those files all matching extractors are executed - the extractors with higher priority overwrite metadata of extractors with lower priority.

The merged result then is stored back in database and the indexed timestamp is updated. All metadata only will be stored to the liverecord of language 0.

4: Separate Scheduler Tasks

Since some storages do not need to be indexed (because only managed via file module) or for some storages no metadata needs to be extracted there should be different Tasks for that. In addition the Tasks for extraction are configurable to restrict the amount of files processed at one run. With that the admin is able to control the load on network and IO according to their needs.

5: Adapt Storage/Driver API

The Storage and Driver API currently do not support the things the indexer needs - we need to add that. getAllFileIdentifiers in Storage/Folder is not possible currently - this needs to be added.

Adding that we also are able to fix https://forge.typo3.org/issues/52778

THis will be a breaking change for drivers; but that is not a problem since drivers already broke backwards compat.

Risks

  1. Custom Drivers need to be adapted
    this is already the case, the refactoring just gets bigger
  2. Current Signals work different
  3. People forget configuring scheduler tasks
  4. Old Scheduler tasks become invalid and need to recreated manually

Issues and reviews

Dependencies upon other Blueprints

External links for clarification of technologies