PK 蛫BU揦81 1 README.mdnu 炠槧
# Embed
[![Latest Version on Packagist][ico-version]][link-packagist]
[![Total Downloads][ico-downloads]][link-packagist]
[![Monthly Downloads][ico-m-downloads]][link-packagist]
[![Software License][ico-license]](LICENSE)
PHP library to get information from any web page (using oembed, opengraph, twitter-cards, scrapping the html, etc). It's compatible with any web service (youtube, vimeo, flickr, instagram, etc) and has adapters to some sites like (archive.org, github, facebook, etc).
Requirements:
* PHP 7.4+
* Curl library installed
* PSR-17 implementation. By default these libraries are detected automatically:
* [laminas/laminas-diactoros](https://github.com/laminas/laminas-diactoros)
* [guzzle/psr7](https://github.com/guzzle/psr7)
* [nyholm/psr7](https://github.com/Nyholm/psr7)
* [sunrise/http-message](https://github.com/sunrise-php/http-message)
> If you need PHP 5.5-7.3 support, [use the 3.x version](https://github.com/oscarotero/Embed/tree/v3.x)
## Online demo
http://oscarotero.com/embed/demo
## Video Tutorial
[](https://youtu.be/4YCLRpKY1cs)
## Installation
This package is installable and autoloadable via Composer as [embed/embed](https://packagist.org/packages/embed/embed).
```
$ composer require embed/embed
```
## Usage
```php
use Embed\Embed;
$embed = new Embed();
//Load any url:
$info = $embed->get('https://www.youtube.com/watch?v=PP1xn5wHtxE');
//Get content info
$info->title; //The page title
$info->description; //The page description
$info->url; //The canonical url
$info->keywords; //The page keywords
$info->image; //The thumbnail or main image
$info->code->html; //The code to embed the image, video, etc
$info->code->width; //The exact width of the embed code (if exists)
$info->code->height; //The exact height of the embed code (if exists)
$info->code->ratio; //The aspect ratio (width/height)
$info->authorName; //The resource author
$info->authorUrl; //The author url
$info->cms; //The cms used
$info->language; //The language of the page
$info->languages; //The alternative languages
$info->providerName; //The provider name of the page (Youtube, Twitter, Instagram, etc)
$info->providerUrl; //The provider url
$info->icon; //The big icon of the site
$info->favicon; //The favicon of the site (an .ico file or a png with up to 32x32px)
$info->publishedTime; //The published time of the resource
$info->license; //The license url of the resource
$info->feeds; //The RSS/Atom feeds
```
## Parallel multiple requests
```php
use Embed\Embed;
$embed = new Embed();
//Load multiple urls asynchronously:
$infos = $embed->getMulti(
'https://www.youtube.com/watch?v=PP1xn5wHtxE',
'https://twitter.com/carlosmeixidefl/status/1230894146220625933',
'https://en.wikipedia.org/wiki/Tordoia',
);
foreach ($infos as $info) {
echo $info->title;
}
```
## Document
The document is the object that store the html code of the page. You can use it to extract extra info from the html code:
```php
//Get the document object
$document = $info->getDocument();
$document->link('image_src'); //Returns the href of a
$document->getDocument(); //Returns the DOMDocument instance
$html = (string) $document; //Returns the html code
$document->select('.//h1'); //Search
```
You can perform xpath queries in order to select specific elements. A search always return an instance of a `Embed\QueryResult`:
```php
//Search the A elements
$result = $document->select('.//a');
//Filter the results
$result->filter(fn ($node) => $node->getAttribute('href'));
$id = $result->str('id'); //Return the id of the first result as string
$text = $result->str(); //Return the content of the first result
$ids = $result->strAll('id'); //Return an array with the ids of all results as string
$texts = $result->strAll(); //Return an array with the content of all results as string
$tabindex = $result->int('tabindex'); //Return the tabindex attribute of the first result as integer
$number = $result->int(); //Return the content of the first result as integer
$href = $result->url('href'); //Return the href attribute of the first result as url (converts relative urls to absolutes)
$url = $result->url(); //Return the content of the first result as url
$node = $result->node(); //Return the first node found (DOMElement)
$nodes = $result->nodes(); //Return all nodes found
```
## Metas
For convenience, the object `Metas` stores the value of all `` elements located in the html, so you can get the values easier. The key of every meta is get from the `name`, `property` or `itemprop` attributes and the value is get from `content`.
```php
//Get the Metas object
$metas = $info->getMetas();
$metas->all(); //Return all values
$metas->get('og:title'); //Return a key value
$metas->str('og:title'); //Return the value as string (remove html tags)
$metas->html('og:description'); //Return the value as html
$metas->int('og:video:width'); //Return the value as integer
$metas->url('og:url'); //Return the value as full url (converts relative urls to absolutes)
```
## OEmbed
In addition to the html and metas, this library uses [oEmbed](https://oembed.com/) endpoints to get additional data. You can get this data as following:
```php
//Get the oEmbed object
$oembed = $info->getOEmbed();
$oembed->all(); //Return all raw data
$oembed->get('title'); //Return a key value
$oembed->str('title'); //Return the value as string (remove html tags)
$oembed->html('html'); //Return the value as html
$oembed->int('width'); //Return the value as integer
$oembed->url('url'); //Return the value as full url (converts relative urls to absolutes)
```
Additional oEmbed parameters (like instagrams `hidecaption`) can also be provided:
```php
$embed = new Embed();
$result = $embed->get('https://www.instagram.com/p/B_C0wheCa4V/');
$result->setSettings([
'oembed:query_parameters' => ['hidecaption' => true]
]);
$oembed = $info->getOEmbed();
```
## LinkedData
Another API available by default, used to extract info using the [JsonLD](https://www.w3.org/TR/json-ld/) schema.
```php
//Get the linkedData object
$ld = $info->getLinkedData();
$ld->all(); //Return all data
$ld->get('name'); //Return a key value
$ld->str('name'); //Return the value as string (remove html tags)
$ld->html('description'); //Return the value as html
$ld->int('width'); //Return the value as integer
$ld->url('url'); //Return the value as full url (converts relative urls to absolutes)
```
## Other APIs
Some sites like Wikipedia or Archive.org provide a custom API that is used to fetch more reliable data. You can get the API object with the method `getApi()` but note that not all results have this method. The Api object has the same methods than oEmbed:
```php
//Get the API object
$api = $info->getApi();
$api->all(); //Return all raw data
$api->get('title'); //Return a key value
$api->str('title'); //Return the value as string (remove html tags)
$api->html('html'); //Return the value as html
$api->int('width'); //Return the value as integer
$api->url('url'); //Return the value as full url (converts relative urls to absolutes)
```
## Extending Embed
Depending of your needs, you may want to extend this library with extra features or change the way it makes some operations.
### PSR
Embed use some PSR standards to be the most interoperable possible:
- [PSR-7](https://www.php-fig.org/psr/psr-7/) Standard interfaces to represent http requests, responses and uris
- [PSR-17](https://www.php-fig.org/psr/psr-17/) Standard factories to create PSR-7 objects
- [PSR-18](https://www.php-fig.org/psr/psr-18/) Standard interface to send a http request and return a response
Embed comes with a CURL client compatible with PSR-18 but you need to install a PSR-7 / PSR-17 library. [Here you can see a list of popular libraries](https://github.com/middlewares/awesome-psr15-middlewares#psr-7-implementations) and the library can detect automatically 'laminas\diactoros', 'guzzleHttp\psr7', 'slim\psr7', 'nyholm\psr7' and 'sunrise\http' (in this order). If you want to use a different PSR implementation, you can do it in this way:
```php
use Embed\Embed;
use Embed\Http\Crawler;
$client = new CustomHttpClient();
$requestFactory = new CustomRequestFactory();
$uriFactory = new CustomUriFactory();
//The Crawler is responsible for perform http queries
$crawler = new Crawler($client, $requestFactory, $uriFactory);
//Create an embed instance passing the Crawler
$embed = new Embed($crawler);
```
### Adapters
There are some sites with special needs: because they provide public APIs that allows to extract more info (like Wikipedia or Archive.org) or because we need to change how to extract the data in this particular site. For all that cases we have the adapters, that are classes extending the default classes to provide extra functionality.
Before creating an adapter, you need to understand how Embed work: when you execute this code, you get a `Extractor` class
```php
//Get the Extractor with all info
$info = $embed->get($url);
//The extractor have document and oembed:
$document = $info->getDocument();
$oembed = $info->getOEmbed();
```
The `Extractor` class has many `Detectors`. Each detector is responsible to detect a specific piece of info. For example, there's a detector for the title, other for description, image, code, etc.
So, an adapter is basically an extractor created specifically for a site. It can contains also custom detectors or apis. If you see the `src/Adapters` folder you can see all adapters.
If you create an adapter, you need also register to Embed, so it knows in which website needs to use. To do that, there's the `ExtractorFactory` object, that is responsible for instantiate the right extractor for each site.
```php
use Embed\Embed;
$embed = new Embed();
$factory = $embed->getExtractorFactory();
//Use this MySite adapter for mysite.com
$factory->addAdapter('mysite.com', MySite::class);
//Remove the adapter for pinterest.com, so it will use the default extractor
$factory->removeAdapter('pinterest.com');
//Change the default extractor
$factory->setDefault(CustomExtractor::class);
```
### Detectors
Embed comes with several predefined detectors, but you may want to change or add more. Just create a class extending `Embed\Detectors\Detector` class and register it in the extractor factory. For example:
```php
use Embed\Embed;
use Embed\Detectors\Detector;
class Robots extends Detector
{
public function detect(): ?string
{
$response = $this->extractor->getResponse();
$metas = $this->extractor->getMetas();
return $response->getHeaderLine('x-robots-tag'),
?: $metas->str('robots');
}
}
//Register the detector
$embed = new Embed();
$embed->getExtractorFactory()->addDetector('robots', Robots::class);
//Use it
$info = $embed->get('http://example.com');
$robots = $info->robots;
```
### Settings
If you need to pass settings to the CurlClient to perform http queries:
```php
use Embed\Embed;
use Embed\Http\Crawler;
use Embed\Http\CurlClient;
$client = new CurlClient();
$client->setSettings([
'cookies_path' => $cookies_path,
'ignored_errors' => [18],
'max_redirs' => 3, // see CURLOPT_MAXREDIRS
'connect_timeout' => 2, // see CURLOPT_CONNECTTIMEOUT
'timeout' => 2, // see CURLOPT_TIMEOUT
'ssl_verify_host' => 2, // see CURLOPT_SSL_VERIFYHOST
'ssl_verify_peer' => 1, // see CURLOPT_SSL_VERIFYPEER
'follow_location' => true, // see CURLOPT_FOLLOWLOCATION
'user_agent' => 'Mozilla', // see CURLOPT_USERAGENT
]);
$embed = new Embed(new Crawler($client));
```
If you need to pass settings to your detectors, you can add settings to the `ExtractorFactory`:
```php
use Embed\Embed;
$embed = new Embed();
$embed->setSettings([
'oembed:query_parameters' => [], //Extra parameters send to oembed
'twitch:parent' => 'example.com', //Required to embed twitch videos as iframe
'facebook:token' => '1234|5678', //Required to embed content from Facebook
'instagram:token' => '1234|5678', //Required to embed content from Instagram
'twitter:token' => 'asdf', //Improve the data from twitter
]);
$info = $embed->get($url);
```
Note: The built-in detectors does not require settings. This feature is only for convenience if you create a specific detector that requires settings.
---
[ico-version]: https://poser.pugx.org/embed/embed/v/stable
[ico-license]: https://poser.pugx.org/embed/embed/license
[ico-downloads]: https://poser.pugx.org/embed/embed/downloads
[ico-m-downloads]: https://poser.pugx.org/embed/embed/d/monthly
[link-packagist]: https://packagist.org/packages/embed/embed
PK 蛫BU2R% % src/ApiTrait.phpnu 炠槧 extractor = $extractor;
}
public function all(): array
{
if (!isset($this->data)) {
$this->data = $this->fetchData();
}
return $this->data;
}
public function get(string ...$keys)
{
$data = $this->all();
foreach ($keys as $key) {
if (!isset($data[$key])) {
return null;
}
$data = $data[$key];
}
return $data;
}
public function str(string ...$keys): ?string
{
$value = $this->get(...$keys);
if (is_array($value)) {
$value = array_shift($value);
}
return $value ? clean((string) $value) : null;
}
public function strAll(string ...$keys): array
{
$all = (array) $this->get(...$keys);
return array_filter(array_map(fn ($value) => clean($value), $all));
}
public function html(string ...$keys): ?string
{
$value = $this->get(...$keys);
if (is_array($value)) {
$value = array_shift($value);
}
return $value ? clean((string) $value, true) : null;
}
public function int(string ...$keys): ?int
{
$value = $this->get(...$keys);
if (is_array($value)) {
$value = array_shift($value);
}
return is_numeric($value) ? (int) $value : null;
}
public function url(string ...$keys): ?UriInterface
{
$url = $this->str(...$keys);
try {
return $url ? $this->extractor->resolveUri($url) : null;
} catch (Throwable $error) {
return null;
}
}
public function time(string ...$keys): ?DateTime
{
$time = $this->str(...$keys);
$datetime = $time ? date_create($time) : null;
if (!$datetime && $time && ctype_digit($time)) {
$datetime = date_create_from_format('U', $time);
}
return ($datetime && $datetime->getTimestamp() > 0) ? $datetime : null;
}
abstract protected function fetchData(): array;
}
PK 蛫BUV8褵 src/HttpApiTrait.phpnu 炠槧 endpoint;
}
private function fetchJSON(UriInterface $uri): array
{
$crawler = $this->extractor->getCrawler();
$request = $crawler->createRequest('GET', $uri);
$response = $crawler->sendRequest($request);
try {
return json_decode((string) $response->getBody(), true) ?: [];
} catch (Exception $exception) {
return [];
}
}
}
PK 蛫BU!S&f f src/functions.phpnu 炠槧 $value) {
if ($value === null) {
continue;
} elseif ($value === true) {
$html .= " $name";
} elseif ($value !== false) {
$html .= ' '.$name.'="'.htmlspecialchars((string) $value).'"';
}
}
if ($tagName === 'img') {
return "${html} />";
}
return "{$html}>{$content}{$tagName}>";
}
/**
* Resolve a uri within this document
* (useful to get absolute uris from relative)
*/
function resolveUri(UriInterface $base, UriInterface $uri): UriInterface
{
$uri = $uri->withPath(resolvePath($base->getPath(), $uri->getPath()));
if (!$uri->getHost()) {
$uri = $uri->withHost($base->getHost());
}
if (!$uri->getScheme()) {
$uri = $uri->withScheme($base->getScheme());
}
return $uri
->withPath(cleanPath($uri->getPath()))
->withFragment('');
}
function isHttp(string $uri): bool
{
if (preg_match('/^(\w+):/', $uri, $matches)) {
return in_array(strtolower($matches[1]), ['http', 'https']);
}
return true;
}
function resolvePath(string $base, string $path): string
{
if ($path === '') {
return '';
}
if ($path[0] === '/') {
return $path;
}
if (substr($base, -1) !== '/') {
$position = strrpos($base, '/');
$base = substr($base, 0, $position);
}
$path = "{$base}/{$path}";
$parts = array_filter(explode('/', $path), 'strlen');
$absolutes = [];
foreach ($parts as $part) {
if ('.' == $part) {
continue;
}
if ('..' == $part) {
array_pop($absolutes);
continue;
}
$absolutes[] = $part;
}
return implode('/', $absolutes);
}
function cleanPath(string $path): string
{
if ($path === '') {
return '/';
}
$path = preg_replace('|[/]{2,}|', '/', $path);
if (strpos($path, ';jsessionid=') !== false) {
$path = preg_replace('/^(.*)(;jsessionid=.*)$/i', '$1', $path);
}
return $path;
}
function matchPath(string $pattern, string $subject): bool
{
$pattern = str_replace('\\*', '.*', preg_quote($pattern, '|'));
return (bool) preg_match("|^{$pattern}$|i", $subject);
}
function getDirectory(string $path, int $position): ?string
{
$dirs = explode('/', $path);
return $dirs[$position + 1] ?? null;
}
PK 蛫BU*Oq
src/Metas.phpnu 炠槧 extractor->getDocument();
foreach ($document->select('.//meta')->nodes() as $node) {
$type = $node->getAttribute('name') ?: $node->getAttribute('property') ?: $node->getAttribute('itemprop');
$value = $node->getAttribute('content');
if (!empty($value) && !empty($type)) {
$type = strtolower($type);
$data[$type] ??= [];
$data[$type][] = $value;
}
}
return $data;
}
public function get(string ...$keys)
{
$data = $this->all();
foreach ($keys as $key) {
$values = $data[$key] ?? null;
if ($values) {
return $values;
}
}
return null;
}
}
PK 蛫BU
GzD D
src/Embed.phpnu 炠槧 crawler = $crawler ?: new Crawler();
$this->extractorFactory = $extractorFactory ?: new ExtractorFactory();
}
public function get(string $url): Extractor
{
$request = $this->crawler->createRequest('GET', $url);
$response = $this->crawler->sendRequest($request);
return $this->extract($request, $response);
}
/**
* @return Extractor[]
*/
public function getMulti(string ...$urls): array
{
$requests = array_map(
fn ($url) => $this->crawler->createRequest('GET', $url),
$urls
);
$responses = $this->crawler->sendRequests(...$requests);
$return = [];
foreach ($responses as $k => $response) {
$return[] = $this->extract($requests[$k], $responses[$k]);
}
return $return;
}
public function getCrawler(): Crawler
{
return $this->crawler;
}
public function getExtractorFactory(): ExtractorFactory
{
return $this->extractorFactory;
}
public function setSettings(array $settings): void
{
$this->extractorFactory->setSettings($settings);
}
private function extract(RequestInterface $request, ResponseInterface $response, bool $redirect = true): Extractor
{
$uri = $this->crawler->getResponseUri($response) ?: $request->getUri();
$extractor = $this->extractorFactory->createExtractor($uri, $request, $response, $this->crawler);
if (!$redirect || !$this->mustRedirect($extractor)) {
return $extractor;
}
$request = $this->crawler->createRequest('GET', $extractor->redirect);
$response = $this->crawler->sendRequest($request);
return $this->extract($request, $response, false);
}
private function mustRedirect(Extractor $extractor): bool
{
if (!empty($extractor->getOembed()->all())) {
return false;
}
return $extractor->redirect !== null;
}
}
PK 蛫BU8秹R src/Http/RequestException.phpnu 炠槧 request = $request;
}
public function getRequest(): RequestInterface
{
return $this->request;
}
}
PK 蛫BU4 src/Http/FactoryDiscovery.phpnu 炠槧 responseFactory = $responseFactory ?: FactoryDiscovery::getResponseFactory();
}
public function setSettings(array $settings): void
{
$this->settings = $settings + $this->settings;
}
public function sendRequest(RequestInterface $request): ResponseInterface
{
$responses = CurlDispatcher::fetch($this->settings, $this->responseFactory, $request);
return $responses[0];
}
public function sendRequests(RequestInterface ...$request): array
{
return CurlDispatcher::fetch($this->settings, $this->responseFactory, ...$request);
}
}
PK 蛫BUf霩 src/Http/Crawler.phpnu 炠槧 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:73.0) Gecko/20100101 Firefox/73.0',
'Cache-Control' => 'max-age=0',
];
public function __construct(ClientInterface $client = null, RequestFactoryInterface $requestFactory = null, UriFactoryInterface $uriFactory = null)
{
$this->client = $client ?: new CurlClient();
$this->requestFactory = $requestFactory ?: FactoryDiscovery::getRequestFactory();
$this->uriFactory = $uriFactory ?: FactoryDiscovery::getUriFactory();
}
public function addDefaultHeaders(array $headers): void
{
$this->defaultHeaders = $headers + $this->defaultHeaders;
}
/**
* @param UriInterface|string $uri The URI associated with the request.
*/
public function createRequest(string $method, $uri): RequestInterface
{
$request = $this->requestFactory->createRequest($method, $uri);
foreach ($this->defaultHeaders as $name => $value) {
$request = $request->withHeader($name, $value);
}
return $request;
}
public function createUri(string $uri = ''): UriInterface
{
return $this->uriFactory->createUri($uri);
}
public function sendRequest(RequestInterface $request): ResponseInterface
{
return $this->client->sendRequest($request);
}
public function sendRequests(RequestInterface ...$requests): array
{
if ($this->client instanceof CurlClient) {
return $this->client->sendRequests(...$requests);
}
return array_map(
fn ($request) => $this->client->sendRequest($request),
$requests
);
}
public function getResponseUri(ResponseInterface $response): ?UriInterface
{
$location = $response->getHeaderLine('Content-Location');
return $location ? $this->uriFactory->createUri($location) : null;
}
}
PK 蛫BU`揓2 2 src/Http/NetworkException.phpnu 炠槧 request = $request;
}
public function getRequest(): RequestInterface
{
return $this->request;
}
}
PK 蛫BU src/Http/CurlDispatcher.phpnu 炠槧 exec($responseFactory)];
}
//Init connections
$multi = curl_multi_init();
$connections = [];
foreach ($requests as $request) {
$connection = new static($settings, $request);
curl_multi_add_handle($multi, $connection->curl);
$connections[] = $connection;
}
//Run
$active = null;
do {
$status = curl_multi_exec($multi, $active);
if ($active) {
curl_multi_select($multi);
}
$info = curl_multi_info_read($multi);
if ($info) {
foreach ($connections as $connection) {
if ($connection->curl === $info['handle']) {
$connection->result = $info['result'];
break;
}
}
}
} while ($active && $status == CURLM_OK);
//Close connections
foreach ($connections as $connection) {
curl_multi_remove_handle($multi, $connection->curl);
}
curl_multi_close($multi);
return array_map(
fn ($connection) => $connection->exec($responseFactory),
$connections
);
}
private function __construct(array $settings, RequestInterface $request)
{
$this->request = $request;
$this->curl = curl_init((string) $request->getUri());
$this->settings = $settings;
$cookies = $settings['cookies_path'] ?? str_replace('//', '/', sys_get_temp_dir().'/embed-cookies.txt');
curl_setopt_array($this->curl, [
CURLOPT_HTTPHEADER => $this->getRequestHeaders(),
CURLOPT_POST => strtoupper($request->getMethod()) === 'POST',
CURLOPT_MAXREDIRS => $settings['max_redirs'] ?? 10,
CURLOPT_CONNECTTIMEOUT => $settings['connect_timeout'] ?? 10,
CURLOPT_TIMEOUT => $settings['timeout'] ?? 10,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_SSL_VERIFYHOST => $settings['ssl_verify_host'] ?? 0,
CURLOPT_SSL_VERIFYPEER => $settings['ssl_verify_peer'] ?? false,
CURLOPT_ENCODING => '',
CURLOPT_CAINFO => CaBundle::getSystemCaRootBundlePath(),
CURLOPT_AUTOREFERER => true,
CURLOPT_FOLLOWLOCATION => $settings['follow_location'] ?? true,
CURLOPT_IPRESOLVE => CURL_IPRESOLVE_V4,
CURLOPT_USERAGENT => $settings['user_agent'] ?? $request->getHeaderLine('User-Agent'),
CURLOPT_COOKIEJAR => $cookies,
CURLOPT_COOKIEFILE => $cookies,
CURLOPT_HEADERFUNCTION => [$this, 'writeHeader'],
CURLOPT_WRITEFUNCTION => [$this, 'writeBody'],
]);
}
private function exec(ResponseFactoryInterface $responseFactory): ResponseInterface
{
curl_exec($this->curl);
$info = curl_getinfo($this->curl);
if ($this->error) {
$this->error(curl_strerror($this->error), $this->error);
}
if (curl_errno($this->curl)) {
$this->error(curl_error($this->curl), curl_errno($this->curl));
}
curl_close($this->curl);
$response = $responseFactory->createResponse($info['http_code']);
foreach ($this->headers as $header) {
list($name, $value) = $header;
$response = $response->withAddedHeader($name, $value);
}
$response = $response
->withAddedHeader('Content-Location', $info['url'])
->withAddedHeader('X-Request-Time', sprintf('%.3f ms', $info['total_time']));
if ($this->body) {
//5Mb max
$response->getBody()->write(stream_get_contents($this->body, 5000000, 0));
}
return $response;
}
private function error(string $message, int $code)
{
$ignored = $this->settings['ignored_errors'] ?? null;
if ($ignored === true || (is_array($ignored) && in_array($code, $ignored))) {
return;
}
if ($this->isBinary && $code === CURLE_WRITE_ERROR) {
// The write callback aborted the request to prevent a download of the binary file
return;
}
throw new NetworkException($message, $code, $this->request);
}
private function getRequestHeaders(): array
{
$headers = [];
foreach ($this->request->getHeaders() as $name => $values) {
switch (strtolower($name)) {
case 'user-agent':
break;
default:
$headers[] = $name . ':' . implode(', ', $values);
}
}
return $headers;
}
private function writeHeader($curl, $string): int
{
if (preg_match('/^([\w-]+):(.*)$/', $string, $matches)) {
$name = strtolower($matches[1]);
$value = trim($matches[2]);
$this->headers[] = [$name, $value];
if ($name === 'content-type') {
$this->isBinary = !preg_match('/(text|html|json)/', strtolower($value));
}
} elseif ($this->headers) {
$key = array_key_last($this->headers);
$this->headers[$key][1] .= ' '.trim($string);
}
return strlen($string);
}
private function writeBody($curl, $string): int
{
if ($this->isBinary) {
return -1;
}
if (!$this->body) {
$this->body = fopen('php://temp', 'w+');
}
return fwrite($this->body, $string);
}
}
PK 蛫BU箔穌y y src/EmbedCode.phpnu 炠槧 html = $html;
$this->width = $width;
$this->height = $height;
if ($width && $height) {
$this->ratio = round(($height / $width) * 100, 3);
}
}
public function __toString(): string
{
return $this->html;
}
#[ReturnTypeWillChange]
public function jsonSerialize()
{
return [
'html' => $this->html,
'width' => $this->width,
'height' => $this->height,
'ratio' => $this->ratio,
];
}
}
PK 蛫BU .r src/ExtractorFactory.phpnu 炠槧 Adapters\Slides\Extractor::class,
'pinterest.com' => Adapters\Pinterest\Extractor::class,
'flickr.com' => Adapters\Flickr\Extractor::class,
'snipplr.com' => Adapters\Snipplr\Extractor::class,
'play.cadenaser.com' => Adapters\CadenaSer\Extractor::class,
'ideone.com' => Adapters\Ideone\Extractor::class,
'gist.github.com' => Adapters\Gist\Extractor::class,
'github.com' => Adapters\Github\Extractor::class,
'wikipedia.org' => Adapters\Wikipedia\Extractor::class,
'archive.org' => Adapters\Archive\Extractor::class,
'sassmeister.com' => Adapters\Sassmeister\Extractor::class,
'facebook.com' => Adapters\Facebook\Extractor::class,
'instagram.com' => Adapters\Instagram\Extractor::class,
'imageshack.com' => Adapters\ImageShack\Extractor::class,
'youtube.com' => Adapters\Youtube\Extractor::class,
'twitch.tv' => Adapters\Twitch\Extractor::class,
'bandcamp.com' => Adapters\Bandcamp\Extractor::class,
'twitter.com' => Adapters\Twitter\Extractor::class,
];
private array $customDetectors = [];
private array $settings;
public function __construct(?array $settings = [])
{
$this->settings = $settings ?? [];
}
public function createExtractor(UriInterface $uri, RequestInterface $request, ResponseInterface $response, Crawler $crawler): Extractor
{
$host = $uri->getHost();
$class = $this->default;
foreach ($this->adapters as $adapterHost => $adapter) {
if (substr($host, -strlen($adapterHost)) === $adapterHost) {
$class = $adapter;
break;
}
}
/** @var Extractor $extractor */
$extractor = new $class($uri, $request, $response, $crawler);
$extractor->setSettings($this->settings);
foreach ($this->customDetectors as $name => $detector) {
$extractor->addDetector($name, new $detector($extractor));
}
foreach ($extractor->createCustomDetectors() as $name => $detector) {
$extractor->addDetector($name, $detector);
}
return $extractor;
}
public function addAdapter(string $pattern, string $class): void
{
$this->adapters[$pattern] = $class;
}
public function addDetector(string $name, string $class): void
{
$this->customDetectors[$name] = $class;
}
public function removeAdapter(string $pattern): void
{
unset($this->adapters[$pattern]);
}
public function setDefault(string $class): void
{
$this->default = $class;
}
public function setSettings(array $settings): void
{
$this->settings = $settings;
}
}
PK 蛫BU_7伥E
E
src/Document.phpnu 炠槧 extractor = $extractor;
$html = (string) $extractor->getResponse()->getBody();
$html = str_replace(' ', "\n ", $html);
$html = str_replace(' document = !empty($html) ? Parser::parse($html) : new DOMDocument();
$this->initXPath();
}
private function initXPath()
{
$this->xpath = new DOMXPath($this->document);
$this->xpath->registerNamespace('php', 'http://php.net/xpath');
$this->xpath->registerPhpFunctions();
}
public function __clone()
{
$this->document = clone $this->document;
$this->initXPath();
}
public function remove(string $query): void
{
$nodes = iterator_to_array($this->xpath->query($query), false);
foreach ($nodes as $node) {
$node->parentNode->removeChild($node);
}
}
public function removeCss(string $query): void
{
$this->remove(self::cssToXpath($query));
}
public function getDocument(): DOMDocument
{
return $this->document;
}
/**
* Helper to build xpath queries easily and case insensitive
*/
private static function buildQuery(string $startQuery, array $attributes): string
{
$selector = [$startQuery];
foreach ($attributes as $name => $value) {
$selector[] = sprintf('[php:functionString("strtolower", @%s)="%s"]', $name, mb_strtolower($value));
}
return implode('', $selector);
}
/**
* Select a element in the dom
*/
public function select(string $query, array $attributes = null, DOMNode $context = null): QueryResult
{
if (!empty($attributes)) {
$query = self::buildQuery($query, $attributes);
}
return new QueryResult($this->xpath->query($query, $context), $this->extractor);
}
/**
* Select a element in the dom using a css selector
*/
public function selectCss(string $query, DOMNode $context = null): QueryResult
{
return $this->select(self::cssToXpath($query), null, $context);
}
/**
* Shortcut to select a element and return the href
*/
public function link(string $rel, array $extra = []): ?UriInterface
{
return $this->select('.//link', ['rel' => $rel] + $extra)->url('href');
}
public function __toString(): string
{
return Parser::stringify($this->getDocument());
}
private static function cssToXpath(string $selector): string
{
if (!isset(self::$cssConverter)) {
if (!class_exists(CssSelectorConverter::class)) {
throw new RuntimeException('You need to install "symfony/css-selector" to use css selectors');
}
self::$cssConverter = new CssSelectorConverter();
}
return self::$cssConverter->toXpath($selector);
}
}
PK 蛫BUt>L L src/Extractor.phpnu 炠槧 uri = $uri;
$this->request = $request;
$this->response = $response;
$this->crawler = $crawler;
//APIs
$this->document = new Document($this);
$this->oembed = new OEmbed($this);
$this->linkedData = new LinkedData($this);
$this->metas = new Metas($this);
//Detectors
$this->authorName = new AuthorName($this);
$this->authorUrl = new AuthorUrl($this);
$this->cms = new Cms($this);
$this->code = new Code($this);
$this->description = new Description($this);
$this->favicon = new Favicon($this);
$this->feeds = new Feeds($this);
$this->icon = new Icon($this);
$this->image = new Image($this);
$this->keywords = new Keywords($this);
$this->language = new Language($this);
$this->languages = new Languages($this);
$this->license = new License($this);
$this->providerName = new ProviderName($this);
$this->providerUrl = new ProviderUrl($this);
$this->publishedTime = new PublishedTime($this);
$this->redirect = new Redirect($this);
$this->title = new Title($this);
$this->url = new Url($this);
}
public function __get(string $name)
{
$detector = $this->customDetectors[$name] ?? $this->$name ?? null;
if (!$detector || !($detector instanceof Detector)) {
throw new DomainException(sprintf('Invalid key "%s". No detector found for this value', $name));
}
return $detector->get();
}
public function createCustomDetectors(): array
{
return [];
}
public function addDetector(string $name, Detector $detector): void
{
$this->customDetectors[$name] = $detector;
}
public function setSettings(array $settings): void
{
$this->settings = $settings;
}
public function getSettings(): array
{
return $this->settings;
}
public function getSetting(string $key)
{
return $this->settings[$key] ?? null;
}
public function getDocument(): Document
{
return $this->document;
}
public function getOEmbed(): OEmbed
{
return $this->oembed;
}
public function getLinkedData(): LinkedData
{
return $this->linkedData;
}
public function getMetas(): Metas
{
return $this->metas;
}
public function getRequest(): RequestInterface
{
return $this->request;
}
public function getResponse(): ResponseInterface
{
return $this->response;
}
public function getUri(): UriInterface
{
return $this->uri;
}
/**
* @param UriInterface|string $uri
*/
public function resolveUri($uri): UriInterface
{
if (is_string($uri)) {
if (!isHttp($uri)) {
throw new InvalidArgumentException(sprintf('Uri string must use http or https scheme (%s)', $uri));
}
$uri = $this->crawler->createUri($uri);
}
if (!($uri instanceof UriInterface)) {
throw new InvalidArgumentException('Uri must be a string or an instance of UriInterface');
}
return resolveUri($this->uri, $uri);
}
public function getCrawler(): Crawler
{
return $this->crawler;
}
}
PK 蛫BU矛轓 src/Adapters/Wikipedia/Api.phpnu 炠槧 extractor->getUri();
if (!matchPath('/wiki/*', $uri->getPath())) {
return [];
}
$titles = getDirectory($uri->getPath(), 1);
$this->endpoint = $uri
->withPath('/w/api.php')
->withQuery(http_build_query([
'action' => 'query',
'format' => 'json',
'continue' => '',
'titles' => $titles,
'prop' => 'extracts',
'exchars' => 1000,
]));
$data = $this->fetchJSON($this->endpoint);
$pages = $data['query']['pages'] ?? null;
return $pages ? current($pages) : null;
}
}
PK 蛫BU=kQ $ src/Adapters/Wikipedia/Extractor.phpnu 炠槧 api;
}
public function createCustomDetectors(): array
{
$this->api = new Api($this);
return [
'title' => new Detectors\Title($this),
'description' => new Detectors\Description($this),
];
}
}
PK 蛫BU;式I I 0 src/Adapters/Wikipedia/Detectors/Description.phpnu 炠槧 extractor->getApi();
return $api->str('extract')
?: parent::detect();
}
}
PK 蛫BUZn9; ; * src/Adapters/Wikipedia/Detectors/Title.phpnu 炠槧 extractor->getApi();
return $api->str('title')
?: parent::detect();
}
}
PK 蛫BU凫< < src/Adapters/ImageShack/Api.phpnu 炠槧 extractor->getUri();
if (!matchPath('/i/*', $uri->getPath())) {
$uri = $this->extractor->getRequest()->getUri();
if (!matchPath('/i/*', $uri->getPath())) {
return [];
}
}
$id = getDirectory($uri->getPath(), 1);
if (empty($id)) {
return [];
}
$this->endpoint = $this->extractor->getCrawler()->createUri("https://api.imageshack.com/v2/images/{$id}");
$data = $this->fetchJSON($this->endpoint);
return $data['result'] ?? [];
}
}
PK 蛫BUJ剁
% src/Adapters/ImageShack/Extractor.phpnu 炠槧 api;
}
public function createCustomDetectors(): array
{
$this->api = new Api($this);
return [
'authorName' => new Detectors\AuthorName($this),
'authorUrl' => new Detectors\AuthorUrl($this),
'description' => new Detectors\Description($this),
'image' => new Detectors\Image($this),
'providerName' => new Detectors\ProviderName($this),
'publishedTime' => new Detectors\PublishedTime($this),
'title' => new Detectors\Title($this),
];
}
}
PK 蛫BU┠/鮡 e 3 src/Adapters/ImageShack/Detectors/PublishedTime.phpnu 炠槧 extractor->getApi();
return $api->time('creation_date')
?: parent::detect();
}
}
PK 蛫BUQu庽 2 src/Adapters/ImageShack/Detectors/ProviderName.phpnu 炠槧 extractor->getApi();
return $api->str('owner', 'username')
?: parent::detect();
}
}
PK 蛫BU铔v礜 N 1 src/Adapters/ImageShack/Detectors/Description.phpnu 炠槧 extractor->getApi();
return $api->str('description')
?: parent::detect();
}
}
PK 蛫BU瘜< < + src/Adapters/ImageShack/Detectors/Title.phpnu 炠槧 extractor->getApi();
return $api->str('title')
?: parent::detect();
}
}
PK 蛫BU$梆 / src/Adapters/ImageShack/Detectors/AuthorUrl.phpnu 炠槧 extractor->getApi();
$owner = $api->str('owner', 'username');
if ($owner) {
return $this->extractor->getCrawler()->createUri("https://imageshack.com/{$owner}");
}
return parent::detect();
}
}
PK 蛫BU睈訢k k + src/Adapters/ImageShack/Detectors/Image.phpnu 炠槧 extractor->getApi();
return $api->url('direct_link')
?: parent::detect();
}
}
PK 蛫BU鄲x x src/Adapters/Twitter/Api.phpnu 炠槧 extractor->getSetting('twitter:token');
if (!$token) {
return [];
}
$uri = $this->extractor->getUri();
$id = getDirectory($uri->getPath(), 2);
if (empty($id)) {
return [];
}
$this->extractor->getCrawler()->addDefaultHeaders(array('Authorization' => "Bearer $token"));
$this->endpoint = $this->extractor->getCrawler()->createUri("https://api.twitter.com/2/tweets/{$id}?expansions=author_id,attachments.media_keys&tweet.fields=created_at&media.fields=preview_image_url,url&user.fields=id,name");
return $this->fetchJSON($this->endpoint);
}
}
PK 蛫BUX唍N " src/Adapters/Twitter/Extractor.phpnu 炠槧 api;
}
public function createCustomDetectors(): array
{
$this->api = new Api($this);
return [
'authorName' => new Detectors\AuthorName($this),
'authorUrl' => new Detectors\AuthorUrl($this),
'description' => new Detectors\Description($this),
'image' => new Detectors\Image($this),
'providerName' => new Detectors\ProviderName($this),
'publishedTime' => new Detectors\PublishedTime($this),
'title' => new Detectors\Title($this),
];
}
}
PK 蛫BU妜縣g g 0 src/Adapters/Twitter/Detectors/PublishedTime.phpnu 炠槧 extractor->getApi();
return $api->time('data', 'created_at')
?: parent::detect();
}
}
PK 蛫BU模婉 / src/Adapters/Twitter/Detectors/ProviderName.phpnu 炠槧 extractor->getApi();
return $api->str('includes', 'users', '0', 'name')
?: parent::detect();
}
}
PK 蛫BUy~L L . src/Adapters/Twitter/Detectors/Description.phpnu 炠槧 extractor->getApi();
return $api->str('data', 'text')
?: parent::detect();
}
}
PK 蛫BU 7棛 ( src/Adapters/Twitter/Detectors/Title.phpnu 炠槧 extractor->getApi();
$name = $api->str('includes', 'users', '0', 'name');
if ($name) {
return "Tweet by $name";
}
return parent::detect();
}
}
PK 蛫BU廡1 , src/Adapters/Twitter/Detectors/AuthorUrl.phpnu 炠槧 extractor->getApi();
$username = $api->str('includes', 'users', '0', 'username');
if ($username) {
return $this->extractor->getCrawler()->createUri("https://twitter.com/{$username}");
}
return parent::detect();
}
}
PK 蛫BU鈎軋U U ( src/Adapters/Twitter/Detectors/Image.phpnu 炠槧 extractor->getApi();
$preview = $api->url('includes', 'media', '0', 'preview_image_url');
if ($preview) {
return $preview;
}
$regular = $api->url('includes', 'media', '0', 'url');
if ($regular) {
return $regular;
}
return parent::detect();
}
}
PK 蛫BU飇躘 ! src/Adapters/Twitch/Extractor.phpnu 炠槧 new Detectors\Code($this),
];
}
}
PK 蛫BU侗M奖 &